SEMANTIC IMAGE SYNTHESIS WITH SPATIALLY ADAPTIVE NORMALIZATION

Author: Yeshwanth Buggaveetia

Here we are going to discuss about spatially-adaptive normalization, a simplistic yet effective layer for generating photorealistic pictures from a semantic layout input Previously, the semantic layout was sent directly into the deep network, which was subsequently processed through stacks of convolution, normalization, and nonlinearity layers. I demonstrate that this is inefficient since the normalization stages tend to "erase" semantic information. Finally, our paradigm gives the user control over both semantic and stylistic aspects. Because, their normalization layers tend to "wipe away" information included in the input semantic masks, When compared to other state-of-the-art approaches, a compact network can produce considerably superior outcomes. Furthermore, an extensive ablation investigation shows that the proposed normalizing layer outperforms many alternatives for the semantic picture synthesis challenge.

Introduction:

The task of producing photorealistic pictures based on specific input data is referred to as conditional image synthesis. The resulting image is computed by stitching parts from a single image or by utilising an image collection in seminal work. Using neural networks, recent approaches directly learn the mapping. The latter approaches are quicker and do not require an external picture database.

I am interested in a specific type of conditional image synthesis in which a semantic segmentation mask is converted to a photorealistic image. This form has several uses, including content creation and picture manipulation. This is known as semantic image synthesis. In this study, we show that the traditional network design, which is created by stacking convolutional networks, normalization, and nonlinearity layers, is, at best, unsatisfactory since their normalization layers "wipe away" information included in the input semantic masks. To overcome this issue, we offer geographically adaptive normalization, a conditional normalization layer that modulates activations based on input semantic layouts via a spatially adaptive, learnt transformation and can efficiently convey semantic information all through the network.

Semantic image synthesis:

Given a source picture, we aim to create a realistic image that not only fits the target text description but also preserves source image characteristics that are unrelated to the target text description. For this picture synthesis job, we use adversarial learning to automatically develop implicit loss functions. We refer to t as matching text, t as mismatching text, and t as semantically relevant text, which includes t and other related but not precisely matched texts (e.g., given an image of a specific type of bird, t can be text describing other types of birds, but not texts for other objects such as flower, building). s represents the likelihood of a text matching with a picture x, and x is the synthesised image from generator G(x, (t)).

In our method, we feed three sorts of input pairs to the discriminator D, and the outputs of the discriminator D are the independent probabilities of these types:

s + r ← D(x, ϕ(t)) for real image with matching text;
s − w ← D(x, ϕ(tˆ)) for real image with mismatching text;
s − s ← D(ˆx, ϕ(t)) for synthesized image with semantically relevant text.

where + and represent instances of positive and negative examples, respectively.

introduced the word s − , to allow the discriminator to create a stronger image / text matching signal, allowing the generator G to synthesis realistic pictures that better match the text descriptions. G generates pictures using x G(x, (t)) and is optimized in competition with D.

Figure-1

SOURCE: Dong_Semantic_Image_Synthesis

Why SPADE:

SPADE is an abbreviation for Spatially AdaptivE Normalizing, which is just a normalization approach similar to Batch Norm, Instance Norm, and so forth. It is used in GANs to create synthetic photo-realistic images using a segmentation mask. This normalizing approach is used across the generator's layers in the article. Instead of learning the affine parameters in the Batch Norm layer, SPADE uses the semantic map to compute those affine parameters. Are you unsure what affine parameters are?

Figure-2

SOURCE: https://arxiv.org/abs/1502.03167

This is an image from the batch norm paper. Those little gamma and beta guys are the affine parameters. Those parameters are learnable and gives model the freedom to choose any distribution they want to become. So, the SPADE says why not use the semantic maps to compute those gammas and betas so called the scaling and shifting parameters respectively. SPADE will compute the scaling and shifting parameters using semantic maps rather than arbitrarily initialized scaling and shifting parameters, and THAT'S IT! They accomplished this because the information in input semantic masks gets washed away by traditional normalizing layers. SPADE aids in the efficient propagation of semantic information throughout the network. The architecture of a SPADE block is shown below.

Figure-3

SOURCE: https://arxiv.org/abs/1903.07291

Everything I said must have made sense by this point. The conventional normalization is still carried out; only the affine parameters have changed. They are essentially simply a pair of convolutions applied to the input semantic map. I am grateful to NVIDIA AI researchers for making this study so visually appealing. So, let's get started and create some code.

SPADE Implementation:

All of the code I'll be demonstrating today is from my GitHub repository, which you can see here. I'll show you how I implemented a version of the SPADE paper with less features in this notebook, but I've added other features in other notebooks. It will have a generator block and a discriminator block because it is a GAN. The SPADE layer is contained in the generator block. The first basic spade block is,

Figure-4

SOURCE: Basic block

It supports the neural network as well as the features. The classification mask is simply a straight and simple long integer 2D mask; the study recommends using a hidden state for the classes, but I choose to keep things simple, thus the first convolutional layer number of input filters is 1. The mask is then resized to match the size of the features. It does this because the SPADE layer will be utilised at each layer, and it needs to know the size of the features so that the mask may be adjusted for the affine parameter operation. Take a look at how I set affine to false when I initialise the BatchNorm2d layer to avoid using the default affine settings. To stabilise the GAN training, spectral normalisation is employed in all of the convolutional blocks in the article. The variables ni and nf represent the number of input filters and the number of data filters in convolutional layers, respectively.

Figure-5

SOURCE: Spade Residual

Here's the complete generator architecture. Gaussian noise is fed into a linear layer, which converts 256 input values to 16384 output values, which are then moulded into a (1024, 4, 4) random matrix. On this input, a SPADE ResNet block with 1024 nodes executes (every Conv2d layer is fed a 4x4 input). The output is then subjected to nearest-neighbor up sampling to twice its size before being propagated to the next SPADEResBlk. Because there are seven such levels, the output has a dimensionality of 4*2**7 = 512. A 3x3 Conv2d with three filters (R, G, and B) with tanh activation serves as the output layer.

Figure-6

SOURCE: Generator

Now that we've got the fundamental blocks in place, it's time to stack them up as indicated in the design below. I calculated the amount of feature maps that each layer would generate and used a for loop to construct the generator. To make things simple in my code, I utilize techniques like initializing the module's arguments with a global variable. The nfs variable holds the output of all the SPADE residual blocks that were used to initialize the layers in the generator. Finally, I change the tanh layer output to be in the 0–1 range to make it easier to see.

Figure-7

Discriminator

The discriminator is a Multi-Scale, Patch Gan based Discriminator. Multi-Scale means that it classifies the given image at different scale. Patch Gan based discriminator the final output layer is convolutional and the spatial mean is taken. The output from multiple scale is just added to give the final output. The discriminator outputs the activation using both the mask and the generated/real pictures at the same time. As shown in the implementation, the discriminator's advance method accepts both the mask and the picture as input.

Overall model architecture:

Figure-8

Functions of Loss:

Because it is a GAN, it has two loss functions, one for the generator and one for the discriminator. The loss function is the pivot loss from the SAGAN article, which I discussed in my last blog. The loss function is quite simple, consisting of only one line of code. but, here is the section where I spent the most effort and understood how critical a loss function in a deep learning problem. Loss function is given below.

Preparing Data:

I utilised data from the Chesapeake Conservancy's Land Cover Data Project to classify land cover. I used ArcGIS Pro's "Export Training Set for Deep Learning" function to retrieve all of the pictures from the categorized raster. Once I had the photos on the disc, I used the fastai data block API to load them so that they could be provided to a fastai learner and its fit function. After building the necessary classes, I used the code below to generate a fastai data bunch object. The classes I built were SpadeItemList, which is essentially the opposite of the fastai's SegmentaionItemList. So, what fastai does is that you can construct a class that inherits from a fastai Item or Label class and override a few methods to suit your needs, so it will construct the databunch object using that class. The Data bunch object has properties that contain PyTorch datasets and dataloaders. It has a function called show batch that displays your data. Here is the result of my show batch command.

Figure-9

Training Process; In fast ai, we must build a learner that contains the method fit that is used to train the model. However, unlike image classification, GAN requires us to move the model from generator to classifier and vice versa. To use the fastai library, we must first construct a callback mechanism. To do this, I cloned and modified the GAN Learner from the fastai library. The self.gen mode parameter instructs the GAN Module how to use the generator and when to use the discriminator. Fastai has a callback that switches the GAN at predefined intervals. For each generator stage, I repeated the identifier steps five times. FixedGANSwitcher is used for this. Other triggers are utilised as well. Please have a peek at the code on my GitHub page. We can now use the fit technique to train the model.

Results:

The results are not photorealistic, but with enough time and computation, as well as the removal of any flaws, the model will be able to create nice photos. The initial pictures generated by the model are shown below.

Figure-10

Figure-11

Note : The picture qualities in figure-1 are relatively low, however after training the data for a significant duration, we can observe the good quality images in figure-2.

Conclusion:

I introduced spatially-adaptive normalization, which takes use of the input semantic layout while completing the affine transformation in the normalization layers. The suggested normalization results in the first semantic image synthesis model capable of producing photorealistic outputs for a wide range of scenarios such as indoor, exterior, landscape, and city scenes. For this purpose, the Fastai library is utilized.

Github:

https://github.com/yeshwanth69/SPADE

References:

https://www.kaggle.com/residentmario/spade-a-semantic-image-segmentation-model
https://openaccess.thecvf.com/content_ICCV_2017/papers/Dong_Semantic_Image_Synthesis_ICCV_2017_paper.pdf
https://www.researchgate.net/figure/In-the-SPADE-generator-each-normalization-layer-uses-the-segmentation-mask-to-modulate_fig2_331859039
https://dl.acm.org/doi/abs/10.1145/3306305.3332370
Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis Xihui Liu, Guojun Yin, Jing Shao, Xiaogang Wang, Hongsheng Li.
Efficient Semantic Image Synthesis via Class-Adaptive Normalization Zhentao Tan; Dongdong Chen; Qi Chu; Menglei Chai; Jing Liao; Mingming He; Lu Yuan; Gang Hua.

Madras Scientific Research Foundation