2.1 Data augmentation model
Data augmentation model F2GAN comes from few shot image generation field. We use F2GAN for it needs less data for training comparing to common GANs when generate various images from input data. As a variant of GAN, F2GAN has a generator and a discriminator. The architecture of the generator is a U-Net combining skip-connection. And the discriminator contains convolutional layer, down-sampling and fully connected layers. In the training process, the model firstly maps the input image to a low dimension feature space by down-sampling, and then the input mapping back to the sample space from the low dimension feature space. In the mapping back process, the model fuses different features from different input images to generate new image by weight fusion. That is the way to generate new images from limited input without too much target scene data. And to obtain low-level details, F2GAN uses skip connection to pass the details from input to output directly which is lost in down-sampling. The weight fusion used to fuse the features can also be untiled to selectively fill the low-level details. The generated images will be evaluated by the discriminator, which is used for distinguish true and generated. Thus, the whole training process of discriminator begin with the convolution and pooling of input images, and then the features are classified by the fully connected layers. The output of the generator will be distinguished by discriminator which gives the error for backpropagation process. The total training process is similar with the common GAN.
For F2GAN, the core equation is the loss function. The details of the loss function of F2GAN are shown (1). L is the total loss; LD is the loss of the discriminator; LGD is the loss of generator; L1 is the weighted reconstruction loss; Lc is the classifier loss; Lm is the mode seeking loss; La is the interpolation regression loss. And the \({\lambda }_{1}\), \({\lambda }_{m}\), \({\lambda }_{a}\) is trade-off parameters. More details is shown in literature [33].
$$L={L_D}+{L_{GD}}+{\lambda _1}{L_1}+{L_c} - {\lambda _m}{L_m}+{\lambda _a}{L_a}$$
1
F2GAN has two main differences with other GANs: First, F2GAN can utilize low-level details to generate new images from the input images by non-local attention module. This feature enables the F2GAN to generate various and clear images. The utilization process is modified by the relevant weight, and the low-level details comes from the down-sampling process and the filling of the low-level details is accomplished in the up-sampling procedure. The utilization of low-level details is based on the skip-connection which directly linked the down-sampling and up-sampling process. Second, the F2GAN is a model in few shot image generation field, which means the F2GAN augments the data without too much target data. Therefore, F2GAN satisfies the practical needs of welding. That is the reason for us to use F2GAN.
And before using the F2GAN, we make some adaptive modifications. First, to prevent overfitting phenomenon, we use dropout [34] in generator and discriminator. And to have the most various network architecture and a better generalization effect, the dropout ratio is 0.5. Differently, the dropout is used in both training process and testing process. Expressed in formula:
$$\begin{gathered} {x_{i+1}}=f(W\cdot {x_i}+b){\text{ }} \Rightarrow {\text{ }}{x_{i+1}}=f(W\cdot mask({x_i})+b) \hfill \\ {\text{ }}mask(x)=m\cdot x,{\text{ }}m \in {\{ 0,1\} ^D} \hfill \\ \end{gathered}$$
2
Among them, p is the dropout ratio, xi+1 is the output of i-layer of the network, the first equation of (2) is the procedure of layer without dropout, the second is the procedure of layer with dropout, mask is generated by Bernoulli distribution with probability 0.5.
Second, to reduce the side effect caused by the wrong label, prevent overconfidence and remain the shape of the optimal judgment function for a various output at the same time, we use one-sided label smoothing to tune the whole training process of F2GAN. Expressed in formula:
$${P_i}=\left\{ \begin{gathered} 1,if(i=y) \hfill \\ 0,if(i \ne y) \hfill \\ \end{gathered} \right.{\text{ }} \Rightarrow {\text{ }}{P_i}\left\{ \begin{gathered} (1-\varepsilon ),if(i=y) \hfill \\ 0,{\text{ }}if(i \ne y) \hfill \\ \end{gathered} \right.$$
3
$$Loss=-\sum\limits_{{i=1}}^{K} {{p_i}\log {q_i}{\text{ }} \Rightarrow } {\text{ }}Los{s_i}=\left\{ \begin{gathered} (1-\varepsilon )*Loss,{\text{ }}if(i=y) \hfill \\ Loss,{\text{ }}if(i \ne y) \hfill \\ \end{gathered} \right.$$
4
$$L_{{D\_new}}^{{}}=\left\{ \begin{gathered} (1 - \varepsilon )\cdot {L_D},{\text{ }}if(i=y) \hfill \\ {L_D},{\text{ }}if(i \ne y) \hfill \\ \end{gathered} \right.{\text{ }} \Rightarrow {\text{ }}L_{{GD\_new}}^{\prime }=\left\{ \begin{gathered} (1 - \varepsilon )\cdot {L_{GD}},{\text{ }}if(i=y) \hfill \\ {L_{GD}},{\text{ }}if(i \ne y) \hfill \\ \end{gathered} \right.$$
5
$$L={L_{D\_new}}+{L_{GD\_new}}+{\lambda _1}{L_1}+{L_c} - {\lambda _m}{L_m}+{\lambda _a}{L_a}$$
6
Among them, Pi is the true probability distribution, \(\epsilon\) is a small hyper-parameter which is 0.1 in this article. \({L}_{D\_new}\) is the new loss function of discriminator and \({L}_{GD\_new}\) is the new loss function of generator. In the left side of (3), (4), (5), it is the original part, and in the right side, it is the tuned part. It is clear that the tuned Pi could reduce the confidence of the label which in turn reduce the risk of overconfidence. And after one-sided label smoothing, the model can restrain the difference between the positive label and negative sample which enhance the generalize ability.
And considering the stability of GAN could be affected by sparse gradient, we replace the activation function relu [35] in discriminator by leaky relu [36] to avoid this phenomenon. Expressed in formula:
$$\operatorname{Re} LU(x)=\hbox{max} (0,x)$$
7
$$Leaky{\text{ }}\operatorname{Re} LU(x)=\hbox{max} (0,x)+\gamma \hbox{min} (0,x)$$
8
Equation (7) is the operation process for relu, and (8) is the operation process for leaky relu. Among them, γ is a small hyper-parameter which is 0.01 in this article. And by using leaky relu, the activation function will be activated all the time which improve the stability of the model by avoid the sparse gradient.
At the same time, for the images collected in a same weld bead is similar, we use two methods to increase the variance of the input to modified F2GAN. One is adding Gaussian noise to the input, the other is using gamma transformation. Gaussian noise means the probability density function of the noise follows Gaussian distribution. And the expression of Gaussian distribution is shown below:
$${p_G}(z)=\frac{1}{{\sigma \sqrt {2\pi } }}\exp (-\frac{{{{(z-\mu )}^2}}}{{2{\sigma ^2}}})$$
9
$$\begin{gathered} G(input){\text{ }} \Rightarrow {\text{ }}G(noised(input)) \hfill \\ D(input){\text{ }} \Rightarrow {\text{ }}\left\{ \begin{gathered} D(noised(input)),{\text{ }}for{\text{ }}real{\text{ }}image \hfill \\ D(G(noised(input))),for{\text{ }}generated \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$
10
σ and µ2 are the expectation and variance of Gaussian distribution, and s is 0, m is 0 ~ 0.03. In the gamma transformation, the gamma varies from 3/4 to 4/3. In this way, the input of the modified F2GAN is different from the original image, but not distorted, which alleviate overfitting. And the input of the generator as well as the discriminator is thus noised in (10).