Linear-ResNet GAN-based anime style transfer of face images

Converting directly real-world images into high-quality anime styles using generative adversarial networks is one of the research hotspots in computer vision. The current popular AnimeGAN and WhiteBox anime generative adversarial networks are problematic when distortion of image features, loss of details on lines and textures are concerned. To address these problems, we introduce AnimationGAN. To preserve the details of images, we use the linear bottlenecks in the residual network, what is more, we also employ the hybrid attention mechanism to capture the salient information of images. In addition, we adopt optimized normalizations to improve the accuracy and learning rate of the model. The experimental results show that compared with AnimeGAN and Whitebox, the proposed AnimationGAN has smaller FID to cartoon(61.73), better IS(6.79) and faster network training speed(405 s per epoch). In summary, the generated animation image significantly improves line texture details and image feature retention with much faster network training speed.


Introduction
Animation is a prevalent form of artistic expression, yet the production process of these comics is very challenging. To date, most animations are made by hand and then, rendered by computers. This process is lengthy and usually incurs heavy cost. As a result, it is economically promising to seek a simple approach to transform a realistic image into a comic image with apparent comic features.
In recent years, with continuous exploration and research on the method of transforming the real scene around them into animation images, GAN (generative adversarial network) was proposed, which is capable of transforming the Mingxi 1 School of Communicating Engineering, Chengdu University of Information Technology, Chengdu 610225, Sichuan, China real scene from simple texture to texture generation so as to realize the style transfer of real images. In 2018, Y. Chen et al. proposed a new GAN framework CartoonGAN, which is mainly used to present realistic cartoon images similar to Hayao Miyazaki's style [1]. In 2019, AnimeGAN was proposed by J. Chen et al. in ISICA to improve CartoonGAN [2]. In 2020, X. Wang proposed the WhiteBox network, which greatly simplifies the network parameters and improves the speed of image generation [3]. Later, S. Hicsonmez et al. proposed the GANILLA network, which studied the migration of illustration comic style [4].
However, for the current mainstream framework of GANbased animation generation networks, such as AnimeGAN and WhiteBox, it is difficult to match the style transfer with the content structure of the image when dealing with face images with complex detail information and salient features. As a result, problems arise such as feature deformation, lack of style color, loss of line and texture detail information.
In order to address the aforementioned problems, we propose a new network called AnimationGAN. The main contributions of this work are summarized as follows.
(1) In the residual module, we use a linear bottleneck residual block with fewer parameters instead of the inverted residual block in mainstream networks, which not only greatly reduces the number of parameters of the model, but also prevents the nonlinear destruction of too much information, thus preserving the details of lines and textures.
(2) In the animation style transfer, some salient information often needs special attention. In order to better capture it, we introduce the hybrid attention mechanism module in the last layer of residual block. The attention mechanism can better capture the tone and texture of the anime faces and match them with the real faces, further improving the transfer effect of animation style.
(3) Normalizations are optimized, by adopting batch group normalization and spectral normalization in the generator and discriminator, respectively. The optimized normalizations help accelerate network training, leading to improved accuracy and learning rate.
The subsequent sections of this paper are arranged as follows. In Sect. 2, we briefly review the works related to animation style transfer. In Sect. 3, we propose the architecture of our network and describe the key ingredients in detail. The experimental results are presented in Sect. 4. Finally, we draw conclusions in Sect. 5.

Related work
With the continuous development of deep learning, GANbased style transfer technology has quietly emerged: in 2017, Zhu et al. proposed an unsupervised generative adversarial network CycleGAN, which contains two pairs of generative adversarial networks for bidirectional domain transformation [5]. This method designs a cycle-consistent adversarial loss and constructs a framework that can use unpaired data for image conversion. In 2018, Yang Chen et al. analyzed the general characteristics of animation images, added edgepromoting adversarial loss on the basis of GAN and proposed a network architecture CartoonGAN suitable for animation style transfer, which successfully realized the effect of image animation [1]. Jie Chen et al. proposed a lightweight network called AnimeGAN on the basis of CartoonGAN [2]. AnimeGAN introduces an inverted residual network and constructs three loss functions: grayscale style loss, grayscale adversarial loss and color reconstruction loss, which makes the animation texture features of the generated image more significant and reduces the local artifacts in the generated image of CartoonGAN. However, AnimeGAN and Cartoon-GAN generate images with serious details loss of details and color distortion. Based on the characteristics of animation artists' painting behavior, Wang et al. proposed a Whitebox method [3]. This method uses three independent representation models of surface, structure and texture to ' white-box ' animation images to achieve targeted optimization and adjustment of animation images. Later, S. Hicsonmez et al. proposed the GANILLA network for the style transfer of children 's comics [4]. By adding jump connections between the layers of the network, the style transfer effect is effectively improved.
Attention mechanisms have been widely used to deep learning in different fields, including natural language processing [6], image processing [7][8][9][10], machine learning [11], etc. Significant progress has been made in the research of GAN combined with attention mechanism. 2021 Shuo Yang et al. proposed a text-to-image generation method based on multi-attentive deep residual generation adversarial network, which introduced CBAM attention mechanism to improve the quality of high-resolution image generation [12], corroborating that the CBAM module has feature extraction and image generation tasks with outstanding performance. In the same year, Yang Liu et al. proposed a network intrusion detection method based on CNN and CBAM [13]. The method implanted lightweight CBAM into CNN networks and demonstrated that CBAM can be integrated into various convolutional neural networks for end-to-end training.
Based on the above existing works, we know that a single generator is not effective in dealing with face images with complex details and salient features. For face images with numerous information and obvious features, the use of linear bottleneck residual block and hybrid attention mechanism is a worthy research direction. Therefore, we propose the new network called AnimationGAN, which can process the complex information and salient features of face images.

Architecture of AnimationGAN
The neural network we proposed consists of a generator and a discriminator. The generator learns the distribution of the anime-style dataset and generates anime-style face images; the discriminator takes the generated images and the real images as input and distinguishes the generated images from the real images as much as possible by learning. All the images not generated by the generator will be passed through VGG19 to find the grayscale confrontation loss L gra (G, D). The generator finally generates images that are indistinguishable from the discriminator and match the real data distribution.
Our generator can be treated as an encoder-decoder structure. The encoding layer consists of Conv-Block, Down-Block, DSConv, and so on. The decoding layer consists of Conv-Block, UP-Conv, DSConv and other modules. In the middle of the encoding and decoding layers is a residual network consisting of 8 layers of linear bottleneck residual blocks+CBAM [14]. Because of the Convolutional Block Attention Module, the residual blocks can also extract the salient features in face images [15]. The image features are extracted by the encoding layer, the style transfer of the image Fig. 1 The structure of the generator and discriminator in the Anima-tionGAN. In the generator, the numbers such as 64, 128, 256 et al above all boxes stand for the number of channels, a is the anime image, x is the anime grayscale map, e is the de-linearized anime image, y is the de-linearized anime grayscale map, and G(p) is the generator generated map. In the discriminator, "K" is the kernel size, "S" is the stride in each convolutional layer, "N" is the number of the feature maps and Spec_Norm represents the Spectral normalization layer is performed by the residual module, then the features of the image are restored by the decoding layer, and finally, the image is generated by the convolutional layer.
In our discriminator, the first four layers have LReLu activation functions for extracting features of the input image, and the last individual convolutional layer converts them into one-dimensional feature vectors and outputs them for the purpose of discriminating. The discriminator also uses the spectral normalization [16] instead of the instance normalization [17] to prevent parameter magnitude boosting and to avoid abnormal gradients. This not only makes the training of the network more stable and reliable but also makes the network training time somewhat reduced and reduces the training requirements of the network. The structure of the generator and discriminator of AnimationGAN is shown in Fig. 1.
Where Conv-Block is the standard convolution module, DSconv is the depthwise separable convolution [18], Up-Conv is the up-convolution, and Down-Conv is the down-convolution. There are linear bottlenecks between the encoder and decoder, compared to the inverted residual structure, the linear bottleneck residual blocks can reduce the loss of image information, making it possible to process images with complex detail information. In the non-residual part, we use similar convolutional layers(from N32 to N256 in Fig. 1) as the mainstream networks (e.g., WhiteBox, AnimeGAN, etc.) and optimized normalization which can improve the network training efficiency. See Fig. 2 for the detailed unfolding structure.
All convolutional and residual layers except the final output layer use Batch Group Normalization and LeakyRelu activation function [19]. Compared with instance normalization, it can speed up the iterative convergence of the network, which requires much less hardware. Meanwhile, the LeakyRelu function is used in the activation layer to solve the problem of positive interval gradient disappearance and some neurons not being activated. The linear bottlenecks+CBAM in the ResNet choose to reduce the dimension first and then, use linear activation instead of the ReLu activation function at the end layer to retain complex details, salient features and the content structure.

Convolutional block attention module
The attention mechanism is inspired by the study of human vision, where people can easily distinguish between salient and non-salient areas when viewing an object and thus, obtain important information. Attention mechanisms are used to highlight important features and suppress some irrelevant information by applying attention weights to image features.  The common attention mechanisms mainly include spatial domain attention mechanisms [20], channel domain attention module [21], and Hybrid attention mechanisms [15,22].
The spatial attention module applies attention weights from the spatial scale of image features to make the model focus on the region of the target, while the channel attention module applies attention weights from the channel scale of image features to make the model focus on the specific target.
The CBAM (Convolutional Block Attention Module), a representative model of the hybrid attention mechanism, combines the attention mechanisms of the spatial dimension and the channel dimension of the image features, which can perform a comprehensive analysis of image features [15]. So that the network can focus more accurately on the target features in the image, the structure of CBAM is shown in Fig. 3.
Given an image feature map F ∈ R C×H ×W as input, CBAM can sequentially output a 1D channel attention map M c ∈ R C×1×1 and a 2D spatial attention map M s ∈ R 1×H ×W as illustrated in Fig. 3. The overall attention algorithm can be summarized as:

Algorithm 1 Algorithm of CBAM
Require: The input feature map F Ensure: The output result F" • Define a variation McF, which satisfies: • Then define a variation MsF, which satisfies: We will introduce the convolutional attention module and apply it to the deep residual blocks of the generative network to better resolve image features.

Linear bottlenecks residual blocks with CBAM
Linear bottleneck residual networks were first proposed by MobileNetv2 [14]. It has been found that the convolution kernel in the DepthWise part tends to fail when using the inverted residual network, thus the values within the convo- lution kernel are mostly zero. This is caused by the ReLU, which is needed to map the low-dimensional information to the high-dimensional space during the transformation, and then, remapped back to the low-dimensional space via the ReLU. Therefore, if the output dimension is relatively high, the information loss in the transformation process is small; when the output dimension is relatively low, the information loss in the transformation process is large. So the linear bottleneck chooses to reduce the dimension first and then, use linear activation instead of the ReLu activation function at the end layer. The MobileNetv2 also proved that using linear bottlenecks can prevent the nonlinearities from destroying too much information, thus preserving the details of lines and textures.
We choose linear residual blocks to replace IRBs used in mainstream networks. Since the linear bottleneck residual blocks are in the middle layers of the generator, the feature maps of these layers tend to have larger perceptual fields and contain more image information, so we also combine CBAM with linear bottleneck residual blocks to capture the important information in images more effectively. The structures of Linear-Bottleneck-RB and Linear-Bottleneck-RB+CBAM are shown in Fig. 4.
The algorithm of Linear-bottlenecks+CBAM is as follows: We introduce the CBAM in the last layer of Linear-Bottleneck-RB because CBAM can more comprehensively perform attention learning for the abstract features in the residuals to obtain the attentional features needed for the target task.
In order to obtain a better effect of face animation style transfer, the model must be trained to focus on animation features, for example, in transforming animation face style, features such as line strokes, facial color, and facial structure of anime faces need to be focused on. By obtaining the corresponding attention weights in the channel and spatial dimensions of image features, CBAM can "guide" the network to accurately locate the target features and perform finer optimization, which can effectively improve the stylization effect of the target features and the stability of the content structure.

Batch group normalization and spectral normalization
In generators, we use BGN, which is more efficient in terms of parameters and computational performance, instead of instance normalization used in mainstream generators [19]. Although IN, LN and PN perform well in specific tasks, they are less general and perform poorly when doing general style transfer. To solve the above problems, we proposed BGN, merge into a new dimension and implement F l N ×D , where, D = W × C × H , the mean μ g and variance δ 2 g of the new dimension are calculated as: where G is the number of groups divided by the new dimension, which is a hyper-parameter. S(M/G) is the number of instances in each divided feature group. is a small number added for stability of the division. Batch Group Normalization not only inherits the advantages of batch normalization such as larger learning rate, stable training process and very high training speed, but also can have the strengths of group normalization, such as avoiding the influence of batch size model and speeding up the convergence of the network [23,24]. Therefore, BGN has better performance, stability, and generalizability and does not require the use of additional trainable parameters, information across multiple layers or iterations, and additional computations.
In the discriminator, we use spectral normalization instead of the Instance Normalization discriminator. For the discriminator network, f can generally be regarded as a composite function, i.e., a nested operation of many functions, when the Lipschitz constant for f satisfies: The activation functions ReLu and Leakyrelu all satisfy 1-Lipschitz, in order to make the Lipschitz constant of the discriminator equation not exceed 1, we only need to ensure that the σ (W ) = 1 of 1-Lipschitz during the convolution operation. So for the weights W of the convolutional layer, we have Then, when we are restrained in this way, bringing in the previous inequality, we get f Li p is bounded from above by 1. The final restriction to 1-Lipschitz for discriminator is achieved.
According to recent insights about network parameter tuning related to GAN, it is shown that the condition of the generator almost determines the success or failure of the training because the training of GAN is always unstable. However, normalization helps to speed up training and improve accuracy and the learning rate. Miyato et al. stabilized the training of GAN by applying spectral normalization to discriminator networks [16], because doing so limits the number of spectral paradigms in every layer, thus restricting the Lipschitz condition of the discriminator. Compared to other normalization algorithms, spectral normalization does not require additional hyperparameter tuning and is relatively less computationally expensive. Therefore, inspired by this study, this paper applies spectral normalization to the discriminator of AnimationGAN to prevent parameter magnitude boosting and avoid anomalous gradients. From the experiments, it can be found that the spectral normalization of the discriminator can significantly reduce the computational cost of training and also make the training more stable

Experimental setup and datasets
The experimental platform used in this paper is Intel i7-1165G7, a 4-core processor with a basic frequency of 2.8GHz, Nvidia GTX1660ti graphics processor, and 16GB memory. In the experiments, the batch size is set to 12; the initial learning rate is set to 0.0002; the number of training iterations is set to 50; each epoch is trained 554 times, the initialized epoch weight is 5, and the network is optimized using the Adam optimizer. The resulting experimental model can be made available to others for subsequent work to make improvements.
The training data contain authentic and animation images, while the test images have only genuine photographs. To better demonstrate the effect of the improved network, the experiments were extended to the dataset selected by Ani-meGAN with a resolution of 256x256. One thousand seven hundred ninety-three manga images from Hayao Miyazaki manga were used for the training set, and 792 real-world images were used for the test set. The validation set is 68 authentic images. Since the real-world dataset was not matched with the anime image dataset, 6,656 real-world photographs were prepared and trained with the cycleGAN strategy to ensure that the anime-style images generated by our generator matched with the authentic images.

Image generation effect and comparative analysis
We compare the images generated by AnimationGAN with multiple face style transferred GANs, namely CartoonGAN, WhiteBox, AnimeGAN and GANILLA. The images of anime faces generated by different networks are shown in Fig. 5: These four methods can effectively capture the anime style. However, CartoonGAN can better preserve the color and content of the anime image, but the local area of the generated image will produce obvious color artifacts and lose the color of the original content image. The WhiteBox network can effectively reduce the artifacts of the generated images, and the stylization effect is obvious, while the fine details are retained to some extent, but the method will lead to excessive smoothing or distortion of the face features, and the face will be collapsed in serious cases. The image generated by Ani-mationGAN not only reduces the artifacts of the generated image but also retains the face features, the content structure and the color of the corresponding area to a great extent; in addition, the overall animation color of the generated image is also more significant. It is worth noting that AnimeGAN and AnimationGAN share similar network frameworks and loss functions, i.e., grayscale style loss, grayscale adversarial loss, and color reconstruction loss). However, we add a linear bottleneck network structure to the generative network so that the information of the generated images, such as content and color, is preserved better than that of the AnimeGAN.
To conclude, our method outperforms other methods to a certain extent. Our approach generates images with significant anime color, trim or no face feature distortion, or even face collapse, retains anime texture, colors and other stylistic information, and preserves the content structure better, with more stable detail texture.
In addition, We use Frechet Inception Distance (FID) and Inception Score (IS), evaluation metrics widely used in GAN-generated images, to evaluate the model quality. FID [25] uses the pre-trained Inception-V3 classification network to extract high-level features of images and calculates the distance between the distributions of two types of images. Generally speaking, the smaller the value of FID evaluation, the closer the distributions of the two types of images and the higher the similarity of image features; IS [26] is mainly used to evaluate whether the GAN can generate clear and diverse images, and it measures the difference between the target domain images and the generated images by calculating the KL scatter of the probability class distribution. The higher the IS evaluation value, the better the quality of the generated images. We conducted performance tests by generating images with Hayao Miyazaki anime style using different networks, and using these generated images to perform FID evaluation with their original real images and the corresponding anime style anime images, and also using IS evaluation metrics to evaluate the clarity and diversity of the generated images. The final results are shown in Table 1.
From the above table, we can see that the FID to cartoon of AnimationGAN is the smallest, which indicates that the image generated by AnimationGAN can capture the image content more effectively, balance the content features and animation style features better, and get the generated image with high feature similarity to both content image and animation image. The IS of AnimationGAN is also the largest, which further proves that AnimationGAN can obtain better generation results. Since AnimationGAN introduces the attention mechanism in the residual module, which enhances the critical information and weakens the irrelevant information in the image, the detail texture and other feature information of the generated image will be retained or ignored according to its importance.

Comparison of model complexity analysis
The complexity of the algorithm mainly includes the number of parameters of the network, the modelsize of the network, and the image stylization time. To ensure the reliability of the experiments, we firstly trained the same dataset for Ani-meGAN, WhiteBox, AnimationGAN, Cartoon-GAN, and GANILLA methods for 50 epochs on Google colab while calculating the training time per epoch. Then, we used above networks to generate some face images to calculate the inference time and used the param.numel function in torch to calculate the model size of different networks. Finally, we performed animation transfer on the same test set and conducted an experimental comparison of model complexity, and the test results are shown in Table 2.
AnimationGAN has been further improved from Ani-meGAN. In the residual module, we use a linear bottleneck with a smaller number of parameters, so the network model size and inference time are significantly improved compared to both AnimeGAN and GANILLA. Although the number of parameters and modelsize of WhiteBox is smaller than Ani- mationGAN, the training time is longer because the Group normalization module we adopt helps to speed up the convergence of the network, the group batch normalization of the generator helps to speed up the training speed, and the spectral normalization structure of the discriminator enhances the stability of the training and allows him to continuously maintain fast training speed, which makes AnimationGAN have faster training efficiency.

Ablation experiment
We made four groups of networks, LBN+CBAM, IRB+CBAM, LBN, and IRB, representing a linear bottleneck residual block containing CBAM, an inverted residual block containing CBAM, a linear bottleneck residual block without CBAM, and an inverted residual block without CBAM, respectively. Then, we tested two sets of real face images with these four groups of networks, and the test results are shown in Fig. 6. As shown in Fig. 6, the IRB method can better preserve the facial content structure, but the generated image has no significant anime features; the image generated by LBN is slightly lacking in contour softening, but it can effectively portray high contrast edge strokes, and its generated image has obvious anime face style. The comparison between IRB+CBAM and IRB, LBN+CBAM and LBN shows that adding the CBAM module retains more detailed information and anime features of anime faces, and better softens the face contours. In addition, the faces generated by LBN+CBAM are more remarkable than IRB+CBAM in terms of texture lines and colors, and the animation migration effect is better.
LBN+CBAM can combine the advantages of both LBN and CBAM; LBN can portray the face more realistically in terms of animation edge stroke features, while the CBAM hybrid attention module can pay good attention to the tones and contents of the face images, which can make up for the problems of unvivid colors and too dark tones brought by LBN. In this experiment, after adding CBAM, the generated images all retain the tones of real images better, which makes the animation features of the generated images more effective.

Conclusions
We proposed AnimationGAN using linear bottleneck residual block with CBAM and the optimized normalizations. In the experimental stage, we used different networks to realize the anime migration of faces, comparing their generated face images and analyzing the model complexity. We studied the ablation experiments based on whether to use linear bottleneck and whether to add CBAM. The experiments showed that AnimationGAN can effectively match the content and face features, and its generated images have significant anime texture, color and other style information, and better retain the content structure and detail texture. In addition, due to the optimization of the normalized model, the model size and inference time of AnimationGAN have been significantly reduced compared with AnimeGAN, and the training efficiency is significantly improved.
AnimationGAN solves the problem of insufficient transfer effect of character anime style; however, more works need to be done in the future to tackle difficulties when dealing with complex semantic information. We will work hard to find a good algorithm for this problem.