Realistic real-time processing of anime portraits based on generative adversarial networks

Nowadays, more and more brands use interesting anime characters to promote and increase brand awareness. However, real-time and interesting promotional materials are also one of the key factors to attract people, so real-time processing of anime characters has also become an eﬀective way to enhance brand awareness. In recent years, with the rapid development of deep learning, image style conversion based on AI technology has become a much-attended artiﬁcial intelligence application, but it also suﬀers from the disadvantages of complex model structure, slow conversion speed, and inconspicuous identity features, which need to be improved. In view of this, this paper proposes an Anime Portrait Realization (ARF-GAN) algorithm based on Generative Adversarial Network. This algo-rithm is based on the PixPix architecture and also uses the U-net network structure with jump connections to


Introduction
In recent years, along with the continuous development of AI technology, its application has been revolutionized in various industries.For example, quantum imaging [1] can achieve higher resolution than classical optical imaging by utilizing quantum entanglement phenomena, which has important application value for fields that require high-resolution imaging, such as biomedical imaging and astronomical imaging.Social media plat- forms can use this technology to analyze the language, sentiment, and representative views of users to provide more personalized recommendations and search results.Computer vision (CV) [2] technology automatically recognizes people, objects, and scenes in photos and videos.Social media platforms can use this technology to analyze photos uploaded by users, merge the photos into appropriate topics or events, and use this information to provide users with better recommendation results.Artificial intelligence technology, especially deep learning, provides powerful algorithmic support and processing capabilities for real-time image processing.By training and optimizing models, AI can efficiently process and analyze image data for faster and more accurate image recognition, classification, enhancement, and generation.This has led to the rapid development and widespread use of real-time image processing technology in a variety of application scenarios, such as autonomous driving, intelligent monitoring, and medical diagnosis.The advancement of AI technology not only promotes the innovation of real-time image processing technology, but also provides it with unlimited possibilities, further promoting the rapid development and wide application of image processing technology.
Similarly, AI technology can also be used in the task of style transformation of images.Anime portraits have always attracted a large number of fans with their fresh and cute images.Many brands use the latest anime character images on social media to market their products and promote their brands.When brands are building a presence in the media, new, interesting and engaging content is crucial.Anime characters are products of human imagination and their appearance is somewhat different from real humans.With the increasing demand for advertising and promotion, how to realize anime portraits in a fast, real-time manner has become an issue of great concern.By applying animation portrait realization technology to brand promotion, it can provide brands with a novel and unique way to attract and retain audiences.Therefore, anime portrait realization is becoming a research topic of great interest.
Traditional techniques for realizing anime characters tend to be slow, and are generally achieved through a 3D modeling approach that combines computer graphics and computer animation techniques.Among them, computer graphics mainly involves techniques in texture mapping, lighting and shading, which are used to create and render 3D scenes and models.Computer animation techniques, on the other hand, include techniques such as skeletal animation, skinning, keyframe animation, and physics simulation, which are used to realize the dynamic effects of 3D models.In addition, there are some auxiliary techniques, such as rendering techniques, animation editing and compositing techniques, used to improve the quality and effect of 3D animation.Such methods are often very complex, require knowledge and techniques involving multiple domains, and require large amounts of time and computational resources.
In recent years, deep learning has become a popular approach in image processing techniques.Despite the various improvements of existing deep learning-based methods for anime character realization, there are still some more obvious problems.First, traditional GAN networks have high structural complexity, insufficient feature extraction capability, and generally need to extract the features of anime portrait images based on traditional methods to generate corresponding real portrait images, which often leads to a degradation of the quality of the generated images; second, it is very important to retain the feature and detail information of anime portrait images during the process of converting anime portraits to real portraits.However, existing GAN-based methods do not take this into account, much less give a corresponding solution; third, most of the current deep learning-based methods for anime character realization are dedicated to improving the quality of the image, with very few results investigating the subsequent downstream visual tasks.
In order to solve the above problems and challenges, this paper proposes a novel GAN (Anime-to-Real Face Generative adversarial network, ARF-GAN) for converting anime portrait images into real portrait images.The main work of this paper is shown below: 1.In this paper, a novel GAN-based Anime-to-Real Face Realization model (ARF-GAN) is preferred to be proposed.The algorithm distinguishes itself from the traditional portrait animeization algorithm by innovatively generating the corresponding real portrait image from the input anime portrait image to achieve the purpose of image style conversion.The new algorithm uses the U-net structure in the generator as the generator network structure to enhance the feature extraction ability and realize the lightweight model network structure.
2.In this paper, a new type of attention module CBAM is used and introduced into the animation portrait realization model, which can help the model to focus on the most important feature channels, so as to improve the accuracy of the model and enhance the model's expressive and generative capabilities, making the generated images more realistic and diverse.At the same time, this paper constructs a deblurring network [12], which significantly improves the model's image generation and quickly processes the input image, thus

Generating Adversarial Networks
Generative Adversarial Network [3] is a deep learning architecture ,2014 saw the proposal of Ian Goodfellow et al., whose basic idea is to achieve high fidelity image/video generation through a competitive game between two neural networks.GAN consists of two parts, the discriminator and the generator.During the training process, the two neural networks compete with each other and optimise themselves, eventually when one reaches the optimum, it can generate more highly realistic images, whileThe discriminator can distinguish between produced and genuine images more effectively.In the field of image generation, Ian J. Goodfellow [5] in 2014 to solve the problem of the inability of primitive GANs (Generative Adversarial Networks) to generate images with specific attributes.CGAN solves this problem by incorporating attribute information into the generator and discriminator, which can be any label information.Martin Arjovsky et al. in 2017 proposed Wasserstein GAN [6] in which a fresh loss function is suggested Wasserstein distance, which measures the distance between two distributions, thus addressing the training's gradient vanishing and pattern collapse issues.In the field of image quality enhancement, Wenzhe Shi et al. proposed SRGAN [7] in 2017,which employs a novel loss function and also introduces techniques such as residual networks and adversarial training.This model's primary function is to increase the resolution of low-quality photos to achieve high-quality and high-efficiency image super-resolution.Further, in the field of portrait attribute editing, Xiaojuan Qi et al. proposed BeautyGAN [8] in 2018,the main role of this model is to achieve efficient and accurate portrait beautification, which makes the generated images visually more beautiful and natural.Despite GAN's success in the picture area, there are still numerous issues with video translation, mostly because it is challenging to guarantee the consistency of the front and back frames of the generated video.For this, Wang et al. proposed the Vid2Vid [9] model in 2018, Vid2Vid is the version of image translation in the video domain, which adds the optical flow information in the discriminator, models the foreground and background separately, and focuses on solving the problem of inconsistency of the front and back frames in the process of video-to-video translation.In addition, Zheng et al. proposed AdvGAN [10] in 2018 .The main role of this model is to generate relocatable and generic adversarial samples that enable attackers to deceive multiple machine learning models instead of just one.Similarly, Karen Simonyan et al. proposed BigGAN [11] in 2019, which employs a novel architecture that includes techniques such as deep residual generators and conditional instance normalisation, resulting in generated images with higher resolution, richer details and more diverse features.When it comes to image editing, Tero Karras et al. proposed the StyleGAN model algorithm in 2019.StyleGAN [12] is mainly used to generate high-resolution, realistic portrait images, as well as other images with a high degree of controllability and diversity generated images .Yunsuk Kim et al. proposed UGATIT [13] model in 2019, which exploits the attention mechanism and adaptive layer instance normalisation technique, making the generated images more realistic and diverse.Kaiming He et al. proposed APEGAN [14] in 2020, which adopts the idea of Energy Based Generative Adversarial Networks (EBGAN) by introducing adversarial perturbations to enhance the diversity and realism of images.Eric R. Chan et al. proposed Pi-GAN [15] in 2021 for high quality 3D-aware image synthesis using SIREN network as an implicit representation and generating images from SIREN using conditional radiance field.Yabin et al. in 2023 proposedInaccurate-Inaccurate-Supervised Learning With GGAN [16], a method for inaccurate-supervised learning using a generative adversarial network GAN.The method generates target labels by training a generator network while using imprecise supervised learning to guide this generation process.

GAN-based image style translation
The anime portrait realisation described in this paper is a type of GAN attributed to image style translation, which does not refer to a specific research area, but is a collective term for a series of domains, including tasks such as image stylisation.A series of sets of images with the same style is the domain, and what image translation needs to achieve is the conversion from one domain to another, which is very closely linked to the domain migration task.Phillip Isola et al. in 2017 proposed a model called Pix2Pix [17], a deep learning image transformation model, is a typical supervised image translation model GAN, which uses pairs of images to accomplish image-to-image translation.In the same period, Jun-Yan Zhu et al. proposed CycleGAN [18] model in 2017, which is an unsupervised image translation model based on.Its biggest innovation is its ability to be trained using unpaired data.In the same year, Taeksoo et al. also proposed DiscoGAN [19], which uses the same loss function as CycleGAN, except that the specific generator structure and discriminator structure are different.Weidi Xie et al. proposed CartoonGAN [20] in 2018 that trained an anime image generation network and improved the generalisation ability of the algorithm by introducing instance normalisation (CIN) [21].Yi-Hsuan Tsai et al. proposed DualGAN [22] in the same year, which used semantic consistency loss to retain feature information of images at the semantic level, thus better retaining higher level feature information, due to its model structure like "X" and named for its model structure like "X".Further, Yunjey Choi et al. 2018 proposed StarGAN [23], the StarGAN discriminator must learn to determine not only if the samples are real or not, but also from which domain the true images originate,achieving single-model multi-domain image interconversion between them.Xiaocong Fan et al.
proposed AnimeGAN [24] in 2021, which introduces a new loss function and network structure to achieve the generation of different styles of anime images and can be trained in a shorter time.William Peebles et al. presented GAN-Supervised Dense Visual Alignment [25], which focuses on proposing a method for dense visual alignment using GAN.The method aligns the input image x with the target image y by training an STN, which allows the resulting image to retain more details and structural information.Similarly,Liu Hang et al. in 2023 proposed an improved model for GAN style migration method [26], by using spectral normalisation and perceptual loss, avoiding, the gradient explosion phenomenon during the model's training procedure to solve the problem of insufficiently stable training and low quality of the generated images.
In summary, although there are many models for realising anime portrait realisation through image translation by GAN, there are still many problems, for example, for complex images, the model with better generation effect tends to have a more complex structure as well, which makes the training more difficult; and for the image translation model with a simple structure, it will have a poorer effect on image style translation.Therefore, designing a model structure lightweight and 3 Anime Portrait Realisation Based on Generative Adversarial Networks (ARF-GAN)

Basic Design Ideas and Model Framework Processes
BASIC DESIGN IDEAS: The anime portrait realisation model ARF-GAN in this paper is mainly based on the CGAN model architecture, which changes the network architecture of the generator part compared with the traditional GAN model.Aiming at the problems of complex network structure and insufficient feature extraction ability of traditional GAN, by utilizing the generator's U-net network design,the network can better reconstruct the output data because a jump connection is used to link the encoder's feature maps directly to the decoder.This design approach also allows for a more lightweight network structure.For the traditional GAN exists in the poor quality of image generation, the problem of detailed information is not rich.The new algorithm also introduces the CBAM attention module, which can better extract image features and enhance the expressive ability capability of the model, while adding the deblurring module to enhance the image generation effect.
In addition, for ease of explanation, this paper outlines the entire framework process of ARF-GAN model for anime portrait realisation in Fig. 1.Its basic structure and steps are illustrated as shown below: Firstly, the input anime images are uniformly cropped by the preprocessing module, and normalised and adjusted for channel dimensions, with the aim of facilitating subsequent data processing.At this point, this is done in two steps.The preprocessed image is passed into the generator of the U-net design in the first stage.The U-net architecture has two elements, an encoder and a decoder, and is characterized by its symmetry.The input data is compressed by the encoder into a latent form, which is then decoded by the decoder to produce the output data.The attention module (CBAM) [27] is first used in the encoder to assign importance weights to each channel through learning.By doing this, the weights for each channel are obtained and different hierarchical importance is assigned to the features enriched in different channels.In the second step, the input features are transferred to the deblurring module, which aggregates the features by means of a fusion block with five convolutional layers, where the first four convolutional layers contain convolutional operations, batch normalisation, and ReLU activation functions, while the last one contains convolutional operations and Tanh activation functions, where each layer receives activation information from all the previous layers.In the second step we input the generated real face image and the original input anime face image simultaneously into the discriminator for training in order to make the discriminator perform true-false judgement.In this process, this paper adjusts the weights of the loss function, and the network model is continuously and iteratively optimised by calculating the loss function and updating the parameters by backpropagation.

Generator U-net network structure
Traditional GAN generators use a simple CNN [28] structure, which leads to a lack of detail and clarity in the generated images and increases the model complexity.Therefore, This paper's network model is based on the U-Net topology as the network architecture of generator G for generating realised images of anime portraits.
The main feature of the network structure U-net is its "jump connection" which allows the model to capture information at different scales.This structure can provide an image that is sharper and more realistic, lighter than the network model, and more accurate at making predictions by connecting the feature maps of each layer in the encoder to the corresponding layer in the decoder.
In the specific implementation process, this paper first creates an encoder and a decoder, in the encoder, the input image is first downsampled for four consecutive times, and a 3×3 convolutional kernel is used to convolve the image, and then the feature channel is outputted through the ReLU activation function, and the image is downsampled by maximum pooling, with a pooling kernel size of 2×2.i.e., the source picture of the anime portrait image is gradually scaled down and the features are extracted, and the features are extracted by using the convolutional layer and the pooling kernel.Convolutional and pooling layers are used simultaneously to simplify the computational process.This process can be regarded as extracting the highlevel features of the anime portrait image.Each downsampling layer reduces the image by half and doubles the quantity of kernels for convolution.The output of the encoder is then upsampled 4 consecutive times with a convolutional kernel size of 2×2 and dropout is used in the first 2 layers to reduce the complexity of the generator and prevent overfitting.The up-sampling is connected with the features of the corresponding layer in the encoder.Then, in this paper, the low level encoder features can be coupled with the high level decoder features via the connection between the encoder's output and decoder's input, as shown in Figure 2, to increase the accuracy of the created image and lightweight network structure.

CBAM Attention Module
In order to enhance the expressive and generative ability of ARF-GAN model, making the generated images more realistic and diverse.In this paper, we propose to incorporate a lightweight structured Attention mechanism CBAM into the generator network, the channel Attention module (CAM) and the spatial Attention module (SAM) are two separate sub-modules that make up CBAM.Both channel and spatial attention are exercised.This ensures the network structure's lightweight properties while also conserving computational resources and parameters.In Fig. 3, the precise structure is displayed.In this part, we'll go through in detail how the model's self-attention module is computed.
Step 1: Calculate the channel attention CAM.The first step in the training process is to input the feature map F that was produced by the previous layer.This feature map is then subjected to global maximum pooling and global average pooling based on width and height, respectively, to produce two N C feature maps (where N is the feature map's length height and C is the number of channels).The summation procedure is then carried out by the MLP, a two-layer neural network, dependent on how the elements are arranged.The computational formula is as follows: Where α is the scale parameter, default is 1, which will be adjusted by itself during the training process.Set the activation function Relu, the quantity of neurons in layer C of the MLP network, the number of neurons in layer C of the MLP network, and the number of neurons in layer C of the activation function.
Step 2: To produce the input features needed by the spatial attention module, perform element-based arrangement multiplication after channel attention calculation and input feature mapping.
Step 3: Calculate the SAM for spatial attention.To create two H × W × 1 feature maps (where H, W are the height and width, respectively), first input the feature maps produced by the channel attention module.Then, perform global maximum pooling and global average pooling based on the channel-based dimensions, respectively.Pooling depending on channels is done.The computational formula is provided below.Next, a 7 × 7 convolution operation is used to reduce the dimension to 1 channel, or H × W × 1, and then sigmoid is used to produce spatial attention characteristics.
where σ is the scale parameter, defaulted to 1, which will adjust itself during the training process.
Step 4: The final generated characteristics can be obtained by multiplying the spatial attention features by the channel attention features.
In the process of model training, the middle layer calculates the attention feature maps corresponding to the feature maps of the previous layer through the above steps, applies the calculated attention feature maps to the subsequent training, and the estimated attention feature maps' dimensions match those of the input feature maps, so that the subsequent training can be carried out normally without changing the structure of the model dimensions.

De-blurring network module
GAN-based methods focus on generating images that mimic the original image distribution.However, this approach often results in poor and blurred image quality.
One of the most difficult problems in the field of image processing is image deblurring.In deep learning, there are primarily two sources for de-blurring networks.Traditional image processing methods have been developed on the one hand, and deep learning techniques have been developed on the other.In traditional image processing algorithms, deblurring techniques have already had more mature research results.For example, based on blind deblurring [31], inverse filtering [32] and other methods.
These algorithms are based on mathematical models that perform complex mathematical operations on the image to achieve the effect of removing blur from the image.However, these traditional methods sometimes lead to distortions and artefacts, and they are often ineffective for non-linear, complex blurring problems.This challenge is addressed in this paper by using a convolutional neural network-based deblurring algorithm to enhance the quality of the generated photos.
The network model in this paper automatically learns the blurring features in an image by using deep convolutional neural networks and achieves efficient deblurring.The deblurring network module in this study has five convolutional layers, as shown in Table 1.The first layer uses 64 convolutional kernels with a 3x3 size, padding of 1, an activation function called ReLU, and no bias terms.This layer mainly extracts the low level components of the input image, such as edges and texture.The purpose of the second, third, and fourth convolutional layers is primarily to extract higher level information from the image.The number of convolutional kernels[36] is doubled layer by layer while the rest is maintained constant.The fifth convolutional layer merges the previously extracted information to create a deblurred image.The number of convolutional kernels is set to three and the activation function change is set to Tanh.

Overall Loss Function Design
In the method of this paper, generator is used to generate the image while retaining the information distribution that is similar in structure to the anime portrait image.The discriminator is designed to accurately distinguish between the real image and the generated portrait image.The generator G and discriminator D components of this paper's network model structure.(x, y) is a representation of the paired dataset of training images, where x is an anime portrait and y is a real portrait.Using the min-max game below, the goal is to bring the input image as near to the target image as possible: The adversarial loss function and the generative loss function are the two sub-loss functions that make up the loss function in this study.The following is an expression of the loss function's overall goal in this paper: The aforementioned function, the generator G minimises the objective function as much as possible and the discriminator D maximises the objective function as much as possible, where the adversarial loss function is the above equation 3. The adversarial loss function is mainly designed to give a higher quality to the generated images.It is typically created with the idea of a game between the generator and discriminator, allowing the generator to produce more realistic images while keeping the discriminator from being able to distinguish between actual and generated images.Here, the game between the generator and the discriminator is often computed using the binary cross-entropy loss function.To calculate the difference between the converted image and the target image, the loss function is created.Here, the loss is calculated using an L1 distance function at the pixel level.The L1 distance represents the sum of the absolute difference between the pixel values of the converted image and the pixel values of the target image.Where λ is the hyperparameter controlling L1 and the L1 distance loss function is defined as follows: In the task of this paper, the generator should produce a realistic portrait image that retains the outline information and specifics of the original anime character's picture.In order to keep the input image and its associated output image consistent, this paper proposes the use of L1 distance to minimise the differences between them.It is is a commonly used loss function for tasks such as image deblurring and image super resolution.To direct the model's training, it can be utilized to add up the absolute values of the pixel disparities between the predicted and actual images.

The experiment results and analysis
In this part, the effectiveness of the suggested algorithm will be assessed.,we conduct extensive experiments on the algorithm proposed in this paper using DanBooru, CartoonfaceAB dataset.This paper presents a comprehensive and rigorous experiment to evaluate the efficiency of ARF-GAN.In Section 4.1, the experimental parameter settings, the experimental dataset, and the evaluation metrics used in the experiment are presented and their characteristics are described.In Section 4.2, the paper compares ARF-GAN with other image translation methods and analyses them to verify the advantages of ARF-GAN.In Section 4.3, this paper conducts various ablation experiments to elucidate the role of functional modules.In Section 4.4, this paper analyses the efficiency of the proposed method by calculating its time complexity and space complexity.

Experimental environment and parameter settings
In this paper, image style translation experiments are conducted on Cartoonface, DanBooru dataset.Each sample is a 256×256 anime portrait.DanBooru dataset contains 12000 training samples and 2000 test samples.Each sample is a 512×512 cartoon image.The convolutional neural model used for the experiments con-sists of 4 convolutional layers, 2 pooling layers, 1 batch layer and 1 fully connected layer.A convolutional kernel of size 3 was used, the Adam optimiser [32] was used in the network, and the ReLU function [33] and Tanh function were used for the activation function [34].The initial learning rate α = 0.0005 was set, and the momentum parameters β1 = 0.5 and β2 = 0.999 were set. the expected results were achieved after 1400 training batches.The experiments were conducted on a Windows 10 machine with an AMD Ryzen 9 5900HX CPU 3.30 GHz, NVIDIA GeForce RTX 3060 and 16 GB of RAM, and the experimental algorithms used the Python 3.7 programming language to ensure efficient processing and accurate results.These options were selected in order to optimize the performance of the models in this study after conducting numerous experiments.By using these settings, this paper can efficiently train the model of this paper on the selected dataset and obtain high quality results.
When configuring the generator model, choose to configure the generator to accept a three-channel image as input and produce a three-channel image.The model contains 3 downsampling layers and 3 upsampling lay-ers with a centre layer in between.Each downsampling layer contains a convolutional layer, a LeakyReLU activation function and a batch normalisation layer.Each upsampling layer contains an inverse convolutional layer, a ReLU activation function and a bulk normalisation layer.Among them, the 1st-6th upsampling layers also contain a dropout layer.Finally, the output layer contains a ReLU activation function, an inverse convolution layer and a tanh activation function.The role of the whole model is to transform the input image into the output image while maintaining the specific details of the input image.
In the configurator model,the configuration generator is selected to accept a 6-channel input image (including RGB channels for two images) and a scalar value as output indicating whether the input image is real or not.The model contains four convolutional blocks [35] and one output layer.A convolutional layer, a LeakyReLU activation function, and a batch normalization layer are all present in each convolutional block.The 1st-3rd convolutional blocks have a step size of 2 and the last convolutional block has a step size of 1.The output layer contains a Sigmoid activation function.The entire model's function is to determine whether or not the input image is real.
In the field of Generative Adversarial Networks, to judge the quality of generated photographs, a variety of assessment metrics are frequently used.In this context, five representative metrics are selected in this paper, Peak Signal-to-Noise Ratio (PSNR), (structural similarity index) (SSIM) [31] , Fréchet Inception Distance (FID) and Learned Perceptual Image Patch Similarity(LPIPS), Mean Square Error (MSE) to quantitatively evaluate the performance of the proposed ARF-GAN algorithm.These five metrics provide a comprehensive assessment of the performance of the ARF-GAN algorithm and enable a thorough analysis of the quality of the fused images.

Comparison of the experiment results
In order to demonstrate the image generation performance of ARF-GAN, this paper compares four existing algorithms with good performance, which are Pix2Pix, CycleGAN, DualGAN and DiscoGAN.During the experiment, we found that ARF-GAN performs better on realistic anime face images, as in Fig. 4.This model can accurately map facial features of anime style to real facial structures, achieving high-quality realistic transformation.In model design, a large amount of training data is continuously used to optimize model param- eters, improving the accuracy and naturalness of the transformation.

Qualitative analysis
QUALITATIVE ANALYSIS: As shown in Fig. 4, this paper gives intuitive result images for several different methods on.There are seven sets of image pairs, firstly the method in this paper successfully preserves the rich texture and background details of the visible image.Compared to the images of pix2pix, CycleGAN, Dual-GAN, and DiscoGAN (the second,third,fifth,and sixth columns), the method of this paper demonstrates excellent performance in terms of overall structure and contrast for the image style transformation.This aspect is particularly evident in terms of overall image results.Secondly, compared to the problems of insufficient image clarity and blurred detail information that exist with Pix2Pix and CycleGAN, the method in this paper effectively generates the detailed texture features that should be present in real portraits.And compared to other methods, DiscoGAN has the problems of overexposure and blurring of colours, and DualGAN has the problem of uneven colours of portrait faces, it can be seen that compared to some known traditional methods, the pre-trained network model of this paper generates better realisation of images and produces images of higher quality.the statistical findings, image creation using ARF-GAN exhibits the best performance in terms of PSNR and LPIPS, especially in PSNR has a significant improvement over the second place.Meanwhile, the method in this paper maintains the second place in SSIM.In addition, in terms of FID evaluation metrics, this paper's method yields similar performance results with the second place.This indicates that ARF-GAN achieves the best or second best results for most of the metrics.The above qualitative data suggests that the approach used in this paper performs well in terms of picture style translation.

Ablation experiments
In the generative adversarial network based anime portrait realisation method proposed in this paper, CBAM plays a crucial role in extracting image features, enhancing the expressive and generative capabilities of the model, and making the generated images more realistic.To investigate the efficacy of CBAM, this paper analyses the model structure and conducts an ablation study.Specifically, this paper demonstrates the effectiveness of CBAM by calculating the normalised values of SSIM, PSNR, LPIPS, MSE and FID, as shown in Table 3 and Figure 6, this demonstrates that the approach used in this paper leads to improved performance. in all metrics compared to the results obtained without CBAM.In addition to this, in this section, the paper further discusses the effect of deblurring network on the method of this paper for image generation.In this paper, the normalised values of SSIM, PSNR, LPIPS, MSE and FID are calculated.It can be noticed in this paper that the values of all the metrics except IPIPS are better than the previous ones.The experimental findings are given in Figure 7, and they demonstrate that the deblurring network has a considerable impact on the method's ability to generate images and can enhance the texture details of such images.

Complexity analysis
In order to fully analyze the computational complexity of this paper's method, the time complexity and space complexity of all four methods and ARF-GAN are analyzed in this paper, as shown in Table 4.In terms of time complexity, the method in this paper achieves the first-place running speed, which is only 1.9ms less than the second-place DiscoGAN.furthermore, in terms of space complexity, the method in this paper achieves the third-place parameter size.Despite the small parameter size of Pix2Pix, its running time is much larger than the other algorithms, reaching 127.4ms.Since the model of this paper is based on Pix2Pix, the parameter size of this paper is only 0.04G larger than the second-ranked Pix2Pix.this shows the high efficiency of the method in this paper.

Summary and outlook
In this paper, we propose an (ARF-GAN) algorithm that can convert anime portrait images into real portrait images, which can be applied to many fields including social, advertising, movie and TV, game and so on.Specifically, this paper firstly constructs a new GAN structure ARF-GAN and uses U-net structure in the generator, lightweight network structure, ARF-GAN uses CBAM attention mechanism and deblurring network to improve the realism of the image,lightweight model structure, and analyzes the relevant theoretical proof of the algorithm.In this paper, the quality and similarity of the generated images are evaluated based on a variety of evaluation metrics by comparing with other algorithms, and the effectiveness and real-time performance of the model are verified in terms of time and space complexity.Although the GAN model proposed in this paper has achieved good performance in the real-time processing task of images, there are still problems that need to be solved.For example, the current model in this paper still produces images with a certain degree of distortion and deformation when processing non-realistic anime portraits.This is due to the model's insufficient understanding of different styles of anime portraits.Therefore, one of the future research directions is to further improve the model to enhance its understanding of different styles of anime portraits.Therefore, this paper will further explore these application challenges in the future and develop more pervasive and sustainable application methods.In conclusion, anime portrait realism is a potentially rewarding and testing area of application, and there are many directions to be explored in the future, and we hope that the work in this paper can contribute to the development of this field.

Fig. 1 3 .
Fig. 1 The flowchart of ARF-GAN algorithmic framework et al. proposed the DCGAN [4] model in 2014.By combining GAN and Convolutional Networks, the DCGAN model uses Deep Convolutional Neural Networks (DCGAN) to generate and discriminate the feature maps, which reduces the interplay of these errors while undergoing training.This approach increases the training's stability and allows the generator and discriminator to fit the data better.Mehdi Mirza et al. proposed CGAN

Fig. 2
Fig. 2 Schematic diagram of U-net structural model

Fig. 5
Fig. 5 Quantitative Comparison of the Five Indicators in the Dataset

Table 2
Comparison of each evaluation metric between different experimental algorithms

Table 3
Comparison of network models with the addition of different technologies However, qualitative analyses are usually subjective.Subsequently, this paper quantitatively compares the ARF-GAN in this paper with the aforementioned competitors.The outcomes are displayed in Table2and Figure5.According to

Table 4
Comparison of computational efficiency of all four methods and ARF-GAN