Face frontalization with deep GAN via multi-attention mechanism

In recent years, the development of deep learning has led to some advances in face synthesis approaches, but the significant pose remains one of the factors that are difficult to overcome. Benefiting from the proposal and development of the generative adversarial network, the level of face frontalization technology has reached new heights. In this paper, we propose a deep generative adversarial network based on the multi-attention mechanism for multi-pose face frontalization. Specifically, we add a deep feature encoder based on the attention mechanism and residual blocks in the generator. Meanwhile, to carry the global and local facial information, the discriminator of our model consists of four independent discriminators. The results from quantitative and qualitative experiments on CAS-PEAL-R1 dataset show that our model proves effective. The recognition of our model exceeds or equals the highest recognition rate of other models at some angles, such as 100% at β = 0°, α = 15° and 99.78% at β = 30°, α = 0°.


Introduction
With the rapid development of deep learning, the performance of face recognition techniques has improved significantly [1,2]. However, in practical applications (such as surveillance video), face images are often affected by multiple poses.
At present, multi-pose face recognition methods for multipose are divided into two categories. One category entails learning pose-invariant features from original face images [3][4][5]. The other category involves synthesizing an identitypreserved frontal-view face image from a face image in a specific pose, which is called face frontalization. Then, it uses the generated face images to extract features and recognize the face. Previous works [6][7][8] all show strong face recognition performance by using this method. Since Goodfellow et al. [9] proposed the generative adversarial network, many face frontalization methods based on GAN have been proposed [10][11][12][13][14][15][16][17][18]. Attention mechanisms have been widely B Zhenxue Chen chenzhenxue@sdu.edu.cn 1 the network layers and extract more abstract facial details. We also add the attention mechanism in the discriminator.
• Compared with some existing advanced methods, our model is simpler and achieves higher recognition rates for some angles. The results from quantitative and qualitative experiments prove the effectiveness of the proposed method.

Generative adversarial networks
Goodfellow et al. [9] first proposed generative adversarial networks (GANs). The min-max two-player game provides a simple yet powerful way to estimate target distribution and generate novel image samples [23]. Many variations of GAN have been proposed. DCGAN [24] applies deep convolution to GAN. WGAN [25] and WGAN-GP [26] use the Wasserstein distance instead of KL-divergence in GAN. BEGAN proposes a new equilibrium enforcing method [27]. These methods greatly advance in various generation tasks.

Face frontalization
Face frontalization is a computer vision task that synthesizes identity-preserved frontal-view faces from various viewpoints. Existing methods can be divided into two categories: 2D-based methods [6][7][8] and 3D-based methods [28,29]. In recent years, many face frontalization methods based on GAN have been proposed. For instance, Huang et al. [10] propose a deep architecture with two pathways (TP-GAN). Qian et al. [17] propose a face normalization model (FNM) to synthesize frontal face images in the condition of an unconstrained environment. Tran et al. [11] propose DR-GAN which extends GAN with an encoder-decoder structured generator and pose code. Cen et al. [30] design a novel feature fusion module to fuse features more effectively. These methods have all made meaningful contributions. In view of the effectiveness of GAN, our model is also based on GAN.

Attention mechanism
In 2014, Mnih et al. [31] first used attention in image classification and then it became widely used in natural language processing tasks [32,33]. In recent years, attention mechanisms have also played an important role in computer vision. In particular, the DA-Net [21] aggregates the position attention module and the channel attention module. Zhang et al. [22] introduce self-attention to GAN (SAGAN). For face frontalization tasks, DA-GAN [13] and GSP-GAN [14] place the self-attention module in the generator and discriminator, respectively. In our model, we combine and stack the position attention module and channel attention module to produce more abstract features.

Method
The structure of our model is shown in Fig. 1. The input multi-pose image is denoted as I P , the corresponding frontal image is denoted as I F , and the synthesized frontal image is denoted asÎ F . The encoder, deep feature encoder and decoder are denoted as G E , G DFE , G D . The frontal face is cropped into three regions: eyes I E ; nose I N ; mouth I M . The corresponding three synthesized regions are denoted asÎ E , I N andÎ M . The discriminators are denoted as D F , D E , D N and D M .

Generator
The generator of our model, shown in Fig. 1, is based on Unet architecture that consists of an encoder-decoder structure for image synthesizing. Skip connections are used between the encoder and decoder to enable multi-scale feature fusion. Inspired by [34], we add a deep feature encoder (DFE) behind the encoder. The generation process can be described as: The DFE consists of four stacked modules, as shown in Fig. 2. The DFE module comprises composed two parts: a residual block and a dual-attention module. He et al. [35] propose ResNet, which is easier to optimize and can gain accuracy from increased depth. In order to extract more abstract facial features, we stack up basic residual blocks to deepen the generation network. Inspired by [21], we combine position self-attention and channel self-attention into a dual-attention module.

Position self-attention module
Given a feature map Y ∈ R C×H ×W , we first generate three new feature maps A, B and C by feeding Y into three different 1 × 1 convolutional layers, where {A, B, C} ∈ R C×H ×W . Then, we reshape A and B to R C×N , where N = H × W . Then, we perform matrix multiplication between A T and B and apply a softmax layer to obtain the spatial feature map D ∈ R N ×N : Meanwhile, we reshape C to R C×N . Then, we perform matrix multiplication between C T and D and reshape the result to R C×H ×W . Finally, we multiply the result by a scale parameter α and perform an element-wise sum operation with the original feature Y , obtaining the final-position selfattention distribution feature map M ∈ R C×H ×W : From Eq. 3, we can infer that the feature M of each position is the weighted sum of the features of all positions and the original feature.

Channel self-attention module
Given a feature map Y ∈ R C×H ×W , we first reshape Y to R C×N and perform matrix multiplication between Y and Y T . After that, we apply a softmax layer to obtain the channel attention map E ∈ R C×C : where e ji indicates the degree of the ith channel's effect on the jth channel. Additionally, we perform matrix multiplication between E and Y and reshape the result to R C×H ×W . Finally, we multiply the result by a scale parameter β and perform an element-wise sum operation with the original feature Y , obtaining the final channel self-attention distribution fea- It can be inferred from Eq. 5 that the final feature of each channel is the weighted sum of all channel features and the original feature. After obtaining M and N, we feed them into a 3 × 3 convolutional layer and perform an element-wise sum operation to obtain feature map F. Finally, we apply a 1 × 1 convolution operation for F and add the result with the original feature Y to obtain the final feature map Z.

Discriminator
We use segment strategy to implement a facial attention mechanism by cropping the face image into three regions-eyes, nose and mouth-which are the most discriminative areas in face recognition. The whole image and these three regions are fed into four independent discriminators (D F , D E , D N , D M ), as shown in Fig. 1.
Inspired by SA-GAN [21], we also add the self-attention block to the uppermost and second-uppermost layers in each discriminator. The self-attention block is the position selfattention block shown in Fig. 2.

Loss function
The loss function is a weighted sum of five individual loss functions.

Global-local adversarial loss
The loss for distinguishing real images from synthesized images is the sum of a global loss and three local losses. Mathematically speaking, it is phrased as follows: where subscript j represents the facial region and the corresponding label of the discriminator.

Identity-preserving loss
Preserving the identity is a critical part of synthesizing the frontal face image. We exploit the pre-trained LightCNN-29Layers-V2 [36] to give our model the ability to preserve identity. The identity-preserving loss is denoted as: where φ(·) is the output feature from the fully connected layers of pre-trained Light CNN and · 2 is the L2-norm.

Multi-scale pixel-wise loss
Following [12], we employ a multi-scale pixel-wise loss to constrain the content consistency between the synthesized I F and the corresponding frontal image I F . Mathematically speaking: where C is the channel number and W i and H i are the corresponding width and height of the ith scale. The scales are image size: 128 × 128, 64 × 64, 32 × 32.

Perceptual loss
Johnson et al. [37] propose using perceptual loss to measure the similarity between images. We use pre-trained vgg19 [38] as the feature extractor to gain feature maps. Mathematically speaking, this can be expressed as follows: where C is the channel number and W and H are the width and height of the feature map.

Total variation regularization
We introduce the total variation regularization [37] to remove artifacts and improve the synthesis quality of images; this is written as: where C, W and H are the channel number, width and height of the synthesized imageÎ F .

Overall objective function
The final objective loss function is a weighted sum of all the aforementioned losses: where λ 1 , λ 2 , λ 3 , λ 4 and λ 5 are hyper-parameters corresponding to each loss term.

Implementation details
In the training process, we use pairs of frontal and non-frontal face images I F , I P from CAS-PEAL-R1 dataset for input. We use the face detection model Retinaface [41] to preprocess the image and use the position of the nose tip as the center point to crop images to size 128 × 128. In addition, we align images to ensure that the feature points of the left and right eyes are in the same horizontal line. Based on the above operations, we build discriminators in the fixed area of all the face images. The center points of the three regions (i.e., eyes, nose, mouth) are (64,42), (64,64), (64,90), respectively. The sizes of the three regions are fixed as 84 × 25, 24 × 38 and 52 × 18, respectively. In the training process, we use the Adam optimizer (β 1 = 0.9, β 2 = 0.999). The hyperparameters of the objective function as: λ 1 = 1.0, λ 2 = 0.01, λ 3 = 10, λ 4 = 0.01, λ 5 = 0.01. The weight decay, batch size, and learning rate are fixed as 5 × 10 −4 , 6, and 1 × 10 − . 4 , respectively. A single NVIDIA RTX 2080Ti GPU with 12G memory is used in our experiment. We implemented our network with PyTorch. After testing, the speed of our trained model processing gray image at shape 128 × 128 is 30fps. The process of training is as follows:

Qualitative results
To verify the performance of our model, we conduct test experiments on two datasets, respectively. For the CAS-PEAL-R1 dataset, Fig. 3 shows the DMA-GAN's ability to synthesize high-quality frontal face images that maintain clarity of facial features and stable facial structure, even when much semantic information is lost due to wide-angle perspectives (β = − 30°, α = 45°).
To further demonstrate the high performance of our model, we also visually compare synthesized images produced by CAS-PEAL-R1 with state-of-the-art methods (TP-GAN Table 1 Rank-1 recognition rates (%) across yaw (α) and pitch (β) pose variations under CAS-PEAL-R1. The highest recognition rate in each case is bolded Pitch(− 30°)  [10], CR-GAN [18], M 2 FPA [12], DA-GAN [13]), as shown in Fig. 4. We can observe that DMA-GAN displays strong performance in both facial texture detail and geometric shape. To demonstrate our model's generalization ability, we use images from CASIA-FaceV5 dataset to test our model trained solely on CAS-PEAL-R1. The synthesis results, shown in Fig. 5, reveal that our model can faithfully synthesize frontalview face images.

Quantitative results
We conduct the rank-1 face recognition with the CAS-PEAL-R1 dataset to quantitatively verify the identity-preserving ability of DMA-GAN. We follow the open evaluating protocol and use a pre-trained LightCNN-29Layers-V2 [36] to extract deep features and cosine-distance metric to calculate similarity. Table 1 shows the recognition rate of our model and some popular methods.
It can be observed that DMA-GAN outperforms CR-GAN and exceeds or equals the highest recognition rate of other models at some angles (in bold within Table 1). In terms of network structure, TP-GAN uses a two-pathway generative adversarial network which greatly reduces the training efficiency. The training of TP-GAN lasts for one day while ours lasts about five hours. M 2 FPA and DA-GAN use a pretrained model as a face parse to generate masks which are used as inputs. It is always difficult and time-consuming to acquire this kind of data. However, our model uses a single generation path and does not require additional input except I F , I P . Thus, we achieve not the best but also very efficient performance through a simpler network.

Ablation study
To verify the superiority of DMA-GAN as well as the contributions of various components, we remove each component individually and test the synthesis performance. So we train four partial variants of DMA-GAN: one without deep feature encoder (DFE); one without three discriminators (sub-D); one without self-attention in discriminators (D-att); and one without identity-preserving loss (id). The synthesis results of the ablative comparison are shown in Fig. 6. Table 2 shows the quantitative comparison.
First, the synthesis images without L i p cannot maintain the identity-preserved feature well. Second, the model without sub-D and D-att can hardly capture the details of facial features, especially for the eyes, nose and mouth regions. Moreover, the deep feature encoder (DFE) we designed displays notable performance in perceiving local texture and synthesizing more vivid facial details.

Conclusion
In this paper, we propose a deep GAN-based model which combining multi-attention mechanisms (DMA-GAN) can be effectively used for face frontalization. DMA-GAN can synthesize high-quality frontal-view face images from multipose face images and does not require inputs of other prior knowledge of the face. The synthesized results are compelling, showing that our model has practical significance.
There are two main limitations of our method. First, paired training data are required to train our model and it is always difficult and time-consuming to acquire this kind of data. Liu and Chen [42] propose DRCycleGAN which can be trained with unpaired data. Second, the performance of the model in the wild to be improved. CCFF-GAN [15] uses semisupervised learning to solve this problem.