An attention-erasing stripe pyramid network for face forgery detection

Face forgery detection aims to distinguish between real and fake facial images or videos by identifying manipulated or forged visual media. The main challenge in face forgery detection is achieving high model generalization ability, i.e., satisfactory performance under cross-database scenarios where the training and testing datasets are from different forgery methods. To achieve this goal, this paper presents an attention-erasing stripe pyramid network (ASPNet) to utilize high-frequency noises and exploit both the RGB and fine-grained frequency clues. First, since separately extracting features from different scales and granularities will ignore their complementarity, we employ a stripe pyramid block (SPB) to learn multi-scale and multi-granularity features simultaneously. Second, to make the model focus on useful information and suppress noise, a two-stage attention block (TSAB) is introduced by combining spatial attention and channel attention to filter out the pixel-wise and channel-wise noise in the learned feature maps. Finally, to dynamically guide the model to pay attention to different areas of the human face, an attention erasing (AE) scheme is adopted by randomly erasing units in attention maps. Sufficient experiments demonstrate that ASPNet has superior performance than F3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F^{3}$$\end{document}-Net on the FaceForensics++ dataset. The area under the receiver operating characteristic curve (AUC) and the accuracy (ACC) of our model reach 77.4% and 70.85%, respectively, which are improved by 0.83% and 1.28% compared with F3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F^{3}$$\end{document}-Net. Our code is available at: https://github.com/NWPU-Zwu.


Introduction
Face forgery technology usually refers to video or image face manipulation or face identity replacement technology. Recent studies have shown rapid progress in facial manipulation, which enables an attacker to manipulate the facial area of human faces. So, many face forgery detection methods have been proposed for solving the above problem. However, these methods still suffer from the following problems: First, some existing methods [1][2][3] extracted multi-scale features for face forgery detection. Some other methods [4,5] noted the effectiveness of multi-granularity. However, these approaches have rarely considered combining both extraction patterns. Learning multi-scale features provides the model with the ability to extract more comprehensive information, including both global and local details. Different granularities can capture invariant fine-grained manipulation patterns. The combined extraction method is more sensitive to forgery clues of different sizes and qualities due to a further subdivision of the proportion of local and semantic information. Based on such observation, this paper employs a Stripe Pyramid Module (SPB) inspired by [6], to extract multigranularity features from different scales and concatenate them together to take full advantage of the complementarity between them.
Second, channel attention weights different channel features in the channel dimension so that the network can selectively amplify valuable feature channels and suppress useless ones. The commonly used Squeeze and Excitation Block (SEB) [7] computes a channel-wise attention mask by applying Global Average Pooling (GAP). However, due to the simplicity of GAP (GAP collects information about a three-dimensional feature map by only using summation and averaging operations), discriminative subtle features from some local regions could be diluted by other regions when using GAP. Therefore, separately applying SEB without auxiliary guidance may miss useful information during feature learning. To address the above problem, we notice that the FAB [8] reweights feature maps from both spatial and channel aspects, thus it is expected to filter spatial noise and highlight discriminative local details. Thus, we introduce a two-stage attention block (TSAB), inspired by [9], to guide the model to focus on useful information and suppress useless information while promising to filter spatial noise and highlight discriminative local details.
Third, existing models [10,11] tend to pay attention to limited regions of the human face (e.g., eyes, nose) by using an attention mechanism. However, these methods may ignore some other potential regions that are imperceptible but discriminative. To solve the above problem, inspired by [12], this paper adopts an attention erasing (AE) module by consciously selecting a region to erase randomly. This module can dynamically direct the model to pay attention to different subtle regions and find potential discriminative forgery traces.
In general, based on the two-stream network, we propose a face forgery detection method named attention-erasing stripe pyramid network. The features from different scales and different granularities are fused by the stripe pyramid block. Useful features are enhanced and useless features are suppressed by the two-stage attention block, and the model is dynamically directed to focus on different regions through the attention erasing.
In summary, the main contributions of this paper are summarized as follows: • A stripe pyramid block (SPB) is employed to extract multigranularity features in a horizontal stripe fashion from different scales to obtain global and local features. Then, the block fuses them for taking full advantage of the complementarity between those features. • A two-stage attention block (TSAB) is employed to guide the model to focus on useful information and suppress useless information through spatial attention and channel attention while promising to filter spatial noise and highlight discriminative local details. • An attention erasing (AE) is adopted to strengthen the ability of the model to search for regions with potential forgery traces by randomly erasing regions of the face.

Related work
Spatial-Based Forgery Detection With the development of face forgery technology, a variety of forgery detection algorithms have been widely used and some results have been achieved. Because texture information can provide discriminative feature representation, most methods are proposed to extract RGB features. Some works [1,13] utilized auxiliary supervision such as blending boundaries or forged masks. Cao et al. [14] presented a reconstruction-classification learning method that mines the common features of genuine faces. Yang et al. [15] proposed a multi-scale Siamese prediction framework. Aloraini [16] proposed a novel approach based on fusing three streams of convolutional neural networks. Atkale et al. [17] designed an approach known as the multiscale feature fusion model followed by a residual network.

Frequency-Based Forgery Detection
Because of the effectiveness of frequency information for forgery detection, several methods tried to pay attention to the frequency domain for exploring subtle clues. Qian et al. [18] proposed F 3 -Net for face forgery detection, utilizing discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) to collect the frequency-aware clues to mine subtle forgery artifacts and compression errors. Luo et al. [2] used the SRM filter to guide the RGB features. Wang et al. [19] proposed using modified Landweber iterations for reverse image filtering. Jia et al. [20] proposed a novel adversarial attack method based on meta-learning to generate perturbations in the frequency domain.
Attention Mechanism The attention map can highlight the regions of the image that are manipulated and thus guide the network to detect these regions, so this approach could be useful for face forgery detection. Zhao et al. [21] proposed a multi-attention network to capture discriminative local features from multiple high-attention regions. Tao et al. [22,23] designed an attention mechanism for smoke recognition. Wang et al. [24] proposed an attention-based data augmentation framework that is used to mine training data in the training process to enhance the model's attention to multiple facial regions. Duan et al. [25] designed multi-spectral channel attention for person re-identification.  as input and is divided into an upper branch and a lower branch. The upper branch uses DCT to obtain the highfrequency noisy image and then uses IDCT to obtain the RGB domain image. The lower branch uses the sliding window discrete cosine transform (SWDCT) to obtain the high-frequency noisy image, which is also transformed back to the RGB domain. Xception [26] is selected as the backbone network. The last feature maps from the 1st, 5th, 9th and 12th Xception blocks of the upper CNN are sent to the SPB. The last feature maps from the 1st Xception block are sent to the AE. The feature maps from the last block of the lower CNN are sent to the TSAB. The information of both the upper branch and lower branch interacts through the dual cross-modality attention (DCMA) [18]. Finally, the outputs of the SPB, AE and TSAB are concatenated together for prediction.

The whole network architecture
Based on the idea of horizontal stripe, SPB extracts and fuses the features with multiple granularities at different scales in the network. TSAB guides the model to focus on useful information and suppress useless information through the attention mechanism. AE dynamically guides the model to pay attention to different regions by the attention erasing while looking for potential discriminative forgery traces.

Stripe pyramid module (SPB)
Let f l ∈ R C l ×H l ×W l , (l {1, . . . , 4}) represent the feature maps generated by the last layer of the l-th Xception block. After a 1 × 1 convolution layer, which is followed by a BN (batch normalization) layer and a leaky ReLU layer, each feature map f l produces a new feature map f l ∈ R C l ×H l ×W l . The process of obtaining new feature maps can be expressed as: Fig. 2 The architecture of the stripe pyramid block (SPB). This block is inspired by work [6] Then, the f l is processed by the stripe pyramid block parametrized by the number of pyramid levels m l . As shown in Fig. 2, we will describe the process of SPB with m l 3 for clarity and simplicity.
More specifically, at level p ∈ {1, . . . , m l } of the pyramid, a maximum pooling function P(·) is utilized to project the i-th stripe, with i ∈ {1, . . . , p}, onto a feature vector: where s 2 p−1 denotes the number of horizontal stripes into which the image is divided at pyramid level p of the pyramid. In another word, only if p > 1 that we will divide the feature maps into s horizontal stripes along the height dimension. δ p H l p denotes the height of each stripe. P(·) is a maximum pooling operator. The corresponding feature vector of each horizontal stripe is obtained separately by such operation and then are concatenated to obtain the stripe pyramidal vector x l .
The feature maps of the last layer from the 1st, 5th, 9th and 12th Xception blocks are sent to SPB with m l , respectively. After a series of experiments, we set m l 4, 3, 2, Fig. 3 The architecture of the two-stage attention block (TSAB), which is inspired by work [9] 1 which can achieve the best performance among different value combinations. After the above operations the feature vectors x 1 , x 2 , x 3 , x 4 are obtained. Finally, these vectors are concatenated together to get the stripe pyramidal vector.

Two-stage attention block (TSAB)
As shown in Fig. 3, the feature maps from the last block of the lower branch are used as input X ∈ R C×H ×W . To be able to filter out both pixel-level and channel-level noises in the feature maps, a two-stage attention block is utilized in this paper. The module consists of two basic blocks: the Fully Attentional Block (FAB) [8] and the Squeeze and Excitation Block (SEB) [7]. In Fig. 3, the upper layer shows the architecture of FAB while lower showing the architecture of SEB. X is firstly sent to the FAB. The attention mask M FAB ∈ R C×H ×W is obtained after a convolution layer with kernel size 1×1, a ReLU activation layer, a convolution layer with kernel size 1 × 1, and a Sigmoid activation layer. M FAB is multiplied with X to obtain X . Then the output X is fed into SEB to obtain the attention mask M SEB ∈ R C×1×1 after a global average pooling, a fully connected layer, a ReLU activation layer, a fully connected layer, and a Sigmoid activation layer. M SEB is then multiplied with X to obtain the output feature map X of TSAB. The overall computational process is represented as follows: where represents multiplication at the element level.

Attention erasing (AE)
As shown in Fig. 4, the feature maps X ∈ R C×H ×W from the last layer of the first Xception block in the upper branch are used as input. Instead of cropping the discriminative regions directly from face landmarks, this paper uses an attention module LANet [27] to automatically locate discriminative Fig. 4 The architecture of the attention erasing (AE). This module is inspired by work [12] regions. First, an attention map M i ∈ R H ×W , i {1, 2, 3} is generated in the ith LANet branch. It represents the informative face regions. The computation process is specified as follows: where Conv denotes a 1 × 1 convolution layer. Notice that the erasing is only conducted in the training stage with a probability P. After we obtain M i , we select a random probability P greater than 0 and less than 1. When P is greater than 0.5, M i will be sent into the erasing block.
The erasing is conducted on feature maps with the following steps. First, the erasing block selects a rectangle region of size S e within the attention map M i . The size ratio must satisfy Then the refined feature map X r for the ith branch is aggregated via the attention map M i (erased or original) and input X : where z denotes the element-wise multiplication operation. After a Global Average Pooling (GAP) layer, the local facial representation P i for the ith branch will be obtained.

Loss function
In the model of this paper, the loss function we use is the cross-entropy loss, denoted by: where N is the number of training samples. y i is the label of sample i. The label of the real face image is set to 0 and the label of the false face image is set to 1. p i is the probability value predicted for sample i which is manipulated.

Datasets
We use FaceForensics++ [28] as the training and testing dataset. FaceForensics++ is a face forgery detection video dataset containing 1000 real videos, in which 720 videos are used for the training set, 140 videos are used for the validation set, and 140 videos are used for the testing set. Each video is subjected to four forgery methods to generate four videos, making a total of 5000 videos. Generated videos have different quality levels for creating a realistic setting for manipulated videos. In this paper, we use the Dlib [29] library to randomly sample 8 frames from each original low-quality (LQ) video and 2 frames from LQ videos manipulated using different forgery methods. So, the dataset comprises a total of 16,000 images, including 8000 real face images and 8000 fake face images. To ensure the use of the same proportion of real face images and fake face images in the training and testing datasets, 11,520 samples are selected for the training set, including 5760 real face images and 5760 fake face images, and 4480 samples are used for the testing set, including 2240 real face images and 2240 fake face images. Figure 5 shows some samples from the FaceForensics++ dataset. It can be seen that many manipulation methods have now emerged and the differences between forgery traces are apparent in the images generated by different face forgery methods. This indicates that as the forgery techniques continue to be updated, the requirement for model generalization ability is also increasing.

Implementation details
ASPNet is implemented using Pytorch deep learning framework. Xception is chosen as the backbone network without pretraining to extract image features in the initial stage in this paper. During training, the horizontal flipping data augmentation is employed. The input image size is set to 299 × 299 pixels. The total training epoch is set to 50. The model is trained using the Adam [46] optimizer with an initial learning rate of 0.0002 and parameter beta (0.9, 0.999). The batchsize is set to 16. We set m l 4, 3, 2, 1 for the 1st, 5th, 9th and 12th Xception blocks, respectively. The number of LANet is 3. The random probability P is in the range [0, 1] and the threshold is set to 0.5. Following works [28,30,31], we apply the accuracy score (ACC) and area under the receiver operating characteristic curve (AUC) as our evaluation metrics.

Quantitative comparison
Comparison with recent works To prove the superiority of our model, we conduct quantitative comparison experiments between ASPNet and several advanced models. The results of most comparison methods are obtained from [1] and some results are obtained by us.
The comparison with them is shown in Table 1. Our ASP-Net consistently outperforms most compared methods by a considerable margin. For example, compared with the stateof-the-art Xception [26], the AUC of our method exceeds it by 0.88%. The same performance gain is obtained under ACC. Compared with the current method Face X-ray [32], the proposed method also achieves competitive results. Different from Xception, Face X-ray utilizes a blending boundary that is susceptible to noise to supervise the learning of the model, whereas our method learns a content-independent local pattern through the SPB.
As shown in Table 2, we also train the models on the Face-Forensics++ dataset and evaluate them in DeepFakes (DF) [40], Face2Face (F2F) [41], FaceSwap (FS) [42], and Neu-ralTextures (NT) [43], respectively. We follow the common practice of only using LQ in each manipulation method for training and testing. We can see from Table 2 that our method The best results are shown in bold The best results are shown in bold achieves improvements over F 3 -Net [18] in most cases. On the widely used forgery dataset DF, our method exceeds F 3 -Net [18] by 3.23% and 0.63% under the ACC and AUC, respectively. The improvement is also gained on F2F, which improve by 2.09% and 0.59%, respectively. FS is a promising technology that transfers the identity of a source face into a target face while keeping the attributes (e.g., expression, posture, lighting, etc.) of the target face unchanged. NT utilize conditional GAN to store pixel features (as opposed to RGB) in the texture maps, which have fewer visible artifacts. These two methods generated forgery face images that are much harder to discriminate. On the challenging dataset FS and NT, the performance of our model is comparable to that of the F 3 -Net [18]. It is worth mentioning that the number of training samples for the above methods such as multi-task [37] is much larger than our method, which means that our model can achieve satisfactory performance in a real-world scenario that lacks facial image. Meanwhile, because of the smaller training set, The best results are shown in bold the training time and cost of our model are also less than most of the state-of-the-art face forgery detection models. Generalize from one method to another As new forgery methods are emerging all the time, the generalizability of detection models directly affects their application under realworld scenarios. We perform the cross-dataset evaluation on the FaceForensics++ (LQ) dataset consisting of four different forgery algorithms, i.e., DF, F2F, FS, and NT.
We train the models using forged images of one method. The rest images of all four methods are used for testing. As shown in Table 3, our model exceeds the F 3 -Net [18] in most cases. Although F 3 -Net [18] is comparable to our ASPNet on DF and F2F, it cannot generalize to other forgeries well and its performances on FS and NT which are newer datasets are not as good as ours. On the most challenging NT manipulation method which does not produce noticeable forged clues, our model exceeds F 3 -Net [18] by 0.13% and 0.04% under the ACC and the AUC, respectively. It further illustrates the better generalization ability of our model.

Qualitative comparison
To further investigate the advantages of our model qualitatively on face forgery detection, we employ the Grad-CAM [44] to present the CAM of our model and F 3 -Net [18]. The CAM can indicate where the network allocates more attention. Figure 6 shows some representative results of a qualitative comparison using CAM. As can be observed, the images displayed in the first row show that the proposed model pays more attention to the eyes, which is more conducive to the identification of true and fake face images, indicating that the model can focus on local details with discrimination. From the images displayed in the second row, we can see that the proposed model focuses on the philtrum and nose in the face, which indicates that the model can not only focus on significant forgery traces but also mine potential forgery traces. From the images displayed in the third row, we can see that  our model can focus more on the forgery area by extracting multi-scale and multi-granularity information when the forgery trace is not obvious.

Ablation studies
We conduct ablation experiments to verify the effectiveness and advantages of each component in our model. All experiments use the same training dataset and parameter settings to ensure the fairness of the comparison. Table 4 presents the results of ablation experiments conducted on the FaceForensics++ dataset. From the table, it is evident that our ASPNet outperforms other combinations in terms of the ACC, and has also achieved high scores in the AUC. Specifically, the ACC of the F 3 -Net [18] with TSAB is improved by 0.1%, and the AUC is improved by 0.48%. This demonstrates that TSAB may guide the model to focus on useful information and suppress useless information, while filtering spatial noise through spatial attention and channel attention. Compared with the original F 3 -Net [18], the ACC of the F 3 -Net [18] using both SPB and TSAB is improved by 0.3%, and the AUC is improved by 1.35%. This is because SPB provides the model with the ability to extract multi-granularity features of different scales, which has a complementary effect with TSAB. The ACC of our  Fig. 7 The training loss curves of different models model is improved by 1.28%, and the AUC is improved by 0.83%, which verifies that AE can improve the ability of the model to find potentially discriminative manipulated areas. From this experiment, we observe that the model's performance gradually improves with each module added stepby-step, indicating the effectiveness of each module.
The parameters and FLOPs of the proposed ASPNet and other state-of-the-art methods are presented in Table 5. It can be observed that our model can achieve better performance than F 3 -Net [18] with only a few additional parameters. In practice, our model can process 1947 face images per minute, while F 3 -Net [18] can process 1895 face images per minute. It indicates that our model is also in line with the actual application scenario. Other methods have fewer parameters, but they require far more training samples and time than our model requires. Besides, these methods are not guaranteed to achieve better performance.
Since the number of parameters is not directly related to the speed of model convergence, we further analyze the training loss curves of different models. As shown in Fig. 7, the black curve represents the loss curve of the original F 3 -Net [18] and the red curve represents the loss curve of our model. It can be seen from the figure that the loss of our model is lower than that of F 3 -Net [18] in most cases during the training process. This indicates that the gaps between the predicted labels and the true labels of our model are smaller than the original model. At the same time, the dynamic convergence rate of the loss curve of our model is higher, which shows that the learning ability of our model is more outstanding than F 3 -Net [18].

Discussion and conclusion
In this paper, we present an attention-erasing stripe pyramid network for face forgery detection. The stripe pyramid block extracts multi-granularity features from different scales and concatenates them together to take full advantage of the complementarity between them. The introduced two-stage attention block filters spatial noise and highlights discriminative local details while focusing on useful information and suppressing useless information. The employed attention erasing expands the ability of the model to search for areas where there may be forgery traces by randomly erasing the sensitive areas of the face.