FDS_2D: rethinking magnitude-phase features for DeepFake detection

To reduce the harm of forged information, more and more detection methods use frequency domain information. They mostly take spectra as clues to identify fake content. However, the current work tends to use only one of the magnitude and phase spectra for learning. In this paper, we notice that the magnitude and phase spectrum contain different image information. Only one spectrum is easily disturbed by noise, and the robustness of the method is difficult to guarantee. Therefore, we propose the Frequency Domain Separable DeepFake Detection (FDS_2D), which is a multi-branch network to obtain features in different frequency spectra. In FDS_2D, the spectral information is divided into three categories: the magnitude spectrum, the phase spectrum, and the relationship between the two spectra. According to their characteristics, we design independent modules for feature extraction from them. Moreover, to improve the utilization efficiency of multi-features, we propose a multi-input multi-output attention mechanism for information interaction between branches. The experimental results show that each part of FDS_2D effectively extracts and applies spectral information; The comprehensive performance of our model is verified on FaceForensic +  + , Celeb-DF, and DFDC. It proves that the ability of FDS_2D to detect DeepFake is not inferior to existing models.


Introduction
In recent years, deep generative networks have developed rapidly. Since the high quality of generated content and low cost of use, people gradually apply them to daily life, such as movie special effects production. However, due to the abuse of generative networks, a forgery technology called Deep-Fake [1,2] is derived. It forges high-quality fake images and videos with the help of Generative Adversarial Networks (GAN) [3], Variational Autoencoders (VAE) [4], Diffusion Model [5] and their improved methods [6,7]. These false contents spread through social media and deliberately guide public opinion, posing a significant threat to social stability and people's information security.
A large number of realistic fake content makes researchers pay more and more attention to DeepFake. Early work focuses on finding visual cues [8][9][10][11][12][13], such as visual artifacts caused by the upsampling process [8,9] and phenomenon that does not conform to the laws of nature [10]. With the deepening of the research, their targets are more local and subtler, and the performance of the models is also improved. However, when the fake model improves the artificial traces in the generated content, these detection models are much less effective. Later, some researchers begin to pay attention to the changes brought about by forgery in the frequency domain [14,15]. These methods mainly select one of the magnitude and phase spectra from Discrete Fourier Transform (DFT) [8,15,16] or Discrete Cosine Transform (DCT) [17]. Then, they extract features in the selected frequency spectrum. However, which spectrum to choose is a problem for them. We believe that Communicated by J. Gao. 1 3 the detection of DeepFake is different from general classification, and subtler features can be used for detection. Selecting only one spectrum will lose some features and increase the difficulty of detection.
Therefore, in this work, we propose Frequency Domain Separable DeepFake Detection (FDS_2D) to exploit the spectral information fully. The proposed method learns spectral information from multiple branches: (1) By exchanging the magnitude spectrum, as shown in Fig. 1, we find that the contents contained between the magnitude spectrum and the phase spectrum are independent. To obtain features in different spectra, we treat the two spectra as different cues for the first time. With the help of Local Fourier, our method only needs the general CNN structure to complete the feature extraction. (2) We find that the relationship between the two spectrums changes when the image is forged. Therefore, in addition to the above two clues, we take the relationship as the third clue and design the Frequency Domain PointWise (FDPW) block to extract it. (3) These multi-branch networks learn spectral information from different perspectives, and the obtained feature representations of each branch are complementary. To improve the utilization efficiency of these features, we design a multi-attention mechanism with multi-inputs and multi-outputs.
In summary, our main contributions are summarized as follows (1) Due to the existing work in the frequency domain only selects one spectrum for DeepFake detection, and such solutions lose content contained in another spectrum. Therefore, we use both the magnitude and phase spectrum to obtain different information contained in each of them for DeepFake detection. (2) We find that changes in image content lead to changes in the relationship between the two spectra. Therefore, we think that while extracting the features of the two spectra, Frequency Domain Separable DeepFake Detection (FDS_2D) also needs to learn the features of the spectral relationship. (3) In the experiment, we verify that FDS_2D has a good performance on DeepFake detection under various datasets. The results also prove each processing of spectra and relationship plays a crucial role.

Related work
According to the form of data processed, we divide the current DeepFake detection methods into two categories, i.e., in the spatial domain [11,[18][19][20][21] and the frequency domain [14].

Detection methods in the spatial domain
DeepFakes generate non-existent video sessions and individual actions by tampering with the attributes of the target content. To find the difference between fake and real content, most detection methods working on the spatial domain exploit the visual traces left by the forgery methods on the image. In the early stage, some detection methods researched the Shortcoming of generation methods. Such as, Odena et al. [18] found that due to the inconsistency of the convolution kernel size and step size, the results generated by the fake methods have some Checkerboard Artifacts. Azulay et al. [19] gave Checkerboard Artifact a more detailed explanation and proved that CNN ignores the classical sampling theorem. Until now, this kind of artifact is still one of the bases of many detection methods. Later, some studies found that there are some phenomena do not conform to physical laws in some fake data Wang et al. [20] aligned the features to measure the similarity between the face area and the images background. They judged real and fake images by different degrees of similarity. Yang et al. [21] found the dense optical flow of real and fake video does not behave consistently. They combined this phenomenon with the feature extraction network for video forgery detection. In addition, some work modified aspects of classifiers and training strategies to improve model performance. Zhao et al. [11] used fine-grained classification to extract more subtle and local features in fake images. Fig. 1 The magnitude spectrum of R_p&F_m is from Fake, and the phase spectrum of R_p&F_m is from Real. The magnitude spectrum of R_m&F_p is from Real, and the phase spectrum of R_m&F_p is from Fake There are two main problems with detection methods in the spatial domain: (1) the bases for detecting these methods are also the aspects that the forgery methods need to improve, such as artifacts. As forgery methods resolve these problems, the effectiveness of these detection methods will decrease significantly. (2) Detection methods generally have the problem of poor generalization ability. Different forged regions, different forgery methods, and different image quality will all result in reduced generalization ability. However, when the picture is converted into spectra, the imperceptible forgery will lead to changes in the high and low frequency information of the spectrum. This change is less affected by factors such as the forgery method and the forgery area. Therefore, we consider the detection of fake faces using spectra to reduce the impact of these problems on our method.

Detection methods in the frequency domain
Besides using the information from the spatial domain, some work focused on the changes brought about by the forgery methods in the frequency domain. They used either Discrete Fourier Transform (DFT) or Discrete Cosine Transform (DCT) to obtain spectra in the frequency domain, such as the magnitude and phase spectrum. Durall et al. [14] converted the 2D magnitude spectrum into 1D information through azimuth averaging. Then they found a difference in the highfrequency part of the real and fake content. Based on this work, much work began to learn the features of the magnitude spectrum. Qian et al. [17] designed a dual-branch forgery detection model. Both branches separate the high and low frequency information on the spectrum. Wang et al. [22] performed Inverse Discrete Fourier Transform on the magnitude spectrum of different frequency bands; then, they used general CNNs to obtain spectral information. In addition to the magnitude spectrum, the phase spectrum also retains many necessary image signals. Liu et al. [15] found that the phase spectrum retains rich frequency domain components. They combined the spatial image and the phase spectrum to capture the upsampling artifacts.
In our work, we do not choose one from the two spectrums as in the previous work. Considering the magnitude and phase spectrum contain different image information, we attempt to use a multi-branch network to obtain features in different spectra. In addition to this, we also found that image forgery leads to a change in the relationship between the two spectra, and we propose to take this change as the object of learning.

Our method
Since the forgery methods are difficult to cover the trace in the high and low frequency information, we propose the Frequency Domain Separable DeepFake Detection (FDS_2D) to better extract frequency domain information. Comparing with existing detection methods, FDS_2D takes fuller advantage of spectrum information. It extracts features from the magnitude spectrum, the phase spectrum, and the relationship between the two spectra.
In this section, we introduce the various components of FDS_2D. This section is organized as follows: In subsect. 3.1, we explain the spectral information and the relationship between the two kinds of spectra. Subsection. 3.2 introduces the processing method for the spectra and the relationship between them. In subSect. 3.3, we design the Multiple Cross Attention (MCA) to realize the interaction between features.

The analysis of spectra
The DFT is a critical way to obtain the frequency domain information of the image. It can represent any periodic function as the sum of multiple different sine and cosine functions. Images are also suitable for the definition when we use a periodic function to describe an infinite period of every pixel. We assume a pixel x is a periodic signal, and the periodic function corresponding to x is f(x). According to DFT, we conclude: where C is a constant, T is the period of the image information, a n and b n are the Fourier coefficients. We perform a sine-cosine transform on Eq. (1), then: where w is the angular frequency, equal to 2π/T; the A n and φ n represent the magnitude and phase.
We can get the magnitude and phase of each pixel by the above calculation. After spectral centering and power transformation for the A and φ, we can get the spectrum as in Fig. 2. Figures 2(b) is the magnitude spectrum of Fig. 2(a). We randomly take a point on the spectrum, as shown by the red point in Fig. 2(c). The distance from the red point to the center point is the frequency of the corresponding pixel point in the original image. The angle formed with the coordinate axis indicates the direction of the periodic function. The entire magnitude spectrum presents the characteristics of high surrounding frequency and low middle frequency. Figure 2(d) is the phase spectrum; it is the aggregation of image phase information. We perform the inverse Discrete Fourier transform (IDFT) on the phase spectrum only ( Fig. 2(e)). It looks like the profile of the original image. Since the information in the two spectra is different, their influence on the image display will also differ. We design the following experiment, as shown in Fig. 3, to intuitively display the role of each frequency spectrum. That is, we swap the magnitude spectra of the two images (Real and Fake) and compare the obtained images with the original images.
The results of the above experiments are shown in Fig. 4. The phase spectrum of Fig. 4(c) comes from the real image ( Fig. 4(a)), and the magnitude spectrum comes from the fake image ( Fig. 4(b)). Compared with Fig. 4(a), (c) has a different source of the magnitude spectrum and more noise texture. This shows that the magnitude spectrum controls the texture of the image; combining the magnitude spectrum of the real image and the phase spectrum of the fake image, we can get Fig. 4(d). Compared with Fig. 4(a), (d) has a different source of the phase spectrum and a different profile. It means the phase spectrum contains the contour information of the image. Through description and experiments of the two spectrums, we conclude that magnitude and phase spectra contain different image information. To obtain features in both spectra, we extract features from them as independent clues.

The relationship of spectra
In addition to feature extraction for both spectra, our method extracts features of their relationship. According to Eq. (2), we can obtain the magnitude (A) and phase (φ) of a single signal. We extend this solution method to two-dimensional discrete images. Supposing a twodimensional image of size L × K, each pixel is taken as a sampling point. The image is a two-dimensional signal I(x, y). F(u) is the result of DFT. We extend Eq. (1) to I(x, y) and combine it with Euler's formula Eq. (3) to obtain the expression shown in Eq. (4): where u is related to I(x, y). It means the image content will cause the change of u. We convert F(u) in Eq. (4) to polar coordinates and obtain the expressions of A u and φ u, as shown in Eq. 5.
where R(u) is the real part of F(u), and I(u) is the imaginary part. Since both A u and φ u have R(u) and I(u), we simplify them as:  Divide the two equations in Eq. (6); we can get: According to Eq. (7), h(u) is the relationship between A u and φ u . Combining it with Eq. (4), we can see that this relationship is related to the image's content. It means that the relationship between the magnitude and phase spectra will change after forging. Therefore, we can also find forged traces in their relationship and the two kinds of spectra.

The local fourier spectrum
How to process the spectra is the basis for FDS_2D. Currently, the network structure mainly used for feature extraction is Convolutional Neural Network (CNN) [23]. In the feature extraction process, CNN can convert the information in the image into a matrix with semantic information. However, neither magnitude nor phase is sufficient for tuning with CNN. We take the one-dimensional f(I L ) of the twodimensional DFT of the image in Eq. (5) to verify.
According to Euler's formula, we can write Eq. (3) in the following form: We replace l in Eq. (8) with L-l and bring Euler's formula into verification.
Continue to deduce Eq. (10) in conjunction with Eq. (9): From the inferences of Eqs. (10) and (11), we can know that the result of DFT is conjugate symmetry on [0, L − 1]. It means the spectrum is not translation invariant like natural images. Therefore, the magnitude and phase spectra cannot be the input of CNN for feature extraction. Considering that CNN relies on convolution kernels to obtain local receptive fields, we propose Regular Local Fourier (RLF) and Irregular Local Fourier (ILF), which add local area information to the spectra. We resize the image to 3 × H × W, and the height H and the width W need to keep the same value. RLF and ILF only need to set a shared Fourier slider. The slider performs Fourier transform to obtain the local spectra in the covered areas. After iterative calculation, we obtain the spectrum with position information, which is the result of RLF. The ILF regards the image center containing the face as a whole and uses a large Fourier slider to calculate the central area. The rest of the image part is calculated similarly to RLF. This calculation method makes the result of ILF contain receptive fields of different sizes. The processes and results are shown in Fig. 5.

Fig. 5
The magnitude and phase spectra with local information are generated by RLF and ILF

Pointwise convolution module in the frequency domainS
The Pointwise (PW) convolution is initially used for the Depthwise Separable Convolution [24,25]. It can replace ordinary convolution and reduce model parameters. In this paper, we need to extract the relationship between the two spectra. The process extracts information between dimensions if each spectrum is regarded as information on one dimension. This step is similar to the PW convolution. The feature extraction process of the relationship is shown in Fig. 6. We concatenate the two spectra according to the dimensions and multiply them with N filters. These filters are recorded as α {α 1 , α 2 , …, α N }. where n = 1, 2, …, N. New_map needs to be input into the backbone network. However, the process increases the depth of the overall network. It may cause the gradient update to proceed in the direction of exponential decay or exponential explosion. To prevent the occurrence of gradient disappearance and gradient explosion, we design the Frequency Domain PW Block (FDPW) regarding the residual structure [26] as Fig. 7.
Mid_Dim1, Mid_Dim2 and UP_Dim are all composed of PW convolution and a Batch Normalization. The calculation process of the three parts is represented by f MD1 (x), f MD2 (x) and f UD (x) respectively.
The PW convolution is used in the two Mid_Dim to extract the relationship through this module. The UP_Dim is to give the input a linear map to match the dimension of the output. If the relationship of two spectra is regarded as a variable G, the whole process is expressed as After the above steps, the FDPW extracts the relationship between the spectra. Meanwhile, it prevents the overall model from gradient disappearance and gradient explosion.

Multiple cross-attention
How to deal with multiple clues reasonably needs to be considered in multi-branch networks. In previous work, Qian et al. [17] and Luo et al. [27] adopted similar crossattention methods, which achieved promising results compared with the simple concatenation [28]. Inspired by these works and self-attention structures [29], such as Vision Transformer (ViT) [30], we propose the Multiple Cross-attention block (MCA).  The MCA consists of multiple Binary Cross-Attention (BCA) blocks, shown in Fig. 8. We set the two inputs of the module as I 1 and I 2 . After two linear transformations, matrices obtain the corresponding V and K.
where i = 1, 2. We use K 1 and K 2 to calculate the correlation between the two input vectors, Mid_attention. According to the attention calculation method, we perform Softmax on the result of the dot product to obtain the relationship weight: Then, we use K 1 and K 2 to do a dot product calculation with Mid_attention to get new attention mapping. The mapping action object here does not have a one-to-one correspondence with the source object of K, but a cross-correspondence. That is

We sum Attention i with V i after weighting, and the result is the Mapi:
where W i is the weight of Attention i .
With the help of the BCA block, the feature interaction between the two feature maps can be realized. However, in the FDS_2D, we need more inputs and outputs, and the feature interaction between the two features cannot satisfy our needs. Therefore, we propose Multiple Cross Attention (MCA) based on the BAC. The structure of the MCA is shown in Fig. 9.
There are three inputs to the MCA: MAP1, MAP2, and MAP3. We match them and put the pairs into three BCA blocks. Then we get three pairs of new maps: where M n ' and M n " are the operation results between MAP n and the other two maps (n = 1, 2, 3). We calculate M n ' and M n " as follows to get the results of the interaction: We obtain a multi-input multi-output mapping result by multiple BCA modules and linear transformations. This process can be extended to more dimensional inputs and outputs.

Experiments
FDS_2D is a forgery detection method for DeepFake that makes full use of spectral information. In this section, we demonstrate the experimental results of this method. In Subsect. 4

Overall model of FDS_2D
We propose a DeepFake detection method that fully utilizes spectral information. In this method, we use the magnitude spectrum, phase spectrum, and the relationship between them as the inputs of the multi-branch network. To facilitate feature interaction in feature extraction, we use similar structures for the backbones of branches. The overall framework of our network is shown in Fig. 10.
In the feature extraction stage, each backbone network is divided into two parts: feature Extract1 and Feature Extract2. We add the MCA between the two parts to complete the feature interaction. After the above steps, we use the linear layer to achieve the final classification.

Experiment setup
Datasets. To verify the effect and ability of this model, we put the model on a variety of datasets for training and testing. These datasets include FaceForensics + + (FF + +) [31], Celeb-DF [32], and DeepFake Detection Challenge (DFDC).  FaceForensics + + (FF + +) is an extension of the Face-Forensics, divided into FF + + _c23 (low compression rate) and FF + + _c40 (high-compression rate). This dataset has 1000 original videos (Real) and 4000 forgery videos (Fake) from four forgery methods, including Face2Face [33] (F2F), FaceSwap (FS), DeepFakes (DF), and Neural-Textures [34] (NT). We accept the official suggestion of this dataset and split the dataset with the ratio of train: test: value = 720:140:140. Then, we extract video frames to get the image dataset. To balance the number of real and false images in the FF + + , different videos are extracted with different numbers of frames. The details of each part of the dataset are shown in Table 1.
Celeb-DF used in this section is the v2 version. Celeb-DF v2 contains 590 real videos in Celeb-DF (Celeb-real), 5639 related fake videos (Celeb-synthesis), and 300 new real videos on Youtube added later. According to the JSON file provided in the dataset, we organize the dataset as shown in Table 2.
To make the amount of data in the training set and the test set as close as possible to 1:1, the number of real images in the training is (482 + 230) × 74 = 52,688, and the number of fake images is 5299 × 10 = 52,990. The number of real images in the test is (108 + 70) × 40 = 7120, and the number of fake images is 340 × 20 = 6800.
DeepFake Detection Challenge (DFDC) is currently the maximum and publicly available face-swapping video dataset. Due to a large amount of data and the difficulty of controlling data quality, we selected some videos in the data set during the experiment. The cropping method is similar to the above two datasets. The obtained images are only used as test images to observe the model's generalization performance.
Evaluation metrics. In this task, we regard the detection task as a binary classification task and set some metrics for evaluating the effect of the model. (1) Accuracy (ACC): It is the most intuitive evaluation benchmark in classification tasks, which is equal to the ratio of predicted correct data to the total number of data. The optimizer is Adam. The initial learning rate is 0.0002; the learning rate degrades linearly once every 60 k iterations, and the degradation ratio is 0.8. Since this experiment is entirely for spectrum learning and lacks an official spectrum pre-training model, none of the models we compare uses pre-training models to ensure the objectivity of the comparison.

Ablation study
Our ablation experiments perform on FF + + _c23. This process ablates the Local Fourier, The FDPW, and The MCA separately to demonstrate their role in FDS_2D. The overall results are shown in Table 3. Since models that do not use the Local Fourier cannot obtain features, some metrics cannot be solved. Therefore, we fill some "-" in the table. In the following specific instructions, we use the AUC score and the P-score as the evaluation benchmarks.

Effectiveness of local fourier
According to our previous description, the most important step is to perform the Local Fourier transform on the spectra. To verify the importance of the Local Fourier transform for the FDS_2D, we try to directly use the magnitude and phase spectra obtained by the DFT as input during the training process. Then we compare the experimental results with FDS_2D. Figure 11(a) shows the results of FDS_2D with the Local Fourier and FDS_2D with DFT in the iterative process.
As shown in Fig. 11(a), no matter how many times we train, the AUC and P of FDS_2D (DFT) are always around 0.5, which means that the model cannot learn the features from the DFT spectrum for distinguishing real face and fake face. In contrast, FDS_2D (LF) can converge normally. Since the spectrum does not contain local information, it cannot be directly used as the neural network input for feature extraction. In other words, the neural network cannot directly obtain the difference between spectra after the DFT of the true and false images.

The importance between the magnitude and phase spectra
In FDS_2D, we design the FDPW block to extract the relationship between two spectra. To verify the effectiveness of it, we remove the entire branch of relation extraction and compared it with the original model. In Fig. 11(b), the AUC and P scores of the FDS_2D without the FDPW block are lower than FDS_2D. This result is consistent with our previous proof in Sect. 3.1.2. After the FDPW, the relationship between the spectra is extracted. It plays a critical role in making full use of spectrum information.

Effectiveness of MCA/BCA
To improve the robustness and effectiveness of the extracted features, we propose multi-input and multi-output feature interaction modules, MCA. To verify the effectiveness of this structure, we design the following experiments: (1) comparing the FDS_2D with the model without MCA; (2) comparing the FDS_2D with the model without MCA after removing the FDPW. The reason we design (2) is to avoid the influence of FDPW on the experimental results.
The results are shown in Fig. 12. In the above two line charts, the model using MCA (green line and red line) is better than the model directly splicing feature maps (blue line and black line) in both AUC and P. Meanwhile, with the help of boxplots, we verify the effect of MCA on training stability. As shown in Fig. 12(c) and (d), the green boxplots show the result of the models using MCA, and the orange boxplots show the result of model splicing feature maps. As we can see, the rectangles of the models with MCA/BCA are lower than those without MCA/BCA. It means that MCA guarantees the stability of the model training.

Comparison
In this section, we compare FDS_2D with some excellent detection methods that performed well in 2017-2023, like Two-stream [35] and DTFA-DOF [21], etc. The datasets used in the experiment include FF + + , Celeb-DF, and DFDC. For some works with source code, we reproduce them and perform experiments following the same steps as our method, like Xception and F3-Net. To reduce the influence of other factors on the effect of each module of the model, we do not use any pre-trained models in the comparison process. In addition to these models, we use the results from their papers of the rest works since they have not published their code. If some works did not use the experimental steps, we fill in "-" in the tables. We compare previous models with ours in two ways: (1) We train models on one dataset and verify on the test set corresponding to the dataset, as Table 4. (2) We train models on one dataset and verify on the other datasets, as Table 5. In Table 4, our model has the best performance among many results trained and tested in FF + + dataset and Celeb-DF (FF + + : 0.8709, Celeb-DF: 0.9108). In Table 5, when the training set is FF + + , we use Celeb-DF and DFDC as the test set. When the training set is Celeb-DF, we use FF + + and DFDC as the test set. This part of the experiment proves that our model is comparable to the comparison model in terms of generalization performance. Our model also has the best performance in some experiments across datasets (train: Celeb-DF & test: FF + + : 0.7456 and train: Celeb-DF & test: DFDC: 0.7701).
We also test the performance of the models on datasets with different compression ratios. In this part of the experiment, the datasets we use are FF + + _c23 (low compression ratio, high-quality data) and FF + + _c40 (high compression ratio, low-quality data). In this part, we design two sets of experiments: (1) Training and testing on the same compression ratio dataset. (2) Training and testing using  datasets with different compression ratios. The results of the experiment are shown in Table 6. It can be seen from the experimental results that when FF + + _c23 is used for training, FDS_2D has better performance in testing on the same dataset and datasets with different compression ratios. However, when the training dataset is FF + + _c40, our method performs worse than other models. To explore the cause of the problem, we perform the AbsDiff operation on real and fake images with different compression ratios and the corresponding spectra. Through AbsDiff, we can clearly observe the difference between real and fake images under different compression rates, as shown in Fig. 13. Experiments prove that in the spatial domain, although the change of the compression rate can change the salient degree of the image features, the image features can still be observed. On the frequency spectrum, images with low compression rates have more salient features than images with high compression rates. We mark some of the salient features in the calculation results of different compression rates in Fig. 13 In FDS_2D, our feature extraction method for the spectra is based on CNN, which cannot perfectly adapt to the spectra feature. Therefore, in future work, we hope to improve this problem by optimizing the method of spectral feature extraction.
In addition, we also compare models dealing with different data types. FDS_2D is a method based entirely on spectral information. To reflect the effectiveness of our processing of frequency domain information, we compare models for processing other data types as Table 7 and Table 8: (1) Xception is a method only for Spatial domain information, which extracts airspace features and detects them through the CNN model. (2) F3-Net is divided into two branches: One of the branches is used to extract the frequency domain information in the spectrum. The other branch processes the spatial domain information after high and low frequency separation and inverse Fourier transform. (3) DTFA-DOF is a method for optical flow changes in video frames. The features processed by this method include temporal domain features. The results of experiments show that FDS_2D has similar or even better performance than other methods that extract more information. This indicates that FDS_2D is effective by extracting the features in the two kinds of spectra and their relationship.

Conclusions
In this paper, we conduct studies on frequency domain DeepFake detection, finding that the current methods always choose one of the magnitude and phase spectra. To make full use of the spectrum information, we propose FDS_2D. This method is the first to learn the magnitude and phase spectrum separately. In addition, we find that the relationship between the two kinds of spectra changes with the image content. Adding the learning of this relationship further improves the performance of FDS_2D. Meanwhile, we propose a novel multi-input multi-output cross-attention for multi-feature information interaction. The extensive experiments demonstrate the effectiveness of each block, and the comparisons with the previous methods further corroborate the effectiveness and significance of our model. However, our method is not good at detecting highly compressed images, as shown in the comparison results of subsect. 4.4. These problems need to be further improved in future research until we can explore a detection method that is very suitable for the frequency domain.