This section describes the experiments used to measure the detection and classification accuracy of the proposed model. We also highlighted the delicate aspects of the dataset used. Extensive experiments are also described to elucidate the suitability of our strategy. Subsection 4.1 describes the evaluation metrics we used to measure the accuracy, and Subsection 4.2 discusses the dataset in depth. The details of the experimental design are presented in Subsection 4.3. Subsections 4.4 to 4.7 cover the various trials we conducted to evaluate the effectiveness of the suggested technique.
4.1. Evaluation metric
Several common metrics, including Precession (PR), Recall (RC), Accuracy (AC), and F1-Score, were used to assess how well the suggested method behaved on the deepfakes detections. Equations (3) to (7) are the metrics used for the performance evaluation.
$$Ac=\frac{D{\prime }+Ꞃ}{D{\prime }+Ꞃ+r+R}$$
3
,
$$Pr=\frac{D{\prime }}{D{\prime }+r}$$
4
,
$$Re=\frac{D{\prime }}{D{\prime }+R}$$
5
,
$$F1- score=2* \frac{Precision*Recall}{Precision+Recall}$$
6
,
Where \(D{\prime }\) denotes the positive (deepfakes values), and \(Ꞃ\) refers to negative values. Similarly, \(r\) stands for the mistakenly positive results (untrue bonafide), and \(R\) stands for the mistakenly adverse (false deepfakes), respectively. Some evaluation measurements using these equations on the proposed classification model are detailed in the following subsections.
4.2. Dataset
Our proposed model is evaluated using two datasets: DFDC provided by Facebook [32] and the FF++ [33]. The Facebook dataset contains 4119 modified visual examinations and 1131 unique records. The Kaggle competition website allows users to obtain the DFDC data, an open and free online dataset [32]. A variety of different AI-based techniques, including Deepfakes: FS [20], F2F [34], and NT [35], were used to create the FF + + dataset. 1000 authentic and 1000 altered records are included for each control process. Modifying recordings using simple, light, and high compression settings is also common nowadays. Table 3 shows some of the abandoned sets that DL used for deepfake. The datasets we used during the experiment are randomly split into 70% and 30% ratios to learn the proposed model for the training and testing, respectively. Since the modified recordings in the FF + + sample set were created with an extended deepfake algorithm, this is specifically used for the evaluation [36]. This calculation creates a high-quality visual record that closely resembles the real-world environment. The datasets usually contain actual and false examples, as shown in Fig. 6.
Table 3
A list of collections that include modified and original videos.
DF collections with Origin | Year | Pristine vs Modified |
UADFV [17]- YouTube | Nov-2018 | 49 vs. 49 |
Deep Fake-TIMIT [37] - YouTube | Dec-2018 | 550 vs. 620 |
FF++ [38] - YouTube | Jan-2019 | 1000 vs. 4000 |
Google DFD [39] - Actors | Sep-2019 | 363 vs. 3068 |
DFDC [32]- Actors | Oct-2019 | 23,654 vs. 104,500 |
Celeb-DF [40]- YouTube | Nov-2019 | 890 vs. 5639 |
DeeperForensics [41]- Actors | Jan-2020 | 10,000 vs. 50,000 |
4.3. Implementation details
We used PowerEdge R740, Intel Xeon Gold 6130 2.1G, 16C, 128 GB RAM and 10 TB HDD with Windows 11 OS, A 64GB T4 GPU and a GeForce RTX 2060 (Realistic Card). The proposed DL-based network is developed using the latest version of Python 3.9.0 language, with Python libraries like Keras, Tensorflow, OpenFace2, Sklearn, Numpy, Arbitrary, OS and PIL. To implement the proposed model and perform the required task successfully, the following requirements must be met.
-
The OpenFace2 toolbox is used to perform standardization, segmentation, and feature extraction of facial features.
-
As the model requires, the patterns of identified faces are sized to 224 x 224 dimensions.
-
We have trained the model for 60 epochs, 0.000001 learning rate, 35 batch sizes, and stochastic gradient descents (SGD as the hyperparameters.
4.4. Assessment of the Proposed Model
Accurately detecting visual changes is a prerequisite for a trusted forensic analysis model. To illustrate the power of deepfake detection, we've evaluated several common measures. The results are presented in Fig. 7 to highlight the classification results of the recommended approach for the FF + + deepfake collections. Regarding identifying FS, F2F, and NT deepfakes, the data in Fig. 7 shows that the proposed framework has achieved reliable performance. More specifically, the suggested approach achieved PR, RC, and AC 99.25%, 97.89%, and 98.99%, respectively, for the FS Deepfake. Similarly, we achieved 98.23%, 97.78%, and 98.08%, respectively, for the F2F Deepfakes. Moreover, for the NT Deepfake, the obtained results were 95.45%, 97.66%, and 96.23%, respectively. The above results demonstrate the resilience of our proposed method. The ResNet-Swish-BiLSTM model is significant enough in picking and selecting the features that enable the framework to present complex visual patterns and utilize them for classification effectively.
4.5. Assessment of the Swish Function
Knowing about activation functions is very important because they play a major role in training neural networks. So, it is better to be aware of the pros and cons of different activation functions beforehand. Exploring the activation functions compared in this paper, Swish consistently performed better than others. Although Swish put up an intense fight with the Mish activation function, Mish has some implementation problems. For this reason, we used Swish AF to solve deepfake problems with the proposed network model. Table 4 shows the results of some activation functions on both DFDC and FF + + datasets.
Table 4
Comparative analysis of Swish and other activation functions over DFDC and FF++
Activation Function | Accuracy (%) | Avg training time (sec) | Avg time (sec) classification | Remarks |
Sigmoid | 94.0 | 1110 | 2549 | Can't work for Boolean gates simulation |
Swish | 98.0 | 1166 | 3057 | Worth giving a try in very deep networks |
Mish | 98.35 | 1155 | 3524 | It has very few implementations, not matured |
Tanh | 90.0 | 1173 | 2950 | In the recurrent neural network |
Relu | 97.0 | 1050 | 2405 | Prone to the dying ReLU" problem |
Leak_Relu | 97.5 | 1231 | 2903 | Use only if expecting a dying ReLU problem |
ReLU is still better than Leak_ReLU (L-ReLU), but the difference in accuracy has significantly decreased with the number of epochs; L-ReLU is performing better than ReLU now. For longer training, L-ReLU performed better than ReLU, and Swish and Mish performed way better than other activation functions. Swish is more accurate than Mish in such types of applications. So, based on these observations, we can conclude that Swish outperformed other activation functions in the list. The other main reason for the selection is the non-monotonic nature of the Swish activation method, which allows the computed values to fall still even if the input rises. Hence, the result improves the computed values storage capacity of the proposed approach. Similarly, employing the Swish activation method optimizes the model behavior by improving the proposed approach's feature selection power and recall ability. Hence due to these factors, the proposed ResNet-Swish-BiLSTM model presents the highest performance results for classifying visual manipulations.
4.6. Comparing the improved performance of ResNet-18 to the base Model
In this subsection, we compared the effectiveness of the proposed deepfake detection method with the original ResNet-18 model. Table 5 illustrates a comparison, and Fig. 9 shows the confusion matrix of the proposed deepfake detection model on the FF + + test set, clearly showing its effectiveness. We attained the best results for FS, F2F, and NT deepfakes when evaluating all performance metrics using the ResNet-18 model. We have attained performance gains of 4.72%, 5.57%, 5.16%, and 6.24% for the FS deepfakes detection PR, RC, F1, and AC, respectively. In the PR, RC, F1, and AC measurements for F2F, we gained improvements of 5.81%, 4.73%, 3.93%, and 6.99%, respectively. For the NT deepfake's PR, RC, F1, and AC measurements, we achieved 4.72%, 5.75%, 4.33%, and 6.33%, respectively. In addition, we compared the AUC (area under the curve) with ROC (receiver operating characteristic curve) for the FS, F2F, and NT deepfakes in Fig. 8 (a, b, and c), which clearly shows how robust our method is comparable to the base model. The analysis shows that the enhanced version, ResNet-Swish-BiLSTM, outperforms the base model over the FF + + dataset. Although, our proposed technique has more parameters (18.2 million) than ResNet-18 (13.5 million). However, it is quite clear from the presented results that the proposed technique has reliable performance for deepfake detection. Integrating Swish activation methods into the model, which produces precise feature computation, is the main factor contributing to the improved performance of the proposed solutions in recognition. Furthermore, the additional Bi-LSTM layers help the model deal with network overfitting data more effectively, ultimately improving performance.
Table 5
Comparison of the core with proposed models over FF + + deep fake collections
CNN-Model | PR (%) | RC (%) | F1 (%) | AC (%) |
| FS | F2F | NT | FS | F2F | NT | FS | F2F | NT | FS | F2F | NT |
ResNet-18 | 94.15 | 89.42 | 94.33 | 90.19 | 92.5 | 94.22 | 91.1 | 88.5 | 85.23 | 89.2 | 92.5 | 86.6 |
Propose Model | 99.25 | 98.23 | 95.48 | 97.89 | 97.8 | 97.66 | 96.3 | 92.9 | 89.56 | 98.9 | 98.0 | 96.2 |
4.6 Comparison with Existing DL Models
Creating high-quality datasets is crucial, as current techniques based on DL rely heavily on large amounts of data. We contrasted the suggested technique with several current CNN network methodologies evaluating similar deepfake collections to understand better how well they behave for detecting deepfakes. As deepfake algorithms become more prevalent, new datasets must be created to provide more robust algorithms that counter new manipulation tactics.
We first compared the accuracy of our deepfake detector with VGG16 [30], VGG-19 [42], ResNet101 [43], InceptionV3 [44], InceptionResV2 [45], XceptionNet [46], DenseNet-169 [15], MobileNetv2, EfficientNet [47] and NasNet-Mobile [48], respectively. The comparative findings are shown in Fig. 10. The results show that the proposed technique works better than the alternative options. VGG16 got the lowest AC at 88%. The VGG19 model took the second lowest place with 88.92%. The developed Res-Swish-BiLSTM model got the best AC with 98.06%. The average accuracy score when comparing is 93.84%, however, our strategy has a score of 98.06%. As a result, a 4.22% average performance increase in the accuracy metric is achieved. Figure 10 shows a comparison using the DFDC dataset based on accuracy, recall, precision, and F1-score with the existing DL techniques. Additionally, using the FF + + dataset, we evaluated the proposed technique against existing DL models including XceptionNet [46], VGG16 [30], ResNet34 [49], InceptionV3 [44], VGGFace, and EfficientNet, and the results are shown in Fig. 10. From the results, it is clear that the Res-Swish-BiLSTM model achieved more accurate classification results than other DL methods for all classes: FS, F2F, and NT. When finding low-quality movies versus high-quality videos, detection algorithms often show a decrease in effectiveness, as shown in Table 6.
Hence, Fig. 10, Fig. 11, Table 6 and Table 7 demonstrate that the provided ResNet-Swish-BiLSTM model is more robust to visual tampering detection than previous DL techniques for all specified assessment parameters. The novel proposed method works consistently well due to the improved feature computation capability of the proposed model, which allows it to learn the key points more effectively. Furthermore, the Resnet18 model's ability to deal with the video's visual changes, such as lighting and background variations, has been improved by employing the Swish activation approach, which results in superior deepfakes detection performance.
Table 6
Comparative analysis with existing methods using various DF datasets.
Study | Method | Dataset | Performance (AC) |
E.D. Cannas [50] | Group of CNN | FF++(c23) | 84% |
Sabir [24] | CNN + GRU + STN | FF++, DF | 96.9% |
FF++, F2F | 94.35% |
FF++,FS | 96.3% |
J.C. Neves [51] | 3D head poses | UADFV | 97% |
F. Juang | Eyeblink + LRCN | Self-made dataset | 97.5% |
Ciftci [52] | Biological signals | Self-made deep fakes dataset | 91.07% |
Tarasiou [53] | A lightweight architecture | DFDC | 78.76% |
Keramatfar [54] | Multi-threading using Learning with Attention | Celeb-DF | 70.2% |
Nirkin [55] | FACE X-RAY | Celeb-DF | 81.58% |
Ciftci [52] | Bio Identification | Celeb-DF | 90.50% |
Proposed Method | | FF++, FS | 99.13% |
FF++,F2F | 98.08% |
FF++,NT | 99.09% |
Table 7
Performance Comparison of DL network's over the DFDC dataset
DL-Models | AC | PR | RC | F1 |
Proposed | 0.986 | 0.99 | 0.992 | 0.988 |
EfficientNEt | 0.924 | 0.94 | 0.908 | 0.924 |
MobileNetv2 | 0.9 | 0.93 | 0.888 | 0.904 |
InceptionResV2 | 0.962 | 0.96 | 0.952 | 0.956 |
XceptionNet | 0.968 | 0.96 | 0.964 | 0.964 |
Inception V3 | 0.96 | 0.93 | 0.94 | 0.936 |
DenseNet-169 | 0.98 | 0.97 | 0.968 | 0.972 |
NAISNet-Mobile | 0.922 | 0.93 | 0.904 | 0.916 |
ResNet18 | 0.964 | 0.95 | 0.956 | 0.952 |
VGG-16 | 0.89 | 0.87 | 0.868 | 0.864 |
VGGFace | 0.984 | 0.96 | 0.942 | 0.952 |
VGG-19 | 0.882 | 0.86 | 0.876 | 0.864 |
4.7. Cross-Validation
We conducted an experiment to measure the effectiveness of the proposed deepfake detection method in a cross-corpus scenario. For this challenge, we chose the Celeb-DF data set. The collection includes 950 edited videos and 475 original videos. Unfortunately, the tiny visual distortions make it difficult to detect tampering.
The ResNet-Swish-BiLSTM model was tested for the situations presented in Table 8, where it was trained on the FF++ (FS, F2F, and NT) deepfake collections and assessed comparatively on the FF++, DFDC, and Celeb-DF deepfakes. The statistics in Fig. 12 show unequivocally that in a cross-corpus scenario, the performance of the proposed technology has decreased compared to the database-internal evaluation situation. The main reason for this degradation in performance is that the method used does not account for temporal variations that have evolved within frames throughout training, which could have helped the method to capture the underlying manipulated biases more accurately. Table 8 clarifies that in the first case, the suggested DL model network AUC values for the DFDC and Celeb-DF collections are 71.56% and 70.04%, respectively, when trained on the FF++(FS, F2F, and NT) deepfake collections. Suggested CNN network ResNet-Swish-BiLSTM produced the AUC values for the FF + + and Celeb-DF deepfake data sources are 70.12% and 65.23%, respectively, when trained on the DFDC dataset. These results show how well the model holds up to unobserved cases from whole new deepfake collections.
Table 8
Cross-data validation accuracy corresponding to the suggested convolution network.
Training Dataset | Class | Proposed Model-ACU |
Real | Fake | FS (%) | F2F (%) | NT(%) | DFDC(%) | C-DF(%) |
FS | 988 | 983 | 99.36 | 61.7 | 68.71 | 64.22 | 49.13 |
F2F | 988 | 978 | 65.23 | 99.53 | 65.9 | 75.22 | 70.05 |
NT | 988 | 950 | 47.98 | 84.82 | 99.5 | 71.19 | 80.39 |
DFDC | 1000 | 980 | 49.68 | 74.25 | 78.55 | 99.36 | 65.23 |
C-DF | 475 | 950 | 55.88 | 77.22 | 80.23 | 71.56 | 92.22 |