An Efficient Deepfake Detection Using Robust Deep Learning Approch

doi:10.21203/rs.3.rs-3103257/v1

Download PDF

Research Article

An Efficient Deepfake Detection Using Robust Deep Learning Approch

https://doi.org/10.21203/rs.3.rs-3103257/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The creation and manipulation of synthetic images have evolved rapidly, creating serious concerns about their effects on society. Although there have been various attempts to identify deep fake videos, these approaches are not universal. Identifying these misleading deepfakes is the first step in preventing them from following on social media sites. We introduce a unique deep-learning technique to identify fraudulent clips. Most deepfake identifiers currently focus on identifying face exchange, lip synchronous, expression modification, puppeteers, and other factors. However, exploring a consistent basis for all forms of fake video and images in real-time forensics is challenging. We propose a hybrid technique that takes input from videos of successive targeted frames, then feeds these to the ResNet-Swish-BiLSTM, an optimized convolutional BiLSTM-based residual network for training and classification. This proposed method helps identify artifacts in deepfake images that do not seem real. To assess the robustness of our proposed model, we used the open deepfake detection challenge dataset (DFDC) and Face Forensics deepfake collections (FF++.) We achieved 96.23% accuracy when using the FF + + digital record. In contrast, we attained 78.33% accuracy using the aggregated records from FF + + and DFDC. We performed extensive experiments and believe that our proposed method provides more significant results than existing techniques.

Video Deepfakes

Multimedia Forensics

Swish

Visual manipulation

In the modern world, falsifying images and videos threaten society. Any audio or video clip can be artificially created. Thanks to artificial intelligence (AI), especially machine learning (ML) methods, it isn't easy to distinguish altered photos and movies from the originals [1]. To alter photos and videos, a few conventional approaches are employed. Some are content altering-based, whilst others focus on computer-generated images like GIMP, Photoshop, and CANVA. Deepfake, which is based on Deep Learning (DL) [2–5], is a strong competitor among the procedures for customizing the content of videos. Deepfake is a phrase that evolved from the ideas of DL and fraud. DL networks (DNN) have accelerated and simplified the process of creating compelling synthetic images and movies. It involves altering a person's video or image with another person's image using DL algorithms such as the general adversarial network (GAN) [6][7]. Face swap is a crucial step in establishing a fake video as it involves the exchange of a source that challenges a victim's face while preserving the victim's initial movements and verbalization spoofs.

Generative Adversarial Networks (GANs) [8] are the primary driving force behind facial modification methods. As the use of StyleGAN [9] and StyleGAN2 [10] to synthesize the images increases, it becomes increasingly difficult for the human visual system to distinguish them. Many parodies, pranks, and other social channels exist on YouTube, Instagram, Twitter, Tiktok, and other streaming websites that use GAN-based face-swapping techniques. Commercial smartphone apps like ZAO2 [11] and FaceApp3 [12] would significantly speed up the adoption of these deepfake capabilities by making it natural for ordinary internet users to produce fake picture frames and clips. Deepfake records were first easily visible to the human eye, but as technology advances, they have become increasingly complex to differentiate from real images [13, 14]. Because of this, there has been an increase in the frequency of its inappropriate use, leading to the formation of numerous pornographic images of politicians and celebrities to disseminate propaganda and fake news, leading to many societal issues. While they shouldn't be, they can influence a person's social standing and mental health. It has become so easy that mobile cameras, tiny devices, or other bugs are used to capture pictures or videos anytime. Then, it can be altered using professional image processing technology software, allowing one to create bogus pictures or films. Even some companies are specialists in offering such deepfakes services.

Various techniques have been proposed for deepfake detection; however, they are still unreliable for unseen videos. Moreover, the applicability of advanced DL methods for deepfake generation has made detection very difficult due to environmental changes, such as lighting, compression, change in scale and positions, etc. Hence, the increasing fraudulent activities using deepfakes have forced researchers to propose a more reliable method for deepfake detection. Therefore, to overcome the above glitches and shortcomings of previous efforts, we provide a robust approach for deepfake detection. We propose the innovative DL technology: ResNet-Swish-BiLSTM. The reason behind the Swish activation function is to utilize its smoothness as it combines the ReLU and Sigmoid functions. It is proven to be more efficient than ReLU and helps in learning complex patterns from videos achieving a high-recall rate. Furthermore, the extracted features from ResNet's layers are then classified using BiLSTM [15] layers that capture the meaningful attributes from input features and classify the frames effectively. The main contributions are as follows:

For deepfake detection, we offer ResNet-Swish-BiLSTM, a distinctive architecture based on Bi-LSTM and Residual Network. It is a more robust technique than previous deepfake detection methods due to its ability and reliability in feature extraction.

We analyzed the proposed model using DFDC, FF++, and Celeb-DF collections, which include a cross-compilation check along with a confirmatory assessment to assess the suitability and generalizability of the proposed ResNet-Swish-BiLSTM network for visual fraud detection.

The presented work reliably works well against cyberattacks such as compression, noise, blurring, translation, and rotation variations, as ResNet-Swish and BiLSTM can propagate a more meaningful set of visual features within neurons and perform the classification task.

The remaining part of the research is organized as follows: Section 2 outlines the relevant deepfake detection techniques. Then, Section 3 describes the methodology, Section 4 explains the results and experiments used to analyze the performance, and Section 5 concludes the article.

The three types of modern deepfake records are Face Swap (FS), Lip Sync, and Puppeteer. The face of the target person is exchanged with the face of a source person to develop a fictional clip of the addressee in FS deepfakes. Deepfakes containing lip sync with audio are called lip sync deepfakes, so it appears they are saying the lyrics. When using Puppet Master, the source person's facial expressions remain on the target face while the target face is inserted into the original video to create a more convincing impersonation. Certain types of deepfake are the focus of most current detection methods; however, general methods capable of defeating all deepfakes are studied less often. A technique for detecting lip sync deepfakes has been released by Agarwal et al. [16]. This method took advantage of the discrepancies of both a phoneme (spoken word) and the viseme (shape of the mouth). Viseme-phoneme mapping was calculated in this study using manual and CNN-based approaches. This model works well for a specific collection of visual data.

Existing deepfake detection methods can be broadly divided into two groups: those based on hand-made features [17] and those using DL [18, 19]. Yang et al. [19] trained an SVM (Support Vector Machine) classifier for recognition using 68-D face marking data. This approach worked well with high-quality movies from the UADFV [19] and DARPA MediFor datasets but had difficulties with low-quality movies. In addition, not all deepfakes were considered in the analysis of this study. Using 16-Dim texture-based eye and teeth characteristics, Massod et al. [20] were able to exploit visual flaws and detect deepfakes like FS and F2F. The most crucial aspect of this study was determining the various eye colors of a POI (Point of Interest) to detect FS deepfakes by filling in the blanks, such as using the reflection in the eye color. This study also uses attributes like the nose's top, face's rim, and eye retina color to distinguish F2F deepfakes. Unfortunately, only faces with clean teeth and open eyes can be identified with this approach [20], which is a disadvantage. Finally, only the FF + + collection of deepfake examples was used for this work's assessment evaluation. The intended affine warping artifacts that arise during the production of deepfakes have been proposed by Massod [20]; by focusing on specific objects, training cost has been reduced. However, as it becomes difficult to detect a deepfake with some advanced transforming characteristics, this artifact selection could weaken the resilience of this approach.

The open-source toolkit OpenFace2 [11] was developed by Agarwal et al. [21] to extract features from facial features. Several features were created based on the retrieved landmark features. Then, a binary SVM was trained for deepfake detection using these derived and action unit traits. Five POIs were used in this approach, and a T-SNE plot showed that each POI could be linearly separated. However, the performance of this approach was severely impacted by the increased number of POIs brought about by the new datasets. DL-based approaches are also used for identifying deepfakes, in addition to some professional apps for altering facial content. To identify the deepfakes, Doke et al. [18] used a DL-based approach. This approach used a long short-term memory (LSTM) to learn the features after a CNN extracted them. The key contribution of this study was the use of deepfakes sequence and pattern inconsistencies for categorization. Still, the issue with this approach is lacking accuracy for the top three deepfake levels. Another DL model (MesoNet) was developed by Xia et al. [22] to identify deepfakes and F2F video manipulations, convolution and pooling layers for feature extraction followed by dense layers for classification in this unified architectural design. Instead of using a standard data set, these algorithms [22] were tested on digital media collected from different websites, raising questions about their sustainability over a sizeable and heterogeneous standard dataset. A bubble network was reported by Nguyen et al. [23] to uncover different types of manipulations in images and video clips. This framework is designed to identify computer-generated images, FS, and Natural Textures (NT). The performance of the proposed model has been improved by using dynamic routing and expectation maximization methods. The VGG-19 was used by the Nguyen bubble network to extract latent facial attributes, which are then categorized as legitimate or fake. Although the framework is sophisticated in computations, it effectively detects FS deepfakes in the FF + + dataset. Still, it has not been tested on lip-sync and puppet master deepfakes. Table 1 summarizes some recent developments in the domain of deepfakes.

Table 1

A Comparison of prior research on deepfake video recognition
Research	DF Collections	CNN Attributes
A. Jaiswal. [24]	FF++	Bidirectional recurrent neural networks (RNN) with DenseNet/ResNet50 are used to analyze the spatiotemporal properties of video streams.
P.Dongare [18]	Hollywood-2 Human Actions	It takes into consideration the deep-fake video's temporal irregularities. Inception-V3 + LSTM
S. Lyu. [25]	Closed Eyes in the Wild (CEW)	Used long-term recurrent convolution networks the frequency of eye blinking VGG16 + LSTM + FC
A. Irtaza. [14]	Fusion of datasets	Differences in facial structure, missing detail in the eyes and mouth, and a neutral network and logarithmic regression model
Nguyen. [23]	Four major datasets	VGG-19 + Capsule Network
Hashmi. [26]	DFDC whole dataset	CNN + LSTM Used facial landmarks and convolution features
Ganiyusufoglu. [27]	FF++, Celeb-DF	Triplet architecture, Metric learning approach

The details of the proposed system are provided in this section. The proposed ResNet-Swish-BiLSTM deepfake video identification technique is shown in Fig. 1. This proposed approach utilizes the Swish activation function that minimizes the movement of negative values across the network while optimizing the model's learning behaviour due to its smoothness feature. To extract face landmark characteristics from the input video, the OpenFace 2.0 [11] toolkit was used. The Residual Blocks (RBs) extract the face's key features to categorize whether they are authentic or fake. We also utilized BiLSTM layers that focus on capturing the most useful features for classifying deepfake frames.

3.1. Data acquisition

For making the model efficient for real time prediction. We have gathered the data from different available datasets like FF++ [33] and DFDC [32]. Further we have mixed the dataset with the collected datasets and created our own new data set, to accurate and real time detection on different kind of videos. To avoid the training bias of the model we have considered 70% real and 30% fake videos. DFDC consist of certain audio alerted video, as audio deepfake are out of scope for this paper. We preprocessed the DFDC dataset and removed the audio altered videos from the dataset by running a python script. After preprocessing of the DFDC dataset, we have taken 1000 x Real and 1000 x Fake videos from the DFDC dataset and 1000 x Real and 1000 x Fake videos from the FF + + dataset. Which makes our total data set consisting of 2000 Real, 2000 fake videos datasets.

3.2 Pre-Processing locating and cropping faces in videos

In this process system detect the faces from the frames after extracting them. In the facial area of frames, changes are mainly made for visual modification. Therefore, the procedure described here focuses primarily on the facial area. We used OpenFace 2.0, a face detector toolkit, to collect the faces. The method locates the facial region using 2 and 3-dimension facial area landmarks. As illustrated in Fig. 1, we have chosen seven (POI), the outward even edge (OE), the outward left edge (LE), the chin (C), the frontal head (FH), the outward proper cheek (PC), odd check (OC), and the center of the face (MF). The primary justification for using OpenFace 2.0 for face identification is that this method works well to find faces even when there are changes in face orientation, intensity fluctuations, and the position of the capturing device. The OpenFace toolkit's capabilities allow it to categorize face images from the collection of clips even when they go through considerable transformation changes.

3.3 Standardization and Segmentation

Different methods are used for the standardization of features. The extracted attribute is normalized during data loading by semi-training the distributed facial attributes. We use Eq. (1) for the standardization.

$$z = \frac{X-\mu }{S}$$

Where the mean and standard deviation of feature columns are represented by $\mu$ and $S$, respectively. The values of$X$ are the input face attributes vector. Both the frames and the segment levels are functional with our solution. We created fragments with a frame duration of 100 and an overlapping of 25 frames to allow the slices to work planar. Projected resolution requirements are 25 frames per second to maintain computational complexity in our case.

3.4 Feature computation

The next challenge is to compute features from the video images after extracting the faces. We adjusted the pre-trained ResNet18 CNN network by introducing the low-value elimination method in the current framework to solve this issue. The theory behind the Swish incitement design is that it eases the model's capability to scatter negative numbers through the neural network, helping to capture the complicated underlying patterns in visual perception. On the other hand, the supplementary RNN (Recurrent Neural Network) and additional coatings help select a representative set of support countenance that is passed for categorization. The starting layers of CNN Networks are responsible for learning the essential visual information, but the layers at the end focus more on computing the information required for each job. Therefore, using a pre-trained CNN network for deepfake detection improves the accuracy of spurious recognition while accelerating learning. The primary purpose of using a pre-trained model is to quantify more accurate sets of image attributes because it has been previously trained using the extensive ImageNet database of online datasets. Figure 2 demonstrates the visual representation of the above task.

3.5 Base ResNet-18

We used the K. He et al. [28] approach to extract features. They designed an attention-based 2-D RB network that is both straightforward and efficient. The design of ResNet18 is described in Fig. 3. By deepening the network and addressing the problem of a limited training dataset, the architecture of ResNet18 [28] improves the performance of visual categorization. We selected the Residual Network model, which comprises eight fundamental RBs, a convolutional layer, and a totally connected layer. Each basic block has two convolutional layers, and subsequently, each convolutional layer is followed by batch normalization and a nonlinearity authorize function called ReLU. [28]. Typically, the traditional CNN algorithms draw a thick collection of visual physiognomy from the information of all former slices to boost object identification accuracy [29]. However, due to the gradient vanishing problem in the learning phase, when the network thickness is increased, the algorithms with these types of design configurations experience a significant performance impact [30]. The Resnet approach suggested using hop connections for fully convolutional topologies that need to exclude some network layers to overcome the problems with existing Residual Neural network models. RBs are then built up on these connections. The resulting structure allows the keypoint mapping from the earlier layers to be reused, improving performance and simplifying training. The fundamental component of the Resnet model, the RB, is shown graphically in Fig. 4.

$$Y = F\left({x}_{i}\right) + {x}_{i}$$

In the above expression, attributes are fed into the formula symbolized by the letters ${x}_{i}$. The residual function through the letters $F$ and $Y$ represents the results obtained after applying the residual function and addition to ${x}_{i}$.

3.6 Proposed Model ResNet-Swish-BiLSTM

Additionally, the use of full interconnects during input-to-state, and state-to-state transitions is enabled by the addition of Bi-LSTM in ResNet-18, which processes spatiotemporal data in FC-LSTM [31] rather than spatial information is coded. In contrast, Convolutional LSTM (ConvBi-LSTM) solves this problem by including 3D tensors with spatial (rows and columns) last two dimensions for all inputs. Table 2 provides the structural details of the proposed ResNet-Swish-BiLSTM models, and Fig. 5 illustrates the proposed network design from a comprehensive perspective. The 18 total layers reside in the proposed CNN Network framework and are further broken down into four phases with full connection layers. Each stage has several RBs stacked on top of each other.

Similarly, integrating the attention-based RB with a set of filters of Kernal size "3x3" can grab and extract the different vertices, which is very useful for classification during the final training, making the proposed model efficient and intelligent for pointing the POI. After that, by using the BiLSTM block, the network can accept a series of convolution feature vectors from the input frames and a 2-node neural network which predicts whether the sequence will come from a deepfake or original video clip. The main problem we must address is developing a logical model to process a sequence of features. To achieve this, we experiment with BiLSM, which produces positive results, like resorting to a 9216-wide BiLSTM unit with a prediction dropout error of 0.5. This addition in the parameter has a positive outcome, which we want. More specifically, 9216-D spatial feature vectors are fed into the BiLSTM model to train it for evaluation. Further, we added in the proposed model a fully connected 512 layers with a 0.5 percent dropout probability at the end of the proposed network to determine the probability that the frame sequence is either authentic or deepfake. We used a softmax layer for making the final classification.

Table 2

ResNet-Swish-BiLSTM structural description
Layer Name	Activation	Learnable
Image Input	224x224x3	-
Convolution	112x112x64	Weights 7x7x3x64 Bias 1x1x64
Batch Normalize	112x112x64	offset 1x1x64 Scale 1x1x64
Swish Activation	112x112x64
Multi Plan	112x112x64
Convolution	56x56x64	Weights 3x3x64x64 Bias 1x1x64
Batch Normalize	56x56x64	Offset 1x1x64 Scale 1x1x64
Convolution Block	28x28x128	Weights 3x3x128x128 Bias 1x1x128
Identity Block	28x28x128
Convolution Block	14x14x256	Weights 3x3x258x256 Bias 1x1x256
Identity Block	14x14x256
Convolution Block	7x7x512	Weights 3x3x128x512 Bias 1x1x512
Identity Block	7x7x512
Convolution Block	3x3x1024	Weights 3x3x1024x1024 Bias 1x1x1024
Identity Block	1x1x1024
Residual Block-1	1x1x1024
AP (Average Pooling)	1x1x1024
Residual Block-2	3x3x1024
Featureinput	9216
BiLSTM1	500	inputWeights:500x9216,
FC1	200	Weights:200x500,Bias: 200x1
BiLSTM2	200	inputWeights:2500x9216, Bias:2500x9216
Dropout 25%	200	-
FC2 2 Fully Connected	2	Weights:2 x 200, Bias:2 x 1
SoftMax	2	-
Classification	2	-

This section describes the experiments used to measure the detection and classification accuracy of the proposed model. We also highlighted the delicate aspects of the dataset used. Extensive experiments are also described to elucidate the suitability of our strategy. Subsection 4.1 describes the evaluation metrics we used to measure the accuracy, and Subsection 4.2 discusses the dataset in depth. The details of the experimental design are presented in Subsection 4.3. Subsections 4.4 to 4.7 cover the various trials we conducted to evaluate the effectiveness of the suggested technique.

4.1. Evaluation metric

Several common metrics, including Precession (PR), Recall (RC), Accuracy (AC), and F1-Score, were used to assess how well the suggested method behaved on the deepfakes detections. Equations (3) to (7) are the metrics used for the performance evaluation.

$$Ac=\frac{D{\prime }+Ꞃ}{D{\prime }+Ꞃ+r+R}$$

$$Pr=\frac{D{\prime }}{D{\prime }+r}$$

$$Re=\frac{D{\prime }}{D{\prime }+R}$$

$$F1- score=2* \frac{Precision*Recall}{Precision+Recall}$$

Where $D{\prime }$ denotes the positive (deepfakes values), and $Ꞃ$ refers to negative values. Similarly, $r$ stands for the mistakenly positive results (untrue bonafide), and $R$ stands for the mistakenly adverse (false deepfakes), respectively. Some evaluation measurements using these equations on the proposed classification model are detailed in the following subsections.

4.2. Dataset

Our proposed model is evaluated using two datasets: DFDC provided by Facebook [32] and the FF++ [33]. The Facebook dataset contains 4119 modified visual examinations and 1131 unique records. The Kaggle competition website allows users to obtain the DFDC data, an open and free online dataset [32]. A variety of different AI-based techniques, including Deepfakes: FS [20], F2F [34], and NT [35], were used to create the FF + + dataset. 1000 authentic and 1000 altered records are included for each control process. Modifying recordings using simple, light, and high compression settings is also common nowadays. Table 3 shows some of the abandoned sets that DL used for deepfake. The datasets we used during the experiment are randomly split into 70% and 30% ratios to learn the proposed model for the training and testing, respectively. Since the modified recordings in the FF + + sample set were created with an extended deepfake algorithm, this is specifically used for the evaluation [36]. This calculation creates a high-quality visual record that closely resembles the real-world environment. The datasets usually contain actual and false examples, as shown in Fig. 6.

Table 3

A list of collections that include modified and original videos.
DF collections with Origin	Year	Pristine vs Modified
UADFV [17]- YouTube	Nov-2018	49 vs. 49
Deep Fake-TIMIT [37] - YouTube	Dec-2018	550 vs. 620
FF++ [38] - YouTube	Jan-2019	1000 vs. 4000
Google DFD [39] - Actors	Sep-2019	363 vs. 3068
DFDC [32]- Actors	Oct-2019	23,654 vs. 104,500
Celeb-DF [40]- YouTube	Nov-2019	890 vs. 5639
DeeperForensics [41]- Actors	Jan-2020	10,000 vs. 50,000

4.3. Implementation details

We used PowerEdge R740, Intel Xeon Gold 6130 2.1G, 16C, 128 GB RAM and 10 TB HDD with Windows 11 OS, A 64GB T4 GPU and a GeForce RTX 2060 (Realistic Card). The proposed DL-based network is developed using the latest version of Python 3.9.0 language, with Python libraries like Keras, Tensorflow, OpenFace2, Sklearn, Numpy, Arbitrary, OS and PIL. To implement the proposed model and perform the required task successfully, the following requirements must be met.

The OpenFace2 toolbox is used to perform standardization, segmentation, and feature extraction of facial features.

As the model requires, the patterns of identified faces are sized to 224 x 224 dimensions.

We have trained the model for 60 epochs, 0.000001 learning rate, 35 batch sizes, and stochastic gradient descents (SGD as the hyperparameters.

4.4. Assessment of the Proposed Model

Accurately detecting visual changes is a prerequisite for a trusted forensic analysis model. To illustrate the power of deepfake detection, we've evaluated several common measures. The results are presented in Fig. 7 to highlight the classification results of the recommended approach for the FF + + deepfake collections. Regarding identifying FS, F2F, and NT deepfakes, the data in Fig. 7 shows that the proposed framework has achieved reliable performance. More specifically, the suggested approach achieved PR, RC, and AC 99.25%, 97.89%, and 98.99%, respectively, for the FS Deepfake. Similarly, we achieved 98.23%, 97.78%, and 98.08%, respectively, for the F2F Deepfakes. Moreover, for the NT Deepfake, the obtained results were 95.45%, 97.66%, and 96.23%, respectively. The above results demonstrate the resilience of our proposed method. The ResNet-Swish-BiLSTM model is significant enough in picking and selecting the features that enable the framework to present complex visual patterns and utilize them for classification effectively.

4.5. Assessment of the Swish Function

Knowing about activation functions is very important because they play a major role in training neural networks. So, it is better to be aware of the pros and cons of different activation functions beforehand. Exploring the activation functions compared in this paper, Swish consistently performed better than others. Although Swish put up an intense fight with the Mish activation function, Mish has some implementation problems. For this reason, we used Swish AF to solve deepfake problems with the proposed network model. Table 4 shows the results of some activation functions on both DFDC and FF + + datasets.

Table 4

Comparative analysis of Swish and other activation functions over DFDC and FF++
Activation Function	Accuracy (%)	Avg training time (sec)	Avg time (sec) classification	Remarks
Sigmoid	94.0	1110	2549	Can't work for Boolean gates simulation
Swish	98.0	1166	3057	Worth giving a try in very deep networks
Mish	98.35	1155	3524	It has very few implementations, not matured
Tanh	90.0	1173	2950	In the recurrent neural network
Relu	97.0	1050	2405	Prone to the dying ReLU" problem
Leak_Relu	97.5	1231	2903	Use only if expecting a dying ReLU problem

ReLU is still better than Leak_ReLU (L-ReLU), but the difference in accuracy has significantly decreased with the number of epochs; L-ReLU is performing better than ReLU now. For longer training, L-ReLU performed better than ReLU, and Swish and Mish performed way better than other activation functions. Swish is more accurate than Mish in such types of applications. So, based on these observations, we can conclude that Swish outperformed other activation functions in the list. The other main reason for the selection is the non-monotonic nature of the Swish activation method, which allows the computed values to fall still even if the input rises. Hence, the result improves the computed values storage capacity of the proposed approach. Similarly, employing the Swish activation method optimizes the model behavior by improving the proposed approach's feature selection power and recall ability. Hence due to these factors, the proposed ResNet-Swish-BiLSTM model presents the highest performance results for classifying visual manipulations.

4.6. Comparing the improved performance of ResNet-18 to the base Model

In this subsection, we compared the effectiveness of the proposed deepfake detection method with the original ResNet-18 model. Table 5 illustrates a comparison, and Fig. 9 shows the confusion matrix of the proposed deepfake detection model on the FF + + test set, clearly showing its effectiveness. We attained the best results for FS, F2F, and NT deepfakes when evaluating all performance metrics using the ResNet-18 model. We have attained performance gains of 4.72%, 5.57%, 5.16%, and 6.24% for the FS deepfakes detection PR, RC, F1, and AC, respectively. In the PR, RC, F1, and AC measurements for F2F, we gained improvements of 5.81%, 4.73%, 3.93%, and 6.99%, respectively. For the NT deepfake's PR, RC, F1, and AC measurements, we achieved 4.72%, 5.75%, 4.33%, and 6.33%, respectively. In addition, we compared the AUC (area under the curve) with ROC (receiver operating characteristic curve) for the FS, F2F, and NT deepfakes in Fig. 8 (a, b, and c), which clearly shows how robust our method is comparable to the base model. The analysis shows that the enhanced version, ResNet-Swish-BiLSTM, outperforms the base model over the FF + + dataset. Although, our proposed technique has more parameters (18.2 million) than ResNet-18 (13.5 million). However, it is quite clear from the presented results that the proposed technique has reliable performance for deepfake detection. Integrating Swish activation methods into the model, which produces precise feature computation, is the main factor contributing to the improved performance of the proposed solutions in recognition. Furthermore, the additional Bi-LSTM layers help the model deal with network overfitting data more effectively, ultimately improving performance.

Table 5

Comparison of the core with proposed models over FF + + deep fake collections
CNN-Model	PR (%)			RC (%)			F1 (%)			AC (%)
	FS	F2F	NT	FS	F2F	NT	FS	F2F	NT	FS	F2F	NT
ResNet-18	94.15	89.42	94.33	90.19	92.5	94.22	91.1	88.5	85.23	89.2	92.5	86.6
Propose Model	99.25	98.23	95.48	97.89	97.8	97.66	96.3	92.9	89.56	98.9	98.0	96.2

4.6 Comparison with Existing DL Models

Creating high-quality datasets is crucial, as current techniques based on DL rely heavily on large amounts of data. We contrasted the suggested technique with several current CNN network methodologies evaluating similar deepfake collections to understand better how well they behave for detecting deepfakes. As deepfake algorithms become more prevalent, new datasets must be created to provide more robust algorithms that counter new manipulation tactics.

We first compared the accuracy of our deepfake detector with VGG16 [30], VGG-19 [42], ResNet101 [43], InceptionV3 [44], InceptionResV2 [45], XceptionNet [46], DenseNet-169 [15], MobileNetv2, EfficientNet [47] and NasNet-Mobile [48], respectively. The comparative findings are shown in Fig. 10. The results show that the proposed technique works better than the alternative options. VGG16 got the lowest AC at 88%. The VGG19 model took the second lowest place with 88.92%. The developed Res-Swish-BiLSTM model got the best AC with 98.06%. The average accuracy score when comparing is 93.84%, however, our strategy has a score of 98.06%. As a result, a 4.22% average performance increase in the accuracy metric is achieved. Figure 10 shows a comparison using the DFDC dataset based on accuracy, recall, precision, and F1-score with the existing DL techniques. Additionally, using the FF + + dataset, we evaluated the proposed technique against existing DL models including XceptionNet [46], VGG16 [30], ResNet34 [49], InceptionV3 [44], VGGFace, and EfficientNet, and the results are shown in Fig. 10. From the results, it is clear that the Res-Swish-BiLSTM model achieved more accurate classification results than other DL methods for all classes: FS, F2F, and NT. When finding low-quality movies versus high-quality videos, detection algorithms often show a decrease in effectiveness, as shown in Table 6.

Hence, Fig. 10, Fig. 11, Table 6 and Table 7 demonstrate that the provided ResNet-Swish-BiLSTM model is more robust to visual tampering detection than previous DL techniques for all specified assessment parameters. The novel proposed method works consistently well due to the improved feature computation capability of the proposed model, which allows it to learn the key points more effectively. Furthermore, the Resnet18 model's ability to deal with the video's visual changes, such as lighting and background variations, has been improved by employing the Swish activation approach, which results in superior deepfakes detection performance.

Table 6

Comparative analysis with existing methods using various DF datasets.
Study	Method	Dataset	Performance (AC)
E.D. Cannas [50]	Group of CNN	FF++(c23)	84%
Sabir [24]	CNN + GRU + STN	FF++, DF	96.9%
		FF++, F2F	94.35%
		FF++,FS	96.3%
J.C. Neves [51]	3D head poses	UADFV	97%
F. Juang	Eyeblink + LRCN	Self-made dataset	97.5%
Ciftci [52]	Biological signals	Self-made deep fakes dataset	91.07%
Tarasiou [53]	A lightweight architecture	DFDC	78.76%
Keramatfar [54]	Multi-threading using Learning with Attention	Celeb-DF	70.2%
Nirkin [55]	FACE X-RAY	Celeb-DF	81.58%
Ciftci [52]	Bio Identification	Celeb-DF	90.50%
Proposed Method		FF++, FS	99.13%
		FF++,F2F	98.08%
		FF++,NT	99.09%

Table 7

Performance Comparison of DL network's over the DFDC dataset
DL-Models	AC	PR	RC	F1
Proposed	0.986	0.99	0.992	0.988
EfficientNEt	0.924	0.94	0.908	0.924
MobileNetv2	0.9	0.93	0.888	0.904
InceptionResV2	0.962	0.96	0.952	0.956
XceptionNet	0.968	0.96	0.964	0.964
Inception V3	0.96	0.93	0.94	0.936
DenseNet-169	0.98	0.97	0.968	0.972
NAISNet-Mobile	0.922	0.93	0.904	0.916
ResNet18	0.964	0.95	0.956	0.952
VGG-16	0.89	0.87	0.868	0.864
VGGFace	0.984	0.96	0.942	0.952
VGG-19	0.882	0.86	0.876	0.864

4.7. Cross-Validation

We conducted an experiment to measure the effectiveness of the proposed deepfake detection method in a cross-corpus scenario. For this challenge, we chose the Celeb-DF data set. The collection includes 950 edited videos and 475 original videos. Unfortunately, the tiny visual distortions make it difficult to detect tampering.

The ResNet-Swish-BiLSTM model was tested for the situations presented in Table 8, where it was trained on the FF++ (FS, F2F, and NT) deepfake collections and assessed comparatively on the FF++, DFDC, and Celeb-DF deepfakes. The statistics in Fig. 12 show unequivocally that in a cross-corpus scenario, the performance of the proposed technology has decreased compared to the database-internal evaluation situation. The main reason for this degradation in performance is that the method used does not account for temporal variations that have evolved within frames throughout training, which could have helped the method to capture the underlying manipulated biases more accurately. Table 8 clarifies that in the first case, the suggested DL model network AUC values for the DFDC and Celeb-DF collections are 71.56% and 70.04%, respectively, when trained on the FF++(FS, F2F, and NT) deepfake collections. Suggested CNN network ResNet-Swish-BiLSTM produced the AUC values for the FF + + and Celeb-DF deepfake data sources are 70.12% and 65.23%, respectively, when trained on the DFDC dataset. These results show how well the model holds up to unobserved cases from whole new deepfake collections.

Table 8

Cross-data validation accuracy corresponding to the suggested convolution network.
Training Dataset	Class		Proposed Model-ACU
Training Dataset	Real	Fake	FS (%)	F2F (%)	NT(%)	DFDC(%)	C-DF(%)
FS	988	983	99.36	61.7	68.71	64.22	49.13
F2F	988	978	65.23	99.53	65.9	75.22	70.05
NT	988	950	47.98	84.82	99.5	71.19	80.39
DFDC	1000	980	49.68	74.25	78.55	99.36	65.23
C-DF	475	950	55.88	77.22	80.23	71.56	92.22

This study has provided a comprehensive strategy to identify all three forms of deepfakes based on the fusion of our unique facial features. Unlike many other systems, the proposed model is easy to use, understandable, efficient, and resilient at the same time. We introduced a novel Resnet-Swish-BiLSTM deepfake detection model. The utility of the proposed visual tampering detection technique was extensively tested on the deepfake data source FF++, DFDC and Celeb-DF. We evaluated the proposed method across deepfake collection to demonstrate its generalizability for unusual scenarios. We found that the suggested modelling technique can distinguish between the modified and the unmodified digital footage with a high- recall rate and recognize various visual modifications, including FS, NT and F2F. The FF + + and DFDC datasets, which show the highest AU values of 0.9623 and 0.9876, respectively, are used to evaluate the proposed technique in detail. In summary, our approach experienced some performance loss, yet the performance is still quite good. Therefore, after thoroughly examining the ResNet-Swish-BiLSTM at the statistical and digital media levels, we can conclude that our work in the field of advanced digital investigation, such as criminal forensics, is potentially beneficial. Since the model cannot yet capture the temporal patterns of the faked material over time, our ultimate aim in the future is value addition in form of temporal pattern analytics and reasoning to further improve the discovery, inference, and adaptability capabilities of the projected approach.

Acknowlegement: This research project was supported by a grant from the “Research Center of College of Computer and Information Sciences”, Deanship of Scientific Research, King Saud University.

Data Availability Statement: Data sharing does not apply to this article as authors have used publicly available datasets, whose details are included in the “experimental results and discussions” section of this article. Please contact the authors for further requests.

Conflicts of Interest: The authors declare no conflict of interest.

P.S.Q. Yeoh, K.W. Lai, S.L. Goh, K. Hasikin, Y.C. Hum, et al., "Emergence of deep learning in knee osteoarthritis diagnosis." Computational intelligence and neuroscience, vol. 2021, pp. 2021.
K. Bjerge, H.M. Mann and T.T. Høye, "Real‐time insect tracking and monitoring with computer vision and deep learning." Remote Sensing in Ecology and Conservation, vol. 8, no.3, pp. 315-327, 2022.
N. Le, V.S. Rathour, K. Yamazaki, K. Luu and M. Savvides, "Deep reinforcement learning in computer vision: a comprehensive survey." Artificial Intelligence Review, vol. pp. 1-87, 2022.
A. Bouguettaya, H. Zarzour, A.M. Taberkit and A. Kechida, "A review on early wildfire detection from unmanned aerial vehicles using deep learning-based computer vision algorithms." Signal Processing, vol. 190, pp. 108309, 2022.
P. Shukla, R. Aluvalu, S. Gite and U. Maheswari, Computer Vision: Applications of Visual AI and Image Processing. Vol. 15. 2023: Walter de Gruyter GmbH & Co KG.
T.T. Nguyen, C.M. Nguyen, D.T. Nguyen, D.T. Nguyen and S. Nahavandi, "Deep learning for deepfakes creation and detection." arXiv preprint arXiv:1909.11573, vol. 1, no.2, pp. 2, 2019.
Marek Kowalski, Faceswap, Jan 2020. [Online]. Available :https://github.com/ deepfakes/faceswap. Accessed 19 Jan 2021.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, et al., "Generative adversarial networks." Communications of the ACM, vol. 63, no.11, pp. 139-144, 2020.
C. Bravo-Prieto, J. Baglio, M. Cè, A. Francis, D.M. Grabowska, et al., "Style-based quantum generative adversarial networks for Monte Carlo events." Quantum, vol. 6, pp. 777, 2022.
P. Zhu, R. Abdal, Y. Qin, J. Femiani and P. Wonka, "Improved stylegan embedding: Where are the good latents? ." arXiv preprint arXiv:2012.09036, vol. pp. 2020.
T. Baltrusaitis, A. Zadeh, Y.C. Lim and L.-P. Morency. Openface 2.0: Facial behavior analysis toolkit. in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). 2018. IEEE.
F. Schroff, D. Kalenichenko and J. Philbin. Facenet: A unified embedding for face recognition and clustering. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
A. Radford, L. Metz and S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434, vol. pp. 2015.
S. Suwajanakorn, S.M. Seitz and I. Kemelmacher-Shlizerman, "Synthesizing obama: learning lip sync from audio." ACM Transactions on Graphics (ToG), vol. 36, no.4, pp. 1-13, 2017.
R. Mahum and S. Aladhadh, "Skin Lesion Detection Using Hand-Crafted and DL-Based Features Fusion and LSTM." Diagnostics, vol. 12, no.12, pp. 2974, 2022.
S. Agarwal, H. Farid, O. Fried and M. Agrawala. Detecting deep-fake videos from phoneme-viseme mismatches. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2020.
S. Kolagati, T. Priyadharshini and V.M.A. Rajam, "Exposing deepfakes using a deep multilayer perceptron–convolutional neural network model." International Journal of Information Management Data Insights, vol. 2, no.1, pp. 100054, 2022.
Y. Doke, P. Dongare, V. Marathe, M. Gaikwad and M. Gaikwad, "Deep Fake Video Detection Using Deep Learning." Journal homepage: www. ijrpr. com ISSN, vol. 2582, pp. 7421.
X. Yang, Y. Li and S. Lyu. Exposing deep fakes using inconsistent head poses. in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019. IEEE.
M. Masood, M. Nawaz, K.M. Malik, A. Javed, A. Irtaza, et al., "Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward." Applied Intelligence, vol. pp. 1-53, 2022.
S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, et al. Protecting World Leaders Against Deep Fakes. in CVPR workshops. 2019.
Z. Xia, T. Qiao, M. Xu, X. Wu, L. Han, et al., "Deepfake Video Detection Based on MesoNet with Preprocessing Module." Symmetry, vol. 14, no.5, pp. 939, 2022.
H.H. Nguyen, J. Yamagishi and I. Echizen. Capsule-forensics: Using capsule networks to detect forged images and videos. in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019. IEEE.
E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, et al., "Recurrent convolutional strategies for face manipulation detection in videos." Interfaces (GUI), vol. 3, no.1, pp. 80-87, 2019.
Y. Li and S. Lyu, "Exposing deepfake videos by detecting face warping artifacts." arXiv preprint arXiv:1811.00656, vol. pp. 2018.
M.F. Hashmi, B.K.K. Ashish, A.G. Keskar, N.D. Bokde, J.H. Yoon, et al., "An exploratory analysis on visual counterfeits using conv-lstm hybrid architecture." IEEE Access, vol. 8, pp. 101293-101308, 2020.
I. Ganiyusufoglu, L.M. Ngô, N. Savov, S. Karaoglu and T. Gevers, "Spatio-temporal features for generalized detection of deepfake videos." arXiv preprint arXiv:2010.11844, vol. pp. 2020.
K. He, X. Zhang, S. Ren and J. Sun. Identity mappings in deep residual networks. in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. 2016. Springer.
M.J. Akhtar, R. Mahum, F.S. Butt, R. Amin, A.M. El-Sherbeeny, et al., "A Robust Framework for Object Detection in a Traffic Surveillance System." Electronics, vol. 11, no.21, pp. 3425, 2022.
M. Nawaz, Z. Mehmood, M. Bilal, A.M. Munshi, M. Rashid, et al., "Single and multiple regions duplication detections in digital images with applications in image forensic." Journal of Intelligent & Fuzzy Systems, vol. 40, no.6, pp. 10351-10371, 2021.
A. Graves, "Generating sequences with recurrent neural networks." arXiv preprint arXiv:1308.0850, vol. pp. 2013.
B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, et al., "The deepfake detection challenge (dfdc) dataset." arXiv preprint arXiv:2006.07397, vol. pp. 2020.
A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, et al., "Faceforensics: A large-scale video dataset for forgery detection in human faces." arXiv preprint arXiv:1803.09179, vol. pp. 2018.
A. Kohli and A. Gupta, "Detecting DeepFake, FaceSwap and Face2Face facial forgeries using frequency CNN." Multimedia Tools and Applications, vol. 80, pp. 18461-18478, 2021.
J. Thies, M. Zollhöfer and M. Nießner, "Deferred neural rendering: Image synthesis using neural textures." ACM Transactions on Graphics (TOG), vol. 38, no.4, pp. 1-12, 2019.
T. Jung, S. Kim and K. Kim, "Deepvision: Deepfakes detection using human eye blinking pattern." IEEE Access, vol. 8, pp. 83144-83154, 2020.
P. Korshunov and S. Marcel, "Deepfakes: a new threat to face recognition? assessment and detection." arXiv preprint arXiv:1812.08685, vol. pp. 2018.
A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, et al., "FaceForensics++: Learning to Detect Manipulated Facial Images." arXiv preprint arXiv:1901.08971, vol. pp. 2019.
Y. Hua, R. Shi, P. Wang and S. Ge, "Learning Patch-Channel Correspondence for Interpretable Face Forgery Detection." IEEE Transactions on Image Processing, vol. pp. 2023.
Y. Li, P. Sun, H. Qi and S. Lyu, Toward the creation and obstruction of deepfakes, in Handbook of Digital Face Manipulation and Detection: From DeepFakes to Morphing Attacks. 2022, Springer International Publishing Cham. p. 71-96.
H. Chi and M. Peng, Toward Robust Deep Learning Systems against Deepfake for Digital Forensics, in Cybersecurity and High-Performance Computing Environments. 2022, Chapman and Hall/CRC. p. 309-331.
Y.S. Taspinar, M. Dogan, I. Cinar, R. Kursun, I.A. Ozkan, et al., "Computer vision classification of dry beans (Phaseolus vulgaris L.) based on deep transfer learning techniques." European Food Research and Technology, vol. 248, no.11, pp. 2707-2725, 2022.
D. Theckedath and R. Sedamkar, "Detecting affect states using VGG16, ResNet50 and SE-ResNet50 networks." SN Computer Science, vol. 1, no.2, pp. 1-7, 2020.
G. Chugh, A. Sharma, P. Choudhary and R. Khanna, "Potato leaf disease detection using inception V3." Int. Res. J. Eng. Technol (IRJET), vol. 7, no.11, pp. 1363-1366, 2020.
M.M. Rahman, A.A. Biswas, A. Rajbongshi and A. Majumder, "Recognition of local birds of Bangladesh using MobileNet and Inception-v3." International Journal of Advanced Computer Science and Applications, vol. 11, no.8, pp. 2020.
A. Biswas, D. Bhattacharya and K.A. Kumar, "DeepFake Detection using 3D-Xception Net with Discrete Fourier Transformation." Journal of Information Systems and Telecommunication (JIST), vol. 3, no.35, pp. 161, 2021.
G. Marques, D. Agarwal and I. de la Torre Díez, "Automated medical diagnosis of COVID-19 through EfficientNet convolutional neural network." Applied soft computing, vol. 96, pp. 106691, 2020.
F. Saxen, P. Werner, S. Handrich, E. Othman, L. Dinges, et al. Face attribute detection with mobilenetv2 and nasnet-mobile. in 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA). 2019. IEEE.
R. Roy, I. Joshi, A. Das and A. Dantcheva, 3D CNN Architectures and Attention Mechanisms for Deepfake Detection, in Handbook of Digital Face Manipulation and Detection. 2022, Springer, Cham. p. 213-234.
N. Bonettini, E.D. Cannas, S. Mandelli, L. Bondi, P. Bestagini, et al., "Video Face Manipulation Detection Through Ensemble of CNNs." arXiv e-prints, vol. pp. arXiv: 2004.07676, 2020.
J.C. Neves, R. Tolosana, R. Vera-Rodriguez, V. Lopes, H. Proença, et al., "Ganprintr: Improved fakes and evaluation of the state of the art in face manipulation detection." IEEE Journal of Selected Topics in Signal Processing, vol. 14, no.5, pp. 1038-1048, 2020.
U.A. Ciftci, I. Demir and L. Yin, "Fakecatcher: Detection of synthetic portrait videos using biological signals." IEEE transactions on pattern analysis and machine intelligence, vol. pp. 2020.
W. Zhang, C. Zhao and Y. Li, "A novel counterfeit feature extraction technique for exposing face-swap images based on deep learning and error level analysis." Entropy, vol. 22, no.2, pp. 249, 2020.
A. Keramatfar, H. Amirkhani and A. Jalaly Bidgoly, "Multi-thread hierarchical deep model for context-aware sentiment analysis." Journal of Information Science, vol. 49, no.1, pp. 133-144, 2023.
Y. Nirkin, L. Wolf, Y. Keller and T. Hassner, "Deepfake detection based on discrepancies between faces and their context." IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no.10, pp. 6111-6121, 2021.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

An Efficient Deepfake Detection Using Robust Deep Learning Approch

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Work

3. Proposed Methodology

3.1. Data acquisition

3.2 Pre-Processing locating and cropping faces in videos

3.3 Standardization and Segmentation

3.4 Feature computation

3.5 Base ResNet-18

3.6 Proposed Model ResNet-Swish-BiLSTM

4. Experiments and Results

4.1. Evaluation metric

4.2. Dataset

4.3. Implementation details

4.4. Assessment of the Proposed Model

4.5. Assessment of the Swish Function

4.6. Comparing the improved performance of ResNet-18 to the base Model

4.6 Comparison with Existing DL Models

4.7. Cross-Validation

5 Conclusion

Declarations

References

Additional Declarations

Status:

Version 1