The suggested system uses deep learning to identify if a video is authentic or has been altered. The input videos are separated into two groups the original videos and the edited videos which is shown in fig 1. The video is categorized as non-frames, and the authenticity of the video is assessed based on whether all of the frames are real.
A model is trained using transfer learning to extract some features from a large amount of data. The suggested technique employs a deep CNN model with two distinct layers: (1) CNN layers (convolutional, pooling, and dense layers), and (2) customized CNN layers. The obtained attributes are given to the CNN layers after frame production so they can compute the video's hierarchical representation. categorized into two types: original and tampered. utilized as the input. The movie's authenticity is determined by determining whether each and every frame in the film is real. The video is classified as non-frames. To model some features that have been learned from a lot of data and are well-trained, transfer learning is employed. The proposed method employs a deep CNN model with two distinct layer (1) CNN layer (2) Customized CNN layers. Following frame creation, the obtained characteristics are sent to the CNN layers, which compute a hierarchical representation of the video. Ultimately, the dense layer of the custom CNN, which serves like a classifier, determines whether the clips have been edited or are authentic.
3.1 Learning Algorithm
- Input video clips are divided into original and modified categories.
- Input: Split video into non-intersecting frames(VC).
- Feature Extraction: The features are extracted from the frames using TL and CNN Layers
- Frames extracted
- Frames are resized to 128*128*3.
- VGG-16 used for transfer learning.
- Custom CNN layers are added.
- Training is done using a different split dataset to check the accuracy.
- Output: Categorized as tampered or original videos.
3.2 Preprocessing and Feature Extraction
Even when training a classification framework across a video classification dataset, the dataset must be first preprocessed. Real-world data is generally noisy, incomplete, and in inadequate format which cannot be directly used. Data preparation is an essential operation for cleaning data and preparing it within machine learning algorithms, which increases the model's accuracy and performance. Data preparation is an iterative process that turns unstructured data into understandable and usable formats. Raw datasets are typically characterized by incompleteness, inconsistency, a lack of behavior, and patterns, as well as errors, preprocessing consists mostly of three steps: extract frames, resize frames, and normalize frames. A video is a collection of frames. Each frame indicates a distinct stage of object status. Detecting and retaining objects from the frame following the specific object is difficult to follow the video sequence entirely. Thus first phase of preprocessing is video frame extraction. The collected frames are then used for object identification, detection, and tracking. Identifying an efficient technique for extracting key frames from video is critical for this. These videos are first separated into non-overlapping frames in order to be classified. To accomplish this, OpenCV offers an interface, using the Python OpenCV binding, videos are transformed into frames. After that, the created frames are adjusted to a certain width and height.
3.3 Transfer Learning
This is a deep learning method where in an existing model that's been trained on a large amount of data and apply the features learned from such data to proposed problem. This model is good at discovering certain features because it has learned from a large amount of data which can be altered and trained for specific requirements. Over-fitting occurs when a deep model is trained on a limited dataset, which is one of its limitations. However, by increasing the size of the training data can prevent over-fitting. Though it is a challenging to create large amount of labeled data. In this case, transfer learning is used to address the problem. Transfer machine, its a learning method in which a model generated for one data is utilized as the foundation model for a different task. Rather than training all of the model's layers, transfer learning locks/freezes portions of them and utilizes those trained weights in the locked/frozen layers to extract specific attributes from the data.
Lower layers as such as FC6,FC7,FC8 could be re - trained because they will be customized for the data.
3.4 Custom CNN Layers
In the proposed technique, the input layer and last three layers of the VGG16 model (2 dense, FC6,FC7,FC8 as well as the softmax) are altered as shown in Fig 3. The VGG-16 model's input layer accepts 224x224 input picture shapes by default. This layer was replaced by one that accepts a 128x128 input shape. The final layers were modified by an 8-layer custom CNN layer. The VGG-16's last thick layers learn task-specific information and thus are trained to use the SGD algorithm, which needs a large amount of data and time. The FC6,FC7,FC8& Softmax layers of the VGG-16 have been replaced with eight custom CNN layers that are activated by the previous layer fig 4.
The eight custom levels are made up of two conv2D layers, two Batch Normalization layers, a MaxPooling 2D layer, a GlobalAvgPooling layer, and two Dense layers. Images can exist in HSV, RGB, Grayscale, CMYK, and other color spaces. The Convolutional layers is used to compress that image into a more understandable form while without compromising information critical to creating a successful prediction. Convent’s are not restricted to a single Convolutional Layer. Traditionally, low-level information such as edges, color, gradient direction, and so forth is captured by the first ConvLayer. The design reacts to the High-Level features as well, giving all with a network that understands the photographs in the dataset. Batch normalization has been a technique for training extremely deep neural networks in which the inputs to each mini-batch are normalized. It has the effect of relaxing the learning process, resulting in a considerable reduction in the number of training epochs required to develop deep neural networks. This is accomplished directly by normalizing the activations of every input parameter for each mini-batch. Normalization is the process of rescaling data so the mean is zero as well as the standard deviation is one. Batch normalization is a modification that keeps the mean output near zero and the standard deviation towards one. Just like Convolutional Layer, the Pooling Layer is used for reducing the dimension of a Convolved Feature.it is useful to extract dominant characteristics that are both rotational and positional invariant, enabling the model to be trained effectively. Maximum pooling and average pooling are the two types of pooling. Max Pooling extracts the most value from the image's Kernel-covered region. Max Pooling can be used to reduce noise and also it filters out all noisy activations then performs de-noising and dimensionality reduction. Average Pooling, just conducts dimension reduction as just a noise suppression strategy. As a result, we can conclude that Max Pooling outperforms Average Pooling.
When using Global Average Pooling, its pool size is still fixed to a size of a layer input, but the pool's average is being used instead of the highest. These are frequently used to replace entirely or densely connected layers in classifiers. Each neuron in the dense layer receives input on all neurons in the previous layer, making it a deep-connected neural network layer. Its dense layer generates a m dimensional vector and is mostly employed to alter the vector's dimensions. In this scenario, the dense layer is linked to the classification of output. Soft ax generates a probability distribution based on a set of values. Soft ax is typically used as the activation for the final layer of the classification network since the output can be viewed as a probability distribution.
3.5 DATASET DESCRIPTION
3.5.1 VIFFD - Video Inter-Frame Forgeries Detection
VIFFD is a dataset that is used for detecting video inter-frame forgeries. Inter - frame forgeries are where frames as a whole are copied from one part of the video and are being added to some other part of the video. These kinds of forgeries are also known as copy move forgeries of the frame. It consists of 136 training data and 136 testing video data making a total of 272 videos in the dataset.
3.5.2. ViFoDAC - Video Forgery Detection and Classification
ViFoDAC is a collection of Authentic videos and Forged videos. The dataset contains a total of 32 video data where 16 videos are Authentic and 16 videos are forged. The Authentic videos are recorded using a camera whereas the Forged videos are edited using certain tools by inserting certain objects into the video sequence. Each of these videos are about 10 seconds to 62 seconds long. The videos are captured using a moving camera. This dataset is very useful for forgeries where objects are externally added into the video. Apart from the video object - based forgery, these can be effective for detecting any kind of forgeries where the background keeps changing.
3.5.3. VDFF_3D DATASET
VDFF_3D is the 3rd dataset that's been used for the project. It is a data set that is used in evaluating small three-dimensional region forgery detection of videos. It consists of 50 original and tampered videos. The whole dataset was considered for the project. Each of these videos are about 6 seconds to 22 seconds long. Videos are captured using a static camera. The dataset can be effective in detecting forgeries where small 3-D objects in videos where the background is always the same and does not keep moving.
3.5.4. REWIND_3D DATASET
REWIND_3D is a dataset that is used in detecting small object based forgery detection of videos. It consists of a total of 20 video data where 10 of them are original and the other 10 videos are tampered. Each of these videos are about 6 seconds to 18 seconds long. The videos are captured using a static camera where the background is always the same. This dataset is useful for detecting forgeries where small objects are inserted into the existing video where the background does not change.
3.5.5. Tampered Video Dataset
Tampered Video Dataset is a collection of 160 tampered videos from six different source videos. Tampered videos were created by choosing an object in a video frame then tracking it for a specified number of frames. After various alterations, the duplicated object gets cloned into some other area of the same video. Transformations like change in brightness, flipping, rotation, scaling, shearing, copy move of objects without transformation, RGB etc are performed so as to obtain 7 types of transformations in forgeries from a single video. This dataset is effective in detecting forgeries where the background remains the same and objects from the video are copied to another part of the video after undergoing certain transformations.
Table 2 : Analysis of data used for forgery detection.
DATASET
|
VIFFD
|
ViFoDAC
|
VDFF_3D
|
REWIND_3D
|
TAMPERED VIDEO
|
STATIC
|
136 / 272
|
-
|
50 / 50
|
20 / 20
|
160 / 160
|
REAL TIME
|
-
|
32 / 32
|
-
|
-
|
-
|
COMBINED
|
136 / 272
|
32 / 32
|
50 / 50
|
20 / 20
|
160 / 160
|
The above table 2 shows an overall information of the number of data that is being used from each of these datasets.