## 3.1 Employing Residual Information

Compressed videos already carry beneficial and available information, such as motion vectors and quantized coefficients. During the compression process of a video, frames will be divided into some macroblocks, and for each macroblock, the best match will be found by searching in a reference frame(s). Then the difference between every macroblock and the related best match block will be calculated. The following statement illustrates this difference for each pixel:

$${R_i}={M_i} - M_{i}^{P}$$

1

In which\({R_i}\),\({M_i}\)and \(M_{i}^{P}\)show the residual block, the original and the predicted best match macroblocks respectively. The quantization of discrete cosine transform of differences will be encoded and stored in compressed video format. \({R_q}\)illustrates the result of applying \(DCT\)and quantization () on residual information as follows:

So, *Rq* and a 2-D motion vector related to the macroblock are generated and transmitted for every block.

The size of the macroblocks depends on compression standards. In some compression methods, the size of the macroblocks is fixed, while in the recent standards, it is feasible to slice the frames by different sizes. Smaller macroblocks mostly encode regions with fast motions and complex structures, and larger ones encode regions with simple structures. In the MPEG2 standard, the size of macroblocks is fixed on 16×16.

We assume that a compressed video file is available in the MPEG.2 format in the proposed method implementation. Then the compressed video is partially decoded to obtain residues. In this case, we should apply the dequantization (\({Q^{ - 1}}\)) process followed by the inverse discrete cosine transform (\(IDCT\)) to estimate the residual information in the spatial domain.

$$\tilde {R}=IDCT({Q^{ - 1}}({R_q}))$$

3

We use the obtained residual as the input of a pre-trained neural network for feature extraction. In this paper, *ImageNet VGG-f* is considered as the pre-trained model, which is trained using *MatConvNet2* on the \(\operatorname{Im} ageNet\)dataset.

The model's input is the residual estimation obtained by partially decoding the compressed videos. These inputs are resized into \(224 \times 224\). Moreover, the normalization step is considered based on a calculated average image specified in the pre-trained network. Outputs of the \({18^{th}}\) layer of size \(4096\)are considered as the extracted feature. Every input produces a \(4096 \times 1\)feature vector. As a result, an input video of length will generate a matrix of size\(4096 \times M\).

A SVM classifier with χ2 kernel for action classification is used for the classification step. The size of the input in each classifier is unique, but the output of the feature extraction step has a size \(4096 \times M\). We use the Pooled Time Series approach to reshape the matrix to a unique size [5].

We conducted a temporal partitioning method for data augmentation and action segmentation. Firstly, the temporal partitioning of motion information is presented using optical flow images in [30]. In this paper, the residual information formed the input videos. Each residual video is then decomposed into several temporal partitions, and a max-pooling operator is applied to the partitions. The action of each partition is recognized, and finally, the class of the input video is obtained using a voting strategy. For the implementation, we consider the partition size as 8; hence, the main video is decomposed into eight segments. The pooling method is applied to each part and produces eight decisions. To classify the action of the input video, we vote among these decisions.

Generally, the main idea of using residual information is proposed by the authors in [2]. In fact, we showed that the residual carry enough information to replace the original RGB frames. In fact, we investigate the possibility of using this low complexity approach for healthcare monitoring as a case of real-time applications in [1].

In the next section, we try to accumulate the residual information in order to magnify the relative information. The accumulation strategy is used in [28, 29] in the fixed time step, but in the following section, we introduce the dynamic and content-based accumulation approach and explain the effectiveness of the proposed method in experimental results.

## 3.2 Accumulating Residual Frames

In many cases, residuals contain few values, especially when the movement within the video is too slow. Also, in some types of videos like surveillance ones, the number of frames is too high, making it inapplicable to analyze each frame individually. Feeding a single residual to CNN in these cases could impinge upon time efficiency, resulting in a dramatic reduction in performance. This section proposes a dynamic accumulation algorithm to accumulate similar residuals and provide the accumulated frame for CNN.

To better illustrate the efficiency of the accumulated residuals, Fig. 2 shows the accumulated residuals and single residual frame, which represent a person moving his hand. The magnified parts illustrate how accumulated residual frames can represent the motion more clearly than a single residual.

In this method, feeding CNN by residuals has improved by deploying a temporal window, which moves over residual frames. This window contains the similarity values between consequent residuals.

*Similarity* measures the amount of similarity between two residual frames. The higher value of this measure means the two residual frames are more similar. The range of the value varies between zero and one. The closer this criterion is to one, the more similar the two residuals are. Statement (4) illustrates the *similarity* formula, in which \({R_i}\)is the current residual and \({R_{i - 1}}\)is the previous one. Also, is a constant value.

$$SIMILARIT{Y_i}=\frac{{2 \times {R_i} \times {R_{i - 1}}+c}}{{R_{i}^{2}+R_{{i - 1}}^{2}+c}}$$

4

There is a window size demonstrating how many consecutive residuals information should be stored in the history of the window. If we assume the window size is *N*, the window contains the history of the last *N* consecutive residual frames *similarity*. During the residual frames' processing for each new residual, the similarity of this residual with the previous one is calculated. If the similarity is more than the mean of the temporal window, we accumulate the current residual with the last group. On the other hand, if the similarity is less than the mean of the window, we assume we have entered a new motion event, then the previous group of residuals will be cut, and a new group containing the mentioned residual will be considered. The previous group members will be accumulated to be prepared as the CNN input.

Besides, the window members will be updated, and in each step, the window contains *N* last newest residuals similarity that *N* represents the window size. The steps of the algorithm are explained in more detail in Algorithm 1.

1. Add *N* (size of the Temporal Window) residuals similarities to the window and compute the mean of the vector containing residuals' similarity. 2. Calculate the next residual similarity with the last window member. 3. If the similarity is more than the mean of the window, this residual will be accumulated with the previous frames and go to step 5; otherwise, go to step 4. 4. The last group of accumulated residuals will be fed to the CNN, go to step 1. 5. Move forward the temporal window. 6. Update the mean of the window. 7. If there is any residual, go to step 2. |

**Algorithm 1.** The steps of detecting and accumulating similar residuals. |

By applying this dynamic accumulation algorithm, we improve the efficiency of the frame-by-frame approach. We feed a fewer number of inputs to the CNN for extracting features. Experimental results show that utilizing this method has decreased the processed residual frames by about 45%. An overview of the dynamic accumulation approach is represented in Fig. 3.