Lite general network and MagFace CNN for micro-expression spotting in long videos

Facial expressions, especially spontaneous micro-expressions, as an intuitive reflection of human emotions, have come through much concern along with rapid advances in computer vision recently. Micro-expressions are small in amplitude and short in duration and often appear together with macro-expressions, making micro-expression spotting in long videos a challenging task. In this article, we propose intersection over minimum labelling method combined with a Lite General Network and MagFace CNN (LGNMNet) model to predict the possibility of video frames belonging to a micro-expression interval, which balances easy and difficult samples to improve the learning effect of training process. Experimental results show that our method achieves state-of-the-art performance in spotting micro-expressions in long videos of both the CAS(ME)2 and SAMM-LV datasets (with F1-scores of 0.2474 and 0.2555, respectively). Additionally, a new pair-merge way of combining nearby detected apex frames to construct micro-expression intervals in post-processing stage has been devised and analysed, providing a feasible solution for the task of macro- and micro-spotting in long videos.


Introduction
Micro-Expressions (MEs) unconsciously occur when a person attempts to conceal or hide genuine emotions, and are regarded as involuntary facial expressions [1].Compared with explicit Macro-Expressions (MaEs), MEs have a great potential role in many fields such as clinical diagnosis [2], and have attracted much attention on automatic facial expression analysis research lately.Although a series of exploratory work has successfully recognized and analyzed the emotional state on MEs which are manually found or tested in videos, automatic recognition of MEs is not satisfying since spotting the temporal locations of facial expressions in videos still has yielded only limited results [3].Different from ordinary facial expressions or MaEs, MEs are subtle which occur in a very short period of time with low intensity, and sometimes along with MaEs, causing ME spotting being a challenging problem [4].
Facial expression spotting task can be described as locating an interval consisting of three distinct phases: onset, apex, offset.The optical flow feature, which is used to detect subtle changes in motion change detection, has also been utilized to characterize subtle movement on face during MEs [5].Early ME spotting works prefer to design handcrafted features to differentiate MEs from non-MEs in videos like main directional maximal difference (MDMD) [6] and mean directional mean optical Flow (MDMO) [7].Along with the extensive progress in computer vision, recent researchers have extensively used deep learning models such as CNN [8] and LSTM [9][10][11] to extract ME features of raw images in video sequences [12].On one hand, to alleviate the imbalance between positive and negative samples in ME spotting, Liong's [13] work formulates the localization task as a regression problem, which outperforms prevailing work that regards localization task as a classification task.By building a binary search method to automatically locate the apex frames and a shallow optical flow three-stream CNN, Liong et al. effectively captured the apex frames of MEs in video sequences, while still suffering from certain issues.(1) The work judged whether frame F i is in an ME interval based on the simple IoU value, causing the expression spotting performance to worsen closer to the onset and offset frames.(2) The work employed the optical flow feature difference as a feature descriptor between two frames in a fixed duration, resulting in the appearance of pseudo apex frames.On the other hand, the imbalance exists not only between positive and negative ME intervals, but also exists between hard and easy frames in one ME interval.Clearly, for one ME interval, frames close to the apex frame are easier to detect than those close to the onset/offset apex frame, resulting in more errors in detecting frames not close to the apex frame.
Hence, inspired by Liong's work, a new approach has been proposed to take into account the imbalance between inter-and intra-MEs.We modify the pseudo-labelling strategy and split the data labels according to an intersection over minimum strategy to avoid errors in sample calibration.The classification task is transformed into a regression problem, combined with polynomial fitting, to predict the probability that each frame belongs to one ME.We build a general low-dimensional feature embedding loss model called the lite general network and MagFace CNN (LGNMNet) model based on the ideas of processing the intraclass feature distributions and face image quality in face recognition [3] and monotonically increasing the weights of the facial features in peak areas.Then, the pair-merge method is designed to construct the latent ME interval by exploiting the position of the detected peak frame and its corresponding pseudopeak frame, which effectively improves the accuracy of hard frame spotting in one ME interval.The contributions of our paper are summarized as follows: 1. We implement an intersection over minimum strategy for frame labelling in place of traditional IoU-based labelling method to alleviate the missing samples near the boundary of one micro expression.2. We propose the LGNMNet model, which employs a deep compression model to embed the optical flow features extracted from raw videos into MagFace loss model, thereby improving the learning effect by balancing easy and hard samples to address the model overfitting problem due to noisy low quality samples.3. We present a novel pair-merge way to combine nearby detected apex frames for ME intervals construction to enhance the ability of spotting boundary frames on an ME interval.

Related work
Facial expressions, including Macro-Expressions (MaEs) and Micro-Expressions (MEs), are the most important visual behavioural manifestations of human emotions, reflecting people's attitudes, experiences, and corresponding behavioural responses to objects [14,15].The major difference between these two classes lies in both their duration and intensity [3].MaEs are explicit and made using facial movements that cover a large facial area and last for between 0.5 and 4 s [16], while MEs are local expressions and last between 0.065 and 0.5 s [17].The small in amplitude and short in duration make MEs more brief and harder to observe [4].ME analysis involves ME spotting and ME recognition, and the temporal position of ME events should be detected before any recognition step.Li et al. found that the performance of ME recognition was better when evaluated using manually processed ME samples than when evaluated with true samples without spotting [18].Since an ME is a subtle and imperceptible movement of human facial muscles, the effectiveness of ME spotting relies on discriminative features that can be extracted from image sequences.In the past years, a series of hand-crafted features have been designed including local binary patterns (LBPs) [19], Local Binary Pattern-Three Orthogonal Plane (LBP-TOP) [20], histograms of oriented gradients (HOGs) [21], 3D Histograms of Oriented Gradients (3DHOG) [21], and so on.Optical flow is a motion pattern of moving objects/scenes in and image sequence, which can be detected by the intensity change of pixels between two image frames over time [3].Based on optical flow features, second-order features have been proposed from temporal features including facial dynamics map (FDM) [22], to spatial features like main directional maximal difference (MDMD) [6] and mean directional mean optical Flow (MDMO) [7].In terms of ME spotting, Shreve has attempted to distinguish MaEs from MEs [23] and exploit non-rigid facial motion [24] based on optical strains successively.The spotting task including onset, apex, and offset location of one ME has been explored by Patel based on a discriminative response map fitting (DRMF) model [25].Guo considered the motion angle information and designed magnitude and angle combined (MAC) optical flow features to improve spotting efficiency [26].In summary, conventional unsupervised methods on ME spotting typically rely on setting thresholds for specific hand-crafted features based on human experience or statistical properties.After that, although some other studies [27,28] have taken advantage of machine learning, the performances were still not good enough due to traditional learning methods were still not robust enough to handle the subtle movements of MEs.Recently, deep learning approaches have achieved great success in integrating automatic feature extraction and classification.Deep learning models have been widely used to detect the interval or apex frame of ME using extracted features of raw images in video sequences as input [12].Some interval-based deep learning models use video clips as input and utilize long short-term memory (LSTM) [9][10][11] or a clip proposal network [29] to obtain potential ME intervals.Convolutional neural networks (CNNs) [8] have also been integrated into ME analysis to encode the feature representation of temporal states (onset, apex, and offset) since Kim's first trial [30].In addition to spotting ME intervals, several other deep learning models focused on spotting a specific type of ME phase, specifically the apex frame, rather than the entire interval.A binary search method was proposed by Liong et al. [31] to automatically locate the apex frame in a video sequence.Pan et al. [32] proposed a CNN-based model to classify all frames into MaE, ME frames, or irrelevant frames.The CNN module has been commonly constructed as the basis for recognizing the changes in facial AUs [29,33,34], and a series of post-processing methods would be combined to obtain the expression intervals [35].

Proposed framework
Our objective is to spot MEs in long video sequences.The processing tasks, which include three stages (image preprocessing, LGNMNet feature learning, and ME spotting procedure), are illustrated in Fig. 1.

Face alignment
To reduce the effects of camera shake and head turning, all faces need to be normalized to the same scale and a front viewing angle.RetinaFace method [36] was adopted to perform face detection and landmark detection frame by frame on the original video.Face alignment and normalization were performed on 112 × 112 pixel points according to the detected landmarks.

Optical flow feature extraction
In the research of previous work [13,15,26], it was proven that optical flow features are advantageous for characterizing subtle spatiotemporal movement information.The TV-L1 method [37] was applied to calculate the optical flow f between the i-th frame F i and the The optical flow f between the i-th frame F i and the (i + k)-th frame F i+k was calculated based on TV-L1 method [37] and the horizontal and vertical components u and v of optical flow f were switched to polar coordinates and used as hue and saturation channels.Then, the optical flow feature of the i-th frame F i was extracted by converting the HSV Fig. 1 An overview of our proposed framework values encoded from u and v to BGR values [38].Here, the value of k was set to half of the average length of expressions, as shown in formula (1).

Region-of-interest (ROI) extraction
The motion of the nose region has been removed to eliminate the global motion for each frame [27].The facial action units involved in one ME are distributed among the eyebrows, mouth, and corners of the eyes [13].Since the eyes themselves do not present effective action units for characterizing expressions, and the blinking of the eyes can affect the optical flow properties, we applied a polygon plus a 4-pixel boundary dilation to mask out each left and right eye area.In our paper, three ROIs with an extra border of 4 pixels were extracted: (1) the left eye and left eyebrow, (2) the right eye and right eyebrow, and (3) the mouth.We cut the ROIs in accordance with the alignment image, resized regions (1) and ( 2) each to a size of 56 × 56 pixels, stitched them into a single image of 56 × 112 pixels and combined them with the resized 56 × 112 pixel image obtained from region (3) to finally obtain a new 112 × 112 pixel image containing the main facial action units [3,13].

Lite general network and MagFace CNN
(LGNMNet)

Pseudo-labeling-intersection over minimum labelling (IML)
The closer an AU is to the onset or offset time of the corresponding expression, the smaller the motion amplitude.In terms of that, the distinction between an expression AU and a non-expression AU near the ME boundary may not be significant.Yap [35] calculated the ratio ℜ between the inter- section of the current interval of k frames with the duration of the current expression (corresponding to label = 1) and the union of the current k frames with the current expres- sion, as follows: For ℜ ≥ 0.5 , the label of the current frame is set to 1; otherwise, it is set to 0. Above, k is the average length of expressions of the same type as the current expression in the target dataset.However, when the length of the current expression is much greater than or less than k , there will be a large error.For this reason, we instead divide the intersection of F i , F i+k and F onset , F offset by min( F onset , F offset , F i , F i+k ) to calculate an alternative indicator.Similarly, when ℜ ′ ≥ 0.5 , the label of the current frame is 1.Mathematically, our indicator ℜ ′ is computed as follows:

LGNMNet
The ROI optical flow features obtained in the pre-processing stage are used as input images to the LGNMNet model, and the IML-based labeled frame samples are used as input labels.The overall structure of our network consists of a backbone network, a neck network, and a head network, as shown in Fig. 1.First, the backbone network, which contains a series of convolutional layers and pooling layers, is the MobileNet_V2 [39] network without classification layer.Subsequently, the first part of the neck network is a Conv2d convolutional layer (with a 1 × 1 kernel function), a batch normalization ( BN ) layer, and a ReLU6 layer.After nor- malizing the output of the first part of the neck network, the rest of the neck network uses a dropout layer to mitigate overfitting and a linear layer to eliminate redundant information.Finally, the head network performs two tasks: loss and classification.
To improve the generalizability of the model and optimize the model's feature learning capabilities, we construct a new loss function L shown in formula (4), which is the combination of cross-entropy loss ( L CEH ) and MagFace loss ( L Mag ).In detail, the feature embedding strategy utilized in the MagFace model could effectively modify the feature distribution of AUs with different intensities from the same class during the training process, making those whose locations are close to apex frames are attracted to the centres of the class, and those whose locations are close to onset/offset frames are pushed toward the adaptive margin of the class.The predicted classification labels and confidence scores are obtained by applying the one-dimensional labelling results mapped from the high-dimensional feature embedding results of the MagFace model to the softmax function. (3)

Apex frame spotting
Liong used a regression model to calculate a set of scores for detected expressions to spot apex frames, making the results highly dependent and sensitive to the accuracy of the deep learning model.To obtain more robust apex frame detection results, the classification problem in our method is converted into a regression task, while the score set S i,j tends to be discrete, then the distinguishability of different expressions during post-processing can be improved.The prediction score for the i-th frame in the j-th video is calculated using formula (5), where l i,j ∈ {0, 1} is the category to which the model predicts the current frame to belong and c i,j ∈ {0, 1} is the confidence of this category prediction: To eliminate errors due to model-based classification, Savitzky-Golay convolution is used to smooth the score set to obtain a score curve for the target video.
We use peak detection to find the peaks in each video.However, there may be differences in lighting and skin colour between different video scenes, which may cause the peak characteristics to differ between scenes; therefore, a dynamic threshold T is used as the minimum value to judge the existence of a peak P, and the value k is used as the nearest neighbour interval, as shown in Fig. 2. The calculation formula for T is as follows, where S mean is the mean value of S, S max is the maximum value of the score set S, and is a weight coefficient:

Pair-merge strategy for interval construction
In the apex frame based expression spotting task, nonmaximal suppression is implemented to merge the nearby detected apex frames after spotting the apex frame.Specifically, the highest apex score is selected.Then, the k frames after and before the detected apex frame x i are considered as its expression interval I , as shown in formula (7).However, for one apex frame F i , frame F i − k can also be detected as one apex frame according to Liong's method or our method (5) S i,j = c i,j * l i,j (6) T = S mean + * (S max − S mean ) in that the optical flow feature difference between frame F i and frame F i − k is extracted as features of frame F i − k , and the optical flow feature difference between frame F i and frame F i + k is extracted as features of frame F i .Hence, there may be one true peak and one pseudo peak corresponding to the same expression interval based on our method, and the traditional simple interval F i − k, F i + k construc- tion strategy for the detected apex frame F i is not suitable enough.
T h e re fo re , fo r m u l t i p l e p e a k f ra m e s P = p 0 , … , p i, p i+1, … selected by finding a local maximum in our method, a pair-merge strategy for expression construction was proposed that pairs the adjacent peaks p i , p i+1 and extends the interval p i , p i+1 around this maximum by * k frames on each side to obtain i = p i − * k, p i+1 + * k .Finally, we merge overlapping intervals to obtain pairs of adjacent peaks.In this case, for a true apex frame F i satisfying the condition F i ∈ P , its expression interval E constructed by our strategy can meet t h e c o n d i t i o n E ⊆ F i −  * k, F i +  * k .T h e corresponding pseudo-frame F i − k satisfies the condition F i − k ∈ P , and its expression interval E ′ constructed by our strategy can meet the condition the union set of expression intervals constructed by the frame of set P ′ --U satisfies the condition and we can simplify U , as shown in formula (8).
The comparison between the expression interval I, constructed by the conventional method, and the expression interval U, constructed by our method, is performed.For one real expression interval TE = T i − k, T i + k , the pre- dicted expression interval PE is judged as the true value if the intersection-over-union (IoU) between TE and PE is more than a certain threshold .Then, for the traditional expression interval I , when the predicted apex frame x i satis- fies the condition x i ∈ T = T i + 2k( − 1), x i + 2k(1 − ) , its corresponding constructed expression interval is judged as true.For our expression interval U , the pre- dicted apex frame x i should satisfy the condition Based on the above analysis, (7) it is reasonable to expect that our pair-merge method for expression construction based on our model has better fault tolerance.

Experiment
To prove the effectiveness of the ME spotting frameworks, we applied leave-one-subject-out (LOSO) cross-validation for model training and inference [26].This section provides information regarding the datasets analyzed and details of the model training steps in our algorithm.The code is available at https:// github.com/ bleak ie/ LGNMN et.

Datasets
We used the most extensive and authoritative datasets in the field of ME research, namely, the SAMM-LV [40] and CAS(ME) 2 [41,42] datasets.The 147 long videos in the SAMM-LV dataset contain 343 MaEs and 159 MEs, and the 87 long videos in the CAS(ME) 2 dataset contain 300 MaEs and 57 MEs.Basic information for each facial expression is included in these datasets: onset, apex, and offset labels.

Performance metrics
The intersection-over-union (IoU) method was used as the benchmark for MaE and ME spotting, and the F1-score index was employed for performance evaluation and comparison between the method proposed here and the latest research method.A prediction is considered correct if the IoU between the predicted onset-to-offset interval and the true one is greater than 0.5.

Training
Because the frame rates in the SAMM-LV and CAS(ME) 2 datasets are different and there is an order-of-magnitude difference between the numbers of non-expression and expression frames, we used random frame skipping for training and verification.We under-sample the training samples with random weights to address the sample imbalance problem.We used LOSO cross-validation to evaluate the results, selected the model with the highest F1-score among the validation iterations, and saved it as the best model [3].In the detection post-processing step, the parameters and were selected through grid optimization.The values of the parameter k for the CAS(ME) 2 and SAMM datasets were calculated [23,40,41].(For each dataset, a smaller value in the braces corresponds to MEs, and a larger value corresponds to MaEs has also been adopted comparison).

Comparison with state-of-the-art methods
In comparison with recent methods on facial expression interval spotting task, our method was superior in detecting MEs in spite of narrow weakness in MaE spotting, in which the best F1-scores for ME spotting have been obtained on two public datasets (CAS(ME) 2 = 0.2474, SAMM-LV (ME) = 0.2555).
The main reason for poor MaE detection of our method is the omission in post-processing T i + k(1 − ), T i + k caused by great variance in period of MaEs (Table 1).

Ablation study
Table 2 shows the performance of the ablation analysis on all databases using a range of metrics.To undertake these analyses, the model is modified and evaluated by including or removing certain components in the architecture.In the LGNMNet part, the Non-Neck results are obtained by removing the neck module and linearly mapping the output of the backbone module to the input of the head module; the Non-MagFace results are obtained using the cross-entropy function as the loss function; and the Non-IML results are obtained using a labelling strategy based on formula (2).The Non-Post-process model simply uses the score set to detect the peak frame F i and regards the interval F i − k, F i + k as the predicted expression interval.From Table 2, it can be observed that our method performs well when including IML, LGNMNet, and post-processing module on all the databases.Table 3 compares the performance of deep learning models that utilize different networks as backbone modules on the ME spotting task.The best F1-score (0.2474) was obtained under the condition that MobileNet_v2 was used as the backbone network.In terms of time consumption, the cost of MnasNet0_5 is best (5.97 ms).The size of the parameters of MobileFaceNet is the smallest (1.20 M), and ShuffleNet_v2_x0_5 has the lowest flops (0.01 G).By a comprehensive consideration, the backbone module should be MobileNet_v2 for higher accuracy and acceptable cost.In addition, Fig. 3 demonstrates the results of ME spotting on two when the metrics and are varied from 0-1.It is obvious that the best results always satisfy the condition  > 0.5.

Detailed discussion
As can be seen from Tables 1 and 2, LGNMNet not only guarantees recall but also improves the accuracy of positive ME intervals.Intra-ME interval analysis been performed to evaluate the capability of our proposed method in detecting hard frames within an ME interval close to the onset/offset frame.In detail, W/o LGNMNet model employs MobileNet_v2 for regression prediction and Liong's post-processing approach [27] as interval construction.Frames in T i⋅onset , 0.5 * (T i⋅onset + T i⋅apex ) a n d 0.5 * (T i⋅offset + T i⋅apex ), T i⋅offset a r e l a b e l l e d a s h a r d f r a m e s , a n d f r a m e s i n (0.5 * T i⋅onset + T i⋅apex , 0.5 * T i⋅offset + T i⋅apex ) are labelled as easy frames.A predicted hard/easy frame interval is considered as a positive hard/easy frame interval if its intersection with the true ME interval is greater than 0.5, where the intersection employs intersection over minimum strategy and can be calculated by formula (2).From Table 4, it can be observed that LGNMNet method achieves higher recall and precision of hard frames on both CAS(ME) 2 and SAMM-LV datasets, demonstrating its learning ability for detecting ME boundary frames.In further analysis on examining our LGNMNet model in Table 5, we found that the precision of our method on the CAS(ME) 2 dataset is better than that on the SAMM-LV dataset, while the recall of our method on the SAMM-LV dataset is better than that on the CAS(ME) 2 dataset.This finding may be attributed to differences in data sampling during model training.Since SAMM-LV dataset has higher frame rate and wider range of ME interval length than CAS(ME) 2 dataset, ME spotting of SAMM-LV dataset is more sensitive to the key value k which is set as half of the average length of expressions, causing its lower ME spotting precision than CAS(ME) 2 dataset.

Conclusion
In this paper, we propose the LGNMNet method to spot MEs in long videos based on optical flow features.The local adaptation strategy is applied to IML to avoid the omission of potential apex frames when the fixed duration mismatches a certain true ME interval, in the sense that the expression intervals in different datasets vary over a wide range.The traditional expression classification problem is transformed into a regression problem to improve the distinguishability of AUs by incorporating MagFace, which effectively optimizes the feature distribution of apex and non-apex frames in our proposed network [10,44].The pair-merge strategy on adjacent predicted peaks is specifically designed to reinforce the role of the true apex frame and constrain the deviation between true and predicted expression intervals.Experimental results demonstrate that our method produces competitive results for ME spotting in long video sequences.However, the length of fixed duration k selected in our method may not fit the data set well, and a more reasonable adaptive selection method for k is necessary to spot MEs.Our future work will further address the imbalance between positive and negative sample sizes and simplify post-processing work for auto-expression interval construction based on detected apex frames.

Fig. 2
Fig. 2 Main steps of post-processing Fig. 3 F1-score of ME spotting on CAS(ME) 2 and SAMM-LV with different thresholds in location suppression modules

Table 2
Effects of pre-and post-processing modules on LGNMNetThe performance of LGNMNet method in different scenarios are shown in bold

Table 3
Performance comparison between various network backbones (for CAS(ME) 2 ME spotting)

Table 4
Performance of hard and easy Intra-ME interval spotting on LGNMNetThe performance of LGNMNet method in different scenarios are shown in bold