Micro-expression action unit recognition based on dynamic image and spatial pyramid

Most of the existing studies have focused on the expression recognition of micro-expressions, while little research has been done on how to recognize the action units of micro-expressions. This is due to the low intensity of facial action units, which are not easy to be recognized. We proposed a micro-expression action unit recognition algorithm based on dynamic image and spatial pyramids to address this problem. First, the video is passed through the dynamic image generation module to generate a dynamic image and extract the motion information contained in all frames. Then, given the subtle movement properties of micro-expressions, different levels of semantic features are obtained through spatial pyramids. It is also known that micro-expressions appear in the small range and are concentrated in the local area of the face, so the regional feature network and attention mechanism are used for the image features of each layer. Finally, our models are trained separately due to the weak correlation between action units. Experiments on CASME and CAS(ME)2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^2$$\end{document} datasets verify that our proposed algorithm has shown better action unit recognition performance compared with other advanced methods.


Introduction
The study found [1] that when liars lie, they will subconsciously try to suppress and disguise their true emotions.The basis of their suppression and disguise is the cognition and experience of various emotions.However, because the liar's cognition and experience of various emotions are biased, the "acted" expressions are unnatural, only superficial and fragmentary, and the real emotions will be revealed unconsciously through micro-expressions come out.Micro-expressions are short-lasting, difficult-to-detect, and uncontrollable facial movements that appear when a person tries to hide his or her true emotions, often reflecting the individual's true emotions.It is characterized by short duration, low intensity, and localized appearance on the face, usually about 1/25-1/5 s; in contrast, macro-expressions are controllable facial expressions, which usually appear on the face for about 1/5-4 s, with obvious facial movement and covering a large facial area.The comparison between the two kinds of expressions is shown in Fig. 1.Micro-expressions were first discovered by Haggard and Isaacs [2] in 1966 and called "micro-moment" expressions.In 1969, Ekman and Friesen [3] reported that they had discovered a special facial expression, which they named: micro-expressions.Analyzing micro-expressions is valuable for many potential applications, such as medical [4], law enforcement [5], political psychology [6], and national security [7].
Micro-expression recognition includes two important branches, namely expression recognition and action unit recognition.Most of the existing studies have Fig. 1 Comparison between macro-expression and micro-expression images [8] 1 3 Micro-expression action unit recognition based on dynamic… focused on the expression recognition of micro-expressions [9][10][11], while little research has been done on how to recognize the action units of micro-expressions.Expression recognition can only make general divisions of expressions, such as six basic human expressions such as happy, angry, disgusted, fearful, sad, and surprised.Since human expressions are complex, to recognize the complete expression, it is necessary to use the face action unit (AU) to divide.AUs are the basic movements of a single muscle or muscle group, and different AU combinations can describe most expressions.Facial Action Coding System (FACS) [12] shows that successful Table 1 Commonly used AUs facial action unit recognition can greatly facilitate the analysis of complex facial actions or expressions.Therefore, exploring AU is very important for in-depth interpretation of the facial behavior of micro-expressions.The commonly used AUs [13] are shown in Table 1.
Currently, there are many research methods on AU recognition of macroexpressions [14][15][16][17].Compared with the AU recognition research on macroexpressions, there are relatively few AU recognition studies on micro-expressions [18][19][20] because of the following problems: • The intensity of micro-expression AU recognition is much lower, and the duration of AU occurrence is much shorter, resulting in difficulty in recognition; • Compared with the macro-expression AU dataset (such as the BP4D dataset [21]) (328 videos and a total of about 140,000 frames), the micro-expression AU dataset (such as CAS(ME) 2 dataset [8]) contains a very small number of samples; • There are not several AUs coexisting in micro-expressions, i.e., the correlation is weak, that is to say, the multi-label learning framework [22] commonly used in macro-expressions is not suitable for micro-expression AU recognition; • The number of AU samples of micro-expressions is unbalanced.Some AUs have many samples, such as AU4 (brow down), and some AUs have only a few, such as AU10 (upper lip).
Unfortunately, micro-expression AU recognition also has some problems, i.e., micro-expression has subtle and rapid facial muscle changes.Therefore, it is more difficult to recognize AUs on micro-expressions than that on macro-expressions.
To address the above problems, we propose a micro-expression action unit recognition algorithm based on dynamic image and spatial pyramids.
The study found that the analysis of micro-expressions is mostly based on the analysis of video.Therefore, it is crucial to understand and represent video content accurately.A motion map, such as dynamic image [23,24] is a single RGB image, equivalent to a still image, that captures the dynamics and appearance of an entire video sequence or subsequence, resulting in a long-term, stable representation of motion.Dynamic image can be applied to CNN network architectures commonly used in image tasks, such as VGG, ResNet and other networks, while the network can still infer long-term dynamics in videos and learn dynamic features.The dynamic image is shown in Fig. 2.
Due to the subtle changes in micro-expressions, it is not easy to be captured and localized.We believe that this is similar to the feature localization problem considered in fine-grained image recognition.In the general research method, only the high-level features of the last network layer are used for final recognition, but because of the limitation of the perceptual field, only one range of local area feature information can be collected, and it is impossible to locate micro-expressions synthetically from the local area feature information of multiple range sizes.Furthermore, studies have shown that for convolutional neural networks, high-resolution low-level features help to capture detailed information of local regions, while 1 3 Micro-expression action unit recognition based on dynamic… low-resolution high-level features contain global semantic information that is crucial for classification.Therefore, the idea of our proposed algorithm is to use a spatial pyramid network to fuse multi-scale features from different layers to localize micro-expressions.
Unlike facial expression recognition, which only needs to analyze the entire face, the action unit (AU) of the face appears in a sparse facial area and needs to be analyzed in local areas.The different AU distributions are shown in Fig. 3.Most expression recognition methods use standard convolutional layers to learn image features and assume that the weights of the convolutional kernels are shared across Fig. 2 Dynamic Image.It can be seen that in the angry expression video, the experimental subject frowned; in the disgusted expression video, the experimental subject's left eyebrow was raised; in the happy expression video, the experimental subject's mouth and eyebrows moved; in the neutral expression, the subject has no facial movements.The facial actions mentioned above can all be reflected in the dynamic image generated corresponding to the video [23,24] Fig. 3 AU distribution map in face the image.But the human face is a structured image, such an assumption will not be able to capture the local subtle appearance changes, so different local feature extraction methods should be used for different facial regions.To this end, we propose a regional feature module to extract local features.
To highlight important features, we propose an attention module to emphasize decisive features and suppress invalid features, and then utilize residuals to improve robustness to partial face occlusion or camera viewpoint changes.
Finally, the recognition of each micro-expression AU is a binary classification task.However, the unbalanced distribution of the number of AU samples makes it easier to recognize AUs with a large number of samples and harder to recognize AUs with a small number of samples.Therefore, we use the focal loss function to solve the problem of the unbalanced distribution of AU samples.
Our main contributions are as follows: • To solve the problems of low intensity and difficult recognition of the action units in micro-expressions, we used dynamic image and spatial pyramids to get the micro-expression information and propose a micro-expression action unit recognition algorithm.• Since the micro-expression occurs in the local area of the face, we proposed a regional feature module to extract regional features, and reinforce the effectiveness of the micro-expression finding task in our action unit recognition algorithm.• We proposed an attention module to emphasize the important features, weaken the impact of useless features, and enhance the robustness of the micro-expression action unit recognition algorithm.

Related work
Currently, there are many research methods on AU recognition of macro-expressions [14,15], mainly based on manual extraction of facial AU appearance features and geometric features.Appearance features represent local or global changes in the face.Commonly used appearance features are Haar feature [25], Histogram of Oriented Gradients (HOG) feature [26], LBP feature [27], Garbor wavelet feature [28], and Scale Invariant Feature Transform (SIFT) features [29].Rathee et al. [30] detected the intensity of facial movement units through cosine similarity-based feature mapping.These features can be applied to a support vector machine to classify action units of various intensities.[31] proposed joint patch and multi-label learning for AU recognition using SIFT descriptors near landmark points.Geometric features represent the changing direction or distance of face landmarks or skin.[32] detects the intensity of facial action units by combining geometric and visual deformations of facial features.Thin plate splines are used to extract geometric deformations, and 1 3

Micro-expression action unit recognition based on dynamic…
Gabor filters are used to extract appearance deformations.Wei et al. [33] proposed a method that can obtain robust and accurate AU intensity regression by extracting multi-scale spatial features and corresponding temporal features of faces in video image sequences and learning the local relationships of spatio-temporal features.Tang et al. [34] proposed a Deep Feature Enhancement (DFE) framework, a novel end-to-end three-stage feature learning model that takes into account subject identity bias, dynamic facial variation and head pose.It consists of three feature enhancement modules, including coarse-grained local and holistic spatial feature learning (LHSF), spatio-temporal feature learning (STF) and head pose feature decomposition (FD).[35] proposed a method combining geometric variation and local texture information.However, these hand-crafted features are still not able to represent facial variations well.
In recent years, deep learning methods have been widely studied in AU recognition of macro-expressions due to their strong nonlinear representation capabilities [36,37].Li et al. [38] proposed a local convolutional neural network (LCNN) to learn AUs on cropped regions centered on face landmarks, but this network suffers from severe instability in face landmark detection.Chu et al. [36] proposed a method based on Deep Region and Multi-label Learning (DRML), which utilizes region layers to obtain important face regions.It extracts facial structure information for good AU recognition results with subtle motion changes.In addition, Li et al. [39] proposed a local feature learning method to embed an attention map based on facial landmarks in cropped regions.These works strongly suggest that learned features can identify AUs well.
Compared with the AU recognition research on macro-expressions, there are relatively few AU recognition studies on micro-expressions.[18] proposed a deep Spatio-Temporal Adaptive Pooling (STAP) network with focal loss for micro-expression AU recognition.STAP is an end-to-end trainable network capable of identifying subtle and rapidly changing micro-expression AUs on specific regions with effective temporal information.Afterward, Li et al. [19] proposed an end-to-end Spatial Channel Attention (SCA) network for micro-expression AU recognition, which consists of spatial and channel modules for spatial relations, respectively Modeling and local area representation.The SCA network efficiently identifies subtle AUs by using self-second-order statistics.
Although there is not much literature related to micro-expression action unit recognition, there is another related research field, which is micro-action recognition.Although the methods in both fields are not completely common, they have similar objectives and may provide new ideas for our subsequent research.Mi et al. [40] fused features from the higher and lower layers of the convolutional neural network to improve the accuracy of micro-action recognition.Mi et al. [41] proposed a new two-branch network for micro-action recognition: one branch uses high-level CNN features for classification and the other branch uses mid-level CNN features for categorization.Mi and Wang [42] used a dual-stream convolutional neural network (CNN) followed by a temporal pyramid to extract deep features.Yonetani et al. [43] used self-centered paired videos recorded by two interacting people to identify Micro-expression action unit recognition based on dynamic…

Dynamic image
The dynamic map is obtained by directly applying sorting pooling to the original image pixels of the video, and is the parameter of the sorting function finally obtained by solving the sorting support vector machine.
Suppose there is an image sequence of a micro-expression, which can be represented as X = x 1 , x 2 … , x (K−1) , x K , where x t ∈ ℝ 3×224×224 is the t-th frame, t = 1, … , K .K is the number of micro-expression frames.Define x t as the feature representation of micro-expression at the t-th frame.The original RGB image x t can be used directly to represent x t , i.e., x t = x t .After smoothing the original image sequence, a new sequence V = v 1 , v 2 … , v (K−1) , v K is obtained, and the smoothing process is shown in Eq. (1).
Since action changes are time-related, and the frame order of micro-expression sequences is known.Then, the two frames of v t and v t+1 in sequence are defined as v t+1 ≻ v t , i.e., the v t+1 frame is after the v t frame.Furthermore, for two frames of v i and v j in sequence, where i < j , and they should be defined as v j ≻ v i .
Define a sorting function as f (x) , which determines which frame is in front and which frame is behind according to the value of f (x) .The properties of f (x) are shown in Eq. ( 2).
Theoretically, f (x) can be any function.For simplicity, it is assumed that it is a linear function f (x) = ⟨w, x⟩ .From the properties of the linear function, for any two fea- tures x i and x j , the satisfying relationship is shown in Eq. (3).
For features v i and v j , the above relationship is also satisfied as shown in Eq. ( 4).Define positive samples as v i − v j and negative samples as v j − v i , where time i is after j, and the corresponding ground-truth labels are set as shown in Eq. (5).
The above sorting problem is converted into a classification problem, which can be solved using Support Vector Machine (SVM).Learn the following convex optimization problem, as shown in Eq. ( 7). (1) Micro-expression action unit recognition based on dynamic… where, From the equation, the parameter w can be learned, and then the sorting function f (v) = ⟨w, v⟩ can be obtained, and there is ∀i, j, v i ≻ v j ⇔ f v i > f v j .Since the w * vector contains enough information to rank all the frames in the video, it aggregates the information of all the frames and can be used as a video descriptor.The above process of solving the video frame sequence w * is called sorting pooling [23].
The above solution process is still complicated.In order to obtain dynamic image more easily, approximate sorting pooling is used.Let w = 0 , then according to the properties of gradient descent, we can get w * = 0 − ∇E(w)| w=0 ∝ −∇E(w)| w=0 , and ∇E(w)| w=0 is shown in Eq. ( 9).
The following Eq.( 10) is further obtained.
.The calculated parameter value w * is the required dynamic image.The whole process is shown in Algorithm 2. ( 8)

Space pyramid
Studies have shown that for convolutional neural networks, high-resolution low-level features help to capture detailed information in local regions, while low-resolution high-level: features contain global semantic information that is critical for classification.Therefore, the spatial pyramid is used to fuse multi-scale features from different layers to locate the position micro-expressions.In detail, we use a ResNet50 network with 4 intermediate convolutional parts and take the output features of the last residual block of each convolutional part as the features of one layer of the spatial pyramid.Due to the difference in the size of the receptive field of different layers, the scope of the local area context that can be observed contains important features is also different.Comprehensive consideration is helpful to locate the area of AU feature change.The spatial pyramid is shown in Fig. 5.

Regional feature network
Since human faces are structured images, to capture local subtle appearance changes, different local feature extraction methods should be used for different facial regions.To this end, we propose a Regional Feature Module (Reg) based on [36] to extract local features, as shown in Fig. 6.The regional feature module first divides the input image into 7 × 7 grids, each grid represents a local region, and then extracts features for each local region.Different from using only one convolutional layer in [36], to fully extract subtle features, this chapter uses two 1 × 1 convolutions and one 3 × 3 convolution for each local region.Two 1 × 1 convolutions, the former is used for dimensionality reduction and the latter is used for dimensionality enhancement, ensuring that the output and input size are the same.Batch normalization (BN) and ReLU activation functions are used Furthermore, if no useful information about AU is learned in the local region, the original local region features are directly output using the residual.The output size of the local area after passing through the regional feature module is the same as the input size, and the position of the image is also the same.The generated feature map should be placed at the original local area position and combined with other local area output feature maps to form a new one.image.In this way, AUs are identified in sparse facial local regions.

Attention mechanism
To highlight important features, we propose the Attention Module (Att), as shown in Fig. 7.
The input feature map of the module is F ∈ ℝ C×H×W .To calculate spatial atten- tion, first apply max pooling and average pooling operations along the channel axis to obtain the feature maximum and average value of each channel at each position (i, j) , which are used to represent the salient features of this position, and where n is the channel index, referring to the n-th channel; C is the total number of channels.F n (i, j) represents the feature value at the position (i, j) of the n-th channel feature map of F. F Avg (i, j) is the feature average of all channel feature maps of F at position (i, j) , F Max (i, j) is the feature maximum of all channel feature maps of F at position (i, j) value.
Then flatten it into a feature vector of length H × W , and apply the softmax function to obtain the importance of the feature of each position point in the entire face space, and obtain the feature maps F s Avg and F s Max .The calculation method are shown in Eqs. ( 13) and ( 14).

Resize the generated feature maps F s
Avg and F s Max to F � Avg and F � Max .Then combined into feature map F � ∈ ℝ 2×H×W .After 1 × 1 convolution kernel convolutional dimen- sion reduction, the sigmoid function is used to limit all the values to the range of 0 ∼ 1 to obtain the final spatial attention feature map F Att ∈ ℝ 1×H×W .
Finally, the original feature map F and F Att are multiplied, and the product result is added to the original feature map F to obtain the final output feature map F sp ,forming a residual block to avoid the problem of vanishing gradients during training.The residual operation is shown in Eq. ( 15).
where F sp ∈ ℝ C×H×W is the final output feature map, F is the original input feature map, and F Att is the spatial attention feature map.

Focal loss function
To address the problem of unbalanced distribution of AU sample numbers, we use a focal loss function as shown in Eq. ( 16). ( 11) Micro-expression action unit recognition based on dynamic… Among them, M is the total number of samples, y i is the true label of the i-th sam- ple, if AU appears, it is 1, otherwise it is 0. ŷi is the predicted label of the i-th sam- ple, representing the probability of AU occurrence. is usually taken as 2, and is usually taken as 0.25.Among them, represents the weight of the hard samples, which is used to reduce the loss contribution of the easy samples, so that the network training process pays more attention to the hard samples.is the class weight, which is used to weigh the imbalance of positive and negative samples.

Settings
Datasets We evaluate the proposed algorithm on two spontaneous micro-expression datasets, CASME [44] and CAS(ME) 2 [8].Both micro-expression datasets contain AU labels, but the number of samples per AU label is quite different.In our experiments, we only explored 8 AUs related to deception detection [45], namely AU4, AU5, AU6, AU10, AU12, AU14, AU17, and AU45.The number of AU samples contained in different datasets is shown in Table 2.
CASME contains 195 micro-expression samples from 35 subjects at a frame rate of 60fps and an image resolution of 640 × 480 .Since the average duration of video samples is only two or three seconds.Therefore, this dataset is only suitable for micro-expression recognition, not for micro-expression discovery.The dataset contains AU4, AU6, AU10, AU12, AU14, and AU17, a total of 6 AUs related to deception.
CAS(ME) 2 contains 357 expression samples from 22 subjects, of which the number of micro-expression samples is 57, the number of macro-expression samples is 300, the frame rate is 30fps, and the image resolution is 640 × 480 .for micro- expression recognition and discovery tasks.The dataset contains AU4, AU5, AU6, AU10, AU12, AU14, AU17, and AU45, a total of 8 AUs related to deception.( 16) Metrics: AU recognition is a binary classification problem.For binary classification tasks, especially in the case of sample imbalance, the F1-score can better explain the performance of the algorithm.It is calculated as follows: where P and R denote the precision and the recall, respectively; TP (True Positive) is the number of samples predicted to be positive and actually positive; FP (False Positive) is the number of samples predicted to be positive and actually negative; and FN (False Negative) is the number of samples predicted to be negative and actually positive.
The F1-score is the result of combining the accuracy and recall of the model and is of size 0 to 1.The larger the F1-score, the better.In our evaluation, the F1-scores of 6 AUs in CASME and 8 AUs in CAS(ME) 2 were calculated according to the number and importance of AUs.The overall performance of the algorithm is evaluated by the average F1-score of all AUs.

Implementation
In our experiments, we used the cropped face images provided by the dataset.The input is a sequence of aligned RGB micro-expression images.Since the average number of micro-expression frames is 10 frames, it is necessary to expand each micro-expression image sequence to 10 frames through a temporal interpolation model.Finally, the preprocessed image sequence is input into our proposed algorithm for AU recognition.Training is performed in a "one-vs-rest" format, i.e., all samples that can currently identify AU are labeled as positive samples, while samples from other AU are labeled as negative samples.A binary classification model is trained for each AU.The ratio of the training set and test set data size is 8:2.The optimizer uses the Adaptive Moment Estimation (Adam) method, the learning rate is set to 0.001, the training epoch is 100, and the batch size is 60.

Results
In this section, we conduct relevant comparative experiments on the CASME and CAS(ME) 2 datasets, respectively, and discuss them from four perspectives: ResNet network depth, ResNet network layer combination, image type, and method type.
ResNet network depth It can be seen from Tables 3 and 4 that in the CASME and CAS(ME) 2 datasets, the F1-score of ResNet50 is higher than that of ResNet101 and ResNet152, which are improved by 0.022 and 0.11, 0.041 and 0.135, respectively.( 17) Image type To verify the reliability of dynamic image, we compare the dynamic image with average pooling image, max pooling image, and micro-expression vertex frames on CASME and CAS(ME) 2 datasets.From Tables 7 and 8, it can be seen that the experimental effect of dynamic image is the best, and the average F1-score values of 0.455 and 0.786 are obtained on the CASME and CAS(ME) 2 datasets,  Micro-expression action unit recognition based on dynamic… results, because average pooling combines all feature information, and apex frames contain the feature information with the largest range of motion.
Ablation study To demonstrate the effectiveness of each module of the proposed method, ablation experiments are performed in this paper.In Tables 9 and 10, Base refers to the method that uses only the dynamic image generation module and the spatial feature module, Att refers to the attention module, and Reg refers to the regional feature module.Especially, the spatial feature module needs to rely on the dynamic image generation module, so the two modules cannot be split.From the tables, it can be seen that Base+Att which uses the attention module improved by 0.038 and 0.03 on the CASME and CAS(ME)2 datasets, respectively, compared to Base without the attention module, suggesting that highlighting important features and invalid features contribute to the recognition of micro-expression action units.In addition, Base+Reg which uses the regional feature module obtained higher F1-score values on all action units compared to Base without the regional feature module, indicating that the regional feature module can capture local subtle changes in appearance.It can be seen that the highest recognition results were obtained by using both the attention module and the area feature module, with higher F1-score values of 0.079 and 0.057 on the CASME and CAS(ME)2 datasets, respectively, compared to Base without these two modules, indicating that the joint use of these two modules can better describe local action unit regions and consider the relationship between facial relationship between regions, which helps to improve the recognition of action units in sparse face regions.
Method type From Tables 11 to 12, it can be seen that with LBP-TOP [46] as the baseline, the F1-score of our algorithm in the CASME dataset and CAS(ME) 2 dataset is improved by 0.258 and 0.443, respectively, compared with the baseline.The F1-score value of handcrafted features is generally lower than that of learned features, because handcrafted features may ignore more detailed information of images, and learned features can capture more discriminative features for micro-expression AU recognition.Among the three handcrafted features LBP-TOP, LPQ-TOP [47], and LBP-SIP [48], LBP-TOP performs the best, obtaining F1-scores of 0.197 and 0.343 in CASME dataset and CAS(ME) 2 dataset, respectively.The learned feature I3D [49] is a network that augments 2DCNN to 3DCNN, which can achieve significant quality gain with less computation.SCA [19] is an end-to-end spatial channel attention network for micro-expression AU recognition, which consists of spatial and channel modules for spatial relationship modeling and local region representation, sub-sequence are captured using the dynamic image video representation.After that, the spatial pyramid network is used to extract subtle features at different layers, the regional feature network is used to capture the local appearance changes of the face, and the attention mechanism is used to highlight important features.Finally, we conduct extensive experiments on the CASME dataset and the CAS(ME) 2 dataset to demonstrate the effectiveness of our proposed algorithm.In the future, because of the small sample size of the current micro-expression dataset, we need to design a new dataset to expand the current amount of data.

Fig. 4
Fig. 4 Network framework of our proposed method, which consists of four sub-modules: Dynamic image generation model, spatial feature module, regional feature module, and attention module.different scales of the output of different Attention Module are all down sampled to 256 × 7 × 7 and then concatenated to 1024 × 7 × 7 for the classification

Fig. 5 3
Fig. 5 Spatial pyramid model.The output of the last residual block in each of the four convolutions of ResNet50 is used as a feature of the spatial pyramid

Fig. 6 Fig. 7
Fig. 6 Regional feature module.Different facial regions use different local feature extraction methods to capture subtle movements

Table 2
Number of AU samples for CASME and CAS(ME) 2 datasets

Table 3
F1-score comparison of different ResNet network depth in the CASME dataset

Table 4
F1-score comparison of different ResNet network depth in the CAS(ME)2dataset Contains more video action information.The experiment of max pooling image is the worst, getting an average F1-score of 0.389 and 0.722 on CASME and CAS(ME) 2 datasets, respectively, because max pooling loses a lot of important information.Average pooling image and vertex frames also produce good experimental

Table 5
F1-score comparison of different combinations of ResNet network layers in the CASME dataset

Table 6
F1-score comparison of different combinations of ResNet network layers in the CAS(ME) 2 dataset

Table 7
F1-score comparison of different types of input images in the CASME datasets

Table 8
F1-score comparison of different types of input images in the CAS(ME) 2 datasets

Table 9
F1-score comparison of ablation experiments on the CASME dataset

Table 10
F1-score comparison of ablation experiments on the CAS(ME) 2 dataset

Table 11
F1-score comparison of different methods in the CASME datasetsThe best and second best results are indicated using bold and brackets, respectively