Face expression recognition based on improved MobileNeXt

FER(Face expression recognition) plays an important role in human-computer interaction, but it is also one of the hot problems that artificial intelligence is difficult to solve. In recent years, as researchers have intensively studied the field of FER, the number of network models applied to FER has also increased. Relying on their strong feature extraction ability, CNN(Convolutional neural networks) have gradually become the main model in the field of FER. However, the excessive amount of parameters of CNN limits its application scenarios, and the lightweight technique for CNN brings about the degradation of face expression feature extraction ability. To address the above problems, this paper proposes an improved MobileNeXt-based expression recognition network. Based on the MobileNext lightweight network , the model improves its feature extraction capability. Firstly, the SandGlass block in the network can enhance the transmission of feature information in the network and reduce the loss of expression features during transmission. Secondly, the Ghost module is used to replace the 1×1 convolution kernel in the network to reduce the number of parameters in the feature extraction layer. Then use Drop-Activation layer instead of ReLU layer in Sandglass block to enhance the generalization ability and accuracy of the network. Finally, the Spatial Group-wise Enhance attention mechanism is introduced to enhance the network's ability to refine the expression features. The experimental results show that the network model improves the expression recognition accuracy by 2.6%, 6.5% and 7.15% in FER2013, RAF-DB and CK+ datasets, respectively, while the parameter and floating-point operations only increase 0.85M and 2.93M compared with MobileNet V2.

Model) , SIFT(Scale-Invariant Feature Transform).Appearance-based feature extraction is to extract the exact part of the face, which is generally implemented by operators of different shapes.This method either extracts from the entire face, or extracts different feature information for different expression areas.Commonly used methods include Gabor filter, LBP (Local Binary Pattern), directional gradient value histogram and other methods.D.J.Kim [2] proposed a face recognition system based on the ASM.First, ASM was used to normalize the face image and EHMM(Hidden Markov Model) was used to generate a probability weight factor.Finally, by studying the optimal feature divergence at high resolution to calculate the weight factor.Stefano [3] uses the SIFT method to study the problem of facial expression recognition of three-dimensional shapes, which calculates SIFT descriptors on the facial landmarks of a set of depth images, then selects the most relevant feature handwriting and performs SVM classification on the selected features.On the basis of LBP, Guo [4] et al. proposed the ELBP algorithm, which combined with the SVM classifier to recognize expressions, aiming at the problem of insufficient features due to the large number of dimensions extracted from LBP. Chen [5] takes advantage of the fact that HOG is sensitive to facial muscle deformation, and applies HOG to encode facial components such as eyebrows, eyes, nose, etc. as features, and then uses SVM to classify facial expressions.The above methods have achieved some success in the field of expression recognition, but the traditional feature extraction methods are less robust to realistic factors (e.g., lighting, angle, pose, etc.).With the continuous development of deep learning theory, expression recognition algorithms based on deep learning have gradually shown certain advantages in practice.Based on deep learning, expression recognition [6] is an end-to-end recognition process, that is, the facial expression features are first extracted through the feature extraction layer in the network model, then the neural network is trained to learn discriminative expression features, and finally the classifier discriminates and classifies the input facial expressions according to the learned expression features.Yu [7] combines multiple CNN models by minimizing the log-likelihood loss and minimizing the hinge loss, which significantly improves the accuracy of expression recognition.Jiang [8] merged the Gabor convolution and channel-shift modules into the ResNet network to improve the expression recognition accuracy.However, most of the current mainstream convolutional neural networks use complex deep neural network structures, which require large computational resources for training and are difficult to use in embedded devices.To solve the problem of limited scenarios of convolutional neural networks in the field of expression recognition, researchers have applied lightweight convolutional neural networks to expression recognition.Hewitt [9] proposed improved networks of AlexNet, VGGNet, and MobileNet, and applied the improved versions of these three networks to mobile devices for facial expression recognition.A. Lindt [10] proposed VGGFace (a variant of VGG16), a network that operates on expressions in facial images based on continuous two-dimensional expression labels .Barros [11] proposed the FaceChanel lightweight neural network, which has 10 convolution layers including 4 pooling layers .Among them, the last layer is the shunting suppression domain, and its function is to output the expression recognition effect.Inspired by ResNet and MobileNet, Rodolfo [12] proposed a residual network and depthwise separable convolutional facial expression recognition network (ResMoNet).The network has great advantages in Params, FLOPS, and main memory utilization.
The lightweight network model has a relatively simple structure, relatively small number of parameters and computation, and a wide range of application scenarios, but its model depth is not deep enough, there are problems such as weak expression feature extraction ability, low recognition accuracy and other problems.In order to achieve expression recognition process can maintain its original lightweight characteristics and obtain a high accuracy rate in the process of realizing expression recognition, this paper improves the expression extraction of expression feature information for the original network based on the MobileNeXt network.The main works involved include (1) pioneering the application of MobileNeXt network in the field of expression recognition, which has an internal SandGlass block to reduce the loss of feature information during transmission within the network.(2) The pointwise convolution in the SandGlass block is replaced by the Ghost module to reduce the computational effort caused by feature map redundancy.(3)The nonlinear activation function ReLU is replaced by Drop Activation to enhance the generalization capability of the network.(4) Adding Spatial Group-wise Enhance lightweight attention mechanism to make the network focus more on the regions rich in facial expression features and improve the feature extraction ability of the network model for the input images.Based on FER2013, RAF-DB and CK+ datasets, the improved network model is tested and its expression recognition accuracy is improved by 2.6%, 6.5% and 7.15%, respectively.Compared with MobileNet V2, the number of parameters and the amount of floating point computation only increased by 0.85M and 2.93M.
This paper shows the network model in Related work and analyzes the main modules in the network in the Method subsection.Finally, the effectiveness of the improvements in this paper is verified by various experiments in Section 4.

Fig.1Improvements to the overall MobileNext network block diagram
The MobileNeXt network is a new lightweight model redesigned on the basis of MobileNet v2.It replaces the original Inverse residual block in MobileNet with the SandGlass block.The network is constructed by Depthwise separable convolution and SandGlass block.In order to improve the accuracy of network expression recognition while keeping the network lightweight.,thispaper selects the MobileNeXt network as the backbone network, uses the Ghost module and Drop-Activation layer to replace the pointwise convolution and ReLU Activation functions in the Bottleneck and embeds the SGE Attention mechanism.The network structure is shown in Figure 1, where S is the stride.In this model, the input image is firstly passed through a 3*3 standard convolutional layer, and the expression features are filtered by the standard convolutional layer and merged together.Then the main framework of the network is stacked by the modified SandGlass blocks.The feature information is firstly passed through SandGlass-A,B,C blocks in turn, followed by 2 or 3 SandGlass-B blocks with SandGlass-C blocks for effective extraction of facial expression features.Among them,the SandGlass-A block is a structure that does not contain a shortcut connection for the initial extraction of facial expression features.The SandGlass-B block further extracts the expression features and weights the key information in the face pictures through the SGE attention mechanism, while the Ghost module reduces the overall module's computational effort generated in operation.SandGlass-C block is then mainly responsible for tuning the feature dimension.Finally, the features are extracted and classified by having AdaptiveAvgPool layer with fully connected layer.

Method 3.1 SandGlass block
The traditional MobileNet network, the network adopts the inverted residual structure to build the bottleneck module, that is, to use a deeply separable convolution instead of the standard convolution in the residual module.The inverted residual module first takes a low-dimensional compressed tensor as input, expands it into a high-dimensional tensor through pointwise convolution, and then spatially encodes it through depth-wise convolution, and finally compresses it into a low-dimensional tensor output by point-by-point convolution.The output low-dimensional tensor serves as input to the next inverted residuals module and establishes shortcut connections between the two linear low-dimensional.The depthwise separable convolution replaces the standard convolution of the residual module with the depth convolution, which reduces the parameter amount of the network model, so that the lightweight network can also use the residual module to deepen the number of network layers to improve the network performance.However, since the feature information is compressed in the residual module, information loss will inevitably occur when mapping between low-dimensional bottlenecks.In addition, the decrease of feature dimension will also lead to gradient confusion, weaken the ability of gradient cross-layer propagation, and affect the convergence of model training and model performance.In order to solve the above problems, the SandGlass block (SG block) is used to replace the inverted residual module in the MobileNet network.As shown in Figure 2, Different from the previous residual module and inverse residual module, the SangGlass [13](SG) builds low-dimensional connections between linear lowdimensions, and adds Depthwise convolutions at both ends of the connection points for spatial coding instead of establishing shortcut connections between linear high-dimensions.And the SG block reorders the two pointwise convolutions and Depthwise convolutions to ensure that the two high-dimensional representations are connected together at the location of the shortcut connection, making more features of the bottom layers could be reserved when the features are transferred to the top layers.
Based on the above design principles, the two pointwise convolution sequences are first flipped.
Without considering the deep convolution and activation layers, ( ) Among them, e  r  denotes the two pointwise convolutions used for channel expansion and reduction, respectively.In this way we can place the bottleneck layer in the middle of the residual path to save parameters and computational costs.More importantly, it allows us to use shortcut connections to connect characterization channels with high-dimensional channels instead of bottleneck channels.The shortcut connections constructed between the high-dimensional representations are "wider" than those in the inverse residual module, allowing more expression features to be transferred from F to G. Since pointwise convolution can only encode channel information, it cannot capture spatial information, we add depth convolution between the two pointwise convolutions.The following equation can be obtained.
Both convolutions are performed in a high-dimensional space, so that richer expression features can be extracted.
The details of the SG block structure are shown in Table 1, where t denotes the dimensionality reduction factor and s denotes the step size.The position between two pointwise convolutions in the SG block is the bottleneck with deep convolution on either side of the pointwise convolution.Since the SG module outputs a high-dimensional feature vector, it is found that too many activation functions have a negative impact on the expression recognition performance after several experiments, so the activation functions are only added after the first layer of Depthwise convolution and the last pointwise convolutions.

Ghost module
When feature information is transmitted in CNN, whether there is redundancy in the feature map is an important characteristic to evaluate the performance of CNN.When the mainstream convolutional neural network calculates the facial expression features, there is redundancy in the feature map, that is, there is a similar feature map in the output of the convolutional layer.As shown in Figure 3  .Among them, h  and w  are the height and width of the output data, respectively, and k × k is the kernel size of the convolution kernel f.The regular convolution operation can be expressed as follows: As can be seen from equation ( 4), the computation of the whole convolution operation is related to the dimensionality of the input and output feature mapping, but many of the output features are similar features inside, so a large number of parameters with floating point calculations are used to generate these redundant feature maps.The main convolution in the Ghost module uses a customizable kernel approach.the Ghost module decomposes the convolutional layer into two parts, and the first part uses ordinary convolution first into some intrinsic feature mapping X.
The m intrinsic feature maps wm h YR    are generated by equation ( 5) , where is the convolution kernel, m ≤ n.The second part uses cheap linear operations to enhance features and add channels.The feature mapping is parallel to the linear transformation in the Ghost module to preserve the intrinsic feature mapping.
, ( ) , 1, . . ., , 1, . . . ., , Among them, i y  is the i-th original feature map in Y .The above function .The linear operation  runs on each channel and its computational cost is much lower than that of normal convolution.The Ghost module feature mapping is shown in Figure 4.

Drop-Activation
The network model often produces over fitting phenomenon when learning and training expression pictures.This phenomenon is often related to network normalization.Sometimes the normalization alone can work effectively, but the combination of normalization and activation function cannot improve the overall performance of the network.In order to make our model have better generalization ability and accuracy, we replace the nonlinear activation function ReLU with the Drop Activation [15] layer.
Drop Activation will randomly eliminate the nonlinearity of the activation function in a way similar to Dropout, and drop out the activation function in the training stage using Equation ( 7): Where, ( 1, 2, , ) P di ag P P Pd =  , P1 to Pd are independent and the same random variables follow the Bernoulli distribution B (P), which is to 1, the probability is P, the value is 0, and the probability is 1-P.In the process of testing, we use a deterministic nonlinear function to code the average effect of random discard activation, which is obtained from the average value realized by P. Then the expectation about the random variable P in equation ( 8) is: We replace the ReLU activation function of the first depthwise convolution in SandGlass-B block with the Drop Activation layer.After the depth convolution is encoded, the regularization and Drop Activation layers are processed.The specific structure is shown in Figure 5.

Batch normalization
Drop-Activation

Spatial group wise enhance attention mechanism
In the field of machine recognition, it has become popular to use attention mechanism models to improve the performance of existing mainstream networks.So far, several attention mechanisms have been applied in expression recognition, such as SENet(Squeeze-and-Excitation Networks), ECANet(Efficient Channel Attention), BAM(Bottlenet attention Module), CBAM (Convolutional Block Attention Module), etc. SENet weights feature maps with rich feature information through automatic calibration of channel importance.Based on SENet ,ECANet improves the channel attention mechanism through a local cross-channel interaction strategy without dimensionality reduction and an adaptive selection of the one-dimensional convolutional kernel size.In addition to channel attention, BAM and CBAM introduce spatial attention mechanisms in a similar way.The advantage of the Spatial group wise enhance attention mechanism(SGE) over the former attention mechanism is it enhances the spatial distribution of different semantic sub-features in each feature map by improving their learning in the feature map.
SGE module [16] generates attention maps by combining the similarities between global and local features.In the feature map containing expression features, a complete expression feature is composed of multiple expression sub-features.The SGE module can process the sub-features of each group in parallel and use the similarity between the global features and local statistical features of each group as an attention guide to enhance the features, so as to obtain a spatially uniformly distributed semantic feature representation.The SGE has fewer parameters and computational effort, and the module can highlight multiple active regions with higher-order semantics in expression recognition.These areas are not limited to a person's five senses, when people feel frustrated or the folded areas formed by a furrowed brow are also noticed by the SGE module.
It is assumed that when capturing specific semantic information in the feature map (e.g., eyes), ideally the eye region in the group space has features with larger vector lengths and similar vectors, while other locations have almost no activation into zero vectors.To reduce the influence of noise and similar patterns, the module uses the global information of the whole group to further enhance the learning of semantic features in key areas.The global features are first obtained by spatially averaging the function gp F approximation.
This global feature is then used to obtain the important coefficients corresponding to each feature through a simple dot product to obtain the similarity between the global and local features.To prevent bias in the coefficients between different samples, the module performs a normalization operation on the space.
 (e.g.,1e-5) is a constant set to increase stability.Two learnable parameters, γ and β, are introduced at normalization to scale and shift each coefficient i c to ensure that the normalization inserted in the network represents the identity transform.We introduce the SGE block into the SandGlass-B block at the same time as the Ghost module, and its structure is shown in Figure 7.After placing the SGE module after the first deep convolution, the key information points are weighted after the deep convolution encodes the feature information, making the feature map contains more feature information when passing through the bottleneck.

EXPERIMENTS 4.1 Experimental Setup
The experimental software and hardware configurations are shown in the table below.In the experiment, all the comparison networks run on the same platform.

FER2013
The FER2013 [17] expression data-set is the official data set used by the ICML2013 facial expression recognition challenge.The data-set consists of 35,887 face expression images of size 48 × 48, and all images are grayscale images.The expression image images in FER2013 are all from real life, including facial expressions of different ages, races, genders, nationalities, and skin tones from 0 to 70 years old.When using the FER2013 data-set, 28709 images were selected as the training set, 3589 images were used as the validation set to adjust the model weight parameters, and finally tested in a private data-set consisting of 3589 images.

RAF-DB
The Real-world Affective Faces Database (RAF-DB) [18] is a large-scale of facial expression database.It contains 30,000 images of various faces downloaded from the Internet.The data-set has a large diversity of subjects in terms of age, gender, race, head posture, lighting conditions, and shading.Among them, there are 12,271 training sets and 3,068 validation sets.

CK+
The CK+ [19] is an extension of the Cohn-Kanda data-set from 2010, and the CK + data-set is a more general face expression data-set containing 123 participants and 593 image sequences.Both datasets contain emotion labels labeling the participants' expressions.

Evaluation Indicators
The evaluation metrics used in this paper are AC(accuracy rate), Params (parameters), and FLOPs(floating point operations).Params is the total number of weights and biases in the network model.FLOPS is the amount of computation in the network model, which is used to measure the complexity of the network.Params and FLOPs are calculated as shown in equations ( 14) and (15).
where acc is the number of accurately predicted images in the test set, all is the number of images in the test set, C o is the number of output channels, i C is the number of input channels, k is the convolutional kernel size, and W and H denote the length and width of the feature map, respectively.

Experimental results and analysis
In order to test the performance of the model in expression recognition, training and testing were conducted in FER2013, RAF-DB and CK+ datasets respectively.In the training process, the weights are randomly initialized and the NNI tool kit is used to count the parameters and calculations.
Training settings are shown in Table 3.To analyze the impact of the improvements proposed in this paper on the AC and network complexity, the experiments in this section set different improvements groups to compare and analysis based on the expression recognition model of MobileNet V2.The results of the comparative experiments for each module are shown in Table 4.In this paper, the depth-wise convolution positions of the bottleneck layers in the expression recognition model of MobileNet V2 are rearranged so that the feature information is transmitted in higher dimensions , reducing its loss during transmission in the network model effectively.From the experimental results, the SandGlass block improves the accuracy by 1.9%, 3.31%, and 2.1% in the three datasets, respectively.As the module changes the bottleneck dimension, the feature information is transmitted in higher dimensions and depth-wise convolution is added on both sides of the bottleneck, resulting in an increase of 1.65M and 7.13M in the number of network model parameters and floating point operations, respectively.Linear operations are used in the Ghost module to effectively reduce the redundant information when extracting features from the input images.The introduction of this module enabled the network to improve accuracy by 0.96%, 2.5%, and 0.98% in the three datasets, and reduce Params and FLOPS by 37.27% and 13.4%, respectively.When Drop-Activation is introduced into Depthwise convolution, the accuracy and generalization ability of the network are enhanced, and the accuracy of the network in the three data sets is improved by 0.7%, 1%, and 0.69%.The SGE attention mechanism enables the network model to focus more on the higher-order semantic activity regions containing rich expression feature information when extracting expression features, which increases the overall number of network parameters and floating points operations in a small amount while improving the accuracy of the network model.When the SandGlass block and Ghost modules are introduced, the accuracy increases by 2.3%, 5.3% and 5.14% respectively compared to the base network, and the Params and FLOPS increase by only 0.85M and 2.82M.When the four improvements are applied to the base network simultaneously, the accuracy increases by 2.6%, 6.5%, and 7.15% in the three datasets respectively, while Params and FLOPs only increase by 0.85 M and 2.93 M. Thus, the addition of each of the four modules can improve the accuracy of network for expression recognition to varying degrees, otherwise, the introduction of the Ghost module can significantly reduce the overall number of parameters and floating points operations of the network, ensuring the lightweight feature of the network.In the selection of attention mechanism, this paper selects a variety of attention mechanisms for comparison.SE is a channel attention mechanism, which generates a weight in the channel dimension and multiplies it with the feature map to assign a new weight.When the network model is embedded with the SE attention mechanism, its accuracy is increased by 0.3%, and Params is increased by 0.26M, which has a small improvement in network performance.The CBAM attention mechanism weights the feature information in the channel and space respectively, which can improve the accuracy of the network model by 0.5%, but the number of parameters increases by 1.3M.CA is a coordinate attention mechanism, which pays more attention to the importance of location information on the basis of considering internal channel information.By embedding location information in channel attention, the network can obtain information in a larger area.Embedding the CA attention mechanism improves the network accuracy by 0.4%, and the Params increase by 0.36M.The SGE attention mechanism used in this paper generates attention maps by combining the similarities between global and local features, and can obtain spatially uniform semantic feature representations.The introduction of the SGE-attention mechanism improves the accuracy of the model by 0.45%, and the number of parameters only increases by 0.11M.The visual heatmap of feature extraction for each attention mechanism is shown in Figure 9. Observing Figure 9, it can be seen that due to the relatively large proportion of the background area of the face image, SE and CA cannot accurately focus the model's attention to the area with obvious information, as shown in the second column, 2 and 3 rows.CBAM can only lock part of the effective area in individual expressions, as shown in lines 2 and 3 of 3. The model in this paper can better lock the key area on the face part, and the key area generated by the attention mechanism can better cover the relevant action units of the facial expression.When laughing, SGE focused on the open mouth area.During sadness and disgust, SGE will focus on areas of the face that have folds (such as frowning brows).Although sadness and disgust expressions have more facial similarities, the SGE attention mechanism still has good discriminative ability and can achieve high expression recognition accuracy.When calm, there are no obvious areas of change in the expression, SGE focuses on the entire face area, while other models focus on useless hair or forehead areas.When startled, SGE focuses on the wide-open eyes and open mouth area.For the case of curly hair in the avatar image, as shown in column 5, due to the rich texture features in the large curly hair area, each attention mechanism is related to different degrees, and the SGE model also mistaken the hair area as a key area.It can be seen that the attention mechanism analyzes all areas of the input image and pays attention according to the richness of the features, and sometimes pays too much attention to the background and hair areas, while the focus of expression recognition is on the facial features.The effectiveness of the feature map, in the follow-up work, it can be considered to perform semantic-level segmentation of the facial expression image to be recognized, so as to effectively improve the accuracy of the network.From the above analysis, it can be seen that the SGE attention module proposed in this paper can improve the feature extraction ability of the model by paying attention to the high-order expression semantic area.Less affected.

Comparison of mainstream convolutional neural networks
To further evaluate the performance of the models in this paper, three commonly used datasets are selected based on the MobileNeXt network, and the integrated and improved models are applied to expression recognition, and the experiments are compared with several typical convolutional neural networks.To ensure fair results, all network models are retrained in this platform, and all network models do not use loading pre-training.The experimental results are shown in Table 6 and Table 7. Analysis of Tables 6 and 7 shows that the model in this paper maintains its better lightweight advantage in Params and FLOPS compared with traditional deep convolutional neural networks, despite the difference in accuracy.In the three datasets, the model in this paper outperformed ResNeXt by 0.2%, 3.69%, and 1.01%, respectively.The accuracy improvement of the RAF-DB dataset in the three datasets is larger because the FER2013 dataset, the image size is small and the network model has difficulty in extracting effective expression information in some images.The traditional deep convolutional neural network ResNeXt relies on its larger network depth and has better robustness in expression feature extraction for images.However, in the RAF-DB dataset the face expression size is large, and the model in this paper can effectively capture the expression features in the images, so its performance is further improved.It can be seen that the input image enlargement can significantly improve the recognition effect of face expressions.At the same time, the model in this paper has advantages in model size and computation volume with the dual effect of its own lightweight architecture and Ghost module, and its Params are only 13.35% of ResNeXt, and its FLOPS are much lower than ResNeXt50.
Compared with the SqueezeNet, in terms of accuracy, the model in this paper improves 0.6%, 13.69%, and 6.07% in the three data sets, respectively.In terms of model complexity, the Params of the model in this paper are 2.36M more than SqueezeNet, but the FLOPS are only 19.14% of SqueezeNet.SqueezeNet obtains lower Params by replacing a large number of 3*3 convolutional kernels with 1*1 convolutional kernels to compress the overall dimensionality of the network, but the small size of convolutional kernels brings the problem that the network has insufficient perceptual field for expression feature extraction, and the reduction of the overall dimensionality of the network makes the feature extraction ability decrease, and increases the FLOPs of the model.Therefore SqueezeNet is not really lightweight.
Compared with ShuffleNet V2, the accuracy of the model in this paper is increased by 2.84%, 6.18%, and 9.09% on the three datasets, and the Params and FLOPS are increased by 1.48M and 18.69M, respectively.ShuffleNet V2 continuously reduces the depth of the model while pursuing less computation, and has a lower model complexity, but it is unable to achieve effective information filtering and merging in the expression ShuffleNet V2 has a lower model complexity by pursuing less computation and decreasing the model depth, but it cannot filter and merge the effective information when extracting the features, resulting in a weaker classification ability of expression recognition.
Compared with GhostNet, MobileNet V2 and V3, the accuracy of this model has improved in all three datasets.In terms of model complexity, the Params of this model are 0.83M and 1.12M less than those of GhostNet and MobileNet V3, and 0.86M more than those of MobileNet V2.In terms of FLOPS, this model has increased, including an increase of 8.54M relative to GhostNet.In order to increase the capability of extracting expression features, the dimensionality of feature information transmission in the model is increased and the SGE module is added additionally.[20] combined discrete wavelet transform with HOG features to achieve recognition of expression features by transforming spatial domain features to frequency domain.Liu et al. [21] proposed a new enhanced deep belief network to learn and select effective facial appearance features in a unified recurrent architecture to obtain better expression recognition results.Zheng et al. [22] proposed an oriented attention pseudo-siamese network which consists of two parts, the maintenance branch and the attention branch, to compensate the limitation of insufficient local information through the attention branch, and thus improve the accuracy of expression recognition.Hua et al. [23] proposed a convolutional neural network with dense backward attention to achieve high-performance expression recognition using channel attention aggregation on multi-level features in the backbone network.Chen et al. [24] proposed a densely connected convolutional neural network with hierarchical spatial attention to adaptively localize salient regions through a spatial attention mechanism.Ghosh et al. [25] used CapsuleNet as the basis for predicting facial expressions using various information such as face expression information and scene information.Fan et al. [26] proposed the FaceNet2ExpNet network, which divides the network training into a pre-training phase and a refinement phase.Zeng et al. [27] merge multiple datasets to improve the learning ability of the network for large datasets through an end-to-end LTNet scheme.In terms of expression recognition accuracy, the network in this paper is 6.97% higher than the W-HOG-based method in the literature [20] in the CK+ dataset, reflecting that deep learning-based methods have better performance than traditional methods in expression recognition.It is 0.27% higher than the deep belief network proposed by Liu et al. [21], indicating that the convolutional neural network has better recognition ability than the deep belief network.Compared with other approaches using convolutional neural networks, the network in this paper achieves the highest value in all experimental results.Although the literature [22], [23] and [24] have used different attention mechanisms to improve their respective accuracies, they all focus only on local features and lack the combination of local and global features.The confusion function of the improved network and the basic network in the FER2013 dataset is shown in Figure 10.The results on the left correspond to the improved model, and the results on the right correspond to the base model.The recognition rate of the improved model in this paper is higher than that of the basic model except that the happy is slightly lower than the basic model.

Analysis through visualization
Based on the improved model, the design of the PC version of the facial expression recognition system is completed, which can quickly identify one or more facial expressions in pictures or videos.By inputting a specified picture, facial expression recognition is realized, and the recognition effect is shown in Figure 11.The identification and statistical results of this system can assist specific application scenarios, such as customer preference analysis, classroom effect monitoring, and infant recipe analysis.
tensor.Thus the relationship between the input tensor and the output tensor in the SandGlass module can be written as follows.

Fig. 2
Fig.2 Relationship between SG module and inverse residual module Fig.3 feature map after ghost module processing MobileNet builds an efficient convolutional neural network with less flops by introducing Depthwise separable convolution, but the Pointwise Convolution layer in deep Depthwise separable convolution still takes up a lot of memory and FLOPS.Traditional Depthwise separable, first processes cross-channel features by Pointwise Convolution, and then use Depthwise Convolution to process spatial information.Suppose the input data are

Fig. 6
Fig.6 Illustration of the lightweight SGE module As shown in Figure 6, A C × H × W feature map is first divided into G groups along the channel dimension in the SGE module.Without losing the feature map information, one of the divided groups is examined.Subsequently the group has a vector at each position in the space to express different semantic information.  1 x . . ., , original generated significant coefficients are scaled by sigmoid σ(-) to obtain the enhanced feature vector .

Fig. 7
Fig.7 Illustration of the lightweight Sandglass-B block

Fig. 9
Fig.9 Attention heat map of different attention mechanisms

Fig. 10
Fig.10 confusion matrix diagram of improved network and basic network

Fig11
Fig11 Facial expression recognition effect of the MobileNeXt network 5 Conclusion To address the problem that the current expression recognition model based on convolutional neural network has too many parameters and the feature extraction ability of lightweight neural network is insufficient, an improved network model based on MobileNeXt is proposed.According to the characteristics of facial expressions, the model relies on SandGlass block to strengthen the propagation ability of facial expression features in the network, reduces the parameter amount and calculation amount of the network model through the linear operation in Ghost module to maintain the original lightness of the network model, and increases the generalization ability and accuracy of the network through the Drop Activation layer.Finally, SGE attention mechanism is introduced to strengthen the focus on areas rich in facial expression features.The experimental results show that the improved model in this paper maintains the lightweight advantage of the model while effectively improving the expression recognition accuracy compared with the benchmark model and various other deep networks.Reference [1] Revina I M, Emmanuel W R S. A survey on human face expression recognition techniques[J].Journal of King Saud University-Computer and Information Sciences, 2021, 33(6): 619-628.[2]Kim, D.J. Facial expression recognition using ASM-based post-processing technique.Pattern Recognit.Image Anal. 26, 576-581 (2016).https://doi.org/10.1134/S105466181603010X[3] S. Berretti, A. D. Bimbo, P. Pala, B. B. Amor and M. Daoudi, "A Set of Selected SIFT Features for 3D Facial Expression Recognition," 2010 20th International Conference on Pattern Recognition, 2010, pp.4125-4128, doi: 10.1109/ICPR.2010.1002.[4] Guo, M., Hou, X., Ma, Y. et al.Facial expression recognition using ELBP based on covariance matrix transform in KLT.Multimed Tools Appl 76, 2995-3010 (2017).https://doi.org/10.1007/s11042-016-3282-9[5] Chen J, Chen Z, Chi Z, et al.Facial expression recognition based on facial components detection and hog features[C]//International workshops on electrical and computer engineering subfields.2014: 884-888.[6] Li S, Deng W. Deep facial expression recognition: A survey[J].IEEE transactions on affective computing, 2020.[7] Zhiding Yu and Cha Zhang.2015.Image based Static Facial Expression Recognition with Multiple Deep Network Learning.In Proceedings of the 2015 ACM on International Conference on

Table 2 ,
Software and hardware configuration of the experiment

Table 3
Settings in model training

Table 4
Comparison of ablation experimental results

Table 5
Effect comparison of attention mechanism

Table 8 Comparison of ablation experimental results between CK+ and RAF-DBdata set(AC)
To further compare the performance of the model in this paper, we compare it with the expression recognition data reported in other recent literature, as shown in Table8.In order to objectively represent the model performance, this paper does not use pre-training or migration learning in training.Nigam et al.