FER(Facial expression recognition) is a very challenging task in the field of computer vision due to the influence of real-world factors such as illumination, angle, and skin color. With the in-depth study of FER by researchers, CNN(convolutional neural networks) have been widely used in the field of FER due to their excellent local feature extraction ability. In recent years, VIT(Vision Transformer) has become a popular research method for FER due to its excellent global feature processing ability. However, the CNN lacks attention to global features, and VIT has insufficient processing capability for local features, and both face the dilemma of limited application scenarios due to too large parameters. In view of the above problems, this paper first uses Mobile-former as the basic network, so that the network model can combine local and global features when performing expression recognition. Secondly, the ACmix model is introduced to replace the original stem module, making the network can have enough receptive field when initially extracting the input image. Finally, this paper proposes a more lightweight and efficient mobile sub-module to reduce network model parameters. The final experimental results show that the accuracy of the network model in the RAF-DB and CK + datasets is increased by 3.03% and 3% respectively, while the Params is reduced by 1.05M.