Emotion recognition is the process of analyzing human emotional information from multiple channels, including facial expressions, voice signals, and physiological signals, using computer technology to determine human emotional states, such as happiness, sadness, anger, surprise, and so forth. This field represents an extremely important and valuable research direction for artificial intelligence, which has a wide range of applications, including smart home, smart education, smart healthcare, and smart security, etc. The objective is to improve the effect and experience of human-computer interaction and to enhance the intelligence and friendliness of AI systems [1]. Research on emotion recognition is inextricably linked to the development of AI and big data. As algorithms approach the level of human perception, understanding human emotions will become a crucial step for AI to enter the realm of self-emotion. Emotions are not only a unique human perceptual expression, but also an integral part of the psyche, reflecting the internal state and external behavior of humans and influencing thinking and decision-making. Consequently, an understanding and recognition of human emotions is of paramount importance to the development and evolution of AI.
The recognition of emotions through facial expressions is the most intuitive and commonly used method, and the one that most closely approximates human perception of emotions towards each other. Facial expressions are a complex and efficient aspect of human nonverbal communication that involves the movement of multiple facial muscles. These changes can convey a wide range of emotions and psychological states, including happiness, sadness, anger, surprise, fear, and disgust. Given the significance of facial expressions in communication, automated facial expression recognition (FER) technology has garnered increasing attention. FER has the potential to transform various fields, including education and healthcare. For instance, FER can be utilized in educational settings to assess the efficacy and quality of teaching [2,3] or in medical settings to assist in analyzing a patient's psychological condition [4]. The advancement of GPU technology has also facilitated the development of downstream applications of FER, which has contributed to its growing popularity. Compared to other image recognition tasks, the main challenges faced by FER are inter-class similarity and intra-class variability in human facial expressions. Inter-class similarity refers to the subtle differences between facial expressions, which makes it difficult to highlight small differences between facial expressions and recognize them correctly. In contrast, intra-class variability, also known as subject variability, refers to the fact that the images of an expression category in the FER dataset consist of different subjects with different facial structures, gender, age, and ethnicity. This variability may affect the learning performance of the solution, as the model may have difficulty generalizing across different subjects, resulting in lower accuracy and reliability. For instance, the distinction between expressions of anger and disgust may be subtle, whereas the variation between individuals within the same expression category may be considerable. Moreover, existing research has identified additional challenges associated with FER on wild datasets, including the recognition of negative expressions, FER under challenging conditions, and the reliance on large neural networks. The paucity of negative emoji images on the Internet renders the collection of a representative dataset that reflects real and complex scenes a challenging endeavor. Consequently, the imbalance in the representation of negative and positive expressions in the wild FER dataset may result in a lower recognition rate for negative expressions than for positive ones. Additionally, there are instances where facial expression recognition is challenging when the subject is at certain angles or when the subject's face is partially occluded by other objects. It is therefore of great importance to achieve accurate recognition of these samples, particularly in view of the likelihood that the challenging conditions will be the same as those encountered in downstream applications. Additionally, in the pursuit of classification performance, existing work has gradually favored the use of large neural networks to achieve these performance gains. Nevertheless, in light of the downstream application's computational resource constraints, FER methods must be capable of serving these applications without requiring the use of powerful resources.
The field of emotion recognition is currently facing a number of challenges and problems. Firstly, emotion is a complex and diverse psychological phenomenon, which is affected by a variety of factors such as individual differences, cultural backgrounds, social environments, etc. Consequently, it is difficult to define and measure it by a uniform and accurate standard [5]. Second, emotion information obtained from different channels may be inconsistent or incomplete. For instance, facial expressions may be masked or camouflaged, speech signals may be interfered or distorted by noise, and physiological signals may be interfered or changed by other factors. In addition, existing emotion recognition datasets are often collected from laboratories with a single environment. However, emotion recognition in real and complex environments may have different needs and difficulties. These include ensuring the efficiency and robustness of emotion recognition in complex environments subject to variations in noise, light, occlusion, gesture, and intensity of expression. Another challenge is achieving real-time and accurate emotion recognition in dynamically changing environments. Furthermore, there is a need to achieve simple and accurate emotion recognition in lightweight scenarios, as well as simple and fast emotion recognition. In order to address these issues, cutting-edge research methods have been proposed in recent years. [7] proposed DialogueRNN, which employs a memory network-based model to model the emotional state and emotion transfer of each speaker in a conversation. This model utilizes global and local attention mechanisms to enhance the utilization of contextual information. [8] employs a graph neural network-based model and proposes GraphCRF, which is used to model the emotional labels of each sentence in a conversation and the relationship between neighboring sentences. Additionally, conditional random fields are employed for global optimization. [9] employs a pre-training language model based approach, proposing BERT-ERC, to pre-train using large-scale unlabeled text data and fine-tune on the target dataset with the objective of enhancing the model's generalizability and expressive power. In the field of emotion recognition in complex environments, cutting-edge research methods have predominantly focused on textual information. However, in practical application scenarios, the integration of surveillance images is essential. Consequently, research in the direction of machine vision is an inevitable avenue for progress in this field.
This paper addresses the challenges posed by the difficulty of application landing and the single application scene caused by the overly complex model and one-sided training data on this task. It builds upon previous work by integrating the residual connection and the deep separable structure to make the network more lightweight. Furthermore, it selects a data set that is more aligned with the real-life scene for training. Ultimately, it achieves emotion recognition in different complex scenes.