Teachers' teaching behavior is a typical emotional labor. In the process of teachers' activities in carrying out teaching tasks, intonation, expressions, actions and other behaviors affect the transmission and reception of teaching information [1]. The different emotions displayed by teaching behavior are one of the important reasons that affect the quality of classroom teaching. In the classroom environment, positive emotions in teaching behavior can promote learners to have positive emotional states, thus strengthening students' learning motivation, improving learning cognitive effect, and helping teachers to complete high-quality teaching tasks. With the update and iteration of artificial intelligence and computer information technology, the emotion recognition model has been widely used in service robots, education quality evaluation, human-computer interaction and other fields [2]. Therefore, using emotion recognition model to accurately perceive the emotional state contained in teaching behavior and implement effective intervention can provide an effective way for teachers and teaching administrators to understand and improve teaching, which is of great significance to promote teachers' personal development and growth.
Emotion recognition refers to the analysis of different human behavior data to obtain the emotional state of human beings. In recent years, there are many researches on the analysis of different emotions in teaching behavior based on various emotion recognition models. For example, Uzuntiryaki et al. [3] used positive and negative emotion scale, emotion control scale and teacher efficacy scale to obtain the emotional state of 336 science teachers, and established the corresponding structured emotion recognition model for analysis. Moreira et al. [4] collected questionnaire data from 350 middle school teachers, and used a two-step structural emotion recognition model to analyze teachers' emotions. Park et al. [5] collected the facial expressions of 14 pre-service teachers in public undergraduate universities in South Korea, and used emotion recognition software to analyze the emotions expressed by these facial expressions. Kim et al. [6] proposed an intelligent classroom platform based on emotion recognition algorithm to analyze teachers' emotions and quantify various parameters in the teaching process. Estrada et al. [7] used emotion recognition model and evolutionary method to analyze the emotional polarity of teachers' corpus about students' learning state. Cabada et al. [8]used sentiment analysis algorithm based on deep learning to detect students' and teachers' class emotions. Balahadia et al. [9] developed a performance evaluation system based on opinion mining and emotion analysis, which takes the emotion expressed in students' evaluation as the performance evaluation standard of teachers. Gutierrez et al. [10] introduced a text mining method to analyze students' comments on teachers' performance, and used machine learning and random forest algorithm to complete the sentiment analysis. Oika et al. [11] respectively used a variety of emotion analysis methods to automatically extract emotion information from teaching evaluation, and applied them to the teaching evaluation link of university teaching management system. Although the above methods have achieved certain performance in emotion recognition in teaching behavior, there are inconsistencies in the time of text, audio and image emotion data in teaching behavior. These methods can't better integrate and identify various features in these data, resulting in poor emotion classification results.
With the strong learning ability and wide application of deep learning, multimodal emotion recognition model has achieved good research results [12–14]. Applying multimodal emotion recognition model to teaching behavior analysis can better distinguish different types of teaching behavior emotion data. The two most important factors affecting multimodal emotion recognition are extracting modal features and fusing multimodal data. there are many models designed for text, audio and image feature extraction methods [15–17]. As for how to fuse multimodal data of text, audio and image, traditional multimodal learning methods mainly use feature-level fusion and decision-level fusion, but now most fusion methods use different mathematical formulas to characterize each modality, such as multimodal hierarchical fusion, conversational neural network, word-level splicing of different features, etc. In order to capture the dynamic interaction of different modes in multimodal emotion recognition, Zadeh et al.[18] proposed Tensor Fusion Network, and used Cartesian product three times to model the single-peak, double-peak and triple-peak features in text, image and speech respectively. Arevalo et al. [19] proposed a Gated Multimodal Units, GMU) model, which uses multiplication gates to determine how modes affect the activation of units. Pan z et al. [20] proposed a multimodal attention network, which adopted a multimodal focusing mechanism to selectively fuse three modal information. These methods have achieved good performance in emotion classification. In order to study the influence of one modal information on another. Yang et al. [21] put forward a cross-modal BERT model, which dynamically adjusts the weight of related words and reduces the weight of irrelevant words by using voice information, thus capturing richer emotional information and reducing the influence of noise. Tsai et al. [22] proposed a Multimodal Transformer model to fuse different modal features. Based on the Transformer, the method introduced a modal enhancement unit to enhance the target mode by using the information from the source mode, thus realizing multi-modal fusion of asynchronous sequences. Although these multimodal emotion recognition methods have achieved certain recognition results for different types of emotion data, there are still some problems such as the inability to distinguish teaching emotion information.
In order to better identify different emotional information in multimodal teaching behavior data and improve teachers' personal teaching level, this paper proposes an emotional recognition model based on dynamic convolution and residual gating. The innovations of this paper are as follows:
(1) The cross modal dynamic convolution is used to model the local information, and multi-modal data fusion operation is carried out, so as to better represent the low-level features, high-level local features and context dependency of each mode in the teaching behavior.
(2) The contribution of each interactive information in the final emotion classification is automatically learned through residual gating, and the multimodal fusion features are input into the classifier for multimodal teaching behavior emotion prediction.
In the second chapter, the emotion recognition model of multimodal teaching behavior based on dynamic convolution and residual gating is introduced; The third chapter shows the classification performance of this model for different emotions in teaching behavior through experiments; The fourth chapter summarizes and prospects the contents of this paper.