Emotion Recognition in Teachers' Teaching Behavior Based on Multimodal Data Analysis

doi:10.21203/rs.3.rs-2489595/v1

Download PDF

Research Article

Emotion Recognition in Teachers' Teaching Behavior Based on Multimodal Data Analysis

https://doi.org/10.21203/rs.3.rs-2489595/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 22 Jun, 2023

Read the published version in Soft Computing →

You are reading this latest preprint version

The teaching effect and learning state of teachers are significantly influenced by many emotions that are manifested in teaching behaviour. The affective recognition model can be applied to the analysis of useful teaching feedback data found in teaching behaviour data to assist teachers in raising the level of instruction they provide. The accuracy of emotion classification is impacted by the typical emotion recognition model's inability to completely distinguish the intricate emotional aspects and hints in instructional conduct. In order to improve the performance of emotion classification, this paper proposes a multi-modal emotion recognition model of teaching behaviour based on dynamic convolution and residual gating. This model enhances the performance of emotion classification by further mining advanced local features and creating efficient interactive fusion strategies. First, text, audio, and images' low-level features, high-level local characteristics, and context dependencies are each extracted. Second, cross-modal dynamic convolution (CMDC) is employed to represent the interaction between modes and within modes, simulate the interaction between lengthy time series, capture the interaction properties of various modes, and prevent the obliteration of crucial information. The experimental results demonstrate that this model performs better than other comparable models in terms of accuracy of emotion categorization and F1 value on the self-built data set, reaching 83.5% and 83.1%, respectively. It has been demonstrated that the emotion classification model can help teachers become more effective over time by giving them a framework on which to analyse teaching behaviour with objectivity.

Teaching behavior

multi-modal

emotion recognition

dynamic convolution

residual gating

Teachers' teaching behavior is a typical emotional labor. In the process of teachers' activities in carrying out teaching tasks, intonation, expressions, actions and other behaviors affect the transmission and reception of teaching information [1]. The different emotions displayed by teaching behavior are one of the important reasons that affect the quality of classroom teaching. In the classroom environment, positive emotions in teaching behavior can promote learners to have positive emotional states, thus strengthening students' learning motivation, improving learning cognitive effect, and helping teachers to complete high-quality teaching tasks. With the update and iteration of artificial intelligence and computer information technology, the emotion recognition model has been widely used in service robots, education quality evaluation, human-computer interaction and other fields [2]. Therefore, using emotion recognition model to accurately perceive the emotional state contained in teaching behavior and implement effective intervention can provide an effective way for teachers and teaching administrators to understand and improve teaching, which is of great significance to promote teachers' personal development and growth.

Emotion recognition refers to the analysis of different human behavior data to obtain the emotional state of human beings. In recent years, there are many researches on the analysis of different emotions in teaching behavior based on various emotion recognition models. For example, Uzuntiryaki et al. [3] used positive and negative emotion scale, emotion control scale and teacher efficacy scale to obtain the emotional state of 336 science teachers, and established the corresponding structured emotion recognition model for analysis. Moreira et al. [4] collected questionnaire data from 350 middle school teachers, and used a two-step structural emotion recognition model to analyze teachers' emotions. Park et al. [5] collected the facial expressions of 14 pre-service teachers in public undergraduate universities in South Korea, and used emotion recognition software to analyze the emotions expressed by these facial expressions. Kim et al. [6] proposed an intelligent classroom platform based on emotion recognition algorithm to analyze teachers' emotions and quantify various parameters in the teaching process. Estrada et al. [7] used emotion recognition model and evolutionary method to analyze the emotional polarity of teachers' corpus about students' learning state. Cabada et al. [8]used sentiment analysis algorithm based on deep learning to detect students' and teachers' class emotions. Balahadia et al. [9] developed a performance evaluation system based on opinion mining and emotion analysis, which takes the emotion expressed in students' evaluation as the performance evaluation standard of teachers. Gutierrez et al. [10] introduced a text mining method to analyze students' comments on teachers' performance, and used machine learning and random forest algorithm to complete the sentiment analysis. Oika et al. [11] respectively used a variety of emotion analysis methods to automatically extract emotion information from teaching evaluation, and applied them to the teaching evaluation link of university teaching management system. Although the above methods have achieved certain performance in emotion recognition in teaching behavior, there are inconsistencies in the time of text, audio and image emotion data in teaching behavior. These methods can't better integrate and identify various features in these data, resulting in poor emotion classification results.

With the strong learning ability and wide application of deep learning, multimodal emotion recognition model has achieved good research results [12–14]. Applying multimodal emotion recognition model to teaching behavior analysis can better distinguish different types of teaching behavior emotion data. The two most important factors affecting multimodal emotion recognition are extracting modal features and fusing multimodal data. there are many models designed for text, audio and image feature extraction methods [15–17]. As for how to fuse multimodal data of text, audio and image, traditional multimodal learning methods mainly use feature-level fusion and decision-level fusion, but now most fusion methods use different mathematical formulas to characterize each modality, such as multimodal hierarchical fusion, conversational neural network, word-level splicing of different features, etc. In order to capture the dynamic interaction of different modes in multimodal emotion recognition, Zadeh et al.[18] proposed Tensor Fusion Network, and used Cartesian product three times to model the single-peak, double-peak and triple-peak features in text, image and speech respectively. Arevalo et al. [19] proposed a Gated Multimodal Units, GMU) model, which uses multiplication gates to determine how modes affect the activation of units. Pan z et al. [20] proposed a multimodal attention network, which adopted a multimodal focusing mechanism to selectively fuse three modal information. These methods have achieved good performance in emotion classification. In order to study the influence of one modal information on another. Yang et al. [21] put forward a cross-modal BERT model, which dynamically adjusts the weight of related words and reduces the weight of irrelevant words by using voice information, thus capturing richer emotional information and reducing the influence of noise. Tsai et al. [22] proposed a Multimodal Transformer model to fuse different modal features. Based on the Transformer, the method introduced a modal enhancement unit to enhance the target mode by using the information from the source mode, thus realizing multi-modal fusion of asynchronous sequences. Although these multimodal emotion recognition methods have achieved certain recognition results for different types of emotion data, there are still some problems such as the inability to distinguish teaching emotion information.

In order to better identify different emotional information in multimodal teaching behavior data and improve teachers' personal teaching level, this paper proposes an emotional recognition model based on dynamic convolution and residual gating. The innovations of this paper are as follows:

(1) The cross modal dynamic convolution is used to model the local information, and multi-modal data fusion operation is carried out, so as to better represent the low-level features, high-level local features and context dependency of each mode in the teaching behavior.

(2) The contribution of each interactive information in the final emotion classification is automatically learned through residual gating, and the multimodal fusion features are input into the classifier for multimodal teaching behavior emotion prediction.

In the second chapter, the emotion recognition model of multimodal teaching behavior based on dynamic convolution and residual gating is introduced; The third chapter shows the classification performance of this model for different emotions in teaching behavior through experiments; The fourth chapter summarizes and prospects the contents of this paper.

The structure of multimodal teaching behavior emotion recognition model based on dynamic convolution and residual gating is shown in Fig. 1.The model includes: feature extraction layer, which extracts low-level features, high-level local features and context dependency of each mode of text, audio, and image respectively; Cross modal Dynamic Convolution (CDC), interactive modeling of features between different modes, and interactive feature representation; Residual Gating (RG), a residual Gating fusion is designed to dynamically learn the contribution of each group of interactive features in the final emotion classification; Finally, the final representation is input to the softmax classification layer to get the emotion classification results.

2.1 Multi-modal feature extraction

2.1.1 Text feature extraction

Firstly, the text data in the emotional data of teaching behavior is transcribed, and then the text sequence is marked with the same marking method as BERT. If a marked text sequence is given, after it is input into the BERT embedding layer, special marks will be added at the beginning position and the end position of the text sequence respectively.$CLS$and$SEP$, which can be expressed as$T=\left[ {CLS,t1,t2,tn,SEP} \right]$. Among them,$CLS$ is an identifier for classification,$SEP$ is an identifier used to split sentences.

2.1.2 Audio feature extraction

Audio features in teaching behavioral emotion data$Xa$contains low-level features$XOpensmile$ and advanced local features $XRDCNN$. The low-level feature is to extract the prosodic information of different teachers' speeches by using Opensmile Toolbox, mainly including low-level acoustic descriptors such as loudness, pitch and mel-frequency cepstral coefficients. In this paper, residual dilated convolutional neural network (RDCNN) is used to extract the advanced local features of each segment of audio. First, DCNN extract the local features from the original audio signal, and then it is input into the local feature enhancement block to enhance the local emotional features. This block is composed of the extended convolutional neural network, the BN layer (Batch Normalization) and the leaky_relu layer. BN layer is mainly used to improve the performance and speed of training, thus avoiding gradient explosion. While the function of the leaky_relu layer is to make no linear relationship in the module. Wherein, the expansion rate of the network is set to 2, and the stride length is set to 1.

2.1.3 image feature extraction

Image features in teaching behavioral emotion data$XV$contains low-level features.$XOpenface$and advanced local features.$XCapsule$. The low-level feature is to extract facial features from facial action units such as teacher's eye area and head posture by using Openface toolbox. In order to capture more facial emotional features, ResNet-152, the most effective image recognition method is adopted. Firstly, the facial expression map is preprocessed and adjusted to a 224 pixel facial expression map $V^{\prime}$ and then input to ResNet-152 for further feature extraction, which can be expressed as:$XR=\operatorname{Re} sNet(V^{\prime})$. Because ResNet-152 can't process the position information in the image, X_R is input into the single-layer capsule network, which can be expressed as:$XC=Capsule(XR)$。

2.1.4 monomodal context dependency

Bi-GRU (Bi-gated recursive unit) is a simplified network used to identify long-sequence context dependencies, which is widely used in time-series data [4]. In this paper, Bi-GRU is used to obtain the text features with context dependence $zt$, audio features $za$ and image features$zv$.

2.2 Cross-modal dynamic convolution

Attention mechanism is to determine the importance of contextual elements by comparing each element with other elements. However, attention can't pay attention to the relationship between long-distance sequences. The computational cost of dynamic convolution changes with the increase of the length of the input target sequence. Compared with the attention mechanism, it is simpler, more efficient, and easier to superimpose. It can simulate the interaction between long sequences in time domain, and the interaction is more stable. In order to prevent the important information with emotional color in discourse from being overwhelmed by irrelevant information and realize multimodal information interaction, this paper chooses to model local information by CMDC, where one-dimensional time convolution is used to transform the output corresponding to each modality into the same dimension. The formula is as follows:

$$Zm=1DConv(zm,Wm)$$

Where $1DConv$represents one-dimensional time convolution,$zm,Wm$ represent Modeling data fusion operations and local information; $Zm$ is a CMDC. CMDC is to use the complementarity of different modes to extract information from one mode, guide the time filtering of another mode, and simulate cross-modal time interaction. Taking interaction with Image and Audio as an example, the calculation formula is as follows:

$$CDCv - a(Zv,Za)=soft\hbox{max} (Ma,v)ZvWKv=soft\hbox{max} (TM(ZaWKa))ZvWKv$$

Where $CDCv - a$ is a CMDC;$TM$is a rearranged function that sets the$Ka$ rearrange into a new matrix.$Ma,v$;$Ka$ is a dynamic convolution kernel matrix; is the weight matrix. Because the module is directional, each group of modal pairs is processed by CMDC.

$$K_{v}^{{\left( 0 \right)}}=\hat {Z}v$$

$$K_{{v - a}}^{{\left( 0 \right)}}=K_{v}^{{\left( 0 \right)}}=\hat {Z}a$$

$$\hat {K}_{{v - a}}^{{\left( i \right)}}=MHCDC_{{v - a}}^{{\left( i \right)}}(LN(K_{{v - a}}^{{\left( {i - 1} \right)}}),LN(K_{v}^{{\left( 0 \right)}}))$$

$$K_{{v - a}}^{{\left( i \right)}}=K_{{v - a}}^{{\left( {i - 1} \right)}}+PWCon{v^{(i)}}(LN(K_{{v - a}}^{{\left( i \right)}}))$$

$$Ov - a=K_{{v - a}}^{{(N)}}$$

Among them,$MHCDC_{{v - a}}^{{\left( i \right)}}$ is the Multi CMDC of $i - th$layer;$LN$is layer standardization;$PWCon{v^{(i)}}$ is the point-by-point convolution on the $i - th$layer; $K_{{v - a}}^{{\left( i \right)}}$ represents the output of inter-modal interaction of $i - th$layer; $Ov - a$ represents the interactive feature of the final image and audio. Align the aligned text features$Zt$, audio features$Za$, image features$Zv$ after two interactions, six interactive features are obtained, namely:$Ov - a$, $Ot - a$,$Oa - v$,$Ot - v$ $Oa - t$,$Ov - t$。

2.3 Residual gating fusion

In order to make full use of the six interactive features, this paper designs a residual gated fusion layer to automatically learn the contribution of each group of representations. Its structure is shown in

Three trainable parameters, α, $\beta$,$\gamma$ are introduced to make it self-learning,$\alpha =\left[ {\alpha 1 ...\alpha C} \right]$ is mainly used to control the embedding weight of each channel,$\beta =\left[ {\beta 1 ...\beta C} \right]$and$\gamma =\left[ {\gamma 1 ...\gamma C} \right]$ are used to control the activation of the door passage. The normalized channel relationship is modeled to avoid the situation that it is constant during the characteristic input. In this paper, the channel with large receptive field is used to fuse features, for $Ov - a$༌$Ot - a$, it can be expressed as:

$$S_{{v - a}}^{{\left( c \right)}}=\alpha _{{v - a}}^{{\left( c \right)}}{\left\| {O_{{v - a}}^{{\left( c \right)}}} \right\|_2}=\alpha _{{v - a}}^{{\left( c \right)}}{\left\{ {\sum\limits_{{i=1}} {{{(O_{{v - a}}^{{\left( {ci} \right)}})}^2}+\varepsilon } } \right\}^{\frac{1}{2}}}$$

$$S_{{t - a}}^{{\left( c \right)}}=\alpha _{{t - a}}^{{\left( c \right)}}{\left\| {O_{{t - a}}^{{\left( c \right)}}} \right\|_2}=\alpha _{{t - a}}^{{\left( c \right)}}{\left\{ {\sum\limits_{{i=1}} {{{(O_{{t - a}}^{{\left( {ci} \right)}})}^2}+\varepsilon } } \right\}^{\frac{1}{2}}}$$

Among them,$\varepsilon$ is a constant with a small value, so as to avoid the problem that the derivative cannot be obtained at zero;$\alpha _{{v - a}}^{{\left( c \right)}}$ is used to control the weight of each channel, when$\alpha _{{v - a}}^{{\left( c \right)}}$ = 0, it means that this channel does not participate in the final emotion prediction.

To create the competition and cooperation between $Ov - a$and$Ot - a$channels, the introduction of $\ell 2$ operation on standardized channel relationship can be expressed as follows:

$$\hat {S}_{{v - a}}^{{\left( c \right)}}=\frac{{\sqrt C S_{{v - a}}^{{\left( c \right)}}}}{{\left\| {Sv - a} \right\|}}=\frac{{\sqrt C S_{{v - a}}^{{\left( c \right)}}}}{{{{\left[ {\sum\limits_{{C - 1}}^{C} {{{(S_{{v - a}}^{{\left( c \right)}})}^2}+\varepsilon } } \right]}^{\frac{1}{2}}}}}$$

$$\hat {S}_{{t - a}}^{{\left( c \right)}}=\frac{{\sqrt C S_{{t - a}}^{{\left( c \right)}}}}{{\left\| {St - a} \right\|}}=\frac{{\sqrt C S_{{t - a}}^{{\left( c \right)}}}}{{{{\left[ {\sum\limits_{{C - 1}}^{C} {{{(S_{{t - a}}^{{\left( c \right)}})}^2}+\varepsilon } } \right]}^{\frac{1}{2}}}}}$$

Among them, , $C \in \left\{ {1 2...,c} \right\}$, is the number of channels.

Introducing a gating mechanism to adapt to the input characteristics and promoting the competition and cooperation of channels by adjusting the input characteristics can be expressed as:

$$\hat {O}_{{v - a}}^{{\left( c \right)}}=O_{{v - a}}^{{\left( c \right)}}\left[ {1+\tanh (\gamma _{{v - a}}^{{(c)}}\hat {S}_{{v - a}}^{{\left( c \right)}}+\beta _{{v - a}}^{{\left( c \right)}})} \right]$$

$$\hat {O}_{{t - a}}^{{\left( c \right)}}=O_{{t - a}}^{{\left( c \right)}}\left[ {1+\tanh (\gamma _{{t - a}}^{{(c)}}\hat {S}_{{t - a}}^{{\left( c \right)}}+\beta _{{t - a}}^{{\left( c \right)}})} \right]$$

Among them, ,$\beta \in \left[ {{\beta ^{(1)}},...,{\beta ^{(c)}}} \right]$ are the weight and offset that controls the activation of the gate channel. When $\gamma _{{v - a}}^{{(c)}}$ is positively activated, it is in a competitive state with other channels, otherwise it is in a cooperative state.

$$P1=\sigma \left( {Wp1\left[ {Ov - a,Ot - a} \right]} \right)$$

$$h1=P1 \odot {\hat {O}_{v - a}}+(1 - P1) \odot {\hat {O}_{t - a}}$$

Among them,$P1$ is the weight of the first group's contribution in emotion classification;$\odot$ means multiplication;$\sigma$ is sigmiod function;$h1$ is the final output of the first residual gating.

Similarly, the second and third groups of gated outputs can be obtained.,$h2$ ,$h3$. Final gated output $h=h1+h2+h3$ is the results for emotion classfication .

3.1 Experimental settings

This paper constructs MOOC Review, a data set for emotional recognition of teaching behavior, which is crawled from Massive Open Online Course (MOOC) [21]. This data set contains 10 teachers' teaching behaviors, 12 hours of data and 302 videos. The specific teaching behavior elements and data types are shown in Table 1. Record emotions such as anger, disgust, fear, sadness, neutrality, happiness and excitement. This paper selects and evaluates four commonly used emotional categories: happiness, sadness, anger and neutrality. 4290, 2124 and 1208 samples are used as training set, verification set and test set respectively.

Table 1

Classification, description and data types of teaching behavior elements
Behavioral elements	Description	Data type
Teaching	Teachers teach courses and put forward opinions according to the teaching contents.	Emotion perception based on speech and text
Demonstration	Teachers present teaching points through writing on the blackboard, teaching aids, and physical objects.	Emotion perception based on video and text
Interaction	Students answer questions from teachers, and teachers will give feedback and evaluation in time.	Emotion perception based on voice and video
Supervision	Teachers supervise and manage students' classroom behavior.	Emotion perception based on voice and video

The experiment is based on PyTorch deep learning architecture, running on Windows10 operating system, and the device is NVIDIA Quadro RTX 6000 GPU. Evaluate the performance of the model by using ACC-7(7-class Accuracy), ACC-2(Binary Accuracy), F1 value, MAE(Mean Absolute Error), Corr (Pearson correlation coefficient) and other evaluation indicators[19].

3.2 Results and discussion

3.2.1 Comparison with the mainstream model

In order to verify the effectiveness of the proposed model in emotion recognition of teaching behavior, we compared it with LF-LSTM[12], EF-LSTM[12], MCTN[19], MulT[19], RAVEN[20], MFRM[21], PMR[18]and Multi-Logue-. Table 2 shows their performance.

Table 2

Comparative experimental results of MOOCREVIEW data set
model	ACC-2	ACC-7	F1-score	MAE	Corr
LF-LSTM	0.806	0.488	0.806	0.619	0.659
MCTN	0.798	0.496	0.806	0.609	0.670
MulT	0.825	0.518	0.823	0.580	0.703
RAVEN	0.791	0.51	0.795	0.614	0.662
MFRM	0.824	0.509	0.826	0.598	0.690
Multilogue-Net	0.821	/	0.803	0.590	/
MAG-BERT	0.822	/	0.826	0.543	0.764
PMR	0.833	0.525	0.826	/	/
this text	0.835	0.528	0.831	0.568	0.712

As can be seen from Table 2, on the MOOC Review data set, MTCN, RAVEN and LF-LSTM all perform poorly compared with other models, and their accuracy and F1 value are all close to 0.8. MTCN translates one modality into another modality for emotion classification, LF-LSTM uses chain fusion, and RAVEN builds nonverbal representation through images and audio in text words. The possible reasons for the poor performance of these three models are insufficient use of important emotional information or information missing in the fusion process. Compared with the first three models, the accuracy of MulT, MFRM, MAG-BERT and Multilogue-Net are all improved. MFRM keeps the previous information, MulT captures the relevant information between different modalities, and Multilogue-Net captures the dialogue context information and the emotional state of the listener and speaker. The classification index of the proposed model is superior to the current optimal model PMR, and its accuracy of ACC-2 is increased by 0.2%, ACC-7 by 0.3%, F1 by 0.5%, MAE by 2.5% and Corr by 5.2% compared with MAG-BERT, reaching the best known results. In this paper, the interaction between different modes is modeled by CMDC, which is lighter than cross-modal cross-attention, and focuses on avoiding the important emotional features in the modes from being drowned by irrelevant features.

3.2.2 Ablation experiment

To study the influence of CMDC interaction module and residual gated fusion module on emotion classification, 100 sample data were extracted from MOOC Review data set for ablation experiment. The result is shown in Fig. 3, input the output features of the last layer of Bi-GRU into NCR, CDC and our model. Among them, NCR deletes the CMDC interaction module and residual gated fusion module, and CDC deletes only residual gated fusion module.

Compared with our model, the emotion classification index of the model without the CMDC interaction module and the residual gated fusion module is greatly reduced. When the CMDC interaction module is used, Acc-2, Acc-7, F1 and Corr are slightly increased, respectively, and MAE is also reduced. When using this model to classify emotions, the values of Acc-2, Acc-7, F1 and Corr have been greatly improved, and MAE has also been significantly reduced. This proves the effectiveness of the residual gating fusion module.

In addition, we study the influence of the size of the CMDC kernel and the number of interaction layers between multiple modes on the accuracy Acc and F1 values, the results are shown in Table 3. Among them, the range of the size of the CMDC kernel is {3,5,7,9,11,13,15,17,19}, and the range of the number of interaction layers between multimode is {2,3,4,5,6,7,8}.

Table 3

Results of ablation experiment for CMDC kernel size and multi-modal interaction layers
number of plies	2		3		4		5		6		7		8
Nuclear size	Acc	F1	Acc	F1	Acc	F1	Acc	F1	Acc	F1	Acc	F1	Acc	F1
3	0.830	0.813	0.836	0.813	0.829	0.816	0.830	0.814	0.835	0.812	0.835	0.811	0.830	0.811
5	0.835	0.818	0.828	0.807	0.835	0.813	0.833	0.817	0.834	0.815	0.832	0.814	0.830	0.810
7	0.829	0.816	0.835	0.812	0.832	0.812	0.835	0.814	0.834	0.817	0.835	0.825	0.836	0.806
9	0.835	0.810	0.835	0.819	0.831	0.815	0.835	0.810	0.835	0.817	0.836	0.811	0.836	0.820
11	0.828	0.825	0.835	0.811	0.832	0.810	0.832	0.817	0.835	0.816	0.832	0.812	0.835	0.820
13	0.835	0.815	0.829	0.812	0.835	0.811	0.835	0.814	0.835	0.815	0.835	0.825	0.832	0.812
15	0.830	0.810	0.830	0.815	0.835	0.811	0.833	0.818	0.835	0.810	0.835	0.809	0.832	0.813
17	0.831	0.819	0.832	0.812	0.829	0.812	0.835	0.816	0.834	0.812	0.833	0.815	0.835	0.819
19	0.830	0.815	0.833	0.815	0.828	0.815	0.832	0.813	0.834	0.803	0.830	0.811	0.832	0.810

From Table 3, the performance of the deeper model is slightly better than that of the shallower model. For the same interaction layer, the performance of the model with kernel size 9 of CMDC is slightly better than that of the model with kernel size 3. However, the larger the CMDC kernel is, the better the performance will be. The model with the CMDC kernel size of 19 is slightly worse than the model with the kernel size of 9. The reason may be that when the dynamic convolution kernel is too large, more irrelevant features will be extracted. The interaction layers between modes have little effect on the performance of the model.

3.2.3 Comparison of different Modes

In order to further evaluate the recognition performance of this model for emotional data of teaching behavior with different input modes. 100 samples were selected from different modal data in MOOC Review data set for comparative experiments.

From Fig. 4, when single modality emotion data is classified, the emotion recognition performance of images is better than that of text and audio; Compared with single mode, emotion recognition performance is improved when two modes are input. When three modes are input, the overall emotion recognition performance is better than that of single mode and dual mode. Therefore, the integration of text, image and audio modes is helpful to improve the performance of emotion recognition in teaching behavior.

To analyze different emotions in teachers' teaching behaviors and promote teachers' personal development accurately, this paper proposes a multimodal emotion recognition model of teaching behaviors based on dynamic convolution and residual gating. Firstly, the low-level features, high-level local features, and context dependence of the three modes are extracted by CMDC, and the clues related to emotion in teaching behavior are identified. A residual gated fusion is designed to automatically learn the contribution of each interactive representation in the final emotional classification of teaching behavior. The experimental results show that the performance of our model in the emotion classification task is better than that of the benchmark model, and the effectiveness of the CMDC interaction module and residual gated fusion module is verified. In future work, we should consider further research on other modal data that extends this model to teaching behavior, such as brain wave signals and skin electrical signals.

Funding

This work was supported by the National Natural Science Foundation of China, the project number is 6207020477.

Competing Interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author Contributions

Yuanyuan Lu contribution to the Material preparation; Zengzhao Chen contribution to the data collection and analysis and the writing of the first draft; Qiuyu Zheng contribution to the design; Mengke Wangcontribution to the study conception. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Data Availability

Datasets can be provided by corresponding authors upon request.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

B H WANG YXIONG (2021) Affective Analysis of Students' Teaching Evaluation Based on Deep Learning. Res Audio Visual Educ 42(04):101–107
F LI (2021) “Research on Association Rules Mining of Online Teaching Evaluation Data and Its Application in Teaching Quality Improvement. Beijing University of Posts and Telecommunications
F E MOREIRA, S M GARCIA, R A CONDE et al (2019) “Teachers' ICT-related self-efficacy, job resources, and positive emotions: Their structural relations with autonomous motivation and work engagement”,Computers & Education,(6):63–77
S PARK, R JYU. “Exploring Preservice Teachers' Emotional Experiences in an Immersive Virtual Teaching Simulation through Facial Expression Recognition”,International Journal of Human-Computer Interaction, (6):521–33
Y KIM TTOYOTA (2018) R BEHNAGH. “Towards Emotionally Aware AI Smart Classroom: Current Issues and Directions for Engineering and Education”,IEEE Access, (1):5308–31
M L B ESTRADA, R Z CABADA, R O BUSTILLOS et al (2020) “Opinion mining and emotion recognition applied to learning environments ”[J].Expert Syst Appl, (7):12–23
R Z CABADA, M L B ESTRADA, R O BUSTILLOS (2018) “Mining of Educational Opinions with Deep Learning”[J].J Univers Comput Sci, (11):1604–26
F F BALALAHADIA MFERNANDO, I C JUANATAS (2016). “Teacher's performance evaluation tool using opinion mining with sentiment analysis”, Region 10 Symposium. IEEE,
G, GUTIERREZ, E J CANUL,O ZEZZATI A et al “Mining: Students Comments about Teacher Performance Assessment using Machine Learning Algorithms”, 2018(3)
S H LIU, X SUN CB, LI (2021) Emotion recognition using EEG signals based on location information reconstruction and time-frequency information fusion. Comput Eng 47(12):95–102
Y TAN, Z SUN et al (2021) F DUAN,. “A multimodal emotion recognition method based on facial expressions and electroencephalography”, Biomedical Signal Processing and Control, (70)
M W CHANG JDEVLIN K LEE, “Bert: Pretraining of deep bidirectional transformers for language under standing”[J]. arXiv preprint arXiv: 1810.04805, 2018.
W N HSU BBOLTE, H H TSAI Y et al (2021) “Hubert: Self-supervised speech representation learning by masked prediction of hidden units”,IEEE/ACM Transactions on Audio, Speech, and Language Processing, (29):3451–3460
T BALTRUSAITIS PROBINSON, L P MORENCY (2016) Openface: an open source facial behavior analysis Openface: an open source facial behavior analysis of Computer Vision (WACV).IEEE, :1–10
A ZADEH M et al (2017) CHEN, S PORIA,. Tensor fusion network for multimodal sentiment analysis, arXiv preprint arXiv:1707.07250,
T SOLORIO JAREVALO, M MONTES-Y-GOMEZ et al (2017) Gated multimodal units for information fusion, arXiv preprint arXiv:1702.01992,
Z PAN, Z LUO et al (2009) J YANG,. Multi-modal attention for speech emotion recognition[J]. arXiv preprint arXiv: 04107, 2020
K YANG HXU (2020) K GAO. Cm-bert: Cross-modal bert for text-audio sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. : 521–528
Y H H TSAI S, BAI PP, LIANG et al (2019) Multimodal transformer for unaligned multimodal language sequences, Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2019: 6558
K E UZUNTIRYAKI, Z D KIRBULUT SARICIE et al (2019) Emotion regulation as a mediator of the influence of science teacher emotions on teacher efficacy beliefs,Educational Studies, (14):1–19
Q K LIN, Y F ZHU ZS, FHANG et al (2019) Lexical based automated teaching evaluation via students' short reviews. Comput Appl Eng Educ 27(1):194–205
A SHENOY, A SARDANA. “Multilogue-net: A context aware rnn for multi-modal emotion detection and sentiment analysis in conversation”,. arXiv preprint arXiv: 2002.08267, 2020.
A ZADEH, P P LIANG NMZAUMDER et al (2018) Memory fusion network for multi-view sequential learning, Proceedings of the AAAI Conference on Artificial Intelligence. 32(1)
W RAHMAN, K HASANM et al (2020) S LEE,. Integrating multimodal information in large pretrained transformers, Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2020: 2359

Download PDF

Journal Publication

published 22 Jun, 2023

Read the published version in Soft Computing →

Reviewers agreed at journal
24 Jan, 2023
Reviewers invited by journal
18 Jan, 2023
Editor assigned by journal
18 Jan, 2023
First submitted to journal
17 Jan, 2023

You are reading this latest preprint version

Emotion Recognition in Teachers' Teaching Behavior Based on Multimodal Data Analysis

Status:

Journal Publication

Version 1

Abstract

Figures

1 Introduction

2 Emotional Recognition Model Of Multimodal Teaching Behavior

2.1 Multi-modal feature extraction

2.1.1 Text feature extraction

2.1.2 Audio feature extraction

2.1.3 image feature extraction

2.1.4 monomodal context dependency

2.2 Cross-modal dynamic convolution

2.3 Residual gating fusion

3 Experiments

3.1 Experimental settings

3.2 Results and discussion

3.2.1 Comparison with the mainstream model

3.2.2 Ablation experiment

3.2.3 Comparison of different Modes

4 Conclusion

Declarations

References

Status:

Journal Publication

Version 1