GTSception: A Deep Learning EEG Emotion Recognition Model Based on Fusion of Global, Time Domain and Frequency Domain Feature Extraction

With the rapid development of deep learning in recent years, automatic electroencephalography (EEG) emotion recognition has been widely con-cerned. At present, most deep learning methods do not normalize EEG data properly and do not fully extract the features of time and frequency domain, which will a ﬀ ect the accuracy of EEG emotion recognition. To solve these problems, we propose GTScepeion, a deep learning EEG emotion recognition model. In pre-processing, the EEG time slicing data including channels were pre-processed. In our model, global convolution kernels are used to extract overall semantic features, followed by three kinds of temporal convolution kernels representing di ﬀ erent emotional periods, followed by two kinds of spatial convolution kernels highlight-ing brain hemispheric di ﬀ erences to extract spatial features, and ﬁnally emotions are dichotomy classiﬁed by the full connected layer. The experiments is based on the DEAP dataset, and our model can e ﬀ ectively normalize the data and fully extract features. For Arousal, ours is 8.76% higher than the current optimal emotion recognition model based on Inception. For Valence, the best accuracy of our model reaches 91.51%.


Introduction
Being an essential way for people to communicate and express, emotion is a kind of people's attitudinal responses and corresponding behavioral responses to objective things. With the wide application of human-computer interaction [1]. emotion recognition has been widely studied by researchers. In order to analyze people's emotions well, researchers need to collect and recognize the emotions expressed in speech [2,3] facial expressions [4]peripheral physiological signals [5]and behavioral actions [6]. Because the emotional expression of facial expressions has strong subjectivity and camouflage, the emotion recognition through them cannot continuously feed back a reliable emotional state to ensure its authenticity. Physiological signals [7]are controlled by the autonomic nervous system of human beings, which can show related emotions for a long time more accurately and truly, compared with other emotional modes. In particular, EEG signals of physiological signals are generated by human cerebral cortical neurons and can truly reflect people's emotional state stimulated by external stimuli. Therefore, it is relatively reliable to identify emotions through EEG [8].
The traditional machine learning method of EEG emotion recognition requires manual feature extraction of time domain (Zheng [9]; Reuderink [10]), frequency domain (Gao [11]; Wang [12]) time-frequency domain (Dang [13]; Zhang [14]) and others, and then establish a machine learning model to identify emotions. On the other hand, deeep learning algorithm can automatically extract high-dimensional features and classify them by using multi-layer neural network due to its good nonlinear learning ability [15]. Most researches only consider one of the features in the time-domain and frequency-domain, which will lead to insufficient feature extraction of EEG signal and the problem of low recognition accuracy. Only a few studies achieve high accuracy by fully combining time-domain and frequency-domain features [16]. For image and signal data, good data pre-processing can effectively reduce noise data, thus effectively improving accuracy [8]. The emergence of TScpetion [17]solves the above problems, which effectively combines features in the time-domain and 9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 frequency-domain. However, it is not perfect in global feature extraction and data pre-processing, whose recognition accuracy needs to be further improved.
In order to solve these problems, we proposes a deep learning EEG emotion recognition model that fully extracts time-domain and frequency-domain features and has good data pre-processing effect. The advantages are as follows: (1)Inspired by Inception, GTSception uses decomposed temporal and spatial convolutional kernels to fully extract time-domain and frequency-domain features. At the same time, GTSception additionally uses global convolution kernels to extract overall semantic features.
(2)The two normalization methods, Z-Score and Min-Max, are used for preprocessing comparison experiments on channels and time, because appropriate data normalization can effectively improve the accuracy of emotion recognition.
The rest of the article consists of the following parts: The second section describes the related work. The third section describes our methods in details. The fourth section describes the experiments and the results. Conclusions will be presented on the final section.
2R e l e v a n tw o r k

Inception structure
In ILSVRC2014, Google won with GoogleNet (Inception-V1) [18]. It uses multiple convolution kernels to extract multiscale features from images, and also used 1x1 convolution kernels to reduce the number of feature channels, which could ensure that the model could not overfit in the case of fully extracting features. In order to further reduce the number of parameters and the calculation cost, Inception v2 [19] and Inception v3 [20] have added operations such as decomposing convolution kernel (asymmetric convolution kernel, large scale convolution kernel to small scale), regularization (Dropout, Batch Normalization) and others to GoogleNet networks. Inception v4 [21], proposed in 2016, ensures high accuracy and network depth by adding the ResNet module.

Application of EEG emotion recognition in artificial intelligence
In the research on EEG emotion recognition based on machine learning, Candra [22] showed that the recognition accuracy of Valence and Arousal in DEAP dataset was 65.13% and 65.33% by using time-frequency feature wavelet entropy; Ning [23] obtained that of Valence and Arousal in DEAP dataset was 69.1% and 71.9% by using frequency domain feature empirical mode decomposition (EMD). However, most of these methods extract only one of these features, which is a low-dimensional feature that contains insufficient electroencephalographic information to reliably identify emotions.
In the research of deep learning, there are mainly two kinds of methods: convolution neural network and recurrent neural network. Yanagimoto [24] [25] showed that the accuracy of Valence in the DEAP data could reach 87.27%, using a better input format, effective EEG channel filtering, and CNN. The researchers mentioned above mainly use the CNN, but it focuses more on the correlation between EEG channels. Alhagry [26]u s e d the 2-layer LSTM to achieve the accuracy of 85.65% and 85.45% respectively in the Arousal and Valence for the DEAP dataset. LSTM is a classical network structure of recurrent neural network, which focuses more on the temporal timing of EEG signals.
In this paper, GTSception will adopt the global, temporal, spatial convolution kernels to effectively extract the relevant features, and our model achieves better performance in the DEAP dataset compared to TSception.

GTScpetion EEG emotion recognition framework
The framework proposed in this paper is based on deep learning and applied to EEG emotion recognition. Each sub-module of GTSception is shown in Figure  1. First, the model needs to collect or gather data related to EEG emotions. DEAP dataset from the public database is used [27] in our model.Secondly, in order to achieve faster convergence of our model, good data pre-processing methods are adopted. Z-Score and Min-Max, two excellent normalization methods, are selected. Then, the feature extraction was performed on the preprocessed dataset through the global convolution module, the temporal convolution module and the spatial domain convolution module. Finally, the classifiers analyzed the extracted features to classify emotions of Valence and Arousal.

Data pre-processing
The original DEAP EEG data is multidimensional signal data including total number of experiments, EEG channels, time sequence signals, etc. In order to adapt to the training of GTSception, it needs to be transformed into twodimensional data before it can be applied to the convolution module in our model. Algorithm 1 is the pre-processing algorithm for our model, which preforms well by using DEAP dataset.
According to algorithm 1, X is the two-dimensional EEG data obtained by the subject, the channel, the experiment and the time slice layer by layer, and then the average value of the first 3 seconds is subtracted to obtain the pure EEG data shown in Figure 3. In order to minimize the impact of noise data on the convergence of GTSception , we uses two normalization methods, Z-Score and Min-Max, to conduct a comparative experiment and select the optimal normalization scheme. Fig. 3 Pre-processed EEG data Z-Score standardization uses the standard deviation and mean of the sequence data to standardize the data. To ensure the consistent overall distribution of the standardized data, the data of 0 in EEG data will be ignored. Z-Score standardized formula is shown in Formula 1: Min-Max normalization uses the maximum and minimum values of sequence data to standardize data, whose formula is shown in Formula 2:

Global Convolution
Due to the improvement of hardware devices and the widespread application of image recognition, the convolutional neural networks are gradually developing and maturing. The Inception structure of GoogleNet is proposed for image recognition, and the model is also based on the implementation of CNN, so we also use a similar input format for images(length-width-channel) . The first part is the global convolution module, which adopts the global convolution kernel of 3X3 to extract the overall semantic features. Compared to TSception, the additional features extracted here provide richer details of the global EEG features in the latter part of the model. 9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65

Temporal convolution
The second part is the time-domain feature extraction module. The input data is a time sequence signal with a sampling rate of 128. In order to better integrate the EEG emotion features at different time periods, three temporal convolution kernels of 64x1, 32x1 and 16x1 were used to extract the feature layers containing different time-domain information, which respectively represent the relevant time-domain features of EEG data of 0.5s, 0.25s and 0.125s. Then, the features are average pooled using a 1x16 pooling kernel so that the features will be more compact and the computation will be smaller. To extract more complete and rich information later, the three feature layers will be merged into a feature layer with a size of 32x17x16. Finally, batch normalization is performed on the new feature layer, so that the data distribution is fixed and the model can converge better.

Spatial Convolution
The third part is the feature extraction module of frequency-domain. In the former module, the time-domain features have been fully extracted. Referring to the input of the model,32 is the number of brain electrodes, i.e., the length of the input data. Due to the asymmetry of human brain, it will adopt 1x32 and 1x16 spatial domain convolution kernels to extract the spatial domain features of EEG signals. The features extracted by 1x32 convolution kernel takes into account the spatial domain features brought by all brain electrodes, while the frequency-domain features extracted by 1x16 convolution kernel represent those of two brain hemispheres. Subsequently, 1x8 pooling kernel is used to simplify the features of these two feature layers. Finally, the two feature layers are combined to form a new feature layer with a size of 3x2x64, where the batch normalization operation is also performed in consideration of model convergence.

Classifier
The last part is the classifier. EEG data has been processed into a feature layer with a size of 3x2x64 by former modules, which needs to be converted into a fully connected layer of size n. In order to reduce feature loss, we connect the fully connected layer consisting of 1024 neurons before classification, and Dropout is also used. Finally, the full connection layer of two neurons is used for classification.
The modules of our model are described above. Algorithm 2 contains the training Procedure of GTSception .   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 end for x train represents training data, y train represents training labels, x test represents test data, y test represents test labels, E represents the iteration cycle, B indicates batch size, C indicates emotion category, and Dropout represents the random discard rate of neurons. The model will be iterated for E cycles with y train/B group samples, and the model will be saved after completion of the iterations. Finally, the saved model SAM will be used to test the model performance using x test.

Experiment and result analysis 4.1 Data
The experimental data used in this paper are from the multi-modal EEG emotion dataset of Queen Mary University of London, United Kingdom. It contains multi-modal emotional information such as video data of the subjects' face, EEG, ECG, EMG, etc. Thirty-two participants were visually stimulated by 40 carefully selected one-minute music video clips to evoke relevant emotions. Each subject will then score the video in multiple emotional dimensions according to a score of 1-9. The length of each experiment in DEAP data was 63 seconds, with the first 3 seconds being the experiment wait time and the last 60 seconds being the experiment time while watching the video. The data has been sampled down to 128HZ, and the EOG signal artifact has been removed. The detailed description of the DEAP dataset is shown in Table 1.
Ta b l e 2 Detailed description of DEAP dataset Labels High Low

Experiment settings
Considering that the strong individual specificity of EEG emotion, the model will be trained and tested on a single subject. To optimize model performance, there are mainly the following aspects: in order to prevent the overfitting of the model, dropout randomly drops small neurons with a ratio of 0.5 in the full connection layer. Then, a learning rate of size 2e-4 as well as exponential decay of the learning rate decay method was used. The activation function is ELU, which is a good way to avoid the gradient disappearance problem that occurs when using ReLU. The Batch Normalization ensures that the feature layers before convolution have the same distribution and accelerates the convergence of models. Random truncation of standard normal distribution for weight parameter is adopted, which is similar to BN operation to accelerate model convergence and accelerates weight updating.In this paper, batch size is 80, the epoch is 200, and the model deployment is carried out on the GTX1070 8G Windows environment.

Experimental scheme
In this paper, the hyperparameters of the model such as batch size, epoch, learning rate, training set and test set size, and dropout rate are kept consistent so that we can use the control variable method to design accurate comparison experiments .   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 For the comparison experiment of the model improvement, in order to maximize the effect of the model, it was found that the Valence emotion dimensions were more balanced than that of positive and negative samples of total Arousal data. Therefore, only the Valence category was used to carry out the comparison experiment. To that the experiment is more reliable, all subjects will be tested on the Valence for EEG recognition.
For the pre-processing part, because there was no relevant research on the Arousal in TSception, the comparison experiment was made specifically on the Arousal emotion dimension of dimension in the pre-processing. In the analysis of emotional categories of each subject, positive and negative samples of subjects S16 and S32 were the most balanced. For this reason, four different pre-processing schemes are designed in this paper.
Scheme 1, the Z-Score method was used to standardize the data containing a single sample point for the whole EEG channel; Scheme 2, the Z-Score method was used to standardize the data containing a single sample point for the whole EEG channel; Scheme 3, the Min-Max method was used to standardize the data containing a single sample point for the whole EEG channel; Scheme 4, the Min-Max method was used to standardize the data containing a single sample point for the whole EEG channel. By comparing four different pre-processing schemes, the optimal pre-processing scheme is selected to cooperate with the improved model in this paper to achieve the optimal recognition accuracy.  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 As shown in Table 3, compared with TSception, the full convolutional part of our model can bring more global feature details, and the results in the table also well verify the effect of our model. Ours indicates the accuracy of our model, Orig represents the accuracy of the TSception. Of the 32 subjects, most of the participants reported that the accuracy of our model was about 4% higher than TSception's. However, the 27th subject showed the most significant difference, with a nearly 15% difference in accuracy, which could be attributed to the inconsistency of positive and negative samples from the 27th subject. Figure 4 and Figure 5 show the performances of S16 and S32 in four data pre-processing schemes, respectively. In Figure 5, it can be seen that for subject S32, the Z-Score and Min-Max methods performed almost identically in the model after preprocessing the EEG channel data due to the strong individual specificity of EEG.

Analysis of experimental results
In Figure 4 and Figure 5, the results of the Z-Score pre-processing method for channel and time dimensions are similar to those of the Min-Max method. To verify the robustness of our model, we observe the changes of the accuracy and the loss through a ten-fold cross-validation. The total data for individuals in DEAP would be divided into training set and test set in a ratio of 9:1. 9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 All samples will be scrambled by non-return sampling. The results of crossvalidation of subjects S16 and S32 using the Z-Score method with time and channel preprocessing and training of 200 epoch are shown in Table 4. From table 4, it is obvious that cross-validation is a good way to improve the robustness and accuracy of the model. According to the above chart, the Z-Score method of EEG channel data pre-processing ensures that the model improves the accuracy of emotion recognition steadily during the training process.