Discriminative and Robust Feature Learning for MIBCI-based Disability Rehabilitation

. Background: In the past few years, motor imagery brain-computer interface (MIBCI) has become a valuable assisting technology for the disabled. However, how to effectively improve the motor imagery (MI) classification performance by learning discriminative and robust features is still a challenging problem . Methods: In this study, we propose a novel loss function, called correntropy-based center loss (CCL), as the supervision signal for the training of the convolutional neural network (CNN) model in the MI classification task. With joint supervision of the softmax loss and CCL, we can train a CNN model to acquire deep discriminative features with large inter-class dispersion and slight intra-class variation. Moreover, the CCL can also effectively decrease the negative effect of the noise during the training, which is essential to accurate MI classification. Results: We perform extensive experiments on two well-known public MI datasets, called BCI competition IV-2a and IV-2b, to demonstrate the effectiveness of the proposed loss. The result shows that our CNNs (with such joint supervision) achieve 78.65% and 86.10% on IV-2a and IV-2b and outperform other baseline approaches. Conclusion: The proposed CCL helps the learning process of the CNN model to obtain both discriminative and robust deeply learned features for the MI classification task in the BCI rehabilitation application. supervise the learning of CNNs, the discriminative power of the deeply learned features can be significantly increased for the MI classification. Unlike the center loss based on the quadratic 𝐿2 norm distance, the CCL is based on the CID that assigns minor penalization to the outliers, therefore eliminating the noise’s adverse impact. The novel self -adaptive strategy was used to learn the class MCCMs, releasing the burden of manually setting the kernel size . Experiments on two well-known MI datasets demonstrate encouraging results and show the effectiveness of the proposed method. The proposed algorithm has the potential to be used in the practical MIBCI rehabilitation systems.


Background
Individuals who have paralysis have significant difficulties in engaging in social interactions due to functional limitations. Based on the statistics from the World Health Organization (WHO) [1], one billion, approximately 15% of the world's population, are experiencing different types of disability.
Many of them have varying degrees of activity and participation restrictions. To interact with the world, they rely on assistive technologies, such as Brain Computer-Interfaces (BCI). The BCI is a computerbased technology that collects brain activities, analyzes them, and translates them into commands that drive an external machine to execute the desired actions [2,3]. Motor imagery brain-computer interfaces (MIBCI) is one of the most promising approaches in the medical research area [4]. Its popularity is mainly because of the non-invasiveness of the electroencephalogram (EEG, a brain signal collection technique) and the easy implementation of the motor imagery paradigm. Such a convenient system has been performed in a variety of applications, such as orthoses [5], prostheses [6], robotic arms [7], and mobile robots [8], across different fields, including clinical practices [9,10] and military [11]. The translation of command (e.g., classification) from the raw MI EEG signals remains a significant challenge. The key to an accurate classification is to obtain discriminative features. The conventional strategies usually rely on handcrafted feature extractors, such as common spatial pattern (CSP) [12], filter bank common spatial pattern (FBCSP) [12], band power [13], and Riemannian covariances [14]. The extracted features usually integrate with linear classifiers, including logistics regression (LR) and support vector machine (SVM) [15,16], for the MI classification in the BCI rehabilitation devices.
The conventional approaches dominate classification performance for a relatively long time until the recent development of graphic processing units (GPU) [17]. The modern GPU offers a powerful computational ability, enabling researchers to explore deep learning (DL) approaches that require an enormous computational burden to address the BCI classification task. The DL strategies make distinguished improvements in the classification accuracy compared to the traditional methods. It also provides an extra benefit that the MI EEG data can be processed in an end-to-end manner without any preprocessing [17]. Two well-known and also the earliest DL approaches, called EEGNet [18] and shallow ConNet [19], deliver favorable classification outcomes for the MI recognition. Since then, many research groups have thoroughly investigated multiple innovative convolutional neural networks (CNNs, one of the deep learning structures). The most common framework ( Fig. 1) of the CNN model is to perform two steps in succession, learning deep features from raw EEG data by CNN blocks, then making label predictions using the learned features. The training of the model on the classification task is usually guided by the softmax loss [20].
The deep features need to be separable and discriminative to achieve excellent performance in the general classification task. Such features can be acquired by the joint supervision of the softmax loss and center loss, which was proposed in [20] for face recognition in the computer vision field. Specifically, the softmax loss ensures the separability of the features (Fig. 2 (a)), and the center loss concurrently increases the discriminative power upon the separability by pulling deep features towards their class centers (intra-class distance minimization, Fig. 1(b)). This joint supervision signal is also widely adopted for the MI detection task due to its efficiency and easy setup [21,22]. However, the center loss is sensitive to the non-Gaussian noise since it is based on the quadratic 2 norm distance [23]. Few noise points/outliers far from the class centers may dominate the objective function and degrade the classification performance ( Fig. 2 (c)). Coincidentally, raw EEG data have a low signal-to-noise ratio (SNR) and contain much noise [13], so it is not suitable to directly use the center loss to optimize the CNN model for MI classification.
For this concern, we propose a new loss, referred to correntropy-based center loss (CCL), based on the correntropy induced distance (CID). Like the original center loss, the CCL also simultaneously learns the class centers and penalizes the distance between deep features and their corresponding class centers to minimize the intra-class variation. However, there are two major differences between the CCL and center loss. First, the class centers learned by the CCL are based on maximum correntropy criterion mean (MCCM) [24]. The outliers have a low weight in the MCCM calculation. Second, unlike the 2 norm distance that heavily penalizes the deep features far from class centers, the CID used in the CCL assigns minor penalizations to these far feature points. In the Fig. 2 (c) and (d), we can notice that class centers updated by the center loss have a significant deviation from the s, while the class MCCMs nearly  overlap with s. The noise feature points are also filtered out by the CCL as shown in Fig. 2 (d). Such characteristics can decrease the negative effect of the non-Gaussian outliers during the model training, which enables the CNN model to learn a robust feature pattern from the raw MI EEG data that usually contain much noise. The main contributions of our method are summarized as follows: 1. To our best knowledge, this is the first attempt to use CCL to guide the training of the CNN model. With joint supervision of the softmax loss and CCL, the discriminative features can be attained for the MI classification, which can be easily equipped for the BCI applications in either clinical or military fields. 2. The proposed loss in based on the CID and MCCM that can significantly reduce the adverse effect of the noise during the CNN model training. 3. We present extensive experiments on two MI public datasets, called BCI-competition IV-2a and IV-2b. The results show the effectiveness of the proposed method.

Method
In this section, we firstly elaborate the network architecture used in the study. Then, we intuitively and mathematically introduce the proposed loss (CCL).

Network architecture
In this study, we follow the model architecture of the EEGNet [18], but make a slight modification. We decrease the kernel size of the first CNN block to capture the feature pattern above 8 Hz, as previous studies show that the MI is mainly related to the EEG signals in the range of 8-30 Hz [13]. Our CNN model (see Fig. 3) consists of two temporal filters with average-pooling, one spatial filter with averagepooling, and two dense layers. One dense layer is applied with the CCL, and the other is involved with the softmax loss [25].

Fig. 3.
Illustration of network architecture. and ̂ are ℎ trial of the raw MI EEG data and ℎ predicted label, repectively. is the number of classes. is the number of EEG electrodes. is the time stamp. 1 and 2 are the sizes of temporal filters.

2.2
The proposed CCL As illustrated in Fig. 2 (b), the center loss can increase the discriminative power by reducing the intraclass variation. However, it is based on the 2 quadratic norm distance and is easily affected by the noise or non-Gaussian outliers. It is well-known that the MI-EEG series carry significant noise during the collection. We proposed the CCL, which combines maximum correntropy criterion (MCC) [24] and center loss. The CCL not only preserves the primary function of the center loss that increases the discriminative power of deep features by reducing intra-class variations but also decreases the anti-effect of the noise feature points by assigning them 'minor' significance. It is mathematically defined as follows.
where ∈ ℝ represents the th sample's deeply learned features with a label of ∈ {1,2. . . }. denotes the feature dimension. is the th class MCCM of deep features. is the size of the Gaussian kernel for data of th class. The sample size and the number of classes are and . Intuitively, the MCCM (class center) is similar to the conventional mean (averaged value) of each class but assigns a very low weight to the outlier/noise in its calculation. More details and its mathematical definition can be found in Appendix A. The merit of this loss is that we can dynamically adjust kernel size to control the significance of the feature points that are far from the MCCMs. We implement a two-step approach to update the deep feature and class MCCM in succession within one epoch. We first use the gradient descent method in each epoch to update . Then, the self-adaptive half quadratic (HQ) approach [24] is subsequently performed to update class MCCM . The two-step update approach is given in Appendix A. Similar to the center loss, we adopt the joint supervision of softmax loss (ℒ ) [25] and the CCL to train the CNN for a discriminative and robust feature learning. The joint supervision signal (ℒ) is formulated where the is the trade-off scalar for balancing these two losses. We summarize the learning details of the CNN model in Algorithm 1 (Appendix B).

Evaluation details
In this section, we firstly introduce the dataset used for the evaluation. Then, the details of the experimental settings are presented.

Data
Two well-known public datasets, called BCI competition IV-2a and IV-2b [26], provided by the Technical University of Graz, are used in this study.
The BCI competition IV-2a was collected from 9 individuals (A01-A09) with a 250 Hz sample rate using 22 electrodes. It is based on the cue-based BCI paradigm that consists of four MI classes, referred to as the imagination of movement of the left hand, right hand, both feet, and tongue. Each subject employed two separate sessions, with 288 trials (72 trials for a single MI type) each session. The same data division scheme of the competition is adopted where the first session is for training and the second for testing whose labels are to be predicted. We use the 4s temporal segment from the starting point of the MI-cue until the end of the MI. Given the 250 Hz sample rate, 1000 samples are used in one trial, and the raw MI data for the input of CNN is in a matrix form (22 × 1000).
The BCI Competition IV-2b was also collected from 9 healthy people (B01-B09) but only with 3 EEG electrodes (C3, Cz, and C4) with a sampling frequency of 250 Hz. The recordings were also based on a cue-based screening paradigm that only comprises two classes, e.g., the MI of the left hand (class 1) and right hand (class 2). Five sessions of data were collected for each subject. We also used the same data division as the competition. The first three sessions are for training, and the rest two are for testing. Like the IV-2a, the 4-second interval, from the starting point of the cue until the end of the MI, is treated as an input trial to the model. Given the recording frequency of 250 Hz, the raw input data for one trial is in a 3 × 1000 matrix.
The examples of one trial input data of BCI competition IV-2a and IV-2b are shown in Fig. 4. The fluctuations seem different between subfigure (a) and (b), but they are similar if plotted on the same scale. Table 1

Experimental settings
We  Table 2. The sampling frequency is 250 Hz for both datasets, so we set 1 as 32 to obtain the temporal information above 8 Hz. The codes for experiments are given in the Supplemental files.

Comparison to other baselines
Our proposed method is first evaluated on the BCI competition IV-2a and IV-2b datasets. The classification accuracy of each subject, averaged accuracy, and standard deviation (SD) are shown in Tables 3 and 4. The model with the best performance for each subject and the average accuracy are highlighted in boldface for a clear illustration. We observe that the proposed method (CNN model jointly supervised by the softmax loss and CCL) achieves the best classification across most subjects on both datasets. It has the largest improvement of 11.26% accuracy better than the second-best model on subject A05 in dataset IV-2a and 14.63% than the second-best model on subject B03 in IV-2b. On the average level, our method also achieves the best classification accuracy with an improvement of at least 3.90% and 2.12% on IV-2a and IV-2b compared to other baselines.
In addition, we also notice that the standard deviation of our method is smaller than others, which shows the high stability of the proposed approach across subjects. More interestingly, the proposed method has a relatively more significant improvement on the subjects A02, A05, and A06, who have EEG signals with a low SNR [22].

Sensitivity analysis
The hyperparameter dominates the intra-class variation and noise influence. It is critical to our model, so we conduct an experiment to investigate its sensitiveness. Fig. 6 illustrates the averaged classification accuracy across subjects on the dataset BCI Competition IV-2a as ∈ {0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1}. It is observed that simply using the softmax loss (when = 0) may not be a good option, and the CNN model has the lowest averaged classification accuracy. We can also observe a 'Log' curve when varies between 0 to 1. The classification accuracy increases with the increase of at the beginning but remains relatively stable as the ≥ 0.4. The highest classification performance is achieved when the = 0.6. Therefore, we empirically set the as 0.6 for the final implementation of our approach.

4.3
Feature visualization To better understand the function of the proposed CCL, we visualize the deep learning features under four conditions. The models in first two conditions, Model A ( = 0) and Model B ( = 0.6), are trained using the original data from A08 (random selection). In addition, we also try to intuitively observe the effectiveness of CCL on reducing the negative effect of the noise/outliers. It is hard to know which trial can be regarded as the noise in the original dataset, so we manually add ten trials of electricity flow simulation (50 Hz wave), one of the most frequent noises in the EEG signal, to the data of A08. Five trials are labeled as 'left hand', and the other five as 'right hand'. Therefore, the models in the last two conditions, Model C (center loss, trade-off value = 0.01) and Model D ( , = 0.6), are trained using the data from A08 with 10-trial noises. For models A and B, as the dimension of the second last layer (deeply learned feature) is eight, we use the principle component analysis (PCA) [30] technique to convert the high-dimensional features into 2-D vectors for the visualization in the plane. For Models C and D, as the popular dimension reduction techniques such as PCA [30] and t-SNE [31] are easily affected by outliers, we decrease the output dimension of the second last layer to 2 for direct visualization. Fig. 7 shows the feature distributions of these four different models. We have the following two key findings. First, by comparing the subfigures (a) and (b), we observe that the intra-class variation significantly decreases when using the CCL, which has the same function as the center loss. Second, it is also clearly noted that the CCL filters out the noise/outliers, while the center loss does not (see. Fig. 7 (c) and (d)).

Discussion
The current study investigated the feasibility of a new loss function, called CCL, as a supervision signal for the CNN model training to obtain a high classification performance in the MI recognition task. The results reported in Section 4 demonstrate the superiority of the joint supervision of the softmax loss and CCL. The CNN models trained by such combined loss outperform other baselines and show better stability across individuals. The reasons behind such performance increasements may be diverse. The conventional/traditional machine learning methods have their limitations. The FBCSP only extracts 1-D feature vectors for the classification, which may discard the 2-D latent information of the MI EEG signal [12]. For the matrix-based approach, including SMM, CCSP, and SSCSP, although the 2-D structure information can be preserved, they can only take the linear transformation [17]. Alternatively, as a DLbased approach, the proposed method can process the EEG data in 2-D space without destroying its structural information and generate the non-linear feature patterns for the MI classification. In addition, our approach also outperforms previous DL models. The objective function may be one of the significant reasons for the increment of accuracy. The EEGNet and shallow ConNet are only supervised by softmax loss that only ensures the separable features but does not guarantee their discriminative power nor consider the negative effect of noise. In contrast, the joint supervision of the softmax loss and CCL can extract discriminative deeply learned features and eliminate the negative impact of noise. These encouraging results demonstrate that, apart from the model structure that has been heavily investigated by previous research groups, an efficient and well-designed objective function can also be another research focus to break through the bottleneck of the CNN models on MI classification.
The center loss has two outstanding characteristics. First, it is stable/insensitive to the trade-off hyperparameter . Second, it increases the discriminative power of the deeply learned feature by minimizing the intra-class variation [20]. The Result section shows that the CCL, initially inspired by the center loss, also carries on these two characteristics to a certain degree in the MI classification task. The verification performance does not show significant variance when ≥ 0.4 (section 4.2), and the model trained by the joint supervision of softmax loss and CCL has a smaller intra-class variation than the one only trained by softmax loss (Fig. 7). These phenomena show that CCL preserves the significant properties and functionality of the center loss by changing the quadratic 2 norm distance into CID (applying the Gaussian kernel). It still delivers stable performance across different (in a particular range) and increases the discriminative power of the deep features by reducing the intra-class variation.
The unique property of the CCL is its effectiveness in handling the negative effect of the noise in the MI classification task. We have observed how this property benefits the MI classification in two dimensions in the Result section. First, for the subjects (A02, A05, and A06) with a low signal-to-noise ratio, the CNN trained by joint supervision of softmax loss and CCL shows a more significant improvement. In these subjects, more noise is mixed with the signals during the data collection. We assumed that deep features extracted by the CNN pipeline from the trials containing a large amount of noise are more likely far from the class feature centers. Our CCL only assigns very low significance on the further points during the training. The greater performance increments on these three subjects may come from such low significance assignments, allowing the CNN to learn more helpful information from the high SNR trials than the low ones. Second, by seeing Fig. 7 (d), it is evident that the features of noise (stimulated electricity signals) are filtered out from the feature distributions of the standard EEG data when using the joint supervision of the softmax loss and CCL for the CNN training. This phenomenon may provide direct evidence on the effectiveness of the CCL on reducing the negative effect of the 'pure' noise in the MI classification task. According to these two exciting findings, the CCL may provide benefits on reducing the adverse effects of both low SNR and pure noise trials.
The proposed CCL show its efficacy on the MI classification task in our experiments. It may offer similar benefits on other EEG-based BCI classification tasks such as P300 [32], steady-state visually evoked potentials (SSVEP) [33], and Steady-State Somatosensory Evoked Potential (SSSEP) [34]. Each type of classification has its traits. It is not appropriate to estimate the effectiveness of CCL on these tasks without actual experiments, although they share similar characteristics of the EEG signals. In the future, we consider extending this work into other EEG-based BCI classification tasks to have a deeper understanding of how the general effectiveness of CCL on the BCI classifications. Such insight enables us to gain more confidence in applying the proposed algorithm in the real-world BCI application in both clinical and military fields.

Conclusion
In this study, we proposed a new loss function called correntropy-based center loss (CCL). By combining softmax loss with CCL to jointly supervise the learning of CNNs, the discriminative power of the deeply learned features can be significantly increased for the MI classification. Unlike the center loss based on the quadratic 2 norm distance, the CCL is based on the CID that assigns minor penalization to the outliers, therefore eliminating the noise's adverse impact. The novel self-adaptive strategy was used to learn the class MCCMs, releasing the burden of manually setting the kernel size . Experiments on two well-known MI datasets demonstrate encouraging results and show the effectiveness of the proposed method. The proposed algorithm has the potential to be used in the practical MIBCI rehabilitation systems.

Funding
The work in this paper is supported in part by the Hong Kong Innovation and Technology Fund (MRP/015/18) and the Hong Kong Research Grants Council (PolyU 152006/19E).