A Multi-scale Residual Convolutional Neural Network for Sleep Staging Based on Single Channel Electroencephalography Signal

Sleep disorder is a serious public health problem. Non hospital sleep monitoring system for monitoring sleep quality can eﬀectively support the screening of sleep disorder related diseases. A new algorithm of multi-scale residual convolutional neural network (MS-RESCNN) was proposed to discover the feature of electroencephalography (EEG) signals detected with wearable system and staging the sleep stage. EEG signals were analyzed by this algorithm every 30 seconds, and then sleep staging results of wake-up (W), rapid eye movement sleep (REM) and non-rapid eye movement sleep (NREM) were outputed. NREM can also be subdivided into N1, N2 and N3 stages. 5-fold cross validation and independent subject cross validation were performed on the dataset with Kappa coﬃcients 0.7360 and 0.7001, respec-tively. The accuracy rates of those methods were 92.06% and 91.13%, respectively. Compared with the other methods, our proposed method can expediently obtain the information of sleep stages from single channel EEG signals without special feature extraction. It has a good performance and can provide support for clinical application based on automatic sleep staging.


Introduction
Sleep plays an important role in human health [1]. The monitoring of human sleep has significant implications for medical research and practice [2]. Sleep specialists usually obtain the quality of sleep by analyzing electrical activity signals from sensors connected to different parts of body in accordance with the Rechtschaffen & Kales Rules [3] or the American Academy of Sleep Medicine (AASM) sleep score manual [4]. In particular, polysomnography(PSG), which records EEG, electrooculogram (EOG), electrocardiograph (ECG), electromyography (EMG), respiratory effort, leg movement, and blood oxygen saturation over several nights in a sleep laboratory, is considered as a gold standard for evaluating sleep status of subjects [5]. In order to improve the efficiency of sleep monitoring, several effective sleep staging methods with multiple physiological parameters based on EEG, ECG and EMG signals have been proposed in recent years [6][7][8][9][10]. However, their signal acquisition mostly adopted the silver/silver chloride electrode with certain adhesive or conductive paste, and fixed in the positions to be tested, which affected the natural sleep of subjects and was not suitable for longterm sleep monitoring in the home environment.
There are two main types of current research on sleep monitoring and staging methods, one is to simplify and optimize the signal acquisition channel, the other is to explore change rules of other physiological signals during sleep [11][12][13][14][15][16][17][18][19][20]. Wearable measurements of physiological signals such as ECG, EOG and respiration, as well as body movements and other behaviors, offer the potential of low cost and easy operation for longterm dynamic sleep monitoring [12][13][14][15]. Rahman [15] used EOG signals to obtain up to 73% accuracy of sleep staging, but the classification effect of N1 stage still needs to be improved. In recent years, non-invasive or noncontact measurement of cardiac, respiratory and body movement signals has gradually gained the favor of researchers [16][17][18][19][20]. Ballistocardiograph (BCG) measurements based on pressure-sensitive sheets or bedmounted measurements have been used for sleep analysis and preliminary screening for diseases [16][17][18]. Methods such as video monitoring or microwave radar were also used in the detection of cardiac, respiratory and body movement of patients [19][20]. Of course, their performances mainly depends on the quality of signal acquisition.
Single channel EEG parameters in time domain, frequency domain, timefrequency domain can be used for sleep stage classification [21][22]. Channel selection is important for measurement convenience. Fp1, Fp2 and Fpz below hairline are suitable for wearing and data collection, F3 and F4 channels which near hairline also have operating space. However, these channels whose direct information can support for W, N1, N2, and N3 stages, which are different from most models based on sleep dataset of EEG position in the parietal lobe, and it can affect the credibility of staging results. Deep neural network models such as recurrent neural network (RNN), convolutional neural network (CNN) and deep belief network (DBN) were used to analyze single channel EEG signals, and a good overall average accuracy of sleep staging was obtained [23][24][25][26][27][28][29].
Muliti-scale convolutional neural network (MSCNN) was also proposed to perform multi-scale feature extraction and classification simultaneously [30][31]. However, gradient dispersion or gradient explosion is likely to occur with the increase of the depth of CNN. The residual convolutional neural network (RESCNN) proposed by He [32] was aimed to solve the degeneration problem of network. The method has been applied to the machine fault detection and achieved good results [33]. The sleep stage classification based on residual network also achieved good results [34]. But central and occipital EEG from right hemisphere, left and right EOG, and chin EMG channels were extracted from each PSG study.
Considering that the multi-scale RESCNN can capture detailed signal features required for pattern classification, the idea was adopted to single channel EEG sleep stage classification for realizing a practical wearable smart eye mask in our study. The research work included following aspects: apply deep MS-RESCNN where no special EEG feature extraction is required; train the network model by using PSG dataset; test the application effect of the new method in sleep stage classification by using cross validation. The new method used single channel EEG signals, dispensed with for special feature extraction process. It has good performance and application potential, which can provide support for clinical applications such as sleep disorder disease screening and diagnosis.

Method
Sleep staging is a problem of recognition or regression of time series signals. According to the sleep manual, experts divided the EEG, EOG, EMG and other signals obtained by sensors at intervals of 30 s into W, REM, N1, N2, N3, N4. In this paper, the multi-scale feature learning was integrated into the RESCNN to automatically learn sleep features of original physiological signals at different time scales, and classification in a parallel way from complex original physiological signals of sleep was achieved.

CNN
CNN is a unique artificial neural network inspired by the cerebral cortex, which uses convolution method to extract signal features and compress signal size. In order to retain valid information, it reduces the amount of input data by several orders of magnitude, which can form a easy network to optimize and reduce the risk of network overfitting. CNN contains two core layers: convolution and pooling. The purpose of convolution is to extract features at different levels from original data, and a certain number of filters are used to extract feature maps of input data. The pooling layer is periodically inserted between CNN's convolutional layers, and a subsampling operation is used to reduce the number of parameters. At the end of the algorithm, the ReLU activation function which can significantly improve the training speed is used.
In a CNN, receptor field is defined as the size of the region mapped on the input image by pixel points on the output feature map in each layer of CNN. In layman's terms, this is the area where the input feature is "seen" by the output feature point. The receptive field is calculated by Equation (1).
where RF i is the receptive field of the i-th convolutional layer, RF i+1 is the receptive field of the (i+1)-th layer, stride i is the convolution step size of layer i, kernel i is the size of the convolution kernel of i-th layer.

Residual block
In general, the more layers in convolutional neural network, the more diverse features can be extracted. But simply increasing the depth of the network, which can lead to gradient dispersion or gradient explosion. Although the CNN of dozens in layers can be trained by normalized initialization and batch normalization (BN), it is prone to degenerate as the number of layer increases. He [32] proposed the RESCNN to solve the degradation phenomenon: if the rear layer of a certain layer in the deep network is identity mapping, the model would degenerate into a shallow network.
For the network with an input x, the learning feature is denoted as H(x), expected network learning residuals F (x) = H(x) − x. As shown in Figure 1, a residual block is formed by neural network with shortcuts connections. It contains two kinds of mapping: one is the identity mapping, which is the shortcut curve in the graph, and the other is the residual mapping. If the residual error is not equal to 0, the network performance can still be improved by increasing the number of network layers. If the residual is 0, then the current layer is just an identity mapping which neither improves nor degrades, thus the network degradation problem can be solved.

Figure 1 Schematic diagram of residual block
The output y from the shortcut operation is given as Equation (2).
where x and y are the input and output vectors of the layer under consideration, F (x, W i ) is the residual mapping function to be learned, and it can be expressed by Equation (3).
where σ is the ReLU activation function.

MS-RESCNN
According to the above description, the expression of RESCNN unit constructed can be expressed by Equation (4) - (10).
where ⊗ is convolution operator, BN is a batch normalization operation used to improve generalization capability.
In order to extract features from different receiving domains, we used three RESCNN pathways constructed by multiple RESCNN units as shown in Fig. 2, and time series signal fragments were directly used as inputs of the network. Each path contained four RESCNN units with two convolutional layers and a shortcut. Each convolution layer was followed by the batch normalization and the activation function ReLU. The solid line shortcut meaned that it can be added data directly, dashed line shortcuts indicated that they need to be added the 1 × 1 Conv to the same dimension, the result of each path was averaged by pooling 512 features. Each RESCNN block had a different convolution kernel when the core size was set to 1 × 3, 1 × 5 and 1 × 7.
EEG signals inputted into the network was proceeded by the convolution layer with convolution kernel length of 15, and it was transmitted to Batch Normalization and ReLU activation functions for carrying out maximum pooling. Then the output data was sent to three channels of different sizes of the convolution kernel for calculation. Finally, the characteristics of three channels were combined and connected to the full connection layer with 1536 neurons, and the network classification results were obtained by the Softmax function.
The receptive field of each convolution layer can be calculated according to Equation (1). The receptive field of the output characteristic graph of the last convolution layer in the network was 563 in the input data. EEG signals had a sampling frequency of 100 Hz, the effective frequency resolution of 563 data points was 0.35 Hz, which can meet the frequency resolution requirements of all rhythmic waves.

Dataset
The dataset used in the experiment was Sleep-EDFX [36], which contains 153 SC* files and 44 ST* files. Records named SC * were EEG signals from healthy subjects during 24 hours of normal life, and records named ST * were signals from subjects with mild sleep disturbances who underwent nighttime sleep tests in hospitals. Signals in two subsets included EOG Fpz-Cz EEG and Pz-Oz EEG, both sampled at 100 Hz. The annotation files included sleep stages W, R, S1, S2, S3, S4, M (Movement time) and UNKNOWN, it consisted of a manual score by a skilled technician. In this study, we cropped the SC * files in the data set, and only signals in the period from 30 minutes before the beginning of sleep to 30 minutes after the end of sleep were retained, data from FPz-Cz channels was used. Stage M and UNKNOWN are deleted for their extremely small percentage. At the same time, according to the latest AASM sleep scoring manual, S3 and S4 were combined into N3.

Preprocessing
In order to adapt to different PSG acquisition devices and individual differences of different subjects, EEG signals of each subject were normalized by using the 5-th and 95-th quantiles [34], as shown in Equation (11).
where x norm was the result of signal normalization, x was original signal, Q 0.05 (x) and Q 0.95 (x) were the 5% and 95% largest point in the signal, respectively. In order to expand the dataset and improve the network generalization ability, each input data should be randomly clipped (the 3000 data points in each epoch were randomly clipped to 2700), the flip probability was 50%, and 0.01 times random noise was added. Finally, the preprocessed data were integrated, and then added batch size and channel number, it was converted to tensor data type, as shown in Equation (12).
where N=16,C=1,T=2700 represented batch size, channel number, data points of single epoch respectively. In order to input the sleep stages corresponding to the EEG of each Epoch into the network, sleep stages were mapped as Equation (13).

Network training 3.3.1 Loss function
In order to realize the loss calculation of multiclassification of sleep stage, cross entropy (CE) is used as the loss function, its definition is shown as Equation (14).
where input is one-dimensional array (the array consists of predicted probability values for each tag) after being processed by Softmax, target is the actual label. input [j] represents the element in the input array whose ordinal number is j. weight[target] is the actual label weight. Since the classification of sleep stages is an unbalanced classification task, this paper tried to balance differences in the amount of data of each label, weight[target] is defined as Equation (15).
where p(target) is the proportion of the label target in total label.

Optimizer
For the loss function, the adaptive moment estimation (Adam) solver was used for optimization. Learning rate is set to 5 × 10 −4 , all other parameters are set to default. Each time the network training is performed, the algorithm would be used to optimize all training set data for 20 times, and then the verification set would be used to obtain the system performance index results.

Operating environment
The multi-scale residual network was built by using PyTorch 1.0, and it was trained by using GTX950M with Ubuntu 18.04 system. Other hardware configurations included: Intel Core i7-4710MQ, 12GB RAM.

Performance evaluation 3.4.1 Cross validation
The cross validation used in this paper included K-fold cross validation and subject cross validation. The former randomly could divide the entire data set into k subsets with epoch as the smallest unit. Each subset was taken as the verification set, and the remaining k-1 subset was taken as the training set. Experiments were performed for k times, and the results of all verification sets were weighted and summarized to get the final result. In this paper, k is set to 5. The latter divided the data set into training set and verification set with a partition ratio of 8:2.

Performance
Recall (Re k ), accuracy (Acc k ) and specificity (Sp k ) were used to evaluate the results of sleep staging. Overall recall (Re ε ), overall accuracy (Acc ε ) and overall specificity (Sp ε ) were expressed by Equation (16) - (18).
Acc k = T P + T N T P + F N + T N + F P (%), Acc ε = n k=1 Acc k n (%) (17) where TP, TN, FP and FN represented the true cases, true negative cases, false positive cases and false negative cases formed by the classifier to judge the category, respectively. In these equations, n=5, k=1,2,3,4,5 represented five different stages of sleep. At the same time, the kappa coefficient which could describe overall performance of the system would also be calculated as Equation (19) - (20).
where Y i represented the real label of the sample i, P Y i represented the predicted label of the sample i of the model, and a i was the actual number of the sample i, and a i was the predicted label of the first sample. b i was the number of predicted samples at last, and n was the total number of samples.

5-fold cross validation
The 5-fold cross validation was completed on Sleep-EDFX dataset, and the results were shown in Table 1. The table contained the classification performance index of each sleep stage and the overall performance index of the original confusion matrix obtained after cross validation. The overall recall rate, accuracy rate, specificity, It can be seen that the network proposed in this paper can provide good classification performance with high resolution for the period W, but poor resolution for N1. In particular, it was found by the confusion matrix that N1 was easily misjudged as N2 and REM, which was consistent with the results in [24]. This may be due to the relatively small proportion of sleep time in N1, resulting in less training data.

Subject cross validation
The results of subject cross validation on the dataset were shown in Table 2. The overall recall rate, accuracy rate, specificity, error rate and Kappa coefficient of classification were 75.81%, 91.13%, 94.65%, 22.19% and 0.7001, respectively. Memar  [26] used random forest to classify W, S1+S2, S3+S4 and REM, and the recall rates were 97.99%, 79.62%, 75.39% and 59.62%, respectively. For comparison, N1 and N2 stages were combined and recalculated according to the confusion matrix, and recall rates of each sleep stage were 88.56%, 83.00%, 82.92% and 77.31% respectively. Except for W, recall rates of the other three stages were much higher than those in [26]. Tsinalis [24] used the convolutional neural network to perform 20-fold subject cross validation on the Sleep-EDFX dataset. According to the confusion matrix given in this paper, we calculated the overall recall rate, accuracy rate, specificity, error rate and Kappa coefficient, which were 73.7%, 89.91%, 93.62%, 25.23% and 0.6535 respectively, and the system performance of the method was slightly lower than that of the method proposed in this paper.

Performance comparison with different residual networks
Resnet18 [37], Multi-Scale 1D Convolutional Neural Network(Multi-scale-1d-Resnet) [35] and our proposed MS-RESCNN were compared in same circumstances, and 5-fold cross validation results were shown in Table 3-5. In order to adapt to the input of one-dimensional data, all two-dimensional layers in the Resnet18 network structure were modified to one-dimensional layers. The results showed that there was no significant difference in the overall accuracy of the three networks, but in terms of overall recall rate, the network model proposed in this paper had an improvement of about 2% compared with the other two networks. In terms of N1 resolution, the three network structures have great differences. The classification recall rate of our proposed network for N1 was 72.67%, which was more than 10% better than that of Resnet18 (61.53%) and Multi-scale-1d-Resnet (58.33%).

Confusion matrix visualization processing
The confusion matrix obtained by the two cross validation methods was normalized according to the actual number of labels, and then a heat map was generated. The result was shown in Fig. 3. The heat map can be used to evaluate the method proposed in this paper intuitively.

Overfitting phenomenon in training
In order to observe the training of the network, our program output the error rate of the network in training set and verification set in real time. The curve of the error rate in 5-fold cross validation and Subject cross validation is shown in Fig. 4. In the two graphs, we can found that serious overfitting could still occur in the actual training of the network although the data set, which has been expanded in the process of data preprocessing, i.e., the performance of the network in the training set was better than that of the verification set. In particular, the error rate (red line) of validation set still decreased with the increase of iteration times in 5-fold cross validation although severe overfitting occurred. However, the error rate of training set still decreased when the error rate of validation set had convergent in subject cross validation. This phenomenon showed that unknown subject data would have a great impact on system performance from another perspective.

Conclusion
In this paper, a new sleep staging method based on multi-scale residual network was proposed. It can automatically extract useful information from original single channel EEG signals and classify sleep stages. By the cross validation of datasets, the system performance can be maintained for the data of invisible subjects in the case of large data volume. Compared with other deep learning methods, our method only used a single channel EEG, it could not require a complex data preprocessing and specialized feature extraction process for obtaining a better system performance, which provided a possibility for the clinical application of automated sleep staging. In addition, the multiscale residual network could be further deepened when the computing capacity was enough, and then, a larger amount of data could be used for training, so that the model with better robustness and system performance can be obtained theoretically. Figure 1 Schematic diagram of residual block   Error rate curve in the training process