Automatic Sleep Stage Classication Based on Two-channel EOG and One-channel EMG

The sleep monitoring with PSG severely degrades the sleep quality. In order to simplify the hygienic processing and reduce the load of sleep monitoring, an approach to automatic sleep stage classication without electroencephalogram (EEG) was explored. Totally 108 features from two-channel electrooculogram (EOG) and 6 features from one-channel electromyogram (EMG) were extracted. After feature normalization, the random forest (RF) was used to classify ve stages, including wakefulness, REM sleep, N1 sleep, N2 sleep and N3 sleep. Using 114 normalized features from the combination of EOG (108 features) and EMG (6 features), the Cohen’s kappa coecient was 0.749 and the accuracy was 80.8% by leave-one -out cross-validation (LOOCV) for 124 records from ISRUC-Sleep. As a reference for AASM standard, the Cohen’s kappa coecient was 0.801 and the accuracy was 84.7% for the same dataset based on 438 normalized features from the combination of EEG (324 features), EOG (108 features) and EMG (6 features). In conclusion, the approach by EOG+EMG with the normalization can reduce the load of sleep monitoring, and achieves comparable performances with the "gold standard" EEG+EOG+EMG on sleep classication.


Introduction
Sleep is the process of the body's self-repairing and self-recovery. A good quality sleep will eliminate fatigue, restore physical strength and energy.
Generally the quality of sleep is assessed quantitatively from the macro-sleep structure, sleep breathing, and body movement during sleep. Polysomnography (PSG) is used as the basis in the clinic for the evaluation of macro-sleep structure. Following American Academy of Sleep Medicine (AASM) standard, sleep stages are divided into wakefulness, rapid eye movement (REM) sleep, and Non-REM (NREM) sleep. Furthermore, NREM sleep is divided into N1, N2 and N3 stage. Deep sleep (N3 stage) is mainly conducive to physical recovery. REM sleep is mainly conducive to the backtracking, encoding, consolidation, trimming and even strengthening of the knowledge and the memory. Breathing and blood oxygen are another dimension to sleep quality assessment. Sleep apnea includes central apnea, obstructive apnea and mixed apnea. Frequent apnea can cause reduction in Oxygenation and PH value in blood. Besides, frequent movement during sleep is a characteristic of some types of sleep disorders.
Sleep Stage Classi cation (SSC) from PSG provides sleep stage information for studying sleep patterns. Nowadays, there are two main research areas of Automatic Sleep Stage Classi cation (ASSC). Correctly identifying sleep stages are important in diagnosing and treating sleep disorders. Hence, researchers have tried to obtain higher accuracy with respect to manual SSC. As the most accurate way, PSG abides the decisive approach in many cases [1]. A joint classi cation-and-prediction framework based on convolutional neural networks (CNNs) yielded an accuracy of 82.3% on Sleep-EDF Expanded (Sleep-EDF), and accuracy of 83.6% on Montreal Archive of Sleep Studies (MASS) dataset [2]. Yan et al [3] developed a versatile deep-learning architecture to automatic sleep scoring when using raw PSG recordings, and got an accuracy of 86% and a kappa coe cient of 0.82 on ISRUC sleep. A Hybrid Stacked LSTM Neural Network got an accuracy of 83.1% and a kappa coe cient of 0.78 on 994 subjects when half subjects were randomly assigned to the training set, and the other half as the testing set [4].
Speci cally, the inter-scorer agreement following the AASM standard [5] is only approximately 82.6% [6]. Too high accuracy from ASSC may be caused by over-tting. It is reasonable to get a trade-off between boosting the classi cation performance by integrating as much features as possible, and not over-tting sleep scoring model in certain sleep stage type. On the other hand, studies emphasis on reducing the burden from data recording during the whole night sleep, such as wearable, on-bed, and actigraphy devices [7]. Researchers have tried to use feasible devices for sleep monitoring in the community, such as ear-EEG [8], wireless sensor on the neck for tracheal breathing sounds [9], actigraphy-based devices [7] and pressure sensor mattress [1].
Although PSG is currently the "gold standard" for sleep monitoring, it requires attaching at least 10 electrodes to the head, the face and the body, which seriously interferes the subjects' natural sleep. The sleep monitoring with PSG severely degrades the sleep quality. As a result, PSG is mainly used in hospitals to monitor patients with severe sleep disorders because of its complex operation and high level of discomfort. After sleep experiments, it is not easy to clean up the residual conductive paste within the hair for EEG collecting.
Six-channel electroencephalogram (EEG), two-channel electrooculogram (EOG) and one-channel electromyogram (EMG) are recommended for sleep scoring according to AASM standard. If the EEG is not collected during sleep monitoring, the problem of injecting conductive paste into the electrode cap before EEG collection and washing hair after EEG collection can be avoided. As a result, the hygienic treatment is simpli ed. Besides, signal acquisition without EEG, also reduces the load of sleep monitoring on subjects.
Therefore, the study of EEG-free sleep monitoring method is of great signi cance to degrade the load of sleep monitoring. EOG and EMG signals can be acquired in a more comfortable way in comparison with EEG. Can comparable performances on sleep classi cation be achieved by EOG + EMG when in comparison with that of EEG + EOG + EMG? Focusing on this problem, this paper studies a method for low-load sleep monitoring based on EOG and EMG, and evaluates the role of EOG and EMG signals in sleep staging.

Data acquisition
A public dataset called ISRUC-Sleep [10] with AASM standard was used, including the sleep disorder groups (two subsets, namely subgroup-I and subgroup-II) and the health group (subgroup-III). The database was provided by the Sleep Medicine Centre of Coimbra. It can be downloaded freely from the web site "http://sleeptight.isr.uc.pt/ISRUC_Sleep/". The data provide 126 PSG records and sleep stages labels from two experts. There are 19 channels of physiological data for most PSG records. However, record '8' and record '40'were excluded from subgroup-I for the analysis. The former record does not provide EEG channels of F3 and F4, while the latter one suffers some electrode problems. As a result, totally 124 records were used in this research, including 98 records from subgroup-I, 16 records from subgroup-II and 10 records from subgroup-III.
The sampling frequency is 200 Hz for each channel. All these channels in ISRUC-Sleep had been ltered to eliminate noise and undesired background by the dataset itself, aiming to enhance the PSG signal quality and increase the SNR. The ltering stage comprised: (1) a notch lter to eliminate the 50 Hz electrical noise; (2) a band-pass Butterworth lter with a lower cutoff frequency of 0.3 Hz and a higher cutoff frequency of 35 Hz for EEG and EOG channels, and a lower cutoff frequency of 10 Hz and higher cutoff frequency of 70 Hz for EMG channels.
According to ASSM rules, the sleep stages of each subject in the dataset were labeled by two experts individually. Therefore, small differences existed in annotations between two experts. If sleep scores from only one expert were used, a bias would produce from a rater's style. As a result, only 30-s sequences with consist annotations from the two sleep diagrams were extracted for analysis in this paper. Two EOG channels are unipolar, namely'LOC-A2' and 'ROC-A1'. FFT is applied to each EOG channel to get the power spectral density (PSD). The sum of energy in sub-bands delta (1)(2)(3)(4), theta (4)(5)(6)(7)(8), alpha (8-13 Hz) and beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30) in each 30-s period is de ned as E delta , E theta , E alpha and E beta , while the sum of E delta , E theta , E alpha and E beta is de ned as E sum . The entropy derived from E delta , E theta , E alpha and E beta is de ned as E Entropy . Similarly, the sum of the absolute value in these sub-bands in each 30-s period is de ned as S delta , S theta , S alpha and S beta , while the sum of them is de ned as S sum . The entropy derived from S delta , S theta , S alpha and S beta is de ned as S Entropy . Feature vector within four sub-bands is de ned as EogBand4_Ft16= [E Entropy E beta /E delta E delta /E sum E theta /E sum E alpha /E sum E beta /E sum S Entropy S beta /S delta S delta /S sum S theta /S sum S alpha /S sum S beta /S sum S delta S theta S alpha S beta ] (1) In the same way, for eleven sub-bands (0.4-4) Hz, (4-8) Hz, (8-10) Hz, (10)(11)(12)(13) Hz, (13)(14)(15)(16)(17)(18) (2) while OneLeadEog represents LOC_LeadEog for lead 'LOC-A2' and ROC_LeadEog for lead 'ROC-A1'.
The number of features between two-channel EOG is 10, and the feature vector is as follows, EogBtwn= [r beta r alpha r theta r delt r org PLV beta PLV alpha PLV theta PLV delt PLV org ] (3) The number of features for one-channel EOG is 49, and the number of features between two-channel EOG is 10. Therefore, the total number of features for two-channel EOG 'LOC-A2'and'ROC-A1' is 49×2 + 10 = 108, and the whole vector of EOG is de ned as, EogFeat=[ LOC_LeadEog ROC_LeadEog EogBtwn] (4)

Features from single-channel EMG
The fractal dimension of EMG is de ned as EmgFD, and the root mean square is de ned as EmgStd in every 30 seconds.
The EMG signals of every 30-s period are transformed by Hilbert to obtain the enveloping signal. After that, the enveloping mean is de ned as EnvlpMean, the enveloping maximum is de ned as EnvlpMax, the enveloping root mean square is de ned as EnvlpStd, and the ratio of EnvlpMax to EnvlpMean is de ned as RtMaxdMean. The total number of features for single-channel EMG is 6, and the whole vector of EMG is de ned as, One normal sleep in adults may last 8 hours. During such a long period, the recording conditions variation such as skin humidity, body temperature, body movements or even worse as electrode contact loss. Besides, the discriminant information for the considered sleep stage classi cation lies in relative amplitudes rather than the absolute amplitudes.
If the maximum and the minimum values in the feature sequence are taken as the reference for feature normalization, it may cause an error; because both the maximum and the minimum values may be noise points. For example, most values in a feature sequence are near 1, but one noise point is 100 and the other noise point is -10. If the normalized scale is according to the maximum and the minimum values, i.e., 100 + 10 = 110, then most of the values in the normalized feature series are clustered around 0.01. Only the former noise point is 1 and the latter noise point is 0, which is obviously not the expected result of normalization.
A new 'quasi-normalization' method is designed in this paper. First, the original feature sequences {a(n)} are arranged in order from small to large, which is de ned as {f(n)}. Set the series number of {f(n)} at the position of 10% length from the beginning as n1, the series number of {f(n)} at the position of 50% length from the beginning as n2, and the series number of {f(n)} at the position of 90% length from the beginning as n3.
The standard deviation of the sequences {f(n1:n3)} is de ned as Sd.
Sd = std( f(n1:n3) ) (8) (12) Then using formula (12) for 'quasi-normalization', most elements in {b(n)} are transformed into the interval [-1, 1], but a few elements are out of that range. In order to make all elements locate into the interval [-2, 2], the following transform is applied, Finally, the feature sequences {c(n)} are used for classi cation. Figure 1 is an example of the quasi-normalization for EmgStd of EMG. For original index EmgStd as Fig. 1b, most elements are lower than 2, but none is lower than 0. After using formula (12) for 'quasi-normalization', most elements in {b(n)} are transformed into the interval [-1, 1], as Fig. 1c, but some elements are still higher than 2. After using formula (13) for data truncation, elements that higher than 2 are reset as 2.

Classi cation model selection
Random Forest (RF) [13] has some wonderful advantages, including strong generalization ability, strong anti-over-tting ability, rapid model training, simple structure and easy constructing, which is suitable for processing high-dimensional data sets without feature selection.

Comparison of classi cation results
Leave-one-record-out (LOOCV) strategy was applied to the mixed group (10 healthy recordings and 114 sleep disorder recordings). The training dataset contained 123 records while the rest one record was used as the validation set. This step repeated 124 times until each record had been tested. The whole 124 times' testing formed the nal results.
Furthermore, the results were compared that derived from each signal type among EEG, EOG and EMG. Evaluation indices are employed, including accuracy, the multi-class weighted F1 score [14] and Cohen's kappa coe cient.

Classi cation results for individual record
Sleep stages classi cation results on subgroup-III from the combination of EOG (108 features) and EMG (6 features) are shown in Table 2 Table 3.
Classi cation results of No.3 from subgroup-III are shown in Fig. 4. Classi cation results are to some extent similar with manual scoring, as REM Fscore (0.822) and N3 F-score (0.778). Percentage of REM stage is similar with manual scoring, as shown in the row of No.3 in Table 3. However, percentage of N3 stage (22.0%) derived from the classi cation is signi cantly lower than the manual scoring (38.5% and 35.0%). Visual evidence is very obvious in Fig. 4, as the second N3 stage in the second expert's manual scoring is mistaken for N2 stage, and almost half quantity of the last two N3 stage in the second expert's manual scoring is also mistaken for N2 stage. Besides, the percentage of wakefulness stage (17.7%) is signi cantly higher than the manual scoring (10.8% and 10.0%).
Worst-case result from Table 4 is No.10, which is shown in Fig. 5. Classi cation results are to some extent similar with manual scoring, as N3 F-score (0.867) and wakefulness F-score (0.808). Percentage of N3 stage (13.9%) is as similar as the manual scoring (14.1%), as shown in the row of No.10 in Table 3. However, percentage of N1 stage (6.9%) derived from the classi cation is signi cantly lower than the manual scoring (27.8%). Besides, the percentage of N2 stage (42.7%) is signi cantly higher than the manual scoring (22.7%).  is that only 10 records were used, as 5 records for training and the other 5 records for testing.
In multi-class sleep staging, the best discrimination was achieved by the combination of EEG + EMG + EOG, in which the highest F1-score was 0.894 for both awake and N3 stages in Table 6, followed by REM (0.884) and N2 (0.836) stages. However, the lowest F1-score resided in the detection of stage N1 (0.528). The order in F1-score for detection of single stage was consist with the order in the accuracy of the research of Khalighi et al [14], with awake (88.59%) > N3 (87.13%) > REM (86.99%) > N2 (79.06%) > N1 (66.91%).
Most studies in Table 6 did not use the whole dataset of ISRUC-Sleep for validation. It is not a fair comparison when one study uses more than 100 records for validation and another one only use 5 records for validation. Using 99 records for validation from 6EEG + 2EOG + 1EMG + 1ECG, deep learning [3] got an accuracy of 86% and Cohen's kappa coe cient of 0.82. In comparison with that, our proposed method used 124 records for validation, and got an accuracy of 84.7% and Cohen's kappa coe cient of 0.807 from 6EEG + 2EOG + 1EMG. The main difference is that the F-score is 0.528 from our method with 6EEG + 2EOG + 1EMG, but the deep learning [3] got an F-score of 0.67 for N1 stage.
Furthermore, for training and testing set as Table 7, our method obtains comparable performance with the method of Md Mosheyur Rahman [16].   Table 8, our proposed way for feature normalization improves the performance when in comparison with original features. For traditional feature normalization, i.e., y(n)=(x(n)-mean(X))/(max(X)-min(X)), the performance is even worse than the original features. The reason is that physiological signals often have signi cant individual characteristics. For example, although the lowest EMG amplitudes in most subjects occurred during deep or REM sleep, a few subjects tended to be different, so did the highest EMG amplitudes during wakefulness. This is the reason why traditional feature normalization should be reconsidered for noised physiological signals.
If the features are normalized by traditional way, features have the risk of being greatly in uenced by noise; that is, the test accuracy will be greatly reduced. Therefore, the proposed feature normalization is superior to the traditional way in terms of generalization ability.
The in uence of positions of index n1and n3 in formula (8 ~ 10) is also tested. As shown in Table 8, the changes of their positions induce uctuation in accuracy less than 1%. Obviously, the proposed classi cation method bene ted from feature normalization. Currently, the gold standard for sleep monitoring is PSG that recorded in the hospital or in a sleep laboratory. PSG remains a complex, high demanding and obtrusive procedure [1]. Shortcomings limits the utility of PSG, such as discomfort sleep assessment, high cost and being labour-intensive. On the contrary, unattended and portable non-medical devices deliver highly unobtrusive measurements at the expense of accuracy and reliability, also referred as electronic gadgets [1]. The focus is that "how to assess sleep stages and sleep quality less intrusively bur more reliably" [1]. It is contradictory to improve the classi cation accuracy and reduce the intrusion degree during sleep.
When signals for sleep recording are selected from EEG, EOG and EMG signals, the more the signal types, the higher the classi cation accuracy.
However, for reducing the burden on subjects and reducing sleep disturbance, the fewer the signal types, the better. According to the AASM, six-channel unipolar EEG acquisition needs 8 electrodes, two-channel EOG acquisition needs 3 electrodes, one-channel EMG acquisition needs 2 electrodes. These electrodes and their cables will greatly in uence the sleep quality.
The brain provides the most useful information about sleep regulation. EEG is the most important signal that directly re ects the state of the brain in sleep [19]. However, due to the structure of the scalp and the effects from the hair, novel materials or electrodes are still hard to obtain that can be used for EEG acquisition with comfort and high signal-to-noise ratio [19].
Using only one type of physiological signal can further reduce the number of electrodes attached to the body and the physiological load. When only one physiological signal is selected from EOG and EMG, as shown in Table 6, the precision for sleep staging derived from EOG is much higher than that from EMG, but lower than EEG alone. EOG-based sleep staging is relatively poorer than the method of using EEG, because the criterion of scoring the sleep stage mainly depends on the characteristics of EEG signals.
Unlike EEG, EOG electrodes can be placed below the hairline with self-adhesive electrode without the assistance of experts [16]. EOG is highly useful to identify the wakefulness and REM sleep, since there are major eye movements during these stages. Eye movements tend to slow down with the depth of sleep. EOG signals record the movement of the eyes is a fundamental indicator to distinguish between REM and NREM stages [19]. As shown in Fig. 2b, PLV org of EOG is usually bigger than 0 during N2 and N3 sleep, and it is usually smaller than 0 during wakefulness and REM sleep. Visual examples are shown in Fig. 6 and Fig. 7. During REM sleep in Fig. 6, the polar between left EOG and right EOG is opposite, i.e., when a peak shows in left EOG, a valley will show in right EOG, such as the position of 12-s.
During N3 sleep in Fig. 7, the polar between left EOG and right EOG is consist, i.e., when a peak shows in left EOG, a similar peak will also show in right EOG, such as the position of 5-s. Furthermore, waveforms in EOG are to some extent similar to that of EEG, which is obvious in Fig. 7.
When compared with the microvolts of EEG's small amplitude variations, EOG and EMG show in millivolts and requires less stable contact with the body, which make them less sensitive to noise and more suitable for unobtrusive measurement apparatus. Hence, the combination of EOG and EMG is a good choice for sleep monitoring.

sleep monitoring by EOG and EMG
The EOG makes the main contribution for sleep scoring derived from the combination of EOG and EMG. The placements of EOG electrodes are close to the placements of Fp1 and Fp2 EEG electrodes [19]. Hence, the EOG recordings are highly in uenced by a portion of EEG activities [19]. During NREM sleep, EOG has a similar waveform with the EEG signals recorded at the frontal poles [19].
In addition, using 438 normalized features from the combination of EEG (324 features), EOG (108 features) and EMG (6 features

Conclusion
On a public data set called ISRUC-Sleep, comparative analysis suggests that the performances on sleep classi cation that achieved by EOG+EMG are comparable with that of EEG+EOG+EMG. The EOG makes the main contribution for sleep scoring derived from the combination of EOG and EMG. The proposed method from the combination of EOG and EMG can achieve comparable performance as using EEG signals for sleep staging. EOG and EEG during N3 sleep