Automatic sleep stage classification based on a two-channel electrooculogram and one-channel electromyogram

Objective. Sleep monitoring by polysomnography (PSG) severely degrades sleep quality. In order to reduce the load of sleep monitoring, an approach to automatic sleep stage classification without an electroencephalogram (EEG) was proposed. Approach. A total of 124 records from the public dataset ISRUC-Sleep incorporating American Academy of Sleep Medicine (AASM) standards were used: 10 records were from the healthy group while the others were from sleep disorder groups. The 124 records were collected from 116 subjects (eight subjects had two records each, the others had one record each) with ages ranging from 20 to 85 years. A total of 108 features were extracted from the two-channel electrooculograms (EOGs) and six features were extracted from the one-channel electromyogram (EMG). A novel ‘quasi-normalization’ method was proposed and used for feature normalization. Then the random forest algorithm was used to classify five stages, including wakefulness, rapid eye movement sleep, N1 sleep, N2 sleep and N3 sleep. Main results. Using 114 normalized features from the combination of EOG (108 features) and EMG (6 features) data, Cohen’s kappa coefficient was 0.749 and the accuracy was 80.8% by leave-one-out cross-validation. As a reference for AASM standards using a computer-assisted method, Cohen’s kappa coefficient was 0.801 and the accuracy was 84.7% for the same dataset based on 438 normalized features from a combination of EEG (324 features), EOG (108 features) and EMG (6 features) data. Significance. A combination of EOG and EMG can reduce the load of sleep monitoring, and achieves comparable performance to the ‘gold standard’ signals of EEG, EOG and EMG for sleep stage classification.


Introduction
Sleep is the body's self-repair and self-recovery process (Matar et al 2018). Good-quality sleep will eliminate fatigue and restore physical strength. Generally the quality of sleep is assessed quantitatively from the macrosleep structure, breathing and body movement during sleep (Mendona et al 2019). Polysomnography (PSG) is used as the basis in the clinic for the evaluation of macro-sleep structure. Following American Academy of Sleep Medicine (AASM) standards (Iber et al 2007), sleep stages are divided into wakefulness, rapid eye movement (REM) sleep and non-REM (NREM) sleep. Furthermore, NREM sleep is divided into stages N1, N2 and N3 (Iber et al 2007). Deep sleep (stage N3) is mainly conducive to physical recovery (Matar et al 2018). REM sleep is mainly conducive to the backtracking, encoding, consolidation, trimming and even strengthening of the memory (Matar et al 2018). Breathing and blood oxygen are other dimensions to sleep quality assessment (Mendona et al 2019). Frequent apnea can cause the reduction of both oxygenation and pH value in blood (Mendona et al 2019). Also, frequent movement during sleep is a characteristic of some types of sleep disorder (Mendona et al 2019).
Sleep stage classification (SSC) from PSG provides sleep stage information for the study of sleep patterns (Matar et al 2018). Correct identification of sleep stages is important in diagnosing and treating sleep disorders (Phan et al 2019). Hence, researchers have tried to obtain high accuracy with respect to manual SSC (Yan et al 2021). As the most accurate method, PSG is the decisive approach in many cases (Matar et al 2018).

Data acquisition
A public dataset called ISRUC-Sleep (Khalighi et al 2016) with AASM standards was used, it includes sleep disorder groups (two subsets, namely subgroup I and subgroup II) and a healthy group (subgroup III). The database was provided by the Sleep Medicine Centre of Coimbra. It can be downloaded freely from the website http://sleeptight.isr.uc.pt/ISRUC_Sleep/. More diagnostic information is presented on the website.
The data comprise 126 PSG records and sleep stage labels from two experts. There are 19 channels of physiological data for most PSG records. However, record '8' and record '40'were excluded from subgroup I for the analysis in this paper. The former record does not provide EEG channels F3 and F4, while the latter one suffers some electrode problems. As a result, a total of 124 records were used in this research as shown in table 1, including 98 records from subgroup I, 16 records from subgroup II and 10 records from subgroup III. All records in table 1 were combined as the mixed group (10 healthy recordings and 114 sleep disorder recordings) for sleep scoring in this paper.
The proposed method focused on two-channel EOG and one-channel EMG, while a combination of sixchannel EEG, two-channel EOG and one-channel EMG was used for comparison. The sampling frequency was 200 Hz for each channel. All these channels in ISRUC-Sleep had been filtered to eliminate noise and undesired background by the dataset itself, aiming to enhance the PSG signal quality and increase the signal to noise ratio (SNR). The filtering stage comprised: (1) a notch filter to eliminate the 50 Hz electrical noise; (2) a band-pass Butterworth filter with a low cutoff frequency at 0.3 Hz and a high cutoff frequency at 35 Hz for both EEG and EOG channels, and a low cutoff frequency at 10 Hz and a high cutoff frequency at 70 Hz for EMG channels.
The sleep stages of each subject in the dataset were labeled by two experts individually. Therefore, small differences existed in annotations between two experts. If sleep scores from only one expert were used, a bias would produce from a rater's style. As a result, only those stages with consistent labels from two experts were extracted for analysis in this paper to avoid personal inclination.
2.2. Feature extraction 2.2.1. Features from single-channel EOG Two EOG channels are unipolar, namely 'LOC-A2' and 'ROC-A1'. In this paper the periodogram power spectral density (PSD) is estimated through the fast Fourier transform (FFT) of each EOG channel during each 30 s period using the MATLAB ® function 'periodogram'. The sum of energy in sub-bands delta (1-4 Hz), theta (4-8 Hz), alpha (8-13 Hz) and beta (13-30 Hz) in each 30 s period is defined as E delta , E theta , E alpha and E beta , respectively, while the sum of E delta , E theta , E alpha and E beta is defined as E sum . The Shannon entropy derived from E delta , E theta , E alpha and E beta is defined as E Entropy . Similarly, the sum of the absolute value in these sub-bands in each 30 s period is defined as S delta , S theta , S alpha and S beta , while the sum of them is defined as S sum . The Shannon entropy derived from S delta , S theta , S alpha and S beta is defined as S Entropy . The feature vector derived from the four sub-bands is defined as
The number of features for single-channel EOG is 49. The feature vector is as follows: where OneLeadEog represents LeadEog LOC for lead 'LOC-A2' and LeadEog ROC for lead 'ROC-A1'.

Correlation features between two-channel EOG
Temporal signals within sub-bands delta (1-4 Hz), theta (4-8 Hz), alpha (8-13 Hz) and beta (13-30 Hz) are derived from individual Finite Impulse Response band-pass filters from original EOG data within the frequency band 1-4 Hz, 4-8 Hz, 8-13 Hz and 13-30 Hz, respectively. The Pearson correlation coefficients between twochannel EOG in four sub-bands during each 30 s period are defined as r delta , r theta , r alpha and r beta , respectively (Li et al 2016). The Pearson correlation coefficient between two-channel EOG with the original waveform during each 30 s period is defined as r org . In the same way, the phase-locking value (PLV) is obtained, including PLV beta , PLV alpha , PLV theta , PLV delt and PLV org .
The number of features between two-channel EOG is 10, and the feature vector is as follows: The number of features for one-channel EOG is 49, and the number of features between two-channel EOG is 10. Therefore, the total number of features for two-channel EOG 'LOC-A2'and'ROC-A1' is 49×2+10=108, and the whole vector for EOG is defined as  The EMG signal of every 30 s period is transformed by the Hilbert transform to obtain the enveloping signal. After that, the enveloping mean is defined as EnvlpMean. The enveloping maximum is defined as EnvlpMax. The enveloping root mean square is defined as EnvlpStd, and the ratio of EnvlpMax to EnvlpMean is defined as RtMaxdMean. The total number of features for single-channel EMG is 6, and the whole vector for EMG is defined as

Whole feature vector
For classification based on EOG and EMG, the whole feature vector with 114 features from two-channel EOG (108 features) and one single-channel EMG (6 features) is defined as follows:
For comparison with AASM standards using a computer-assisted method, the whole feature vector with 438 features from two-channel EOG (108 features), one single-channel EMG (6 features) and six-channel EEG (324 features) is defined as follows:

Characteristic normalization
One normal sleep in adults may last for 8 h. During such a long period, the recording conditions, such as skin humidity, body temperature, body position, body movements or even loss of electrode contact, may change from time to time. The discriminant information for sleep stage classification lies in relative amplitudes rather than absolute amplitudes. Therefore, amplitudes of electric physiological signals will change from time to time. Hence normalization of features is important.
If the maximum and the minimum values in the feature sequence are taken as the reference for feature normalization it may cause an error, because both the maximum and the minimum values may be noise points. For example, suppose most values in a feature sequence are near 1, but one noise point is 100 and another noise point is −10. If the scale is normalized according to the maximum and the minimum values, i.e. 100-(−10) =110, then most values in the normalized feature series are clustered around 0.01. Besides, only the noise point 100 is scaled to near 1 and the noise point −10 is scaled to near 0, which is obviously not the expected result of normalization.
A new 'quasi-normalization' method is designed in this paper. First, the original feature sequences {a(n)} are arranged in order from small to large, which is defined as {f (n)}. We set the series number of {f (n)} at the position of 10% of the distance from the beginning as n1, the position of 50% distance from the beginning as n2 and the position of 90% distance from the beginning as n3.
The standard deviation (Sd) of the sequences {f (n1:n3)} is defined by formula (8) Then, using formula (12) for 'quasi-normalization', most elements in {b(n)} are transformed into the interval [−2, 2], but a few elements are outside that range. In order to locate all elements in the interval [−2, 2], the following transform is applied: Finally, the feature sequences {c(n)} are used for classification. Figure 1 shows an example of quasi-normalization for EnvlpStd of EMG. For the original index EnvlpStd, most elements are lower than 2 in figure 1(a), but none is lower than 0. After arranging in order from small to large, sequences f (n) are obtained. As it is shown in figure 1(b), n1 is at the position of 10% of the distance of f (n) from the beginning, n2 is at the position of 50% distance from the beginning, and n3 is at the position of 90% distance from the beginning. Values of {f (n1), f (n2), f (n3)} are used for quasi-normalization. After using formula (12) for 'quasi-normalization', most elements in {b(n)} are transformed into the interval [−2, 2], as in figure 1(c), but some elements are still higher than 2. After using formula (13) for data truncation, elements that are higher than 2 are reset as 2. Figure 2 is an example of quasi-normalization for EmgStd of EMG. For the original index EmgStd, most elements are lower than 2 in figure 2(b), but none is lower than 0. After using formula (12) for 'quasinormalization', most elements in {b(n)} are transformed into the interval [−2, 2], as in figure 2(c), but some elements are still higher than 2. After using formula (13) for data truncation, elements that higher than 2 are reset as 2. Figure 3 is an example of quasi-normalization for PLV org of EOG. For the original index PLV org of EOG, as in figure 3(b), all elements are in the range [−1, 0.5]. After using formula (12) for 'quasi-normalization', most elements in {b(n)} are transformed into the interval [−2, 2], as in figure 3(c), but some elements are lower than −2. After using formula (13) for data truncation, elements that lower than −2 are reset as −2.

Classification model selection
The RF algorithm (Pan et al 2018) has some wonderful advantages, including strong generalization ability, strong anti-over-fitting ability, rapid model training, simple structure and easy construction, making it suitable for processing high-dimensional data sets without the need for feature selection.
In this paper, the number of trees is also set as 500 (Pan et al 2018). When the number of trees is too huge, the calculation speed will slow down, but the accuracy is almost unaffected.

Comparison of classification results
The leave-one-record-out cross-validation (LOOCV) strategy is applied to the mixed group (10 healthy recordings and 114 sleep disorder recordings). The training dataset contains 123 records while the other dataset is used as the validation set. This step repeats 124 times until each record has been tested. The combination of the 124 times of testing forms the final results.
Furthermore, results are obtained that are derived from each signal type among EEG, EOG and EMG. Evaluation indices are employed, including accuracy, the multi-class weighted F1-score (Khalighi et al 2013) and Cohen's kappa coefficient. Cohen's kappa coefficient is a more effective evaluator because it takes prior probability into account (Pan et al 2018).

Classification results for individual records
The results of sleep stage classification of subgroup III from the combination of EOG (108 features) and EMG (6 features) are shown in table 2. According to Cohen's kappa coefficients, results from 8 out of 10 records are in substantial agreement, with kappa coefficients in the range [0.6, 0.8]. Record no. 1 shows almost perfect agreement (0.801), but record no. 10 is only in moderate agreement (0.542).
Classification results for record no. 6 from subgroup III are shown in figure 4. Classification results are similar with manual scoring, supported by F-scores (N3, 0.930; N2, 0.862; awake, 0.811; REM, 0.772). The percentage of each stage is also similar to the manual scoring, as shown in the row for subject 6 in table 3.
Classification results for subject 3 from subgroup III are shown in figure 5. Classification results are to some extent similar to manual scoring, supporting by F-scores (REM 0.822; N3 0.778). The REM stage percentage is similar to manual scoring, as shown in the row for subject 3 in table 3. However, the N3 stage percentage (22.0%) derived from the classification is significantly lower than with manual scoring (38.5% from the first expert, and 35.0% from the second expert). The visual evidence is very obvious in figure 5, as the second N3 stage in the second expert's manual scoring is mistaken for N2 stage by the proposed method, and almost half of the last two N3 stages in the second expert's manual scoring is also mistaken for N2 stages. The wakefulness stage percentage (17.7%) is significantly higher than with manual scoring (10.8% from the first expert and 10.0% from the second expert).
The worst-case result from table 2 is for subject 10, as shown in figure 6. Classification results are to some extent similar to manual scoring, supported by the F-scores (N3, 0.867; wakefulness, 0.808). The N3 stage percentage (13.9%) is similar to the result with manual scoring (14.1%), as shown in the row for subject 10 in table 3. However, the N1 stage percentage (6.9%) derived from the classification is significantly lower than the

Classification results for all records
The results of sleep stage classification from the combination of EOG (108 features) and EMG (6 features) are shown in table 4, and results from the combination of EEG (324 features), EOG (108 features) and EMG (6 features) are shown in table 5. More details are shown in table 6.
When classification results were derived from only one type of the signal from EEG, EOG and EMG, Cohen's kappa coefficients from high to low were in the order C-EEG>F-EEG>O-EEG>EOG>EMG, as shown in table 6.
In multi-class sleep staging, the best discrimination was achieved by a combination of EEG, EMG and EOG, in which the highest F1-score was 0.894 for both awake and N3 stages in table 6, followed by REM (0.884) and N2 (0.836) stages. However, the lowest F1-score was for the detection of the N1 stage (0.528). The order of F1scores for detection of a single stage is consist with the order of accuracy of the research of Khalighi et al (2013), with awake (88.59%)>N3 (87.13%)> REM (86.99%) >N2 (79.06%) >N1 (66.91%) (Khalighi et al 2013).
Most studies in table 6 did not use the whole ISRUC-Sleep dataset for cross-validation. It is not a fair comparison when one study uses more than 100 records for validation and another one only use five.
Furthermore, for training and testing sets as in table 7, our method has comparable performance with the method of Rahman et al (2018).
Using 99 records for validation from the combination of six-channel EEG, two-channel EOG, one-channel EMG and one-channel ECG, deep learning (Yan et al 2021) in table 7 achieved an accuracy of 86% and Cohen's kappa coefficient of 0.82. In comparison with that, our proposed method used 124 records for validation, and obtained an accuracy of 84.7% and Cohen's kappa coefficient of 0.807 from the combination of six-channel EEG, two-channel EOG and one-channel EMG. The main difference is that the F-score with our method is only 0.528 but the deep-learning method (Yan et al 2021) had a better F-score of 0.67 for the N1 stage.

Best features for two-stage classification
The best features from the combination of EEG (324 features), EOG (108 features) and EMG (6 features (1) Three types are easy to classify, including {N1, N3}, {R, N3} and {W, N3}, which all contain N3 but with the other stage nonadjacent to N3. They are labeled as '()' in the legend of figure 7. As shown in   EEG indices of ratios of beta to delta, and three EEG indices of percentage of with frequency higher than 36 Hz.
According to table 8, most good features are associated with EEG. However, there are a few features related to EOG or EMG. The enveloping mean from EMG is the seventh best feature for discriminating stages {W, R}. Energy powers of the alpha band in EOG are the third and the fourth best features for discriminating stages {W, N1}. One remarkable finding is that the best 13 features for discriminating stages {R, N1} are all non-EEG features.    Using only one type of physiological signal can further reduce the number of electrodes attached to the body. The brain provides the most useful information about sleep regulation. The EEG is the most important signal that directly reflects the state of the brain during sleep (Sun et al 2019). When only one physiological signal is selected from EOG and EMG, as shown in table 6, the precision for sleep staging derived from EOG alone is much higher than that from EMG alone, but lower than EEG alone. EOG-based sleep staging is relatively poorer than the method using EEG, because the criterion of scoring sleep stages mainly depends on the characteristics of EEG signals (Sun et al 2019).
EOG is very useful for identifying wakefulness and REM sleep, since there are eye movements during these stages. Eye movements tend to slow down with increase in the depth of sleep. EOG records the movement of the    Table 9. Comparison of performance with normalized or original features by LOOCV on subgroup III of the dataset ISRUC-Sleep using two-channel EOG and one-channel EMG.  figure 3(b), PLV org of EOG is usually greater than 0 during N2 and N3 sleep and it is usually less than 0 during wakefulness and REM sleep. Visual examples are shown in figures 9 and 10. During REM sleep in figure 9, the amplitude polarity between left EOG and right EOG is opposite, i.e. when a peak shows in the left EOG, a valley will show in the right EOG, such as the position at 12 s. During N3 sleep in figure 10, the amplitude polarity between the left EOG and the right EOG is consistent, i.e. when a peak shows in the left EOG, a similar peak will also show in the right EOG, such as the position at 5 s. Furthermore, the waveforms in the EOG are to some extent similar to those of the EEG, which is obvious in figure 10.
Compared with the microvolt values of EEG's small-amplitude variations, EOG and EMG values are in millivolts and require less stable contact with the body (Sun et al 2019), which make them less sensitive to noise and more suitable for unobtrusive measurement apparatus. Hence, the combination of EOG and EMG is a good choice for sleep monitoring.

Sleep monitoring by the combination of EOG and EMG
EOG makes the main contribution to sleep scoring derived from the combination of EOG and EMG. The placements of EOG electrodes are close to the placements of Fp1 and Fp2 EEG electrodes (Sun et al 2019). Hence, the EOG recordings are highly influenced by a portion of EEG activity (Sun et al 2019). During NREM sleep, EOG is similar in waveform to the EEG signals that are recorded at the frontal poles (Sun et al 2019).
In addition, using 438 normalized features from the combination of EEG (324 features), EOG (108 features) and EMG (6 features), Cohen's kappa coefficient by RF from LOOCV (N=124) was 0.801 and the accuracy was 84.7%. The F1-scores were 0.894, 0.884, 0.528, 0.836 and 0.894 for wakefulness, REM sleep, N1 sleep, N2 sleep and N3 sleep, respectively. Consequently, the sleep classification performance achieved by EOG and EMG is comparable to that of EEG, EOG and EMG.

Limitation
ISRUC-Sleep is an extremely unbalanced dataset, including 116 records from subjects with sleep disorders and only 10 records from healthy subjects. This research used 114 records from sleep disorder subjects and all 10 records from healthy subjects. Therefore sleep disorder records make the main contribution to the classification of sleep stages. Whether sleep scoring models derived from a healthy group can be applied to a sleep disorder group or not is still unknown, and vice versa. Balanced numbers in sleep disorder subjects and healthy subjects would be very important to answer that question. In the future, it will be necessary to study the classification effects on a balanced dataset from sleep disorder and healthy subjects, respectively.
According to the percentage of wakefulness and REM stage based on visual scoring from two experts in table 3, the healthy subjects did not have normal sleep but disturbed sleep. It holds true that PSG monitoring requires the attachment of many electrodes to the head, face and body, therefore it seriously interferes with a subject's natural sleep. Therefore bad-quality sleep records make the main contribution to the classification of sleep stages, not only from sleep disorder subjects but also from healthy subjects.
The third limitation is that there is no process for feature selection in this paper. Also, the best combination of features with as few features as possible was not found either.

Conclusion
On a public dataset called ISRUC-Sleep, comparative analysis suggests that the sleep classification performance achieved by EOG and EMG is comparable to that of EEG, EOG and EMG. EOG makes the main contribution to sleep scoring derived from the combination of EOG and EMG. The proposed method from the combination of EOG and EMG can achieve comparable performance to the use of EEG signals for sleep staging.