Dataset
For this study, a total of 198 subjects, aged 18 and older, and not undergoing continuous airway pressure therapy, were obtained from the SOMNIA database [25]. No specific selection criteria were imposed based on body mass index (BMI), sex, or AHI. All subjects participated in routine PSG monitoring at the Kempenhaeghe Center for Sleep Medicine, located in Heeze, the Netherlands, between June 2017 and November 2017. A summary of their demographic and clinical characteristics is presented in Table 2.
As part of the recommended set of sensors for in-lab PSG, the montage included ECG, recorded using modified lead II configuration with electrodes from Kendall (Ashbourne, Ireland), and RE, acquired using a Sleepsense (Elgin, USA) RIP belt mounted around the thorax.
Sleep stages and SDB events (obstructive, central and mixed apneas, and hypopneas) were scored based on the PSG by qualified sleep technicians, following the guidelines provided by the American Academy of Sleep Medicine (2015 rules). Specifically, hypopneas were scored using as confirmation rule, the presence of an SpO2 desaturation of at least 3% or the presence of a cortical arousal.
Table 2
Male/Female | 121/77 |
---|
Age, years | 50.1 ± 14.8 (range: 18–86) |
BMI, kg/m2 | 27.2 ± 4.8 (range: 18.6–45.2) |
AHI, events/hour | 18.0 ± 18.4 (range: 0-108.4) |
AHI < 5 | 48 |
5 < AHI < 15 | 67 |
15 < AHI < 30 | 46 |
AHI > 30 | 37 |
Note: Numbers per subject are presented as mean ± SD over subjects. |
Signal Pre-processing and Segmentation
The pre-processing and segmentation procedures closely followed the methodology of our previous work [15]. Specifically, for the ECG signals, R-peaks were initially detected using an algorithm based on nonlinear transformation and a simple peak-finding strategy [26]. Subsequently, a post-processing algorithm was employed to precisely localize the QRS complexes and eliminate artifacts [27]. Additionally, the method proposed by Mateo and Laguna [28] was applied to address ectopic beats: periods containing artifacts or ectopic beats were marked with a value of 0 for exclusion. The resulting RR intervals were then subjected to linear interpolation and resampled at a frequency of 4 Hz.
Similarly, RE signals were resampled at a rate of 4 Hz to ensure consistency with the RR time series. During the resampling process, high-frequency noise (> 2 Hz) was eliminated. Additionally, a high-pass filter with a cut-off frequency of 0.05 Hz was utilized to remove low-frequency noise from the RE signals.
Next, all signals were segmented into 5-minute segments with an overlap of 2 minutes. Subsequently, “soft” min-max normalization was applied to the RR interval and RE time series of each segment, with the minimum and maximum values set to the 5th and 95th percentiles, respectively. Finally, the RR and RE data of each segment were stacked to form an input bivariate vector of dimensions 1200 samples × 2 channels.
The scoring of SDB events was mapped into segments with the same 5-minute length and overlap of 2 minutes. Each sample in the segment corresponds to a one-second period, set to 1 to indicate an SDB event of any type (obstructive, central, or mixed apneas, hypopneas) occurring during that sample, and to 0 during periods of normal breathing. This resulted in a vector of dimensions 300 samples × 1 label for each segment.
Similarly, annotations of sleep stages were mapped into segments with the same length and overlap, but at a sampling frequency of 1/30 Hz, corresponding to the epoch duration of the sleep stages scored from PSG. Each 30 second sample in the segment that corresponded to the sleep stage Wake was assigned a label of 0, while samples corresponding to any sleep stage (N1, N2, N3 or REM) were assigned a label of 1. Accordingly, each segment was represented with a sleep label vector of 10 labels (10 epochs × 1 label).
The two labels (SDB events, and Wake/Sleep stages) represent our training targets for the multi-task model. Notably, all segments occurring during the periods at the beginning and at the end of the recording when the lights were on (as annotated in the PSG) were removed.
Multi-task deep learning model
The multi-task deep learning model architecture illustrated in Fig. 8 consists of three main components: a shared part for both tasks, a specific part for SDB event detection, and another specific part for sleep-wake classification. The shared part was designed to learn common latent representations relevant to both tasks. It comprised two blocks, which were analogous to the feature extraction blocks employed in prior studies [14, 15]. Each block consisted of two layers of bidirectional gated recurrent units (GRU), a batch normalization layer, a max-pool layer, an activation layer utilizing the rectified linear unit (ReLU) activation function, and a dropout layer.
The task-specific part for SDB event detection consisted of a feature extraction block and a classification block. The feature extraction block encompassed two layers of bidirectional GRUs, a batch normalization layer, an activation layer using the ReLU activation function, and a dropout layer. Subsequently, the classification block consisted of a dense-connected layer employing the ReLU activation function and another dense-connected layer utilizing the sigmoid activation function to generate the output.
Similarly, the task-specific part for sleep-wake classification included a feature extraction block combined with three subblocks. Each subblock was composed of a 2-dimensional convolution layer with ReLU activation, a batch normalization layer, a max-pool layer, and a dropout layer. Additionally, two reshape layers were incorporated at the beginning and end of the feature to adjust the shape for input and output. The classification block for sleep-wake classification mirrors that of event detection.
Model training
Training was performed with an Adam optimizer with a learning rate of 0.001 and a weight decay of 0.0001, and a batch size of 128. The model initialization was conducted using Xavier uniform initializer. Given the imbalance between apnea/hypopnea events and normal periods, sample weighting was implemented, with a weight of 10 assigned to apnea/hypopnea events and a weight of 1 assigned to normal breathing periods. Binary cross-entropy was used as the loss function for both tasks and the final loss was set to the sum of the losses of the two tasks.
To mitigate the risk of overfitting, we used kernel regularization, dropout and incorporated an early stopping mechanism: training was terminated when no decrease in the validation loss was observed in 10 epochs.
In addition to the multi-task model, we also trained a separate single task model, as utilized in our previous study [15], to allow a direct comparison with our proposed multi-task model for AHI estimation.
Training, validation, and testing data splitting
The study implemented a four-fold cross-validation approach to effectively leverage the available dataset for model evaluation and training. The division of the dataset into four folds was done through a stratified random split method, designed to maintain a balanced distribution of subjects across all levels of SDB severity. Initially, all subjects were categorized into four distinct SDB severity groups based on their AHI values from the annotation: normal (AHI < 5), mild SDB (5 < AHI < 15), moderate SDB (15 < AHI < 30), and severe SDB (AHI > 30) [1]. Subsequently, subjects within each severity group were randomly assigned to four subsets, and one subset was chosen from each group to form one fold of the cross-validation procedure. The cross-validation procedure is depicted in Fig. 9. During each iteration, one fold was designated as the testing set, while the remaining three folds were merged and further partitioned into training and validation sets.
To create the training and validation sets, the three folds were combined and subsequently divided into the four SDB severity groups. 75% of the subjects from each group were taken for training, and the remaining 25% for validation. To ensure that samples from the same subject are only assigned to one set (either the training, validation, or testing set), all samples from each subject adhere to their corresponding subject's assignment. Figure 10 illustrates the configuration of the training, validation, and testing sets in the first iteration of the four-fold cross-validation process. This process was repeated for each cross-validation iteration, with a different fold designated as the testing set in each round, while the training and validation sets were assembled as described.
In each iteration of the cross-validation process, the model was fitted using the training set, while the validation set was utilized for early stopping and to determine the optimal decision threshold. The results obtained on each recording of the testing set for each cross-validation iteration were finally combined to assess the overall performance for all subjects in the dataset.
Performance evaluation
Performance was evaluated for different tasks, namely sleep-wake classification, SDB event detection, AHI estimation, and SDB severity classification.
Sleep-wake classification was evaluated in terms of epoch-per-epoch agreement between the predictions of the model, and the sleep stages manually scored from PSG. The model outputs a value between 0 to 1; to obtain a binary classification, we automatically selected, on the validation set of each cross-validation iteration, the threshold that yielded the best F1 score for sleep (as positive class, label 1) versus wake classification. This threshold was then used, on the same iteration, to obtain the binary classification on the segments of the recordings of the testing set. Performance was finally evaluated by comparing the classification for the six 30-second epochs of the 3 minutes on the middle of each 5-minute segment, discarding the outer 2 minutes which overlapped neighboring segments. Agreement was assessed by means of the Cohen’s kappa coefficient of agreement and F1 scores between the predictions and the reference class from PSG. Additionally, we calculated the Spearman's correlation coefficient between the estimated TST and the PSG reference TST for each recording.
Similarly, for SDB event detection, we evaluated the middle 3 minutes of each segment. The segments were transformed into events based on a methodology described in previous studies [14, 15]. Specifically, we assigned a one-second sample to an event if the model output for SDB event detection surpassed a threshold. Samples scored as part of an SDB event during a period detected as Wake by the assessment above were assigned to ‘normal breathing’ periods. The threshold used to decide whether a sample was part of an SDB event was automatically determined by maximizing the F1 score for event detection on the validation set of each cross-validation iteration. Consecutive samples classified as SDB events were combined to form a single event. True positives, false positives and false negatives were then computed according to the same rules as the previous study [15]. Finally, sensitivity, precision, and F1 score were used as performance measures for SDB event detection.
AHIest was estimated by dividing the number of detected SDB events by the estimated TST, while the reference AHIref was obtained by dividing the number of PSG scored SDB events by the reference TST, also from PSG. Bland-Altman analysis, scatter plots, and Spearman's correlation coefficient (R) were employed to assess the agreement between AHIref and AHIest.
To evaluate the performance of SDB severity classification, we categorized subjects into different classes based on their AHIref and AHIest values, namely as normal (AHI < 5), mild SDB (5 < AHI < 15), moderate SDB (15 < AHI < 30), and severe SDB (AHI > 30). To address potential biases near the class boundaries, the NBL technique was employed [29, 30]. This technique involved assigning subjects with AHIs near the class boundaries to two severity classes, and the estimated class was considered correctly classified if it fell into either one of the two classes. The near-boundary zones from Pee et al. [30] were used in this study. The confusion matrix, accuracy, and Cohen's Kappa were computed both with and without the NBL technique.
Finally, to assess the added value of sleep-wake classification with our multi-task model, we compared the AHI estimation performance with that obtained with a single-task model where the model outputs only SDB event detection, but no estimate of the periods of sleep and wake. In the latter case, the ratio of events per hour is estimated not based on TST (which is not available as an output of the model), but rather, on the TIB. This index, often referred to as REI, is commonly used in applications or systems where a measurement of sleep (and TST) is not available, and we expect AHI and REI to diverge especially in recordings where the subject spends a relatively large amount of the time in bed awake. Accordingly, we calculated the Spearman's correlation coefficient and mean squared error (MSE) between AHIref and AHIest (with the multi-task model) and between AHIref and REIest (with the single-task model) across a varying range of sleep efficiencies. Specifically, we applied a varying threshold on sleep efficiency, obtained from PSG as the ratio between TST and TRT, and calculated the performance for all subjects with a sleep efficiency lower than that threshold.