Data and study population
This study entails a secondary analysis on data collected during the observational, prospective PRODIGY trial [1]. After IRB/IEC approval and patient consent, general care floor patients receiving parenteral opioids underwent blinded, continuous capnography and pulse oximetry monitoring with the Capnostream 35 or 20p bedside monitor (Medtronic, Boulder, CO, USA) [[1]. The median effective monitoring time was 24 hours (IQR 17–26). The data was collected at 16 clinical sites in the United States, Europe and Asia. The included subjects are adults (≥ 18, 20, and 21 years in United States/Europe, Japan, and Singapore, respectively) who were able to wear continuous monitoring equipment. A total of 1,458 patients were included. Details of the PRODIGY study can be found in the article from Khanna et al. (2020) [1].
Our study utilized 90-second segments of combined capnography and pulse oximetry monitoring for each event. An event was defined as the exact timestamp where either abnormal or normal breathing was identified. Abnormal segments started 60 seconds before and ended 30 seconds after the abnormal event. These abnormal patterns were primarily detected automatically via the monitor alarm when certain thresholds limits were breached. The abnormal patterns were then reviewed and confirmed by 9 anesthesiology experts (see section labeling). Normal breathing segments also consisted of 90 seconds, and were randomly identified from the continuous monitoring tracing, at least 30 minutes before and 30 minutes after detecting an abnormal segment. Figure 1 shows an example of how the events were identified in a continuous measurement of an individual patient. A total of 10,145 segments with a 90-second duration were included.
Labelling and data quality
Nine anesthesiology experts with ample experience in capnography assessments adjudicated the data stream and labeled 3490 events, inclusive of all primarily abnormal detections and 168 normal events. These normal segments were selected for revision based on an exploratory visualization of a small subset of the data, which showed deviations from the regular breathing pattern in several segments. It was due to workload considerations that we chose not to revise all 10,145 segments.
The labels assigned to each event consisted of 3 abnormal labels and one normal label:
Following a learning phase to resolve uncertainties, the nine raters independently labelled 300 events as normal or abnormal. A 5-fold bootstrap analysis evaluated if the label would change significantly if the input of only seven raters were taken into account. The Cohen Kappa value of 0.80 (std ± 0.02) indicated that consistency in labels was maintained when votes per event were reduced from nine to seven. This allowed us to lessen the burden on raters by reducing the number of events to be labeled per rater. Hence, the first round resulted in a final label defined by the majority vote of nine raters, and the following round used a seven-rater majority.
Evaluation of the label revision
Inter-rater agreement was evaluated using two metrics: Fleiss' Kappa and percent agreement.
The percent agreement for each item was calculated by first determining the label that had the most agreement among raters. Then, the proportion of raters that agreed with this most frequent label was calculated, in relation to the total number of raters. This measure, expressed as a percentage, represents the percent agreement for each item. The overall percent agreement was then obtained by averaging the percent agreement across all items. Fleiss’ Kappa was calculated as described by Fleiss [14]. Both metrics were calculated separately for the two labeling rounds. Furthermore, these metrics were assessed considering the multi-class approach as well as the binary class-definition. The labels resulting from this revision process were then used to create a classification model as described in the next section.
Development of the classifier
The goal of this study is to create a first-stage ML classifier to distinguish between normal and abnormal segments of combined capnography and pulse oximetry measurements. This classifier is part of a larger concept which applies multiple sequential classifiers to detect significant respiratory depressions, and which can potentially differentiate artifacts from true respiratory depressions. A general overview of the multi-stage classifier approach is shown in Fig. 2. The rest of this section presents the steps taken in the development and evaluation of the first-stage model.
Pre-processing
Segments were removed from the dataset when more than 20% of the CO2, SpO2, or PR measurement was missing or when the CO2 value over the entire 90-second period was lower than 1.0 mmHg. In cases where less than 20% of the data was missing, a linear interpolation was implemented, followed by a forward and backward fill to address missing values at the beginnings and ends of each segment, respectively.
Using the python packages, TSFresh and NeuroKit2, we extracted 300 relevant features from the raw capnography and pulse oximetry segments [15, 16]. Features with high correlation (> 0.9), low variance (< 0.005), or above 10% missingness were removed, resulting in 208 features for modeling. Subsequently, these features were processed using Scikit-learn's IterativeImputer for missing data imputation and MinMaxScaler for feature scaling[17].
Model development and training
We divided our dataset into training and test subsets, maintaining an 80:20 split at the individual subject level. The class ratios in the training and test sets were comparable, with a proportion of 0.63 in the training set and 0.61 in the test set for the negative (normal) class label.
Five ML models were trained, including Gaussian Naive Bayes, eXtreme Gradient Boosting (XGBoost), Random Forest, C-Support Vector (SVC), and K-Nearest Neighbors (KNN). The classifiers were trained through subject based stratified 5-fold cross validation on the train set.
Model Evaluation
The first-stage model developed in this study is an important first step towards a more precise clinical alarm system for respiratory depression. Since it can be fatal to miss a patient with an abnormal breathing pattern, the costs of false negative weights higher than false positives. Therefore, we chose to train and optimize the models towards an Fβ score with β = 2, which favors recall over precision. Further performance measures included accuracy, precision, recall, specificity, AUPRC and AUROCc.