Using domain adaptation and inductive transfer learning to improve patient outcome prediction in the intensive care unit

doi:10.21203/rs.3.rs-3100844/v1

Predicting patient outcomes in the intensive care unit (ICU) can allow for more effective and efficient patient care. Deep learning models are effective in learning from data to accurately predict patient outcomes; however, they require huge amounts of data to train and massive computational power. Transfer learning (TL) helps in scenarios when data and computational resources are scarce. TL is commonly used in medical image analysis and natural language processing but is comparatively rare in electronic health record (EHR) analysis. In medical image analysis and natural language processing, domain adaptation (DA) is the most commonly used TL method in the literature while inductive transfer learning (ITL) is quite rare. This study explores DA as well as rarely researched ITL for predicting ICU outcomes using EHR data. To investigate the effectiveness of these TL models, we compared them with baseline models of fully connected neural networks (FCNN), logistic regression, and lasso regression in the prediction of 30-day mortality, acute kidney injury (AKI), hospital length of stay (H_LOS), and ICU length of stay (ICU_LOS). TL models transfer the knowledge gained while training for the source prediction task on the source domain to improve the prediction performance of the target prediction task on the target domain. Whereas baseline models were trained directly on the target domain for the target prediction task. Two cohorts were used in this study for the development and evaluation. The first was eCritical, a multicenter ICU data linked with administrative data with 55,689 unique admission records from 48,672 unique patients admitted to 15 medical-surgical ICUs in Alberta, Canada, between March 2013 and December 2019. The second was MIMIC-III, a single-center, publicly available ICU dataset from Boston, USA, acquired between 2001 and 2012. Random subsets of training data, ranging from 1% to 75%, as well as the full dataset were used to compare the performances of DA and ITL with FCNN, logistic and lasso regression. Overall, the ITL outperformed baseline FCNN, logistic and lasso regressions in 55 out of the 56 comparisons (7 data subsets, 4 outcomes, and 2 baseline models), whereas DA models outperformed the baseline models in 45 out of 56 cases. ITL performance was comparatively better than DA, considering the number of times it outperformed baseline models and the margin with which it outperformed baseline models. In 11 out of 16 cases (8 out of 8 for ITL and 3 out of 8 for DA), TL models outperformed baseline models when trained using the 1% data subset. This is significant because TL models are useful in data-scarce scenarios. The publicly available pre-trained models from this study can be used to predict ICU patient outcomes and serve as building blocks in further research for the development and validation of models in other cohorts and outcomes.

Health sciences/Health care

Health sciences/Health care/Prognosis

Electronic Health Records (EHR) are databases that hospitals and healthcare providers use to record an individual’s health history. There has been significant progress in using deep learning models for predicting patient outcomes using EHR data [1]. But using deep learning models is not feasible in scenarios like rural hospital ICUs, which have low patient volumes and limited computational capacity due to budget restrictions. Transfer learning (TL) will be useful in these challenging scenarios. TL research using EHR has been less common compared to medical image analysis and natural language processing. The basic idea of TL is to utilize the knowledge and representations learned while training the model on a source prediction task, to improve prediction performance on different, but potentially closely related target prediction task [2]. Practically, this is achieved by pre-training a model with data for the source prediction task and saving the trained weights. These weights capture the intrinsic knowledge of the data. As only the general knowledge needs to be transferred, this pre-trained model is typically loaded without the final layers (usually the fully-connected layers immediately preceding the output layer and the output layer itself), which are then replaced with custom layers. Finally, this pre-trained model is retrained to fine-tune it to the target prediction task (see Fig. 1). In this work, we considered two commonly used types of TL methods. First, inductive transfer learning (ITL) aims to improve performance on the target task after learning a different but related source task, usually from the same domain [3]. For example, in Tokuoka et al. [4] a model trained for brain tissue annotation label (source task) was adapted to the task of brain tumor segmentation (target task) using Magnetic Resonance Imaging images. Second, transductive transfer learning or also referred to as domain adaptation (DA) makes use of different domains but the same prediction task [3]. For example, in titoriya et al. [5] AlexNet [6] model which was pre-trained on ImageNet data (source dataset) for object classification (source task) was re-trained on the Breakhis dataset (target dataset) to classify medical images into malignant or benign (target task) to predict breast cancer.

TL can be useful in a data-scarce scenario where the target dataset does not have sufficient data to train a deep learning model but there is a sufficiently large source dataset to train a pre-trained model or a relevant pre-trained model is available. For example, DA can be useful in a scenario where a rural intensive care unit (ICU) does not have sufficient data to train a model to predict a patient outcome, but an urban teaching hospital has a large dataset to train a model for the same patient outcome. On the other hand, ITL can be useful when predicting new or rare patient outcomes at an ICU if that ICU has a sufficiently large dataset for training a model to predict a different patient outcome.

TL research in medical image analysis is relatively comprehensive, as pre-trained convolutional neural networks (CNN) such as AlexNet [6], ResNet [7], VGGNet [8], and GoogleNet [9] are publicly available and have been used widely for prediction problems such as image classification [10][11][12][13], image segmentation[14][15], object identification [16][17], disease categorization [18][19], and severity grading [20][21]. Most of these application examples are in DA rather than ITL.

Another field with significant previous TL research is natural language processing. Some of the established pre-trained language models are Word2Vec [22], GloVe [23], BERT [24], and fastText [25]. Some of the use cases of pre-trained natural language processing models include text mining [26], word classification [27][28][29], and sentiment classification [30].

A recent trend was to utilize the progress made in TL research from the natural language processing domain in EHR data analysis. Inspired by the pre-trained BERT model [24], Li et al. [31], developed BEHRT using EHR data to improve future visit diagnosis prediction. This study used only four features, which included age, segment, position, and diagnosis (10^th revision of the International Statistical Classification of Diseases and Related Health Problems, ICD-10). Longitudinal data were used, and a transformer-based model was trained. This study had an inclusion criteria of patients with at least 5 visits, which have a diagnosis available in their EHR. One of the shortcomings of this study was not utilizing all the available features in the EHR data such as demographics, lab results, vitals, prescriptions, etc., Another concern was the requirement of five previous visits with diagnosis, causing the model to be unable to predict the diagnosis for new patients.

Liu et al. [32] used TL to improve the prediction of acute kidney injury (AKI) using EHR data acquired at the University of Kansas Medical Center. Here logistic regression (LR) was used as the global (baseline) model and the global TL model. Global TL and baseline models are the same LR models except the baseline model was trained on the original dataset and the TL model was trained on the modified dataset. To create the modified dataset, data of each feature were multiplied by the corresponding feature coefficient, which was obtained from the global baseline model. The personalized model was then trained on a subset of the training sample with the highest similarity (nearest neighbors) with the selected test sample. For each test sample, a subset of the training dataset with the highest k-NN score was selected and a personalized model and personalized TL model were trained. The personalized TL model followed the same approach as the global TL model except that the training sample was selected from the modified dataset using the k-NN score (similarity) with the selected test sample. Although the source and target prediction tasks and domains were the same, this can be considered as DA since TL models use a modified dataset. Here, deep learning models were not used, which are often assumed to be better at learning the general representations of the data. Since both these TL models (global and personalized) were not like the traditional pre-trained models with saved weights, the transfer of knowledge happened by modifying the dataset. Thus, the transferred knowledge is stored in a modified dataset, not in the TL models. So, these TL models are strongly tied to this specific dataset and their generalizability has not been well established by using external validation data.

In Shickel et al. [33], TL was used to improve hospital discharge prediction using a conventional ICU cohort acquired at the University of Florida Health as the source cohort and the Intelligent ICU cohort also acquired at the University of Florida Health as the target cohort. It utilized a feed-forward neural network for TL. Here, the source cohort has 48,400 patient records whereas the target cohort has 51 patient records. This study used only 9 features and DA, even though not stated specifically in the manuscript. This semantic progression was observed in other studies as well; DA has become so prevalent that the terms TL and DA are being used interchangeably, with the former most commonly referring to the latter [14][16]-[21][33].

TL is a relatively novel technique for EHR data analysis as evidenced by the absence of publicly available robust pre-trained models. The few previous works focused on DA rather than ITL for this purpose. Therefore, this retrospective study intends to fill the gaps in the TL research. First, there was not much research regarding the benefit of using DA based on EHR data, and even less work regarding the benefit of ITL. Therefore, this study investigates both DA and ITL using EHR data. Second, there are very few publicly available pre-trained DA and ITL models. Thus, one of the aims of this study was to publicly share the pre-trained models, which can be used for patient outcome predictions in the ICU, or to conduct further research about the benefit of using ITL or DA in this domain. Third, this study aims to predict outcomes for new patients, unlike the natural language processing studies mentioned above, such as Li et al. [31], which can only be applied if the patient has five previous hospitalizations with a diagnosis in their EHR.

Overall, this study aimed to investigate the added benefit of using DA and ITL compared to using baseline models only when predicting the following four patient outcomes: 30-day mortality (after ICU admission), acute kidney injury (AKI), hospital length of stay (H_LOS), and ICU length of stay (ICU_LOS) across different target training sample sizes. A patient was defined as expired if he/she died within 30 days after ICU admission (30-day mortality). AKI during the ICU stay was determined using the Serum creatinine definition from the KDIGO [34]. TL models aim to improve the prediction performance of the target prediction task trained on the target domain by leveraging the knowledge gained while training for a different but related source prediction task on the source domain. Whereas baseline models were only trained on the target domain for the target prediction task. These patient outcomes were predicted for patients at their first ICU admission. The different target training sample sizes explored were 1%, 5%, 10%, 25%, 50%, 75%, and 100%. These training subsets were obtained by random sampling. 30-day mortality and AKI were binary classification tasks, whereas H_LOS and ICU_LOS were regression tasks. Deep learning models (fully connected neural networks) were used as baseline models. Also, logistic regression and lasso regression were used as baseline models for the classification and regression tasks, respectively.

Patient Cohorts

In our study, we used EHR data from two ICU cohorts: 1) eCritical data collected from the ICUs in Alberta, Canada, between March 2013 and December 2019, and 2) the publicly available MIMIC-III database from the ICUs at the Beth Israel Deaconess Medical Center in Boston, Massachusetts, USA, collected between 2001 and 2012. Based on the inclusion and exclusion criteria, the final eCritical and MIMIC-III cohorts were different for each patient outcome (see the Patient Cohort section in the Methods for more details). After applying the inclusion criteria, the 30-day mortality cohort had 39,317 and 31,446 samples in eCritical and MIMIC-III databases, respectively, whereas the AKI cohort had 32,076 and 26,741 samples, respectively. The H_LOS cohort had 37,675 and 30,816 samples, and the ICU_LOS cohort had 38,529 and 30,816 samples, respectively. In the eCritical cohort, there were 6,713 (17.07%) 30-day mortalities, whereas, in the MIMIC-III database, there were 3,900 (12.40 %). The eCritical cohort had 4,524 (14.11 %) AKI cases whereas MIMIC-III had 5,789 (21.64 %). The eCritical cohort had a median H_LOS of 11.48 days with an interquartile range (IQR) of (5.59, 23.29), whereas the MIMIC-III cohort had a median (IQR) of 7.39 (4.67, 12.32). Similarly, the median (IQR) ICU_LOS in the eCritical and MIMIC-III cohorts were 3.97 (2.2, 7.67) and 2.47 days (1.59, 4.58), respectively. The descriptive statistics for the two cohorts are shown in Table 4.

Pre-trained models

TL models trained on the source domain for the source prediction task are called pre-trained models. These pre-trained models can then be retrained for improved prediction performance of target prediction tasks on the target domain. Pre-trained models were selected based on balanced accuracy (BA) for classification tasks and mean absolute error (MAE) for the regression tasks. Because of the high class imbalance in both eCritical and MIMIC-III (30-day mortality: 17.07% and 12.40 %; AKI: 14.11% and 21.64%, respectively), BA was selected as the metric used for hyper-parameter tuning and selection of classification models. MAE is a more natural measure of average error, and it is not sensitive to outliers as is the case with MSE. Due to the difference in the spread of the data between eCritical and MIMIC-III (H_LOS, [mean, standard deviation (std), variance (var), maximum (max)]: [18.86, 20.86, 435.18, 127.81] vs [9.73, 7.48, 56.02, 44.85]), there will be outliers of different magnitude in them. To nullify the impact of outliers, MAE was selected as the metric for hyper-parameter tuning and selection of regression models.

After hyper-parameter tuning for the 30-day mortality source task, the pre-trained model with three hidden layers of 128, 64, and 32 neurons was selected. This model had the highest BA of 0.7810. This was the pre-trained model for all four ITL target tasks and the 30-day mortality DA target task. Similarly, after hyper-parameter tuning for AKI, H_LOS, and ICU_LOS source tasks, pre-trained models with three hidden layers of 256, 128, and 64 neurons, seven hidden layers of 256, 512, 512, 256, 256, 128, and 64 neurons, and seven hidden layers of 256, 512, 512, 256, 256, 128, and 64 neurons, with a BA of 0.7199, an MAE of 11.8019, and an MAE of 3.0887, were selected, respectively. These were the pre-trained models for the DA target tasks for AKI, H_LOS, and ICU_LOS, respectively.

Domain Adaptation

The DA pre-trained model re-trained using a 100% MIMIC-III train dataset for the 30-day mortality task resulted in a median BA of 0.784 with a 95% confidence interval (CI) between 0.7645 and 0.8033 and a median area under the receiver operating characteristic curve (AUC) of 0.8602 (95% CI: 0.8424,0.8778). Reducing the training sample size to 75% dataset resulted in a slight drop in median BA to 0.7829 (95% CI: 0.7607,0.8045) and AUC to 0.853 (95% CI: 0.8326,0.8716). This decrease in performance metrics was observed as sample size decreased to 50%, 25%, 10%, 5%, and 1% as shown in Figure 2(B). Finally, training DA models using 1% of the MIMIC-III dataset resulted in a BA of 0.6744 (95% CI: 0.5758,0.7083) and an AUC of 0.7554 (95% CI: 0.6775,0.7942). Here, the 1% MIMIC-III training dataset for the 30-day mortality task contains 222 samples.

Similarly, fully connected neural network (FCNN) baseline models trained using the 100% MIMIC-III train dataset for the 30-day mortality task resulted in a median BA of 0.7927 (95% CI: 0.7733,0.8114) and median AUC of 0.8591 (95% CI: 0.8405,0.8776). There was a slight decrease in performance metrics as the sample size decreased to 75%, the performance metrics were a median BA of 0.7844 (95% CI: 0.7612,0.8068) and median AUC of 0.8556 (95% CI: 0.8349,0.8752). The performance metrics decreased as the sample size shrunk to 50%, 25%, 10%, 5%, and 1% as shown in Figure 2(B). Finally, training baseline models using 1% of the MIMIC-III dataset resulted in a median BA of 0.6636 (95% CI: 0.6146,0.6971) and an AUC of 0.7527 (95% CI: 0.7214,0.7827).

DA vs. baseline models

We used paired Wilcoxon rank sum test to statistically assess the performance difference between the TL and baseline models. The Bonferroni correction was applied as there were repeated comparisons. The classification tasks had 35 comparisons (7 data subsets and 5 metrics) so statistical significance was indicated by p < 0.0014 (0.05 / 35). The regression tasks had 14 comparisons (7 data subsets and 2 metrics) so statistical significance was indicated by p < 0.0035 (0.05 / 14).

For 30-day mortality, DA models outperformed both the baseline models LR and FCNN for data subsets 1% to 50%. For datasets, 75% and 100% DA model outperformed the LR model but underperformed the FCNN model. For example, when 1% dataset was used for training, DA model had a median BA of 0.6744 (95% CI: 0.5758,0.7083) whereas LR had 0.5821(0.551,0.6134) and FCNN had 0.6636(0.6146,0.6971), both these comparisons are statistically significant with p < 0.0014, refer to Figure 2(B) for results of other data subsets.

For AKI, DA models outperformed both the baseline models for some of the data subsets (75%, 50%, 25%, 10%, and 5%) and underperformed both the baseline models for the data subset (1%). Whereas the DA model outperformed the LR model and underperformed the FCNN model for 100% dataset. For example, when the 10% data subset was used for training models, the DA model had a median BA of 0.6511 (95% CI: 0.626,0.6763) whereas the LR model had 0.6052 (0.5779,0.6262) and the FCNN model had 0.6439 (0.6177,0.6678) refer to Figure 3(B) for results of other data subsets.

For H_LOS, the DA models outperformed both the baseline models for some of the data subsets (25% to 100%) and DA models overperformed FCNN models but underperformed Lasso models in some cases (1% and 5%). The result at 10% data subset was not significantly different between DA and Lasso (p-value was 0.0193 > 0.0035) and the DA model outperformed the FCNN model. For example, when the 25% data subset was used for training, the DA model had a median MAE of 4.9109 (95% CI: 4.5982,7.389) whereas the Lasso model had 5.0491 (4.8903,6.457) and the FCNN model had 9.2677 (9.0162,9.6332) refer to Figure 4(B) for results of other data subsets.

For ICU_LOS, DA models outperformed both the baseline models for some of the data subsets (25% to 100%) and DA models overperformed FCNN models but underperformed Lasso models in some cases (5%, and 10%). Also, the result at 1% data subset was not significant between DA and FCNN models (p-value was 0.0468 > 0.0035). For example, when the 25% training data subset was used for training, the DA model had a median MAE of 2.2781(95% CI: 2.0427,4.643) whereas the Lasso model had 2.4165(2.2967,11.1001) and the FCNN model had 3.8481(3.5641,5.0331) refer to Figure 5(B) for results of other data subsets.

We primarily discussed the results using BA and MAE metrics as they were used in the selection of pre-trained models and hyperparameter tuning. Full results for 30-day mortality, which includes all data subsets (1%, 5%, 10%, 25%, 50%, 75%, and 100%), all DA models, and baseline models, and all performance metrics (BA, AUC, accuracy, precision, and recall) are summarized in Table 2. Similarly, AKI full results are available in Table A2. The H_LOS full results with MAE and MSE metrics are summarized in Table A3, and similarly, ICU_LOS results in Table 3.

Inductive Transfer Learning

The ITL pre-trained model retrained using the 100% eCritical training dataset for the AKI task resulted in a median BA of 0.6933 (95% CI: 0.6701,0.717) and median AUC of 0.7762 (95% CI: 0.7531,0.7999). The performance metrics decreased as the sample size decreased to 75%, 50%, 25%, 10%, 5%, and 1% as shown in Figure 3(A). Finally, training ITL models using 1% of the eCritical dataset resulted in a median BA of 0.6434 (95% CI: 0.6006,0.6888) and median AUC of 0.7103 (95% CI: 0.6606,0.7484). Here, the 1% eCritical train dataset for the AKI task contains 224 samples.

Similarly, FCNN baseline models trained using the 100% eCritical train dataset for the AKI task resulted in a median BA of 0.6968 (95% CI:0.6738,0.7212) and median AUC of 0.7694 (95% CI: 0.7448,0.7938). The performance metrics decreased as the sample size decreased to 75%, 50%, 25%, 10%, 5%, and 1% as shown in Figure 3(A). Finally, training baseline models using 1% of the eCritical dataset resulted in a BA of 0.6222 (95% CI: 0.5757,0.6604) and AUC of 0.6796 (95% CI: 0.6343,0.7189).

ITL vs. baseline models

For AKI, ITL models outperformed both the baseline models for all data subsets except for 100%. The DA model underperformed both the baseline models at 100% dataset. For example, when the 1% dataset was used for training the models, the DA model had a median BA of 0.6434 (95% CI: 0.6006,0.6888) whereas the LR model had 0.5467 (95% CI: 0.5154,0.5732) and the FCNN model had 0.6222 (95% CI: 0.5757,0.6604) refer to Figure 3(A) for results of other data subsets.

For H_LOS, the ITL models outperformed both the baseline models for all data subsets. For example, when the 1% dataset was used for training the models, the DA model had a median MAE of 13.3182 (95% CI: 12.6128,13.9609) whereas the Lasso had 13.7765 (95% CI: 13.3118,14.2661) and the FCNN had 18.5363 (95% CI: 17.8711,19.243) refer to Figure 4(A) for results of other data subsets.

For ICU_LOS, ITL models outperformed both the baseline models for all the data subsets. For example, when 1% dataset was used for training the models, DA model had a median MAE of 3.4519 (95% CI: 3.2863,3.8158) whereas Lasso had 3.5883 (95% CI: 3.4255,3.7376) and FCNN had 5.626 (95% CI:5.4351,5.8329) refer to Figure 5(A) for results of other data subsets.

Previous discussion focussed primarily on the description of the BA and MAE metrics. Full results for 30-day mortality, which includes all data subsets (1%, 5%, 10%, 25%, 50%, 75%, and 100%), all models ITL, and baseline models, and all performance metrics (BA, AUC, accuracy, precision, and recall) are summarized in Table 1. Similarly, full AKI results are available in Table A1. The full H_LOS results with MAE and MSE metrics are summarized in Table A3, and similarly, ICU_LOS results in Table 3.

Overall, using the BA and MAE metrics ITL models outperformed the baseline models in 55 out of the 56 cases (7 data subsets, 4 outcomes, and 2 baseline models). Similarly, DA models outperformed baseline models in 45 times out of 56 cases. ITL models performed better than DA models both in the number of times they outperformed baseline models and the margin with which they outperformed, refer Tables 1, 2, 3, A1, A2, and A3.

Inductive Transfer Learning

Using the MAE metric, the ITL model evidently outperformed both the baseline models Lasso and FCNN for H_LOS and ICU_LOS with no or very little CI overlap as shown in Figs. 4(A) and 5(A). However, for 30-day mortality and AKI, ITL model has an overlapping confidence interval with the baseline model at some data subsets using the BA metric. Although non-overlapping CIs indicates that two populations are necessarily significantly different, the contrary statement is not always true [35][36][37][38]. Thus, to statistically assess the performance difference between the TL and baseline models we used paired Wilcoxon rank sum test. For the BA metric, ITL outperformed the baseline model for the outcome AKI at all data subsets except at 50% based on the statistical Wilcoxon rank sum test findings as shown in Fig. 3(A). Similarly, for 30-day mortality, the ITL model outperformed the baseline model across all data subsets when utilizing the BA measure as shown in Fig. 2(A).

Domain Adaptation

For H_LOS, based on statistical testing DA models outperformed the baseline models for some of the training data subsets (25–100%) and underperformed in some cases (1% and 5%). Also, the result at 10% data subset was not statistically significant (p = 0.0193, see Fig. 4(B)). Overall, DA models did not perform as expected due to the difference in the spread of the data between eCritical and MIMIC-III (H_LOS, [mean, std, var, max]: [18.86, 20.86, 435.18, 127.81] vs [9.73, 7.48, 56.02, 44.85]). Because of the population variance, negative TL occurred, which means the knowledge gained from pre-training on the source task was counterproductive, and instead of improving performance, it degrades the model performance. DA models pre-trained on eCritical data tend to overestimate the H_LOS, causing higher MAE whereas the Lasso regression model, which was only trained on the target dataset (MIMIC) has better knowledge of the data. Another takeaway from this was that if sufficient data were available for retraining the DA model, it can still outperform the baseline model, as evident from DA models outperforming baseline models for data subsets of 10% and higher.

For ICU_LOS, DA models outperformed the baseline models for some of the training data subsets (25–100% and underperformed in some cases (5%, and 10%). Also, the result at 1% data subset was not significant (p = 0.0468). Again, the reason for DA models not performing was likely related to the population heterogeneity between eCritical and MIMIC-III (ICU_LOS, [mean, std, var, max]: [5.92, 5.41, 29.26, 32.52] vs [4.03, 4.12, 17.00, 27.22]). The DA models outperform the baseline models when there was sufficient data to retrain, 50% dataset or higher.

ITL performance was better than DA, both considering the overall number of times they outperformed the baseline models and the margin with which it outperformed. The takeaway from this evaluation was the importance of data homogeneity in transferring the knowledge from the source task to the target task. Combining ITL and DA, TL statistically significantly outperformed baseline models 100 out of 112 times.

Novelty of this study

This study was novel in several aspects. First, our study was one of the few studies that explored ITL using EHR data. Second, this study explored seven different training data subsets (1%, 5%, 10%, 25%, 50%, 75%, and 100%), whereas in Liu et al. [32], a limited number of data subsets (100%, 20%, 10%, and 5%) were used, which were unevenly distributed as there is a gap between 20% and 100%. Third, our study used two large cohorts: eCritical with 55689 ICU admissions from 48672 patients as the source domain, and MIMIC-III with 61532 ICU admissions from 46476 patients as the target domain. This was considerably larger than previous studies on DA. For example, Shickel et al. [33] used the conventional ICU cohort at the University of Florida Health with 48,400 distinct ICU admissions as the source domain and the Intelligent ICU cohort at the University of Florida Health with 51 ICU admissions as the target cohort. Fourth, our study used a large feature set of 104. In comparison, Shickel et al. [33], used only 9 features, while Li et al. [31], used 4 features. Fifth, the pre-trained models were able to predict clinical outcomes for new patients without needing previous hospital visits, unlike the natural language processing models described by Li et al. [31]. Finally, pre-trained models have been made available publicly, which can be used for predicting patient outcomes in ICU or future research. The ITL/DA models retrained on the 1% training dataset with about 200 samples have already decent performance. Thus, the ITL models can be used to predict any patient outcome for which at least 200 samples within the same domain are available (Alberta ICUs). Similarly, DA models can be used to predict the four outcomes, 30-day mortality, AKI, H_LOS, and ICU_LOS, in any hospital with access to at least 200 samples. As this study shows promising ITL results, these pre-trained models can be used for future research in ITL using EHR data.

Limitations and Future work

There were a few limitations to this study that should be discussed. First, we could not include all the features available due to the constraint of finding common features in both cohorts, eCritical and MIMIC-III. Second, data quality in MIMIC-III was limited, out of the 46476 patient records in MIMIC-III only 2309 patient records have delirium related data, and 3562 patient records have Hypoxemic Respiratory Failure related data. Because the MIMIC-III cohort was not supporting the analysis of Hypoxemic Respiratory Failure and delirium, those outcomes were dropped from the study. Lastly, even though some of the data such as vitals and labs were longitudinal, we performed a cross-sectional study. As a result of that, some longitudinal information may have been lost.

Future work can include a better target dataset preferably another dataset from the same country so that the same guidelines and definitions are followed while collecting data, which will allow a broader feature set and more patient outcomes in the study. Also, recurring neural networks (RNN) or long short-term memory (LSTM) architecture can be used to model the longitudinal data.

In this retrospective study, we found that TL can lead to improved prediction performance when compared to the baseline models. Also, when comparing the results of ITL and DA, the performance of ITL was superior to DA. The pre-trained models from this study can help in improving prediction performance in data scarce scenarios like rural hospital ICU’s and help in further research in TL using EHR data.

Sources of Data

EHR data from two cohorts were used in this retrospective study. The first cohort was eCritical, which has 55,689 unique admission records from 48,672 unique patients admitted to the ICUs in Alberta, Canada, between March 2013 and December 2019. The second cohort was the MIMIC-III database (Medical Information Mart for Intensive Care) [39] version 1.4, which includes 61,532 unique admission records from 46,476 unique patients admitted to the ICUs at the Beth Israel Deaconess Medical Center in Boston, MA, USA, between 2001 and 2012.

Since MIMIC-III is a publicly available database, the need to obtain research ethics approval to use it in this study was waived. However, eCritical contains patient identifying information and approval from the Conjoint Health Research Ethics Board, University of Calgary, was obtained (REB17-0389). Informed consent was waived due to the large number of patients involved in the study.

All research was performed in accordance with relevant guidelines and regulations set by the University of Calgary and Alberta Health Services, the custodian of the eCritical data, as well as the Declaration of Helsinki.

Patient Cohort

Two different sets of inclusion and exclusion criteria were applied. The first set was used to establish the base cohorts for the entire study, whereas the second set was specific to each patient outcome and applied to the base cohorts.

For the base cohorts, the following inclusion criteria were applied to both eCritical and MIMIC-III (see Fig. 2): 1) only the first ICU admission of each patient; 2) ICU length of stay greater than 24 hours; 3) only adult patients with an age 18 years and over; and 4) samples with data available of at least 80% of the features (missing values are imputed as discussed in data preprocessing section). As a result, the base cohorts for eCritical and MIMIC-III consisted of 39,317 and 31,446 patient records, respectively.

Further inclusion and exclusion criteria were applied for each patient outcome, resulting in different datasets for each outcome, as shown in Figure 6. 30-day mortality did not require further inclusion criteria and was modeled using the base cohorts. For AKI, the following criteria were applied: 1) only patient records with sufficient data to determine the presence or absence of AKI, with one serum creatinine lab measurement within the first 24 hours of admission to be able to establish a baseline and another measurement after 24 hours of ICU admission; and 2) no AKI onset at or within 24 hours of admission. For H_LOS, the following inclusion criteria were applied: 1) the presence of hospital admission and discharge date-times; and 2) to exclude outliers only the bottom 98th percentile values of H_LOS are included. For ICU_LOS, the following inclusion criteria were applied: 1) the presence of ICU admission and discharge date-times; and 2) to exclude outliers only the bottom 98th percentile values of ICU_LOS are included.

Finally, the eCritical and MIMIC-III cohorts had 39,317 and 31,446 samples respectively for 30-day mortality. Similarly, AKI had 32,076 and 26,741 samples, H_LOS had 37,675 and 30,816 samples, and ICU_LOS had 38,529 and 30,816 samples, respectively. These cohorts were randomly split into 80% training, 10% validation, and 10% test data. We analyzed the prediction efficacy of TL by comparing its performance with baseline models in different training data availability scenarios with random subsets of 1%, 5%, 10%, 25%, 50%, and 75% of the 100% training data.

Patient outcomes

The primary patient outcomes for this study were 30-day post-ICU admission mortality and ICU_LOS. A patient was defined as deceased if he/she died after being admitted to the ICU and within 30 days of ICU admission. ICU_LOS was defined as the time between ICU admission and discharge.

Furthermore, AKI after 24 hours of ICU admission and H_LOS were predicted as secondary patient outcomes. Patients with an AKI onset at or within 24 hours of ICU admission were excluded from the study. AKI was identified using the creatinine definition of KDIGO [34]. H_LOS was defined as the time between hospital admission and discharge.

30-day mortality and AKI were predicted as classification problems, whereas H_LOS and ICU_LOS were predicted as regression problems.

Feature set

ML models were trained with the following predictor variables from the first 24 hours in the ICU: demographics, vitals, laboratory test results, Glasgow Coma Scale (GCS), prescriptions, dialysis, and mechanical ventilation. Features with more than 30% missing data were excluded from the study. A complete list of predictor variables is available in Table A4. Four statistical features (5th percentile, 95th percentile, Interquartile Range (IQR), and median) were extracted from the longitudinal variables like vitals, lab results, and GCS. The maximum and minimum values of labs and vitals carry crucial information regarding the health condition of the patient, but to minimize the influence of outliers, the 5 percentile and 95 percentiles were used in place of maximum and minimum values. Other predictor variables, such as prescriptions, dialysis, and mechanical ventilation, were transformed into binary features to indicate the presence or absence. Finally, there were a total of 104 features.

Data preprocessing

Differences in units of measurement between eCritical and MIMIC-III were handled by converting all features in MIMIC-III to the eCritical units of measurement.

The train set was used for training the model, whereas the validation set was used for tuning hyperparameters. The test set was used for model performance evaluation. Numerical features (e.g., vitals, labs) were scaled (unit variance and zero mean) and categorical features (e.g., sex, prescriptions, mechanical ventilation) were transformed using one-hot-encoding.

Missing data were present to varying degrees in both cohorts. Features with more than 30% missing data were excluded from the study. Patient records with missing data for more than 20% of the features were dropped. The remaining patient records with missing values were imputed using the IterativeImputer from the sklearn Python package, which is similar to the multiple imputation by chained equations (MICE). Imputation was performed after splitting the data to avoid data leakage. Training, validation, and test data were imputed separately.

Both categorical patient outcomes, the 30-day mortality, and AKI, have varying degrees of class imbalance in both cohorts. Of the two categorical outcomes, 30-day mortality had the highest class imbalance; the minority class has 16.79% and 12.24% samples in eCritical and MIMIC-III, respectively. The class imbalance was mitigated using SMOTE [40], by over-sampling the minority class to 50% of the majority class and then under-sampling the majority class to 100% of the minority class.

Baseline Experiments

Since logistic [41][42][43] and lasso [44][45] regression were widely used in medical research for classification and regression respectively, they were used as baseline models. Also, deep learning models (FCNN) were used as baseline models.

Hyperparameter tuning for LR and lasso baseline models was done using grid search with 3-fold cross-validation. The searched hyperparameter space for LR included: solvers of newton-cg, liblinear, lbfgs, sag, and saga; penalties of L1, L2, and none; and C values ranging from 0.01 to 10 with a step of 0.01. The searched hyperparameter space for lasso included: alpha values ranging from 0.01 to 1 with a step of 0.01. FCNN baseline models used the same architecture and hyperparameters as the corresponding TL model so that we were comparing models which were overall identical except for their initial weights. While TL models have pre-trained weights after being trained on source prediction task using source domain whereas baseline FCNN models were initialized using a kernel initializer. These pre-trained weights store the knowledge transferred from source domain to target domain, so this FCNN and TL comparison shows the actual difference between TL and model trained directly on a target domain.

Transfer Learning Experiments

In DA, the source and target domains were eCritical and MIMIC-III, respectively. The source and target tasks were the same, and each DA model predicted one of the four patient outcomes. Each prediction task was pre-trained on the source domain data before being retrained and evaluated on the target domain data. As a result, four pre-trained DA models were created, one for each of the four patient outcomes. In ITL, both, the source and target domains, were eCritical and the source task was 30-day mortality prediction whereas the target task was the prediction of one of the four patient outcomes. The ITL model where both the source and target tasks were 30-day mortality served as a benchmark for the other ITL models. Therefore, for each data sunset four pre-trained ITL and DA models were created, one each for the four patient outcomes. Baseline models for each patient outcome and cohort were used to compare the performances of the TL models. Eight FCNN models were created for the four patient outcomes and two cohorts. Four LR models were created for 30-day mortality and AKI trained on eCritical and MIMIC-III. Similarly, four lasso regression models were created for H_LOS and ICU_LOS trained on eCritical and MIMIC-III. These models were evaluated using the hold-out test data and 95% confidence intervals (CI) were calculated by bootstrapping the test data.

Source TL models were fully connected neural network models trained on the training dataset of the source domain. Hyperparameter tuning was done using the validation dataset. The searched hyperparameter space included: dropout rates of 0.5, 0.4, and 0.3; batch sizes of 32, 64, and 128, numbers of neurons per hidden layer of 100, 128, 256, and 200; learning rates of 0.001 and 0.0001; activation functions of ReLU, tanh, selu, elu, LeakyReLU, and PReLU; kernel initializers of HeUniform and HeNormal; and kernel regularizers of L2 (l2=1e-3) and L1 (l1=0.001). Also, different architectures were explored. The first one had three hidden layers with layer, layer/2, and layer/4 number of neurons. The second architecture was seven hidden layers with layer, layer*2, layer*2, layer, layer, layer/2, and layer/4 neurons. Here, the layer had 100, 200, 128, and 256 neurons. Finally, these models were tested using the hold-out test set from the source domain to identify the best performing model concerning balanced accuracy for classification tasks and MAE for regression tasks. Then, these best performing models were used as the pre-trained models.

For re-training using the target domain training dataset, the pre-trained model was loaded and the last hidden layer was replaced with a new hidden layer with randomly initialized weights. This was to facilitate the learning of the target feature space. This re-training was done in two steps. First, all pre-trained model layers were frozen (preventing those layers from learning) except for the newly added hidden layer and the model was trained to allow the new hidden layer to adjust its weights. Then, all layers were unfrozen (allowing weights to update) and the model was trained for the final time. Finally, the model was tested using the hold-out test set.

Random subsets of 1%, 5%, 10%, 25%, 50%, and 75% were created from the full training dataset (100%). To avoid selection bias [46], each subset was obtained 10 different times using 10 different random states. Models were trained on these 10 data subsets and then performance metrics from all models were then aggregated for each subset. As there are six subsets (1%, 5%, 10%, 25%, 50%, and 75%) and full training dataset of 100%, so 61 (6X10+1) models were trained for each outcome and each baseline and TL. For example, for AKI 61 LR, 61 FCNN, 61 ITL, and 61 DA models were trained.

To obtain the median and 95% confidence intervals of the performance metrics, 1000 bootstrap samples of the test set were obtained for the full dataset (100%), and for the random subsets, 100 bootstrap samples for each of the 10 random states (total 1000) were created and then tested using these bootstrapped test sets.

Finally, Wilcoxon rank sum tests were performed to compare the performance of TL models to the baseline models. Since there were repeated comparisons involved, a Bonferroni correction was applied. Thus, statistical significance was indicated by p < 0.0014 for classification tasks (7 data subsets and 5 metrics, a total of 35 comparisons) and 0.0035 for regression tasks (7 data subsets and 2 metrics, a total of 14 comparisons).

ETHICS STATEMENT

The study involving patient data was reviewed and approved by the Conjoint Health Research Ethics Board, University of Calgary (REB17-0389). Informed consent was waived due to the large number of patients involved in the study.

All research was performed in accordance with relevant guidelines and regulations set by the University of Calgary and Alberta Health Services, the custodian of the eCritical data, as well as the Declaration of Helsinki.

AUTHOR CONTRIBUTIONS

MM designed the study, prepared the data, conducted all experiments, and wrote the manuscript. JL conceived and designed the study, and directly supervised MM. HS facilitated accessing the eCritical data. All authors contributed to revisions of the manuscript.

DATA AVAILABILITY

The eCritical data contain patient identifiable information and cannot be made publicly available but are available from the corresponding author on reasonable request. MIMIC is a publicly available database and access can be obtained at https://physionet.org/content/mimiciii/1.4/.

CODE AVAILABILITY

The code that conducted the experiments as well as the pre-trained models are available at https://github.com/data-intelligence-for-health-lab/ICU_outcome_prediction_Transfer_Learning.git.

ACKNOWLEDGMENTS

The authors would like to thank Dr. Filipe Lucini for assistance with data retrieval, and Alberta Health Services for granting access to the eCritical and administrative health databases. Finally, ARC and MARC clusters used in this project were provided by the University of Calgary.

FUNDING

MM was supported by an Alberta Innovates Graduate Student Scholarship for Data-Enabled Innovation (GSS DEI) from Alberta Innovates and Alberta Advanced Education, as well as a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2021-02588).

Role of the Funder/Sponsor

The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Competing Interests

The authors declare that there are no competing interests.

B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis,” IEEE J. Biomed. Heal. Informatics, vol. 22, no. 5, pp. 1589–1604, Sep. 2018, doi: 10.1109/JBHI.2017.2767063.
Y. Bengio, A. Courville, and P. Vincent, “Representation Learning: A Review and New Perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013, doi: 10.1109/TPAMI.2013.50.
S. Niu, Y. Liu, J. Wang, and H. Song, “A Decade Survey of Transfer Learning (2010–2020),” IEEE Trans. Artif. Intell., vol. 1, no. 2, pp. 151–166, Oct. 2020, doi: 10.1109/TAI.2021.3054609.
Y. Tokuoka, S. Suzuki, and Y. Sugawara, “An Inductive Transfer Learning Approach using Cycle-consistent Adversarial Domain Adaptation with Application to Brain Tumor Segmentation,” in Proceedings of the 2019 6th International Conference on Biomedical and Bioinformatics Engineering, Nov. 2019, pp. 44–48, doi: 10.1145/3375923.3375948.
A. Titoriya and S. Sachdeva, “Breast Cancer Histopathology Image Classification using AlexNet,” in 2019 4th International Conference on Information Systems and Computer Networks (ISCON), Nov. 2019, pp. 708–712, doi: 10.1109/ISCON47742.2019.9036160.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, May 2017, doi: 10.1145/3065386.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90.
K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” Sep. 2014, doi: https://doi.org/10.48550/arXiv.1409.1556.
C. Szegedy et al., “Going Deeper with Convolutions,” Sep. 2014, [Online]. Available: http://arxiv.org/abs/1409.4842.
M. Shaha and M. Pawar, “Transfer Learning for Image Classification,” in 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Mar. 2018, pp. 656–660, doi: 10.1109/ICECA.2018.8474802.
P. Kora et al., “Transfer learning techniques for medical image analysis: A review,” Biocybern. Biomed. Eng., vol. 42, no. 1, pp. 79–107, Jan. 2022, doi: 10.1016/j.bbe.2021.11.004.
L. D. Nguyen, D. Lin, Z. Lin, and J. Cao, “Deep CNNs for microscopic image classification by exploiting transfer learning and feature concatenation,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, pp. 1–5, doi: 10.1109/ISCAS.2018.8351550.
K. S. Ananda Kumar, A. Y. Prasad, and J. Metan, “A hybrid deep CNN-Cov-19-Res-Net Transfer learning architype for an enhanced Brain tumor Detection and Classification scheme in medical image processing,” Biomed. Signal Process. Control, vol. 76, p. 103631, Jul. 2022, doi: 10.1016/j.bspc.2022.103631.
A. van Opbroek, M. A. Ikram, M. W. Vernooij, and M. de Bruijne, “Transfer Learning Improves Supervised Image Segmentation Across Imaging Protocols,” IEEE Trans. Med. Imaging, vol. 34, no. 5, pp. 1018–1030, May 2015, doi: 10.1109/TMI.2014.2366792.
M. Ghafoorian et al., “Transfer Learning for Domain Adaptation in MRI: Application in Brain Lesion Segmentation,” 2017, pp. 516–524.
I. W. Harsono, S. Liawatimena, and T. W. Cenggoro, “Lung nodule detection and classification from Thorax CT-scan using RetinaNet with transfer learning,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 3, pp. 567–577, Mar. 2022, doi: 10.1016/j.jksuci.2020.03.013.
H.-C. Shin et al., “Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning,” IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 1285–1298, May 2016, doi: 10.1109/TMI.2016.2528162.
K. Aderghal, K. Afdel, J. Benois-Pineau, and G. Catheline, “Improving Alzheimer’s stage categorization with Convolutional Neural Network using transfer learning and different magnetic resonance imaging modalities,” Heliyon, vol. 6, no. 12, p. e05652, Dec. 2020, doi: 10.1016/j.heliyon.2020.e05652.
I. Diamant et al., “Chest Radiograph Pathology Categorization via Transfer Learning,” in Deep Learning for Medical Image Analysis, Elsevier, 2017, pp. 299–320.
P. Burlina, K. D. Pacheco, N. Joshi, D. E. Freund, and N. M. Bressler, “Comparing humans and deep learning performance for grading AMD: A study in using universal deep features and transfer learning for automated AMD analysis,” Comput. Biol. Med., vol. 82, pp. 80–86, Mar. 2017, doi: 10.1016/j.compbiomed.2017.01.018.
A. Sugeno, Y. Ishikawa, T. Ohshima, and R. Muramatsu, “Simple methods for the lesion detection and severity grading of diabetic retinopathy by image processing and transfer learning,” Comput. Biol. Med., vol. 137, p. 104795, Oct. 2021, doi: 10.1016/j.compbiomed.2021.104795.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Jan. 2013, [Online]. Available: http://arxiv.org/abs/1301.3781.
J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543, doi: 10.3115/v1/D14-1162.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Oct. 2018, [Online]. Available: http://arxiv.org/abs/1810.04805.
A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov, “FastText.zip: Compressing text classification models,” Dec. 2016, [Online]. Available: http://arxiv.org/abs/1612.03651.
J. Lee et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, Feb. 2020, doi: 10.1093/bioinformatics/btz682.
J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and Word2vec for text classification with semantic features,” in 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), Jul. 2015, pp. 136–140, doi: 10.1109/ICCI-CC.2015.7259377.
V. Major, A. Surkis, and Y. Aphinyanaphongs, “Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research.,” AMIA ... Annu. Symp. proceedings. AMIA Symp., vol. 2018, pp. 1405–1414, 2018, [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/30815185.
S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao, “Deep Learning--based Text Classification,” ACM Comput. Surv., vol. 54, no. 3, pp. 1–40, Apr. 2022, doi: 10.1145/3439726.
Z. Gao, A. Feng, X. Song, and X. Wu, “Target-Dependent Sentiment Classification With BERT,” IEEE Access, vol. 7, pp. 154290–154299, 2019, doi: 10.1109/ACCESS.2019.2946594.
Y. Li et al., “BEHRT: Transformer for Electronic Health Records,” Sci. Rep., vol. 10, no. 1, p. 7155, Apr. 2020, doi: 10.1038/s41598-020-62922-y.
K. Liu et al., “Development and Validation of a Personalized Model With Transfer Learning for Acute Kidney Injury Risk Estimation Using Electronic Health Records,” JAMA Netw. Open, vol. 5, no. 7, p. e2219776, Jul. 2022, doi: 10.1001/jamanetworkopen.2022.19776.
B. Shickel, A. Davoudi, T. Ozrazgat-Baslanti, M. Ruppert, A. Bihorac, and P. Rashidi, “Deep Multi-Modal Transfer Learning for Augmented Patient Acuity Assessment in the Intelligent ICU,” Front. Digit. Heal., vol. 3, Feb. 2021, doi: 10.3389/fdgth.2021.640685.
N. Kellum, J. A. ; Lameire, “Kidney disease: Improving global outcomes (KDIGO) acute kidney injury work group. KDIGO clinical practice guideline for acute kidney injury,” Kidney Int. Suppl., vol. 2, no. 1, pp. 1–138, Mar. 2012, doi: 10.1038/kisup.2012.1.
P. C. Austin and J. E. Hux, “A brief note on overlapping confidence intervals,” J. Vasc. Surg., vol. 36, no. 1, pp. 194–195, Jul. 2002, doi: 10.1067/mva.2002.125015.
N. Mittal, M. Bhandari, and D. Kumbhare, “A Tale of Confusion From Overlapping Confidence Intervals,” Am. J. Phys. Med. Rehabil., vol. 98, no. 1, pp. 81–83, Jan. 2019, doi: 10.1097/PHM.0000000000001016.
M. E. Payton, M. H. Greenstone, and N. Schenker, “Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance?,” J. Insect Sci., vol. 3, no. 1, Jan. 2003, doi: 10.1093/jis/3.1.34.
N. Schenker and J. F. Gentleman, “On Judging the Significance of Differences by Examining the Overlap Between Confidence Intervals,” Am. Stat., vol. 55, no. 3, pp. 182–186, Aug. 2001, doi: 10.1198/000313001317097960.
A. E. W. Johnson et al., “MIMIC-III, a freely accessible critical care database,” Sci. Data, vol. 3, no. 1, p. 160035, Dec. 2016, doi: 10.1038/sdata.2016.35.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/jair.953.
Vairavan Srinivasan; Larry Eshelman; Syed Haider; Abigail Flower; Adam Seiver, “Prediction of mortality in an intensive care unit using logistic regression and a hidden Markov model,” IEEE, vol. In 2012 Co, pp. 393–396, 2012, [Online]. Available: https://ieeexplore.ieee.org/abstract/document/6420413.
Clermont, Gilles MD, CM, MSc; Angus, Derek C. MB, ChB, MPH; DiRusso, Stephen M. MD, PhD; Griffin, Martin MS; Linde-Zwirble, Walter T., “Predicting hospital mortality for patients in the intensive care unit: A comparison of artificial neural networks with logistic regression models,” Crit. Care Med., vol. 29, no. 2, pp. 291–296, 2001, [Online]. Available: https://journals.lww.com/ccmjournal/Fulltext/2001/02000/Evaluation_of_an_interdisciplinary_data_set_for.12.aspx.
W. Ge, J.-W. Huh, Y. R. Park, J.-H. Lee, Y.-H. Kim, and A. Turchin, “An Interpretable ICU Mortality Prediction Model Based on Logistic Regression and Recurrent Neural Networks with LSTM units.,” AMIA ... Annu. Symp. proceedings. AMIA Symp., vol. 2018, pp. 460–469, 2018, [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/30815086.
J. Z. Musoro, A. H. Zwinderman, M. A. Puhan, G. ter Riet, and R. B. Geskus, “Validation of prediction models based on lasso regression with multiply imputed data,” BMC Med. Res. Methodol., vol. 14, no. 1, p. 116, Dec. 2014, doi: 10.1186/1471-2288-14-116.
T. Hepp, M. Schmid, O. Gefeller, E. Waldmann, and A. Mayr, “Approaches to Regularized Regression – A Comparison between Gradient Boosting and the Lasso,” Methods Inf. Med., vol. 55, no. 05, pp. 422–430, May 2016, doi: 10.3414/ME16-01-0033.
G. Tripepi, K. J. Jager, F. W. Dekker, and C. Zoccali, “Selection Bias and Information Bias in Clinical Research,” Nephron Clin. Pract., vol. 115, no. 2, pp. c94–c99, Apr. 2010, doi: 10.1159/000312871.

Table 1. Performance metrics of ITL 30-day mortality – all data subsets

Model	Data set %	Balanced Accuracy (95% CI)	p-value	AUC (95% CI)	p-value	ACC (95% CI)	p-value	PR (95% CI)	p-value	RE (95% CI)	p-value
ITL	1%	0.74(0.7158,0.7674)		0.8327(0.8098,0.8517)		0.7965(0.7696,0.8182)		0.4547(0.411,0.5032)		0.6513(0.5927,0.7307)
LR		0.6548(0.6334,0.6743)	<0.001	0.849(0.832,0.8636)	<0.001	0.8545(0.8436,0.8662)	<0.001	0.699(0.6472,0.7473)	<0.001	0.3421(0.2983,0.3817)	<0.001
FCNN		0.6834(0.6587,0.7189)	<0.001	0.7684(0.7312,0.7956)	<0.001	0.7703(0.7299,0.807)	<0.001	0.4041(0.3445,0.4715)	<0.001	0.5543(0.4668,0.6203)	<0.001
ITL	5%	0.7537(0.7323,0.7775)		0.8388(0.821,0.8602)		0.781(0.7625,0.8057)		0.4352(0.4031,0.4784)		0.708(0.6546,0.7772)
LR		0.6928(0.6663,0.714)	<0.001	0.848(0.8302,0.8629)	<0.001	0.8553(0.8444,0.867)	<0.001	0.6482(0.5995,0.6962)	<0.001	0.4379(0.3827,0.4814)	<0.001
FCNN		0.7235(0.6963,0.7514)	<0.001	0.8097(0.784,0.8372)	<0.001	0.7935(0.7673,0.8138)	<0.001	0.4466(0.3963,0.494)	<0.001	0.6143(0.5639,0.6822)	<0.001
ITL	10%	0.7593(0.737,0.7799)		0.8453(0.8267,0.8608)		0.779(0.7531,0.8032)		0.4332(0.3951,0.4765)		0.7289(0.6662,0.7864)
LR		0.7157(0.6947,0.7398)	<0.001	0.8477(0.8317,0.8636)	<0.001	0.8464(0.8347,0.8591)	<0.001	0.5863(0.5415,0.6303)	<0.001	0.5099(0.4673,0.56)	<0.001
FCNN		0.725(0.6981,0.7494)	<0.001	0.818(0.7952,0.84)	<0.001	0.7922(0.7686,0.8225)	<0.001	0.4462(0.4015,0.5114)	<0.001	0.6217(0.5205,0.6822)	<0.001
ITL	25%	0.7615(0.7429,0.7831)		0.8488(0.8322,0.8641)		0.7782(0.7497,0.7965)		0.4322(0.3945,0.4672)		0.7396(0.6968,0.7803)
LR		0.7426(0.7226,0.7617)	<0.001	0.8492(0.8325,0.8643)	0.4756	0.8291(0.8159,0.8431)	<0.001	0.5233(0.485,0.5637)	<0.001	0.6063(0.5692,0.6428)	<0.001
FCNN		0.749(0.7272,0.7711)	<0.001	0.8379(0.8188,0.8539)	<0.001	0.7749(0.7497,0.7965)	<0.001	0.4255(0.3887,0.466)	<0.001	0.7083(0.6619,0.7698)	<0.001
ITL	50%	0.7639(0.743,0.7855)		0.852(0.8359,0.8671)		0.7721(0.7197,0.7973)		0.4254(0.3694,0.4668)		0.7556(0.6912,0.8177)
LR		0.7484(0.7286,0.768)	<0.001	0.8486(0.8309,0.8635)	<0.001	0.8141(0.8009,0.8278)	<0.001	0.4888(0.4561,0.5266)	<0.001	0.6454(0.6077,0.6838)	<0.001
FCNN		0.7617(0.7425,0.7802)	<0.001	0.85(0.8345,0.8646)	<0.001	0.7749(0.7444,0.7953)	<0.001	0.4274(0.389,0.4651)	<0.001	0.7427(0.7007,0.7889)	<0.001
ITL	75%	0.7669(0.7483,0.7843)		0.8562(0.8387,0.87)		0.7785(0.7538,0.8021)		0.4338(0.3977,0.4762)		0.7491(0.7042,0.7988)
LR		0.7514(0.7325,0.7696)	<0.001	0.8512(0.8353,0.8655)	<0.001	0.8062(0.7935,0.8187)	<0.001	0.4734(0.4423,0.5082)	<0.001	0.6647(0.6261,0.7015)	<0.001
FCNN		0.7646(0.7463,0.7842)	<0.001	0.8543(0.8383,0.869)	<0.001	0.7757(0.7564,0.7983)	<0.001	0.4298(0.3988,0.4717)	<0.001	0.7474(0.7051,0.7915)	0.1259
ITL	100%	0.7692(0.7509,0.786)		0.8565(0.8409,0.8711)		0.7668(0.7538,0.7803)		0.4206(0.3941,0.4507)		0.7728(0.7384,0.8029)
LR		0.7531(0.7334,0.771)	<0.001	0.8511(0.8353,0.8664)	<0.001	0.8032(0.7902,0.8154)	<0.001	0.4689(0.4388,0.5015)	<0.001	0.6749(0.6399,0.7095)	<0.001
FCNN		0.7623(0.7439,0.7784)	<0.001	0.8515(0.8363,0.8666)	<0.001	0.7566(0.7434,0.7706)	<0.001	0.4081(0.3817,0.4372)	<0.001	0.7713(0.7382,0.8003)	<0.001

*p-value is from the Wilcoxon rank sum test.

ITL – Inductive Transfer Learning. LR – Logistic Regression.

Table 2. Performance metrics of DA 30-day mortality – all data subsets

Model	Data set %	Balanced Accuracy (95% CI)	p-value	AUC (95% CI)	p-value	ACC (95% CI)	p-value	PR (95% CI)	p-value	RE (95% CI)	p-value
DA	1%	0.6744(0.5758,0.7083)		0.7554(0.6775,0.7942)		0.7835(0.751,0.813)		0.2939(0.2405,0.3367)		0.5333(0.267,0.5956)
LR		0.5821(0.551,0.6134)	<0.001	0.8326(0.8105,0.8533)	<0.001	0.8738(0.861,0.8859)	<0.001	0.4987(0.4166,0.5886)	<0.001	0.1919(0.1231,0.2587)	<0.001
FCNN		0.6636(0.6146,0.6971)	<0.001	0.7527(0.7214,0.7827)	0.2381	0.799(0.779,0.821)	<0.001	0.3096(0.2686,0.3521)	<0.001	0.4853(0.35,0.5627)	<0.001
DA	5%	0.7262(0.6826,0.7639)		0.8067(0.7608,0.8375)		0.773(0.7443,0.7978)		0.3125(0.2715,0.355)		0.6634(0.5773,0.7506)
LR		0.6351(0.5992,0.6747)	<0.001	0.8295(0.8074,0.8519)	<0.001	0.8684(0.855,0.8817)	<0.001	0.4707(0.4113,0.5512)	<0.001	0.3201(0.2442,0.4143)	<0.001
FCNN		0.7037(0.6359,0.7561)	<0.001	0.7968(0.7382,0.841)	<0.001	0.8083(0.787,0.8353)	<0.001	0.3426(0.2907,0.3969)	<0.001	0.5682(0.4062,0.6745)	<0.001
DA	10%	0.7358(0.7045,0.7678)		0.8144(0.7874,0.843)		0.7687(0.7358,0.7946)		0.3126(0.2749,0.3534)		0.6911(0.6188,0.7867)
LR		0.6811(0.6549,0.7092)	<0.001	0.8349(0.7623,0.8574)	<0.001	0.8525(0.7927,0.8674)	<0.001	0.4215(0.2985,0.4761)	<0.001	0.457(0.4036,0.5245)	<0.001
FCNN		0.7145(0.6584,0.7573)	<0.001	0.8064(0.7598,0.8399)	<0.001	0.8076(0.7701,0.8321)	<0.001	0.3449(0.2915,0.3977)	<0.001	0.594(0.4698,0.6975)	<0.001
DA	25%	0.7594(0.7308,0.7861)		0.8373(0.8105,0.8579)		0.7612(0.7418,0.779)		0.3153(0.2836,0.345)		0.757(0.6985,0.8147)
LR		0.7325(0.7064,0.7597)	<0.001	0.839(0.81,0.8596)	<0.001	0.8216(0.8022,0.8378)	<0.001	0.3743(0.3324,0.4144)	<0.001	0.6135(0.5654,0.6657)	<0.001
FCNN		0.7477(0.7068,0.7765)	<0.001	0.8283(0.801,0.852)	<0.001	0.779(0.7456,0.801)	<0.001	0.3245(0.2868,0.3634)	<0.001	0.7153(0.6034,0.7687)	<0.001
DA	50%	0.7788(0.7543,0.7999)		0.8489(0.8275,0.8684)		0.7606(0.7266,0.7863)		0.3204(0.2856,0.3543)		0.8041(0.7437,0.8561)
LR		0.7515(0.7265,0.7763)	<0.001	0.8429(0.8205,0.863)	<0.001	0.801(0.7809,0.8175)	<0.001	0.3521(0.3134,0.3893)	<0.001	0.6855(0.6366,0.734)	<0.001
FCNN		0.7736(0.7465,0.8004)	<0.001	0.847(0.8247,0.8692)	<0.001	0.765(0.7157,0.786)	<0.001	0.3224(0.2805,0.3568)	0.0099	0.7933(0.7133,0.8505)	<0.001
DA	75%	0.7829(0.7607,0.8045)		0.853(0.8326,0.8716)		0.766(0.7418,0.7882)		0.3268(0.2949,0.3623)		0.8052(0.7564,0.8521)
LR		0.754(0.6948,0.7824)	<0.001	0.8395(0.7524,0.8606)	<0.001	0.7873(0.7046,0.806)	<0.001	0.3362(0.2495,0.376)	<0.001	0.7115(0.6616,0.763)	<0.001
FCNN		0.7844(0.7612,0.8068)	<0.001	0.8556(0.8349,0.8752)	<0.001	0.7599(0.7316,0.7784)	<0.001	0.3226(0.2908,0.3525)	<0.001	0.8171(0.7677,0.8815)	<0.001
DA	100%	0.784(0.7645,0.8033)		0.8602(0.8424,0.8778)		0.7488(0.7326,0.7644)		0.3146(0.2879,0.3433)		0.8304(0.7979,0.8647)
LR		0.7647(0.7417,0.7873)	<0.001	0.8432(0.8227,0.8606)	<0.001	0.7819(0.7676,0.7968)	<0.001	0.3367(0.305,0.3682)	<0.001	0.741(0.6982,0.784)	<0.001
FCNN		0.7927(0.7733,0.8114)	<0.001	0.8591(0.8405,0.8776)	<0.001	0.7396(0.7224,0.7558)	<0.001	0.3104(0.2855,0.339)	<0.001	0.8648(0.8292,0.8965)	<0.001

*p-value is from the Wilcoxon rank sum test.

LR – Logistic Regression. DA – Domain Adaptation.

Table 3. Performance metrics of ICU_LOS – all data subsets

Model	Data set %	MAE (95% CI)	p-value	MSE (95% CI)	p-value
ITL	1%	3.4519(3.2863,3.8158)		27.6137(25.1995,32.8881)
Lasso		3.5883(3.4255,3.7376)	<0.001	24.7468(22.5824,26.9871)	<0.001
FCNN		5.626(5.4351,5.8329)	<0.001	58.3428(54.539,62.3479)	<0.001
ITL	5%	3.227(3.1007,3.3694)		25.6429(23.1118,28.3068)
Lasso		3.4105(3.2982,3.5284)	<0.001	23.2357(21.4593,25.1419)	<0.001
FCNN		5.6384(5.4493,5.8144)	<0.001	58.3837(54.5012,62.1446)	<0.001
ITL	10%	3.1657(3.0505,3.3016)		24.8945(22.7712,27.3443)
Lasso		3.4096(3.3012,3.5246)	<0.001	23.18(21.4,25.0781)	<0.001
FCNN		5.6367(5.4538,5.7984)	<0.001	58.2548(54.4201,62.0101)	<0.001
ITL	25%	3.1476(3.0217,3.2813)		25.0213(22.786,27.4814)
Lasso		3.396(3.2892,3.5072)	<0.001	23.0958(21.3047,24.9569)	<0.001
FCNN		5.6324(5.4662,5.7974)	<0.001	58.1206(54.3058,61.8413)	<0.001
ITL	50%	3.122(2.9975,3.2463)		24.788(22.6,27.0035)
Lasso		3.3986(3.2954,3.5105)	<0.001	23.085(21.2642,24.9566)	<0.001
FCNN		5.6285(5.4505,5.7924)	<0.001	57.9696(54.1726,61.8323)	<0.001
ITL	75%	3.1191(2.9939,3.2486)		24.6982(22.4658,27.0108)
Lasso		3.3979(3.2941,3.5107)	<0.001	23.0554(21.2866,24.9292)	<0.001
FCNN		5.6235(5.4485,5.7827)	<0.001	57.9345(54.1003,61.6588)	<0.001
ITL	100%	3.1047(2.981,3.2212)		24.2975(22.0683,26.4235)
Lasso		3.3957(3.2838,3.4947)	<0.001	23.0094(21.1029,24.7732)	<0.001
FCNN		5.6164(5.4552,5.784)	<0.001	57.7946(53.9761,61.3585)	<0.001
DA	1%	5.151(2.2976,12.2704)		23713.6811(17.6564,96207.6088)
Lasso		2.5026(2.3905,2.6319)	<0.001	13.8609(12.4635,15.3404)	<0.001
FCNN		4.4312(3.16,9.2178)	0.0468	857.9225(27.3788,37542.4647)	<0.001
DA	5%	2.9333(2.1891,5.7482)		1169.1479(15.9935,15706.9793)
Lasso		2.4269(2.3197,2.5273)	<0.001	13.5643(12.2516,14.898)	<0.001
FCNN		4.1887(3.6145,8.2767)	<0.001	506.2204(28.754,21199.7618)	<0.001
DA	10%	2.6185(2.1133,6.8996)		587.236(15.0171,32741.8528)
Lasso		2.4102(2.3044,4.7573)	<0.001	13.5523(12.2645,8800.3305)	<0.001
FCNN		4.1472(3.5754,6.272)	<0.001	301.9264(28.5981,10331.9659)	<0.001
DA	25%	2.2781(2.0427,4.643)		54.3868(13.8313,5248.2596)
Lasso		2.4165(2.2967,11.1001)	<0.001	13.7606(12.1696,157997.5872)	<0.001
FCNN		3.8481(3.5641,5.0331)	<0.001	53.7051(28.1006,2100.1459)	0.139
DA	50%	2.2202(2.0045,3.8034)		78.072(13.2146,7832.6985)
Lasso		2.5322(2.3006,10.9919)	<0.001	43.2711(12.4713,99425.7167)	<0.001
FCNN		3.7991(3.4774,4.5941)	<0.001	171.9689(27.1513,1055.4228)	<0.001
DA	75%	2.1394(1.9897,3.1683)		22.771(12.9367,1438.1044)
Lasso		2.6141(2.2903,7.0188)	<0.001	192.3652(12.3675,22244.873)	<0.001
FCNN		3.8703(3.5031,4.854)	<0.001	231.4307(27.4148,1779.207)	<0.001
DA	100%	2.1608(1.9881,2.4793)		55.7047(12.5679,142.0451)
Lasso		3.3971(2.3057,5.5887)	<0.001	3402.6991(12.7064,10183.4661)	<0.001
FCNN		3.9776(3.4847,4.9573)	<0.001	588.9992(26.9067,1713.0263)	<0.001

* p-value is from the Wilcoxon rank sum test.

ITL – Inductive Transfer Learning. Lasso – Lasso Regression. DA – Domain Adaptation.

Table 4. Descriptive statistics for the two cohorts

Descriptor	eCritical	MIMIC-III
Male, n (%)	22,957 (58.39)	17,900 (56.92)
Age (years), median (IQR)	60 (46, 70)	66 (53, 78)
Admission weight (Kg), median (IQR)	80 (67.4, 96.6)	79.3 (66.5, 94)
30-day mortality, n (%)	6713 / 39317 (17.07 %)	3900/ 31446 (12.40 %)
AKI, n (%)	4524/ 32076 (14.10 %)	5789/ 26741 (21.64 %)
H_LOS (days), median (IQR)	11.48 (5.59, 23.29)	7.39 (4.67, 12.32)
ICU_LOS (days), median (IQR)	3.97 (2.2, 7.67)	2.47 (1.59, 4.58)
Creatinine_Blood (μmol/L), median (IQR)	90.5 (65.5, 152.05)	79.56 (61.88, 123.76)
GCS, median (IQR)	11 (7.2, 14.25)	12(8, 15)
Glucose_Blood (mmol/L), median (IQR)	7.49 (6.09, 9.45)	7.1 (5.83, 8.99)
Potassium_Blood (mmol/L), median (IQR)	3.95 (3.6, 4.4)	4.07 (3.7, 4.5)
Sodium_Blood (mmol/L), median (IQR)	138.2 (135.15, 141)	138.5 ( 136, 141)
PH_Arterial, median (IQR)	7.38 (7.32, 7.43)	7.38 (7.33, 7.43)
WBC (K/uL), median (IQR)	12.03 (8.55, 16.65)	11.2 (8.1, 14.9)
RBC (m/uL), median (IQR)	3.7 (3.15, 4.24)	3.54 (3.15, 3.99)
BP_Systolic (mmHg), median (IQR)	118.5 (100.25, 139.8)	117 (100.8, 135.75)
BP_Diastolic (mmHg), median (IQR)	61.5 (52, 73.8)	59.5 (49.5, 71)
SpO2 (%), median (IQR)	97 (94, 99)	97.7 (95, 99.7)
Hemoglobin (g/L), median (IQR)	112.45 (95.75, 129.2)	107 (94.5, 121)
Hematocrit (%), median (IQR)	34 (29, 39)	31.5 (27.95, 35.5)
Mechanical_Ventilation_Flag, n (%)	34073(86.66)	16429(52.25)
Dialysis_Flag, n (%)	2173(5.53)	6909(21.97)
Norepinephrine, n (%)	15797(40.18)	3339(10.62)
PhenyLEPHrine, n (%)	4384(11.15)	6737(21.42)
Vasopressin, n (%)	5087(12.94)	712(2.26)
DOBUTamine, n (%)	754(1.92)	429(1.36)
DOPamine, n (%)	1051(2.67)	1512(4.81)
EPINEPHrine, n (%)	1320(3.36)	1221(3.88)

No competing interests reported.

TLICUsupplementarymatarialV7.docx

Using domain adaptation and inductive transfer learning to improve patient outcome prediction in the intensive care unit

Status:

Version 1

Abstract

Figures

INTRODUCTION

RESULTS

DISCUSSION

Inductive Transfer Learning

Domain Adaptation

Novelty of this study

Limitations and Future work

Conclusions

Methods

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1