Patient Cohorts
In our study, we used EHR data from two ICU cohorts: 1) eCritical data collected from the ICUs in Alberta, Canada, between March 2013 and December 2019, and 2) the publicly available MIMIC-III database from the ICUs at the Beth Israel Deaconess Medical Center in Boston, Massachusetts, USA, collected between 2001 and 2012. Based on the inclusion and exclusion criteria, the final eCritical and MIMIC-III cohorts were different for each patient outcome (see the Patient Cohort section in the Methods for more details). After applying the inclusion criteria, the 30-day mortality cohort had 39,317 and 31,446 samples in eCritical and MIMIC-III databases, respectively, whereas the AKI cohort had 32,076 and 26,741 samples, respectively. The H_LOS cohort had 37,675 and 30,816 samples, and the ICU_LOS cohort had 38,529 and 30,816 samples, respectively. In the eCritical cohort, there were 6,713 (17.07%) 30-day mortalities, whereas, in the MIMIC-III database, there were 3,900 (12.40 %). The eCritical cohort had 4,524 (14.11 %) AKI cases whereas MIMIC-III had 5,789 (21.64 %). The eCritical cohort had a median H_LOS of 11.48 days with an interquartile range (IQR) of (5.59, 23.29), whereas the MIMIC-III cohort had a median (IQR) of 7.39 (4.67, 12.32). Similarly, the median (IQR) ICU_LOS in the eCritical and MIMIC-III cohorts were 3.97 (2.2, 7.67) and 2.47 days (1.59, 4.58), respectively. The descriptive statistics for the two cohorts are shown in Table 4.
Pre-trained models
TL models trained on the source domain for the source prediction task are called pre-trained models. These pre-trained models can then be retrained for improved prediction performance of target prediction tasks on the target domain. Pre-trained models were selected based on balanced accuracy (BA) for classification tasks and mean absolute error (MAE) for the regression tasks. Because of the high class imbalance in both eCritical and MIMIC-III (30-day mortality: 17.07% and 12.40 %; AKI: 14.11% and 21.64%, respectively), BA was selected as the metric used for hyper-parameter tuning and selection of classification models. MAE is a more natural measure of average error, and it is not sensitive to outliers as is the case with MSE. Due to the difference in the spread of the data between eCritical and MIMIC-III (H_LOS, [mean, standard deviation (std), variance (var), maximum (max)]: [18.86, 20.86, 435.18, 127.81] vs [9.73, 7.48, 56.02, 44.85]), there will be outliers of different magnitude in them. To nullify the impact of outliers, MAE was selected as the metric for hyper-parameter tuning and selection of regression models.
After hyper-parameter tuning for the 30-day mortality source task, the pre-trained model with three hidden layers of 128, 64, and 32 neurons was selected. This model had the highest BA of 0.7810. This was the pre-trained model for all four ITL target tasks and the 30-day mortality DA target task. Similarly, after hyper-parameter tuning for AKI, H_LOS, and ICU_LOS source tasks, pre-trained models with three hidden layers of 256, 128, and 64 neurons, seven hidden layers of 256, 512, 512, 256, 256, 128, and 64 neurons, and seven hidden layers of 256, 512, 512, 256, 256, 128, and 64 neurons, with a BA of 0.7199, an MAE of 11.8019, and an MAE of 3.0887, were selected, respectively. These were the pre-trained models for the DA target tasks for AKI, H_LOS, and ICU_LOS, respectively.
Domain Adaptation
The DA pre-trained model re-trained using a 100% MIMIC-III train dataset for the 30-day mortality task resulted in a median BA of 0.784 with a 95% confidence interval (CI) between 0.7645 and 0.8033 and a median area under the receiver operating characteristic curve (AUC) of 0.8602 (95% CI: 0.8424,0.8778). Reducing the training sample size to 75% dataset resulted in a slight drop in median BA to 0.7829 (95% CI: 0.7607,0.8045) and AUC to 0.853 (95% CI: 0.8326,0.8716). This decrease in performance metrics was observed as sample size decreased to 50%, 25%, 10%, 5%, and 1% as shown in Figure 2(B). Finally, training DA models using 1% of the MIMIC-III dataset resulted in a BA of 0.6744 (95% CI: 0.5758,0.7083) and an AUC of 0.7554 (95% CI: 0.6775,0.7942). Here, the 1% MIMIC-III training dataset for the 30-day mortality task contains 222 samples.
Similarly, fully connected neural network (FCNN) baseline models trained using the 100% MIMIC-III train dataset for the 30-day mortality task resulted in a median BA of 0.7927 (95% CI: 0.7733,0.8114) and median AUC of 0.8591 (95% CI: 0.8405,0.8776). There was a slight decrease in performance metrics as the sample size decreased to 75%, the performance metrics were a median BA of 0.7844 (95% CI: 0.7612,0.8068) and median AUC of 0.8556 (95% CI: 0.8349,0.8752). The performance metrics decreased as the sample size shrunk to 50%, 25%, 10%, 5%, and 1% as shown in Figure 2(B). Finally, training baseline models using 1% of the MIMIC-III dataset resulted in a median BA of 0.6636 (95% CI: 0.6146,0.6971) and an AUC of 0.7527 (95% CI: 0.7214,0.7827).
DA vs. baseline models
We used paired Wilcoxon rank sum test to statistically assess the performance difference between the TL and baseline models. The Bonferroni correction was applied as there were repeated comparisons. The classification tasks had 35 comparisons (7 data subsets and 5 metrics) so statistical significance was indicated by p < 0.0014 (0.05 / 35). The regression tasks had 14 comparisons (7 data subsets and 2 metrics) so statistical significance was indicated by p < 0.0035 (0.05 / 14).
For 30-day mortality, DA models outperformed both the baseline models LR and FCNN for data subsets 1% to 50%. For datasets, 75% and 100% DA model outperformed the LR model but underperformed the FCNN model. For example, when 1% dataset was used for training, DA model had a median BA of 0.6744 (95% CI: 0.5758,0.7083) whereas LR had 0.5821(0.551,0.6134) and FCNN had 0.6636(0.6146,0.6971), both these comparisons are statistically significant with p < 0.0014, refer to Figure 2(B) for results of other data subsets.
For AKI, DA models outperformed both the baseline models for some of the data subsets (75%, 50%, 25%, 10%, and 5%) and underperformed both the baseline models for the data subset (1%). Whereas the DA model outperformed the LR model and underperformed the FCNN model for 100% dataset. For example, when the 10% data subset was used for training models, the DA model had a median BA of 0.6511 (95% CI: 0.626,0.6763) whereas the LR model had 0.6052 (0.5779,0.6262) and the FCNN model had 0.6439 (0.6177,0.6678) refer to Figure 3(B) for results of other data subsets.
For H_LOS, the DA models outperformed both the baseline models for some of the data subsets (25% to 100%) and DA models overperformed FCNN models but underperformed Lasso models in some cases (1% and 5%). The result at 10% data subset was not significantly different between DA and Lasso (p-value was 0.0193 > 0.0035) and the DA model outperformed the FCNN model. For example, when the 25% data subset was used for training, the DA model had a median MAE of 4.9109 (95% CI: 4.5982,7.389) whereas the Lasso model had 5.0491 (4.8903,6.457) and the FCNN model had 9.2677 (9.0162,9.6332) refer to Figure 4(B) for results of other data subsets.
For ICU_LOS, DA models outperformed both the baseline models for some of the data subsets (25% to 100%) and DA models overperformed FCNN models but underperformed Lasso models in some cases (5%, and 10%). Also, the result at 1% data subset was not significant between DA and FCNN models (p-value was 0.0468 > 0.0035). For example, when the 25% training data subset was used for training, the DA model had a median MAE of 2.2781(95% CI: 2.0427,4.643) whereas the Lasso model had 2.4165(2.2967,11.1001) and the FCNN model had 3.8481(3.5641,5.0331) refer to Figure 5(B) for results of other data subsets.
We primarily discussed the results using BA and MAE metrics as they were used in the selection of pre-trained models and hyperparameter tuning. Full results for 30-day mortality, which includes all data subsets (1%, 5%, 10%, 25%, 50%, 75%, and 100%), all DA models, and baseline models, and all performance metrics (BA, AUC, accuracy, precision, and recall) are summarized in Table 2. Similarly, AKI full results are available in Table A2. The H_LOS full results with MAE and MSE metrics are summarized in Table A3, and similarly, ICU_LOS results in Table 3.
Inductive Transfer Learning
The ITL pre-trained model retrained using the 100% eCritical training dataset for the AKI task resulted in a median BA of 0.6933 (95% CI: 0.6701,0.717) and median AUC of 0.7762 (95% CI: 0.7531,0.7999). The performance metrics decreased as the sample size decreased to 75%, 50%, 25%, 10%, 5%, and 1% as shown in Figure 3(A). Finally, training ITL models using 1% of the eCritical dataset resulted in a median BA of 0.6434 (95% CI: 0.6006,0.6888) and median AUC of 0.7103 (95% CI: 0.6606,0.7484). Here, the 1% eCritical train dataset for the AKI task contains 224 samples.
Similarly, FCNN baseline models trained using the 100% eCritical train dataset for the AKI task resulted in a median BA of 0.6968 (95% CI:0.6738,0.7212) and median AUC of 0.7694 (95% CI: 0.7448,0.7938). The performance metrics decreased as the sample size decreased to 75%, 50%, 25%, 10%, 5%, and 1% as shown in Figure 3(A). Finally, training baseline models using 1% of the eCritical dataset resulted in a BA of 0.6222 (95% CI: 0.5757,0.6604) and AUC of 0.6796 (95% CI: 0.6343,0.7189).
ITL vs. baseline models
For AKI, ITL models outperformed both the baseline models for all data subsets except for 100%. The DA model underperformed both the baseline models at 100% dataset. For example, when the 1% dataset was used for training the models, the DA model had a median BA of 0.6434 (95% CI: 0.6006,0.6888) whereas the LR model had 0.5467 (95% CI: 0.5154,0.5732) and the FCNN model had 0.6222 (95% CI: 0.5757,0.6604) refer to Figure 3(A) for results of other data subsets.
For H_LOS, the ITL models outperformed both the baseline models for all data subsets. For example, when the 1% dataset was used for training the models, the DA model had a median MAE of 13.3182 (95% CI: 12.6128,13.9609) whereas the Lasso had 13.7765 (95% CI: 13.3118,14.2661) and the FCNN had 18.5363 (95% CI: 17.8711,19.243) refer to Figure 4(A) for results of other data subsets.
For ICU_LOS, ITL models outperformed both the baseline models for all the data subsets. For example, when 1% dataset was used for training the models, DA model had a median MAE of 3.4519 (95% CI: 3.2863,3.8158) whereas Lasso had 3.5883 (95% CI: 3.4255,3.7376) and FCNN had 5.626 (95% CI:5.4351,5.8329) refer to Figure 5(A) for results of other data subsets.
Previous discussion focussed primarily on the description of the BA and MAE metrics. Full results for 30-day mortality, which includes all data subsets (1%, 5%, 10%, 25%, 50%, 75%, and 100%), all models ITL, and baseline models, and all performance metrics (BA, AUC, accuracy, precision, and recall) are summarized in Table 1. Similarly, full AKI results are available in Table A1. The full H_LOS results with MAE and MSE metrics are summarized in Table A3, and similarly, ICU_LOS results in Table 3.