In this study, we demonstrated that machine learning models utilizing sociodemographic characteristics and medical history can accurately predict the prognosis of COVID–19 patients after diagnosis; the models predict not only the final outcome (i.e., mortality vs. recovery) but also early mortality (i.e., 14- or 30-day mortality). The proposed prediction model aims at the quick triage of patients without having to wait for the results of additional tests such as laboratory or radiologic studies, during a pandemic when limited medical resources have to be wisely allocated without hesitation.
Machine learning is focused on achieving high predictive accuracy without much focus on explaining how the accuracy is achieved. We presented the result of the Cox proportional hazard regression and variable importance as complements, showing how important input variables were used by LASSO and RF. In line with previous studies7–18, old age, the male gender, and the presence of symptoms or underlying medical conditions were significantly associated with worse prognosis. We additionally found that moderate or severe disability and infection route were independently associated with prognosis.
Almost all previous studies reported that old age was a strong prognostic factor, and it was also confirmed by our results. However, the time to recovery or mortality did not differ by age in our study population.
Previously reported underlying medical conditions associated with poor prognosis include hypertension9,11,12,14–16, DM8,9,11,13–15, lung disease including chronic obstructive lung disease and asthma10–12,14, cardiovascular disease9,10,12–15, cancer14, and chronic renal disease11. In our study, DM (from Cox and LASSO), chronic lung disease or asthma (from Cox regression), cancer (from LASSO), and hypertension (from RF) were significant predictors. LASSO paid more attention to DM medication more importantly than the disease itself, which may have been because patients taking medication were more likely to have had DM longer than those not taking it. The insignificance of the other medical conditions, such as cardiovascular disease or chronic renal disease in our study, may be due to a different study population, the broadness of our operational definition of the diseases, and correlation with other strong predictive factors.
We also performed multivariable Cox regression to identify drugs associated with increased or decreased risk of mortality and found that the use of loop diuretics or acarbose was an independent risk factor. However, these results need to be interpreted cautiously. There may have been other confounding factors among the patients, who consume this medication, that we were unable to sufficiently adjust for. Loop diuretics are recommended to be considered in patients with congestive heart failure or advanced chronic renal disease19. Alpha-glucosidase inhibitors including acarbose are often used with a basal insulin regimen when basal insulin treatment alone did not result in glycemic control, especially postprandial glucose level in Asians20. Thus, the poor outcome in the patients taking the medications may have been due to the fact that they had have a longer duration of more comorbidities, not due to the direct drug effects. Our main interest for the medication analysis was the angiotensin converting enzyme (ACE) inhibitors and angiotensin receptor (AR) blockers. There have been concerns regarding a potential harmful effect caused by ACE inhibitors and AR blockers in COVID–19 patients21. In our study, however, the use of ACE inhibitors or AR blockers was not significantly associated with mortality from COVID–19, in agreement with a recent large-scale study10. A further investigation is warranted to examine the potential association of the drugs with prognosis in COVID–19 patients.
Our data had information regarding infection route, which showed that patients at nursing homes had worse prognosis whereas those who contracted the disease from large clusters had good outcome. This may be attributed to the age distribution and the status of underlying diseases in these groups; most nursing home residents are elderly people with underlying diseases, whereas the infection clusters in Korea during the current outbreak were mostly churches and service call centers where a majority of attendees were young.
We tested several machine learning algorithms because the most appropriate algorithm may differ depending on the data structure and the given task. LASSO and linear SVM demonstrated high sensitivities (> 90%) in predicting mortality, which is clinically important because identifying and detecting at-risk patients is more significant than reducing false positive prediction. However, the other models showed low sensitivities despite our efforts to compensate for the class imbalance by up-sampling rare mortality cases and adding class weights when training them. Although we were not able to fully understand and explain the failure to overcome the class imbalance in these models, the difference in variable importance by LASSO and RF may be helpful in explaining the results. The two most important predictors for RF were cluster infection and personal contact where a very small proportion of the patients (0.6%, 42/7,256) died, implying that RF chose to focus on detecting negative cases to achieve the high AUC. In contrast, LASSO appears to have focused on the predictors associated with increased risk of mortality including old age and DM.