Dynamic Prognosis Model for Predicting Survival in Severe and Critically Ill COVID-19 Patients Using Machine Learning

Background Novel coronavirus disease (COVID-19) is an emerging, rapidly evolving situation. At present, the prognosis of severe and critically ill patients has become an important focus of attention. We strived to develop a prognostic prediction model for severe and critically ill COVID-19 patients. Methods To assess the factors associated with the prognosis of those patients, we retrospectively investigated the clinical, laboratory characteristics of conrmed 112 cases of COVID-19 admitted between 21 January to 6 March 2020 from Huangshi Central Hospital, Huangshi Hospital of Traditional Chinese Medicine, and Daye People’s Hospital. We applied machine learning method (survival random forest) to select predictors for 28-day survival and taken into account the dynamic trajectory of laboratory indicators. Results Fifteen candidate prognostic features, including 11 baseline measures (including platelet count (PLT), urea, creatine kinase (CK), brinogen, creatine kinase isoenzyme activity, aspartate aminotransferase (AST), activation of partial thromboplastin time (APTT), albumin, standard deviation of erythrocyte distribution width (RBC-SD), neutrophils (%) and red blood cell count (RBC)) and 4 trajectory clusters (changes during hospitalization in the white blood cell (WBC), PLT large cell ratio (P-LCR), PLT distribution width (PDW) and AST), combined with covariates achieved 100% (95%CI: 99%-100%) AUC and reached 87% (95%CI: 84%-91%) AUC in an external validation set. Taking advantage of random forest technique and laboratory dynamic measures, we developed a forest model to predict survival outcome of COVID-19 patients, which achieved 87% AUC in the external validation set. Our online tool will help to facilitate the early recognition of patients with high risk.


Introduction
The outbreak of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) is a major worldwide public health concern [1]. By July 11, 2020, this virus has rapidly spread to approximately 215 countries, causing a total of 12,102,328 cases with 551,046 deaths, with no sign of stopping [2].
The typical initial symptoms of COVID-19 usually consist of cough, fever, nasal congestion, expectoration, fatigue, and other signs of upper respiratory tract infections [3]. Approximately 80% of COVID-19 cases are classi ed as mild and the symptoms usually disappear within two weeks [4]. However, the remaining patients, classi ed as severe or critical illness, experience clinical deterioration with acute respiratory distress syndrome, septic shock, metabolic acidosis, and coagulation dysfunction, which can progress into multi-organ failure and death [3,5].
Mortality among overall COVID-19 patients is 4.4%, but mortality of severe and critical illness can be as high as 49% [4]. Due to the absence of speci c therapeutics and an effective vaccine, current COVID-19 treatment mainly focuses on symptomatic and respiratory support [6].
Therefore, identifying patients in high-risk groups is vital for patient management and medical resource allocation to decrease the casefatality rate.
Potential factors predicting poor COVID-19 outcomes include older age, male, organ dysfunction, elevated levels of d-dimer and in ammatory markers, and lymphocytopenia [7][8][9][10]. Liang et al. developed a risk score incorporating 10 characteristics at admission of COVID-19 patients to predict 'the risk of developing critical illness during hospitalization, achieving an area under the receiver operating characteristic curve (AUC) of 88% in both development and validation cohorts [11]. In addition, Yan and colleagues used machine learning to select three biomarkers from patients' nal biospecimens that can predict the mortality of COVID-19 patients >10 days in advance with more than 90% accuracy [12].
Of note, Dong et al. developed a nomogram based survival assessment model incorporating three predictors at or immediately after admission which reached approximately 90% AUC [13]. However, the generalization and stability of existing models warrants further validation. In addition, the impact of the dynamic changes of laboratory indices during hospitalization has not been well considered, especially for severe and critically ill patients with high mortality.
Thus, the present study aimed to construct a prognostic prediction model for severe and critically ill COVID-19 patients. Random forest incorporating the demographic and clinical characteristics and laboratory metrics at both admission and during hospitalization was used to develop the model and identify COVID-19 patients with high risks of mortality.

Study Population, Data Sources and Processing
In the discovery set, all severe and critical-ill COVID-19 patients were recruited in three hospitals in Huangshi City, Hubei Province, China between 21 January and 6 March, 2020, including Huangshi Central Hospital, Huangshi Hospital of Traditional Chinese Medicine, and Daye People's Hospital. Huangshi City is located about 100 km far away from Wuhan City, the center of the outbreak in domestic China. COVID-19 diagnoses were con rmed by real-time reverse-transcription polymerase-chain-reaction (RT-PCR) assay or high-throughput DNA sequencing for nasal and pharyngeal swab specimens [14]. Patients were de ned as severe illness if they had any of the following criteria: respiratory rate of at least 30 breaths per min, oxygen saturation of 93% or lower in a resting state, ratio of arterial partial pressure of oxygen and oxygen concentration no greater than 300 mm Hg, or more than 50% lesion progression in lung imaging within 24-48 h. We de ned critical COVID-19 illness as a composite of admission to the intensive care unit (ICU), invasive ventilation, or death [15]. All the 112 patients diagnosed as severe or critically ill at admission or during hospitalization in Huangshi City were included in this study.
The ethics committee of the hospitals waived the written informed consent from patients with COVID-19. Detailed demographics and clinical characteristics including initial symptoms, comorbidities and disease severity were recorded at admission. A team of experienced respiratory clinicians reviewed, abstracted and cross-checked the data. Each record was checked independently by 2 clinicians. Laboratory examinations including routine blood tests, lymphocyte subsets, in ammatory or infection-related biomarkers, cardiac, renal, liver and coagulation function tests were obtained at admission and during hospitalization. Baseline laboratory measures with over 40% missing values were excluded from the analysis.
In the validation set, data were obtained from recently published literature [12]. Brie y, the validation population was collected in Tongji Hospital between 10 th January and 18 th February 2020 and included 375 patients, 197 with general, 27 with severe and 151 with critical COVID-19. Medical records, including epidemiological, demographic, clinical, laboratory and mortality outcome information, were collected.
The follow-up time was de ned as hospital admission to death or discharge.

Outcome
Death within 28 days after admission to the hospital was the primary outcome of this study. Discharge from the hospital within 28 days or remaining hospitalized after 28 days were considered censored. Time-to-event outcome was de ned for the following statistical models.

Statistical analysis
Continuous variables were summarized as mean and standard deviation (SD), and categorized variables were described by frequency (n) and proportion (%). The K-nearest neighbor method (KNN) was used to impute missing laboratory values at baseline using R package KNN Imputation [16] .

Trajectory identi cation for laboratory measurements
For each laboratory indicator with repeated measures during hospitalization, trajectory analysis was performed to cluster patients based on the dynamic time-series trend of the indicator using R package traj [17]. According to trajectory analysis requirements, patients with < 4 observations of the speci c indicator were manually classi ed to the cluster with insu cient data points.

Feature selection
Survival random forest (RF) is a powerful nonparametric and decision-driven machine learning approach to handle high-dimensional data and time-to-event outcome, but false positive or spurious association may occur if confounding factors are not corrected. Ranger, a weighted version of random forest, gives 100% probability for potential confoundings to be selected as candidates for tree construction.
In the discovery set, the importance of all features, including demographics, baseline clinical characteristics, laboratory examination at admission, and laboratory trajectory during hospitalization were evaluated by Ranger [18]. Variable importance scores (VIS) for the features were estimated and ranked in a descending order. In addition, the sliding windows sequential forward feature selection method (SWSFS) was used to identify the top important biomarkers [19]. The SWSFS method adds one variable at a time in the order of variable importance to the Ranger model. The plot of 'out of bagging (OOB)' error was plotted to measure the performance of each model consisting of a speci c set of biomarkers. The set of biomarkers having the lowest OOB error was selected as candidate prognostic factors for further analysis.
Further, a Cox proportional hazards model adjusted for age, gender, number of comorbidities was applied to test the association between the candidate factors and overall survival in both the discovery and validation sets. The difference of hazards was illustrated via Kaplan-Meier survival curves.
Prediction forest model construction For the discovery set, candidate prognostic factors and covariates were put into Ranger to construct prediction forest models, including 1000 decision trees in the forest. This was further validated in the validation set.
Applying the values of features of each patient creates a prediction forest that outputs survival probability by each tree in the forest, thus forming a survival probability distribution for each patient. The median of the probability distribution represents an estimate of the survival probability of each patient.

Assessment of accuracy
Time-dependent ROC analysis was performed with R packages ROCR[20] and pROC[21] to estimate the area under the ROC curve (AUC) at day 28 since admission to the hospital. The c-index, sensitivity, and speci city idicate the accuracy of the prediction forest model.
To assess the stability of the prediction forest model, the discovery set was randomly divided into training (55 samples) and testing sets (57 samples). The prediction model was trained in the training set, followed by internal evaluation in the testing set and external validation. This was repeated for 1000 times.
Statistical analyses were performed using R version 4.0.1 (The R Foundation of Statistical Computing). The P values less than 0.05 were considered statistical signi cance unless otherwise speci ed.
The characteristics of the 375 patients in the validation set were detailed in the original literature [12]. In brief, the mean (SD) age of these patients was 58.83 (16.46), and 224 (59.7%) were male. In the validation set, generally, severe and critically ill patients accounted for 52.5%, 7.2%, and 40.3%, respectively, and the mortality rates in the three groups were 6.09%, 51.85%, 98.01%, respectively (Table S2). Overall, 174 (46.4%) patients in the validation died during hospitalization (Table S2).

Feature selection
In the discovery set, there were 52 laboratory tests with su cient (≥4) numbers of repeated measures for use in trajectory classi cation. In total, 3 covariates, 61 laboratory measures at admission, and 52 laboratory trajectory clusters were included in the Ranger model. SWSFS analysis identi ed the 15 top laboratory features with minimal OOB errors, including 11 laboratory measures at admission: platelet count (PLT), urea, creatine kinase (CK), brinogen, creatine kinase isoenzyme activity, aspartate aminotransferase (AST), activation of partial thromboplastin time (APTT), albumin, standard deviation of erythrocyte distribution width (RDW-SD), neutrophils (%) and red blood cell count (RBC), as well as 4 trajectory clusters including the trajectory during hospitalization of white blood cell (WBC), PLT large cell ratio (P-LCR), PLT distribution width (PDW) and AST (Fig. 1 (Fig. 2). In addition, Cox regression for trajectory features showed that persistently higher and more varied WBC, P-LCR, PDW, and AST were associated with increased hazard of death (Fig. 3). After correcting for false discovery rates, all variables except APTT were signi cant (Table S3).

Prediction forest model construction and assessment
The prediction forest was constructed in the discovery set using all 15 candidate prognostic features combined with covariates. The random forest model achieved 100% (95%CI: 99%-100%) AUC for predicting mortality within 28 days of admission to the hospital with a 3-fold internal cross validation to control for over-tting. Further, the prediction forest model was validated in the external validation set. The trajectory cluster of each laboratory measure in the validation set was classi ed by adding one case at a time to the trajectory model trained in the discovery set. In the validation set, association between the baseline indicators and outcome was signi cant except for APTT, consistent with the results of the discovery set ( Figure S1). The AUC in the external validation set reached 87% (95%CI: 84%-91%) (Fig. 4a), which was 14% higher than the AUC of the model using the covariates only (P=6.87×10 -7 ). The optimal cut-off of the survival probability at decision-making determined based on the Youden index was 0.58; the corresponding sensitivity and speci city for predicting 28-day mortality were 0.73 and 0.88, respectively; the speci city was 0.62 according to a xed sensitivity of 0.90 (Fig. 4b).
In addition, to evaluate the stability of the modeling strategy, the discovery set was randomly split into training set (55 samples) and testing set (57 samples); the prediction forest was developed in the training set, followed by an internal validation in the testing set, and further evaluated in the external validation set. This was repeated 1000 times. The mean AUC was 0.87 (95% CI: 76%-98%) in the testing set, and 0.85 (95% CI: 0.80-0.89) in the validation set ( Figure S2).
Comparison with existing prognostic prediction models Further, we veri ed the prognostic prediction models reported in published studies in our discovery dataset. The c-index ranged from 0.64 to 0.74, and AUC from 0.66 to 0.82 (Table S4).

Web-based application tool
To facilitate the application of our prediction forest model, we developed an online tool that can be accessed at http://106.15.72.70:3838/COSP. By uploading the values of prognostic factors, the tool will output the distribution of likelihood that a given COVID-19 patient will die at a speci c time point (Fig. 5).

Discussion
Using machine learning, we developed and validated a prognostic prediction model incorporating both baselines and dynamic trajectories of laboratory tests that are routinely performed at admission or tenure in the hospital to predict the survival outcome of severe or critically ill COVID-19 patients. Compared to existing models used to predict COVID-19 prognostic outcomes, our model is more accurate and uses more readily available metrics. Yan and colleagues developed a decision tree with three laboratory factors and achieved 95% AUC [12]. However, the hs-CRP test is not typically performed, as the standard CRP is more cost-effectiveness. Hence, we could not validate their tree-model in our discovery dataset, but our prediction forest model achieved considerable accuracy (AUC 87%) in Yan's dataset. Notably, Dong et al. built a predictive nomogram by using only three indicators: hypertension, neutrophil-to-lymphocyte ratio and NT-proBNP at admission to hospital, which surprisingly achieved 89.2% of C-index in internal validation set [13]. However, NT-proBNP is usually examined in patients with symptoms of cardiac dysfunction rather than regularly measured at admission, which may limit its application. In addition, a model without external validation should be generalized with caution. Liang et al. built a risk score with 10 predictors (COVID-GRAM) to assess the risk of developing critical illness in hospitalized patients with COVID-19, which achieved plausible accuracy (88% AUC) in both development and external validation sets [11]. Of note, risk of developing critical illness is highly correlated with hazard of death. Thus, to test if the COVID-GRAM model can be generalized to predict survival, we validated COVID-GRAM in our discovery set, which obtained 77% AUC to predict the survival of COVID-19 patients. Additional prognostic models proposed by Wu et al. [22] and Xie et al.
[23] obtained 93% and 96% AUC, respectively, but these models resulted in unsatisfactory values of AUC and C-indices of 0.64. Our prediction forest model appears to be superior in accuracy compared to the existing models for predicting COVID-19 patient survival (Table S4).
We identi ed 11 laboratory measures at admission which appear to associate with the poor COVID-19 outcomes. Among the regularly measured laboratory indicators at admission, the high neutrophil count, low lymphocyte count, high neutrophil-to-lymphocyte ratio, high direct bilirubin, and elevated lactate dehydrogenase have previously been identi ed as prognostic predictors for an unfavorable outcome [12,13,22,23]. However, in our discovery set, these predictors were not ranked at the top of VIS list for all candidate factors possibly because all patients in our study had progressed to severe or critical disease and the previous predictors may be less prognostic for severity. Notably, in our study, PLT had the highest variable importance score, and 2 of 4 trajectory predictors were platelet related: PDW and P-LCR. PDW and P-LCR are novel markers, but associations of platelet count with the risk or outcome of critically ill patients are well known[24-26]. Since human lung is the site of platelet biogenesis, the abnormal trajectory of PDW and P-LCR during COVID-19 illness may suggest the lung dysfunction and injury and add value to models predicting COVID-19 prognosis [27]. Also, platelets count and functional abnormality increase the risk of bleeding, secondary to clotting disease, such as heparin-induced thrombocytopenia (HIT) or disseminated intravascular coagulation (DIC), and thus increased the risk of death[28].
There are several strengths of this study. First, random forests have better modeling e ciency than most of the regular methods and can avoid over tting, resist noise interference to a certain extent, are not sensitive to the distribution of variables, can handle non-linear relationships, and detect the interaction between features in the training process to improve predictive ability. Second, by integrating multiple classi cation trees, a model can output a series of predicted probabilities forming survival likelihood and accounting for uncertainty. Third, our model includes dynamic changes in laboratory measures during hospitalization, which are more relevant to the progression of the illness than baseline predictors. Despite these strengths, the sample size is relatively small in the discovery set which may limit the machine learning technique to build a more accurate model, although our model retained considerable predictive accuracy in the external validation dataset. Additionally, some patients in this study had non negligible missing values for laboratory tests. We thus used the KNN method to impute the missing values by "borrowing" the information from the correlated variables. Also, our model may be di cult to generalize as it was created using data from just severely and critically ill patients. The model was veri ed in an external validation set of patients with general, severe, or critical COVID-19, but requires more external validation in general and severe populations to ensure stability. Finally, the prediction model of this study was trained and validated using Chinese population. Hence, caution is warranted when extending these ndings to other populations.

Conclusions
In conclusion, taking advantage of random forest technique and laboratory dynamic measures, we developed a highly accurate model to predict COVID-19 patient survival, and we have developed our prognostic model into a user-friendly web-based application tool, which outputs the distribution of a patient's survival likelihood to capture predictive uncertainties. Our online tool will help to facilitate the early recognition of patients with high risk.   Data are shown as the mean (SD). Figure 2 Baseline laboratory factors that associated with prognosis of severe or critically ill COVID-19 patients. a urea; b neutrophils (%); c creatine kinase isoenzyme activity (CK-MB); d creatine kinase (CK); e platelet count (PLT); f aspartate aminotransferase (AST); g albumin; h standard deviation of erythrocyte distribution width (RDW-SD); i red blood cell count (RBC); j brinogen; k activated partial thromboplastin time (APTT); red dashed lines indicate higher-level group and green solid lines for lower-level group; the optimal cutoff points were obtained by the maximum of discrimination of the survival curve.

Figure 3
Trajectory of laboratory factors during hospitalization associated with prognosis of severe/critically ill COVID-19 patients. a trajectory of white blood cells (WBC) in patients during hospitalization; b trajectory of platelet large cell ratio (P-LCR) in patients during hospitalization; c trajectory of PLT distribution width (PDW) in patients during hospitalization; d trajectory of aspartate aminotransferase (AST) in patients during hospitalization; e association between WBC and prognosis of patients; f association between P-LCR and prognosis of patients; g association between PDW and prognosis of patients; h association between AST and prognosis of patients; red lines indicate lower-level and slighter-variation and green higher-level and larger-variation in a, b, c and d; red dashed lines represent groups with lower-level and slightervariation and green solid lines represent groups with higher-level and greater-variation in e, f, g and h. ROC curve and sensitivity/speci city curve of the optimal cutoff in validation dataset. a ROC curve and AUC of the RF model in predicting 28day survival in the validation cohort; b an illustration of the sensitivity and speci city levels retrieved from the ROC-curve analysis, in which sensitivity (blue) and speci city (red) were plotted separately against the potential cutoff probability and the cutoff was speci ed with red dashed line where the Youden index was maximal.

Figure 5
Schematic diagram of prediction model. The left panel is the input of a patient's clinical characteristics information, the middle panel is the random forest prediction model, and the right panel is the predicted survival probability distribution of a patient.