Study design and participants
This retrospective study was conducted in the respiratory disease wards of our hospital between December 22, 2022, and February 15, 2023. A total of 1029 COVID-19 patients who tested positive for SARS-CoV-2 by Nucleic Acid Test (NAT) were included in the study. Four levels of disease severity for COVID-19 according to the Chinese Clinical Guidance for COVID-19 Pneumonia Diagnosis and Treatment (7th version) were defined: mild, moderate, severe, and critically ill [22]. In this study, we classified the severe and critically as severe cases, while the rest two levels as non-severe cases. Patients who were still in hospitalization after the study deadline or died within 24 h were excluded.
Data Collection
Demographic, clinical, and laboratory data were collected within 24 hours of admission and outcomes were extracted from electronic medical reports by the research team. The main outcome was the mortality of participants. Patients with unknown survival outcomes who discharged by themselves were followed up with the outcome 15 days after hospital discharge by phone call.
Demographic variables included age, gender, height, weight, nation, and history of drink and smoking. Clinical data included the severity of COVID-19 disease, number of diagnoses, the activity of daily living (ADL) measured by Modified Barthel Index [23], vital signs on admission (temperature, pulse, respiratory, blood pressure, and pain), clinical symptoms(fever, cough, dyspnea, distress, myalgia, stethalgia, headache), comorbidities (hypertension, diabetes, respiratory distress syndrome (RDS), chronic obstructive pulmonary disease (COPD), respiratory failure, phlebothrombosis, anemia, malnutrition, hypoproteinemia, organ transplant and more). The study also collected information on the modes of respiratory support (noninvasive ventilation, intubation, or tracheotomy tube) during the hospitalization.
Blood test data was collected within 24 hours of admission, including complete blood Count (erythrocyte count, hemoglobin, hemameba count, blood platelet count, neutrophil count, lymphocyte count), biochemical parameters (total protein (TP), alanine aminotransferase (ALT), aspartate aminotransferase (AST), AST/ALT, total bilirubin, albumin, glucose, creatinine, carbamide, glomerular filtration rate(GFR), lactic dehydrogenase (LDH)), electrolyte level (Na, K, Cl, Ca, Mg), coagulation index (prothrombin time (PT), international normalized ratio (INR), thrombin time (TT), fibrinogen), cardiac biomarkers (myoglobin, creatine kinase MB isoenzyme (CK-MB), troponin, brain natriuretic peptide (BNP)), procalcitonin and other factors (procalcitonin (PCT), C-reactive protein (CRP), IL-6) and blood gas analysis (Partial Pressure of Carbon Dioxide (PCO2), arterial oxygen partial pressure (PaO2), oxygen saturation (SPO2)). The data were reviewed by two researchers and a third researcher adjudicated any differences and reverified outliers and missing values with electronic medical reports.
Statistical Analysis
Continuous variables, which are normal distribution, were described as mean (SD) and compared using an unpaired, 2-tailed t-test. Continuous variables, which are not normal distribution, were expressed as medians with interquartile ranges (IQRs) and compared using the Wilcoxon rank sum test. The Shapiro-Wilk test was used to estimate the normality of data. Categorical variables were reported as absolute values and percentages and compared using the Chi-square test and Fisher’s test.
Variables with > 20% missing values were deleted. We imputed other missing values by random forest (missForest, R package, version 4.2.1). Because candidate variables exist in high-dimensional complex collinearity, we used the Lasso regression model to select the most useful prognostic risk factors of death in COVID-19 patients. Lasso regression constructs a penalty function to select the main variables. Some variables, whose coefficients are compressed to 0 due to not strong correlation with the dependent variable, are eliminated. These independent variables strongly related to the dependent variable are screened and obtain corresponding coefficients. Tenfold cross-validation was used to select the optimal parameter (λ) based on the minimum criteria (λ min), and the 1 standard error of the minimum criteria (λ 1SE).
The variables, screened by Lasso regression, were used to formulate LR model. Then we formulated a nomogram to visualize the results of the LR model. All variables were included to construct a decision tree (DT) model in order to explore different predictors. Extreme Gradient Boosting (XGBoost) model was established based on the results of Lasso regression to predict the outcome of hospitalized patients with COVID-19. Further, we used Shapley Additive Explanations (SHAP) method to interpret the XGBoost model. SHAP value visually exhibited each feature’s importance and contribution to mortality.
To evaluate the prediction ability of different models, we conducted confusion matrix of each model to compare the actual values and predictive values and plotted the area under the receiver operating characteristic (ROC) curves. In the same time, we compared the accuracy, sensitivity, specificity, precision, Youden’s index, and area under curve (AUC, 95% CI) of those models.
All statistical analyses, except for XGBoost and SHAP, were implemented via R (version 4.1.3) with R software packages “dplyr”, “epiDisplay”, “gtsummary”, “missForest”, “glmnet”, “rms”, “car”, “rpart”, “partykit”, “caret”, “InformationValue”, and “pROC”. XGBoost and SHAP were carried out by python (version 2.7) with python software packages “numpy”, “pandas”, “xgboost”, and “shap”. A P-value lower than 0.05 (two-sided) was regarded as statistically significant.