Study design
We collected the clinical data of 3563 COVID-19 patients and 18 non-COVID-19 viral pneumonia (designated as non-COVID-19) to build and validate the risk-stratification model. Specifically, data of 548 patients from First People’s Hospital of Jiangxia District of Wuhan were used as training dataset (FPHJ-548); data of 3015 patients from Wuhan Huoshenshan Hospital were used as testing dataset (HSSH-3015); data of 18 non-COVID-19 viral pneumonia patients from First People’s Hospital of Jiangxia District of Wuhan were used differentiating COVID-19 from non-COVID-19 (VPP-18).
The highest severity during hospitalization of each patient was recorded, and laboratory findings of blood test at admission were used to predict the progression of these patients. In the FPHJ-548 dataset, the average age of these patients was 52.4 (SD=14.2), and 49.8% were female. Notably, the median age of severe patients was significantly higher than that of moderate patients (Fisher’s exact test, P<0.01, supplementary Table 1). The clinical information of 385 cases (including 329 moderate and 56 severe cases) that have detection data at admission were selected to do the following analysis. To predict the severity of COVID-19 patients at admission, we employed a risk-stratification model based on support vector machine (SVM) by laboratory indicators in the FPHJ-548 dataset. This model was further validated in an independent dataset (HSSH-3015) (Figure 1, details see Methods). 51 patients in the HSSH-3015 dataset that have detection data at admission were selected as testing dataset.
Then, in order to monitor the survival outcome of severe COVID-19 patients, we select 1448 survival patients and 55 deaths from HSSH-3015 dataset. 60 patients without laboratory findings within the first week since admission were excluded. We randomly split HSSH-3015 dataset into a leave-in training set and a leave-out test set for data analysis at a ratio of approximately 1:1. We ensemble a logistic regression model (LR) based on laboratory findings in the training set and validated in the testing set (Figure 1, details see Methods). Besides, to distinguish COVID-19 from non-COVID-19 viral pneumonia, we compared the laboratory difference between COVID-19 datasets (FPHJ-548 or HSSH-3015) and non-COVID-19 dataset (VPP-18)(Figure 1, details see Methods).
A risk-stratification model of COVID-19 based on 4 laboratory findings at admission
According to the highest severity of each patient during hospitalization, we explored the difference in laboratory findings between moderate and severe COVID-19 cases in FPHJ-548 dataset. We found the high risk factors related to the progression of COVID-19 included procalcitonin (PCT), C-reactive protein (CRP), neutrophils percentage (NEUT%), lymphocytes percentage (LYMPH%), lactate dehydrogenase (LDH), (Wilcoxon rank-sum test, P<0.001, Table 1). We noted that most of the severe patients presented lymphopenia and elevated levels of inflammatory biomarkers. The levels of PCT in severe patients at the initial stage were higher than those in moderate patients (0.225 vs. 0.06, Wilcoxon rank-sum test, P<0.001), suggesting serial procalcitonin measurement may play a role in predicting evolution towards a more critical form of the disease 13. The CRP showed a similar trend to PCT, which became significantly higher in severe patients (44.5 vs. 21.8, Wilcoxon rank-sum test, P<0.001). Lymphocyte percentage was significantly higher in the moderate COVID-19 patients than severe COVID-19 patients (22.4% vs 13.8%, Wilcoxon rank-sum test, P<0.001). The percentage of neutrophils was elevated along with the severity of COVID-19 (77.8 vs. 66.4, Wilcoxon rank-sum test, P<0.001). Besides, LDH (314 vs. 235, Wilcoxon rank-sum test, P<0.001) of severe patients were significantly higher than those of moderate patients. Considering most of these differential indicators are related to organ damage, we next explored the impact of the pre-existing diseases on the progression of COVID-19. Based on the FPHJ-548 dataset, we found only 9% of patients without pre-existing disease progressed to severe condition. In contrast, 16% of severe patients were diagnosed with at least one kind of pre-existing disease (Fisher's Exact Test, P=0.029, Figure 2A), suggesting that COVID-19 patients with pre-existing disease were prone to develop severe illness. Furthermore, we found the same trend in the HSSH-3015 dataset. Patients with multiple pre-existing diseases are more inclined to progress into severe (Figure 2B).
The difference in laboratory indicators between severe and moderate patients prompted us to develop a model based on laboratory indicators to predict the state of patients (Figure 1, details see Methods). To validate that whether laboratory findings could predict the progression of COVID-19, we performed t-distributed stochastic neighbor embedding (t-SNE) based on the laboratory indicators in the FPHJ-548 dataset. The result showed that there was an essential difference in laboratory indicators between moderate and severe patients. 95% of the samples were correctly classified (true positive rate:0.66, true negative rate:1, Figure 3A).
For each indicator in FPHJ-548, it's correspondent AUC was calculated using the detected value as predictors and the status of progression as an outcome. We selected features whose AUC is greater than 0.7 and only kept indicators that has detection data at admission in both FPHJ-548 dataset and HSSH-3015 datasets.. Finally, our model incorporates 4 indicators, including LYMPH%, NEUT%, creatinine (CREA), and urea nitrogen (BUN) (Figure 3B). The NEUT% between moderate and severe patients showed a noticeable increase at about four days before the admission in the FPHJ-548 dataset (Figure 3C). On the contrary, the neutrophil of moderate patients was stable, and between the range of normal reference.
Next, we applied these four indicators to develop a support vector machine model, followed by five-fold cross-validations as internal validation. The average sensitivity and specificity of five cross-validations were 0.89 and 0.84, respectively. The average AUC of the five cross-validations was 0.86 (AUC 95% CI:0.84-0.88). The representative receiver operating characteristic (ROC) for the external validation (HSSH-3015 dataset) was shown in Figure 3D. It still achieved satisfying results in the testing dataset (sensitivity and specificity, 0.73 and 0.96, respectively, AUC:0.89). Lastly, to avoid the biases of age and sex, we divided patients into two groups by age or sex to test our model, the results showed that our model still has good performance when considering age and sex (Figure 3E, F).
Laboratory findings within the first week after admission could predict the risk of death of COVID-19
The progression of COVID-19 into severe illness increases the risk of death, so we predict the survival outcome of severe patients in HSSH-3015 dataset based on the laboratory findings within the first week after admission (Figure 1). Patients were randomly divided into training group and validation group at the ratio of 1:1. To avoid the deviation caused by the difference between the number of deaths and the number of survivors, we randomly selected the surviving patients so that the number of surviving patients equals the number of dead patients. We use the stepwise logistic regression to identify the important laboratory indicators. This process repeats 100 times (details see Methods). Thirteen indicators with statistically significant differences between survivors and deaths were identified. These were Albumin/Globulion, DD dimer, leukocyte, monocytes, Cystatin C, Creatinine, lymphocyte, Urea nitrogen, Thrombin time, Prothrombin time, Lactate dehydrogenase, Fibrinogen, Percentage of neutrophils. We performed multidimensional scaling in the training dataset based on these 13 markers. Results show that these indicators could distinguish deaths from survivors (accuracy=0.96, true positive rate: 0.82, true negative rate: 0.97, Figure 4A). Then, based on these 13 indicators, we develop a logistic model to predict the survival outcome in the training dataset. We found that the model predicts the survival outcome with high accuracy in the testing dataset (AUC = 0.95, Figure 4B). Besides, The average NEUT% of dead patients exceeded the maximum normal value during hospitalization. On the contrary, the neutrophil of survivors was stable, and between the range of normal reference (Figure 4C).
Distinguishing COVID-19 from non-COVID-19 viral pneumonia based on laboratory findings
Increasing studies showed that the infection of viral pneumonia might be associated with organ dysfunction 14, 15, 16, 17. Hence, we explored the change of organ function-related indicators between FPHJ-548 and VVP-18. Interestingly, we found that some indicators related to organ dysfunction showing significant differences between the two groups (Table 2, Wilcoxon two-sided rank-sum test, P<0.05). Our studies showed that patients in non-COVID-19 group had higher levels of NT-proBNP than those of COVID-19 group (1259.4pg / mL vs. 90.285pg / mL, P=0.045). Besides, the level of LDH in non-COVID-19 was higher than COVID-19 patients (594 vs. 242.85, P<0.001). The level of alanine aminotransferase (19 U / L vs. 40U / L, P<0.001) and aspartate aminotransferase (30.1s vs. 36s, P<0.001) were higher in non-COVID-19 group. The median activated partial thromboplastin time was longer than that in the COVID-19 group. The median level of albumin and hemoglobin decreased by more than 5g/L and 10 g/L in non-COVID-19 patients, respectively (albumin: 33.8 g / L vs. 38.95 g / L, P<0.001; hemoglobin: 121 g / L vs. 135.25 g / L, P=0.003). Hence, we use three laboratory findings with significant differences (alanine aminotransferase, aspartate aminotransferase, and lactate dehydrogenase) to perform multidimensional scaling on FPHJ-548 and VPP-18 (Figure 5A, details see Methods). We find that these indicators can distinguish COVID-19 and non-COVID-19 (accuracy=0.93, true positive rate: 0.94, true negative rate: 0.78 ). For verification, we perform the same method on HSSH-3015 and VPP-18 and found the similar results (accuracy=0.97, true positive rate: 0.97, true negative rate: 0.93, Figure 5B).
Considering that these indicators are related to liver disease and heart disease, we removed the patients with liver and heart disease in the HSSH-3015 dataset to exclude the impact of pre-existing disease. Results showed that these indicators still could differentiate COVID-19 from non-COVID-19 (accuracy=0.96, true positive rate: 0.97, true negative rate: 0.92, supplementary Figure 1). In summary, these findings demonstrated that laboratory findings can distinguish COVID-19 patients from non-COVID-19 patients.