Classification results for the prediction of death are shown in Table 1. Neural network models (CNN - convolutional neural networks - and FNet - Fourrier transform neural network - / FNet + VAT - Fourrier transform neural network + virtual adversarial training) produced the worst results, while boosting ('LightGBM' - Light Gradient Boosting Machine), Stacking and one traditional statistical model ('Generalized Additive Models - GAM') produced the best overall results, when considering micro-F1, macro-F1 and AUROC.
The less effective results of the Neural network are somewhat expected as the size of the dataset is not that huge, with fewer than 10,000 samples. Typically, we expect neural networks of large capacity (millions to billions of parameters) to excel in tasks where very large datasets are available (millions to billions of training instances), which is very rare in health-related problems. In such large-scale datasets, neural networks can capture very complex relationships. However, in smaller sample sizes, they show a remarkable tendency to overfit, hence obtaining poor results in terms of validation error (17,18).
Thus, tree-based ensemble models such as random and boosting forests tend to be more robust to small sample sizes and to overfitting, which is exactly the behavior we observed in our experiments (23). SVM and K-nearest neighbors (KNN), which are simpler models, with fewer parameters, also tend to perform reasonably well on smaller datasets being better than the neural network models.
We should stress that the statistical models LASSO regression and GAM showed very competitive results. Unexpectedly, GAM was the runner up method considering all metrics, being even better than LASSO and some traditional ML methods such as SVM and KNN. In our work, we directly tuned GAM to the classification task, using the cross-validation procedure, which yielded superior performance. GAM and LightGBM are statistically tied regarding all evaluation metrics considering a Wilcoxon signed-rank test with 95% confidence.
In any case, the best single overall model, with statistical significance, under all considered metrics, was the Stacking model, which is a combination of the output of all other individual models, which, in turn, allows us to better discriminate between patients with higher clinical risk at admission time. When considering micro and macro-F1, F1 for death and AUROC at the task of predicting death, Stacking was significantly (statistically) better than all other models. The largest gains were in F1 to predict death with gains of up more than 26% over LASSO, the previous state-of-the-art. The Stacking technique improves the F1-score results for the class of interest (death) by 7% over RF, by 5% for LightGBM and by 6% for GAM, which were the three individual best models in this metric. The combination of models based on different classification premises, potentially made stacking more robust. If a single classifier makes a wrong prediction, the others can still make corrections, increasing the robustness of the final stacking model.
Table 2. Micro-F1, macro-F1 and AUROC results for the prediction of COVID-19 in-hospital death.
|
|
MICRO-F1
|
MACRO-F1
|
F1 (DEATH)
|
F1 (NO DEATH)
|
AUROC
|
mean
|
CI
|
mean
|
CI
|
mean
|
CI
|
mean
|
CI
|
mean
|
CI
|
KNN
|
0.807
|
0.002
|
0.492
|
0.007
|
0.091
|
0.014
|
0.892
|
0.001
|
0.781
|
0.010
|
FNet + VAT
|
0.810
|
0.013
|
0.677
|
0.020
|
0.470
|
0.038
|
0.884
|
0.009
|
0.772
|
0.019
|
FNet
|
0.814
|
0.008
|
0.686
|
0.017
|
0.486
|
0.030
|
0.887
|
0.005
|
0.789
|
0.015
|
CNN
|
0.815
|
0.013
|
0.693
|
0.016
|
0.500
|
0.026
|
0.886
|
0.009
|
0.796
|
0.016
|
SVM
|
0.839
|
0.010
|
0.691
|
0.031
|
0.478
|
0.058
|
0.904
|
0.005
|
0.833
|
0.012
|
LASSO
|
0.842
|
0.009
|
0.677
|
0.024
|
0.446
|
0.044
|
0.908
|
0.005
|
0.859
|
0.006
|
LIGHT_GBM
|
0.846
|
0.008
|
0.723
|
0.016
|
0.538
|
0.028
|
0.908
|
0.005
|
0.865
|
0.008
|
GAM
|
0.847
|
0.006
|
0.720
|
0.014
|
0.532
|
0.026
|
0.908
|
0.003
|
0.855
|
0.012
|
RF
|
0.850
|
0.005
|
0.717
|
0.013
|
0.524
|
0.024
|
0.911
|
0.003
|
0.863
|
0.007
|
STACKING
|
0.855
|
0.007
|
0.739
|
0.018
|
0.564
|
0.032
|
0.913
|
0.004
|
0.871
|
0.007
|
List of model names, from top to bottom (ordered by MicF1): CNN = convolutional neural network, FNet = fourier transformation neural network, FNet + VAT = fourier transformation neural network with virtual adversarial training, GAM = generalized additive models, KNN = K-nearest neighbors, LASSO = lasso regression, LIGHT_GBM = light gradient boosting machines, RF = random forest, SVM = support vector machines, STACKING = a stacking classifier, which combines all others.
The ROC curves for all evaluated models are shown in Fig 1. From this Figure, we can see the separation of two distinct groups: one group of models with inferior results, composed of neural network models and K-nearest neighbors, and another group of models with superior (indistinguishable) results, consisting of SVM, RF, LightGBM, GAM and the Stacking.
Despite similarities in the curves and at AUROC values, these classifiers can yield quite different results when compared with micro-F1 and macro-F1, or class-specific F1 scores, which shows that (1) AUROC score is not an adequate metric for evaluating and comparing models, especially in face of high imbalance/skewness and that (2) even though some models, like Stacking and GAM have very similar AUROC scores, their capacity to discriminate relevant outcomes like death is quite different (0.532 F1 score for GAM and 0.564 for Stacking, a significant difference of 6%).
Interestingly, using such curves, we can sensibly calibrate the trade-off between sensitivity and specificity, further customizing the way such models can be used. In particular, when applying Stacking, our model can be tailored to the early identification of high-risk patients, with good discrimination capacity.
Explainability
Explainability is an essential aspect of the task if ML methods are to be trusted and actually used by practitioners. Various prognostic factors have been proposed in the stratification of COVID-19 patients, based on their risk of death, that includes clinical, laboratory and radiological variables. Among these risk factors, stand out advanced age, multiple comorbidities on admission (such as hypertension, diabetes mellitus, cardiovascular diseases and others), abnormal levels of C-reactive protein (CRP), lymphocytes, neutrophils, D-dimer, blood urea nitrogen (BUN) and lactate dehydrogenase (LDH).
A very interesting feature of some ML models, in particular decision trees, RF and boosting forests, is the explainability of these models. This is still a very active research area, but modern advances in tools and visualization alternatives allow us to represent which features are most important to the model and at which polarities and intervals. In this context and as previously cited, the best model in our tests was the Stacking, that is a meta-model, in which inputs are the outputs of other classifiers. Since we aim to explain a classifier that works on the level of the features themselves (instead of a meta-level of other classifier outputs), we will provide explanations for the second best model, LightGBM. Furthermore, tree-based boosting and bagging algorithms rank as some of the most explainable machine learning models, and also lead many benchmarks, particularly for tabular data where data samples are not that large. Their unique combination of explainability, reliability and performance, added to the fact that stacking is a meta-classifier are why we will exploit the boosting model (which, in our case, outperformed the bagging model - random forests/RF), to analyse the correlations among variables.
In a sense, some traditional models, such as regression models, also have a good explainability, as it is possible to assess the coefficients of each attribute, to measure how important a feature is. These models however do not measure up to modern tree-based algorithms in many scenarios, especially in cases with larger datasets (24). Another key difference between these models is that, in the case of regression models, we have to explicitly remove collinear variables, but these variables, even though they might not improve classification performance, still yield valid model explanations.
In decision tree based algorithms, each node represents a feature. The closer to the root (i.e. the 'first' node of each tree), the more the feature is able to differentiate the data classes. For example, in Fig 2, feature 'SF ratio'' with the value less than 233 and the feature 'lactate' with a value less than 1.68 mmol/L results in a subset with 5.9% of the dataset where the 'death' outcome is more common.
These algorithms look for the values of the features that further separate the classes, while trying to decrease the coefficient or entropy values of the class label (which are measures of purity and information) in each partition in the decision tree -- this coefficient is called the GINI Index. Such index and the entropy score tend to isolate records that represent the most frequent class in a branch.
In Fig 3, we present SHapley Additive exPlanation (SHAP) values for our boosting model. This is a special type of explainability technique, which allows us to not only probe which features were important to the model, but also which polarities or intervals push predictions to each of the training classes, and additionally, allows us to evaluate why the model predicted any single instance (25).
For any simple model, such as regression, the model itself is a reasonable explanation of what was learned. However, for more complex models, which in turn are capable of learning more complex solutions (provided enough data is present at training time), we cannot use the model to explain itself, since it is a complex solution. In these situations, shapley values build upon the idea that the explanation of a model can itself be a model. This technique has been recently introduced, and further expands on the explainability of machine learning models, making them even more useful, as they become more interpretable (25).
With the help of SHAP values in Figs 3 and 4, wwe can extract interesting knowledge from our boosting model, the best individual ML model that works with the base features. We can see for instance that the most important feature in the prediction of death COVID-19 is age. This is coherent with previous medical literature, and serves as an additional validation to the model. Other scores and a recent meta-analysis have shown age as a key prognostic determinant in COVID-19 (26–29). The meta-analysis included more than half million of COVID-19 patients from different countries, and observed that the risk increased exponentially after the fifth decade of life. It is important to highlight that this fact could be influenced by both the physiological aging process and, especially by the individuals functional status and reserve, what may hinder the intrinsic capacity to fight against infections, increasing susceptibility to the infection and severe clinical manifestations (30).
The second most important feature is the supplemental oxygen requirement, which, as per Fig 3, lower values (blue tones) indicate higher risk. Although COVID-19 is a multisystem disease, it is well known that lung involvement is the mainstain for assessing disease severity, and oxygen requirement upon hospital admission has been shown to be an independent predictor for severe COVID-19 in several studies (31,32).
We also observed that lower values of platelets and higher levels of urea and C-reactive protein also increase risk of mortality, which is in line with what was previously observed using statistical models (33). Other studies suggest that C-reactive protein was a marker of a cytokine storm developing in patients with COVID-19 and was associated with the disease mortality (34–36).
Interestingly, ML models can return explanations in the form of intervals, such as the behavior seen in Fig 3 for sodium and bicarbonate levels, which imply there is a "safe interval" for which risk is lower, but values either too high or too low yield higher risk of death. This is an intrinsic limitation of regression models, and the variable may be seen as non-significant due to the fact that it is a non-linear association.
From a clinical perspective, those results are in line with a recent study, which demonstrated that hypernatremia and hyponatremia during COVID-19 hospitalization are associated with a higher risk of death and respiratory failure, respectively (37). With regards to bicarbonate, low levels are related to acidosis, and high levels are usually related to advanced chronic obstructive pulmonary disease (COPD) with retention of carbon dioxide, both of them conditions well-known to be associated with worse prognosis in clinical practice (38–40). This sort of non-linearity cannot be captured by simple regression models, since we can only measure how large coefficient values are, and correlate that to the importance of each feature.
In exploiting LASSO regression in our previous work (4), we had to exclude some features which had shown to be important in the boosting model due to high collinearity. This may explain the difference in the features included in both models, despite the fact that all features included in both had previous evidence of association with COVID-19 prognosis.
Another interesting remark is shown in Fig 4, in which we can see the relative importance of each feature. Here, again, age is the most important single feature (due to higher mean SHAP value), which is in line with previous studies (3,26,27). However, the remaining features when combined yield higher predictive value in this task than just age.
Reliability
Finally, we investigate issues related to the reliability of the models. Neural network models are, for instance, known for having irregular error rates, regardless of prediction confidence. At the other end of the spectrum, boosting and bagging models tend to have a very interesting reliability profile, with a tendency to have lower error rates at high confidence scores, and higher error rates at lower confidence scores. This enables tuning the trade-off between accuracy and sensitivity for some specific classifiers.
Accordingly, we show in Fig 5 the reliability profile for our best model (Stacking). In this Figure, the x-axis shows prediction ranges for the model's confidence score, while the y-axis shows the percentage of hits or misses for the model. Note that the model makes more correct predictions (hits, in green) when it is more certain of the prediction (range 0.87-0.96). Thus, this classifier yields a useful reliability profile, in relation to its confidence score. This kind of characteristic means we can tune how many patients the model will indicate, as well as how sensitive or specific that indication can be. Such tuning can be tailored to any healthcare service, accounting for intensive care unit beds, available professionals and so on.
Based on S1 Table, there were few prediction studies that had extensive analysis utilizing AI techniques. In this study, AI techniques were compared to traditional statistical methods to develop a model to predict COVID-19 mortality, considering demographic, comorbidity, clinical presentation and laboratory analysis data. We observed that regarding the prediction of the class of interest (death), the best individual methods was a ML one (LightGBM) closely followed by a statistical model (GAM), both being better than neural network models, and both being surpassed by a meta-learning ensemble model -- Stacking -- which was the best overall solution, considering all criteria for the posed prediction problem.
We would like to emphasize that, despite the fact that in medical research the AUROC is widely used as the sole measure of models' discriminatory ability, our data reassures that it is an insufficient metric for evaluating and comparing models. F1 Score is a more robust metric, especially in larger, more complex and imbalanced datasets, which are common in health-related scenarios.