This study presents the highest ROC-AUC results to date across all COVID-19 clinical outcomes considered so far, at least to our knowledge. It also compares preprocessed human data and raw data using the same ML framework and subjects in a complex clinical setting. In order to do this, a powerful ML tool used especially for categorical classification was adapted for a time series analysis that considered patient states over time. This is of special interest, since most of ML analyses are conceived to classify data in categories, but few work with time series. COP-tt introduces this feature by using subject states instead of collapsing patient data into a single file, and by running its analysis on timelines (Fig. 1). This paradigm could be adapted to work not only with random forest classifiers, but with any ML categorical classifier. Moreover, we determined a minimal number of variables able to obtain ROC-AUCs above 80%, which is around 6 considering the present data. However, the relation between variables and time to event is complex, as demonstrated by both ED and ICU prediction by EHR (Fig. 3). Future studies will address this relationship, studying metehods to determine optimal variable associations. However, LR results differred from EHR, showing more predictable trends. Thus, the differences between LR and EHR were further analyzed. We found, on the one hand, that underreported values did not alter prediction times but significantly lowered specificity and ROC-AUCs and, on the other, that overreported CRP had a greater impact on the prediction of ED calculated on EHR, suggesting that selective reporting by clinicians acted as a filter that increased the differences between patient states and facilitated ML classification. Further studies should be conducted to assess whether human interpretation of the data enhances ML analysis, a concept that could be applied to different clinical problems, both in diagnosis and follow-up. Although in some healtcare systems LR are integrated into EHR, some categorical variables require and represent a human interpretation, therefore the distinction is still relevant.
Regarding the changes in variables over time and their weight in prediction, Age led the group in both sources (EHR and LR), as in other studies, followed by arterial oxygenation measurements. In addition to other studies though, temporal analysis allowed us to see the shifts in importance of different variables for making different predictions, for example, proinflammatory CRP, useful for early discharge and the PaO2/FiO2 ratio, more suitable for predicting clinical worsening and ICU admission.
To our knowledge, the COP-tt framework is the first one that could be successfully used for both early discharge and prediction of severe COVID-19 outcomes. Throughout the pandemic, the increase in new cases of COVID-19 has been associated with higher hospital occupancy and mortality rates. Despite the vaccination campaigns launched, the number of new cases worldwide continues to rise,16 and there are reports of hospitalizations of vaccinated patients, as well as deaths 17–19. At the same time, it is also vital to start easing restrictions, since virtually all healthcare systems are so closely linked to the economic welfare of their respective countries that an economic downturn triggered by the collapse of several industrial sectors would most certainly undermine the functioning of healthcare systems. In this context, the use of the ML prediction approach could be a useful tool to improve healthcare resource allocation and could help avoid the collapse of hospitals during pandemics. Simply by using EHR modelling and selective restrictions, which are interventions aimed at reducing infection rates, our hospital occupancy rates due to COVID19 could have decreased by almost 60% (Fig. 2B). These estimates were calculated on the basis that patients with mild respiratory distress could be discharged and monitored from home. Although COP-tt may be an useful tool for early discharge of patients suffering COVID19 infection, we propose that this strategy should be accompanied by a follow-up after discharge (such as monitoring oxygen saturation through a pulse oximeter and/or daily telephone follow-up) to identify potential clinical worsening or complications at home. In this way, the potential risk of the algorithm may be avoided. Even though this was not possible at the beginning of the pandemic, it is now feasible and the application of our classification framework during the next wave of infection could ease hospital pressure. Thus, we would expect a COP-tt approach to be able to indirectly increase the efficacy of other interventions such as vaccination. Although the final goal of the latter is virus eradication, which is admittedly more complex due to SARS-CoV2 variants, one of the primary objectives of vaccination is precisely to lower infection rates and prevent the collapse of the healthcare system. In addition, the impact of vaccination and different vaccines could be easily monitored using COP-tt by adding a model variable.
The major limitation of the study is that it is based on a single center, although the number of analyzed patients matches, and at times even exceeds, that of multicenter studies12. The second major problem is its applicability to other health structures. Nevertheless, our study demonstrates that this framework can work using very different, even contrasting, sources, such as EHR and LR. This suggests that other variables or types of records could be used to achieve the same objective. Furthermore, compared to large multicentric models20, which aim to generalize their predictive power as much as possible, our study and framework are aimed at local usage, which can be customized according to the hospital infrastructure and laboratory capabilities. The framework is available to clinicians and researchers in the GitHub repository, mentioned in Methods, and a new repository, adapted for a clinical trial, is also available and has been updated.
Summarizing, the presented ML framework is not a model, but a model generator for various COVID19 clinical outcome predictions. It can work under very different circumstances, since it successfully processed 31 (EHR) and 120 (LR) variables, achieving similar results, and is highly robust to missing data. The framework in fact transforms missing data into an advantage rather than a limitation, as demonstrated by the highlighted differences between human-reported laboratory results and objective records. Since the framework is a model generator, we expect it to be adaptable to different settings and objectives, such as assessing the clinical impact of SARS-CoV2 variants21 or emerging clinical variables22. Finally, the early hospital discharge should be associated with a follow-up strategies (e.g., in our institution daily calls were implemented after early discharge) to identify clinical worsening at home and avoid risks related to erroneous outcome classification.