Data sources and preprocessing
Hospital del Mar database separates electronic health records (EHRs), which are clinical history entries, from laboratory results (LRs). Since all records are virtually always in use, data retrieval errors may occur, especially during data usage peaks, which became longer and more frequent during the epidemic. Data from EHR were retrieved in a single query with the search word “COVID”, from March until August 2020, included. LRs could not be searched in the same way, therefore all data was retrieved using just temporal bounds. Identifier numbers of all patients were translated into new random identifiers before analysis, the translation index was kept by JPH and MLS, who were not directly involved in data processing and analysis. EHRs were further processed using keywords and proximity to retrieve numeric or binary variables. Presence of certain conditions was recorded as binary variables, e.g., cardiovascular diseases, while numerical variables were age and transcribed LRs. All EHR variables were located using synonym lists, proximity (distance from the located keyword) was used to parse numerical values. Outcomes were retrieved in the same fashion. To assign outcomes to LRs, EHR data was crossed with LR data. Confirmation of SARS-CoV2 infection was determined either by LR or EHR, giving preference to LRs.
Tables were constructed from raw EHR and LR data, where each line corresponded to a patient initialization or update. Therefore, for each patient one or more lines were compiled. Either initialization or update was located in time using the entry date. If some variable could not be retrieved from the entry in a specific time, default values were used: 0.0 for numerical values; “NO” for binary values. The special variable sex was treated as binary, referring uniquely to the biological phenotype. If no assessment of sex could be retrieved, the subject was discarded. Tables were saved as comma separated values (CSV) files, using “;” as separator.
During the COVID-19 pandemic, in our hospital, physicians from different specialties attended COVID-19 patients. In order to homogenize patient management, guidelines were elaborated by the Infectious Diseases Service for both data collection in EHR and therapeutic management. EHR variables were selected using the clinical history recording guide that was issued by the COVID19 task force of our structure, and by the clinical experience of the authors involved in COVID19 shifts, SG-Z, ILM and AP. Demographic, clinical, and epidemiological data were collected from EHR including the following: age, sex, and race, underlying diseases, clinical and outcome data of COVID-19, complementary tests (kidney and liver profile, electrolytes, myocardial enzymes, blood count, coagulation profile, including D-dimer, and inflammatory markers such as serum Interleukin-6, IL6, ferritin and C reactive protein). On the other hand, LR variables were selected among the 20% most requested blood tests during the aforementioned temporal range. For a complete list of EHR and LR variables, see variable weight tables in Results and Supplementary Results.
The need of rescue treatment was defined as the addition of IL-6 inhibitors such us tocilizumab or sarilumab to the standard therapy in patients with clinical deterioration (increasing of lung infiltrates or inflammatory markers and/or changes in need for oxygen or ventilatory support). Regarding respiratory requirements, ventilatory support was defined as the need for non-invasive mechanical ventilation (MV), high flow nasal cannula or invasive MV. The intensification of oxygen support was considered when during the hospital admission was necessary to increase oxygen needs (need to increase FiO2 by nasal cannulas or mask and/or need to use ventilatory support).
COVID-19 was defined as a SARS-CoV-2 infection confirmed by quantitative reverse transcriptase–polymerase chain reaction, rapid diagnostic tests based on antigen detection in a nasopharyngeal sample or serology. Those patients with a negative test but who fulfilled clinical diagnostic criteria: respiratory symptoms (dyspnea, cough, sore throat, changes in taste/smell), chest X-Ray findings (uni- or bilateral interstitial infiltrates), and laboratory abnormalities (lymphopenia, the raise of C reactive protein, ferritin, IL-6 or D-dimer), which made COVID-19 the most likely diagnosis in the current epidemiological situation were considered as probable infection and processed for the anticipation of early discharge.
Clinical Outcome Prediction through time (COP-tt)
As first step, the framework uploads CSV tables to create an internal database, where subject initialization or updates are stored as states. In states, variables are either numerical or categorical, being the binary a subset of categorical. When a variable is categorical, COP-tt reads all possible text entries and automatically assigns them a numerical counterpart, which is used during analysis. Despite being categorical, the outcome variable was not automatically assigned but was defined as shown in Fig. 1 of the main text. Analysis consists in three steps: data sampling, model training and outcome prediction. For the latter, the outcome of each state is upgraded to the worst outcome reached by the subject. This upgrade can be disabled to create models for outcome detection instead of prediction.
During the first step, the database is split between training and testing set. The number of subjects in each set is defined by the training set ratio, which is arbitrary. The ratio in the current study was changed depending on the type of prediction and availability of patients (60 to 90%). In general terms, the bigger and more varied the training set, the more accurate the prediction.
Several random decision forest (RDF)22 models are trained to simulate the k-fold validation23, widely used in machine learning (ML). Before training a specific model, the training set is split into training and testing subsets, which are checked to ensure a balanced number of outcomes (50% good). An error threshold is employed to bypass unbalanced samplings, since it is not always possible to ensure well balanced sets. When the training set is unbalanced beyond the arbitrary error threshold, the sampling is repeated. In this study, six RDF models were trained before outcome prediction. The number of RDF trees can be set before analysis.
During the last step, each RDF model classifies all states of a testing subject. Since 0 is used for good and 1 for bad outcome, the bad outcome prediction is defined when the average model prediction overcomes or equals a threshold (0.5 in this study, see Fig. 1). For instance, if three models predict bad, while three good outcome, the final prediction is bad. The prediction is through time because it is calculated at each state, then the time before the target outcome is determined. For example, when calculating the anticipation of intensive care unit (ICU, target outcome), it is possible that the final prediction of bad outcome is achieved at a state preceding the real admission at ICU, which counts as real prediction; at the same state of admission, which corresponds to a detection; or after admission at ICU, which is failure to detect or anticipate. In this case, patients non admitted at ICU did not count for the final anticipation statistics. Target outcomes and therefore outcome discriminators can be set before analysis. The bad versus good discriminators used in this study are shown in Fig. 1 of the main text.
The framework current version is written in Python (version 3.8.5), using the Miniconda data science platform (version 4.9.2). The Scikit-learn module (version 0.23.2) was used for the RDF implementation. COP-tt accepts CSV tables and text files as indexes of variable names and types. See code and instructions in the GitHub repository: joricomico/COVID19_cop: COP-tt test, application to COVID19 (github.com).
Variable comparison between EHR and LR
Because of technical retrieval problems, LRs represented a smaller and complete subset of ERHs. For this reason, to compare EHR and LR random samples of the intersection variables (variables present in both sets) were compared using either the Student’s t test or the Mann-Whitney U test, depending on distributions, evaluated through the Shapiro-Wilk test. One-hundred samples of 1000 values were compared for each variable. Average p-values and the percentage of times p < 0.05 were used to determine statistical significance.
RDF models are ensembles of decision trees, where variables are randomly split and linked to outcomes. A variable split (a threshold or threshold vector) in a decision tree defines how much a variable can discriminate between the two or more outcome groups. Through the discrimination power of each sub-model (a decision tree), variable weights can be assessed as the average discrimination power of each variable to classify target outcomes. To assess global variable ponderations, variable weights were calculated for each RDF model, then average weights were calculated across all models.
Data collection, statistical and machine learning data treatments have been performed in accordance with the Declaration of Helsinki. The study was approved by the Clinical Research Ethics Committee of the Parc de Salut Mar (register no 2020/9329/I). Data anonymization was carried out in order to protect the privacy of patients. The need for written informed consent was waived due to the observational nature of the study and retrospective analysis.