RETRAUCI database
RETRAUCI is an observational, prospective, and multicentre nationwide registry that currently includes 52 ICUs in Spain. It has the endorsement of the Neurointensive Care and Trauma Working Group of the Spanish Society of Intensive Care Medicine (SEMICYUC) and currently operates in a web-based electronic format [13]. We include a five-year study period (2015–2019). Ethics Committee approval for the registry was obtained (Hospital Universitario 12 de Octubre, Madrid: 12/209). Due to the retrospective analysis of de-identified collected data, informed consent was not obtained.
Mortality within the hospital episode was used as the outcome variable.
The variables collected were classified into several groups (Table 1).
First, we considered patient variables, such as Age and Sex. Variables were used that describe the importance of injuries by anatomical area according to the AIS model – the scale ranges from 0 to 6, with 0 indicating no involvement and 6 indicating maximum involvement [3]. The anatomical areas were head (AHEAD), neck (ANECK), face (AFACE), thorax (ATHORAX), abdomen (AABDOM), spine (ASPINE), upper extremity (AUPPEREXT), lower extremity (LOWEREXT) and external and thermal injuries (AEXTERNAL). Also, we considered variables derived from the T-RTS, such as the Respiratory Rate (PointRR), Systolic Blood Pressure (PointSBP) and the Glasgow Coma Score (PointGCS), which range between 0 points (greater involvement) and 4 points (normality) [4].
Next, patient treatment variables, such as the presence at the ICU of mechanical ventilation (MV) or the occurrence of a Massive Haemorrhage (MASSIVEHEM) requiring activation of the massive transfusion protocol, were also included [14].
Finally, variables that define organic failures were used: hemodynamic failure (HEMODINAM) indicated by the presentation of an SBP lower than 90 mmHg requiring the administration of volume, blood products, and vasoconstrictor support; respiratory failure (RESPIRATORY), indicated by the presence of PO2/FiO2 below 300 during admission; renal failure (AKIDNEY), indicated by an increase in creatinine > 1.5 times the initial, or 25% reduction in urine flow to less than 0.5 ml/kg/h for at least 6 h; and the presence of coagulopathy (COAGULOP), indicated by the prolongation of prothrombin and activated partial thromboplastin times in > 1.5 times the control or by levels of fibrinogen < 150 mg / dl or thrombocytopenia < 100,000 in the determination of the first 24 hours [13,15,16].
Conventional statistics
Variables are described as median (interquartile range) or as a percentage. For the comparison of survivors (A-ALIVE) and non-survivors (D-DIED), the Mann-Whitney test was used for continuous variables and the chi-square test for categorical variables. A p-value of < 0.05 was taken as significant.
Machine learning techniques
We used the WEKA Platform (version 3.8) and its Explorer modules to determine the parameters of the different algorithms and the Experimenter module to establish the differences between the algorithms used. A ten-fold cross-validation process system was used in all algorithms. WEKA allows one to make a first selection of variables through its application — the gain ratio feature evaluator sorts the variables according to their importance [17].
Algorithm selection
Of the multiple algorithms included in WEKA, we selected nine supervised algorithms classified in traditional and ensemble methodology. The first six are traditional models based on logistic regression binary (LR) functions, a neural network according to multilayer perceptron (NN), sequential minimal optimization (SMO), classification rules (JRip), classification trees (CT) and Bayesian networks (BN), respectively. We also included three models that use ensemble classification algorithms: adaptive boosting (ADABOOST), bootstrap aggregating (BAGGING), and random forest (RFOREST) [17].
For the LR model, we used a backward stepwise regression system with variable input with p < 0.05 and removal with p < 0.10. Odds ratios (OR) with a 95% confidence interval were calculated.
In the CT model, we used the J48 algorithm based on C4.5, obtaining a pruned tree [18]. The JRip algorithm uses a rule learner: Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [19]. We limited tree growth (CT) and the number of rules (JRip), with a minimum of 20 instances.
For the BN, we used the TAN (Tree Augmented Network) variable relation search algorithm, which generates a graph that can be interpreted. This method does not assume the independence of the variables [20,21].
The SMO implements John Platt's sequential minimal optimization algorithm for training a support vector classifier [22]. In NN, we used the automatic mode for selecting the number of nodes in the hidden layer, with a learning rate of 0.3 and a momentum of 0.2 [23]. In RFOREST, we selected ten trees with the C4.5 algorithm [24]. In the rest of the algorithms (ADABOOST and BAGGING), we used the parameters that WEKA incorporates by default [17,25].
Algorithm evaluation
To evaluate the performance of the algorithms, we used the calculation of accuracy, specificity, precision, recall, F-measure and the area under curve ROC (AUC). WEKA's Experimenter module, with ten repetitions, allows one to establish whether there are statistical differences between the evaluated properties of the algorithms using the paired T-Tester (corrected) [17].