Development and Validation of Dynamic Predictive Models Using Vital Signs for Trauma-associated Severe Hemorrhage: a Comparative Study


 Background: This study aimed to develop and to validate dynamic predictive models for trauma-associated severe hemorrhage based on vital signs to register early warning and dynamic prediction of severe hemorrhage in trauma patients.Methods: The MIMIC-IV cohort was collected retrospectively. The inclusion criteria were trauma patients aged ≥16 years with complete clinical data. Heart rate, respiratory rate, systolic blood pressure, diastolic blood pressure, and peripheral oxygen saturation were extracted as predictive variables. Based on logistic regression, support vector machine, random forest, adaptive boosting, gated recurrent unit, and gated recurrent unit-d, predictive models for trauma-associated severe hemorrhage were developed and validated to dynamically predict whether severe hemorrhage will occur in trauma patients in the next 1 h/2 h/3 h. This study was based on the Trauma database of the General Hospital of the People's Liberation Army for external validation. The models were developed and validated using Python 3.8.5 software. SPSS.21 software was used for statistical analysis.Results: Of the 7522 trauma patients in the MIMIC-IV cohort, 283 (3.76%) had a severe hemorrhage. The area under the curve of the gated recurrent unit-d model was the best in the 1 h (0.946±0.029), 2 h (0.940±0.032), and 3 h groups (0.943±0.034), and there was no significant difference among the three groups. In the Trauma cohort, the area under the curve of the gated recurrent unit-d model also achieved the best performance in the 1 h (0.779±0.013), 2 h (0.780±0.008), and 3 h groups (0.778±0.009), and there was no significant difference among the three groups. When comparing the gated recurrent unit-d model with the traditional scoring systems, the gated recurrent unit-d model still has advantages. Moreover, we have developed a web-based predictive system to help clinicians use our models.Conclusions: This study developed and validated dynamic predictive models for trauma-associated severe hemorrhage based on vital signs to assist pre-hospital or in-hospital emergency personnel to make decisions, and the gated recurrent unit-d model performed best.Trial registration: The MIMIC-IV database was previously de-identified and reviewed by the institutional review board (IRB) of its host organization and determined to be exempted from subsequent IRB. We obtained the administrative permissions to use the database (Certification Number: 27959316) for our research, after completing the National Institutes of Health web-based training course: Protecting Human Research Participants. We were reviewed and approved by the Ethics Committee of Chinese PLA General Hospital to use the Trauma database. The ethical batch number is S2021-466-01. Moreover, the informed consent of subjects was waived by the Ethics Committee of Chinese PLA General Hospital.


Background
Trauma is a major global public health issue, contributing to approximately 1 in 10 mortalities and resulting in the annual worldwide death of more than 5.8 million people. Severe hemorrhage is the main cause of preventable death due to trauma. Approximately 40% of trauma deaths are attributed to severe hemorrhage, and up to 50% of such cases are reported dead on arrival at the hospital [1][2][3][4].
Severe hemorrhage can lead to hemorrhagic shock, acute traumatic coagulopathy, and multiple organ dysfunction syndrome in trauma patients with compensatory shock, occult internal bleeding, atypical signs, or multiple injuries. If not found and treated on time, these may eventually lead to death [5]. Emergency personnel involved in pre-hospital and in-hospital rst aid should assess the severity of trauma hemorrhage early and identify patients with severe hemorrhage to effectively make triage and evacuation decisions and to implement life-saving interventions, such as damage control surgery, damage control resuscitation, and massive transfusion programs early. Therefore, the construction of trauma-associated severe hemorrhage (TASH) predictive models and early identi cation of severe hemorrhage are very important to patients' outcomes [6,7].
At present, the predictive models for severe hemorrhage are mostly scoring systems based on logistic regression, such as the TASH score [8], Assessment of Blood Consumption (ABC) score [9], and Prince of Wales (PWH) score [10]. These scores are time-consuming and complex, often requiring laboratory values or ultrasound evaluation upon arrival at the hospital to calculate the results. Furthermore, most of these scoring systems are static evaluations using a single measurement, and the long detection intervals and invasive operations of laboratory or ultrasound examinations make dynamic monitoring di cult to achieve.
In recent years, with the development of machine learning technology, especially sub-domain deep learning, as well as the emergence of electronic health records and explosive growth of the amount of information stored in it, current studies are increasingly applying machine learning algorithms to the medical eld.
This study explores the combination of arti cial intelligence elds, such as machine learning and deep learning, in the eld of medical trauma to develop e cient and dynamic trauma-related clinical predictive models to support pre-hospital or in-hospital emergency personnel's decisions and to assist clinicians in diagnosis, to improve medical services, and to save more lives.

Data sources
Two large databases were selected for this study: the Medical Information Mart for Intensive Care (MIMIC)-IV database (version 0.4) and Trauma database of General Hospital of the People's Liberation Army (PLA) (hereafter referred to as the Trauma database). MIMIC-IV is a database containing real, de-identi ed medical information obtained during the hospitalization of patients in the Beth Israel Deaconess Medical Center from 2008 to 2019. The MIMIC-IV database includes demographic information, vital signs, laboratory values, imaging reports, treatment records, and death records [11]. The Trauma database is a large, comprehensive database established by the Medical Big Data Research Center of the General Hospital of the PLA. It contains the de-identi ed data of patients from 2015 to 2020, including vital signs, laboratory values, nursing records, and treatment records.

Data extraction
The same data extraction standards and preprocessing methods were used for the MIMIC-IV and Trauma databases.
Extraction of the study population. The inclusion criteria of the study population were as follows: 1) patients aged ≥16 years admitted to the hospital due to trauma and 2) complete clinical data of at least one data record for any of the ve vital signs, namely, heart rate (HR), respiratory rate (RR), systolic blood pressure (SBR), diastolic blood pressure (DBP), and peripheral oxygen saturation (SpO 2 ).
Patients in the experimental (that is, patients with TASH) and control groups (patients without TASH) were extracted from the study population. The extraction criteria were as follows [12][13][14]: 1) massive transfusion of three or more units of red blood cells within 1 h, anytime during the rst 24 h after admission; 2) embolization or hemostatic surgery within 24 h after admission; and 3) death within 24 h after admission. If the patient met any of the above three conditions, it was considered that the patient met the outcome variable and was classi ed as the experimental group; otherwise, they were classi ed as the control group.
Extraction of vital signs data: 1) Study section time de nition. For the experimental group, the study section time was de ned as when the outcome variable was met for the rst time. For the control group, the study section time was de ned as the time the last vital sign was detected. 2) Data extraction of the ve vital signs. First, the data in the three-time intervals of 1-13 h, 2-14 h, and 3-15 h before the study section time were extracted to construct models to predict the probability of severe hemorrhage 1 h/2 h/3 h after trauma (hereinafter referred to as the 1 h, 2 h, and 3 h groups).
Second, no more than 50 recent data samples were extracted from each patient during this time interval. Finally, patients with complete deletion for any of the ve recorded vital signs were removed after the extraction was completed.

Data preprocessing
We cleaned all vital signs data, including HR, RR, SBP, DBP, and SpO 2, to circumvent any incorrect data entered by recording errors.
Filling in the missing data. For patient , if data , which is the vital sign at the moment , were missing, the last data before or after the moment were selected to ll in the .

Predictive algorithms and statistical methods
In this study, six machine learning and deep learning algorithms were used to develop and to validate the dynamic predictive models of TASH, which were: Logistic Regression (LR), Support Vector Machine (SVM) [15], Random Forests (RF) [16], Adaptive Boosting (AdaBoost) [17], Gated Recurrent Unit (GRU) [18], and Gated Recurrent Unit-D (GRU-D) [19]. The performance of each model was evaluated by accuracy, sensitivity, speci city, false positive rate (FPR), positive predictive value (PPV), negative predictive value (NPV), Youden index, and area under the curve (AUC).
In our study, big trauma data in the MIMIC-IV database were used to develop the predictive models. Ten-fold cross-validation was used to evaluate the model performance. The big trauma data in the Trauma database were used for external validation of each model. The models were developed and validated using Python 3.8.5 software. SPSS.21 software was used for statistical analysis. The Mann-Whitney U/ Wilcoxon rank-sum test was used for the comparison of two samples of quantitative data, and the Kruskal-Wallis H rank-sum test was used for the comparison of multiple groups of quantitative data. Chi-square test, continuous correction method, or Fisher exact probability test was used to compare the classi ed data. Differences were considered statistically signi cant at p<0.05.

Baseline characteristics
In this study, information on 13307 traumatic patients was extracted from the MIMIC-IV database. After screening according to the inclusion criteria, a total of 7522 patients were included, with a median age of 63 years (57.30% males). Among them, 283 met the outcome variable of severe hemorrhage, accounting for 3.76% of the total of the study (Figure 1a, Table 1). Information on a total of 25810 traumatic patients was initially extracted from the Trauma database. After screening according to the inclusion criteria, a total of 1686 patients were included, with a median age of 47 years (77.52% males). Among them, 306 met the outcome variable of severe hemorrhage, accounting for 18.15% of the total of the study (Figure 1b, Table 2).   No signi cant difference was found in sex, age, and body mass index between the experimental (TASH group) and control groups (non-TASH group) in the MIMIC-IV cohort (Table 1). However, the TASH group had a higher proportion of emergency hospitalization, higher HR and RR, lower blood pressure, and more red blood cell, platelet, and plasma transfusions. Patients in the TASH group received an increased infusion of vasoactive drugs, positive inotropic drugs, and hemostatic drugs, and their sequential organ failure assessment (SOFA), simpli ed acute physiology (SAPS II), and Oxford acute severity of illness (OASIS) scores were higher; hospital stay and hospital mortality were also increased. In the Trauma cohort (Table 2), patients in the TASH and non-TASH groups had a similar pattern.
Development of TASH dynamic predictive models based on the MIMIC-IV cohort As shown in Table 3 (Figure 2a, 2b, 2c, Figure 3a, 3b, 3c). Moreover, the AUC of the GRU-D model showed no statistical difference among the three groups.  External validation of TASH dynamic predictive models based on the Trauma cohort As shown in Table 4, in the Trauma cohort, the GRU-D model was superior to other models in sensitivity, NPV, Youden index, and AUC, while the RF model performed better in terms of accuracy, speci city, FPR, and PPV. In this study, the purpose was to identify patients with TASH as much as possible; thus, we paid more attention to sensitivity than accuracy. Besides, the AUC is the best index to measure the comprehensive performance of the model. Overall, the GRU-D model is still the best model in the Trauma cohort. Thus, it can be observed that the GRU-D model has better generalization ability than other models. The ROC curve comparison and AUC difference analysis of the six models showed statistical differences between the GRU-D model and other ve models (Figure 3a, 3b, 3c, Figure 4a, 4b, 4c). Furthermore, there was no statistical difference in the AUC of the GRU-D model among the three groups.  Comparison between the GRU-D dynamic predictive models and traditional scoring systems Figures 5a, 5b, and 5c compare the GRU-D dynamic predictive model with shock index, Larson score, and Vandromme score in the three groups. As shown in Figure 5, the GRU-D model is signi cantly superior to the traditional severe hemorrhage predictive scores, with the highest AUC. Figures 6a, 6b, and 6c compare the GRU-D dynamic predictive model with the OASIS, SAPS , and SOFA scores in the three groups. As shown in the Figure 6, the GRU-D model is signi cantly better than the previous severity scores and has the highest AUC.

Development of TASH predictive system
To facilitate doctors to use our predictive models, we have developed a website tool; the website is http://82.156.217.249:5000/. The predictive system integrated the GRU-D models with the best predictive effect, including the data input page and predictive result display page (as shown in Figure 7a and 7b). On the data input page, the input data template can be downloaded. After inputting or importing vital signs time series data according to the template format, you can click the "Submit" button to jump to the predictive result display page. This page shows the predictive probability of TASH 1 h/2 h/3 h after trauma and shows the input predictive variable data.

Discussion
In this study, we developed and validated three deep learning models to dynamically predict the probability of severe hemorrhage occurring at three points in time following trauma, based on vital signs data of trauma patients from a large-scale public database. It was further validated in the Trauma database of the University Teaching Hospital. Moreover, we provide an open and accessible data interface for the public to use and to validate our model. Our predictive models can help pre-hospital or in-hospital clinicians in the early identi cation, dynamic prediction, and decision making regarding patients with severe hemorrhage from trauma, thus saving more lives.
There are already some scoring systems for TASH. For example, the TASH score [8], PWH score [10], traumatic bleeding severity score (TBSS) [20], and modi ed TBSS [21] require clinical assessment, laboratory values, and ultrasound assessment. Scores such as the Hsu [22], Larson [23], McLaughlin [24], and Vandromme scores [25] require clinical assessment and laboratory values. The ABC score [9] requires clinical and ultrasound assessment. The above scoring systems, which require results of laboratory values or ultrasound assessment, are more complex and often require patients to arrive at the hospital to calculate the score results; thus, it is time-consuming for in-hospital evaluation and not suitable for pre-hospital evaluation [26].
Furthermore, most of these scoring systems are static evaluations using a single measurement, and the long detection intervals and invasive operations of laboratory or ultrasound examinations make dynamic monitoring di cult to achieve.
The TASH dynamic predictive models developed in our study only depend on vital signs, which can be easily obtained in pre-hospital or in-hospital environments, and medical staff can easily record the data regularly. Simple feature selection also ensures that the predictive models can be continuously and automatically recalculated before or during hospitalization, providing valuable information on whether patients are responding to treatment, thus making it easier for medical professionals to modify their treatment plans. In addition, the simplicity of the input and output improves the interpretability of the predictive models, thus increasing the possibility that health care providers trust their predictions [27][28][29][30][31][32].
Comparing the evaluation indexes of each model based on the MIMIC-IV database, in general, the GRU-D model is better than the GRU model; the GRU model is better than the AdaBoost, RF, and SVM models; there is no obvious difference among the AdaBoost, RF, and SVM models; and these three models are better than the LR model. The reason for the difference between the above six models may be that the LR, SVM, RF, and AdaBoost models are traditional machine learning algorithms, and the input of the models is a ve-dimensional vector. The GRU and GRU-D models belong to the deep learning algorithm, and the input data is a time series of ve-dimensional vectors comprising ve vital signs. Moreover, the GRU model is a variant of the traditional recurrent neural network, which solves the problem of gradient disappearance. The GRU-D model is a variant model based on GRU proposed by Che et al. in 2018, which can deal with irregular sampling time series data with missing values. Its input includes the time series data, the mask, and time interval information. Then, in the process of training, it processes the time interval information between the two recorded data before and after, captures the relationship between the time series data, lls in the missing data, and makes predictions at the same time. In GRU D, data lling and prediction of results are both conducted in the process of neural network training; thus, the parameters related to data lling will be continuously optimized in the process of training and make the predictive result better [19].
The GRU-D model, which has the best performance in our study, is compared with the traditional scoring systems. The shock index has been recommended to predict massive blood transfusion and emergency operation after trauma and is widely used in pre-hospital and battle eld environments [33,34]. The Vandromme score was put forward by Vandromme and his colleagues in 2012, which was used to identify patients with massive blood transfusion risk [25]. The Larson's score was put forward by Larson and his colleagues in 2010 based on a combat database, which was used to predict the massive blood transfusion needs of combat casualties [23]. The OASIS, SAPS , and SOFA scores are commonly used severe illness scoring systems at present, which are often used to predict the severity of patients' illness or hospitalization mortality. Based on the MIMIC-IV cohort, our study compares the GRU-D model with the above scoring systems. The GRU-D model has the highest AUC, which re ects the advantages of the GRU-D model in the dynamic prediction of TASH. and ABC score (AUC 0.782). The performance of the TASH score was better than that of the other two scoring systems [36]. Compared with our study, the performance of the TASH score was not as good as that of the GRU-D model in the MIMIC-IV cohort. By comparing the AUC of predictive models among different studies, the GRU-D model based on vital signs still has a good predictive effect, which con rms the advantage of the GRU-D model in predicting TASH.
In addition to internal validation, we validate the model externally based on the Trauma database of the General Hospital of the PLA. As shown in Figures 4a, 4b, and 4c, the AUC of the GRU-D model is larger than that of the other models, indicating that our models have signi cant generalization ability and clinical value. To help clinicians use our models, we have developed a web-based predictive system, which provides a user-friendly interface.
After entering the variables, the probability of severe hemorrhage occurring at three time points after trauma is shown. These results will help clinical decision-makers understand the condition of patients and prepare appropriate treatment strategies.
This study had certain limitations. First, the study population in this study only included adult patients, and further study population division based on age was not considered. However, age plays an important role in predicting the risk of severe hemorrhage. The age of the experimental group was signi cantly younger than that of the control group in the Trauma dataset in our study. Some studies have shown that elderly patients are more likely to have severe hemorrhage [20]. In future studies, we will divide the patients into different subgroups based on age for further discussion [37]. Second, the severe hemorrhage predictive models can only guide the doctor's clinical decision-making process and cannot replace the doctor's clinical judgment and other diagnostic tests. Finally, this is a retrospective observational study. Although the quality of the MIMIC-IV and Trauma databases is very high, there are still data losses and input errors. Therefore, prospective validation is still needed in the future. In future studies, it is also necessary to determine whether the use of dynamic predictive models for TASH reduces the waiting time before massive blood transfusion or damage control surgery and its impact on the prognosis of trauma patients.
Conclusions 34. Sharma A, Naga Satish U, Tevatia MS, Singh SK. Prehospital shock index, modi ed shock index, and pulse pressure heart rate ratio as predictors of massive blood transfusions in modern warfare injuries: a retrospective analysis. Med J Armed Forces India. 2019;75:171-5.     Comparison of the GRU-D model with the SI, Larson score, and Vandromme score.