Development of a Machine Learning Model for the Prediction of the Real Time Mortality in Patients in the Intensive Care Unit

Prediction of mortality in intensive care units is very important. Thus, various mortality prediction models have been developed for this purpose. However, they do not accurately reect the changing condition of the patient in real time. The aim of this study was to develop and evaluate a machine learning model that predicts short-term mortality in the intensive care unit using four easy-to-collect vital signs. Two independent retrospective observational cohorts were included in this study. The primary training cohort included the data of 1968 patients admitted to the intensive care at Health Korea, from January 2018 to March 2019. The external validation cohort comprised the records of 409 patients admitted to the medical intensive care unit at Seoul National University Hospital, Seoul, South Korea, from January 2019 to December 2019. Datasets of four vital signs (heart rate, systolic blood pressure, diastolic blood pressure, and peripheral capillary oxygen saturation [SpO2]) measured every hour for 10 h were used for the development of the machine learning model. The performances of mortality prediction models generated using ve machine learning algorithms, Random Forest (RF), XGboost, perceptron, convolutional neural network, and Long Short-Term Memory, were calculated and compared using area under the receiver operating characteristic curve (AUROC) values and an external validation dataset. AUROC; Area characteristic curve; CI: Condence interval; CNN: Convolutional Neural Network; COVID-19: Novel coronavirus disease; DBP: Diastolic blood pressure; EHR: Electronic health record; HR: Heart rate; ICT: Information and communications technology; ICU: Intensive care unit; LSTM: Long Short-term Memory; Living and treatment chained equation; MLP: Multilayer MPM: NRMSE: root mean error; ReLu: Rectied RNN: SAPS: Simplied Acute Score; SBP: blood SHAP: Shapley explanation; SpO2: Peripheral capillary oxygen saturation

II. Thus, to investigate the importance of variables that in uence the performance of the machine learning model, machine learning models were generated for each observation time or vital sign using the RF algorithm. The machine learning model developed using SpO2 showed the best performance (AUROC, 0.89).

Conclusions
The mortality prediction model developed in this study using data from only four types of commonly recorded vital signs is simpler than any existing mortality prediction model. This simple yet powerful new mortality prediction model could be useful for early detection of probable mortality and appropriate medical intervention, especially in rapidly deteriorating patients.

Background
Multiple reports have indicated that changes in vital signs often precede the rapid deterioration of a patient's condition [1,2]. This has led to the development of various models for predicting mortality [3,4].
The most well-known mortality prediction models are the Acute physiology and Chronic Health Evaluation (APACHE) system, the Simpli ed Acute physiology Score (SAPS), and the Mortality Probability Model (MPM) [5][6][7]. These classic mortality prediction models were developed based on logistic regression, which is easy to interpret. However, although these models show excellent predictive power in actual clinical practice, they have some limitations, such as the inability to handle non-linear relationships and limited handling of internal interactions among predictor variables [8].
In 1950, Alan Turing proposed the Turing test, which allows a human-like response in a computer (also known as an imitation game). However, the concept of machine learning was rst introduced by Arthur Samuel in 1959, with his groundbreaking game of computer checkers [9]. Machine learning is a computer algorithm that uses sample data, called training data, to create mathematical models for prediction or decision-making on its own without the need for sophisticated programming. Traditional statistics produce models that focus on the cause and effect between variables in the data, whereas machine learning produces models that focus only on predictive power [9]. Over the past few decades, machine learning has led to remarkable advances in algorithms, and in recent years, its applications in the medical eld have been expanding.
Given that the existing mortality prediction models predict mortality using data recorded on the rst day of admission to the intensive care unit, they are fundamentally limited in predicting the prognoses of patients with dynamically uctuating vital signs.
Therefore, the purpose of this study was to develop and evaluate a machine learning-based short-term mortality prediction model for the assessment of real-time mortality in patients in the intensive care unit (ICU).

Data sources
We used the electronic health record (EHR) data of all patients older than 18  Data preprocessing was performed for analysis. All non-numeric data were treated as missing values. If vital sign values exceeded the de ned range of physiologically possible values, the patient data were removed and were not used for the development of the model (HR >300 beats per minute, SBP DBP > 300, DBP >200 mmHg, and SpO2>100%). The data of 32 patients were excluded based on this criterion. Thus, the data of 1936 patients were used for the development of the model. If vital signs were measured multiple times within 1 h, the last measurement was used for analysis. The same preprocessing procedure was adapted to the external validation dataset. The record of one patient out of the 409 in the external validation cohort was excluded.
More effort was put into preprocessing the data, especially when dealing with missing values. Approximately 1.8% of the ICU data used for model training had missing values. Although it cannot be said that the proportion of missing values was high compared to the size of the entire dataset, the problem was that the distribution of the missing values was uneven. The ICU data recorded at time points far from the time a patient was discharged from the ICU (e.g., 10 hours before leaving the ICU) often showed omission rates less than 0.5%. However, ICU data recorded just before the patient left the ICU showed omission rates up to 20% in most cases ( Figure 1).
All patient data from the training dataset with missing values were removed to generate a complete dataset without missing values (data from 1239 patients in the ICU). Values in each vital sign column were randomly deleted in the same proportions as seen in Figure 1. Missing values in the arti ciallygenerated dataset were lled with a proportion of missing values similar to that of the original data using ve well-known missing-value imputation packages in R (multivariate imputation by chained equation [MICE], Amelia, MissForest, Hmisc, and MI). After lling in the missing values using each R package, the performance was assessed using the normalized root mean square error (NRMSE), which is de ned as where X true and X imp are derived from the original complete data matrix and the imputed data matrix, respectively, whereas var represents the variance calculated only for consecutive missing values [10]. MissForest showed that the lowest NRMSE values were used for imputation ( Figure 2).

Development of the machine learning model
A machine learning method was applied for the prediction of short-term mortality in patients in the ICU. Model development was performed using ve machine learning algorithms: Random Forest (RF), XGboost, multi-node perceptrons, convolutional neural network (CNN), and Long Short-term Memory (LSTM). RR was included as one of the features of the rst dataset collected for model development. However, the respiratory rate of a critically ill patient who underwent airway intubation can be adjusted arbitrarily by medical staff; thus, it was excluded from the features used in developing the nal model.
Given that the machine learning model has the risk of leaking information about survival and death, the last vital sign recorded upon discharge from the ICU was deleted and not used for machine learning development. Thus, the ICU data of four vital signs recorded over 10 h were used.
Common sense indicates that spending more time data on model development provides better predictive power. However, this only predicts mortality in the near future. Considering that the time taken to manage patients in the ICU appropriately is becoming smaller, the value of the model will decrease substantially. Predicting mortality in the distant future can provide more time for appropriate medical care in the ICU. However, the small amount of available data makes it di cult to build better models. For the development of our model, we considered a balance between predictive power and how well ICU mortality could be predicted in advance. We believed that if the model could predict mortality four hours in advance, it would have both practicality and good predictive ability. Thus, development of a model that could predict mortality four hours in advance was the basic aim of this study.
The area under the receiver operating characteristic curve (AUROC) was primarily used to evaluate the ability to predict survival or death of patients in the ICU. The ICU data were divided into a training dataset (2/3) and a test dataset (1/3). To overcome over tting and selection bias problems, we subjected the training data to ve-fold cross-validation for the optimization of hyperparameters and the selection of models.
For the development of the perceptron model, we used a single-layer feed-forward neural network and tried all combinations of the following settings to nd the optimal hyperparameter: the number of rst layer nodes (40,42,44,45,46,48,50,52), the number of hidden layer nodes (22,24,26,28,30,32,34), and dropout (0.1,0.2,0.3,0.4). The recti ed linear unit (ReLu) was used as the activation function for one input layer and one hidden layer. Sigmoid was used as the activation function for the output layer.
The development of the CNN model consisted of the use of an embedding layer, one-dimensional convolutional operations followed by max pooling, and one neural input layer with the ReLu activation function. All combinations of the following settings were used to nd the optimal hyperparameter: the number of dimensions in the embedding layer (32,36,40,44,48,52,56,60,64), the number of neural nodes after max pooling (16,20,24,28,32,36,40,44), and dropout (0.1,0.2,0.3,0.4). Sigmoid was used as the activation function for the output layer.
The default RF package was used in the development of the RF model. All combinations of the following settings were used to nd the optimal hyperparameter: the number of trees in the forest (1,2,4,8,16,32,64,100,200) and the minimum number of data points before node split (0.1,0.5,5).
The XGBoost model was developed using default settings without manual parameter adjustments. Default parameter values were used for all other parameters not mentioned. Machine learning was performed using the author's own Keras scripts written in the Python language under Scienti c Python development (Spyder) (Keras version: 2.2.5 backend: tensor ow 1.15.0 python 3.7).

Explainable machine learning
The Shapley Additive exPlanation (SHAP) algorithm was used to overcome the lack of explanation for machine learning decisions, which is known as the black box issue. The SHAP explanation method uses coalitional game theory to calculate a Shapley value that represents the extent to which each feature contributes to predicting an outcome. The SHAP value was calculated using the TreeShap algorithm, which can be applied to tree-based machine learning, such as decision trees and RF [11].

Statistical analysis
The performance of the machine learning algorithms (perceptron, XGboost, LSTM, CNN, and RF) and APACHE II scores were compared using AUROC. Accuracy, sensitivity, speci city, positive predicted value, and negative predicted value were calculated for all predictive models in this study. Statistical analyses were performed using Rex (Version 3.0.3, RexSoft Inc., Seoul, Korea) and R 3.5.1 (R Development Core Team; R Foundation for Statistical Computing, Vienna, Austria).

Descriptive statistics
The ICU data of 1968 patients were obtained for the development dataset. The records of 32 patients were excluded because they had implausible data. Thus, the data of 1936 patients were used for the development of the model. The mean age of the patients was 75.38 ± 8.60 years, and 1742 of them were male. A total of 300 patients (15.2 %) died in the ICU (Table 1, Figure 3).  Figure 4).
In the external validation set, the APACHE II score had an AUROC of 0.84 (95% CI, 0.805-0.879). Thus, we decided to use the RF algorithm for model development because of its superior performance.

Effect of observation time and category selection on model performance
To investigate the importance of variables that in uence the performance of the machine learning model, machine learning models were generated for each observation time or vital sign using the RF algorithm.
Datasets of four vital signs recorded over a 10-hour period were the primary data for this study. Using each vital sign data recorded from 5, 6, 7, 8, 9, and 10 h before discharge to the time of discharge, a model for predicting intensive care unit mortality after 6, 5, 4, 3, 2, and 1 h was developed using the RF algorithm.
To determine the best vital signs for predicting mortality, a model was developed using only one of the four vital signs recorded during the seven hours spent on model development. Comparison of the performance of the models based on AUROC showed that the model based on SpO2 had the best performance. The order of the performance of the models were as follows: SpO2 >DBP >SBP> HR ( Figure  5).
Relative importance of each feature to model performance Comparison of the performance of models using a single variable or multiple observation times has the disadvantage of being unable to investigate complex interrelationships between multiple variables.
A SHAP algorithm was applied to the machine learning model to determine the effect of a single feature on its output in the presence of correlated features. The SHAP values, calculated using the game theoretical approach, can represent importance of features (the magnitude of the contribution) and directions (sign) to the output of the model [12].
The machine learning model generated using the RF algorithm was analyzed using the SHAP algorithm with a 7-hour vital sign dataset. The results indicated that the higher the last recorded SpO2 and SBP levels, the better the patient's chance of survival ( Figure 6).

Discussion
In this study, we developed a short-term mortality prediction model for patients in the ICU using four vital signs (HR, SpO2, DBP, and SBP) recorded at one-hour intervals. The performances of mortality prediction models constructed using ve machine learning algorithms (RF, XGboost, perceptron, CNN, and LSTM) were evaluated using AUROC values. The best performance was achieved with the RF algorithm (0.922, 0.881-0.951). To avoid the black box problem observed in machine learning models, we applied the SHAP algorithm to our model to explain it.
Models that predict mortality in critically ill patients are already available. At present, the APACHE II, SAPS, and MPM are the most popular models [5][6][7]. Although the clinical usefulness of these models is already well known, their major drawback is that they use several static physiological parameters to predict patient risk early during hospitalization. Thus, the application of static predictive models to rapidly changing clinical situations in the ICU is limited. Changes in the medical technology environment, such as improvements in computer performance and predictive software algorithms, and expansion of the use of electronic medical records, have made it possible to develop dynamic tools that can predict patient risk in real time.
Thorsen-Meyer reported that real-time LSTM model predicting 90-day mortality was on a 15625 ICU hospitalization dataset [13]. Meyer described the development of a real-time recurrent neural network (RNN)-based model for the prediction of postsurgical complications, such as mortality, renal failure, and bleeding, in cardiac patients [14]. Kim reported that a real-time mortality prediction model based on a CNN algorithm was created using the datasets of pediatric ICUs [15]. All authors of previous similar reports insisted that their new models outperformed the corresponding clinical reference tools.
Given that it is challenging to evaluate the performance of a model by collecting patient data in real time for research purposes, it is common to retrospectively validate a model using an independent external dataset. However, whether the evaluation is retrospective or prospective, there are always missing values in the data collected. How these missing values are dealt with is always of great interest to researchers. In the present study, we handled missing values in our dataset by deleting the relevant data or replacing the missing values. In most cases, imputation of missing values was performed when there were not too many missing values.
Imputation of missing values is the most important and sophisticated step in data processing for machine learning modeling. In the present study, the missing-value imputation method used in the processing of the training dataset was also used for the validation dataset. It is not easy to recognize the problem of incorrect imputation of missing values until the model is tested on new real data to show its real performance.
In simple datasets, it is possible to replace missing values in some similar summary statistics with the mean or median of the dataset. However, multiple imputation (model imputation) using complex statistics or machine learning models is more popular at present. Multiple imputation is an advanced statistical method of replacing missing values with a number of plausible data values computed using various methods, such as MICE, RF algorithm, and k-nearest neighbor [16]. In this study, we used ve imputation models (MICE, Amelia, MissForest, Hmisc, and MI) provided in the R package.
In our internal validation dataset, most of the missing data were found in the datasets of the last three hours (68.3%). It is not clear what causes missing values just before patient discharge. It is possible that the medical staff accidentally entered an incorrect value while preparing the patient for discharge from the ICU, or that vital sign monitoring was turned off while the patient was waiting to be moved to the ward.
In the present study, the basic mortality prediction model did not include vital signs from the last three hours. However, since the trend of vital signs until discharge may affect the quality of the model, we performed missing-value imputation for the entire dataset for 10 hours. The unequal distribution of missing values made it di cult to assess the reliability of the missing-value imputation method. To overcome this drawback, the data were randomly deleted according to time and vital signs at a rate similar to that of the original data after removing all patient data with missing values from the dataset. Thereafter, missing values were inserted using the ve multiple imputation methods mentioned above. In the evaluation of imputation methods using NRMSE, Missforest showed the best performance in the dataset used in this study.
Two types of machine learning algorithms were used in this study: neural networks (multilayer perceptron, CNN, and LSTM) and tree types (RF and XGboost). The main difference between the two algorithms is that the decision trees that predict the output label in tree types are independent of each other, whereas the neurons of neural networks are interconnected [17][18][19].
Three types of neural network architectures were used in this study: multilayer perceptron (MLP), CNN, and LSTM. The MLP is a classical algorithm that consists of three layers (input, hidden, and output). The MLP can process inputs in the forward direction. Any nonlinear function can be optimized using MLP [20]. A CNN consists of three layers: convolution, pooling, and output. Convolution layers are used as lters to extract relevant features from input data. In addition, CNNs show impressive performance in image classi cation [19]. An LSTM network is a type of recurrent neural network. It specializes in sequential data processing. A block of memory-storing information is attached to the RNN to facilitate the learning of temporal relationships [21].
The basic concept of tree-based algorithms is to create a series of if-then rules to predict the output from the input. There is a large difference between the RF algorithm and the XGboost algorithm in model generation. RF randomly resamples data to build a better model, whereas XGboost improves the model by passing problems found in the previous model to the next model [22]. In this study, RF performed best in predicting ICU mortality. Most of the machine learning algorithms used in this study performed better than the APACHE II score. The APACHE II score is based on initial patient data recorded immediately after admission, whereas the model developed in this study is based on patient data recorded close to the time of discharge, a situation which is not suitable for the APACHE II model. We analyzed the development of the model using selective features and the SHAP algorithm to speci cally observe how the model behaves. As a result of examining the performance of the model at various observation points, a better model was developed using the long-term vital sign data recorded close to the time of discharge.
We compared the performance of machine learning models generated using a single vital sign. The model developed with only SpO2 data showed the best performance, whereas the model developed with only HR showed the worst performance. This is somewhat consistent with the results of an analysis conducted using the SHAP algorithm, which showed the strongest positive correlation between the last recorded SpO2 value and the survival of the patient. Further research is needed to determine whether early prediction of mortality using the model developed in this study can improve a patient's prognosis. In addition, considering that the last vital sign data recorded before a patient is discharged were used in the development of this model, further research is needed to determine whether the model can be useful in predicting patient mortality, such as sudden cardiac arrest during hospitalization.
Faced with the challenge of managing the spread of the novel coronavirus disease (COVID-19), the Korean government entrusted our hospital with the operation of living and treatment support centers (LTSCs) for the management of clinically healthy patients with COVID-19. We implemented information and communications technology (ICT)-based remote patient management systems at a COVID-19 LTSC by adopting new electronic health record templates, hospital information system dashboards, cloudbased medical image sharing, a mobile app, and smart vital sign monitoring devices [23]. This mortality prediction model will be useful to optimize outcome by applying our ICT-based tools and applications, which are becoming increasingly important in healthcare.

Conclusions
We developed a model for predicting patient mortality using only four types of commonly recorded vital sign data, which is simpler than any existing mortality prediction model. Further, we showed that using only one of the four vital signs is good for predicting mortality, and can be used in a variety of ways in more restrictive healthcare settings. This simple yet powerful new mortality prediction model could be useful for early detection of probable mortality and appropriate medical intervention, especially in rapidly deteriorating patients.
Abbreviations Figure 1 Percentage of missing values for each feature of the ve vital signs. The time on the X-axis indicates how many hours before discharge from the intensive care unit. HR, heart rate; SBP, systolic blood pressure; DBP, diastolic blood pressure; RR, respiratory rate; SpO2, peripheral capillary oxygen saturation Normalized root mean square error value obtained by imputing missing values of arti cially generated datasets using ve R packages NRMSE, normalized root mean square error    Receiver operating characteristic curves and tables for model performance according to observation time (a,b) and category selection (c,d) for predicting mortality four hours in advance. HR, heart rate; SBP, systolic blood pressure; DBP, diastolic blood pressure; SpO2, peripheral capillary oxygen saturation