2.1. Data sources
The Medical Information Mart for Intensive Care (MIMIC- IV, v1.0) is a longitudinal, single-center database maintained by the Laboratory for Computational Physiology at the Massachusetts Institute of Technology (MIT) which contains de-identified data for over 60,000 patients admitted to ICUs at a US tertiary academic medical center (the Beth Israel Deaconess Medical Center [BIDMC]) [14]. We received permission to use the database, and all reporting followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines [15]. The datasets generated and/or analysed during the current study are available in the MIMIC Code Repository, (https://mimic-iv.mit.edu). STROBE guideline focuses on reporting results of observational studies without clear discussions of reporting ML analyses. For this reason, the recommendations for reporting and machine learning analyses in clinical research presented by Stevens et al.[16] were used as a template when preparing and submitting our manuscript. In this research statistical methods to be used in machine learning analysis in clinical research are recommended and machine learning analysis workflow is overviewed. Also, several key reporting elements according to the study designs are reported. The project is approved by the institutional review boards of MIT (Cambridge, MA, USA) and BIDMC (Boston, MA, USA) and was granted a waiver of informed consent. One of our authors (M.K.A., Record ID: 35109067) gained permission to document the database after online training at The Collaborative Institutional Training Initiative (CITI Program). Our research was conducted entirely on publicly available, anonymized data; therefore, individual patient consents were not required. All methods were carried out in accordance with relevant guidelines and regulations in the declarations to protect the privacy of patients.
Data availability: The data that support the findings of this study are available from MIMIC-IV but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Due to the data use agreement for the MIMIC-IV (v1.0) (PhysioNet Credentialed Health Data Use Agreement 1.5.0) there are restrictions that apply to the availability of these data, which were used under license for the current study, and so are not publicly available and we cannot share access to PhysioNet restricted data with anyone. (https://physionet.org/content/mimiciv/1.0/)
2.2 Study design
We conducted this retrospective study based on a subgroup of adult patients from the MIMIC IV dataset with the Sepsis-3 criteria; sepsis was defined as a suspected infection combined with an acute increase in Sequential/Sepsis-related Organ Failure Assessment (SOFA) score ≥ 2 [17].
We retrospectively evaluated a subgroup of adult patients (age ≥ 18 years) with sepsis from the MIMIC-IV dataset who had at least two serum lactate measurements recorded within the first 6 hours of sepsis diagnosis and who also has an ICU length of stay ≥ 24 hours.
2.3 Definition of outcomes
To accomplish this aforementioned lactate trend analysis, we first need to define trends. Three trend states are constructed according to value change in blood lactate. For the 24-hour observation period, 1mmol per liter and above change is considered a trend indicator. We calculated the change between two lactate values measured with a maximum interval of 6 hours. According to this setup, all samples in the data cohort have been labeled as increase, decrease, or constant. Trend definition can be seen in Fig.1.
2.5 Variable selection
We identified 9 variables that are most relevant in lactate trend analysis according to the clinical literature. These variables are age, first lactate value, last lactate value and the time interval between two lactate measurements, the mean of hemodynamic and respiratory monitoring parameters measured in this time interval (heart rate, systolic blood pressure, diastolic blood pressure, mean blood pressure, oxygen saturation, and PaO2/FiO2 ratio) (Table 1).
The main reason behind these variable selections is to reduce laboratory dependence on lactate trend analysis and do this analysis in a noninvasive manner. Preprocessing is a vital step to achieving robust machine learning models. These processes help reduce noise, remove redundant data, generate consistent data, and thus increase the performance of prediction models. To increase data quality, we applied some preprocessing steps to the data cohort. Outliers in the dataset are removed to obtain consistency between data points. To make the range intervals more coherent unity-based normalization is applied. All ranges are transformed into 0 and 1. We obtained 18653 data samples after these preprocessing steps.
Feature selection strategies on clinical data provide the right parameters to analyze a certain disease, treatment cost reduction, and reduce computational burden [18]. To achieve these goals, we do a further investigation on variable space. We used the Correlation-based feature selection (CFS) algorithm as a feature selector. CFS algorithm acquires important and pertinent features using inner characteristics of data instead of using machine learning approaches [19]. In many cases, some features have a high correlation with others. These features with high correlation characteristics produce redundant data and thus reduce the performance of prediction models. CFS algorithm evaluates the correlations between other features and discards features that have high correlation [19]. According to the CFS algorithm, we identified four variables that have less correlation than the others and can be used to predict lactate trends in sepsis patients. These are heart rate, oxygen saturation, lactate value before sepsis diagnosis, and time interval. The overall ranking of all features can be seen in Fig. 2. A pyramid shape is used to demonstrate the rank of features. In this pyramid, high-rank features appear on the top side whereas low-rank features appear on the bottom side of the pyramid.
2.5 Proposed Machine Learning Framework
Our proposed machine learning-based framework uses a clinical and demographical type of data and feeds these data to a classifier to oversee lactate trend in ICU settings. We utilized a traditional model for a supervised classification problem that consists of two parts: a training and a test/evaluation phase.
First of all, training data that consist of annotated data samples are acquired from the MIMIC IV dataset. Afterward, they go through a data preprocessing stage to increase data quality for the classification model. Every sample in training data has a lactate trend label (Increase/Constant/Decrease). These samples are trained with a classifier to construct a machine learning model. For the test stage, a preprocessed test sample is fed to the classifier and the classifier predicts its lactate trend label. In conclusion, classification performance is reported in the evaluation phase (Fig.3).
2.6 Selected Classifiers for Proposed Framework
For the prediction of lactate trends in sepsis patients, we evaluated several classifiers on the MIMIC IV dataset. These classifiers are Naïve Bayes (NB), J48 Decision Tree, Logistic Regression (LR), Random Forest (RF), and Logistic Model Tree (LMT). Naïve Bayes is a traditional and simple machine learning approach that contemplates dataset attributes as an independent [18]. The outputs are considered as class probabilities. Naïve Bayes acts on the Bayes theorem, which is the probability of any particular event occurring, given the probability of another event just occurring. The class with the highest probability is selected as the outcome. It became very popular in the machine learning area due to its advantages. These advantages are handling the overfitting problem very well and parallelization of the classification process [20]. J48 decision tree algorithm is an updated version of the popular decision tree algorithm ID3[21]. It can be used in both numerical and categorical data. J48 aims to find a specific attribute that fully partitions the training data. This attribute has the highest information gain value in the dataset [22]. By evaluating the probable values of this attribute, a branch pruning process starts and J48 defines target values. In the meantime, J48 searches other high information gain attributes. This process continues until an explicit decision is made on the combination of attributes that gives a certain rule for determining the target value. At the end of the algorithm, all attributes are evaluated and therefore all samples have a target value accordingly [22]. J48 became a popular machine learning tool in many areas due to its easy implementable and robust nature [21, 23, 24].
Random Forest (RF) belongs to the family of decision trees that employ a supervised ensemble learning strategy [25]. It gained popularity among classification and regression problem domains due to its robustness against overfitting and low computational load [25-27]. RF builds many decision trees that are based on the selection of a random subset of variables that are called bootstrap samples. Other decision tree learners aim to find the best variable available whereas RF uses random variables. The main motivation for this approach is to reduce the correlation between these candidate random trees. This randomness approach is important when making decisions because if there are highly correlated variables available then it affects the prediction phase and leads to poor prediction performance. As an output, all predictions from random trees are combined to achieve the utmost result [26].
The logistic regression algorithm is mainly used for tackling classification problems and modeling class probabilities [28]. It aims to fit the data to a logistic curve to predict the occurrence probability of events [29]. It can handle nonlinear dataset effects.
LMT algorithm is a hybrid decision tree approach that utilizes logistic regression and decision tree learning [30]. Leaves of the tree have piecewise linear regression models that are constructed by logistic regression functions. To construct these logistic regression functions LogitBoost algorithm is used [31]. Pruning of the decision tree is done by decision tree classifier algorithms. Splitting of the decision tree is implemented via logistic variant information gain. The algorithm has many positive aspects, it can map linear relationships, overfitting can be easily avoided, and it is easy to implement. Because of its numerous advantages, in recent years it has been used in many different research areas [30-32]
Ethics approval and consent to participate
Our research was conducted entirely on publicly available, anonymized data; therefore, individual patient consents were not required. All methods were carried out in accordance with relevant guidelines to protect the privacy of patients.