What do Black-box Machine Learning Prediction Models See?- An Application Study With Sepsis Detection

doi:10.21203/rs.3.rs-1991366/v1

Download PDF

Research Article

What do Black-box Machine Learning Prediction Models See?- An Application Study With Sepsis Detection

https://doi.org/10.21203/rs.3.rs-1991366/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

Purpose: The purpose of this study is to identify additional clinical features for sepsis detection through the use of a novel mechanism for interpreting black-box machine learning models trained and to provide a suitable evaluation for the mechanism.

Methods: We use the publicly available dataset from the 2019 PhysioNet Challenge. It has around 40,000 Intensive Care Unit (ICU) patients with 40 physiological variables. Using Long Short-Term Memory (LSTM) as the representative black-box machine learning model, we adapted the Multi-set Classifier to globally interpret the black-box model for concepts it learned about sepsis. To identify relevant features, the result is compared against: i) features used by a computational sepsis expert, ii) clinical features from clinical collaborators, iii) academic features from literature, and iv) significant features from statistical hypothesis testing.

Results: Random Forest (RF) was found to be the computational sepsis expert because it had high accuracies for solving both the detection and early detection, and a high degree of overlap with clinical and literature features. Using the proposed interpretation mechanism and the dataset, we identified 17 features that the LSTM used for sepsis classification, 11 of which overlaps with the top 20 features from the RF model, 10 with academic features and 5 with clinical features. Clinical opinion suggests, 3 LSTM features have strong correlation with some clinical features that were not identified by the mechanism. We also found that age, chloride ion concentration, pH and oxygen saturation should be investigated further for connection with developing sepsis.

Conclusion: Interpretation mechanisms can bolster the incorporation of state-of-the-art machine learning models into clinical decision support systems, and might help clinicians to address the issue of early sepsis detection. The promising results from this study warrants further investigation into creation of new and improvement of existing interpretation mechanisms for black-box models, and into clinical features that are currently not used in clinical assessment of sepsis.

machine learning

interpretability

deep network

sepsis

early detection

Sepsis is a life-threatening organ dysfunction caused by a dysregulated host response to infection[38]. It has a high mortality rate of 6 million per year worldwide, and a healthcare cost of over 16 billion dollars in the USA alone[40-42]. Figure 1A shows the standard medical features and assessments that are currently used to clinically diagnose sepsis. However, an open research question in the sepsis research community is “early sepsis detection”, because by the time these features/assessments provide positive response for sepsis it is often too late for the necessary treatment intervention; and it is known that if a sepsis patient is not treated within six hours of hospital admit time, the mortality rate increases by 9%[4]. This suggests that it is important to find other medical/clinical features that can bolster sepsis detection by clinicians in a timely manner.

Since, machine learning (ML) algorithms are well-known for their ability to discover patterns in data that are otherwise unfathomable to the human eye, PhysioNet hosted a computational challenge in 2019 for sepsis detection[39]. However, despite their ability to optimize, most ML algorithms (especially, the state-of-the-art deep networks) are not good at explaining what it has learnt, let alone elucidate the factors the model used to reach its decision in a human interpretable manner. As a matter of fact, it has been demonstrated that sometimes ML models, with high computational accuracy, can fail to grasp the actual concept (or factors) it was supposed to learn. Thus, human-interpretability of the algorithms need to be addressed for comprehensive use of ML algorithms in sepsis detection to discover additional, relevant, clinical factor.

Currently, with regards to human-interpretability, the ML algorithms can be broadly classified into two categories: i) ones that are intrinsically interpretable, and ii) complex, black-box models that are explainable to an extent via the use of post-hoc analysis. Intrinsically interpretable ML models are limited in number, and include linear regression, logistic regression and decision tree; whereas, explainable models consist of black-box models, such as support vector machine (SVM) and deep learning networks (DNNs). Post-hoc analysis of explainable models often relies on an external, model-specific or model-agnostic method that can only provide local explanations of the model. Local explanations only provide information on why any one particular instance was classified a certain way, as opposed to an explanation of what the model has learnt or how much of the model has understood the intended concept (aka. global explanations).

However, the accuracy-interpretability trade-off[34-35] in ML is still an open research challenge, where model complexity is directly proportional to higher accuracy, but inversely proportional with human-interpretability. Examples of these can also be seen in sepsis detection [10-29]. As a result, even though ML has greatly advanced healthcare data analysis[1-8], computational healthcare studies (including sepsis detection[3, 9-13]) often choose statistics, intrinsically interpretable ML models and/or feature selection methods for analysis, instead of state-of-the-art ML models with higher accuracy. Moreover, there is no guarantee that local explanation provided (via post-hoc analysis using LIME[36]/SHAP[73-74]) for one instance in the dataset will be the same for a different instance in the same dataset, even if they share the class membership. As such, features obtained from local explanations will not be a good representation of the additional relevant features needed for timely sepsis detection. Thus, global interpreter for state-of-art-model is essential because it can aid the task of identifying relevant sepsis factors. Moreover, a global interpreter allows black-box models to retain their high accuracy, while becoming more transparent to human beings. Various works have addressed the challenge of creating global interpreters, and some notable examples include: [30], which shows complete equivalency between fuzzy logic and neural networks; [31], which makes use of decision tree structure to train deep networks; [32], which proposes the use of concept vectors over saliency map to ensure the correctness of convolution networks; and [33], that uses a deep network to create a decision tree.

In this paper, we propose a post-hoc, model-agnostic interpretable mechanism (IM) to globally understand the sepsis-related concepts learnt during training by a state-of-the-art, black-box ML model. Since, the state-of-the-art models are good at classification tasks, investigating the features they use globally for classification, will be a good indicator for additional features that needs to be investigated for timely sepsis detection. Unlike interpretable solutions presented in[30-33], which uses fuzzy rules and decision trees (both are intrinsically interpretable) and visual aids (saliency maps), the proposed IM here leverages the “nearest-neighbor” concept of k-nearest neighbor. Additionally, for human interpretability, it is important to provide explanations of decisions in easy terms that understandable by the general population; yet, some methods such as covariance matrices[29,37] require a certain expertise for interpretation. Thus, the proposed IM presents the results in a format that is visually and easily understandable by both computational and healthcare personnel via qualitative assessment.

The evaluation of the proposed IM presents an additional challenge, because as of now, there is no standard evaluation technique for assessing “human-interpretability” in the field of ML and computer science. This is further complicated by the fact that literature studies[3,22-23,28-29,39,43-47], and our own experiments show that it is difficult to obtain high and balanced specificity and sensitivity for ML models on sepsis detection, especially when using the publicly available, physiological dataset[39] on sepsis; as such, a large body of work on sepsis[18,43-47], uses text data and Electronic Health Records (EHR).

Thus, a second contribution of this paper is the 3-way evaluation scheme proposed to assess the IM, as shown in Figure 2:

Since this is a healthcare research study, it is imperative for the results of the IM to be validated against clinical features and by clinical expert(s). Figure 1A presents the different scoring criteria for sepsis as used in clinical practice, based on our two medical collaborators (co-authors of this paper) and[60-62]. These factors in Figure 1A, which we refer to as clinical features, will be used throughout this paper for evaluation.
The clinical features, however, are not the only features that might affect sepsis diagnosis. Medical factors related to sepsis detection are still an ongoing research area, and hence, the IM is validated by academic literature. Figure1B presents these features that are not used for clinical diagnosis, but still important in treating sepsis patients. We refer to them as literature features.
Last but not least, the IM needs a computational benchmark for comparison of both accuracy and interpretation (local or global). Thus, we experiment with various classes of ML algorithms, and apply a series of assessment to narrow it down to one model, which we call the computational sepsis expert or CSE. As part of the assessment, we also provide a label-shift training paradigm (Section 3.1.2).

The remainder of the paper is broadly divided in to two experiments: one for finding and assessing the CSE (Experiment 1.1), and the other for creating and evaluating the IM (Experiment 1.2). We choose the Long Short-Term Memory (LSTM) model as the black-box representative which the IM will interpret. The IM itself is based on the lesser known multi-set classifier, MSC[66], which can be made intrinsically interpretable semantically and visually using the concept of nearest neighbors. For the CSE, we experiment with convolutional network (Conv1D), support vector machine (SVM), Adaboost, random forest (RF) and MSC. Even though it would have been ideal if MSC was found to be the CSE, our results from the 5-step assessment (Figure 2) show that RF outperforms most other ML models in our experiments and literature on sepsis detection. RF also satisfies the list of assessment we setup for CSE. The MSC-based IM for the LSTM, while limited by the performance of MSC itself, still identified factors that matched with clinical experts and the CSE. Additionally, the IM supports the expansion of clinical features to include some of the features from literature, while highlighting few other features that are currently not used in clinical assessment or not studied in literature as being directly related to sepsis.

The rest of the paper is organized as follows: we present the dataset description and the preprocessing steps in Section 2. Section 3.1 presents the characteristics we expect CSE to have, and then Sections 3.1.1 - 3.1.5 list and describe the assessments for finalizing the CSE, with Section 3.1.6 providing implementation details. Section 3.2.1 provides a brief description of MSC and the IM creation, with Section 3.2.2 providing the evaluation procedure for the IM. Section 4.1 illustrates the results from Experiment 1.1 and Section 4.2 shows the results from Experiment 1.2. Finally, we end with our discussion in Section 5 and conclusion in Section 6.

The publicly available dataset used here is from the 2019 PhysioNet Challenge: Early Detection of Sepsis from Clinical Data[39]. It is an electronic health records dataset sourced from Beth Israel Deaconess Medical Center (hospital system A) and Emory University Hospital (hospital system B). It consists of around 40,000 ICU patients with 40 clinical variables for each hour of a patient’s stay at ICU. The 40 clinical variables[39] can be divided into vital signs (heart rate, blood pressure, etc.), laboratory values (pH, platelet count, hemoglobin, etc.), and demographics (age, gender, etc.). The features are listed in Figure 3. The data collectors used a combination of patient's Sequential Organ Failure Assessment (SOFA) score and time of clinical suspicion (blood culture or IV antibiotics ordered) of infection[38] to determine whether the patient was septic or not. The dataset has 37404 non-septic and 2932 septic patients.

2.1. Preprocessing
When possible, for a particular feature, missing values were filled in either by taking the mean of the preceding and subsequent observations, or by using the available value from the most recent past. The remaining missing values were given a value of zero. To distinguish actual data from missing values and to standardize the range of values across the variables, each feature was rescaled to lie between 1 and 6 (inclusive).

As shown in Fig. 2, we organize this paper into two experiments: one for finding CSE (Section 3.1), and another for creating and evaluating the IM (Section 3.2).

3.1. Finding CSE and establishing a benchmark (Experiment 1.1)

Any computational diagnostic model, no matter how good, would not have a clinicians’ experience/intuition to make a good call if necessary. For fair evaluation of the proposed IM, instead of just comparing it exactly to clinical features, we should also compare it to what a CSE would use for sepsis classification. Hence, the CSE works as a benchmark against which the results of the proposed IM can be evaluated.

Our search for the CSE is guided by the following characteristic criteria of an ideal CSE. For a machine learning classifier to be considered a sepsis expert:

The CSE must be good at solving both sepsis detection (detection at clinical diagnosis time) and early sepsis detection (detection prior to clinical diagnosis). “Goodness” can be measured using the following evaluation metrics:
- Traditional measures: accuracy, specificity, sensitivity, precision
- Imbalance data measure: F1 score, Mathew’s Coefficient
- PhysioNet early detection measure: utility [39]. The PhysioNet challenge organizers provided the utility measure as a way to evaluate ML models’ ability to detect sepsis early with only one number.
When solving the detection task, CSE must demonstrate a good overlap with clinical and literature features (Fig. 1).
- This can be measured by the intersection of important features by CSE and medical features presented in Fig. 1A. This requires CSE to be interpretable to some extent (or at least explainable through the use of LIME[36] or SHAP[73, 74]).
We expect the CSE to require the use of more features than those presented in Fig. 1A.
- The need for these “extra” variables can be justified through the use of Wilcoxon’s ranksum test to check for significance and the intersection of these features with literature features in Fig. 1B.

Thus, we adapt the following 5-step assessment (Sections 3.1.1–3.1.5) to select the CSE from a myriad of literature works and models that we trained.

3.1.1. CSE determination, Step 1: Evaluation against ML models using traditional metrics for sepsis detection at clinical time. To identify the model that can achieve best performance on sepsis detection, here we use four representative ML algorithm groups:

1D Convolution Network[72] classifier or Conv1D – this is a state-of-the-art deep network, and requires LIME/SHAP, which can only provide local interpretations.
AdaBoost and Support Vector Machine or SVM – these are traditional ML models and both require LIME/SHAP for local interpretation. AdaBoost models are usually tree-based models.
Random Forest or RF - this is the most powerful, non-linear model in the class of intrinsically interpretable model that uses the Gini Impurity Index to calculate feature importance. This is a tree-based model.
Multi-set Classifier or MSC – this uses concepts of nearest neighbors and unsupervised learning for classification. And in this paper, we show how it can be both conceptually and visually human interpretable. As such, this is the algorithm we propose to base the IM on.

Performance of these models are measured using traditional ML evaluation metrics such as: i) accuracy, ii) precision, iii) specificity, iv) sensitivity, v) F1 score, and vi) Mathew’s coefficient. The model with the best performance on these metrics. e.g. the CSE candidate, is then passed on to the next stage for further assessment.

3.1.2. CSE determination, Step 2: Evaluation of CSE candidate on early sepsis detection (earlier than clinical time) using the proposed label-shift training paradigm. Sepsis detection done x hours before medical determination of sepsis is considered early sepsis detection. Computational detection of sepsis 6 hours prior to medical determination is considered optimal, while 12 hours prior is considered early[39].

Given this clinical definition and the computational task for early sepsis detection, we propose the use of a label-shift training paradigm to train the CSE candidate, because none of the models listed in Section 3.1.1 is as capable as LSTM in capturing temporal patterns. As such we adapt the training paradigm from one of our earlier work[75] on time series signals, to allow any one of these models to process temporal information. In this training paradigm, instead of trying to optimize model hyperparameters and/or architecture, we re-arrange the data that is being fed to the model.

Suppose, each patient is denoted by P_id, where id is unique to each patient. Each P_id contains a set of rows, with the features/attributes separated by different columns. The rows in each P_id are temporally dependent such that any two rows, ^idr_i, ^idr_j ∈ P_id, where i < j (and i, j denotes the i^th and j^th rows in P_id), can be expressed as ^idr_t, and ^idr_t+c respectively, where c is a positive constant, and represents a time interval; for this sepsis dataset, c = 1 hour if j = i + 1. The sepsis label associated with ^idr_t is denoted by ^idl_t. Thus, the traditional training paradigm dataset is denoted by ^SepsisT = {[ ^idr_t, ^idl_t] | t ∈ ℤ+}. Under the label-shift paradigm, each ^idr_t gets associated with a label from ‘future’ or subsequent time intervals. The basis for this shift is that we want to the train the ML model to predict at the present time by learning to associate the past data with the future state.

Thus, the training dataset is now denoted by ^SepsisT_c = {[ ^idr_t, ^idl_t+c] | t, c ∈ ℤ+ }. Since we are interested in predicting sepsis 12 hours, 6 hours and 1 hour earlier than sepsis onset[39], we restrict c ∈ [1, 6, 12], resulting in three different training sets. This allows us to measure the performance of the CSE candidate separately for each value of c, using: i) accuracy, ii) precision, iii) specificity, iv) sensitivity, v) F1 score, and vi) Mathew’s coefficient.

3.1.3. CSE determination, Step 3: Evaluation using the utility metric against entries from 2019 PhysioNet Challenge. 2019 PhysioNet Challenge[39], which is focused on early sepsis detection, provided a new ML evaluation metric, called utility. The utility measures whether the sepsis diagnosis model is able to detect sepsis at an early stage e.g. at most 12 hours earlier than the medical expert detection, with 6 hours prior medical detection obtaining perfect score. This score takes a minimum value of 0 and a maximum of 1, with higher values indicating better ML model performance. The steps to calculate utility is given in[39], and the code is available at https://github.com/physionetchallenges/evaluation-2019.

Since we are using the data that was made available by [39], in addition to Step 2, we also compare the utility values of the listed models with the values obtained from top performing entries[22–24, 28–29, 39, 69] in the challenge [39].

3.1.4. CSE determination, Step 4: Comparison with sepsis diagnosis models in clinical literature. Due to a need for early sepsis detection, sepsis is a frequently studied medical condition in the medical/clinical field and consists of significant amount of literature work. We identified few of these models tested in clinical setting and compared their performance with the CSE candidate.

The literature models that were chosen are: AISE[3], the Epic Sepsis Model (ESM)[43], and the EPIC native sepsis model[44].

3.1.5. CSE determination, Step 5: Overlap comparison with clinical features. We use the respective explanability/interpretability technique (Gini Index if RF; LIME/SHAP otherwise), to obtain the twenty most important features obtained by CSE to detect sepsis. Their quintiles (maximum, third quartile, median, first quartile, minimum) are then recorded. Additionally, the Wilcoxon’s ranksum hypothesis test is calculated using the median for the sepsis and non-sepsis groups to obtain the statistical significance (for all 40 features).

The overlap of these top twenty features with the clinical features and the features from literature (Fig. 1) is then recorded and presented.

3.1.6. Implementation details. The convolution model was set up in MATLAB with one 1D convolution model layer (15 filters of size 3x3, ReLU activation), followed by a max pooling layer and fully connected layer with softmax activation function. RF was implemented in Python, using 100 estimators, and accounted for the imbalanced dataset by setting built-in hyperparameter “class weight” to “balanced”. AdaBoost was also implemented in Python, using 200 estimators. SMOTE[70] with “minority” sampling was used to handle class imbalance. For SVM we used the RBF kernel with the regularizer constant set to 0.5, along with in-built mechanism to handle the imbalance in Python. For MSC we used kmeans + + to choose 10 anchors, and set the model to have 1 class profile for sepsis and 1 class profile for non-sepsis, and used the Euclidean distance for measuring similarities.

3.2. Creating and Evaluating the IM (Experiment 1.2)

Multi-set Classifier, MSC[66], is not a commonly used algorithm and relies on nearest neighbor principles for classification. However, its output are class profiles (in addition to class labels), which are low-dimensional data structures[71]. The low dimensionality of the profiles makes the model globally and intrinsically interpretable, and easy to visualizable as bar charts or line graphs. Thus, it is a good candidate for the basis of our IM. Another advantage of MSC is that its choice of anchors is completely data driven and requires no human intervention unlike[46–47].

3.2.1. Proposed Strategy for Explaining Black-box ML Models Globally Using MSC. MSC (full algorithm in[66]) assumes that each patient/entity is described by a set of feature vectors (instead of one). It begins by using a clustering technique to select a subset of feature vectors called anchors, which act as the base concepts/patterns and are representative of the dataset. Every entity/patient and class profile are then defined as a histogram(s) in terms of these anchors as follows:

Entity as a collection of concepts

Suppose the set of all feature vectors describing a disease perfectly is called the concept set, and is denoted by Q = {q₁, ..., q_z, ..., q_|Q|}, where q_z ∈ R^d and d > 0. Then E_j, the j^th entity in a given dataset, is a member of the power set of Q (excluding the empty set), and is denoted as E_j ∈ Power Set(Q)\ ∅, where ∅ as the empty set.

Base concepts or anchors

The anchors are defined as a subset of Q, and denoted by \(\widehat{Q}\)= { 𝑞̂_i }, where 𝑞̂_i is the i^th anchor. The number of anchors (e.g. |\(\widehat{Q}\)|) is determined by the user, following the restriction that 0 < |\(\widehat{Q}\)| ≪ |Q|. The anchors are chosen using a clustering technique. Usually, instead of using the entire dataset, the anchors are chosen from a representation initial sample.

Fingerprints

The fingerprint of E_j, which has a class label k, is a histogram over \(\widehat{Q}\)and written as:

^k fp _j ={^kp_ji}, where ^kp_ji is the proportion of 𝑞̂_i present in E_j.

Class profiles

If class k is restricted to having only one profile, then a single profile is the average of all the fingerprints belonging to class k; in our case, k = 1 for sepsis patient and k = 0 otherwise. However, since MSC allows one class to be described by one or more profiles, the set of profiles for the k^th class is given by C_k = {^rc_k}, where r > 0 and indexes the profiles in C_k. Thus, r^th profile of class k is the average of a subset of fingerprints belonging to class k, and is denoted by ^rc_k = {^rcp_ki}, where ^rcp_ki is the average of ^kp_ji over a subset of all E_j in class k.

Building Class Profiles During Training Phase

The user determines the number of profiles required to describe each class. The initial sample is used to initialize the class profiles. Then, during the training phase, MSC algorithm executes the following steps to update the class profiles with respect to the feature vectors in each class. During any particular training iteration:

i. A feature vector, maintained in temporal order, is fetched from the dataset. Suppose it belongs to class k (for example k = 0) and patient E_j.

ii. The algorithm then determines which anchor is most similar to this feature vector. Suppose, it is the best match for 𝑞̂_m, then fingerprint, ^kfp_j (or ⁰fp_j ), is updated as follows:

𝑓 =(( ^𝑘𝑝_𝑗m)*𝑛_j)+1

𝑛_j =𝑛_j + 1

^𝑘𝑝_𝑗m = 𝑓/ 𝑛_j ,

where 𝑛_j is the number of feature vectors seen so far for by the algorithm for entity E_j.

iii. The updated fingerprint is then compared with the profiles in C_k (or C₀ ) only to find the closest match. If ^rc_k (or ^rc₀ ) is the best match, then it is updated as follows:

𝑐 =(𝑛_r′*( ^rc_k)) + ^kfp_j

𝑛_r′ = 𝑛_r′+1

^r c _k =c/ 𝑛_r′ ,

where 𝑛_r′ is number of fingerprints seen so far for by the algorithm for profile ^rc_k.

iv. Algorithm fetches a new fingerprint and repeats.

v. Once the class profiles are obtained, testing is achieved by comparing the testing fingerprints to the mature class profiles and recording whether the best matched profile belongs to C_k=0 or C_k=1.

Thus, the output of MSC algorithm are labels and class profiles, where each class profile can be a set of sub-profiles. It is important to note that while MSC allows features vectors to change sub-profile membership within a particular class C_k, it does not allow them to jump between different classes e.g. C_k=0 and C_k=1. Thus, traditionally, each class is described by a set of profiles based on information from the ground truth labels, because that’s how the membership of feature vectors is determined during training.

As a result, if we were to randomly change the ground truth for each feature vector during the training phase, the resulting class profiles would be quite different than the ground truth class profiles. Similarly, we can theoretically ask an “oracle” to provide us with class labels, because either the ground truth is not accessible or the “oracle” sees unexplainable, hidden patterns. These labels may (or may not) be different from the ground truth. Then the class profiles produced by MSC will reflect insights of the pattern seen by the “oracle”. Thus, if a black-box ML model is the “oracle”, e.g. the model’s predictions (instead of ground truth) is used when training MSC, the resulting class profiles will approximate what the black-box ML model sees with regards to the anchors chosen by the MSC algorithm.

3.2.2. Evaluation of the IM. A hybrid model refers to the IM that was created by training MSC with labels from the LSTM/RF as the oracle. Thus, explanation from the MSC-LSTM hybrid model can be taken as approximation for what the LSTM sees, while the MSC-CSE hybrid model is for comparison purposes.

Even though the hybrid models are not the actual predictive model for sepsis, we still need to take their accuracy and fidelity into account because these values provide insight into how well the interpretable hybrid models can explain the black-box models. The fidelity[33] of the MSC-hybrid models is calculated by recording consensus in classification between the LSTM/RF model(s) and their respective hybrids. Fidelity is calculated in the same way as accuracy, but instead of using the ground truth, we use the labels from the oracle.

In addition, we compare the sepsis profiles obtained from the hybrid models to identify what features the LSTM and RF models are looking at for sepsis classification. We present all our results as comparisons between:

MSC-LSTM versus MSC-CSE model (profile comparison to obtain what the LSTM learnt)
MSC-CSE versus CSE (feature overlap, including clinical and literature features)
MSC-LSTM versus CSE (feature overlap, including clinical and literature features)

3.2.3. Implementation details. The LSTM model was set up in MATLAB with one LSTM layer of 100 hidden units, followed by a fully connected layer with softmax activation function. For this experiment, we use the LSTM as the black-box model that is globally explained by creating a hybrid using its predicted labels to train MSC (as described in Sections 3.2.1 and 3.2.2). The MSC is also trained using the ground labels and labels from CSE (to get the benchmark hybrid model). All three models are evaluated using i) accuracy, ii) specificity, iii) sensitivity, iv) precision, v) F1 score, vi) Mathew’s coefficient and vii) fidelity metric.

In this section, we present our findings and Table 1 summarizes the results of our experiments.

Using RF as the CSE, we identify the top twenty features that computationally helps the CSE to detect sepsis. These twenty features are then used to compare the effectiveness of the proposed IM on LSTM through comparison with each other and with clinical and literature features (Fig. 1). The empirical distribution information (maximum, third quartile, media, second quartile, and the minimum) of these twenty features are presented in Table 1.

The explanation for finding Random Forest (RF) as the CSE using the 5-step assessment (described in Section 3.1) is detailed in Section 4.1. Additionally, Section 4.2 provides more details about how the IM is used to obtain important features used by LSTM for sepsis classification, and its efficacy and limitations.

4.1. Finding Random Forest, RF, as the CSE.

4.1.1. Finding the CSE candidate. The observation here is that despite accounting for imbalances in the dataset, none of the models have a balanced performance across all the evaluation metrics (Table 2). Regardless, Random Forest (RF) obtains the highest accuracy, precision, specificity, F1 score and Mathew’s coefficient but have very low sensitivity (Table 2). Additionally, since RF can use Gini Index for global interpretation and does not require a local explainer, we choose RF as the CSE and implement the label shift training paradigm on RF.

Table 1

This table summarizes our results and shows the breakdown of the top twenty (out of forty) features found by the CSE to be relevant for sepsis detection. These twenty features are then used to compare the features found by the IM with i) clinical features, ii) features from literature, and iii) features from human medical experts (marked with asterisks). The table also shows relevant features that the CSE and LSTM used for the sepsis classification task, but currently not regarded as important in the sepsis medical community. The overlap between the CSE and IM is also displayed (underlined).
	Clinical Features	Medical Literature Features	Other Possible Relevant Features	Top 20 Feature Overlap
RF, CSE	Platelets, Temperature, White blood cell/Leukocyte count, Fraction of inspired oxygen (FiO₂), Heart rate, Systolic blood pressure, Respiration rate, Mean arterial pressure, Creatinine* TOTAL: 9	Age, Hemoglobin, Hematocrit, Glucose, Blood urea nitrogen, Bicarbonate, Potassium* TOTAL: 7	ICU length-of-stay, hours between hospital and ICU admit, Diastolic blood pressure**, Chloride TOTAL: 3	1) Clinical and literature features overlap with RF = (9 + 7)(100/20) = 80%* 2) Overlap with medical expert opinion = (17/20)100 = 85%*
LSTM (IM hybrid)	Partial carbon dioxide pressure arterial blood*, Fraction of inspired oxygen (FiO₂), Platelets, Creatinine TOTAL: 4	Glucose, Blood urea nitrogen, Bicarbonate, Calcium, Magnesium, Phosphate, Age, Hemoglobin, Hematocrit, Potassium TOTAL: 10	pH, Diastolic blood pressure**, Oxygen saturation TOTAL: 3	1) Clinical and literature features overlap with IM = (4 + 10)(100/20) = 70%* 2) RF, CSE overlap with IM (underlined) = 11(100/20) = 55%* 3) Overlap with medical expert opinion (marked with asterisk) = (15/20)100 = 75%*
Not in dataset	Glasgow Coma Scale		n/a

Table 2

Different evaluation metric values for the different machine learning models from Step 1 of Experiment 1.1. The highest value for each column (aka. evaluation metric) is boldfaced.
	Accuracy (%)	Precision (%)	Specificity (%)	Recall/ Sensitivity (%)	F1 Score	Mathew’s Coefficient
Conv1D	81.5	74.4	63.8	63.8	0.66	0.42
RF	99.0	92.5	99.9	56.3	0.70	0.92
SVM	85.0	9.7	85.0	74.4	0.17	0.23
AdaBoost	88.0	10.2	88.3	60.8	0.17	0.21
MSC	67.5	50.5	66.3	69.8	0.58	0.34

4.1.2. Applying the label-shift paradigm to RF. Since RF does relatively better than other models presented in Table 2, we apply the label shift paradigm to RF. The result, in Table 3, shows that RF does pretty well for early detection of sepsis (prediction ranging from 1 to 6 to 12 hours prior clinical determination) with comparable performances, and attains the best performance at 12-hour prior sepsis detection. Since SVM has highest recall in Table 2, we also tested SVM for early sepsis detection but it did not perform as well RF (and hence, the results for SVM are not presented in Table 3 for readability).

Table 3

RF accuracy measures for early sepsis detection. The results from Table 2 are included in this table as the “no shift” (e.g. zero shift) to put the sepsis detection at clinical determination time versus early sepsis detection in context. Utility is the additional evaluation metric [39] that measures how well the model is doing at early sepsis detection.
	Accuracy (%)	Precision (%)	Specificity (%)	Recall/ Sensitivity (%)	F1 Score	Mathew’s Coefficient	Utility
No Shift (Table 1)
RF	99.0	92.5	99.9	56.3	0.70	0.92	0.83
1-Hour Shift
RF	99.0	93.6	99.8	60.4	0.73	0.93	0.82
6-Hour Shift
RF	99.0	95.8	99.8	71.4	0.82	0.95	0.89
12-Hour Shift
RF	99.0	97.0	99.8	78.8	0.87	0.96	0.88

4.1.3. Comparing RF utility with entry models from PhysioNet Challenge:

In the 2019 PhysioNet Challenge[39], which had a total of 104 teams from academia and industry (out of which only 88 qualified), the top five submissions had average utility scores of 0.4260[28], 0.4105[39], 0.4085[22], 0.4025[23], and 0.4025[29] on the dataset from the two hospitals used for this paper. Table 3 shows that the RF had better utility compared to the entries. However, a word of caution- a team[24], using a XGBoost model with a Bayesian optimizer and an ensemble learning framework obtained a utility of 0.522 on the two public datasets, but the utility dropped to 0.364 when the model was tested on a hidden dataset from a third hospital.

4.1.4. Comparison with sepsis diagnosis models from existing medical literatures:

Based on literature search, one good candidate for the CSE could have been AISE[3], which achieved an AUROC value between 0.83–0.85, an accuracy of 72%(maximum) and a sensitivity of 85% and specificity of 67%. However, AISE’s accuracy drops to 60% as it moves from 0 to 12-hour window; whereas, the RF trained under our proposed training paradigm (12-hour shift) has an accuracy of 99%, precision of 97%, specificity of 99.8% and a recall/sensitivity of 78.8%. In addition, RF reaches a maximum utility 0f 0.89 under 6-hour shift, which drops only slightly to 0.88 as we move from 6-hour to 12-hour shift; even the No-shift RF achieves a utility of 0.83. This shows that RF is better at early sepsis detection.

Another candidate could have been the Epic Sepsis Model (ESM)[43], a proprietary sepsis prediction model, but[43] concludes that “This external validation cohort study suggests that the ESM has poor discrimination and calibration in predicting the onset of sepsis. The widespread adoption of the ESM despite its poor performance raises fundamental concerns about sepsis management on a national level.” This conclusion excludes the Epic Sepsis Model as a CSE. But to be fair, our paper and[43] are using different datasets, and the RF model in this paper was not externally validated.

A third candidate for CSE could have been the EPIC native sepsis model[44]. However, the reported evaluation metric values for the model used in [44] lies in the 33%-78% range.

4.1.5. Comparison with clinical features: Table 4 presents statistical description for the top twenty RF features (relative importance > 0.02) in order of importance, with ICU (Intensive Care Unit) length-of-stay having the highest relative importance of 0.146.

Comparing Fig. 1A and Table 4, we find that all clinical features in Fig. 1A, with the exception of bilirubin and Glasgow Coma Score (not present in the dataset), appear in the top twenty features used by RF to classify sepsis, thus, confirming a good overlap e.g. 12 out of the 14 clinical features were in the top twenty RF features for sepsis detection. While RF also uses bilirubin for classification, it does not appear in the top twenty features. RF also shows an excellent overlap with features in Fig. 1B (summarized in Table 1).

Moreover, the Wilcoxon’s ranksum hypothesis test for these top twenty features showed significant median difference at an error rate of 5%. The associated p-values were 0.000, except for “mean arterial pressure” (p-value = 0.0479). Among the forty variables, “phosphate”, “partial thromboplastin time”, “gender”, and the two different ICU units were insignificant with p-values 0.2796, 0.6579, 1.000, 1.000 and 1.000 respectively. Thus, this further ensures that the features used by RF, even if not used in clinical assessments (SIRS, SOFA, qSOFA), are indeed important for features for computationally identifying sepsis.

After examining the results obtained from performing the 5-step assessment described in Section 3.1, we move forward with RF as the CSE.

Table 4

Statistical description for the top 20 features from Random Forest (zero-shift) model. Q1 and Q3 stands for the first and third quartiles. The minimum, first quartile, median, third quartile and the maximum of a variable are often used as an empirical distribution approximation. Note that a minimum of 0 refers to missing values.
	ICU length- of- stay	Platelets	Tempera-ture	Age	Leuko-cyte count	Fraction of inspired oxygen	Heart Rate	Hours between hospital and ICU admits	Systolic Blood pressure	Hemo-globin
SEPSIS:
Maximum	6	4.026	5.6107	6	3.659	1.5	6	6	5.9821	4.809
Q3	2.2687	1.6032	4.8274	4.7791	1.1648	1.25	2.7801	5.9679	3.2008	2.5101
Median	1.6269	1.3993	4.5929	4.1395	1.1194	1.063	2.4423	5.9678	2.8533	2.2752
Q1	1.2388	1.196	2.75	3.3256	1.076	0	2.1538	5.9085	2.5444	2.04
Minimum	1	0	0	1.0233	0	0	0	1.762	0	0
NON-SEPSIS:
Maximum	6	6	6	6	6	6	6	6	6	6
Q3	1.4627	1.5231	4.7359	4.722	1.133	1.062	2.7277	5.9919	3.1622	2.5268
Median	1.2836	1.3274	2.925	4.0233	1.0864	0	2.3846	5.9678	2.8436	2.2094
Q1	1.1343	0	2.65	3.2674	0	0	2.1257	5.9336	2.5714	0
Minimum	1	0	0	1	0	0	0	0	0	0
	Hematocrit	Respira-tion Rate	Mean arterial pressure	Creati-nine	Glucose	Blood urea nitrogen	Bicarbo-nate	Chloride	Diastolic Blood Pressure	Potas-sium
SEPSIS:
Maximum	5.5745	6	5.9643	3.433	6	4.774	5.9303	6	5.8571	4.517
Q3	3.0468	2.4706	2.2222	1.1505	1.7311	1.618	3.2727	4.4034	1.8813	1.8621
Median	2.743	2.1111	2.0185	1.086	1.593	1.3396	2.727	4.067	1.6964	1.6038
Q1	2.4124	1.8586	1.8426	1.052	1.4727	1.187	0	0	1.5216	1.509
Minimum	0	0	0	0	0	0	0	0	0	0
NON-SEPSIS:
Maximum	6	6	6	6	6	6	6	6	6	6
Q3	3.0393	2.2626	2.2143	1.1075	1.6876	1.3774	3.0909	4.2353	1.875	1.8621
Median	2.6465	2.0101	2.0179	1.0645	1.5368	1.2075	0	0	1.6786	1.6038
Q1	0	1.8088	1.8393	0	1.3783	0	0	0	1.4137	0
Minimum	0	0	0	0	0	0	0	0	0	0

4.2. Using the proposed IM to explain LSTM

Table 5 shows the computational evaluation metric and fidelity of the hybrid models and the individual models themselves. We find that even though MSC has good utility (compared to the highest utility score of 0.4260 reported in [39]), its accuracy is not as high as RF, and that limits the performance of the hybrid models as well. While ideally, we want the accuracy and fidelity values as close to one another as possible for model explanation, that is hard to come by in practice. Thus, going forward, we need to keep in mind that the hybrid models will explain at least 50% (fidelity) and at most 67% (accuracy) of the concepts learnt by RF and LSTM.

Table 5

Predictive performance of the hybrid models on the same test set.
	Accuracy (%)	Fidelity (%)	Precision (%)	Specificity (%)	Recall (%)	F1-Score	Mathew's Coefficient	Utility
MSC	67.5	n/a	50.5	66.3	69.8	0.58	0.34	0.51
RF	99.0	n/a	92.5	99.9	56.3	0.70	0.92	0.83
LSTM	91.1	n/a	81.4	74.1	74.1	0.77	0.42	n/a
MSC-RF	67.1	53.6	49.9	66.3	68.9	0.58	0.33	0.51
MSC-LSTM	67.9	53.0	46.9	65.4	73.9	0.57	0.36	0.51

4.2.1: Hybrid-LSTM versus Hybrid-CSE. Figure 4 shows that the profiles from MSC-RF and MSC-LSTM not only look similar but are similar to the first decimal number (even though they differ from the MSC profiles in major ways). This is expected because both RF and LSTM individually as an accuracy over 90% (Table 5).

Concepts of sepsis as seen by RF and LSTM through the eyes of the hybrid models

For the two hybrids, from Fig. 4, we find that

Anchors 3 and 9 occur in higher proportions in sepsis group, while Anchor 2 occurs in higher proportions in non-sepsis, indicating that main features for differentiating sepsis (as seen by the hybrids) are:
- pH, partial carbon dioxide pressure from arterial blood, diastolic pressure, fraction of inspired oxygen, glucose, platelets, bicarbonate, blood urea nitrogen, calcium, magnesium, phosphate, potassium, hematocrit, hemoglobin, and creatinine. Most of these features matches those in Fig. 1, and were found to be significant by Wilcoxon’s ranksum test (Experiment 1.1). However, due to lower portions of Anchor 8 and when combined with Anchors from 1, 2, 3, 9, and 10, it seems to suggest that the hybrids see pH, partial carbon dioxide pressure and oxygen saturation as necessary but not very common features.
Anchor 6 is high in proportion for the non-sepsis class. The main difference between Anchor 6 and all the other anchors is:
- The lower value for the “age” variable. This is supported by Fig. 1B and Table 4, which show that the non-sepsis group tend to have a lower age. Also, the ranksum test (done in Experiment 1.1) found “age” to be a significant variable.
Anchors 4, 7 and 9 are very similar pattern-wise, but Anchor 9 displays a wider range of variable values. Thus, even though these anchors do not inform about classification explanation, higher proportions of Anchor 9 in the hybrids (compared to MSC) might point to the ability of LSTM and RF to represent finer details than the MSC models.
Anchors 1, 5 and 10 are not informative in terms of sepsis differentiation, and seems to capture the common patterns present in both classes.

4.2.2: (Hybrid-RF, Hybrid-LSTM) versus RF. According to the IM, for sepsis classification, the LSTM and the RF models are using pH, partial carbon dioxide pressure from arterial blood, age, diastolic pressure, fraction of inspired oxygen, glucose, platelets, bicarbonate, blood urea nitrogen, calcium, magnesium, phosphate, potassium, hematocrit, hemoglobin, and creatinine. 11 of these 17 features coincide with the top twenty features from RF (Table 1). In addition, similar to the CSE, the IM tells us that the LSTM model consider “age” to be important. The LSTM model also places some importance on pH, partial carbon dioxide pressure and oxygen saturation (not in the top twenty RF features). We also find that unlike the RF CSE, the IM does not identify ICU length-of-time, temperature, systolic blood pressure, interval between hospital and ICU admits, heart rate, leukocyte count etc. (Table 1) as important features.

In this paper, we propose the development of a data-driven, semi-automated IM for qualitatively evaluating concepts learnt by black-box ML models. This IM is designed to mitigate the prevalent issue of accuracy-interpretability trade-off[34–35] in machine learning, while addressing the needs of transparency in healthcare, and is tested on a LSTM model to aid timely sepsis diagnosis.

The strength of this work is three-folds. One, it shows that it is feasible to create an IM using the nearest neighbor concept, in addition to the use of decision trees and fuzzy rules; though better accuracy for MSC is desirable. Two, it presents an evaluation method for the IM using clinical and literature through the establishment of the CSE, specifically for sepsis detection. And three, we report features that are not currently considered by the sepsis medical/research community, but might aid in timely sepsis detection.

By reinforcing the use of anchors (which are selected based on the nearest neighbor and clustering concepts), MSC ensures that the IM’s output can be interpreted in terms of clinical features (as opposed to the data representation from black-box models that remains uninterpretable to a human, or covariance matrices that requires statistical knowledge). Moreover, the choice of anchors is data driven and does not require human/expert intervention; however, should the need arise, MSC can also incorporate anchors selected by medical experts. By compressing the complex data into low-dimensional structures, MSC allows the proposed IM to produce outputs in form of bar-chars or line graphs, which further serves to make the results interpretable to a wider population.

The establishment of the CSE (through the use of the proposed label-shift paradigm and literature comparisons), which had excellent overlap with clinical features (Table 1), showed that machine learning models are not just computational optimizers; but rather, ML models have the ability to pick up trends in clinical features. The CSE also showed that literature features, even when independently linked to sepsis, can be useful features for sepsis (early) diagnosis. This confirms our belief that even while lacking a clinician’s experience, if given enough relevant features (including features not used in standard clinical diagnosis) and data, ML models can aid clinicians in real-life with making decisions. With black-box ML models reigning the frontiers of data analysis, the development of an IM that can work independently of model type, can revolutionize healthcare systems.

In this paper, out of the total 40 features present in the dataset, we studied the top 20 features from RF CSE, and the 17 features from the proposed IM (Table 1). Among these 20 and 17 features, there was an overlap of 11 features. Both the RF CSE and the IM agree that features found in medical literature (such as age, hemoglobin, hematocrit, glucose, blood urea nitrogen, bicarbonate, potassium, etc.) but not used in SIRS, SOFA or qSOFA are important for sepsis detection. Other features that were unique to the RF CSE are: ICU length-of-stay, hours between hospital and ICU admit, and presence of chloride ions. Features unique to the IM are: pH and oxygen saturation. Based on medical expert opinion, treating clinicians do not find it ethical or reasonable enough to use “age”, “length of ICU stays” and/or “hospital admit time” to “discriminate” between patients and their treatment plan; even though older patients have been shown to be more susceptible to sepsis [58–59]. Thus, features that can be relevant in clinical settings and thus warrants further investigation are: chloride ions, pH and oxygen saturation. In addition, further investigation into “age” is warranted because not only medical literatures [58–59], but also both the RF CSE and the IM found that even though patient of any age can develop sepsis, older people have more to lose.

The reason for a lack of a 100% overlap between the RF CSE and the IM might be attributed to the difference in accuracy seen between them in Tables 2 and 5. Another possible explanation, based on the opinion of our clinical collaborator, maybe that strong correlations exist between: i) pH and systolic blood pressure, ii) partial carbon dioxide pressure and respiratory rate, and iii) oxygen saturation, fraction of inspired oxygen, and respiratory rate. Since the LSTM (or the IM) picks up on fraction of inspired oxygen, pH, partial carbon dioxide pressure, and oxygen saturation, maybe it finds that respiration rate and systolic blood pressure no longer results in further gain in new information.

However, there are also limitations to our current study. One limitation of the proposed methodology is that MSC has a much lower accuracy than either RF and LSTM, and so the fidelity is limited by MSC’s reduced discriminative abilities. Two, as implemented in this paper, even though we were able provide interpretation of the LSTM model qualitatively and used indirect quantitative measures for evaluation, a direct quantitative measure is still lacking. Three, none of the models used here had been externally validated (a key requirement to use ML models in clinical settings). However, this study still warrants further investigation into the development of IMs and using the concept of nearest neighbors for aiding black box ML model to attain transparency. As such, we hope to address these three limitations in our future work.

In this paper, we investigated the interpretability of complex ML models through creation of interpretable hybrids using MSC. Using the proposed label-shift paradigm, five ML models, literature comparison and seven evaluation metrics (including one for measuring early sepsis detection), we found Random Forest to be a good computational expert for sepsis diagnosis. Next, we used MSC (the base of our proposed IM) to create hybrid models for Random Forest and LSTM to gain insight into what the two models have learnt regarding sepsis. Both hybrid models showed significant feature overlap with the CSE and the clinical features. The results of Wilcoxon’s ranksum test also supported the identified features by hybrids as features that plays a crucial role in recognizing sepsis. The results presented here show great promise for continued use and further exploration of MSC for unraveling black-box ML models for healthcare studies.

Competing Interest

Author Contribution:All persons who meet authorship criteria are listed as authors. RS conceived of the study. ES implemented. RS and ES participated in interpretation of the data, and drafting the manuscript. RS and JT¹ worked on the final version. BB and JT² (even though not sepsis experts) provided clinical insight.

Conflict Of Interest:The authors have no competing interests to declare.

Competing Interest

Funding: ES received funding from NSF REU, Award Abstract # 2050978

Callahan A. (2017). Key Advances in Clinical Informatics. Chapter 19 Machine Learning in Healthcare. ISBN: 9780128095232
Konstantina Kourou, Themis P. Exarchos, Konstantinos P. Exarchos, Michalis V.Karamouzis, Dimitrios I. Fotiadis. Machine learning applications in cancer prognosis and prediction, Computational and Structural Biotechnology Journal, Volume 13, 2015, Pages 8–17, ISSN 20010370
Nemati, S., Holder, A., Razmi, F., Stanley, M. D., Clifford, G. D., & Buchman, T. G. (2018). An Interpretable Machine Learning Model for Accurate Prediction of Sepsis in the ICU. Critical care medicine, 46(4), 547–553. https://doi.org/10.1097/CCM.0000000000002936
M. Veta, J. P. W. Pluim, P. J. van Diest and M. A. Viergever, "Breast Cancer Histopathology ImageAnalysis: A Review," in IEEE Transactions on Biomedical Engineering, vol. 61, no. 5, pp. 1400–1411, May 2014, doi: 10.1109/TBME.2014.2303852.
Charron, Martin; Beyer, Thomas; Bohnen, Nicholas N.; Kinahan, Paul E.; Dachille, Marsha; Jerin, Jeff; Nutt, Ronald; Meltzer, Carolyn Cidis; Villemagne, Victor; Townsend, David W. Image Analysis in Patients with Cancer Studied with a Combined PET and CT Scanner, Clinical Nuclear Medicine: November 2000 – Volume 25 - Issue 11 - p 905–910
Trevor J. Huff, Parker E. Ludwig & Jorge M. Zuniga (2018) The potential for machine learning algorithms to improve and reduce the cost of 3-dimensional printing for surgical planning, Expert Review of Medical Devices, 15:5, 349–356, DOI: 10.1080/17434440.2018.1473033
Vamathevan, J., Clark, D., Czodrowski, P. et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18, 463–477 (2019). https://doi.org/10.1038/s41573-019-0024-5
J. Doyne Farmer, Norman H Packard, Alan S Perelson, The immune system, adaptation, and machine learning, Physica D: Nonlinear Phenomena, Volume 22, Issues 1–3,1986, Pages 187–204,ISSN 0167–2789, https://doi.org/10.1016/0167-2789(86)90240-X
Kong G, Lin K, Hu Y. Using machine learning methods to predict in-hospital mortality of sepsis patients in the ICU. BMC Medical Informatics and Decision Making. 2020;20(1):1–10
Yao Rq, Jin X, Wang Gw, Yu Y, Wu Gs, Zhu Yb, et al. A machine learning-based prediction of hospital mortality in patients with postoperative sepsis. Frontiers in Medicine. 2020;7:445
Song W, Jung SY, Baek H, Choi CW, Jung YH, Yoo S. A Predictive Model Based on Machine Learning for the Early Detection of Late-Onset Neonatal Sepsis: Development and Observational Study. JMIR Medical Informatics. 2020;8(7):e15965
Chaudhary P, Gupta DK, Singh S. Outcome Prediction of Patients for Different Stages of Sepsis Using Machine Learning Models. In: Advances in Communication and Computational Technology. Springer; 2021. p. 1085–1098
Delahanty RJ, Alvarez J, Flynn LM, Sherwin RL, Jones SS. Development and evaluation of a machine learning model for the early identification of patients at risk for sepsis. Annals of emergency medicine. 2019;73(4):334–344
Hou N, Li M, He L, Xie B, Wang L, Zhang R, et al. Predicting 30-days mortality for MIMIC-III patients with sepsis-3: a machine learning approach using XGboost. Journal of Translational Medicine. 2020;18(1):1–14. pmid:33287854
Aşuroğlu T, Oğul H. A deep learning approach for sepsis monitoring via severity score estimation. Computer Methods and Programs in Biomedicine. 2021;198:105816
Kok C, Jahmunah V, Oh SL, Zhou X, Gururajan R, Tao X, et al. Automated prediction of sepsis using temporal convolutional network. Computers in Biology and Medicine. 2020;127:103957. pmid:32938540
Li Q, Li L, Zhong J, Huang LF. Real-time sepsis severity prediction on knowledge graph deep learning networks for the intensive care unit. Journal of Visual Communication and Image Representation. 2020;72:102901
Svenson P, Haralabopoulos G, Torres MT. Sepsis Deterioration Prediction Using Channeled Long Short-Term Memory Networks. In: International Conference on Artificial Intelligence in Medicine. Springer; 2020. p. 359–370
Lauritsen SM, Kalør ME, Kongsgaard EL, Lauritsen KM, Jørgensen MJ, Lange J, et al. Early detection of sepsis utilizing deep learning on electronic health record event sequences. Artificial Intelligence in Medicine. 2020;104:101820. pmid:32498999
Narayanaswamy L, Garg D, Narra B, Narayanswamy R. Machine Learning Algorithmic and System Level Considerations for Early Prediction of Sepsis. In: 2019 Computing in Cardiology (CinC). IEEE; 2019. p. Page–1
Henry KE, Hager DN, Pronovost PJ, Saria S. A targeted real-time early warning score (TREWScore) for septic shock. Science translational medicine. 2015;7(299):299ra122–299ra122
M. Zabihi, S. Kiranyaz and M. Gabbouj, "Sepsis Prediction in Intensive Care Unit Using Ensemble of XGboost Models," 2019 Computing in Cardiology (CinC), 2019, pp. Page 1-Page 4, doi: 10.23919/CinC49843.2019.9005564
J. Singh, K. Oshiro, R. Krishnan, M. Sato, T. Ohkuma and N. Kato, "Utilizing Informative Missingness for Early Prediction of Sepsis," 2019 Computing in Cardiology (CinC), 2019, pp. 1–4, doi: 10.23919/CinC49843.2019.9005809
Yang, Meicheng & Wang, Xingyao & Hongxiang, Gao & Li, Yuwen & Liu, Xing & Li, Jianqing & Liu, Chengyu. (2019). Early Prediction of Sepsis Using Multi-Feature Fusion Based XGBoost Learning and Bayesian Optimization. 10.22489/CinC.2019.020
Futoma, J., Hariharan, S., Heller, K., Sendak, M., Brajer, N., Clement, M., Bedoya, A.; O’Brien, C. (2017). An Improved Multi-Output Gaussian Process RNN with Real-Time Validation for Early Sepsis Detection. Proceedings of the 2nd Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 68:243–254
Moor, M., Horn, M., Rieck, B., Roqueiro, D. & Borgwardt, K. (2019). Early Recognition of Sepsis with Gaussian Process Temporal Convolutional Networks and Dynamic Time Warping.Proceedings of the 4th Machine Learning for Healthcare Conference, Proceedings of Machine Learning Research 106:2–26 Available from https://proceedings.mlr.press/v106/moor19a.html.
Mao Q, Jay M, Hoffman JL, Calvert J, Barton C, Shimabukuro D, Shieh L, Chettipally U, Fletcher G, Kerem Y, Zhou Y, Das R. Multicentre validation of a sepsis prediction algorithm using only vital sign data in the emergency department, general ward and ICU. BMJ Open. 2018 Jan 26;8(1):e017833. doi: 10.1136/bmjopen-2017-017833. PMID: 29374661; PMCID: PMC5829820
J. Morrill, A. Kormilitzin, A. Nevado-Holgado, S. Swaminathan, S. Howison and T. Lyons, "The Signature-Based Model for Early Detection of Sepsis From Electronic Health Records in the Intensive Care Unit," 2019 Computing in Cardiology (CinC), 2019, pp. Page 1-Page 4, doi: 10.23919/CinC49843.2019.9005805
Yang M, Liu C, Wang X, Li Y, Gao H, Liu X, Li J. An Explainable Artificial Intelligence Predictor for Early Detection of Sepsis. Crit Care Med. 2020 Nov;48(11):e1091-e1096. doi: 10.1097/CCM.0000000000004550. PMID: 32885937.
J. M. Benítez, J. L. Castro, and I. Requena, Are artificial neural networks black boxes?, IEEE Transactions on Neural Networks, vol. 8, pp. 1156–1164, 1997
N. Frosst and G. Hinton, “Distilling a neural network into a soft decision tree,”CExAIIA, 2017.[39]P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang, “Concept bottleneck models,” 7 2020. [Online]. Available: http://arxiv.org/abs/2007.04612
C.K. Yeh, B. Kim, S. O. Arik, C.L. Li, T. Pfister, and P. Ravikumar, “On completeness aware concept-based explanations in deep neural networks,” 10 2019. [Online]. Available:http://arxiv.org/abs/1910.07969
Mark W. Craven and Jude W. Shavlik. 1995. Extracting tree-structured representations of trained networks. In Proceedings of the 8th International Conference on Neural Information Processing Systems (NIPS'95). MIT Press, Cambridge, MA, USA, 24–30
Bratko, I., Machine Learning: Between Accuracy and Interpretability, Learning, Networks and Statistics, 163–177, 1997
A. Bibal, B. Frénay, Interpretability of Machine Learning Models and Representations: an Introduction, ESANN 2016 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27–29 April 2016, i6doc.com publ., ISBN 978-287587027-8.
M. T. Ribeiro, S. Singh, and C. Guestrin, “Why Should I Trust You?: Explaining the predictions of any classifier,”Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144, 2016. [Online]. Available: https://doi.org/10.1145/2939672.2939778
Rosnati M, Fortuin V (2021) MGP-AttTCN: An interpretable machine learning model for the prediction of sepsis. PLoS ONE 16(5): e0251248. https://doi.org/10.1371/journal.pone.0251248
Singer M, Deutschman CS, Seymour CW, et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA. 2016 Feb 23;315(8):801 – 10. doi: 10.1001/jama.2016.0287. PMID: 26903338; PMCID: PMC4968574.
Reyna MA, Josef CS, Jeter R, et al. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019. Crit Care Med. 2020;48(2):210–217. doi:10.1097/CCM.0000000000004145
Vincent X. Liu, Vikram Fielding-Singh, John D. Greene, Jennifer M. Baker, Theodore J. Iwashyna, Jay Bhattacharya, Gabriel J. Escobar, “The Timing of Early Antibiotics and Hospital Mortality in Sepsis” https://doi.org/10.1164/rccm.201609-1848OC
Ferrer R, Artigas A, Levy MM, Blanco J, Gonzalez-Diaz G, Garnacho-Montero J, Ibanez J, Palencia E, Quintana M, De la Torre-Prados MV, et al. Improvement in process of care and outcome after a multicenter severe sepsis educational program in Spain. JAMA. 2008;299(19):2294–303
Rosnati M, Fortuin V (2021) MGP-AttTCN: An interpretable machine learning model for the prediction of sepsis. PLoS ONE 16(5): e0251248. https://doi.org/10.1371/journal.pone.0251248
Wong A, Otles E, Donnelly JP, et al. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Intern Med. 2021;181(8):1065–1070. doi:10.1001/jamainternmed.2021.2626
Bennett, T., Russell, S., King, J., Schilling, L., Voong, C., Rogers, N., & Ghosh, D. (2019). Accuracy of the Epic sepsis prediction model in a regional health system. arXiv preprint arXiv:1902.07276
Horng S, Sontag DA, Halpern Y, Jernite Y, Shapiro NI, et al. (2017) Creating an automated trigger for sepsis clinical decision support at emergency department triage using machine learning. PLOS ONE 12(4): e0174708. https://doi.org/10.1371/journal.pone.0174708
Halpern, Y., Choi, Y., Horng, S., & Sontag, D. (2014). Using Anchors to Estimate Clinical State without Labeled Data. AMIA Annual Symposium proceedings. AMIA Symposium, 2014, 606–615
Halpern, Y., Horng, S., Choi, Y., & Sontag, D. (2016). Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association, 23(4), 731–740
Wernly, B., Lichtenauer, M., Hoppe, U. C., & Jung, C. (2016). Hyperglycemia in septic patients: an essential stress survival response in all, a robust marker for risk stratification in some, to be messed with in none. Journal of thoracic disease, 8(7), E621–E624. https://doi.org/10.21037/jtd.2016.05.24
Zhang Z, Zhu C, Mo L, Hong Y. Effectiveness of sodium bicarbonate infusion on mortality in septic patients with metabolic acidosis. Intensive Care Med. 2018 Nov;44(11):1888–1895. doi: 10.1007/s00134-018-5379-2. Epub 2018 Sep 25. PMID: 30255318.
Li, X., Li, T., Wang, J., Dong, G., Zhang, M., Xu, Z., Hu, Y., Xie, B., Yang, J., & Wang, Y. (2021). Higher blood urea nitrogen level is independently linked with the presence and severity of neonatal sepsis. Annals of medicine, 53(1), 2192–2198. https://doi.org/10.1080/07853890.2021.2004317
Collage, R. D., Howell, G. M., Zhang, X., Stripay, J. L., Lee, J. S., Angus, D. C., & Rosengart, M. R. (2013). Calcium supplementation during sepsis exacerbates organ failure and mortality via calcium/calmodulin-dependent protein kinase signaling. Critical care medicine, 41(11), e352–e360. https://doi.org/10.1097/CCM.0b013e31828cf436
Velissaris D, Karamouzos V, Pierrakos C, Aretha D, Karanikolas M. Hypomagnesemia in Critically Ill Sepsis Patients. J Clin Med Res. 2015 Dec;7(12):911–8. doi: 10.14740/jocmr2351w. Epub 2015 Oct 23. PMID: 26566403; PMCID: PMC4625810.
Limaye CS, Londhey VA, Nadkart MY, Borges NE. Hypomagnesemia in critically ill medical patients. J Assoc Physicians India. 2011 Jan;59:19–22. PMID: 21751660.
Al Harbi, S.A., Al-Dorzi, H.M., Al Meshari, A.M. et al. Association between phosphate disturbances and mortality among critically ill patients with sepsis or septic shock. BMC Pharmacol Toxicol 22, 30 (2021). https://doi.org/10.1186/s40360-021-00487-w
Tongyoo, S., Viarasilpa, T., & Permpikul, C. (2018). Serum potassium levels and outcomes in critically ill patients in the medical intensive care unit. The Journal of international medical research, 46(3), 1254–1262. https://doi.org/10.1177/0300060517744427
Jung, S. M., Kim, Y. J., Ryoo, S. M., & Kim, W. Y. (2019). Relationship between low hemoglobin levels and mortality in patients with septic shock. Acute and critical care, 34(2), 141–147. https://doi.org/10.4266/acc.2019.00465
Jansma, G., de Lange, F., Kingma, W.P. et al. ‘Sepsis-related anemia’ is absent at hospital presentation; a retrospective cohort analysis. BMC Anesthesiol 15, 55 (2015). https://doi.org/10.1186/s12871-015-0035-7
Martin-Loeches, I., Guia, M. C., Vallecoccia, M. S., Suarez, D., Ibarz, M., Irazabal, M., Ferrer, R., & Artigas, A. (2019). Risk factors for mortality in elderly and very elderly critically ill patients with sepsis: a prospective, observational, multicenter cohort study. Annals of intensive care, 9(1), 26. https://doi.org/10.1186/s13613-019-0495-x
Nasa, P., Juneja, D., & Singh, O. (2012). Severe sepsis and septic shock in the elderly: An overview. World journal of critical care medicine, 1(1), 23–30. https://doi.org/10.5492/wjccm.v1.i1.23
Seymour, C. W., Liu, V. X., Iwashyna, T. J., Brunkhorst, F. M., Rea, T. D., Scherag, A., Rubenfeld, G., Kahn, J. M., Shankar-Hari, M., Singer, M., Deutschman, C. S., Escobar, G. J., & Angus, D. C. (2016). Assessment of Clinical Criteria for Sepsis: For the Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA, 315(8), 762–774. https://doi.org/10.1001/jama.2016.0288
Vincent, J. L., Moreno, R., Takala, J., Willatts, S., De Mendonça, A., Bruining, H.,.& Thijs, L. G. (1996). The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure
Evans T. (2018). Diagnosis and management of sepsis. Clinical medicine (London, England), 18(2), 146–149. https://doi.org/10.7861/clinmedicine.18-2-146
Parmar A., Katariya R., Patel V. (2019) A Review on Random Forest: An Ensemble Classifier. In: Hemanth J., Fernando X., Lafata P., Baig Z. (eds) International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018. ICICI 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-030-03146-6_86
John Shawe-Taylor, Shiliang Sun, A review of optimization methodologies in support vector machines, Neurocomputing, Volume 74, Issue 17, 2011, Pages 3609–3618, ISSN 0925–2312, https://doi.org/10.1016/j.neucom.2011.06.02631)
Ruihu Wang, AdaBoost for Feature Selection, Classification and Its Relation with SVM, A Review, Physics Procedia, Volume 25, 2012, Pages 800
Charu C. Aggarwal. 2014. The setwise stream classification problem. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '14). Association for Computing Machinery, New York, NY, USA, 432–441. DOI:https://doi.org/10.1145/2623330.2623751
Marco Marozzi, Amitava Mukherjee, Jan Kalina, Interpoint distance tests for high-dimensional comparison studies, Pages 653–665 | Received 19 Jul 2018, Accepted 23 Jul 2019, Published online: 31 Jul 2019, https://doi.org/10.1080/02664763.2019.1649374
S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," in Neural Computation, vol. 9, no. 8, pp. 1735–1780, 15 Nov. 1997, doi: 10.1162/neco.1997.9.8.1735
X. Li, Y. Kang, X. Jia, J. Wang and G. Xie, "TASP: A Time-Phased Model for Sepsis Prediction," 2019 Computing in Cardiology (CinC), 2019, pp. Page 1-Page 4, doi: 10.23919/CinC49843.2019.9005773
NV Chawla, KW Bowyer, LO Hall, WP Kegelmeyer, SMOTE: synthetic minority over-sampling technique, 2002, Journal of artificial intelligence research 16, 321–357
Shamsuddin, R., Sawant, A., & Prabhakaran, B. (2017). Developing a low dimensional patient class profile in accordance to their respiration-induced tumor motion. Proceedings of the VLDB Endowment, 10(12), 1610–1621
Serkan Kiranyaz, Onur Avci, Osama Abdeljaber, Turker Ince, Moncef Gabbouj, Daniel J. Inman, 1D convolutional neural networks and applications: A survey,Mechanical Systems and Signal Processing, Volume 151, 2021, 107398, ISSN 0888–3270, https://doi.org/10.1016/j.ymssp.2020.107398.
A. Messalas, Y. Kanellopoulos and C. Makris, "Model-Agnostic Interpretability with Shapley Values," 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), 2019, pp. 1–7, doi: 10.1109/IISA.2019.8900669.
Elizabeth Kumar, Suresh Venkatasubramanian, Carlos Scheidegger, Sorelle Friedler, Problems with Shapley-value-based explanations as feature importance measures, Proceedings of the 37th International Conference on Machine Learning, PMLR 119:5491–5500, 2020.
Balasubramanian, A., Shamsuddin, R., Prabhakaran, B., Predictive modeling of respiratory tumor motion for real-time prediction of baseline shifts. Physics in Medicine and Biology, 2017, IOP Publishing, Vol 62 – 5, 1791.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this older preprint version

Read the latest preprint version →

What do Black-box Machine Learning Prediction Models See?- An Application Study With Sepsis Detection

Archived Versions:

Version 1

Abstract

Figures

1. Introduction

2. Dataset

3. Our Approach

3.1. Finding CSE and establishing a benchmark (Experiment 1.1)

3.2. Creating and Evaluating the IM (Experiment 1.2)

4. Results

4.1. Finding Random Forest, RF, as the CSE.

4.1.3. Comparing RF utility with entry models from PhysioNet Challenge:

4.1.4. Comparison with sepsis diagnosis models from existing medical literatures:

4.2. Using the proposed IM to explain LSTM

5. Discussion

6. Conclusions

Declarations

References

Additional Declarations

Archived Versions:

Version 1