3.1. Finding CSE and establishing a benchmark (Experiment 1.1)
Any computational diagnostic model, no matter how good, would not have a clinicians’ experience/intuition to make a good call if necessary. For fair evaluation of the proposed IM, instead of just comparing it exactly to clinical features, we should also compare it to what a CSE would use for sepsis classification. Hence, the CSE works as a benchmark against which the results of the proposed IM can be evaluated.
Our search for the CSE is guided by the following characteristic criteria of an ideal CSE. For a machine learning classifier to be considered a sepsis expert:

The CSE must be good at solving both sepsis detection (detection at clinical diagnosis time) and early sepsis detection (detection prior to clinical diagnosis). “Goodness” can be measured using the following evaluation metrics:

Traditional measures: accuracy, specificity, sensitivity, precision

Imbalance data measure: F1 score, Mathew’s Coefficient

PhysioNet early detection measure: utility [39]. The PhysioNet challenge organizers provided the utility measure as a way to evaluate ML models’ ability to detect sepsis early with only one number.

When solving the detection task, CSE must demonstrate a good overlap with clinical and literature features (Fig. 1).

We expect the CSE to require the use of more features than those presented in Fig. 1A.
Thus, we adapt the following 5step assessment (Sections 3.1.1–3.1.5) to select the CSE from a myriad of literature works and models that we trained.
3.1.1. CSE determination, Step 1: Evaluation against ML models using traditional metrics for sepsis detection at clinical time. To identify the model that can achieve best performance on sepsis detection, here we use four representative ML algorithm groups:

1D Convolution Network[72] classifier or Conv1D – this is a stateoftheart deep network, and requires LIME/SHAP, which can only provide local interpretations.

AdaBoost and Support Vector Machine or SVM – these are traditional ML models and both require LIME/SHAP for local interpretation. AdaBoost models are usually treebased models.

Random Forest or RF  this is the most powerful, nonlinear model in the class of intrinsically interpretable model that uses the Gini Impurity Index to calculate feature importance. This is a treebased model.

Multiset Classifier or MSC – this uses concepts of nearest neighbors and unsupervised learning for classification. And in this paper, we show how it can be both conceptually and visually human interpretable. As such, this is the algorithm we propose to base the IM on.
Performance of these models are measured using traditional ML evaluation metrics such as: i) accuracy, ii) precision, iii) specificity, iv) sensitivity, v) F1 score, and vi) Mathew’s coefficient. The model with the best performance on these metrics. e.g. the CSE candidate, is then passed on to the next stage for further assessment.
3.1.2. CSE determination, Step 2: Evaluation of CSE candidate on early sepsis detection (earlier than clinical time) using the proposed labelshift training paradigm. Sepsis detection done x hours before medical determination of sepsis is considered early sepsis detection. Computational detection of sepsis 6 hours prior to medical determination is considered optimal, while 12 hours prior is considered early[39].
Given this clinical definition and the computational task for early sepsis detection, we propose the use of a labelshift training paradigm to train the CSE candidate, because none of the models listed in Section 3.1.1 is as capable as LSTM in capturing temporal patterns. As such we adapt the training paradigm from one of our earlier work[75] on time series signals, to allow any one of these models to process temporal information. In this training paradigm, instead of trying to optimize model hyperparameters and/or architecture, we rearrange the data that is being fed to the model.
Suppose, each patient is denoted by Pid, where id is unique to each patient. Each Pid contains a set of rows, with the features/attributes separated by different columns. The rows in each Pid are temporally dependent such that any two rows, idri, idrj ∈ Pid, where i < j (and i, j denotes the ith and jth rows in Pid), can be expressed as idrt, and idrt+c respectively, where c is a positive constant, and represents a time interval; for this sepsis dataset, c = 1 hour if j = i + 1. The sepsis label associated with idrt is denoted by idlt. Thus, the traditional training paradigm dataset is denoted by SepsisT = {[ idrt, idlt]  t ∈ ℤ+}. Under the labelshift paradigm, each idrt gets associated with a label from ‘future’ or subsequent time intervals. The basis for this shift is that we want to the train the ML model to predict at the present time by learning to associate the past data with the future state.
Thus, the training dataset is now denoted by SepsisTc = {[ idrt, idlt+c]  t, c ∈ ℤ+ }. Since we are interested in predicting sepsis 12 hours, 6 hours and 1 hour earlier than sepsis onset[39], we restrict c ∈ [1, 6, 12], resulting in three different training sets. This allows us to measure the performance of the CSE candidate separately for each value of c, using: i) accuracy, ii) precision, iii) specificity, iv) sensitivity, v) F1 score, and vi) Mathew’s coefficient.
3.1.3. CSE determination, Step 3: Evaluation using the utility metric against entries from 2019 PhysioNet Challenge. 2019 PhysioNet Challenge[39], which is focused on early sepsis detection, provided a new ML evaluation metric, called utility. The utility measures whether the sepsis diagnosis model is able to detect sepsis at an early stage e.g. at most 12 hours earlier than the medical expert detection, with 6 hours prior medical detection obtaining perfect score. This score takes a minimum value of 0 and a maximum of 1, with higher values indicating better ML model performance. The steps to calculate utility is given in[39], and the code is available at https://github.com/physionetchallenges/evaluation2019.
Since we are using the data that was made available by [39], in addition to Step 2, we also compare the utility values of the listed models with the values obtained from top performing entries[22–24, 28–29, 39, 69] in the challenge [39].
3.1.4. CSE determination, Step 4: Comparison with sepsis diagnosis models in clinical literature. Due to a need for early sepsis detection, sepsis is a frequently studied medical condition in the medical/clinical field and consists of significant amount of literature work. We identified few of these models tested in clinical setting and compared their performance with the CSE candidate.
The literature models that were chosen are: AISE[3], the Epic Sepsis Model (ESM)[43], and the EPIC native sepsis model[44].
3.1.5. CSE determination, Step 5: Overlap comparison with clinical features. We use the respective explanability/interpretability technique (Gini Index if RF; LIME/SHAP otherwise), to obtain the twenty most important features obtained by CSE to detect sepsis. Their quintiles (maximum, third quartile, median, first quartile, minimum) are then recorded. Additionally, the Wilcoxon’s ranksum hypothesis test is calculated using the median for the sepsis and nonsepsis groups to obtain the statistical significance (for all 40 features).
The overlap of these top twenty features with the clinical features and the features from literature (Fig. 1) is then recorded and presented.
3.1.6. Implementation details. The convolution model was set up in MATLAB with one 1D convolution model layer (15 filters of size 3x3, ReLU activation), followed by a max pooling layer and fully connected layer with softmax activation function. RF was implemented in Python, using 100 estimators, and accounted for the imbalanced dataset by setting builtin hyperparameter “class weight” to “balanced”. AdaBoost was also implemented in Python, using 200 estimators. SMOTE[70] with “minority” sampling was used to handle class imbalance. For SVM we used the RBF kernel with the regularizer constant set to 0.5, along with inbuilt mechanism to handle the imbalance in Python. For MSC we used kmeans + + to choose 10 anchors, and set the model to have 1 class profile for sepsis and 1 class profile for nonsepsis, and used the Euclidean distance for measuring similarities.
3.2. Creating and Evaluating the IM (Experiment 1.2)
Multiset Classifier, MSC[66], is not a commonly used algorithm and relies on nearest neighbor principles for classification. However, its output are class profiles (in addition to class labels), which are lowdimensional data structures[71]. The low dimensionality of the profiles makes the model globally and intrinsically interpretable, and easy to visualizable as bar charts or line graphs. Thus, it is a good candidate for the basis of our IM. Another advantage of MSC is that its choice of anchors is completely data driven and requires no human intervention unlike[46–47].
3.2.1. Proposed Strategy for Explaining Blackbox ML Models Globally Using MSC. MSC (full algorithm in[66]) assumes that each patient/entity is described by a set of feature vectors (instead of one). It begins by using a clustering technique to select a subset of feature vectors called anchors, which act as the base concepts/patterns and are representative of the dataset. Every entity/patient and class profile are then defined as a histogram(s) in terms of these anchors as follows:
Entity as a collection of concepts
Suppose the set of all feature vectors describing a disease perfectly is called the concept set, and is denoted by Q = {q1, ..., qz, ..., qQ}, where qz ∈ Rd and d > 0. Then Ej, the jth entity in a given dataset, is a member of the power set of Q (excluding the empty set), and is denoted as Ej ∈ Power Set(Q)\ ∅, where ∅ as the empty set.
Base concepts or anchors
The anchors are defined as a subset of Q, and denoted by \(\widehat{Q}\)= { 𝑞̂i }, where 𝑞̂i is the ith anchor. The number of anchors (e.g. \(\widehat{Q}\)) is determined by the user, following the restriction that 0 < \(\widehat{Q}\) ≪ Q. The anchors are chosen using a clustering technique. Usually, instead of using the entire dataset, the anchors are chosen from a representation initial sample.
Fingerprints
The fingerprint of Ej, which has a class label k, is a histogram over \(\widehat{Q}\)and written as:
k fp j ={kpji}, where kpji is the proportion of 𝑞̂i present in Ej.
Class profiles
If class k is restricted to having only one profile, then a single profile is the average of all the fingerprints belonging to class k; in our case, k = 1 for sepsis patient and k = 0 otherwise. However, since MSC allows one class to be described by one or more profiles, the set of profiles for the kth class is given by Ck = {rck}, where r > 0 and indexes the profiles in Ck. Thus, rth profile of class k is the average of a subset of fingerprints belonging to class k, and is denoted by rck = {rcpki}, where rcpki is the average of kpji over a subset of all Ej in class k.
Building Class Profiles During Training Phase
The user determines the number of profiles required to describe each class. The initial sample is used to initialize the class profiles. Then, during the training phase, MSC algorithm executes the following steps to update the class profiles with respect to the feature vectors in each class. During any particular training iteration:
i. A feature vector, maintained in temporal order, is fetched from the dataset. Suppose it belongs to class k (for example k = 0) and patient Ej.
ii. The algorithm then determines which anchor is most similar to this feature vector. Suppose, it is the best match for 𝑞̂m, then fingerprint, kfpj (or 0fpj ), is updated as follows:
𝑓 =(( 𝑘𝑝𝑗m)*𝑛j)+1
𝑛j =𝑛j + 1
𝑘𝑝𝑗m = 𝑓/ 𝑛j ,
where 𝑛j is the number of feature vectors seen so far for by the algorithm for entity Ej.
iii. The updated fingerprint is then compared with the profiles in Ck (or C0 ) only to find the closest match. If rck (or rc0 ) is the best match, then it is updated as follows:
𝑐 =(𝑛r′*( rck)) + kfpj
𝑛r′ = 𝑛r′+1
r c k =c/ 𝑛r′ ,
where 𝑛r′ is number of fingerprints seen so far for by the algorithm for profile rck.
iv. Algorithm fetches a new fingerprint and repeats.
v. Once the class profiles are obtained, testing is achieved by comparing the testing fingerprints to the mature class profiles and recording whether the best matched profile belongs to Ck=0 or Ck=1.
Thus, the output of MSC algorithm are labels and class profiles, where each class profile can be a set of subprofiles. It is important to note that while MSC allows features vectors to change subprofile membership within a particular class Ck, it does not allow them to jump between different classes e.g. Ck=0 and Ck=1. Thus, traditionally, each class is described by a set of profiles based on information from the ground truth labels, because that’s how the membership of feature vectors is determined during training.
As a result, if we were to randomly change the ground truth for each feature vector during the training phase, the resulting class profiles would be quite different than the ground truth class profiles. Similarly, we can theoretically ask an “oracle” to provide us with class labels, because either the ground truth is not accessible or the “oracle” sees unexplainable, hidden patterns. These labels may (or may not) be different from the ground truth. Then the class profiles produced by MSC will reflect insights of the pattern seen by the “oracle”. Thus, if a blackbox ML model is the “oracle”, e.g. the model’s predictions (instead of ground truth) is used when training MSC, the resulting class profiles will approximate what the blackbox ML model sees with regards to the anchors chosen by the MSC algorithm.
3.2.2. Evaluation of the IM. A hybrid model refers to the IM that was created by training MSC with labels from the LSTM/RF as the oracle. Thus, explanation from the MSCLSTM hybrid model can be taken as approximation for what the LSTM sees, while the MSCCSE hybrid model is for comparison purposes.
Even though the hybrid models are not the actual predictive model for sepsis, we still need to take their accuracy and fidelity into account because these values provide insight into how well the interpretable hybrid models can explain the blackbox models. The fidelity[33] of the MSChybrid models is calculated by recording consensus in classification between the LSTM/RF model(s) and their respective hybrids. Fidelity is calculated in the same way as accuracy, but instead of using the ground truth, we use the labels from the oracle.
In addition, we compare the sepsis profiles obtained from the hybrid models to identify what features the LSTM and RF models are looking at for sepsis classification. We present all our results as comparisons between:

MSCLSTM versus MSCCSE model (profile comparison to obtain what the LSTM learnt)

MSCCSE versus CSE (feature overlap, including clinical and literature features)

MSCLSTM versus CSE (feature overlap, including clinical and literature features)
3.2.3. Implementation details. The LSTM model was set up in MATLAB with one LSTM layer of 100 hidden units, followed by a fully connected layer with softmax activation function. For this experiment, we use the LSTM as the blackbox model that is globally explained by creating a hybrid using its predicted labels to train MSC (as described in Sections 3.2.1 and 3.2.2). The MSC is also trained using the ground labels and labels from CSE (to get the benchmark hybrid model). All three models are evaluated using i) accuracy, ii) specificity, iii) sensitivity, iv) precision, v) F1 score, vi) Mathew’s coefficient and vii) fidelity metric.