Selective classification with machine learning uncertainty estimates improves ACS prediction: A retrospective study in the prehospital setting

Accurate identification of acute coronary syndrome (ACS) in the prehospital sestting is important for timely treatments that reduce damage to the compromised myocardium. Current machine learning approaches lack sufficient performance to safely rule-in or rule-out ACS. Our goal is to identify a method that bridges this gap. To do so, we retrospectively evaluate two promising approaches, an ensemble of gradient boosted decision trees (GBDT) and selective classification (SC) on consecutive patients transported by ambulance to the ED with chest pain and/or anginal equivalents. On the task of ACS classification with 23 prehospital covariates, we found the fusion of the two (GBDT+SC) improves the best reported sensitivity and specificity by 8% and 23% respectively. Accordingly, GBDT+SC is safer than current machine learning approaches to rule-in and rule-out of ACS in the prehospital setting.


Introduction
Accurate identification of acute coronary syndrome (ACS) in the prehospital setting is important for timely treatments that reduce damage to the compromised myocardium.Accordingly, the community has developed Machine Learning (ML) methods to improve the prediction of ACS with prehospital covariates 1,2 .Nevertheless, performance remains insufficient for safe rule-out or rule-in of ACS 3 .Current research in cardiovascular disease detection from ECG has observed a possible trade-off between performance and coverage (i.e. percentage of cases to automatically classify) as a viable way to mitigate errors [4][5][6] .This trade-off is known as selective classification 7,8 and it provides more accurate predictions by identifying a subpopulation better suited for automatic classification 4,5,9 .In our work, we evaluate a cutoff in the predictive uncertainty of an ensemble of gradient boosted decision trees (GBDT 10 ) as the filter for selective classification.We observe an 8% increase in sensitivity and a 23% increase in specificity, at the cost of 25% coverage.More concretely, our contributions are: 1. Identification of a ML model that achieves the best reported ACS prediction performance in the prehospital setting.[See tables 1, 2, 4 and 5] 2. Empirical evidence that selective classification boosts performance at the expense of 25% coverage in the prehospital setting.[See tables 2 and 4]

Patient characteristics and outcomes
Data was collected by Zègre-Hemsey and colleagues 11 .Patients enrolled (n=3646) over 21 years old, transported by ambulance to the ED with non-traumatic chest pain and/or anginal equivalents.Emergency healthcare personnel collected clinical information in the ambulance (i.e.Prehospital setting).The primary outcomes recorded any ACS event (i.e. the acute manifestation of coronary heart disease and include ST-elevation myocardial infarction (STEMI), non-ST elevation myocardial infarction (NSTEMI), and unstable angina (UA)).The observed prevalence was ACS (20%), STEMI (14%), NSTE-ACS (7%) and unstable angina (3%).These events are within 30 days post ED admission.

Dataset derivation and preparation
We divide the dataset into two cohorts: An internal cohort (n=1756 cases before 06/2016) for training and validation, and an external cohort (n=1127 cases after 06/2016) for testing.Furthermore, we select 23 covariates (see table 1) commonly associated with ACS 12 and available in the prehospital setting 13 .We discarded patients with a missing initial troponin value (25 total) or without an ECG date; less than 2% of patients had missing covariates imputed with a constant 14 Note {θ (m) } M m=1 corresponds to an ensemble of M Gradient Boosted Trees (GBT) parametrized by θ (m) .Each θ (m) is sampled i.i.d.from an approximate distribution q(θ ) which converges weakly to a posterior distribution p(θ |D).p(Y |X, θ (m) ) corresponds to the output of a GBT (i.e. a function of X that outputs a distribution over a labels Y ).Note the GBT algorithm is not the standard one, it is modified to guarantee weak convergence.Please see Section 3 from the original paper 10 for more details.It is worth mentioning that total uncertainty decomposes into two non-negative quantities: Model uncertainty and data uncertainty 15 .
Intuitively, model uncertainty (i.e.I(Y, θ |X, D)) is high when the input (x) is sufficiently different from our training set (D) (e.g.Cases with age < 5).On the other hand, data uncertainty is high when the input is inherently random (e.g.Cases with no ST-Elevation).I(•, •) represents mutual information.

Selective classification
Selective classification (SC) 7,8 filters-out cases at test time with the goal of improving predictive performance over the filtered-in subpopulation.In this work, our filter rule is "Total uncertainty greater than cutoff value" (i.e.H(Y |X, D) > cutoff).We use the validation split (D val ), disjoint from D, to determine a total uncertainty cutoff such that 80% of the cases in D val have smaller Total Uncertainty.This corresponds to the 0.8 quantile of {H(Y |X, D) : (X,Y ) ∈ D val }.We deemed 80% the most appropriate coverage to remain clinically useful.However, coverage could be further traded for performance with smaller cutoff values.

Classification performance metrics and estimation
Classification performance is measured in terms of: Coverage, area under the reproducer-operator-curve (AUROC), accuracy (ACC), positive predictive value (PPV), negative predictive value (NPV), sensitivity and specificity.These metrics were estimated by 5-fold stratified cross-validation in the internal cohort.More concretely, for each fold: the corresponding training set is used to estimate the model and the selective classification cutoff; the corresponding test set is used to estimate internal cohort performance (table 3); and the entire external cohort is used to estimate external cohort performance (table 2).This leads to a total of 5 samples of performance.For each metric, we report the mean (µ) and two times the standard error (2σ ).For reference, we also reported the prevalence of ACS in the test data, as this affects PPV and NPV.

Classification performance of ACS
The label for this task is either presence or absence of ACS.ACS is the acute manifestation of coronary heart disease and includes ST-elevation myocardial infarction (STEMI), non-ST elevation myocardial infarction (NSTEMI), and unstable angina (UA).Table 2 compares the ACS predictive performance of GBDT 10 , GBDT+SC, and the reported performance from alternative methods 1,2 .GBDT provides better predictive performance as noted by 24% improvement in sensitivity and 13% improvement in specificity.The rest of the metrics follow suit, with only PPV as the exception.The reason for the exception is that PPV can be arbitrarily high due to prevalence.In this case, even though (Takeda, 2022) 2 discriminator is worse, their higher ACS prevalence masks this in the PPV.With respect to similar prevalence like (Al-Zaiti, 2020)  (Takeda, 2022) and (Al-Zaiti, 2020).Table 3 showcases slightly better performance for the internal cohort.This is expected as the model and cutoff are estimated from this cohort.Nevertheless, performance is similar with respect to the external cohort.This result suggests machine learning uncertainty estimates correlate with predictive performance, and that constraining predictions to a subset of patients, may reassure the model's prediction.

Classification performance of NSTE-ACS
The label for this task is either presence or absence of ACS derived from NSTE-ACS.Like (Al-Zaiti, 2020) 1 , we consider NSTE-ACS as the presence of non-ST elevation MI or unstable angina.Table 4 compares the NSTE-ACS predictive performance of GBDT 10 and the reported performance in (Al-Zaiti, 2020) 1 ; (Takeda, 2022) 2 did not report NSTE-ACS performance due to low prevalence (3.2%).For reference, we also included the prevalence of NSTE-ACS in the test samples, as this inflates/deflates certain metrics (e.g.PPV and NPV).GBDT improves both sensitivity and specificity by 14% and 7% respectively.Like the ACS task, selective classification further improves performance by reducing coverage to 80%.For this subpopulation of the test set, average sensitivity and average specificity improve by 2% and 8 % points respectively.Results reinforce the notion that machine learned uncertainty estimates correlate with predictive performance, and that constraining predictions to a subset of patients may reassure us in the model's prediction.This lead us to only suggest GBDT for this task if prevalence is close to 7%.Note this was not the case for the dataset used in (Al-Zaiti, 2020) 1

What impact do input covariates have on performance?
The more covariates we consider, the more performance improves.We ablate the impact that different sources of data have in the classification (See table 5).Baseline corresponds to age, sex and ECG interpretations; Baseline + Symptoms correponds to all the baseline covariates and the symptoms covariates in table 1; Baseline + Symptoms + Medical History correponds to all baseline covariates, all symptoms covariates and all medical history covariates in table 1.As expected, performance increases the more covariates we consider.However, we observe a larger increase in sensitivity when we include Medical History.

Does total uncertainty correlate with performance?
Figure 1, red line with squares, suggest a positive correlation between the average performance of GBDT (y-axis) and the percentage of uncertain samples excluded (x-axis

Does total uncertainty correlate with performance of other ACS classifiers?
Figure 1 suggests the AUROC performance of unrelated methods (HEAR and HEART) correlates with the percentage of samples excluded.This assesses whether the excluded samples are deemed uncertain by other predictors.It is surprising that this is the case for both the HEAR and the HEART scores.This is important because we may use GBDT for patient selection and a different method for classification.For instance, the benefit of choosing HEAR and HEART as the classifier is that their prediction is explainable, a valuable feature for healthcare providers.Samples are excluded using the predictive uncertainty estimated from GBDT (Eq.1).The larger the percentage of uncertain samples excluded, the higher performance we expect.

Does GBDT outperform other uncertainty quantification methods on ACS prediction?
We repeated the ACS classification experiment with two other popular approaches for uncertainty quantification (i.e.Deep Ensembles 16 and MCDropout 17  6.ACS classification performance on the external cohort.Reported is µ ± 2σ where the samples come from 5-fold stratified cross-validation on the training set.

Does GBDT outperform other prediction methods?
It depends (See table 7).If we are interested in rule-out performance (i.e.sensitivity and NPV) the answer is yes.If we are interested in rule-in performance (i.e.specificity and PPV), then XGBoost+SC is superior.Since rule-out performance is more desirable than the rule-in performance, and GBDT is designed for uncertainty quantification, we lean towards GBDT over XGBoost.For the experiments we repeated the ACS classification experiment with XGBoost and its corresponding predictive entropy (i.e.H(Y |X)) was used for selective classification.Note hyperparameter grid search was used for all methods.

Discussion
In this study we measured the ACS and NSTE-ACS classification performance of GBDT and GBDT+SC.Results show that both methods achieve the best sensitivity and specificity reported for the prehospital setting.This is important because any method that furthers is a better candidate to aid the early rule-out or rule-in of ACS.Compared to previously reported results 1, 2 , GBDT 10 is a better ML algorithm to rule-out ACS, with a 3% improvement in NPV and a 13% improvement in sensitivity.Selective classification (SC) further improves both rule-in and rule-out performance, with a 23% improvement in specificity, a 7% improvement in PPV, a 17% improvement in sensitivity and a 5% improvement in NPV.Nevertheless, selective classification introduces a tradeoff: on average, coverage (i.e. percentage of filtered-in test 5/9 Figure 1.AUROC performance for three different methods (GBDT, HEAR 13 , HEART 12 ) as we exclude more uncertain cases.Performance is computed with the non-excluded cases.Uncertainty is the predictive uncertainty from GBDT (Eq.1).Highlighted is the mean and shaded is two standard deviations from 5-fold cross validation.For this experiment, GBDT uses less covariates than those in table 1 to match the HEAR 13 covariates.Traditional HEART 12 requires troponin, in addition to the HEAR covariates.A troponin measurement is generally unavailable in the prehospital setting.

Method
Prevalence cases) reduces from 100% to 75% (See table 2).For the task of NSTE-ACS, the performance narrative is similar to the ACS case.GBDT provides the best rule-out performance with a 14% increase in sensitivity and a 5% increase in NPV.Selective classification further improves both rule-in and rule-out performance, with an increase of 16% sensitivity, 5% in NPV, 15% in specificity and a 6% increase in PPV.NSTE-ACS is important because it represents patients without ST-Elevation, a naturally ambiguous class of patients difficult to triage from ECG alone (See table 4).Methodologically, the main difference with respect to previous work is the ML model used for prediction.GBDT 10 is designed for predictive uncertainty quantification, whereas previous methods 1, 2 are designed for predictive accuracy.This difference in design permits more elaborate decision making through the identification of the uncertainty source (See equation 3).Furthermore, we observe GBDT has better rule-out performance than the best previously found predictor 2 (See table 7).In regards to input covariates, previous work 2 , we consider symptoms, an interpretation of the ECG and age in our prediction of ACS.However, we did not consider vital signs.We conjecture the addition of vital signs would improve performance like symptoms and history did in table 5; Unlike other works 1 , our methodology requires EMS personnel to interpret the ECG and determine the presence/absence of three conditions (See table 1).However, given how blackbox predictors are prone to random errors 18 and overconfidence 19 , we argue EMS personnel should interpret the patient's ECG, especially when rule-in and rule-out performance is insufficient.With respect to leveraging uncertainty in cardiovascular disease prediction outside the prehospital setting [4][5][6][20][21][22][23] , we also observed a positive correlation between selective classification and performance [4][5][6]23 . Howevr, the deep learning methods 16,17 employed among most these studies 4,5,20,21 are outperformed by GBDT in this task (See table 6) and have more complex implementation.Additionally, we reemphasize deep learning models are generally unpredictable under imperceptible or irrelevant changes to the input signal 18,24 .
This study was approved by the institutional review board of the University of North Carolina at Chapel Hill, and all relevant ethical regulations on human experiments, including the declaration of Helsinki, have been followed.Data were collected through a healthcare registry, and all consecutive eligible patients were enrolled under a waiver of informed consent approved by the institutional review board of the University of North Carolina at Chapel Hill.

Table 1 .
11Statistics of covariates used as input to the Machine Learning model GBDT.Statistics are calculated separetely for the internal and external cohorts.For the ECG interpretations, type {0, 1}11indicates a binary vector.The position corresponds to the ECG lead used for the interpretation.

Ensemble of Gradient Boosted Decision Trees (GBDT) (
15linin et al., 2021)10first proposed GBDT to classify tabular data and improve predictive uncertainty estimates.Accordingly, we chose this method because our data is tabular (See table1) and we use estimates of predictive uncertainty to filter-out patients unsuitable for automatic classification.In this work, the estimate of uncertainty used to make the classification is known as "Total Uncertainty"15, and corresponds to the entropy of the posterior predictive distribution H(Y |X, D); where X is the test input covariates, Y is the unknown outcome (i.e.{ACS, ¬ACS}), D is our training split and H(•) is the entropy function.Total uncertainty is estimated by a Monte-Carlo approximation 10 : 1, our PPV is considerably better.Selective classification (SC) further improves performance (see first row in table 2) by filtering out uncertain cases (i.e. H(Y |X, D) > cutoff).For the filtered-in subpopulation of the external cohort, sensitivity and specificity improve by 4% and 10% points respectively, creating a considerable difference with respect to

Table 2 .
2CS classification performance on the external cohort.Reported is µ ± 2σ where the samples come from 5-fold stratified cross-validation.For (Al-Zaiti) 1 and (Takeda)2, the results presented are their reported results.

Table 3 .
ACS classification performance in the internal cohort.Reported is µ ± 2σ where the samples come from 5-fold cross-validation.

Table 5 .
ACS classification performance for different input covariates: Baseline (i.e.ECG interpretations, Age and Sex); Baseline and Symptoms; Baseline, Symptoms and Medical History.Reported is µ ± 2σ where the samples come from 5-fold cross-validation.
).As expected since excluding uncertain samples should mitigate errors. Table ). Results in table 6 suggest GBDT performs best.Posterior predictive entropy (i.e.H(Y |X, D) or total uncertainty) was used for selective classification across all methods.Note hyperparameter grid search was used for all methods.

Table 7 .
2CS classification performance on the external cohort.Reported is µ ± 2σ where the samples come from 5-fold stratified cross-validation on the training set.XGBoost is the predictor with the best reported performance in previous work2