Importance: Despite sex and race disparities in the symptom presentation, diagnosis, and management of acute coronary syndrome (ACS), these differences have not been investigated in the development and validation of machine learning (ML) models using individualized patient information from electronic health records (EHRs) to diagnose ACS.
Objective: To evaluate ML-based ACS diagnosis performance across different subpopulations in a multi-site emergency department (ED) setting and determine how bias mitigating techniques influence ML performance.
Design, Setting, and Participants: This retrospective observational study included data from 2,334,316 ED patients ( >18 years) from January 2007 to June 2020.
Exposure: Logistic regression (LR) and neural network (NN) models were assessed in ED encounters grouped by sex, race, presence or absence of chest pain, EHR data quality, and timeliness of several key ED procedures. Prejudice regularization, reweighting, and within-subpopulation training were evaluated for bias mitigation.
Main Outcomes/Measures: Metrics including area under the receiver operating characteristic (AUROC) were used to assess performances.
Results: We analyzed 4,268,165 ED visits in which patient demographics by race were 67.40% White, 19.20% Black, 2.40% Asian, and 11.00% Other or Unknown. Patient composition was 54.80% female and 45.20% male. Both models’ AUROCs were significantly higher in White vs. Black patients (LR: z-score = 3.23 and NN: 4.26 for NN; P < 0.0006), in males vs. females (z-score = 3.81 for LR and 4.16 for NN; P < 0.0001) and in no chest pain subpopulation vs. chest pain (z-score = 13.32 for LR and 17.70 for NN; P < 0.0001). Prejudice regularization and reweighting techniques did not reduce biases. Training in race-specific and sex-specific training populations also did not yeild statistically signficant improvements in ML algorithm performance. Chest pain-specific training led to significantly improved AUROC.
Conclusion: EHR-derived ML models trained and tested within similar demographic subpopulations and symptom groups may perform better than ML models that are trained in random populations, and provide less biased clinical decision support for ACS diagnosis.