Medical imaging algorithms exacerbate biases in underdiagnosis

Artificial intelligence (AI) systems have increasingly achieved expert-level performance, particularly in medical imaging (1) . However, there is growing concern that AI systems will reflect and amplify human bias against under-served subpopulations (2-7) . Such biases are especially troubling in the context of underdiagnosis: if AI systems falsely predict that patients are healthy, patients would be denied care when they need it most. This use case is particularly relevant in the context of existing health disparities where high underdiagnosis rates for under-served subgroups are well documented (8-11) . Although bias in underdiagnosis can potentially delay access to medical treatment unequally, underdiagnosis due of AI has been relatively unexplored. In this 15 work we examine algorithmic underdiagnosis in chest X-ray pathology classifiers and find that classifiers consistently and selectively underdiagnose under-served patients, actively amplifying the existing biases in clinical care. These effects are worse on intersectional subpopulations, e.g., Black females, and persist across three large and a multi-source chest X-ray dataset. Our work demonstrates that deploying AI systems risks exacerbating biases present in current care practices. 20 Developers, clinical staff, and regulators must address the serious ethical concerns of -- and barriers to -- effective deployment of these models in the clinic.

While there is much work in algorithmic bias (14) and bias in health (2)(3)(4)(5)(6)(7)(8)(9)(10)(11), the topic of AI-driven underdiagnosis has been relatively unexplored. Crucially, underdiagnosis --defined as falsely claiming the patient is healthy --leads to no clinical treatment when a patient needs it 35 most. The existing clinical landscape demonstrates biases in underdiagnosis against under-served subpopulations. For example, Black patients compared to non-Hispanic White with chronic obstructive pulmonary disease are more under-diagnosed (9), in clinical applications the risk score threshold is adjusted racially, which is potentially harmful for Black patients (8), or the quality of care delivered to the patients within the same hospital varies by the patient insurance type (11). 40 Such biases can manifest algorithmically, e.g. females receiving a longer time to diagnosis than males with the same medical conditions (10). In medical imaging specifically, algorithmic bias  In this work, we perform a systematic study of underdiagnosis bias in a chest X-ray prediction model across three large public radiology datasets, MIMIC-CXR (CXR) (17), CheXpert (CXP) (18), and Chest-Xray8 (NIH) (19) , as well as a multi-source dataset combining all three on 10 shared diseases. Motivated by known differences in disease manifestation in patients by sex (6), age (20), and race (8), and the effect of insurance type in quality of received care (11), we report results across all of these factors. Also we use insurance type as an imperfect proxy of socioeconomic statuse.g. patients with Medicaid insurance are often low income.
We find that algorithms trained on all settings exhibit systemic underdiagnosis biases in 15 under-served subpopulations, including females, Black and Hispanic, younger patients, and patients of lower socioeconomic status (with Medicaid insurance). Further, we show that our observations are not consistent with an increase in overall noise on these subgroups, but instead are reflective of a specific increase in underdiagnosis alone. We find these effects persist for intersectional subgroups (e.g., Black female), and are not consistently worse in the smallest 20 intersectional groups.

Methodology
We train distinct chest X-ray diagnosis models in four settings: the MIMIC-CXR dataset (CXR, 371,858 images from 65,079 patients) (17), CheXpert (CXP, 223,648 images from 64,740  (19), and a multi-source combination of all three (ALL, 707626 images from 129,819 patients). The CXR, CXP, and NIH datasets have relatively equal rates of male and female patients, and most patients are between 40 and 80 years old. Note that the CXP and NIH datasets only report patient sex 1 and age, whereas the CXR additionally reports patient race and insurance type for a large subset of images. The race 15 and insurance type attributes are highly skewed in the dataset, where White are the majority and patients with Medicaid insurance are the minorities within the dataset. The NIH dataset has only frontal view images where other datasets have also lateral view images. For more detailed summary statistics across datasets, see Table 1. For our medical imaging predictive model and training scheme, we follow best practices (7) and train a 121-layer DenseNet (21), with weights 20 initialized using ImageNet (22). Additional details of model training and construction can be found in Appendix Section Model Training Details.
To assess and compare underdiagnosis rates across subpopulations, we compute the false positive rate (FPR) of model prediction for the "No Finding" label, which indicates no disease diagnosed, though patient suffer from at least a diagnostic disease condition. We then compare 25 these FPRs across subpopulations including age and sex on all four datasets, as well as race and insurance type on the CXR dataset specifically. For a visual illustration of our model pipeline, see Fig. 1.
Additionally, we measure the False Negative Rate (FNR) for the "No Finding" label across all subgroups. This measure is useful to help us differentiate between overall model noise (e.g., 30 predictions are flipped at random in either direction), which would result in roughly correlated FPR and FNR rates across subgroups, and selective model noise (e.g., predictions are selectively biased towards a prediction of "No Finding"), which would result in unor anti-correlated FPR and FNR rates. While both kinds of noise are problematic, the latter is a form of technical bias amplification, as it would show the known bias of clinical underdiagnosis is being selectively amplified by the algorithm--i.e., the model is not only failing to diagnose those patients clinicians, but is also failing to diagnose other patients as well. 5

Results
Underdiagnosis occurs in under-served patient subpopulations. We find the underdiagnosis rate for all datasets differs dramatically based on all measured subpopulations. In Fig. 2A we show the subgroup-specific underdiagnosis for CXR dataset on race, sex, age, and insurance type. We observed female, patients under 20 years old, Black, Hispanic, and patients with Medicaid 10 insurance receive higher rates of algorithmic underdiagnosis than other groups. In other words, these groups are at higher risk of being falsely flagged as "healthy", and receiving no clinical treatment. Results on other datasets follow a similar trend and they are shown in the Appendix.

Underdiagnosis occurs in intersectional groups.
We investigate intersectional groups --here defined as patients who belong to two subpopulations, e.g., Black female. Similar to prior work in 15 face detection (15), we find that intersectional subgroups (see Fig. 2B) often have compounded biases in algorithmic underdiagnosis. For instance, in CXR dataset Hispanic females have a higher "No Finding" FPR than White females (see Fig. 2B-1). Also, patients less than 20 years, female, Black, or patients with Medicaid insurance who are often low income has the largest underdiagnosis rates (see Fig. 2B-2). The underdiagnosis rate for the intersection of Black patients 20 with another subgroup of age, sex, and insurance type (see Fig. 2B-3) and patients with Medicaid insurance with another subgroup of sex, age, and race (see Fig. 2B-4) are also depicted in Fig. 2B. It is observable that the patients who are the member of two under-served subgroups are experiencing larger underdiagnosis rate. In another word, though female in subpopulation study of underdiagnosis rate ( Fig. 2A) have shown to have a larger underdiagnosis rate, not all females are 25 misdiagnosed at the same rate (see Fig. 2B-1). The intersection underdiagnosis for other datasets is shown in the Appendix where they also follow a similar pattern.
Underdiagnosis is not a result of subgroup-specific overall noise. As illustrated in Fig. 2C (FNR for 'No Finding'), FPR and FNR show inverse relationships across different under-served subgroups on the CXR dataset (though this finding is consistent across all datasets, see Appendix), 30 for both overall and intersectional subgroups (see Fig. 2D). In other word . For emphasis, we restate that this finding is not consistent with a simple increase in overall noise for specific subgroups, but instead indicates that under-served subpopulations are being aggressively flagged erroneously as healthy by the algorithm, without a corresponding increase in false negatives.

Conclusion
We demonstrate evidence of AI-based underdiagnosis against under-served subpopulations in diagnostic algorithms trained on chest X-rays. Clinically, underdiagnosis is of key importance because undiagnosed patients incorrectly receive no treatment. We observe, across three largescale datasets and a combined multi-source dataset, under-served subpopulations are consistently 40 at significant risk of algorithmic underdiagnosis. Additionally, patients in intersectional subgroups (e.g., Black female) are particularly susceptible to algorithmic underdiagnosis.

Fig. 2.
Analyzing underdiagnoses over subgroups of sex, age, race & insurance type within the MIMIC-CXR (CXR) dataset. The results are averaged over 5 run with different random seed ± 95% confidence interval. A. The underdiagnosis rate (measured by "No Finding" FPR). Female, 0-20, Black and/or low-income patients under Medicaid insurance have the largest underdiagnosis rate, indicating the greatest disparity. B. The intersectional underdiagnosis rates within only female (B1), ages 0-20 (B2), Black (B3), or Medicaid (B4) patients. In these plots, we see that intersectional identities are often underdiagnosed even more heavily than the group in aggregate (e.g., Medicaid female patients are underdiagnosed more than Medicare female patients). C, D We compute the same analyses on the overall subgroups (C), and the intersectional subgroups (D1-4) 5 but now examining the "No Finding" FNR. If we observed a commensurate increase in FNR alongside the increase in FPR observed in A, B, this would indicate these results are tracking an increase in overall noise. Instead, we typically observe an inverse correlation between FPR and FNR, indicating the model is selectively underdiagnosing these vulnerable subpopulations. Throughout, subgroups labeled in gray text, with results omitted, indicate the subgroup has too 10 few members (<= 15) to be used reliably.
This discrepancy is especially interesting in the context of known biases in clinical care itself, in which under-served subpopulations are often underdiagnosed by doctors without a simultaneous increase in privileged group overdiagnosis (9). Our prediction labels are provided by 15 these same doctors, and are therefore not an unbiased ground truth --in other words, our labels should already suffer from this same bias that our model is then additionally exhibiting. This is a form of bias amplification, when a model's predicted outputs amplify a known source of error in the data generative process (23) or data distribution (24). This is an especially dangerous outcome for machine learning models in healthcare, as it indicates that the existing biases in health practice 20 risk being magnified, rather than ameliorated, by algorithmic decisions based on large (707,626 images), multi-source datasets. While this evaluation is for chest x-ray diagnostic imaging, this issue is likely widespread across data sources, and prediction tasks.
Our findings demonstrate a concrete way that algorithms escalate existing systemic health inequities. As algorithms move from the lab to the real world, we must consider the ethical 25 concerns about the accessibility of medical treatment for under-served subpopulations and effective and ethical deployment of these models. Research Council of Canada (NSERC) -funding number PDF-516984, Microsoft Research, Canadian Institute for Advanced Research (CIFAR,) NSERC Discovery Grant; Author contributions: Each named author has substantially contributed to conducting the underlying research and drafting the manuscript.; Competing interests: Authors declare no competing 5 interests.; and Data and materials availability: All 3 datasets that we have used for this work are public under data use agreements. MIMIC-CXR dataset available at: https://physionet.org/content/mimic-cxr/2.0.0/ CheXpert dataset is available at: https://stanfordmlgroup.github.io/competitions/chexpert/ ChestX-ray8 dataset is available at: https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest- 10 publicly-available-chest-x-ray-datasets-scientific-community. Code availability All code used in the analysis will be available in a public repository for purposes of reproducing or extending the analysis. The link to the code will be added to the text of the paper for the camera ready version.

Supplementary Materials:
Model Training Details 15 Additional Results Figs. S1 to S3