Racism In the American Healthcare System
Systematic racism, defined as “a system in which public policies, institutional practices, cultural representations, and other norms work in various, often reinforcing ways to perpetuate racial group inequity” permeates almost every aspect of American society, and healthcare is no exception.[20] While there are difficulties pinpointing when racial bias was specifically introduced into the field of medicine, racism has been an integral part of the U.S. government’s structuring and financing of the healthcare system since the Jim Crow era. [21]
Jim Crow segregation laws – which legalized racial discrimination- separated America’s Black population from the rest of society. Some of these significantly affected the ability of Black Americans to access medical care and necessary medical treatments.[22] Medical institutions, including hospitals, doctor’s offices, and clinics, were segregated by race, and care at the Black hospitals tended to be lower in quality due to a significant lack of resources and underfunding. Due to further segregation in the education system, the lack of trained Black medical professionals often meant that if Black patients did receive care, they would have to face personal discrimination and inhumane treatment. White physicians could care for any patients, while Black physicians (if granted admitting privileges, which they were often denied) were restricted to the Black wards.[23] Fully integrated hospitals were rare, but even then, Black patients were still required to be treated in separate, subpar wards, and could not share the same room as the White patients.[24] Healthcare segregation had serious clinical consequences, although we may never know the full extent due to a lack of data collection on racial disparities until the first National Health Interview Survey in 1958.[25]
Legally sanctioned hospital discrimination continued as late as the mid-1960s after a Supreme Court decision made segregation illegal in hospitals.[26], [27] However, for Black Americans who lived through Jim Crow, the damage had already been done by creating a legacy of racial health disparities, called the Jim Crow effect. For example, Black women born in Jim Crow states before 1965 were more likely than both White women born in the same era and Black women born later to have estrogen-receptor-negative breast cancer, which is particularly aggressive and difficult to treat. Black women born after the abolition of the racist laws did not experience the same effect, suggesting the racial disparity was a direct product of discrimination.[28] The harm extended through generations, with the infants of Black women born in the early 1960s at a much higher risk of low birth weight than the infants born in the late 1960s.[29] Health disparities along racial lines are not due to innate genomic differences, a hypothesis that has been disproven by multiple researchers over the years, but instead due to the influence of inequitable social factors dating back to even before Jim Crow.[30] Bundled with the appalling social and environmental conditions that Black Americans faced post-Reconstruction, the segregated healthcare system worked recode their poor health outcomes so systematically “as if they were [their] genetic material”.[31]
Despite the elimination of segregation in hospitals in 1965, systematic racism persists on individual, intra-organizational, and extra-organizational levels, again limiting the access to care and quality of treatment of patients of color.
Covid-19 Racial Disparities
The COVID-19 pandemic has exposed racist social determinants of health, which highlight the ways historical injustices still linger today. On April 8th, 2020, the CDC published their surveillance data of confirmed COVID-19-associated hospitalizations across 14 states. Black Americans were disproportionality affected by the disease, making up 33.1% of patients despite representing only 18% of the catchment population.[32] The same phenomenon was seen in government statistics from cities across the United States. The racial disparities persisted.
There are several ways social determinants of health led to racial disparities associated with COVID-19 infection, morbidity, death, and vaccine rates. Black Americans are more likely to live in densely populated neighborhoods, leading to increased exposure, and areas of lower socioeconomic status, where they have decreased access to health care and lower rates of COVID-19 testing sites.[33] Racial minorities more often work in essential worker settings, such as public transportation, healthcare facilities, factories, and restaurants, where their chances of exposure are higher again due to the nature of their work.[34] Minorities may also have a greater risk of infection due to comorbidities that exacerbate COVID-19 symptoms, such as hypertension, diabetes, or heart disease. [35] Additionally, the historic and current experiences of racism in medical fields have built a strong mistrust of the American healthcare system among racial minorities, which may have extended to vaccine uptake. As of April 4th, 2022, only 57% of Black Americans had received at least one COVID-19 vaccine dose, compared to 85% of Asian Americans, 65% of Latinx Americans, and 63% of White Americans.[36]
Algorithmic Bias
New medical technologies, including the machine learning algorithms presented in this paper, are built on a foundation of historical and structural racism that has long resulted in inequitable health outcomes for marginalized communities. Without a conscious effort, racial bias can be translated through new, algorithmic channels.
AI systems have successive entry points in their development and management that serve as potential entry points for bias, creating a domino effect of prejudices with alarming consequences.[37] The “bias cascade” begins with the process of data collection. Machines are only as accurate as the data that they receive, so if an algorithm is trained with data with disparities, such as underrepresentation of minority groups or inequitable measurement techniques, the outcome will be similarly skewed.[38] The algorithm developers may even fall victim to their own biases, further compounding any bias in the technology in the way the code is actually written. Lastly, lack of access to healthcare and medical technology of minority groups introduces another level of inequality for healthcare algorithms. If minority groups are not exposed to these technologies, it becomes more difficult to identify biased results, further adding to the misrepresentation of the data if it is used to train the algorithm at a later date.
Analytic Framework
This paper aims to answer the following: What harmful biases are present in predictive digital health algorithms, and how can they best be regulated? Using prognostic models that predict a care course for patients diagnosed with COVID-19, this research follows the analytic framework to offer a conceptual map for evaluating automated decision systems, their development, their outcomes, and their risk of bias.
The models will be pulled from COVID PRECISE, a comprehensive, systematic, and continuously updated review of diagnostic, prognostic, and general population prediction models for COVID-19 and their accuracy, quality, and applicability.[39] COVID PRECISE has a comprehensive list of the three main types of prediction models related to COVID-19; models predicting COVID-19 susceptibility in the population to guide the use of preventative healthcare resources in high-risk areas; models predicting the presence of COVID-19 in patients with symptoms to direct diagnostic capacity to patients with a high probability of having COVID-19, and models predicting a course in patients diagnosed with COVID-19 to allocate hospital resources to those with an estimated poor prognosis. This thesis will focus only on the latter, prognostic models, which are or have been used in clinical settings.
In statistics, bias is used to depict any type of error that may occur when using statistical analyses. Within the context of this thesis, bias will refer to a predicted outcome that is “systematically less favorable to individuals within a particular group and where there are no relevant differences between the groups that justify such harms”.[40] While the main purpose of this paper is to reveal the risk of harmful racial bias, the analytic framework will also lend itself to exposing other forms of bias, such as gender and income-level, while paying particular attention to any compounding interaction effects.
There are various types of statistical bias that offer a possible cause for racial discrimination. The following will most likely be relevant under this context:
- Selection bias occurs when you select the wrong set of individuals, groups, or data in a way that randomization is not achieved.[41] Any calculation using this incorrect dataset will not represent the whole population.[42]
- Confirmation bias, a subset of selection bias,is the tendency to favor information that confirms an individual’s beliefs.[43]
- Omitted variable bias originates from the absence of a relevant variable in a model, making it inaccurate and underfit.[44]
- Susceptibility bias is when cause, effect, and correlation are incorrectly interchanged, without the acknowledgment that correlation does not imply causation.[45]
- Funding bias, also known as sponsorship bias, is the tendency to alter a study or its outcome to support a financial sponsor.[46]
- Status quo bias is more of a cognitive bias that refers to an exaggerated preference for the status quo, where an individual prefers to keep their context or environment the same as it was before.[47]
- Label bias arises when the outcome variable has different meanings across groups and should be anticipated when a proxy is used.[48]
It is important to note however, that even if an algorithm’s outcome is statistically perfect and presents no form of statistical bias through these mechanisms, it is still possible for racial bias to occur. Racism is so engrained into the architecture of American society that it has effectively normalized discrimination as objective reality. Notions of race are embedded into the medical field everywhere from research to clinical practice, from medical school training to insurance claims. Take for instance, the spirometer, a device for measuring and assessing lung function. Drawing on the idea of the once-standard assumption that there are racial differences in lung capacity, the spirometer has a button that produces different measurement of lung normalcy by race.[49] To register at the same level as their White counterparts, Black patients must demonstrate a worse lung function and more severe clinical symptoms. According to prospect theory, decision-making is influenced by options that may rest on biased judgment.[50] A measurement from a spirometer as such would be technically accurate, but only so in the context of a racially biased system. As we observe digital predictive health algorithms for signs of harmful bias, racial or otherwise, it is imperative to consider the broader context under which the medical field is built upon.
The analytic framework consists of seven domains that will all give a distinct insight into the operation of the algorithm: Constitution, Inputs and Outputs, Training Data, Transparency, Outcome, Scale, and Policy. The framework incorporates certain elements from the Prediction Model Study Risk of Bias Assessment Tool (PROBAST), which assesses for both risk of bias and concerns surrounding the applicability of a multivariable diagnostic or prognostic prediction models.[51] The framework expands upon this recently developed risk assessment tool by incorporating details on either side of the algorithms development, including developer diversity, where the training data originates, the scale of the application, and any governance attempts. A brief definition and justification for each domain is included below.
I will observe each domain for signs of statistical bias that are at risk to present inequitable outcomes for historically marginalized groups. Each domain includes a set of descriptive and signaling questions to facilitate structured judgment of the algorithm, which were designed to reveal flaws in the algorithm’s design, conduct, or analysis.[52] The signaling questions (italicized) are phrased so that the answer ‘yes’ indicates no bias, and the answer ‘no’ indicates a gap in the algorithm’s design allowing for the potential for bias introduction. If the information provided by the model’s developers do not give a concrete ‘yes’ or ‘no’ to the signaling questions, I will make an contextual deduction in either direction, with other possible answers being ‘probably yes’ or ‘probably no’. In terms of my own judgement, will tend to assume an answer may be ‘probably no’ rather than ‘probably yes’ from an abundance of caution.
1.Constitution
- What is the algorithm and how does it operate?
- Who developed the algorithm?
- Who funded the development of the algorithm?
- Is the team who developed the algorithm representative of traditionally marginalized groups?
Definition
The constitution of an algorithm refers to what makes up its composition and development. It looks at who actually developed the algorithm, what the development team’s diversity make-up is, how the research was funded, how it is supposed to work, and its main objectives.
Justification
As previously mentioned, blind spots or biases that a model presents reflect the priorities and judgments of its creators. If a development team is diverse and representative of marginalized groups, there is a higher probability that potential bias was flagged and acknowledged at the development stage.
2. Inputs and Outputs
- What data is inputted in the algorithm? Was this appropriate?
- What data is outputted from the algorithm? Was this appropriate?
- Were all data exclusions appropriate?
Definition
Every algorithm takes in data (inputs) and produces a different set of data (outputs) based on the specified set of input values. An input is deemed appropriate if it is defined, and assessed in a similar way for patients, has unbiased measurement, and does not act as an unintended race proxy.[53] An outcome is appropriate if it was deemed using data collected for the purpose of the algorithm, standardized across all patients, and determined without information on the predictors, all within an appropriate time interval.[54] Exclusions of participants are appropriate if they did not meet the developer’s inclusion criteria.
Justification
A model’s inputs and outputs may be observable and controllable, and in other cases, they are not. Sometimes a model must use a closely related variable, also known as a proxy, as a stand-in for a data point that could not be measured for some reason. If the relationship between the proxy and intended measure isn’t perfect or the proxy brings an unintended racial element, this could skew the output. In addition, studies that exclude certain participants may produce a biased estimate, as the model would be based on a group that may not be representative of the target population.[55]
3. Training Data
- What training data does the algorithm use?
- Where does the training data come from?
- How often is the training data updated?
- Was the training data representative and appropriate?
Definition
Training data is a set of historical data that facilitates an algorithm’s machine learning. These data sets help the algorithm identify the correct outputs for certain scenarios. Using the training data, the algorithm develops a model that maps the inputs to the outputs and is later applied to other scenarios where it will be used again. Under this domain, appropriate training data is defined as accurate, free from bias or discrimination, and representative of marginalized races and genders. For prospective models, the most appropriate use of training data with the lowest risk of bias is through a prospective longitudinal cohort design, where using pre-specified and consistent methods ensure that data is systematically and validly recorded specifically for the design of the model.[56]
Justification
An algorithm is at risk of producing biased results if it relies on inappropriate training data. If the training data that an algorithm uses is biased in any way, the algorithm will be biased as well, as that is what it has been taught to replicate.
4. Transparency
- Are information on the algorithm, its decision-making process, training data, and any decisions made readily available and accessible to the public?
- Is the information presented in an accessible and understandable way?
Definition
The transparency of a model refers to the availability and accessibility of information about an algorithm, its decision-making process, and the decision that it ultimately makes. Transparency benefits patients who can better understand why the outcomes are the way they are, and developers can identify issues and problems.
Justification
AI transparency helps us to understand the inner workings of a particular model, and can make instances of bias or inaccurate outcomes more visible and consequently easier to fix. If a model has more transparency, it is less likely to have sources of bias that ae able to fly under the radar or public judgment.
5. Outcome
- What was the outcome?
- Was there a standard outcome definition used?
- Were there a reasonable number of participants with the outcome?
- Were participants with missing data handled appropriately?
- Were relevant model overfitting in model performance accounted for?
- Was the outcome determined in an identical way for all participants?
- Was the outcome accuracy constant between groups of participants?
- Is there a mechanism for human oversight or judgement if deemed appropriate?
Definition
This domain relates to the risk of bias in the definition and determination of a model’s outcome. Appropriate handling refers to optimal data analysis methods. In terms of the number of participants, the events per variable (EVP) rule of thumb will be applied for different racial groups. This rule recommends that at least 10 individuals need to have developed the outcome for every predictor variable included in the model.[57] If the EPV is equal or more than 20, this generally eliminates bias in regression coefficients.[58]
Justification
By observing a model’s outcomes and the specific choices it took to get there, such as its definition, how missing data was handled, and the general accuracy among groups of participants, we can detect the points that have a potential for biased results. It most of these studies, accuracy is measured using the Area Under the Curve (AUC), which represents an aggregate measure of a model’s performance based on a curve plotting the true positive rate and the false positive rate.[59] This number shows the probability that a model ranks a random positive example correctly. If the researchers report this measure for each racial group, it is possible to see the effectiveness of the algorithm across races, allowing us to see if there is any unintended bias. If the algorithm has a specific mechanism for human oversight in the case of biased results, this may potentially alleviate risk of algorithmic bias presenting as harmful outcomes.
6. Scale
- Is the algorithm being used in a clinical setting?
- If yes, at what scale?
- Is this appropriate?
Definition
A model’s scale refers to the ability of an algorithm to grow exponentially, and if it is being used as such. For an algorithm’s scale to be appropriate, it needs to still be relevant for the additional contexts, including its training data, inputs, and outputs.
Justification
When an algorithm is used to aid or replace decision making on a large scale, there are more opportunities for any bias to become exponentially harmful.[60] The risk of bias increases additionally if the algorithm is used in contexts that it wasn’t originally designed for. Models that have been externally validated are more likely to have appropriate use on a larger scale.
7. Policy
- Are there performance evaluation measures in place and were they used appropriately?
- Are there any existing policies governing the use of the algorithm?
- Who is held accountable for any consequences of the algorithm?
Definition
The policy domain observes any pre-existing U.S. regulation or evaluation frameworks that attempt to regulate or mitigate the effects of harmful bias. An appropriate performance evaluation would aim to measure not only the accuracy of an algorithm, but its precision, specificity, accuracy among different populations, AUC, and external validity.
Justification
If there were previous attempts at regulating the use of an algorithm or evaluating its performance, the risk for bias is already much lower as the effects of the bias are acknowledged and mitigated. The creation of the policy itself if already a channel for bias mitigation, and its presence conveys an awareness of the potential impacts of the algorithms outcome.
I hypothesize that there are four possible answers to my original question: that there is no bias present (0 no’s), low risk for harmful bias (1-5 no’s), medium risk for harmful bias, (6 -10 no’s) and high risk for harmful bias (11-17 no’s). A ‘probably no’ answer will count as half of a definite ‘no’ (.5). The answer ‘probably yes/no’ was later added due to the lack of transparency from the research papers once the analysis began. Depending on where the risk of bias introduces itself, I will give policy recommendations for how best to alleviate this within an AI model and for regulating that risk going forward.
The boundaries between high risk, medium risk, and low risk were created by dividing the signaling questions into three equal segments. It is also important to note, however, that in reality, not every bias mechanism is equal. Some types may be more harmful than others. This is outside the scope of this thesis, which intends to observe where the bias is coming from, so each bias domain will be weighted equally.