Demographic diversity of genetic databases used in Alzheimer’s disease research

For several years, experts have warned about the lack of diversity in genetic research databases, and researchers have devoted time and resources to recruiting subjects from underrepresented subgroups. In this study, we review published reports in academic journals of genetic studies of Alzheimer’s disease to note whether demographic diversity was indicated in the reports and, if so, the extent of representation of non-European subjects over the period from 1997 to 2022. We use multivariate regression analysis to analyze changes over time and to explain variation across studies. Our analysis indicates that reported diversity has not changed over time. Rather, it appears to have remained relatively constant, since Genome-Wide Association Studies (GWASs) were first used in the 1990s. We find most variation to be across journals rather than within journals, suggesting that characteristics of journals are an important influence on the dissemination of research with diverse samples. Lack of racial diversity in genetic databases used to develop clinical applications could lead to disparities in the effectiveness of those applications for underrepresented groups.


Introduction
To translate genetic research findings into clinical applications that have widespread effectiveness, it is important to study a diverse pool of subjects (Sirugo et al. 2019). This is especially true in the development of precision medicine, in which therapies are tailored to a patient's genetic characteristics (Cohn et al. 2017). Black Americans and other racial minorities are the population groups most commonly underrepresented in medical research, and they are the ones most vulnerable to the effects of such underrepresentation (Prasanna et al. 2019).
Several authors have noted the effects that underrepresentation of minority groups can have on the utility of clinical applications derived from medical research (Grahan et al. 2018). If a patient's genetic traits are not represented in a database used for research, therapies based on that research may be less effective for that patient (Prasanna et al. 2019). This can create yet another instance of racial disadvantage in the healthcare system.
A recent interview study of genetic researchers by Trinidad et al. found that investigators tend to give only limited consideration to demographic diversity when selecting a database for use in research, with more attention paid to ease of access and database features (Trinidad et al. 2022). This suggests that the importance of diversity to the applicability of findings is underappreciated. As the authors note, structural factors and institutional values may thereby present an ongoing underlying impediment to the collection and use of more inclusive data.
To study the extent to which underrepresentation of historically excluded minorities exists in genetic databases used in research, we examined a specific field of research, that exploring genetic determinants of Alzheimer's disease. We also examined trends over time. Representation of minorities was measured as the reported percentage of subjects who are not of European ancestry in databases used in this research during the period from 1997 to 2022. A similar approach has been used to study the racial diversity of genetic samples used to study cardiovascular disease (Prasanna et al. 2019). We noted the extent to which genetic diversity is disclosed in study reports in academic journals and trends during the study period. We focused on research regarding Alzheimer's disease because it holds particular promise for clinical applications and the number of studies presents a manageably sized research universe. We suspect that it is representative of the diversity in genetic research more broadly.

Materials and methods
We selected the studies for our review using the standard three-step PRISMA methodology, as diagrammed in the appendix (Page et al. 2021). The steps were as follows: First, we identified records of published studies by searching the PubMed database. The search terms included the following combination of keywords: genome AND genetics AND ((genome-wide association study) OR (genome-wide association studies) OR (risk score)) AND (race OR ethnicity OR demographics) AND (medical or medicine) AND (genetic data) AND (Alzheimer's). These terms ensured that the resulting studies were likely to be analyses of genetic data with applications for Alzheimer's disease and that they were likely to include references to the demographics of the sample that was studied. They were designed to exclude studies that referenced genetics tangentially, in a non-medical context, in a way that was unrelated to Alzheimer's disease, or with insufficient information to assess diversity. This search yielded 222 records.
Second, two independent researchers screened these records. In the initial screening, both read the article abstracts to exclude records that did not focus on Alzheimer's disease as a primary outcome. This yielded 168 records. The researchers then assessed the full-text articles to determine whether they actually analyzed genetic data related to Alzheimer's disease. This resulted in the exclusion of an additional 13 records.
Third, the researchers read the remaining 153 studies and collected the following data: author(s), year of publication, journal, source(s) of genetic data, sample size, and demographic characteristics of subjects. These data were then used to create coded variables for analysis. These included four binary variables: whether demographic information was reported; if it was, whether the sample was predominantly European; whether the sample was predominantly non-European and racially non-diverse; and whether the sample was racially diverse. Other coded variables included the percent of European ancestry in the sample, the percent of non-European ancestry, and the predominant reported ancestry.

The extent of reporting of demographic information
We first measured the extent of reporting of demographic information. When equally weighted across all studies, we found that 92% reported it. However, larger samples may be more important in developing precision medicine tools.
When weighted by sample size, we found that 96% of the studies reported demographic information.

The distribution of reported ancestry across all studies
After narrowing our sample to studies that report demographic information, we compiled that information into a dataset to compute the distribution of reported ancestry. We did not attempt to deduce the genealogical lineage of the subjects in each study, nor did we make any assumptions about their "races." Instead, we created three categories based on the way the authors of each study reported the ancestry of their subjects, typically based on the subjects' self-reports.
Throughout the literature, considerations of underrepresentation of minority subgroups typically focus on the overrepresentation of European ancestry. Thus, we denoted one category as "European non-diverse." However, many studies overrepresent other ancestries relative to the global population or to the relevant national population. For instance, a study in Japan may contain only Japanese subjects, or a study in the United States may focus only on Black subjects. While these studies increase the diversity of the literature overall by counterbalancing the European-only studies, their samples are not diverse in and of themselves. Therefore, we categorized these studies as "non-European non-diverse." The final category, for studies containing multiple ancestries in large proportions, was "diverse."

Results
As shown in Fig. 1a, the presence of widespread underrepresentation is empirically validated. Only 23% of the studies we analyzed qualified as "diverse." The largest category is "European non-diverse," accounting for 42%. The imbalance is even more pronounced when we weight the studies by sample size (Fig. 1b). By this metric, 84% of the genetic data used in Alzheimer's disease research are "European non-diverse." While these studies may be counterbalanced somewhat by the number of "non-European non-diverse" studies, these latter studies tend to use very small samples. Hence, underrepresentation is partly the result of the lack of diverse study populations but also partly the result of the small sample size when non-European subjects are included.
Another way to measure genetic diversity is to count the predominant reported ancestry in each study-that is, the ancestry with the largest representation in a given study. In Fig. 2a, we assigned equal weights to each study, indicating how often each ancestry is the predominant focus of a given study. We found 61% to be European or European American, 22% Asian or Asian American (reflecting a sizable number of studies undertaken by universities in Asian countries), 9% Native American or Latin American, and 8% African or African American. In Fig. 2b, we show the studies weighted by sample size, which more closely measures the predominant ancestry of all the databases used. Because European or European American datasets are much larger, these groups are far more predominant in research by this measure: 94%, compared with 5% Asian or Asian American, 1% African or African American, and less than 1% Native American or Latin American.

Factors associated with reported ancestry over time
To investigate factors associated with level of diversity, we used three sets of variables collected from the studies: year of publication, sample size, and whether the study was a genome-wide association study (GWAS) or a smaller study focused on a specific population. These variables allowed us to run a series of multivariate linear regressions to estimate their relationships with different outcome measures indicating diversity or lack thereof. All four regressions followed a similar form: where Y ijt indicates a demographic measure of study i at time t, lnS it is the natural logarithm of the sample size, G it is a binary indicator for a GWAS (controlling for the possibility that these studies may be different, because they use large samples that overlap with other GWASs), and C it is a binary indicator for a clinical study. (There is a third omitted category for studies with both GWAS and focused data.) The coefficients and robust standard errors for each variable are listed for four separate regressions in Table 1.
Across all four regressions, there is little evidence of change over time. The year variable is not statistically significant in explaining (a) whether the study reports demographic information, (b) whether it is more likely to be European non-diverse (indicating zero or close to zero subjects from other ancestries), or (c) whether it is more likely to be diverse. The non-significant point estimates suggest that the number of European non-diverse studies may have declined over time, but they also suggest that the number of diverse studies is also declining. There is some evidence that the publication of studies with a preponderance of European subjects overall (a less stringent test than "European non-diverse," as it only requires European subjects to comprise a larger share than the other ancestries) is declining over time, but with a fairly small coefficient (− 0.0167), and only significant at the 0.05 level. In contrast, there is a stronger significant finding that larger studies are more likely to include only European non-diverse subjects or predominantly European subjects (coefficient = 0.3172). Finally, there is a significant finding that GWASs are more likely to be predominantly European, although that coefficient is fairly small (0.0892), and it does not lead to different results for our other coefficients.
Given the mixed evidence regarding changes in the diversity of subjects over time, we converted the year variable into a series of binary indicator variables-with the first year, 1996, as the omitted category-and investigated the coefficients with their standard errors. (A graph is presented in the online appendix.) Consistent with the regressions described above, there is little evidence of a significant downward trend. The negative point estimates obtained in Table 1 likely reflect the few outliers of very high percent European subjects in the first half of the time period. However, the point estimates from 2011 to 2022 are indistinguishable from the estimates in 1997, 2001-2004, and 2006.

The role of journals in publishing studies with diverse samples
We also investigated the role of journals, themselves, in publishing genetic studies by introducing journal-level fixed effects into the regressions. The resulting coefficients (Table 2) show within-journal effects. While the previous regressions compare across journals, these regressions will only show significant effects if the variables' impacts on demographics have changed for the average journal. This analysis produced even fewer significant results, suggesting that any change detected in the previous analysis likely resulted from a propensity of different journals to publish studies with different levels of diversity in different time Thus, what appears to matter most for representativeness is the journal, not general trends over time. This conclusion is confirmed by comparing the R 2 statistics, which increase from 0. 030-0.193 in Table 1 to 0.550-0.735 in Table 2. Journal fixed effects alone explain approximately half of the variation in the outcome variables, compared to less than 20% explained by all of the other variables combined.
To show this variation across journals, we graphed the distribution of the journals' average percent of European research subjects (Fig. 3). Of particular note is the number of journals that have only published studies with 100% European subjects. There is also a nontrivial minority of journals that have published studies with less than 50% European subjects.

Discussion
Based on our analysis of more than 150 studies involving genetic factors related to Alzheimer's disease, we conclude that underrepresentation of minority populations is real and continuing. We noted variation among academic journals in the proportion of non-European subjects in the databases used in published studies that has not changed in almost a quarter century. This raises the prospect that the research may ingrain disparities in the effectiveness of clinical tools based on much genomic research.
The results confirm previous findings that the preponderance of published studies of genetic correlates of disease relies on subject pools that are non-diverse and skewed toward subjects of European ancestry (Prasanna et al. 2019). While our investigation focused on studies of genetic correlates of Alzheimer's disease, our findings are consistent with analyses of other conditions. The reliance on subjects of European ancestry is more evident when measured by numbers of subjects than by numbers of studies, which suggests that larger GWASs are the most likely to rely on nondiverse subject pools. Yet, these are the studies for which broad genetic representation is most important.
We also found that subject diversity in genetic studies has not changed appreciably over time dating back to 1996. It is notable that there is considerable variation in subject diversity between journals but not within journals. It appears, therefore, that journals are consistent over time in the extent of diversity in the studies they publish. This suggests that promoting the dissemination of research with diverse subject pools is not only theoretically possible; several journals are already doing it.
In addition to concerns over clinical effectiveness, lack of inclusion of Black subjects and other minority groups in genomic research may engender mistrust in precision medicine and other genetically based therapies. Racial disparities in care already pervade much of the American health care system (Roberts 2018). This has diminished trust in medicine by many Black patients (Allen 1997;Idan et al. 2020). Lack of racial representation in the research used to develop new genetically based clinical tools could reinforce disparities in their use and foster further inequities in American health care (Alsan et al. 2022).
This study has several limitations. Most notably, it was limited to a review of genetic investigations involving Alzheimer's disease, which is a small subset of all genetic studies. It is possible that studies in other fields would display different trends. In addition, our study measured demographic diversity as European origin versus origin in all other continents. It is possible that genetic diversity within these broad geographic categories is more important to the applicability of findings than diversity between them. Nevertheless, our study indicates that underrepresentation of members of historically excluded communities exists in an important field of genetic research-to their possible disadvantage in medical interventions based on that research. It should be investigated in other fields, as well.
Based on our finding that underrepresentation of historically excluded populations has persisted over time in one area of genetic research, strategies are called for to address it. One approach would be for journals to more carefully consider genetic diversity in the research databases used in the studies they publish, for example, by promoting consideration of diversity as a limitation on the generalizability of findings. Journals serve as gatekeepers for research diffusion by determining which findings are published and by setting the preconditions for publication. Their ability to use these powers to promote awareness of research on racial inequities in health care has been noted by others (Boyd et al. 2020;Krieger et al. 2021). Genomic medicine holds tremendous Fig. 3 Histogram of mean percent of research subjects with European ancestry by journal. Each journal is counted once in the graph, indicating the average of all genetic studies on Alzheimer's research published in that respective journal promise for curing diseases and saving lives, but in doing so, it should not extend the disparities that are already endemic in American health care.