BioVU fibroid case and control subjects were selected as previously described(Bray, Edwards, Wellons, Jones, Hartmann and Velez Edwards, 2017, Feingold-Link et al., 2014). Briefly, The BioVU repository is a collection of stored DNA linked to de-identified EHRs at Vanderbilt University Medical Center, a resource which currently includes more than 240,000 samples for the investigation of phenotype-genotype associations(Roden et al., 2008). Fibroid cases and controls were selected from female BioVU participants over the age of 18 with at least one record of pelvic imaging. Individuals with an International Classification of Disease, ninth revision (ICD-9) diagnostic code for uterine fibroid diagnosis were selected as cases (n = 1,195 White cases, 583 Black cases), while individuals without the code, a second pelvic image, and no history of hysterectomy, myomectomy, or uterine artery embolization were selected as controls (n = 1,164 White controls, 797 Black controls). A comparison with manually reviewed records indicated a 96% positive predictive value and a 98% negative predictive value. Measurements of fibroid characteristics were manually abstracted from pelvic imaging reports and surgical reports. These characteristics include fibroid volume (n= 396 White subjects, 450 Black subjects), largest dimension (n = 579 White subjects, 450 Black subjects), and presence of multiple fibroids (i.e. single vs multiple, n = 356 White single-fibroid subjects, 359 multiple-fibroid White subjects, 192 Black single-fibroid subjects, 258 multiple-fibroid Black subjects).
The study was approved by the Institutional Review Board at Vanderbilt University Medical Center (#110407).
SNP genotyping and quality control
Fibroid cases and controls were genotyped as previously described(Giri, Edwards, Hartmann, Torstenson, Wellons, Schreiner and Velez Edwards, 2017). Briefly, subjects were genotyped using the Affymetrix Axiom Biobank array (Affymetrix, Inc., Santa Clara, CA) and the Axiom World Array 3 (Affymetrix, Inc., Santa Clara, CA). DNA was purified and quantitated by PicoGreen (Invitrogen, Inc., Grand Island, NY). Standard quality control measures were applied using PLINK2(Chang et al., 2015). Sample exclusion criteria included genotypic duplicates, excess heterozygosity, call rate below 95%, and discordance between genetically-inferred sex and database sex. Closely related individuals identified by inheritance-by-descent (IBD) sharing were removed. Variants with low call rate (<95%) were excluded from subsequent analyses. Genotype data were pruned for linkage disequilibrium (LD) using a window size of 50 base pairs (bp) shifting by ten bp at an r2 threshold of 0.1.
1000 Genomes reference genotype data were downloaded from the UCSC server (http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/). Genotype data for 1000 Genomes samples were pruned for LD using a window size of 50 bp shifting by ten bp at an r2 threshold of 0.1. Variants with low call rate (<95%) were excluded from subsequent analyses. Genotype data were then randomly thinned to include 100,000 variants. For analysis of geographic ancestry proportions, LD-pruned genotype data for cases and controls were merged separately for Black and White subjects with reference genotype data. Variants with low call rate (<95%) in each merged set were excluded from subsequent analyses. Merged genotype data were then randomly thinned to include 100,000 variants.
Assessment and cleaning of genetically-inferred reference ancestries
1000 Genomes samples from each reference population (n=26) were randomly partitioned into training and testing sets. Supervised ADMIXTURE(Alexander et al., 2009) analysis (K=26) specifying population for each training set and estimating ancestry proportions in each testing set was used to identify heterogenous populations. Analysis showed sharing within, but not between, populations corresponding to the five continental ancestries with two exceptions, sharing between Black and White populations and sharing between East and South Asian populations (Supplementary Figure 1). Populations were excluded from subsequent analysis if ancestry proportions for the specified training set population were below 60% in the testing set (Supplementary Table 1). Six 1000 Genomes reference populations were excluded from subsequent analysis due to heterogeneity. These populations included Americans of African Ancestry in the southwestern USA (ASW), Southern Han Chinese (CHS), British in England and Scotland (GBR), African Caribbeans in Barbados (ACB), Kinh in Ho Chi Minh City, Vietnam (KHV), and Indian Telugu from the UK (ITU) samples (Supplementary Table 1). Additionally, admixed American populations (Mexican Ancestry from Los Angeles, USA[MXL], Puerto Ricans from Puerto Rico [PUR], Colombians from Medellin, Colombia [CLM], and Peruvians from Lima, Peru [PEL]) were excluded from further analysis.
Genotype data for 1000 Genomes samples were analyzed using ADMIXTURE(Alexander, Novembre and Lange, 2009) at several K means to determine the maximum number of ancestries that could be resolved by the software. Cross-validation error decreased for K means between one and five, stabilized at K means of five to ten, and began to increase at K means greater than 10 (Supplementary Figure 2). Subjects from remaining 1000 Genomes populations were divided into six geographic populations. East African (EAFR) included Luhya in Webuye, Kenya (LWK) samples. West African included Gambian in Western Divisions in the Gambia (GWD), Esan in Nigeria (ESN), Mende in Sierra Leone (MSL), and Yoruba in Ibadan, Nigeria (YRI) samples. Northern European included Finnish in Finland (FIN) and Utah Residents (CEPH) with Northern and Western European ancestry (CEU) samples. Southern European included Iberian Population in Spain (IBS) and Toscani in Italia (TSI) samples. East Asian included Chinese Dai in Xishuangbanna, China (CDX), Han Chinese in Beijing, China (CHB), and Japanese in Tokyo, Japan (JPT) samples. South Asian included Punjabi from Lahore, Pakistan (PJL), Bengali from Bangladesh (BEB), Sri Lankan Tamil from the UK (STU), and Gujarati Indian from Houston, Texas (GIH) samples.
Analysis of geographic ancestry proportions in BioVU
Unsupervised ADMIXTURE analysis (K=6) of 1000 Genomes reference genotype data from each merged set (Black women and White women) was performed and ancestry proportions for each of the six reference groups were calculated (Supplementary Tables 2 and 3). These ancestry proportions were then projected onto BioVU fibroid cases and control samples in ADMIXTURE using their genotype data from the respective merged sets. Mean ancestry proportions are presented in Table 1.
Association of geographic ancestry proportions with fibroid status and fibroid traits
Associations with global genetic ancestry proportions were computed using R(R Core Team, 2015). Dichotomous fibroid outcomes of fibroid case/control status and single vs multiple fibroids were modeled using logistic regression against each ancestry proportion separately for Black and White subjects. Continuous fibroid traits of fibroid volume and largest fibroid dimension were modeled using linear regression against each ancestry proportion separately for Black and White subjects. Continuous outcomes were log10 transformed for normality. All models were adjusted for age. Effect estimates are reported per 10% increase for a given inferred ancestry proportion.