Genotype-based “virtual” metabolomics in a clinical biobank identifies novel metabolite-disease associations

Circulating metabolites act as biomarkers of dysregulated metabolism, and may inform disease pathophysiology. A portion of the inter-individual variability in circulating metabolites is influenced by common genetic variation. We evaluated whether a genetics-based “virtual” metabolomics approach can identify novel metabolite-disease associations. We examined the association between polygenic scores for 726 metabolites (derived from OMICSPRED) with 1,247 clinical phenotypes in 57,735 European ancestry and 15,754 African ancestry participants from the BioVU DNA Biobank. We probed significant relationships through Mendelian randomization (MR) using genetic instruments constructed from the METSIM Study, and validated significant MR associations using independent GWAS of candidate phenotypes. We found significant associations between 336 metabolites and 168 phenotypes in European ancestry and 107 metabolites and 56 phenotypes among African ancestry. Of these metabolite-disease pairs, MR analyses confirmed associations between 73 metabolites and 53 phenotypes in European ancestry. Of 22 metabolite-phenotype pairs evaluated for replication in independent GWAS, 16 were significant (false discovery rate p<0.05). Validated findings included the metabolites bilirubin and X–21796 with cholelithiasis, phosphatidylcholine(16:0/22:5n3,18:1/20:4) and arachidonate(20:4n6) with inflammatory bowel disease and Crohn’s disease, and campesterol with coronary artery disease and myocardial infarction. These associations may represent biomarkers or potentially targetable mediators of disease risk.


INTRODUCTION
2][3] The adverse clinical consequences of extreme disruptions of metabolite homeostasis caused by inborn errors of metabolism are well recognized. 4However, modest, long-term perturbations of metabolites attributable to common genetic variation also may contribute to disease risk.The clinical consequences of these perturbations remains incompletely de ned, but may underlie the residual risk that exists for many complex diseases that is not explained by our current knowledge of disease biology and mechanisms 5 .Identifying associations between circulating metabolites and diseases not only has the potential to identify biomarkers that can be used to risk-stratify individuals, but also to provide insight into disease mechanisms and enable targeted therapies.
][8][9][10] These data can be repurposed to develop genetic instruments of individual metabolite levels which can be used to test for associations between metabolites and disease [11][12][13] .High throughput methodologies, such as Phenome-Wide Association Studies (PheWAS), test associations between genetic instruments and large number of clinical phenotypes using Electronic Health Record (EHR)-linked DNA biobanks. 14,15These approaches can have signi cant advantages over traditional epidemiological approaches, allowing for highly-powered analyses which would otherwise be unfeasible due to cost or logistics.In this context, a 'virtual' metabolomics approach provides a powerful tool to identify candidate pathways that could be targeted to modulate diseases, and to advance risk prediction beyond standard genetic models.
To de ne the broader phenome associated with circulating metabolites, we applied a virtual metabolomics approach that leveraged a large collection of clinical phenotypes derived from Vanderbilt's BioVU EHR-linked biobank.We constructed virtual metabolomes based on metabolite polygenic scores (PGS), to identify clinical diagnoses that shared genetic modulators with metabolites.Mendelian randomization approaches were then used to better de ne the relationship between candidate metabolitephenotype pairs.Signi cant associations were further validated using external data sets.Our data shed light on multiple metabolite-disease relationships and highlight novel pathways for potential therapeutic intervention.

Predicted circulating levels of metabolites associate with a broad range of clinical phenotypes
We tested for associations among PGS for 726 metabolites and up to 1,247 clinical phenotypes in BioVU.There were 336 metabolites signi cantly associated with 168 phenotypes in European ancestry (Supplementary Table 1) and 107 metabolites that were signi cantly (false discovery rate [FDR] p < 0.05) associated with 56 phenotypes in the African ancestry individuals (Supplementary Table
Many of these associations were driven by instruments composed of only one or two SNPs, increasing the likelihood of associations due to SNPs with pleiotropic effects.We thus selected only metabolites with genetic instruments composed of 3 or more SNPs for further validation.Similarly, to avoid spurious associations driven by pleiotropy, we excluded associations with signi cant heterogeneity (p < 0.05).After applying these exclusion criteria, 47 signi cant associations (FDR < 0.05) among 32 metabolites and 34 phenotypes remained.A summary of the retained metabolite pairs is presented in Fig. 2 and Supplementary Table 3.These metabolites map to four super-pathways, with the majority mapping to lipid pathways.Distinct phenotypes with a high number of signi cant associations with metabolites In the African ancestry population, of 107 metabolites with signi cant associations in the PGS analysis, 85 had available summary statistics in the METSIM study and among unmatched metabolites, 14 were unknown.The IVWA method identi ed 22 signi cant (FDR < 0.05) associations comprising of 15 metabolites and 13 phenotypes (Fig. 1B, Supplementary Table 4).These included several associations between lipids and infectious or acute in ammatory diseases, including urinary tract infections, sepsis, and fever.
A summary of the associations between the individual SNPs used in the genetic instrument for each metabolite and the clinical phenotypes is presented for European (Supplementary Table 5) and African (Supplementary Table 6) ancestry individuals.

Validation of the signi cant association
To validate the signi cant ndings from MR, we tested associations between the metabolite genetic instruments and phenotypes with available external GWAS summary statistics.After excluding associations with signi cant heterogeneity, < 3 SNPs and non-speci c phenotypes (e.g."Other mental disorder"), there were 15 phenotypes (with 12 associated metabolites) taken forward for further validation from European ancestry (Fig. 3A).There were no suitable external GWAS datasets available to evaluate the signi cant associations in African ancestry.

DISCUSSION
Metabolites are highly relevant integrative markers of health and disease, that can inform disease prediction and pathophysiology.However, measuring metabolites in the large datasets required to robustly interrogate metabolite-phenotype associations is costly, logistically challenging, and often unfeasible.In this "virtual" metabolomics study, we leveraged a state-of-the-art genetic methods in conjunction with large, phenotypically diverse clinical and genetic data sets to interrogate an extended set of metabolites against a broad clinical phenome.Among 726 metabolites analyzed, there were 336 and 107 metabolites that showed signi cant associations among BioVU participants of European and African ancestries, respectively.Of these, 159 and 22, respectively, were associated under a MR framework using genetic instruments for metabolites constructed in an independent population, suggesting they may be mediators of disease risk.Among associations identi ed in the European ancestry population, we validated associations for 16 of 22 metabolite-phenotype pairs using phenotypes derived from independent GWAS studies.Among the validated phenotypes were IBD, cholelithiasis, CAD, MI, neutropenia and lipid phenotypes.These analyses highlight the value of applying the "virtual" metabolomic approach in diverse, phenotype-rich biobanks to identify novel associations.
We found consistent associations between gastrointestinal disease phenotypes and bioactive lipids, highlighting both in ammation and resolution of in ammation as important disease mediators.We found inverse associations between phosphatidylcholine (PC) (16:0/22:5n3, 18:1/20:4) and arachidonate (20:4n6) with IBD and Crohn's disease, both in ammatory diseases of the gut mucosa.Circulating phosphatidylcholines have been reported to be reduced in in ammatory bowel disease, suggesting that they may have a protective role in the gut mucosa. 16,17The protective effects of PCs may be attributed to anti-in ammatory action and prevention of mucosal damage 16 , with potential therapeutic application for IBD. 18It is important to identify the speci c PC involved in protecting the gut mucus against disease.One of the abundant main species of phosphatidylcholines in gut mucus is PC 16:0/18:1. 16This is consistent with our data indicating that lower genetically-predicted phosphatidylcholine (16:0/22:5n3, 18:1/20:4) associates with IBD and Crohn's disease.Arachidonate (20:4n6) was also associated with IBD.
[22] We observed several other plausible disease speci c associations.There were positive associations between bilirubin (E,E) and X-21796 and cholelithiasis (gallstone disease).A causal association has been reported between extreme levels of bilirubin and increased risk of gallstone disease. 23This could be due to increased e ux of this metabolite into bile and/or the variation in the expression of genes controlling both bilirubin levels and the disease. 23Bilirubin (E,E) is one of the water soluble isomers of bilirubin that is converted from unconjugated bilirubin (Z,Z) upon exposure to light. 24As X-21796 has an unknown identity, the associated pathway is unknown.However, SNPs associated with X-21796 map to several members of the UGT1A family of genes, which have been associated with bilirubin levels and risk of gallstones 23 , and SLCO1B, which is involved in bilirubin transport into the liver. 25.This also highlights the utility of our approach to de ne the underlying mechanistic basis of associations with unknown metabolites using the underlying genetic data, which is generally not feasible using other standard epidemiological approaches.Interestingly, the "virtual" metabolomics approach provided us with a considerable opportunity for novel discovery in relation to cardiovascular disease (CVD).Previously, a meta-analysis showed that there is no association between serum concentrations of two common plant sterols (sitosterol and campesterol) and risk of CVD. 26 However, through this large well-powered study, we found a positive association between campesterol and risk of CAD and MI.Campesterol was also strongly associated with most of the phenotypes categorized in the lipid-related disorders group.Several factors have been proposed as the potential mechanisms linking elevated concentration of campesterol and increased risk of these two diseases, including common pathways in uencing the absorption of cholesterol and plant sterols in the intestines, 27 shared genetics linking lipoproteins and phytosterols to MI and atherosclerosis, 2829 poor nutritional status, 30 and poor metabolic health. 31However, we anticipate that future analyses may validate and explore the mechanistic bases and the underlying pathophysiology of this novel nding.This unbiased discovery approach allowed us to create and validate a resource of associations which identi ed metabolites that are biomarkers and potential mediators of several other clinical phenotypes.For instance, we successfully validated an inverse association between the plasmalogen 1-(1-enylpalmitoyl)-2-oleoyl-GPC (P-16:0/18:1) and hypercholesterolemia.This metabolite was reported as inversely related to visceral adipose tissue volume and the percentage of fat in the liver and pancreas. 32 also found associations between 1-palmitoyl-2-stearoyl-GPC (16:0/18:0) and low-density lipoprotein (LDL) and total cholesterol; this metabolite has been found to be positively associated with dyslipidemia. 33Our data demonstrated that hypertriglyceridemia was positively associated with oleoyl-linoleoyl-glycerol (18:1/18:2), potentially a novel association.We also found and validated a signi cant association between phosphatidylcholine (16:0/22:5n3, 18:1/20:4) and low blood cell count (neutropenia).There were other interesting associations we were unable to validate using external data sets due to lack of available data.For instance, we observed positive signi cant associations between stearidonate (18:4n3) and 1-stearoyl-2-meadoyl-GPC (18:0/20:3n9) and nasal polyps.Dysregulated lipid metabolism has been reported in Nasal polyps. 34These metabolites potentially represent new biomarkers of this disorder.An inverse association between methylsuccinate and Alzheimer's disease (AD) was not validated, however given published data linking methylsuccinate supplementation to improvement in neuron dysfunction in AD, this may merit further study.
A signi cant strength of this study was the use of large datasets which have proven robust for discovery of SNPs associated with both metabolites and disease.A further strength is that we utilized genetic approaches that are well-validated for the applications we propose. 35,36We analyzed data from multiple sources, including independent cohorts using independent metabolite measurement platforms, and analysis in both European and African American populations where possible.This allowed us to maximize discovery through increased sample sizes and a more diverse population sample, to ensure generalizability, reproducibility and rigor of the association. 37Moreover, validating the observed associations using available external GWAS additionally strengthened our ndings.
Our study also has some limitations.An important limitation of a genetics-based association approach is that the association may not be consistent when using directly measured levels of the metabolite.This can be due to pleiotropic associations, such as when a SNP in the predictor tags a genetic locus that is associated with an outcome through a mechanism unrelated to the metabolite, or due to weak instrument bias. 3839Further, some metabolites are heavily modulated by environment and homeostatic physiology, which may mask an association.A second limitation is that we could not nd GWAS data for all the phenotype showing a signi cant association with metabolites.This limited the number of total novel ndings we could evaluate in external data sets.
In summary, we identi ed novel metabolite-phenotype associations, and con rmed known relationships between metabolites and disease.Further studies are needed to replicate and clinically validate these ndings.This study highlights the utility of a genetics-based "virtual" metabolomics approach in conjunction with DNA biobanks to link metabolites to clinical diseases and clinical diagnoses.As genetic biobanks continue to grow, the potential to discover genetic underpinnings of the metabolome will also expand.This approach can be used to identify additional metabolite-disease associations, uncover novel disease biology and move towards application in clinical populations.

Vanderbilt BioVU Study Population
Genetic and phenotypic data were obtained from BioVU, Vanderbilt University Medical Center's (VUMC) DNA Biobank linked to a de-identi ed electronic health record. 40The study population comprised individuals of genetic white European (n = 57,735) and African American (n = 15,754) ancestries, 18 years and older who had existing SNP genotyping.Genetic ancestry of individuals was determined using principal component analysis in conjunction with HAPMAP reference sets. 40,41This study was reviewed by the VUMC Institutional Review Board (IRB) in accordance with the informed consent guidelines and was determined to be non-human subjects research.

Genetic Data and Quality Control
BioVU participants were genotyped on the Illumina In nium Multi-Ethnic Genotyping Array (MEGA EX ) platform.Quality control procedures for this population have been described previously. 42Individuals with a biological sex discrepancy or who were related (one participant from each related pair [pi-hat > 0.2] was randomly excluded) were excluded.Analyses used PLINK v1.9. 43Genotype imputation was performed using IMPUTE4 44 version 2.3.0 (University of Oxford), using the 10/2014 release of the 1000 Genomes cosmopolitan reference haplotypes.Genetic variants with imputation quality scores less than 0.3 were excluded.Principal components (PCs) to adjust for residual population strati cation were generated using SmartPCA. 45

Phenotype Data
For the BioVU population, the primary analyses examined clinical diagnoses based on PheCodes (v1.2), which are derived from International Classi cation of Disease (ICD) billing codes (ICD-9-CM and ICD-10 diagnosis codes). 46,47For each phenotype, cases were de ned as participants with at least two PheCode instances in their medical record.Individuals without any closely related PheWAS codes and who fell within the observed age of the cases were used as controls.We analyzed associations for 1,247 and 600 PheCodes with ≥ 100 cases in the European and African ancestry population, respectively.

Speci cation of a Virtual Metabolome via Human Genetics
OMICSPRED: Validated PGSs for 726 metabolites were obtained from the OMICSPRED resource (www.omicspred.org). 48These PGS were developed using SNPs that signi cantly associated with concentrations of human blood metabolites in the INTERVAL cohort (n = 8,153 healthy individuals in England). 49Brie y, metabolites were measured in plasma by an untargeted mass spectrometry metabolomics platform (Metabolon HD4), and participants were genotyped using the Affymetrix Biobank Axiom array. 50Bayesian ridge regression was used to develop genetic scores for each metabolite, and scores were validated (Spearman correlation) using an independent validation INTERVAL subset (n = 8,114 non-overlapping participants, 527 validated metabolites) and an external validation cohort (ORCADES, n = 1,007 European participants, 455 validated metabolites).METSIM: SNP instruments used for validation of the OMICSPRED associations by MR analyses were derived from the independent METSIM Finnish population study using publicly available GWAS summary statistics for metabolites. 51This study included 1,391 metabolites quanti ed in 6,136 non-diabetic male participants.Summary statistics were obtained from the METSIM Metabolomics PheWeb server (https://pheweb.org/metsim-metab).

Polygenic Score Analysis
SNPs associated with each of the 726 OMICSPRED metabolites were used to calculate PGSs as a weighted sum of trait-associated alleles for BioVU subjects described above, with PLINK v2.00a3LM. 52he association between metabolite PGS and each PheCode phenotype was tested using a multivariable logistic regression model, adjusting for sex and age.All analyses were strati ed by genetic ancestry.Within each phenotype, association p-values were adjusted for multiple testing using a Benjamini-Hochberg false discovery rate (FDR) correction, (rstatix v0.7.0 R package).

Mendelian Randomization Analysis to Validate PGS associations
Phenotype and metabolite pairs that were signi cantly associated (FDR p < 0.05) with PGS through PheWAS in BioVU, were selected for MR analysis.MR tests for associations under three assumptions: (1)  the SNPs are associated with the exposure; (2) the SNPs are not associated with confounders; and (3) the SNPs affect the outcome only through the exposure. 53For these analyses, SNPs associated with the metabolite were used as the exposure instrumental variables.Genetic instruments for each metabolite were selected using a clumping algorithm that selected an LD-reduced (r-square < 0.05) set of SNPs associated with metabolites (p < 5x10 − 6 ) in the METSIM Study.The association between metaboliteassociated SNPs and the BioVU clinical phenotype of interest was computed using an additive logistic regression genetic model that adjusted for age, sex and 10 principal components (PLINK v2.00a3LM software).The IVWA, MR-Egger and weighted median methods, as implemented in the MendelianRandomization R package 54 were used to perform the analyses.Horizontal pleiotropy was determined by a low heterogeneity p-value (p < 0.05) based on the Cochran's Q statistic.P-values were adjusted for multiple testing using a Benjamini-Hochberg FDR correction, per tested phenotype.For nonpleiotropic associations (heterogeneity p > 0.05), we selected signi cant (FDR p < 0.05) metabolitephenotype pairs based on the IVWA model, that showed consistent ndings across the other MR methods.For associations with evidence of pleiotropy, we used MR-PRESSO to identify and evaluate the contributions of pleiotropic SNPs.

MR Validation in Independent disease-speci c GWAS Datasets
We validated signi cant MR associations using summary statistics from published GWAS datasets, where available.GWAS summary statistics for IBD and Crohn's disease were obtained from a metaanalysis of 59,957 individuals of European ancestry.Summary statistics for cholelithiasis were obtained from FinnGen (19,023 cases, 195,144

Figures
Figures

Figure 1 Overview
Figure 1