Electronic Health Record-Based Genome-Wide Meta-Analysis and Mendelian Randomization Identify Metabolic and Phenotypic Consequences of Non-Alcoholic Fatty Liver Disease

Non-alcoholic fatty liver disease (NAFLD) is a complex disease linked with several chronic diseases. We aimed at identifying genetic variants associated with NAFLD as well as blood biomarkers that may be causally impacted by NAFLD. We performed a genome-wide meta-analysis of four cohorts of electronic health record-documented NAFLD (8434 cases and 770,180 controls) and conrmed known susceptibility loci (GCKR, MAU2/TM6SF2, APOE, and PNPLA3). We also identied potentially new loci (LPL, FTO and TR1B1) and report an effect of lower LPL expression in adipose tissue on NAFLD susceptibility. Mendelian randomization analyses identied an effect of NAFLD on tyrosine metabolism and on blood levels of three proteins. Positive genetic correlations between NAFLD and cardiometabolic traits and negative genetic correlations with parental lifespan, socio-economic factors and ketone bodies were observed. Altogether, this analysis revealed novel susceptibility loci for NAFLD and early biomarkers of NAFLD that could be used to identify patients with NAFLD. 571.9, ICD10: K75.81, ICD10: K76.0 and ICD10: K76.9. Exclusion criteria included, but were not limited to alcohol dependence, alcoholic liver disease, alpha-1 antitrypsin deciency, Alagille syndrome, liver transplant, cystic brosis, hepatitis, abetalipoproteinemia, LCAT deciency, lipodystrophy, disorders of copper metabolism Reye’s syndrome, inborn errors of metabolism, HELLP syndrome, starvation and acute fatty liver (as suggested by the American Association for the Study of Liver Disease [AASLD]) Logistic regression analysis was performed on over 7 million SNPs with MAF >1% adjusted for age, sex, body mass index, genotyping site and the rst three ancestry based principal components. We performed a new GWAS for NAFLD in the UK Biobank (data application number 25205). NAFLD diagnosis was established from hospital records (ICD10: K74.0 and K74.2 (hepatic brosis), K75.8 (NASH), K76.0 (NAFLD) and ICD10: K76.9 (other specied diseases of the liver). Exclusion criteria were the same as those used in the eMERGE study. In the UK Biobank genome-wide genotyping was available for over 28 million genetic markers directly genotyped or imputed by the Haplotype Reference Consortium (HRC) panel. We used the SAIGE (Scalable and Accurate Implementation of Generalized Mixed Models) method to performed the GWAS. This method is based on generalized mixed models and was developed to control for case-control imbalance, sample relatedness and population structure. In this analysis, sex, age and the 10 main ancestry-based principal components were used as covariates. This UK Biobank analysis included 2558 NAFLD cases and 395,241 controls. We also performed a GWAS for NAFLD using SAIGE in the Estonian Biobank. This study and the use of data from 4119 cases and 190,120 controls was approved by the Research Ethics Committee of the University of Tartu (Approval number 288/M-18). We used the same case denition and inclusion/exclusion


Introduction
Non-alcoholic fatty liver disease (NAFLD) is one of the most prevalent chronic liver diseases. 1,2 According to recent estimates, about 25% of the adult population worldwide may have NAFLD. 3,4 This disease has been predicted to become the most frequent indication for liver transplantation in Western countries by 2030. 5 NAFLD is a progressive liver disease with potential consequences for several other chronic disorders such as cardiovascular disease (CVD) (the leading cause of death in patients with NAFLD), 6-9 type 2 diabetes (T2D), 10,11 dyslipidaemia 12 and other extrahepatic manifestations such as chronic kidney disease 13 and gastrointestinal neoplasms. 14 According to the National Institutes of Health U.S. National Library of Medicine, there are currently more than 300 ongoing randomized clinical trials (RCTs) enrolling patients with NAFLD. Such RCTs are challenging because NAFLD "diagnosis" often requires invasive methods and/or imaging approaches, which are clinically burdensome and costprohibitive, especially since NAFLD has reached epidemic proportions in developing countries that may not have the clinical, nancial and infrastructural resources to identify and adequately treat patients with NAFLD. For example, liver biopsy is not only invasive and expensive but is also prone to sampling error. 15 Affordable and easily obtainable tests are required to identify NAFLD patients who may bene t from therapies under investigation. Blood biomarkers of NAFLD that are not modulated by secondary non-causal pathways, are promising candidates for the identi cation of at-risk individuals and to develop tailored therapy for NAFLD.
Mendelian randomization (MR) is a modern epidemiology investigation technique that is increasingly used to explore whether risk factors associated with disease traits re ect true causal associations or not. 16 Akin to a RCT, MR takes advantage of the random allocation of genetic variation at conception to determine the consequences of human traits that are at least in part under genetic control. MR has also been used to determine whether a genetic susceptibility to certain chronic diseases in uences other biological traits such as the blood proteome or the blood metabolome, thereby identifying early biomarkers of disease-related traits. 17,18 High-quality MR studies rely on the availability of the standardized effect sizes of the top genetic variants associated with a trait of interest when this trait is used as an exposure and on the availability of genome-wide association studies (GWAS) summary statistics with this trait is used as an outcome. Although GWASes have identi ed genetic variants associated with liver fat accumulation 19,20 , liver enzymes 21 and different forms of liver diseases 22,23 , less than a handful of small GWAS sought to identify genetic variants associated with a clinical diagnosis of NAFLD. The GWAS of the Electronic Medical Records and Genomics (eMERGE) network included 1106 NAFLD cases and 8571 controls identi ed only one NAFLD susceptibility locus (PNPLA3). The NAFLD GWAS of the UK Biobank included 1664 NAFLD cases and 400,055 controls identi ed only two regions robustly associated with NAFLD (PNPLA3 and PBX4/TM6SF2). The UK Biobank analysis did not exclude participants with secondary causes of NAFLD (such as hepatitis or alcoholism) and used a rather vague de nition of NAFLD (phecode 571.5: other forms of nonalcoholic liver disease). Genetic variation at these two loci also are also associated with NAFLD in the data freeze #4 of the FinnGen cohorts (651 NAFLD cases and 176,248 controls).
Here, we performed a meta-analysis of electronic health record (EHR)-based GWAS to identify genetic variants robustly associated with NAFLD. This analysis included GWAS summary statistics from the eMERGE and FinnGen cohorts, an updated NAFLD GWAS in the UK Biobank (2558 cases and 395,241 controls) and a new GWAS performed in the Estonian Biobank (4119 cases and 190,120 controls) for a total of 8434 NAFLD cases and 770,180 controls. We then used a MR study design to identify novel blood proteins and metabolites that may causally be in uenced by the presence of NAFLD and a combination of genetic correlation analysis and phenome-wide MR study to identify other traits and human diseases that may be in uenced by NAFLD or variants in uencing NAFLD susceptibility.

Results
Identi cation of genetic variants associated with non-alcoholic fatty liver disease The study design is presented in Figure 1. In order to identify independent genetic variants robustly associated with NAFLD and suitable for MR analyses, we rst performed two new GWASes in the UK Biobank and Estonian Biobank and performed a meta-analysis of four cohorts (UK Biobank, Estonian Biobank, eMERGE and FinnGen) totalling 8434 NAFLD cases, all identi ed through electronic health records, and 770,180 controls. We identi ed four genetic loci that harboured at least one SNP that passed the genome-wide signi cance threshold of p≤5x10 -8 (TRIB1, MAU2 [TM6SF2], APOE and PNPLA3). Figure 2A presents the Manhattan plot of the NAFLD GWAS meta-analysis identifying genetic regions with a p-value for association with NAFLD ≤5x10 -8 . The associated quantile-quantile plot is presented in Supplementary Figure 1. In order to add more SNPs to our genetic instruments and to identify potentially new relevant NAFLD genetic loci, we used a Bayesian approach (bGWAS) recently described by Mounier and Kutalik 24 . This method seeks to identify new variants associated with complex diseases using inference from risk factors of these diseases. By leveraging GWAS summary statistics from risk factors likely causally associated with NAFLD in a previous MR study 25 (T2D, body mass index [BMI] and triglyceride levels) as priors, this analysis revealed genetic variation at three additional loci (GCKR, LPL, and FTO) associated with NAFLD (Supplementary Table 1). Variation at these new loci act through selected NAFLD risk factors on Bayes Factors ( Figure 2B), rather than through direct effects ( Figure 2C) or posterior effects ( Figure 2D). The association of lead SNPs at these loci with NAFLD as well as those from the conventional GWAS are presented in Supplementary Table 2 in each cohort separately and in the GWAS meta-analysis. Because some of these SNPs showed evidence of heterogeneity, p-values are presented from xed effects and random effects meta-analysis. This table also presents the F statistic for instrument strength for each SNPs. These range from 16.3 to 212.0. Altogether, through a combination of conventional GWAS and risk factor informed GWAS, our analysis identi ed genetic variation at seven loci that may in uence susceptibility to NAFLD.
Evaluation of the functionality of variants associated with NAFLD Some of the top variants linked with NAFLD in this analysis may have functional consequences. For instance, the rs1260326 at GCKR is a missense variant (p.P446L). The rs1260326 at APOE is also a missense variant (p.R130C).
The lead variant at MAU2/TM6SF2 rs73001065 is in linkage disequilibrium (r 2 =0.90) with the missense variant p.E167K at TM6SF2 and the lead variant at PNPLA3 is in high linkage disequilibrium (r 2 =0.98) with the missense variant p.I148M at PNPLA3. Table 1 presents the details of these results as well as the effect of other previously associated variants with NAFLD (p.A165T at MTARC1, a splice variant HSD17B13 and another variant at MBOAT7). This analysis con rmed previous NAFLD functional variants at MTARC1 and MBOAT7, but not at HSD17B13. Genetic variation at the PNPLA3, TM6SF2, APOE and GCKR have been linked with NAFLD-related traits in previous studies 26,27 . Recent studies identi ed APOE, TR1B1 and FTO as potential new loci for liver enzymes 28,29 . To our knowledge, our study is the rst to link variation at these loci with a clinical diagnosis of NAFLD and to identify LPL as a potential new susceptibility locus for NAFLD. Interestingly, the minor allele (C) at rs13702 associated here with a protection against NAFLD has been predicted to disrupt a micro RNA recognition element seed site for human micro RNA miR-410, resulting in higher LPL expression 30 . We therefore sought to determine whether genetically-predicted LPL expression was associated with NAFLD. We performed a transcriptome-wide association study for NAFLD to map genetically-regulated genes from the Genotype Tissue Expression (GTEx, version 8) consortium 31 with NAFLD using S-PrediXcan. This analysis did not reveal new NAFLD genes outside those who had a genome-wide signal such as PNPLA3 and TM6SF2 (data not shown). Genetically-predicted LPL expression could be estimated in 11 tissues. The association between genetically-predicted LPL expression in these 11 tissues and NAFLD is presented in (Supplementary Table 3). This analysis suggests a negative association between genetically-predicted LPL expression in subcutaneous adipose tissue and NAFLD (p=3.1e-04). The LocusCompare plot ( Figure 3) further suggests shared genetic etiology at this locus with the rs13702 variant being signi cantly associated with both subcutaneous adipose tissue expression of LPL and NAFLD 32 . In summary, most of the seven SNPs identi ed in this analysis or SNPs in close proximity may be considered as functional.

Association of variants associated with NAFLD with NAFLD related phenotypes
We investigated the effect of these variants in another cohort and with NAFLD related traits such as liver fat accumulation and liver enzymes in the UK Biobank. In the Mass General Brigham Biobank, 4312 patients with nonalcoholic steatohepatitis (NASH) or NAFLD (diagnosed by computed tomography and/or magnetic resonance imaging) were compared to 26,404 controls. The direction of the effects of the seven SNPs were concordant with those observed in the GWAS meta-analysis. All SNPs were nominally associated with NAFLD in the Mass General Brigham Biobank, with the exception of the variants at the FTO and at the LPL loci (Supplementary Table 4). Liver fat accumulation in the UK Biobank was quanti ed via machine learning of abdominal MRI images as previously described. 33 We analyzed liver fat accumulation as a continuous trait and as a categorical trait (liver fat ≥5.5%) in 32,976 study participants. The direction of the effects of the seven SNPs on liver fat accumulation were concordant with those observed in the GWAS meta-analysis and all SNPs were nominally associated with liver fat accumulation (whether considered as a continuous or as a categorical trait), with the exception of the variant at the LPL locus (Supplementary Table 4). Finally, the association between the seven variants associated NAFLD with the liver enzymes ALT (alanine aminotransferase), AST (aspartate aminotransferase), GGT (gamma-glutamyl transferase) and ALP (alkaline phosphatase) was investigated in 361,194 participants of the UK Biobank. Results presented in Supplementary Table 4, suggest that all variants were positively associated with liver enzymes, except that the variant at GCKR was not associated with ALT levels, the variant at APOE was not associated with AST levels and the variant at PNPLA3 was not associated with GGT levels. Variants at the GCKR, LPL, TRIB1 and APOE were positively associated with ALP levels, the variant at FTO was not associated with ALP levels and the variants at MAU2/TM6SF2 and PNPLA3 were negatively associated with ALP levels. Overall, results of this analysis suggest that the seven variants associated with NAFLD are associated with NAFLD-related traits such as liver fat accumulation and/or liver enzymes.
Impact of non-alcoholic fatty liver disease variants on the blood metabolome and proteome In order to explore the metabolic effect of variants in uencing NAFLD, we investigated the impact of these variants on the blood metabolome using GWAS summary statistics on 123 blood lipids, lipoproteins and metabolites measured in 24,925 individuals from 10 European cohorts, as described by Kettunen et al. 34 and on the blood proteome using GWAS summary statistics from ve large-scale protein datasets [35][36][37][38] included in the Phenoscanner v2 39 . Figure 4 presents the impact of the NAFLD variants on the blood metabolome. This analysis showed that the lead NAFLD variant at the GCKR locus was associated with several blood metabolites such as higher lipoprotein/lipid levels, branched chain amino acids and other amino acids as well as glycolysis and gluconeogenesis metabolites.
The lead NAFLD variants at the LPL, TRIB1 and MAU2/TM6SF2 loci were also associated with higher lipoprotein/lipid levels while the lead NAFLD variant at the APOE locus was associated with lower lipoprotein/lipid levels. The impact of NAFLD associated variants at the FTO and PNPLA3 loci did not appear to have a major in uence on the blood metabolome. Supplementary Table 5 presents the impact of the NAFLD variants on blood proteins (with p-value for association <1e -05 ). Variants in GCKR, TRIB1, MAU2/TM6SF2 and APOE were associated with plasma levels of nine, three, two and 61 plasma proteins, respectively, while variants in LPL, FTO and PNPLA3 did not show associations with any of the circulating proteins at that threshold. Results of this analysis suggests that the variants in uencing NAFLD might have divergent effect on the blood metabolome and a trivial effect on the blood proteome, with the exception of APOE.
Mendelian randomization analysis on the impact of non-alcoholic fatty liver disease on the blood metabolome We used the top SNPs from each of the four loci that showed association with NAFLD in the conventional GWAS and the top SNPs from each of the three loci showed association with NAFLD using the risk factor informed GWAS to create a multilocus genetic instrument for genetically-predicted NAFLD. We performed a two-sample MR analysis to determine the impact of NAFLD (rather than each variant investigated individually) on the blood metabolome. For this purpose, we used the GWAS summary statistics of 123 blood lipids as described above. Using IVW-MR, we found that genetically-predicted NAFLD was robustly associated with higher levels of tyrosine after correction for falsediscovery rate (pFDR<0.10) with the Benjamini-Hochberg method ( Figure 4 and Supplementary Table 6). We also found an association between NAFLD and the tyrosine precursor phenylalanine, although this association did not pass the FDR-corrected statistical signi cance threshold. The association between NAFLD and tyrosine and phenylalanine levels was consistent across MR methods and robust to outliers and pleiotropy (Table 2 and Supplementary Figure 2). Because there was sample overlap between the exposure (genetically predicted NAFLD) and outcomes (blood metabolites), with the Estonian Biobank contributing to both datasets, we redid the NAFLD We next investigated whether the presence of NAFLD was associated with higher plasma levels of tyrosine and phenylalanine in the Estonian Biobank. Tyrosine and phenylalanine levels were measured in 10,809 individuals including 359 patients with NAFLD (obtained from EHR). Supplementary Figure 3 presents the distribution of tyrosine and phenylalanine levels in cases and controls. Supplementary Table 7 presents the association between tyrosine and phenylalanine levels per one-standard deviation increment before and after multivariable adjustment. After adjusting for age, sex, smoking, education, and BMI, tyrosine levels, but not phenylalanine levels were positively associated with the presence of NAFLD in the Estonian Biobank (odds ratio per 1-SD increment = 1.23 (95% con dence interval = 1.12-1.36, p = 2.19e-05), further suggesting that NAFLD might in uence tyrosine metabolism. Impact of non-alcoholic fatty liver disease on the blood proteome We used a similar approach as described above to determine the impact of genetic exposure to NAFLD on the blood proteome using GWAS summary statistics on >3000 circulating blood proteins from the INTERVAL study. 37 After FDR correction, we found that NAFLD was associated with higher levels of three circulating proteins: alpha L-iduronidase (encoded by the IDUA gene), glutathione-S-transferase A1 (encoded by the GSTA1 gene), alcohol dehydrogenase 4 (encoded by the ADH4 gene) ( Figure 5A and Supplementary Table 8). The association between NAFLD and plasma levels of these circulating proteins was consistent across MR methods and robust to outliers and pleiotropy (Table 2 and Supplementary Figure 4). In order to gain insight into potential tissue speci city of the genes encoding these proteins, we obtained the tissue-speci c gene expression metric (Tau) from the Genotype-Tissue Expression dataset resource, as described by Kryuchkova-Mostacci and Robinson-Rechavi. 40 Genes with evidence of tissue-speci c expression have a Tau value closer to 1 while ubiquitous genes have a Tau value closer to 0. This analysis revealed that two of the genes (ADH4 and GSTA1) encoding circulating proteins that may causally be in uenced by NAFLD had tissue-speci c expression (Tau ≥0.80) ( Figure 5B). Altogether, this analysis revealed additional proteins that are in uenced by the presence of NAFLD and that may represent new biomarkers of NAFLD.
One of the key assumptions of MR is that genetic variants used as a proxy of the exposure in uence the outcome via their effect on the exposure and not via other related traits (horizontal pleiotropy). To identify NAFLD genetic instruments, we leveraged prior effect sizes of NAFLD-related traits, which might increase the chance of nding associations that may be in uenced by NAFLD related traits and not by NAFLD per se. We therefore investigated the impact of genetically-predicted NAFLD on blood levels of tyrosine, alpha-L-iduronidase, glutathione S-transferase A1 and alcohol dehydrogenase 4 using all independent (one SNP per region) NAFLD SNPs (with p-value for association <5e-06) using multiple methods (as in Table 2). Results presented in Supplementary Table 9 show strong associations with the other genetic instrument, suggesting that the effect of genetically-predicted NAFLD on these biomarkers may not be driven by horizontal pleiotropy.
Association of non-alcoholic fatty liver disease with human metabolic and phenotypic traits We performed cross-trait genetic correlation analyses between NAFLD and 240 human traits centralized in the LD Hub database. LD Hub includes GWAS publicly available summary statistics on hundreds of human traits and enables the assessment of LD score regression among those traits. Results presented in Figure 6 show high levels of genetic correlation between NAFLD and cardiometabolic traits such as obesity, insulin resistance and triglycerides and negative genetic correlation with parental lifespan, socio-economic factors and the ketone body acetoacetate.
Impact of non-alcoholic fatty liver disease variants on the human disease-related phenome In order to determine the effect of the variants associated with NAFLD on the human disease related phenome, we performed phenome-wide MR analyses in the UK Biobank and in the FinnGen cohorts (853 and 1404 disease-speci c binary traits in the UK Biobank FinnGen cohorts, respectively). Supplementary Table 10 and Supplementary Table 11 present the effect of the seven NAFLD-associated variants on disease-related traits scaled to their effect on NAFLD after correction for false-discovery rate (pFDR<0.10) with the Benjamini-Hochberg method. This analysis revealed that in accordance to its effect on lipid levels, the NAFLD-associated missense variant at the GCKR locus is associated with hypercholesterolemia. This variant is also positively associated with gout and other crystal arthropathies but negatively associated with T2D and cholelithiasis. The NAFLD-associated variants at the LPL and TRIB1 loci were positively associated with cardiometabolic diseases such as hyperlipidemia and CVD. The NAFLDassociated variant the TRIB1 locus was however negatively associated with cholelithiasis. The NAFLD-associated variant the FTO locus was positively associated with several cardiometabolic disorders such as obesity, T2D, sleep apnea, osteoarthrosis and hypertension. The NAFLD-associated variant the MAU2/TM6SF2 locus was positively associated with various forms of liver disease and T2D (UK Biobank only) and was negatively associated with hyperlipidemia in both cohorts and with CVD (in FinnGen only). The NAFLD-associated missense variant at the APOE locus is negatively associated with the presence of several neurological disorders, hypercholesterolemia and CVD and positively associated with T2D and asthma. Finally, the NAFLD-associated variant at the PNPLA3 locus is associated with several forms of liver disease such as cirrhosis and alcoholic liver damage and diabetes-related traits (T2D in UK Biobank and diabetic maculopathy in FinnGen) and negatively associated with the presence of gout and other crystal arthropathies. Altogether, this analysis identi ed several phenotypic consequences of NAFLD associated variants and revealed signi cant heterogeneity of NAFLD-raising SNPs on the human disease-related phenome.

Discussion
We performed two new genome-wide association studies for NAFLD in the UK Biobank and in the Estonian Biobank and combined these results to those of two publicly available NAFLD GWAS (from the eMERGE network and FinnGen). This GWAS meta-analysis included 8434 NAFLD cases available via EHRs and 770,180 controls, making it the largest genome-wide analysis for a clinical diagnosis of NAFLD. In combination with a risk factor-informed Bayesian GWAS, this analysis identi ed three known susceptibility loci for NAFLD (GCKR, TM6SF2 and PNPLA3) and four new candidate genetic regions for a clinical diagnosis NAFLD based on EHRs (TRIB1, LPL, FTO, APOE). We investigated the impact of the top variant at each region with the blood metabolome, the blood proteome and the human disease-related phenome. We established a MR framework aimed at identifying novel early biomarkers of NAFLD that may be causally impacted by the presence of NAFLD. This analysis revealed an intriguing effect of NAFLD on tyrosine metabolism and on the presence of three circulating blood proteins, which may represent clinical biomarkers of NAFLD.
Our conventional GWAS analysis reports that variation at the TRIB1, MAU2/TM6SF2, APOE and PNPLA3 loci may be linked with NAFLD. While genetic variants at these loci have been associated with some liver phenotypes. 22,26,27,41 , this GWAS meta-analysis revealed important information on the genetic architecture of NAFLD. Using bGWAS, our study identi ed known, and potentially new loci for NAFLD (GCKR, LPL, and FTO) that may be associated with NAFLD through their effects on NAFLD risk factors (BMI, T2D and triglycerides). Although the biological relevance of variation at the FTO locus is still a matter of debate, FTO is a well-characterized genetic locus for obesity. 42 Other studies reported an association of variants at the GCKR loci and liver fat accumulation 19 and liver enzymes 21 . Lipoprotein lipase (LPL) on the other hand is a key enzyme that regulates the catabolism of triglycerides-rich lipoproteins such as chylomicrons and very-low-density lipoproteins in adipose tissue, skeletal muscle and heart. Gain-of-function mutations in LPL were associated with lower triglyceride levels and lower risk of coronary artery disease. 43 In this study, we found a potentially causal association between genetically-predicted LPL expression in subcutaneous adipose tissue and NAFLD. These results are in line with the recent study of Maltais et al. 44 who have reported that 4 out of 10 patients with familial chylomicronemia syndrome and almost 3 out of 4 patients with multifactorial chylomicronemia syndrome (two disorders of impaired LPL function) met the criteria of NAFLD, independently of their BMI. It should be noted that although additional associations the variant at the LPL locus linked with higher NAFLD was associated with liver enzymes in the UK Biobank, it was not associated with liver fat accumulation in the UK Biobank or with NAFLD in the Mass General Brigham Biobank. Additionally, although these results did not reach the level of genome-wide signi cance, we found nominally signi cant associations at the MTARC1 and MBOAT7 loci, thereby con rming the role of these genes in the etiology of NAFLD.
Several observational studies have suggested that liver fat accumulation or NAFLD negatively impacts triglyceriderich lipoprotein metabolism, glucose-insulin homeostasis as well as branched-chain amino acid levels. [45][46][47][48][49] Sliz et al. 50 also documented the individual impact of 4 variants (at the PNPLA3, TM6SF2, GCKR and LYPLAL1 loci) on the blood metabolome and found inconsistent associations. Here, we used a similar approach to investigate the effect of the top NAFLD variant (included 3 already investigated by Sliz et al.). This analysis con rmed that the lead NAFLD variant at the GCKR locus is linked with several blood metabolites levels such as higher lipoprotein/lipid, branched chain amino acids and other amino acids as well as glycolysis and gluconeogenesis metabolites. Variation at the LPL, TRIB1 and MAU2/TM6SF2 loci were also associated with higher lipoprotein/lipid levels. On the other hand, variation at the APOE locus was associated with lower lipoprotein/lipid levels. We investigated whether the presence of NAFLD impacted lipoprotein levels and metabolites of these pathways to identify early biomarkers of NAFLD using MR. This analysis did not nd evidence of a causal association of NAFLD with triglyceride-rich lipoprotein metabolism, which is expected since some variants were associated with higher lipid levels while other variants were associated with lower lipid levels. NAFLD was not however associated with glucose-insulin homeostasis markers or branched-chain amino acids. We did however nd a signi cant impact of NAFLD on tyrosine and, to a lesser extent, its metabolic precursor phenylalanine. Although the impact of NAFLD on tyrosine metabolism has been reported decades ago 51 , our analysis adds to this body of evidence by suggesting that the impact of NAFLD on tyrosine metabolism might be a direct consequence NAFLD, and that this association might not be driven by secondary causes of NAFLD. We also reported that patients with higher blood levels of tyrosine had a higher prevalence of NAFLD in the Estonian Biobank.
Our MR analysis identi ed three proteins that may be causally impacted by NAFLD. Glutathione-S-transferase A1 (encoded by the GSTA1 gene) has been shown to be a sensitive biomarker of hepatocellular damage. 52 Alcohol dehydrogenase 4 (encoded by the ADH4 gene) is also a liver expressed enzyme that mediates oxidative pathways involved in alcohol metabolism. 53 Finally, Alpha L-iduronidase (encoded by the IDUA gene), is a lysosomal enzyme involved in glycosaminoglycan degradation. Additional studies are required to determine if these blood-based biomarkers could predict NAFLD onset or identify patients with more severe forms of liver diseases.
Previous studies have shown that NAFLD could be associated with, or predict the risk of chronic diseases such as CVD or T2D. Our genetic correlation analyses revealed associations with these diseases as well as risk factors for these diseases such as obesity and insulin resistance. We also report interesting negative correlations between NAFLD and the ketone body acetoacetate (as previously suggested in an observational study) 54 as well as parental lifespan, suggesting that NAFLD may be a critical component of long-term disease risk potentially in uencing human lifespan. Similar to the impact on the blood metabolome, our study also documented considerable heterogeneity with regards to disease-traits associated with NAFLD variants. It appears that only the LPL variant and to a lesser extent TR1B1 seemed to be associated with diseases in the same direction as the NAFLD association, thereby suggesting that targeting the LPL pathway may prevent NAFLD as well as other diseases such as hyperlipidemia and CVD without increasing the risk of other human diseases. Drugs targeting the LPL pathway under investigation for NAFLD include the angiopoietin-like protein-3 (ANGPTL3) inhibitors 55 , glucagon-like peptide-1 (GLP-1) receptor agonists 56 and dual glucose-dependent insulinotropic peptide (GIP)/GLP-1 receptor agonists 57 .
Our study has limitations. For instance, although we have excluded secondary causes of NAFLD whenever possible, an EHR-based diagnosis of complex diseases such as NAFLD might be prone to misclassi cation of cases and controls. The prevalence of NAFLD was also not available in some of the cohorts used to document the impact of NAFLD on the blood metabolome (24,925 individuals from 10 European cohorts) and the blood proteome (INTERVAL). We also did not have a validation cohort to replicate the effect of NAFLD on the blood proteome that we have identi ed nor could we determine if these biomarkers were only elevated in speci c NAFLD stages or subtypes. Studies documenting the impact of NAFLD resolution on these biomarkers could also consolidate the causal effect of NAFLD on the blood metabolome and proteome. Finally, there was also sample overlap as subjects in the UK Biobank and of the FinnGen cohorts were used to create our study exposure and were used to assess the phenotypic consequences of variants linked with NAFLD.
In conclusion, we conducted a large NAFLD GWAS based on EHRs from four cohorts to identify genetic variants of NAFLD susceptibility. We identi ed known NAFLD variants and show that variants associated with liver fat accumulation and liver enzymes may also be associated with the presence of NAFLD. Our analysis revealed a potentially causal effect of lower adipose-tissue expression of LPL and NAFLD that will need con rmation by other larger studies. We also identi ed plasma metabolites and proteins that may be causally in uenced by the presence of NAFLD, including a potential effect of NAFLD on tyrosine metabolism. These nding shed light on the metabolic consequences of NAFLD but also identi es potential early biomarkers of NAFLD that could be used to identify patients who may bene t from therapies targeting NAFLD and/or for risk strati cation in this population. Additional studies will be required to determine whether our ndings could be helpful optimize NAFLD risk prediction as well as patient recruitment for trials aiming at preventing and/or treating NAFLD.

Genome-wide association study summary statistics NAFLD
To obtain a comprehensive set of NAFLD GWAS summary statistics, we performed a GWAS meta-analysis of four cohorts: The Electronic Medical Records and Genomics (eMERGE) 58 network, the UK Biobank, the Estonian Biobank and FinnGen. The NAFLD GWAS in the eMERGE network has previously been published. 59 The study sample included 1106 NAFLD cases and 8571 controls participants of European ancestry. Of them, 396 NAFLD cases and 846 controls participants (47% males) were derived from a pediatric population and 710 NAFLD cases and 7725 controls participants (42% males) were derived from an adult population. NAFLD was de ned by the used of EHR codes (ICD9: 571.5, ICD9: 571.8, ICD9: 571.9, ICD10: K75.81, ICD10: K76.0 and ICD10: K76.9. Exclusion criteria included, but were not limited to alcohol dependence, alcoholic liver disease, alpha-1 antitrypsin de ciency, Alagille syndrome, liver transplant, cystic brosis, hepatitis, abetalipoproteinemia, LCAT de ciency, lipodystrophy, disorders of copper metabolism Reye's syndrome, inborn errors of metabolism, HELLP syndrome, starvation and acute fatty liver (as suggested by the American Association for the Study of Liver Disease [AASLD]) Logistic regression analysis was performed on over 7 million SNPs with MAF >1% adjusted for age, sex, body mass index, genotyping site and the rst three ancestry based principal components. We performed a new GWAS for NAFLD in the UK Biobank (data application number 25205). NAFLD diagnosis was established from hospital records (ICD10: K74.0 and K74.2 (hepatic brosis), K75.8 (NASH), K76.0 (NAFLD) and ICD10: K76.9 (other speci ed diseases of the liver). Exclusion criteria were the same as those used in the eMERGE study. In the UK Biobank genome-wide genotyping was available for over 28 million genetic markers directly genotyped or imputed by the Haplotype Reference Consortium (HRC) panel. We used the SAIGE (Scalable and Accurate Implementation of Generalized Mixed Models) method to performed the GWAS. This method is based on generalized mixed models and was developed to control for casecontrol imbalance, sample relatedness and population structure. In this analysis, sex, age and the 10 main ancestrybased principal components were used as covariates. This UK Biobank analysis included 2558 NAFLD cases and 395,241 controls. We also performed a GWAS for NAFLD using SAIGE in the Estonian Biobank. This study and the use of data from 4119 cases and 190,120 controls was approved by the Research Ethics Committee of the University of Tartu (Approval number 288/M-18). We used the same case de nition and inclusion/exclusion criteria as in the UK Biobank. Age, sex and the 10-main ancestry-based PCs were used as covariates. Finally, SAIGE was also used to obtain GWAS summary statistics of the FinnGen cohort. In this study, GWAS was performed using over 16 million genetic markers genotyped with the Illumina or Affymetrix arrays or imputed using the population speci c SISu v3 reference panel. Variables included in the models were sex, age, the 10-main ancestry-based principal components and genotyping batch. In the FinnGen data freeze 4 (November 30, 2020), 651 patients had a NAFLD diagnosis (EHR code K76.0). They were compared to 176,248 controls. We performed a xed-effect GWAS meta-analysis of the eMERGE, UK Biobank, FinnGen and Estonian Biobank cohorts using the METAL package. 60 When variants showed evidence of heterogeneity, we performed a random effect meta-analysis. A total of 6,797,908 SNPs with a minor allele frequency equal or above 0.01 were investigated.

Risk-factor informed Bayesian genome-wide association study
We used bGWAS to identify more SNPs associated with NAFLD. 24 The aim of bGWAS is to identify new variants associated with complex diseases using inference from risk factors of focal traits. We used GWAS summary statistics from three risk factors causally associated with NAFLD in a previous MR study 25 (T2D, BMI and triglyceride levels) as priors and worked with default parameters of the package. GWAS summary statistics for these risk factors are included in the bGWAS package. These were obtained from the Global Lipids Genetic Consortium, Genetics of Anthropometric Traits (GIANT) and the Diabetes Genetics Replication and Meta-analysis (DIAGRAM) consortia. Brie y, bGWAS derives informative prior effects from these risk factors and their causal effect on NAFLD using multivariable MR. Prior estimates (mu) are calculated for each SNP by multiplying the SNP-risk factor effect by the SNP-NAFLD causal effect estimates. By combining observed effects from the NAFLD GWAS meta-analysis and prior effects, Bayes factors, posterior effects and direct effects and their corresponding p-values are generated.
Transcriptome-wide association study of NAFLD Tissues from the GTEx consortium (version 8) with less than 70 samples were not used to provide su cient statistical power for eQTL discovery, resulting in a set of 48 tissues. Only non-sex-speci c tissues (N=43) were analyzed. Alignment to the human reference genome hg28/GRCh38 was performed using STAR v2.6.1d, based on the GENCODE v30 annotation. RNA-seq expression outliers were excluded using a multidimensional extension of the statistic described by Wright et al. 61 Samples with less than 10 million mapped reads were removed. For samples with replicates, replicate with the greatest number of reads were selected. Expression values were normalized between samples using TMM as implemented in edgeR 62 . For each gene, expression values were normalized across samples using an inverse normal transformation. eQTL prediction models were performed using elastic net, a regularized regression method, as implemented in S-PrediXcan 63,64 . We used SNPs with a minor allele frequency greater than 1% from European ancestry participants. Locuscompare function from the LocuscompareR R package 65 was used to depict the colocalization event at the LPL locus. Locuscompare enables visualization of the strengths of eQTLs and outcomes associations by plotting p-values for each within a given genomic location, thereby contributing to distinguish candidates from false-positive genes.

Replication of variants associated with NAFLD in the Mass General Brigham Biobank
The Mass General Brigham Biobank is a hospital-based biorepository with genetic data linked to clinical records as previously described. 66 Patients were de ned as having NAFLD or NASH according to diagnosis codes in the electronic health care record and were compared to controls without such diagnoses. In this cohort, genotyping was performed using the Illumina MEGA array. Association of each the seven variants associated with NAFLD was assessed using logistic regression of disease status with age, sex and ve principal components of ancestry as covariates.
Impact of NAFLD variants on liver fat accumulation in the UK Biobank As part of the study protocol of the UK Biobank, a subset of individuals underwent detailed imaging between years 2014 and 2019 including abdominal MRI. 67 Liver fat in this cohort was quanti ed via machine learning of abdominal MRI images as previously described. 33 We excluded samples that had no imputed genetic data, a genotyping call rate < .98, a mismatch between submitted and inferred sex, sex chromosome aneuploidy, exclusion from kinship inference, excessive third-degree relatives, or that were outliers in heterozygosity or genotype missingness rates, all of which were previously de ned centrally by the UK Biobank 68 Due to the small percentage of samples of non-European ancestries, to avoid artifacts from population strati cation we restricted our GWAS to samples of European ancestries, determined via self-reported ancestry of British, Irish, or other white and outlier detection using the R package aberrant, resulting in a total of 32,976 individuals. We did not remove related individuals from this analysis as we used a linear mixed model able to account for cryptic relatedness in common variant association studies. 69 For analysis of liver fat as a continuous trait, we applied a rank-based inverse normal transformation. We took the residuals of liver fat in a linear model that included sex, year of birth, age at time of MRI, age at time of MRI squared, genotyping array, MRI device serial number, and the rst ten principal components of ancestry. We then performed the inverse normal transform on the residuals from this model, yielding a standardized output with mean 0 and standard deviation of 1. We measured the association of genetic variants with rank inverse normal transformed liver fat via a linear mixed model using BOLT-LMM (version 2.3.4) to account for ancestry, cryptic population structure, and sample relatedness. The default European linkage disequilibrium panel provided with BOLT was used. We also determined the effects of each of the seven variants on presence of hepatic steatosis (liver fat >5.5%) 70 using logistic regression in the same 32,976 individuals, adjusting for sex, year of birth, age at time of MRI, age at time of MRI squared, genotyping array, MRI device serial number, and the rst ten principal components of ancestry.

Impact of NAFLD variants on liver enzymes in the UK Biobank
Age, sex and ancestry-based principal components-adjusted GWAS summary statistics on ALT, AST, GGT and ALP concentrations in 361,194 participants of the UK Biobank of European ancestry, were obtained from the Neale lab. Details on the protocols used to measure these biomarkers is available on the UK Biobank website: https://biobank.ndph.ox.ac.uk/showcase/showcase/docs/serum_biochemistry.pdf.

Impact of NAFLD on the blood metabolome
We used GWAS summary statistics from the study of Kettunen et al. 34 In this study, 123 blood lipids and metabolites were measured in 24,925 individuals from 10 European cohorts using high-throughput nuclear magnetic resonance spectroscopy. Metabolites measured using this platform represent a broad molecular signature of systemic metabolism and include metabolites from multiple metabolic pathways (lipoprotein lipids and subclasses, fatty acids as well as amino acids, glycolysis precursors, etc.). The association of each the seven variants associated with NAFLD was assessed using logistic regression and the association of genetically-determined NAFLD and the blood metabolome was assessed using the IVW-MR with the mr function from TwoSampleMR package in R. 16 The IVW-MR is comparable to performing a meta-analysis of each Wald ratio (the effect of the genetic instrument on the outcome divided by its effect on the exposure). Additional MR analysis were performed to evaluate heterogeneity (intercept pvalue from MR Egger 71 ) and the presence of outliers. We used MR-PRESSO 72 , an outlier-robust method, to detected the presence of outliers (variants potentially causing pleiotropy and in uencing causal estimates) and causal estimates were obtained before and after excluding outliers. We also used the simple median and weighted median consensus methods, which give more weight to more precise genetic instruments.

Impact of NAFLD on tyrosine and phenylalanine levels in the Estonian Biobank
Blood plasma levels of tyrosine and phenylalanine were measured using nuclear magnetic resonance spectroscopy in 10,809 participants of the Estonian Biobank. Odds-ratios and corresponding p-values were estimated using logistic regression model implemented in R version 3.6.1. Metabolite values were scaled and centered prior to analysis. Two models were run: raw model with adjusting for age and sex; and adjusted model, which was additionally adjusted for smoking status, education and body-mass index. Impact of NAFLD on the blood proteome A comparable analytical framework as the one used above for the discovery of NAFLD-associated metabolites was used to identify NAFLD-associated proteins. For that purpose, we used GWAS summary statistics from the INTERVAL cohort. In that study, the relative concentrations of 3,622 plasma proteins or protein complexes were assayed using 4,034 modi ed aptamers (SomaSCAN) in 3,301 participants from the INTERVAL study, as described by Sun et al. 37 Tissue-speci city of gene expression of proteins in uences by non-alcoholic fatty liver disease The tissue-speci c gene expression metric (Tau) was obtained from all genes encoding proteins causally in uenced by NAFLD. We used the formula from Yanai et al. 73 to compare the level of gene expression across selected tissues based on RNA sequencing data from European ancestry donors from GTEx. All the genes with expression <1 RPKM were set as not expressed. The RNA-sequencing data were rst log-transformed. After the normalization, a mean value from all replicates for each tissue separately was calculated. A Tau value closer to 1 indicates tissue-speci city while a Tau value closer to 0 indicates ubiquitous gene expression. We considered that genes encoding proteins found to be causally impacted by NAFLD had tissue-speci c expression when their Tau statistic was ≥0.80.

Phenome-wide Mendelian randomization studies in the UK Biobank and FinnGen cohorts
A recent study performed in the UK Biobank generated GWAS summary statistics on 1403 disease-speci c binary traits in 408,961 white British participants. 74 A scheme was used to de ned disease-speci c binary traits by combining International Classi cation of Diseases (ICD)-9 codes into hierarchical "PheCodes". UK Biobank participants were assigned a PheCode if they had one or more of the PheCode-speci c ICD codes. A detailed description of the EHR codes included in this phecode are available on the Center for Precision Health Data Science of the University of Michigan website: http://prsweb.sph.umich.edu:8080/phecodeData/searchPhecode. In the present analysis, outcomes with a case: control ratio <1:1000 were excluded leaving 853 traits for PheWAS. In the FinnGen cohorts (data freeze 4), outcomes with <500 cases were excluded leaving 1404 traits for PheWAS. In both datasets, we determined the effect of each SNP on disease related traits, scaled on its effect on NAFLD using Wald ratios where the effect of each variant on the disease-related trait divided by its effect on NAFLD.
Data availability GWAS summary statistics of the genome-wide meta-analysis of NAFLD will be made available on the GWAS catalog at time of publication.  Figure 1 Schematic overview of the analytical framework used to identify novel genetic loci for non-alcoholic fatty liver disease, their metabolic and phenotypic effects and to identify the metabolic consequences of NAFLD.

Figure 2
Main results of the meta-analysis of genome-wide association studies (GWAS). A) Manhattan plot depicting singlenucleotide polymorphisms (SNPs) associated with non-alcoholic fatty liver disease in the GWAS meta-analysis of the eMERGE, FinnGen, UK Biobank and Estonian Biobank cohorts. Identi cation of genetic variants linked with NAFLD via a risk factor informed Bayesian GWAS based on B) Bayes Factors (BFs), C) direct effects and D) posterior effects. Genetic loci harboring SNPs associated with NAFLD (p<5.0x10-8) are shown.

Figure 3
Shared genetic etiology at the LPL locus. LocusCompare plot depicting colocalization of the top SNP associated with subcutaneous adipose tissue LPL expression and non-alcoholic fatty liver disease (NAFLD). Each dot represents a single-nucleotide polymorphism (SNP) at the LPL locus. In the left panel, these SNPs are plotted to represent their effect on LPL expression (top right) against their effect on NAFLD (bottom right).

Figure 4
Causal impact of non-alcoholic fatty liver disease (NAFLD) and NAFLD variants on the blood metabolome. Balloon plot depicting the effect of the seven NAFLD variants on blood metabolites and of the effect of NAFLD on blood metabolites using inverse-variance weighted Mendelian randomization.

Figure 5
Causal impact of non-alcoholic fatty liver disease on the blood proteome. A) Volcano plot depicting and blood proteins in uenced by the presence of non-alcoholic fatty liver disease using inverse-variance weighted Mendelian randomization. Green dots represent proteins signi cant in uenced by the presence of NAFLD following correction for false discovery rate (FDR). B) Tissue-speci city of genes encoding proteins in uenced by the presence of NAFLD. Heat map showing the tissue-speci city of genes encoding proteins in uenced by the presence of NAFLD. Tau value is shown in parentheses after the gene name. TPM indicates transcript per per million mapped reads.

Figure 6
Results of the LD regression analysis between non-alcoholic fatty liver disease and other human diseases and traits. LD regression analyses were performed in LD Hub to test the genetic correlation of NAFLD with 240 human diseases and traits. Nominally signi cant (p<0.05) genetic correlation coe cients (Rg) and their 95% con dence interval are presented. HOMA-IR indicates homeostatic model of insulin resistance, adjBMI indicates adjusted for body mass index, VLDL indicates very-low-density lipoproteins and FEV1/FVC indicates forced expiratory volume 1/forced vital capacity.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.