Associations of Gene Expressions in Liver and Blood Cells and Circulating Proteins With Coronary Artery Disease


 Background: Circulating low-density lipoprotein cholesterol (LDL-C) plays a causal role in coronary artery disease (CAD). However, genetic risk factors of CAD were not yet fully understood. The aim of the study was to identify additional factors that contributed to CAD.Methods: We conducted integrative analysis on publicly available data from genome-wide association studies and quantitative trait locus studies by employing two-sample Mendelian randomization methods to examine the associations of gene expression and circulating protein levels with LDL-C and CAD.Results: We found that mRNA expression levels of SORT1, PSRC1 and CELSR2 in liver cells were significantly associated with both LDL-C and CAD (P < 5.0 × 10-8). Higher expression levels of SORT1, PSRC1 and CELSR2 in the liver were significantly associated with lower circulating LDL-C level (beta= -0.142, -0.138 and -0.171, respectively) and CAD risk (beta= -0.10, -0.097 and -0.121, respectively). Expression levels of 31 genes in blood cells associated with LDL-C level were found. Expressions of PSRC1, IL6R, GGCX, VAMP8, LIPA, NT5C2, SWAP70, EIF2B2, FURIN, FES and ATP5G1 in blood cells were significantly associated with CAD. Higher circulating granulins and apolipoprotein B levels, which were strongly affected by PSRC1 and APOB variants, were significantly associated with higher LDL-C level (beta = 0.22 and 1.35) and CAD risk, with odds ratios of 1.15 (1.10-1.19) and 1.45 (1.21-1.74), respectively. Conclusions: This study identified gene expressions in liver and blood cells and circulating granulins and apolipoprotein B that were genetically associated with LDL-C and CAD and provided evidence for exploring the potential causal relationship.


Introduction
Coronary artery disease (CAD) is one of the most common cardiovascular diseases, which is associated with high morbidity and mortality [1]. Special professional groups such as military pilots are at higher risk of CAD than the general population. The pathogenesis of CAD is not fully understood but is known to be the result of the interaction of various genetic and environmental factors. It is a process in which lipoprotein accumulates in the arteries that supply blood to the heart. Dyslipidemia is thought to be a major factor in the development of CAD, and low-density lipoprotein cholesterol (LDL-C) has been shown to be one of the most important causal risk factors. In view of the well-established causal role of LDL-C in CAD, biological molecules associated with LDL-C may be risk factors for CAD.
Large-scale genome-wide association studies (GWAS) have robustly identi ed large numbers of riskassociated genetic variants implicated in LDL-C [2][3][4] and CAD [5,6]. However, the GWAS identi ed loci were located in complex genomic regions containing multiple genes [7]. It is unclear which of these genes are functionally related to LDL-C and CAD functions. The colocalization of genetic associations for disease traits with those for intermediate molecular phenotypes, such as gene expression and metabolomics, provides powerful evidence to advance hypotheses regarding the genes and pathways through which these disease-associated variants mediate their effects [8]. Quantitative trait loci (QTLs) for gene expressions and circulating proteins can offer a route to a comprehensive molecular interpretation of the GWAS ndings. Therefore, by integrative analysis of omics data have the potential to nd pathogenic genes for CAD.
Mendelian randomization (MR) is an analytical technique that assessed the correlation between genetic alternatives to intermediate biomarkers and subsequent diseases that are supposed to be caused by intermediate biomarkers based on the random distribution of genetic variation speci c to biomarkers.
The recent developed two-sample multi-instrumental MR methods that base on GWAS summary data provide feasible ways to integrate omics data from independent GWAS, including QTL studies on genome-wide mRNA expression and circulating protein levels (eQTL and pQTL, respectively) [9][10][11]. The two-sample MR methods are data integration methods that have being widely applied in identi cation of intermediate molecular for disease traits and therefore are effective strategies to explicate the GWAS ndings.
In terms of lipoprotein metabolism, the liver is the major site of removal of LDL-C from the circulation.
Altered liver expression of genes involved in lipid metabolism [12][13][14]. Hepatic de novo lipogenesis in uences hepatic cholesterol content as well as its effects on the circulating lipid levels [15]. Blood cell gene expressions also regulate cholesterol homeostasis [16][17][18]. In this study, we applied several data integration methods to identify potential risk factors such as gene expressions in liver and blood cells and circulating proteins for LDL-C and CAD by using a combination of data from GWAS, eQTL and pQTL studies (Fig. 1).

Data resources
This study used datasets from two LDL-C and one CAD GWAS, four eQTL studies (1 in liver and 3 in whole blood cells), and 3 pQTL studies on circulating protein levels. The two LDL-C datasets were obtained from the lipids GWAS that were conducted by the Global Lipids Genetics Consortium (GLGC).
The rst GWAS evaluated the associations between almost 2.6 million SNPs and lipid levels in 188,578 Europeans [4]. The dataset containing summary results of the association between almost 2.6 million SNPs and LDL-C level was downloaded at http://csg.sph.umich.edu/abecasis/public/lipids2013/ and used in our analysis. The second LDL-C GWAS was a meta-analysis of exome-wide association studies that evaluated the association between almost 292,417 variants and LDL-C level in 47,532 East Asians and more than 300,000 individuals primarily (84%) of European descent (http://csg.sph.umich.edu/abecasis/public/lipids2017EastAsian) [2].
We used the summary results from a large-scale CAD GWAS meta-analysis which was conducted by the CARDIoGRAMplusC4D consortium [6]. This study enrolled about 185,000 individuals that were mainly (77%) of European ancestry. The dataset contained the summary statistics of associations between almost 9.5 million SNPs and CAD tested under the additive model in the initial GWAS (http://www.cardiogramplusc4d.org/data-downloads/).
For the four eQTL studies, one eQTL study was conducted by Westra et al. This study performed eQTL meta-analysis in non-transformed peripheral blood samples from 5,311 individuals of European decent [19]. It is the largest eQTL meta-analysis in blood cells to date. The second eQTL study is the genetic architecture of gene expression (GAGE) study that detected eQTLs in whole blood of 2,765 unrelated Europeans [20]. The remaining two eQTL datasets contained the cis-eQTL summary level data on mRNA expression levels in liver and whole blood cells from the GTEx project [21]. The summary data of these four eQTL studies was available at https://cnsgenomics.com/software/smr/#eQTLsummarydata. We acquired pQTL summary data from three studies that examined the associations between genomewide SNPs and circulating protein levels in thousands of individuals. The rst pQTL study tested genomewide associations between 509,946 SNPs and circulating levels of 1,124 proteins in blood samples of 1,000 individuals from the KORA study [22]. The summary data was available at http://metabolomics.helmholtz-muenchen.de/pgwas/index.php?task=download. The second pQTL study performed genome-wide testing of 10.6 million imputed autosomal variants against levels of 2,994 circulating proteins in 3,301 individuals of European descent from the INTERVAL study [23]. The summary data was available at http://www.phpc.cam.ac.uk/ceu/proteins/. The third pQTL study analyzed 123 metabolites in up to 24,925 individuals [24]. The 123 metabolic traits were quanti ed by nuclear magnetic resonance spectroscopy in blood samples. The summary data was available at http://www.computationalmedicine. /data#NMR_GWAS.

SMR analysis
The summary data-based MR (SMR) approach is a kind of two-sample multi-instrumental MR method that provides feasible ways for integrating GWAS summary data with QTL data [11]. SMR facilitates the goal of integration of summary statistics from large-scale GWAS with transcriptome-wide association data. Besides, in such MR approaches, the exposure and outcome are not necessary measured in the same samples, and the data are summary statistics rather than the raw data of the individuals [11]. The instrumental variables (SNPs that were both tested in the independent QTL studies and GWAS) provided the summary data (e.g., the regression coe cient beta values and standard error) on the effects of SNPs on the levels of biomolecules and outcomes. By analyzing this summary data SMR estimates the causal effects of the biomolecules on the outcomes.
The SMR software (version 0.712) is a command-line program, which was downloaded from http://cnsgenomics.com/software/smr/. The parameters were left at the default setting in the analysis.
The outcome data (i.e., SNP rs number, alleles, allele frequency, beta values, standard error, P values) required for SMR analysis was collected from the LDL-C and CAD GWAS datasets, and then was organized to a .ma le with 8 columns speci c for the SMR analysis by using the R program. The les contained eQTL summary data in binary format for the SMR analysis were available at http://cnsgenomics.com/software/smr/#DataResource (described above). We used genotype data of HapMap r23 CEU as a reference panel to calculate the linkage disequilibrium (LD) correlation matrix for SMR. The genome-wide signi cance threshold for the SMR analysis was 5.0 × 10 -6 . We also performed the heterogeneity in dependent instruments (HEIDI) test to test the 'no horizontal pleiotropy' assumption.
The HEIDI test was conducted to examine if there is a single causal SNP affecting gene expression and the phenotype. The genes with P HEIDI ≥ 0.05 (without heterogeneity) were considered. Functional annotation enrichment analyses were performed for the identi ed genes by using the DAVID online tools.

pQTL analysis
We searched for pQTLs among SNPs that located in the identi ed genes. The pQTLs were expected to provide us with clues to nd out proteins that were related to LDL-C and CAD risk. We picked out SNPs within the genes of interest according to the UCSC database, then examine whether these SNPs were pQTLs according to the results from the KORA and INTERVAL pQTL studies described above.

MR analysis on proteins
In order to obtain further supporting evidence for proteins identi ed in pQTL analysis, we employed the inverse-variance weighted (IVW) MR [25], MR-Egger [26] and the MR pleiotropy residual sum and outlier (MR-PRESSO) [27] methods to test for potential causal relationships between circulating protein levels and LDL-C and CAD. The inverse-variance weighted method combines the ratio estimates from each IV in a meta-analysis model [25]. If the associations with circulating protein levels were to lead to horizontal pleiotropy, the intercept from MR-Egger would be expected to differ from zero [26]. In such cases, we interpreted the coe cient from MR-Egger as being the more valid causal estimate. Conversely, in the absence of statistical evidence for horizontal pleiotropy from the intercept on MR-Egger, we used the IVW MR analysis as it retains greater power. The IVW MR and MR-Egger analyses were performed by using the MendelianRandomization R package [28]. We also detected the horizontal pleiotropy and the outliercorrected causal estimation by using the MR-PRESSO tests [27]. The outlier test in MR-PRESSO is the procedure to test for the MR assumption of no pleiotropy. The source code and documents for MR-PRESSO were available at https://github.com/rondolab/MR-PRESSO. The default parameters were used for the MR-PRESSO analysis.
Data used in these MR analyses were the pQTL data from the three studies and LDL-C (2013 GWAS) [4] and CAD [6] GWAS data that have been described above. The data required in the MR analysis (i.e., the SNP rs number, beta values, standard errors, and the P values) was extracted from each of the LDL-C and CAD GWAS and pQTL datasets. Then we used the "merge" function of the R program to transform the data into a speci c le (an ordinary document with 7 columns) for each protein-trait pair. In the pQTL summary data, SNPs with P value less than 1.0 × 10 -4 were selected to be potential instrumental variables, except for the NMR_GWAS pQTL study (5.0 × 10 -8 ). The selection criterion was set to 1.0 × 10 -4 for the two pQTL studies because 5.0 × 10 -8 would lead to too few instrumental variables. We clumped SNPs (LD r 2 < 0.01 within 10,000 kb) based on data of the Europeans from the 1000 Genomes project using the "clump_data" function on the R package of TwoSampleMR to select independent instrumental variables. The effect allele of each SNP in the LDL-C and CAD GWAS and pQTL studies was manually checked for consistency.

Gene expressions in liver cells associated with LDL-C and CAD
We carried out an SMR analysis to nd gene expressions in liver cells that were associated with circulating LDL-C level and CAD risk by integrating eQTL data from the GTEx project with large-scale GWAS data. The mRNA expression levels of a total of 21,032 genes were analyzed, and expression levels of 15 genes (RHD, RHCE, ANGPTL3, CELSR2, PSRC1, SORT1, SYPL2, ATXN7L2, DNAH11, FADS3, ST3GAL4, NYNRIN, CETP, EFCAB13 and SPTLC3) in the liver were signi cantly associated with LDL-C (P < 5.0 × 10 − 6 ) ( Table 1, Supplemental Table S1). Ten of them pass the HEIDI test. These genes were enriched in gene ontology (GO) biological process of cellular lipid metabolic process (P = 5.0 × 10 − 3 ). The mRNA expression of 3 genes (CELSR2, PSRC1 and SORT1) in the liver were signi cantly associated with CAD risk (P < 5.0 × 10 − 6 ) ( Table 2, Supplemental Table S2). Therefore, the associations of mRNA expression levels of SORT1, PSRC1 and CELSR2 in liver with both LDL-C and CAD passed the threshold of 5.0 × 10 − 8 (Fig. 2). Higher expression levels of SORT1, PSRC1 and CELSR2 in liver cells were associated with lower LDL-C levels (beta= -0.142, -0.138 and − 0.171, respectively) and CAD risk (beta= -0.10, -0.097 and − 0.121, respectively).

Gene expressions in blood cells associated with LDL-C and CAD
By integrating LDL-C GWAS data with eQTL data, we found that expression levels of 31 genes were signi cantly associated with LDL-C level, including expression level of PSRC1 (Fig. 2), without signi cant heterogeneity (Supplemental Table S3). We then performed SMR analysis for CAD. By integrating CAD GWAS data with data from eQTL studies, we found that mRNA levels of 17 probes of 11 genes (PSRC1, IL6R, GGCX, VAMP8, LIPA, NT5C2, SWAP70, EIF2B2, FURIN, FES and ATP5G1) were signi cantly associated with CAD (Table 3), including expression level of PSRC1 (Fig. 2), without signi cant heterogeneity (P HEIDI > 0.05). These genes were enriched in GO biological process of myeloid leukocyte mediated immunity (P = 9.8 × 10 − 4 ). We also found additional 88 genes that were nominally associated with CAD in blood cells (Supplemental Table S4). B. Among these SNPs, rs1367117 was associated with both LDL-C (P = 1.45 × 10 − 282 ) and CAD (P = 1.10 × 10 − 4 ); rs693 (P = 3.16 × 10 − 5 ) and rs10199768 (P = 3.10 × 10 − 3 ) were associated with LDL-C; rs2337383 was associated with CAD (P = 3.70 × 10 − 4 ). A total of 81 SNPs in ERGIC3 were signi cantly associated with circulating level of copine 1, and 19 and 5 of them were associated with LDL-C and CAD, respectively. In addition, we found that the associations of some SNPs with circulating levels of granulins, apolipoprotein B and copine 1 were replicated in the two pQTL studies.

Proteins causally associated with CAD
We tested that if the three proteins, i.e., copine 1, granulins and apolipoprotein B, were genetically associated with CAD using several MR methods. The results were presented in Table 4. We rst examined the associations between circulating levels of these proteins and LDL-C and found that the associations between circulating levels of granulins and apolipoprotein B and LDL-C were signi cant in every MR analyses using data from KORA and INTERVAL studies. In datasets available from the NMR_GWAS, we only found data for apolipoprotein B among the tested metabolites. By using this data, the association between circulating level of apolipoprotein B and LDL-C was validated. Circulating copine 1 was found to be signi cantly associated with LDL-C by using KORA data, but not in MR-PRESSO analysis using the INTERVAL data. Because circulating levels of granulins and apolipoprotein B were signi cantly associated with LDL-C, we thought that these proteins may be risk factors of CAD. As expected, signi cant associations between granulins and apolipoprotein B levels and CAD were found (Table 4). These associations were validated in several MR analyses on data from two pQTL studies. The effect of protein granulins levels on CAD risk was relatively small, with an odds ratio (OR) of about 1.15 (95% con dence interval: 1.10-1.19). The effect of protein apolipoprotein B levels on CAD risk was larger than that of granulins, with an OR of about 1.45 (95% con dence interval: 1.21-1.74). As it is known, LDL-C level was causally associated with CAD. Therefore, granulins, apolipoprotein B, LDL-C and CAD were genetically correlated with each other.

Discussion
In this study we took the advantage of MR approaches to nd out potential causal factors (e.g., gene expressions in liver and blood cells and circulating protein levels) for CAD by integrating data from GWAS. We found many genes that may play causal roles in LDL-C metabolism and CAD. We also found that circulating levels of granulins and apolipoprotein B were causally associated with LDL-C and CAD.
Until now, GWAS have successfully identi ed over 175 genetic loci for lipids. Some of these loci involved therapeutic targets such as HMGCR (statins), PCSK9 (antibodies), and NPC1L1 (ezetimibe). The causal effect of LDL-C on CAD has been well established. Identi cation of related factors for LDL-C may help to understand the etiology of CAD and develop novel therapies. Elucidation of the causal factors underlying GWAS signals remains challenging due to the complexities of the genomic loci (e.g., LD) and the interactions. It is di cult to determine the most functionally relevant genes for LDL-C and CAD only based on the genomic data. On the other hand, previous case-control studies have found gene expressions in blood cells implicated in dyslipidemia and CAD, however, these observational studies are subject to confounding and reverse causations. And there was not liver gene expression pro le study with large samples. MR is an analytical technique that assess the correlation between genetic alternatives to phenotypes and outcomes based on the random distribution of genetic variation speci c to biomarkers, and thus makes up the shortages of traditional epidemiology study [29,30]. By applying the two-sample MR approach we identi ed the most relevant genes in the complex genomic loci for LDL-C and CAD. The ndings of this study also indicated that the GWAS-identi ed loci contained causal factors for LDL-C and CAD, e.g., expressions of the genes in the loci.
The expression level of PSRC1 (Proline and Serine Rich Coiled-Coil 1) was associated with LDL-C and CAD in liver and blood cells. PSRC1 codes a proline-rich protein, which is a target for regulation by the tumor suppressor protein p53. The genetic associations between PSRC1 variants and LDL-C and CAD have been well established but the role of this gene in LDL-C and CAD has not been well discussed. Our analysis suggested that SNPs in PSRC1 were strongly associated with gene expression, LDL-C and CAD, and furthermore, these SNPs were signi cantly associated with circulating levels of granulins and apolipoprotein B. Moreover, we demonstrated that circulating levels of these proteins were causally associated with LDL-C and CAD. Circulating progranulin is a dimer of high-density lipoprotein (HDL) [31]. HDL/apolipoprotein A-I was known to bind to macrophage-derived progranulin and suppress its conversion into proin ammatory granulins [32]. However, how PSRC1 interacts with GRN is unclear but we found that these genes were connected via other genes that were related to cardiovascular diseases (e.g., APP, YY1) (Supplemental Fig. 1). Based on the multiple sources of evidence, the identi ed genes were suggested to be potential functional candidate genes. The interaction among these factors probably point to a pathway for LDL-C and CAD.
Our study found that apolipoprotein B was causally associated with LDL-C and CAD. It is known that one molecule of apolipoprotein B is required for the proper assembly and secretion of each very low density lipoproteins [33], which are converted into LDL-C after the hydrolysis of their triglyceride moiety [34]. Indeed, apolipoprotein B is a proposed LDL-C-lowering target. Proprotein convertase subtilisin/kexin type 9 (PCSK9) regulates circulating LDL-C level. Several studies have shown that PCSK9 regulates apolipoprotein B synthesis and secretion [35][36][37]. In animals, a single dose of siRNA targeting PCSK9 resulted in a lowering of circulating PCSK9, apolipoprotein B, and LDL-C, without measurable effects on either high density lipoprotein cholesterol or triglycerides [38]. In the MR analyses in human samples we showed that there was a causal effect of circulating apolipoprotein B on CAD.
This study has some limitations. First, we did not validate the associations found in the SMR analysis in an independent sample due to lacking in data. Second, the identi ed "candidates" of the causal factors were inferred by statistical models. Additional studies such as experimental studies and randomized intervention trials were needed to strengthen the evidence. Third, although we have identi ed a set of causal genes, the connections between these genes were unclear. How the lipid metabolism related genes connect to each other and how they affect CAD risk is needed to test in future studies.

Conclusion
This study elucidates the biological basis of genetic linkage and promotes the identi cation of susceptibility genes. The ndings suggested that gene expressions in liver and blood cells and circulating protein levels were genetically associated with LDL-C and CAD and may be novel risk factors for CAD.
This study also provided evidence for exploring the potential causal relationship between granulins and apolipoprotein B and LDL-C and CAD. The ndings may provide the clues for seeking new therapeutic target for dyslipidemia and CAD.

Declarations
The study was supported by the National Natural

Availability of data and materials
All data generated or analysed during this study are included in this published article and its supplementary information les.

Ethics approval and consent to participate
The study was approved by the Ethics Committee of Fujian Medical University Union Hospital. Written informed consent was waived by the Ethics Commission of each hospital for emerging infectious.

Consent for publication
No individual participant data is reported that would require consent to publish from the participant (or legal parent or guardian for children). Figure 1 The owchart of the study design. We designed this study to explored potential risk factors such as SNPs, gene expression in liver and blood cells, and circulating protein levels for CAD. This study comprised multiple steps of analysis. The rst step is the SMR analyses. In this step we integrative analyzed data from 4 eQTL studies (1 in liver and 3 in whole blood cells) and data from 2 LDL-C and 1 CAD GWAS to identify mRNAs that were associated with LDL-C and CAD. Second, we found out the shared genes for LDL-C and CAD. Third, we looked for pQTLs inside the genes identi ed in the previous steps using data from 3 pQTL studies on circulating protein levels. Finally, for circulating protein levels that were signi cantly affected by SNPs (i.e., pQTLs) in the identi ed genes, we applied 3 MR methods (IVW, MR-Egger, MR-PRESSO) to examine whether they were genetically associated with LDL-C and CAD by integrating data from the 3 pQTL studies and LDL-C and CAD GWAS.

Figure 2
The association between mRNA expression levels of 1p13.3 genes and LDL-C and CAD in liver and blood cells. The two panels present the results of the associations between mRNA expression levels of 1p13.3 genes and LDL-C and CAD in liver (left) and blood cells (right). Each panel consists of two parts. The xaxis represents the genomic position (GRCh37.p13). The lower part of each panel shows the results of eQTL. The y-axis represents -log10(P eQTL). In this part we can see that SNPs in 1p13.3 were strongly associated with mRNA expression levels of SORT1, PSRC1 or CELSR2. The upper part of each panel shows the results of GWAS (gray dots) and SMR (red rhombus) analysis. The y-axis represents -log10(P GWAS or SMR). In this part we can see that SNPs in 1p13.3 were strongly associated with LDL-C and CAD. According to the SMR analysis, mRNA expression levels of SORT1, PSRC1 and CELSR2 in liver cells were signi cantly associated with LDL-C and CAD, and mRNA expression level of PSRC1 in blood cells was signi cantly associated with LDL-C and CAD.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.