Identification of potential biomarkers for CAD using integration with expression and methylation data and validation by case-control study

Background: DNA methylation plays an essential role in the pathogenesis of coronary artery disease (CAD) through regulating mRNA expressions. This study aimed to identify hub genes as biomarkers functioning in CAD and regulated by DNA methylation. Methods: Gene expression and methylation datasets of peripheral blood leukocytes (PBLs) of CAD were downloaded from Gene Expression Omnibus (GEO) database. Differentially expressed genes (DEGs) and differentially methylated genes (DMGs) were identified based on differentially methylated CpG sites (DMCs) in intragenic regions. The overlapped genes in DEGs and DMGs were filtered and defined as differentially expressed-methylated genes (DEMGs). Multiple computational approaches were performed to analyze the regulatory networks of DEMGs and recognize hub genes, subsequently. Finally, top hub genes were verified in a case-control study based on their differential expression levels between CAD cases and controls. Results: Totally 535 DEMGs were identified and partitioned into 4 subgroups. TSS200 and 5’UTR were confirmed as high enrichment areas of DMCs. DEMGs mainly enriched in processes of histone H3-K27 methylation, regulation of posttranscription and DNA-directed RNA polymerase activity. Pathway enrichment showed DEMGs participated in VEGF signaling pathway, adipocytokine signaling pathway and PI3K-Akt signaling pathway. Besides, hub genes FN1 (fibronectin 1), PTEN (phosphatase and tensin homolog), POLR3A (RNA polymerase III subunit A) were experimentally proved discordantly expressed in CAD patients when compared with controls. Conclusions: In conclusion, our study demonstrated the probable functional genes of PBLs in CAD, in which FN1, PTEN and POLR3A were confirmed. The underlying mechanism that differential expression-methylation of FN1, PTEN and POLR3A in CAD need further study. LDL-C, lipoprotein cholesterol; HDL-C, cholesterol; TG, triglyceride; LMR, lymphocyte to monocyte ratio; NLR, neutrophil to lymphocyte ratio.

2000 to 2029 in China [2]. The total social economy devotion to CAD in developing countries was estimated to approximately 3.7 trillion dollars in 2010, which is roughly equal to 1-3% of Gross Domestic Product (GDP) across developing countries [3].
Coronary arteriography (CAG) is the gold standard of CAD diagnosis, but the high cost and invasiveness limit its application [4]. Whereas, the cheaper cost and less invasive detection make blood biomarkers easier in promotion [5]. Epigenetics is defined as the heritable posttranscriptional modifications that are not induced by the nucleotide sequence alterations of DNA [6]. Multiple factors such as environment, diet, oxidative stress and inflammatory stimuli, have influence on epigenetics contents, which including DNA methylation, RNA methylation, chromatin histone modification, and non-coding RNA, and DNA methylation is the most indagated [7].
Aberrant DNA methylation participates in various processes of CAD development by regulating mRNA expression of interrelated genes. For instance, ABCA1 plays an essential part in reverse cholesterol transport (RCT) by promoting the excretion of free cholesterol and phospholipids from cells [8]. The alterative hypermethylation of ABCA1 promotor has been verified to be related with the high expression of ABCA1, which can accelerate the process of CAD by expediting the formation of foam cells [9][10]. Furthermore, cystathionine gamma-lyase, encoded by CTH, is a crucial part of the homocysteine metabolism pathway [11]. Previous studies have found that hypermethylation of CTH promotor in hyperhomocysteinemia mouse can lead to the decrease of CTH expression, which in turn prevents homocysteine from being catabolized and causes vascular endothelial cells injury, eventually results in CAD [12][13]. In the past a few years, numbers of researches on DNA methylation mainly focused on the connection between methylation condition of promoter regions and the expression level of related genes. Nevertheless, the aberrant methylation status of other gene regions has also been identified to be associated with CAD and the complex regulation networks remain largely unexplored [14][15][16].
To the best of our knowledge, an integrative research that combined both mRNA expression and DNA methylation microarrays of PBLs in CAD in Chinese was required. In our study, we calculated the methylation status of 5'-C-phosphate-G-3' (CpGs) sites in different intragenic gene regions, including TSS1500, TSS200, 5'UTR, 1stExon, body, and 3'UTR. Besides, we consolidated DNA methylation and mRNA expression data in order to recognize genes functioning in CAD and associated with DNA methylation, which might be potential PBLs biomarkers. We identified hub genes that were both aberrantly methylated and differentially expressed in CAD patients compared with controls. Vital hub genes were validated in a case-control study to enhance the reliability of bioinformatics analysis.
Through combination of bioinformatics analysis and clinical sample validation, we aimed to ascertain novel feasible PBLs biomarkers and shed light on potential mechanisms in the pathogenesis of CAD.

Methods
The methods used in our study mainly contained microarray data collection, differential expression and methylation analysis, functional and pathway enrichment analysis, PPI network establishment, module analysis and hub genes identification, followed by quantitative real-time PCR (qPCR) validation in PBLs, correlation analysis and multivariate stepwise linear regression analysis. The research flow diagram of this study is shown in Additional file 1: Fig. S1.

Microarray Data Collection
We retrieved GEO of The National Center for Biotechnology Information (NCBI) to screen datasets that contained profiling information about mRNA expressions and DNA methylations in CAD patients versus controls. A series of datasets were obtained and only those that met both the inclusion and exclusion criteria were analyzed subsequently. The detailed inclusion criteria were as followed: (1) datasets involved mRNA expression information or DNA methylation status detected from PBLs; (2) contained both CAD patients and controls; (3) sample size was no less than 5 of each subgroup.
Besides, datasets were excluded if specimen type was one of the components of PBLs, such as monocyte, granulocyte or platelet. Only two datasets were up to the selection criteria, GSE42148 and GSE107143. Gene expression profiling array (GSE42148), measured by the Agilent-028004 SurePrint G3 Human GE 8 × 60K Microarray, provided mRNA expression data from 11 controls with normal electrocardiogram diagnoses and 13 CAD patients. The series matrix and platform files (GPL13607) were downloaded from GEO database. Genome-wide DNA methylation profiling array (GSE107143) contained information of DNA methylation status from 8 controls with normal physical condition and 8 CAD patients. The data were measured by Illumina HumanMethylation450 BeadChip and the series matrix file as well as the platform file (GPL13534) were obtained from GEO database. Besides, in consideration of mRNA expression array GSE71226 didn't meet the inclusion criteria with a small sample size of 3 CAD patients and 3 controls, we used it only to evaluate the discriminating ability of candidate gene mining.

Differential Expression Analysis
The R package named "limma" was utilized to select differentially expressed genes (DEGs) from the series matrix file downloaded from GEO database [17]. Probes without matching gene symbols were deleted and genes with multiple probes were averaged in the subsequent analysis. We took P < 0.05 and absolute value of log 2 FC (fold change) > 0.3 as threshold of significant DEGs. The heatmap based on the expression data was drawn using the R package "pheapmap". Differential DNA Methylation Analysis AS one of the mainstream detection platforms for DNA methylation, Illumina HumanMethylation450 BeadChip covers roughly 450,000 CpGs that randomly separate in different gene regions, including TSS1500, TSS200, 5'UTR, 1stExon, body, 3'UTR, and regions between adjacent genes. TSS1500 and TSS200 are regions from 201 to 1500 bases and 1 to 200 bases of the upstream of transcriptional start site (TSS), respectively. The "5'UTR (5' untranslated region)" is considered as the region between TSS and the first initiation codon. "1stExon (the first exon)" is one of the most extensively studied translated regions that is deemed to be influenced by the methylation status generally.
"Body" stands for the sequence from the first initiation codon to the stop codon of a gene. The "3'UTR (3' untranslated region)" is the area between stop codon and poly-A tail. The 6 intragenic regions mentioned above are the main components of a gene and we took the average of the beta value of CpGs from the same region as the comprehensive methylation level of each intragenic region. The limma package of R was used for identification of differentially methylated CpGs sites (DMCs), differentially methylated regions (DMRs), and differentially methylated genes (DMGs) with the threshold P < 0.05 and log 2 FC > 0.3. Single CpGs met the threshold were taken as DMCs, meanwhile, intragenic regions matched the threshold were identified as DMRs. Genes with one or more DMRs that differentially methylated in the same direction were considered as DMGs. We defined genes that were identified both as DEGs and DMGs as differentially expressed-methylated genes (DEMGs). The Upset plot performed by R package "UpSetR" was utilized to describe the distribution of DMCs in different intragenic regions [18]. The locations of DMCs on chromosomes were visualized by R package "RIdeogram" [19].
Functional And Pathway Enrichment Analysis R package "clusterProfiler" was resorted to implement Gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis [20,21]. More precisely, GO enrichment analysis was carried out within 3 classical subschemas: biological process (BP), cellular component (CC), and molecular function (MF). Subsequently, we utilized "ggplot2", one of the most extensively cited R packages, for visualization of the analysis results. The cutoff value of statistical significance was set as P < 0.05.

PPI network establishment, module analysis and hub genes identification
Protein-protein interaction (PPI) network was preliminarily constructed through Search Tool for the Retrieval of Interacting Genes (STRING) database to explore the inherent relation and regularity of DEMGs. The cutoff value of interaction score in STRING database was set at 0.4. To make the PPI network more legible, we resorted Cytoscape to visualize the network based on interaction information calculated from STRING [22]. An auxiliary application named Molecular Complex Detection (MCODE) from Cytoscape was used for module analysis to identify modules with significant interaction under threshold MCODE scores > 3, k-score = 2 and nodes numbers > 4. CytoHubba, another frequently-used application from Cytoscape, provided 12 algorithms to estimate evidence levels of interaction within genes from PPI network [23]. We took summation of these 12 evaluation scores as the comprehensive assessment standard for screening top hub genes.

Study Population And PBLs Collection
We performed a case-control study to consolidate the expression status of hub genes filtered through bioinformatics analysis. PBLs of 30 CAD patients from Zhongnan Hospital of Wuhan University (Wuhan, China) were collected during December 2018 and July 2019. The diagnostic criterion for CAD was based on coronary angiography that showing stenosis caused by atherosclerotic plaque was more than 50% in at least one coronary artery. Meanwhile, 30 age and sex matched people who were negative in examination of ultrasound or coronary CTA or coronary angiography were enrolled as controls. Besides, none of the participants were diagnosed with the following diseases: cancer, acute inflammation, hematological system disorders, congenital heart disease, history of previous myocardial infarction (MI), hepatic failure, or other severe disorders. The basic information and clinical characteristics of participants were shown in Table 1. Our study was authorized by the Medical Ethics Committee of Zhongnan Hospital of Wuhan University. < 0.001 Data were showed as mean ± SD, median (25 percentiles, 75 percentiles). HP, hypertension; DM, diabetes mellitus; TC, total cholesterol; TG, triglycerides; LDL-C, low-density lipoprotein cholesterol; HDL-C, high-density lipoprotein cholesterol; FPG, fasting plasma glucose; WBC, white blood cell; LMR, lymphocyte to monocyte ratio; NMR, neutrophil to monocyte ratio; NLR, neutrophil to lymphocyte ratio.  Table S1. All experiments were implemented in duplicate to intensify the credibility. Relative gene expression status was calculated by 2 −ΔCq method, in which ΔCq stands for the difference between the mean Cq (quantification cycle) of a target gene and the endogenous reference gene (GAPDH).

Statistical analysis
Mean ± standard deviation (SD) was utilized to describe the basic information and clinical characteristics that were normal distributed continuous variables. Abnormal distributed continuous variables were depicted as median and inter-quartile range. Categorical variables were exhibited by frequencies. We applied student's t test or Mann-Whitney U tests to compare the difference between 2 groups based on the distribution type. Chi-square test or Fisher's exact test was performed for comparison of categorical variables between groups. Pearson or Spearman test was used for correlation analysis. We utilized multivariate stepwise linear regression to eliminate interference factors in regression analysis. All statistical analyses of this research were conducted through SPSS To probe into the potential effect of the whole intragenic regions' methylation status on gene function, we considered genes with one or more DMRs that differentially methylated in the same direction as DMGs. About 2413 hypermethylated genes and 2952 hypomethylated genes were classified based on DMRs. Subsequently, 135 genes were identified as up-expressed and hypermethylated (up-hyper genes), 212 genes were confirmed as up-expressed and hypomethylated (up-hypo genes), 100 genes were taken as down-expressed and hypermethylated (down-hyper genes), 88 genes were considered as down-expressed and hypomethylated (down-hypo genes) by overlapping DEGs and DMGs (Fig. 2a).
In addition, to further explore whether the integration analysis of expression and methylation data was able to identify potential functional genes in CAD efficiently, another expression microarray In total, 535 differentially expressed and methylated genes were screened out as DEMGs. It was worth noting that up-hyper genes and down-hypo genes occupied virtually half of DEMGs, which indicated the multidirectional regulation of methylation on gene function that worth of further study. Heatmaps were formed according to hierarchical clustering of gene expressions or methylations levels to exhibit the top 50 ranked DEMGs by log 2 FC, respectively (Fig. 2b, c).

Distributions Of DMCs In Intragenic Regions
According to the analysis result of DMCs distributions, it showed inhomogeneous distributions of DMCs in 6 intragenic regions no matter for DMGs or DEMGs. As Fig. 3a and Fig. 3b showed, DMCs located in TSS1500 and body of both DMGs and DEMGs, accounted for over 50% of the total DMCs numbers. On the contrast, there were only less than 3% DMCs distributed in 3'UTR. In consideration of the linear lengths of TSS1500 and body are much longer than TSS200 and 5'UTR, it is more reasonable to take TSS200 and 5'UTR as the high enrichment areas of DMCs. These results above indicated the significant correlation of TSS200 and 5'UTR methylation status and gene expression.
Besides, the similar distribution mode can be found in 4 kinds of DEMGs (Fig. 3c). It can be observed that DMCs in 6 intragenic regions were mostly possessed by up-hypo genes, which manifested up-hypo genes as major roles in epigenetic regulation of CAD.
To demonstrate the relevance among intragenic regions, UpSet plots were drawn to describe the methylation status of a certain DMG with one or more DMRs. Over 70% of both hypermethylated genes and hypomethylated genes were single region-specific DMGs in TSS1500, TSS200, 5'UTR and body (Fig. 3d, e). While, DMGs with multiple DMRs were mainly occupied by 5'UTR and 1stExon, TSS1500 and TSS200, TSS200 and 5'UTR. Even more noteworthy is that approximately 6% DMGs had 3 or more differentially methylated regions, which represented a general differential methylation status of a certain gene in CAD patients compared to controls.

GO Functional And KEGG Pathway Enrichment Analysis Of DEMGs
We performed GO functional and KEGG pathway enrichment analysis on up-hyper, up-hypo, downhyper, down-hypo genes separately to explore the inner connection of DEMGs. The top 5 GO enrichment terms were illustrated in Table 2, from which we could find DEMGs enriched in numerous processes. Up-hyper genes were basically enriched in the biological process and 2 terms were associated with actin cytoskeleton reorganization. Four fifths of terms enriched on up-hypo genes were related to organelle membrane or granule membrane. It is notable that the rest 1 term of uphypo genes was neutrophil activation which enriched most genes among the top 5 terms. AS for down-hyper genes, GO terms were majorly centered on DNA-directed RNA polymerase activity, which indicate the potential connection between DNA methylation and mRNA expression. Besides, 2 terms of down-hypo genes were involved in regulation of calcium ion transportation.   algorithms based on bioinformatic analysis.
As displayed in Fig. 4a and Fig. 4b, top 10

Clinical Characteristics Of Subjects In Validation Study
The clinical parameters of CAD patients and controls were summarized in Table 1. Groups of participants were matched in gender and age structure. More CAD patients suffered from hypertension (HP) (P = 0.009) and diabetes mellitus (DM) (P = 0.025) compared with controls. Among the serum lipid parameters, no conspicuous differences were observed in total cholesterol (TC), low density lipoprotein cholesterol (LDL-C), high density lipoprotein cholesterol (HDL-C). While, the level of triglyceride (TG) was much higher in CAD patients (P < 0.001). Leucocyte differential count reveled significant increase of monocytes (P < 0.001) and neutrophils (P = 0.031) but prominent decrease of lymphocytes (P = 0.025) in CAD patients. Parameters based on leucocyte differential counts such as the lymphocyte to monocyte ratio (LMR) (P < 0.001) was decreased while the neutrophil to lymphocyte ratio (NLR) (P < 0.001) was increased in CAD patients.

Hub Genes Expression In Validation Study
In order to validate the significance of hub genes identified based on bioinformatic analysis, the mRNA expression levels of top 1 hub genes from 4 DEMGs groups were detected by qPCR. In accord with the bioinformatic results, FN1 from up-hyper genes and PTEN belonged to up-hypo genes were remarkably upregulated in CAD patients when compared with controls ( Fig. 6a, b). The mRNA expression level of POLR3A, the top 1 hub gene of down-hyper genes, was conspicuously decreased in patients with CAD, which was also consistent with the bioinformatic results above (Fig. 6c).
Nevertheless, the foremost hub gene UBR1 from down-hypo genes was not differentially expressed between CAD patients and controls (Fig. 6d).
In consideration of these hub genes were screened from DEMGs, we speculated the significant expression differences of FN1, PTEN and POLR3A were in all probability correlated with the aberrant DNA methylation status in CAD patients, which could be a more stable biomarker compared with mRNA expression.

Correlation Between Hub Genes Expressions And Clinical Characteristics
Correlation analysis and multivariate stepwise linear regression analysis were performed to investigate the underlying connection between clinical characteristics of all enrolled subjects and expression levels of top 1 hub genes. As displayed in Table 4, both FN1 (r = 0.268, P = 0.039) and PTEN (r = 0.326, P = 0.011) were identified to be positively correlated with monocyte counts. LMR was observed to be negatively related with FN1 (r = -0.255, P = 0.049) and PTEN (r = -0.315, P = 0.014), and positively correlated with POLR3A (r = 0.288, P = 0.026). Besides, PTEN was reversely associated with NMR (r = -0.311, P = 0.016) and the expression of POLR3A was decreased with aging (r = -0.320, P = 0.013). In addition, although the expression of UBR1 was not confirmed to be different between CAD patients and controls, a prominent reverse correlation was verified between UBR1 and TG when involved all subjects (r = -0.312, P = 0.012). Table 4 Correlation remained to be related with age (β = -0.320, P = 0.013) after adjusting LMR.

Discussion
Identify epigenetic pattern and certain biomarkers from PBLs would be conducive to the diagnosis, therapy and monitor of CAD in non-invasive approach. In this study, we filtrated genes that were both discrepantly expressed and methylated in CAD patients compared with controls. Pathways enriched by these genes were demonstrated and hub genes were screened out based on PPI network. To verify the results of data analysis, top hub genes were experimentally confirmed through qPCR in CAD patients and controls. Furthermore, we investigated methylation patterns based on separate gene regions and gave evidence of the most vulnerable region to methylation in CAD. The differential expressions of hub genes filtered from DEMGs were with great possibility to be associated with altered DNA methylation status, which shed light on the underlying regulation mechanism of DNA methylation in CAD and helped identification of the stable novel DNA methylation biomarkers.
As the chromosome distribution map exhibited, differentially methylated CpGs covered almost every regions of each chromosome. This can be regarded as the universality of methylated regulation in the pathogenesis of CAD [24]. It is interesting to note that DMCs distributed in regions near centromeres were relatively fewer than other regions. The phenomenon might be partly attribute to the supercoiling structure and hyper-reiterated DNA sequence around the centromeres [25].
Numerous published studies took promoters as key differentially methylated regions in CAD, while other intragenic regions were less concerned [26,27]. Recently, a few studies have demonstrated methylation sites in gene body or TSS1500 were also essential to the pathogenesis of CAD [28]. Our KEGG pathway enrichment analysis suggested up-hyper genes mainly enriched in adipocytokine signaling pathway and VEGF signaling pathway. It was consistent with previous studies that increased expression of the adipocytokine omentin was detected in the epicardial adipose tissue of CAD patients and unbalance of isoforms of VEGF was associated with the complexity and severity of CAD [29,30]. Up-hypo genes primarily enriched in neutrophil activation and it has been confirmed by several researches that increased neutrophil counts was connected with the severity of CAD [31].
Intriguingly, hub genes belonged to down-hyper genes were directly correlated with regulation of histone H3-K27 methylation according to GO enrichment analysis, which indicated underlying association between DNA methylation and histone methylation. Regulation of calcium ion transport into cytosol was the term enriched by down-hypo genes and has been proved to affect the progress of CAD through coronary smooth muscle [32].
In view of the small sample size of datasets obtained from GEO database and accumulative deviation from bioinformatics analysis, we performed laboratorial verification with relatively larger sample size Inflammatory cytokines activated TGF-β signaling pathway, which promoted the expression of FN1 in human endothelial cells [33]. Fibronectin, encoded by FN1, was enriched in vascular subendothelial basement membrane during the early process of atherosclerotic plaque formation and aggravated the monocytes recruitment [34]. The significant up-regulation of FN1 was positively related with monocyte amount in CAD patients confirmed the previous explored regulation pathways. The vascular endothelial cells injury caused by chronic inflammatory in atherosclerosis accelerated the atherosclerotic plaque formation. A series of microRNAs could bind to the 3'UTR of PTEN, altered the proliferation of vascular endothelial cells through PI3K-Akt pathway and influenced the procession of CAD [35]. Previous reports suggested PTEN was up-regulated in peripheral blood mononuclear cells of CAD patients, which was consistent with our founding in PBLs and confirmed the reliability of our research [36]. In addition to the involvement of microRNAs in the regulation of PTEN, we conjectured DNA methylation might also participated in the regulation network. POLR3A mainly triggered leukodystrophy [37]. And currently, no study was carried out to research the function of POLR3A in CAD or atherosclerosis. The aberrant expression and methylation of POLR3A may serve as novel potential biomarkers of CAD.

Conclusions
In summary, our study consolidated both mRNA expression and DNA methylation microarrays of PBLs in CAD into bioinformatics analysis and executed experimental validation for the first time.     Relative mRNA expressions of top 1 hub genes from 4 DEMGs groups.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.
Additional file 2.pdf Additional file 1.pdf Additional file 3.pdf