Identication of Methylation Driven Biomarkers for Diagnosis and Prognosis in Colon Cancer by Integrative Analysis of TCGA, GTEx, and GEO Data

Background: The intention of the present work was to investigate methylation driven biomarkers for diagnosis and prognosis in colorectal cancer (CRC) by integrative analysis of DNA methylation and gene expression data from The Cancer Genome Atlas (TCGA), The Genotype-Tissue Expression (GTEx), and Gene Expression Omnibus (GEO). Methods and Results: The differentially expression genes (DEGs) and differentially methylated genes (DMGs) were screened using mRNA expression and DNA methylation data from TCGA, respectively. And the methylation driven genes (MDGs) of CRC were further identied using MethylMix R package. Subsequently, the MDGs were underwent Random Forest (RF) analyses to establish diagnosis prediction model using mRNA expression data from TCGA and GTEx, which were then validated by GSE39582 from GEO dataset. In addition, prognostic biomarkers that were used to establish the methylation related risk score model was generated by the univariate and multivariate Cox regression analysis. 9 out of 10 DMGs revealed excellent performance as independent diagnostic predictors, including CLDN1, EPHX4, TCN1, ARHGAP20, LY6G6D, FAM150A, KCNJ12, KRT7 and STK33. Furthermore, STK33 and EPHX4 were found to be associated with Overall survival (OS). Conclusions: Our ndings suggest that the identied MDGs could be potential biomarkers for diagnosis and prognosis of CRC.


Introduction
CRC is one of the most prevalent malignant tumors and the second leading cause of mortality worldwide. [2] Despite the great progress made in diagnosis and therapy, CRC patients usually develop recurrence and metastasis, leading to the 5-year survival rate dramatically decreased. [23] Therefore, it is urgent to improve the CRC diagnosis, treatment and prognosis.
Molecular characterization is extensively utilized to predict tumor diagnosis, treatment and prognosis, which offers great potential for improving understanding of tumor development. [5,27] Previous studies on epigenetics have indicated that aberrant DNA methylation could led to gene silencing, [14,25] which may promoted oncogenesis. [6,13] DNA methylation related biomarkers for diagnosis and prognosis have been researched in various tumors, such as endometrial cancer, [32] prostate cancer, [18] and hepatocellular carcinoma. [30] MethylMix, a R package has been veri ed to be able to identify MDGs by integrative analyze DNA methylation and gene expression data from normal and tumor samples, [3,9] and the relationship between MDGs and the prognosis of CRC patients has been investigated. [4,28] However, individualized diagnosis and prognosis models for CRC based on MDGs have not been reported.
On the basis of existing literature data, we carried out studies in an effort to identity a set of MDGs using DNA methylation and gene expression data, and construct independent diagnosis model using RF algorithm. Additionally, the MDGs, presumed as potential prognostic indicators, were analyzed using Page 3/12 univariate and multivariate Cox regression. Our ndings may suggest that the potential methylation driven biomarkers could prompt a more individualized diagnosis and therapies for CRC.

TCGA and GTEx Data acquisition
The mRNA expression and DNA methylation data (Illumina Human Methylation 450) of CRC from TCGA and the mRNA expression of normal samples from GTEx were downloaded from the UCSC Xena (https://xenabrowser.net/datapages/). Subsequently, the samples from TCGA were divided into normal and tumor group. Finally, we selected 667 samples (51 normal and 616 tumor) for DEGs analysis, 433 samples (45 normal and 388 tumor) for MDGs analysis, detailed information was shown in Table 1. In addition, we selected mRNA expression data of colon tissue from GTEx to provide supplementary data for normal samples from TCGA, and 307 normal samples were obtained.

Geo Data Acquisition
We downloaded the gene expression matrix of GSE39582 based on platform GPL570 from GEO database (https://www.ncbi.nlm.nih.gov/geo/), with a total of 566 tumor and 19 normal samples (Table 1).

Screening Degs, Dmgs And Mdgs
The DEGs were identi ed by comparison tumor with normal groups using the limma R package, [22] with the cutoff criteria de ned as | log2 fold change (FC)| >0.585 and P < 0.05. In addition, | logFC| >0.3 and P < 0.05 was used to screen DMGs using ChAMP R package. [24] Subsequently, the tumor methylation matrix, normal methylation matrix and tumor expression matrix with the overlapping genes of DEGs and DMGs were constructed to identify MDGs using R package MethylMix. [3,9] The signi cant methylation events were ltered using the cutoff criteria set as correlation coe cient < -0.3 and P < 0.05, and the correlation of the MDGs were visualized by corrplot R package. Moreover, the expression differences of the MDGs between tumor and normal samples were statistically analyzed using unpaired t test.

Identi cation And Validation Of Diagnosis Biomarkers Using Rf Algorithm
The mRNA expression datasets from TCGA and GTEx were integrated and randomly divided into training dataset and validation dataset with a 3:1 ratio in tumor and normal samples. The 10 MDGs were evaluated as independent indicators and combined indicator for diagnosis of CRC using RF algorithm. In addition, The GSE39582 dataset were analyzed as validation dataset combined with training dataset from TCGA and GTEx using 7 of the MDGs, owing to the absence of the other 3 MDGs in GSE39582. The receiver operating characteristic (ROC) curve was utilized to depict the sensitivity and speci city of diagnosis models using independent MDG or the combined MDGs panel.

Identi cation And Evaluation Of Prognosis Biomarkers
The mRNA expression data with OS information were select to determine the prognostic biomarkers. The univariable Cox proportional hazards regression model was performed using Survival R package, using Pvalue cutoff of 0.05. Subsequently, the Kaplan-Meier survival analysis and multivariate Cox regression analysis were performed to determine independent prognostic factors. Accordingly, the coe cient of MDGs was obtained from the multivariate Cox results. The risk score based on the signature of each CRC patients was calculated using the following formula: To further evaluate the predictive e ciency of the constructed MDGs risk score model, we performed the ROC curve to re ect the sensitivity and speci city of survival prediction, and quanti ed the area under the curve (AUC) using survivalROC R package. The optimal cutoff risk score was designated at the turning point of ROC curve, where the difference between true positive and false positive is the most signi cant. The patients above the cutoff value were in the high-risk group, while patients below it in the low-risk group. In addition, Kaplan-Meier curves were plotted to distinguish the two groups using survminer R package.

Identi cation of MDGs as candidate diagnostic and prognostic biomarkers
The whole data generation and analysis work ow was shown in Fig. 1. Expression and DNA methylation data from TCGA were separately analyzed to screen DEGs and DMGs by comparison between normal and tumor samples. As a result, 529 DEGs (273 upregulated and 256 downregulated) were identi ed ( Fig. 2a and Table S1). 91 overlapping genes of DEGs and DMGs were used for the MethylMix analysis with a lter criterion set as Cor < -0.3, |log2FC| > 0.3 and P < 0.05, and 10 MDGs were identi ed as candidate diagnostic and prognostic biomarkers (Fig. S1). Pearson's correlation test was statistically analyzed the correlations between methylation degree and gene expression of the MDGs using the cor.test function of the R language (https://www.r-project.org/) (Fig. S2). It was found that the methylation degree of those 10 MDGs was negatively correlated with their expression. The higher the methylation degree was, the lower the gene expression, and the detailed information of MDGs was shown Table 2. To quantify the difference of each candidate biomarker between normal and tumor samples, the unpaired t test was conducted. As a result, signi cant difference was found in all the candidate biomarkers (Fig. 2b). In addition, the correlations between each candidate biomarker were investigated using corrplot R package (Fig. 2c), which may help us choose the optimal independent gene set for diagnosis of CRC.

Evaluation Of The Mdgs As Candidate Diagnostic Biomarkers
We investigated the potential of using the MDGs for prediction of diagnosis in samples from TCGA and GTEx, 9 out 10 MDGs revealed excellent performance as independent diagnostic predictors, with the area under curve (AUC) of CLDN1 is 0.988, EPHX4 is 0.972, TCN1 is 0.962, ARHGAP20 is 0.929, LY6G6D is 0.888, FAM150A is 0.863, KCNJ12 is 0.858, KRT7 is 0.852, and STK33 is 0.847 (Fig. 3a-i). Additionally, the performance of the combined MDGs were further assessed in samples from TCGA-GTEx and GEO, and the outcome of AUC is 1.0 and 0.996, respectively (Fig. 3k, l). The importance of each predictor for the combined predictive model was shown in Table 3. The larger the number of MeanDecreaseAccuracy and MeanDecreaseGini, the more important of the predictor for the model. It was found that the number of MeanDecreaseAccuracy and MeanDecreaseGini was positively correlated with the AUC of the predictive model of the predictor. Moreover, unsupervised hierarchical clustering of the combined panels with ten makers or seven makers indicated that the constructed models could distinguish CRC patients from normal controls accurately (Fig. 2d-g). Collectively, those 10 MDGs could be candidate biomarkers for diagnosis of CRC samples, and 9 of them performed well as independent indicator. The 'MeanDecreaseAccuracy' and 'MeanDecreaseGini' of each indicator was produced by the importance function in Random Forest model.

Construction and evaluation of MDGs related prognostic model in CRC patients
Univariate Cox regression analysis revealed that 2 of 10 MDGs were independent prognostic indicators of OS, with the hazard ratio of STK33 was 1.09 (95 % CI: 1.01-1.17, p = 0.021) and EPHX4 was 0.91 (CI: 0.83-0.99, p = 0.032) (Fig. 4a). Subsequently, Kaplan-Meier analysis with log-rank test using STK33 and EPHX4 as independent prognostic indicator, the results indicated that patients with high-risk scores suffered poor OS, with the P-value was 0.008 and 0.047, respectively. Moreover, we calculated the risk score of each CRC patient using the formula as follows: (1.09 × STK33) + (0.91 × EPHX4). And then CRC patients (n=635) were categorized into high or low score group, according to the optimal cut-off value of the risk score obtained from survminer R package. And the results also indicated that high-risk score patients had a worse OS than those of low-risk score patients (P < 0.0001) (Fig. 4d). The prognostic accuracy of the risk score model was investigated as a continuous variable (Fig. 4e). The AUC of the prognostic model for OS was 0.569 at 3 years, 0.633 at 4 years, and 0.626 at 5 years.

Discussion
In present study, we identi ed 10 MDGs that could be considered as potential diagnostic and prognostic predictors for CRC samples, and seven of those MDGs have been reported to be associated with cancers as following: 1) CLDN1 (Claudin 1) is a member of the claudin family and tight junction-related proteins, which were demonstrated to be associated with dysfunction or abnormal expression in various tumors. [1,21] In addition, a previous study revealed that aberrant expression of CLDN1 regulated AMPK/STAT1/ULK1 signaling pathway, lead to the promotion of proliferation and metastasis in esophageal squamous cancer. [29] Moreover, CLDN1 was experimentally demonstrated remarkably upregulated in CRC patients and could be considered as methylated diagnostic biomarker in CRC and normal control groups. [17] 2) EPHX4 (Epoxide Hydrolase 4) is a protein coding gene, which was signi cantly upregulated in rectal cancer con rmed by immunohistochemistry. [8] Additionally, a previous study performed exon array analysis from pseudomyxoma peritonei and normal colonic epithelia and found that EPHX4 was upregulated in pseudomyxoma peritonei. Our study also demonstrated that EPHX4 was up-regulated in CRC tumor samples. However, no previous research reported that EPHX4 was associated with CRC.
3) TCN1 (Transcobalamin I) encoded a member of the vitamin B12-binding protein family that transports of cobalamin into cells. A growing number of studies have veri ed that the overexpression of TCN1 obviously associated with tumor invasion and metastasis in CRC. [7,19] Moreover, it was experimentally demonstrated that TCN1 was signi cantly overexpressed in colon cancer tissues compared with normal controls at mRNA and protein level, and it could be considered as potential prognostic biomarker of colon cancer. [20] 4) LY6G6D (Lymphocyte Antigen 6 Family Member G6D) belongs to a cluster of leukocyte antigen-6 genes, which was conspicuously overexpressed (around 15-fold) and considered as a promising biomarker of immunotherapy for human Microsatellite stable CRC. [10] 5) FAM150A (family with sequence similarity 150A) can active lymphoma kinase (ALK) by binding to its extracellular domain, [11,15] and ALK has been used as effective biomarkers in various human cancers, such as neuroblastoma and non-small cell lung cancer. [12] Additionally, the DNA methylation status of FAM150A was demonstrated to be a diagnostic and prognostic indicator for clear cell renal cell carcinoma (ccRCC) using high performance liquid chromatography method. [33] However, no studies found that FAM150A associated with CRC.
6) KRT7 (Keratin 7) is a member of the keratin gene family. Previous studies showed that KRT7 play a signi cant role in tumor metastasis and considered as prognostic biomarker and potential targets for therapeutic prevention of metastasis. [16] In addition, KRT7 was down-regulated and hypermethylated in CRC tissues compared with adjacent normal tissues and may lead to the occurrence of CRC. [26] 7) STK33 (Serine/threonine kinase 33) was experimentally demonstrated that downregulation and hypermethylation of STK33 in CRC tissues compared with normal tissues with P < 0.001, and STK33 methylation was also remarkably associated with lymph node metastasis, tumor invasion, distant metastases and tumor stage. [31] The precious study and my research work both suggested that the hypermethylation and downregulation of STK33 may be a promising biomarker for the diagnosis and prognosis of CRC.
To our best knowledge, there is still a lack of research about the relationship between the left three MDGs and oncogenesis, which may represent new predictive biomarkers.

Conclusions
In conclusion, we identi ed 10 MDGs that could be used as potential biomarkers for CRC, of which 9 performed well as independent diagnostic predictors and 2 could be used as prognostic indicators. Availability of data and material: The data that support the ndings of this study are openly available in TCGA and GTEx from the UCSC Xena (https://xenabrowser.net/datapages/) and GEO database (https://www.ncbi.nlm.nih.gov/geo/).

Declarations
Author Contributions: Lichao Cao were involved in the study concept and design, and drafting the manuscript. Erfei Chen put forward some kind suggestions. Lichao Cao and Jin Yang helped analyze and interpret the data.
Con ict of Interest: The authors declare that they have no con ict of interest.