Identication of Key Biomarkers Associated with Survival and Prognosis in Colon Adenocarcinoma

Background: Colorectal cancer (CRC) has a high rate of relapse and recurrence that result in poor prognosis and unsatisfactory outcomes. Colon adenocarcinoma (COAD) is the most prevalent type of CRC. It is crucial to identify novel molecular biomarkers for the early diagnosis, prognosis evaluation and disease monitoring of COAD. Methods: Three gene expression prole data were downloaded from the Gene Expression Omnibus(GEO), and the differential expression genes(DEGs) were identied by GEO2R. Gene Ontology (GO) and KEGG pathway enrichment analysis were conducted by WebGestalt online tool. String database and Cytoscape software were used for protein–protein interaction (PPI) network construction and module analysis. The top 20 Hub Genes were screened from the PPI network using MCC algorithm on CytoHubba app of Cytoscape software, and were veried by ONCOMINE database then. The core genes affecting CRC prognosis were screened by GEPIA2 survival analysis web tool. Finally, the expression level and clinical indicators including core genes was analyzed by TCGA-COAD dataset. Results: In total, 413 differentially expressed genes (DEGs) were identied, and the GO and KEGG enrichment analyses of DEGs were processed. After, the protein–protein interaction (PPI) network was constructed and 20 hub genes were identied. Furthermore, three core genes were selected via survival analysis . Finally, the diagnostic and prognostic value of these core genes was veried by clinical analysis of TCGA-COAD dataset. Conclusion: SPP1, GRP and GNGT1 were all over-expressed in COAD, and may be regarded as novel diagnostic and prognostic biomarkers for COAD.


Background
Colorectal cancer (CRC) is one of the most common cancer types worldwide, which accounting for about 11% of all diagnosed cancer cases [1,2]. Additionally, CRC is the second most lethal cancer worldwide [3].
Previous studies have indicated that the pathogenesis of CRC is a multi-step and multi-path evolution process in which chromosomal instability, dysregulation of oncogenes and tumor suppressor, epigenetic alterations, metabolic alterations, abnormal immune response occured [4,5].Although numerous treatments, including surgery, chemotherapy radiotherapy, and targeted therapy, have been improved, nearly 55% of CRC patients eventually relapse and suffer from the recurrence [6], and the prognosis of patients is poor [7]. Colon adenocarcinoma (COAD) is the most prevalent type of CRC, the incidence and mortality of COAD was 10.2 and 9.2% [3,8]. The exact mechanisms of CRC development are still poorly understood, therefore, it is crucial to better understand the exact mechanism of tumorigenesis and identify novel molecular biomarkers for early diagnosis, prognosis evaluation, disease monitoring. [9,10] The recently adopted high-throughput gene microarray technology has been extensively applied in several elds of medicine and biology, which allows us to share and explore various molecular mechanisms [11]. Furthermore, we can study the key genes and select potential molecular targets and diagnostic markers [12].
In this paper, 413 differentially expressed genes (DEGs) in CRC were identi ed by GEO2R. Gene Ontology (GO) and KEGG pathway enrichment analysis were conducted by WebGestalt online tool to determine the functional annotation and pathway of DEGs. At the same time, the protein-protein interaction (PPI) network was constructed and 20 hub genes were identi ed. Three core genes were nally selected by survival analysis, and the association between hub genes and its diagnostic as well as prognostic value in COAD was investigated (Fig. 1).

Microarray data
Gene Expression Omnibus(GEO) (http://www.ncbi.nlm.nih.gov/geo) functions as a public functional genomics database of high throughout gene expression data, chips and microarrays [13]. Three gene expression pro les in colorectal cancer and normal epithelium were downloaded from GEO, that is GSE21815, GSE35279 and GSE71187. Microarray data of these three datasets were all on account of GPL6480 Platforms, Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version). The GSE21815 dataset contained 132 colorectal cancer samples and 9 normal epithelium samples. The GSE35279 dataset contained 74 colorectal cancer samples and 5 normal epithelium samples. The GSE71187 dataset contained 47 Colorectal cancer tissue specimen and 12 normal colorectal tissue specimen.
Identi cation of DEGs GEO2R (http://www.ncbi.nlm.nih.gov/geo/geo2r) is regarded as an interactive online tool designed to compare two or more datasets in a GEO series for the purpose of DEGs identi cation across experimental conditions. The DEGs between CRC tissues and normal epithelium tissues were identi ed using GEO2R with the P value < 0.05 which were considered of statistically signi cance. For the next step, we utilized the "dplyr" R package to identify the DEGs between CRC samples and normal epithelium samples. Adjusted P < 0.05 and |log fold change (FC)| > 2 were chosen as the cutoff threshold. The volcano plot and venn diagram containing these DEGs were drawn with the "ggplot2" package in R 4.0.4.

Functional enrichment analysis
WEB-based GEne SeT AnaLysis Toolkit(WebGestalt) ( http://www.webgestalt.org/) is an online biological information software tool that integrates functional categories derived from centrally and publicly curated databases as well as computational analyses [14]. We analyzed GO and KEGG pathway enrichment analyses for the DEGs by WebGestalt. GO terms enrichment analysis were categorized into biological process (BP), cellular component (CC), and molecular function (MF). Top 10 terms were selected according to FDR.

Result
Identi cation of DEGs in colorectal cancer Via GEO2R online tools, DEGs in three datasets (1357 DEGs in GSE35279, 1493 DEGs in GSE21815, and 1884 DEGs in GSE71187, respectively) were extracted after gene expression pro le data processing and standardization with the cutoff standard of P value < 0.05 and |logFC|>2. The volcano plot and venn diagram were drawn with the "ggplot2" package in R 4.0.4. The overlapping DEGs among these three datasets contained 413 genes as shown in the Venn diagram (Fig. 2).

Enrichment analysis for DEGs
To elucidate the biological functions of the overlapping DEGs, we performed functional annotation and pathway enrichment analysis via WebGestalt online tool. Results indicated that the overlapping DEGs in biological process of GO enrichment were markedly associated with tissue development, tube morphogenesis, tissue morphogenesis, tube development, skeletal system development, anatomical PPI network construction and module analysis Search Tool for the Retrieval of Interacting Genes(STING)(http://string-db.org) online database was used to predict the PPI network which may further explain the mechanisms of the occurrence and progression of diseases [15]. By using STRING database, PPI network of DEGs was analyzed and an interaction with a combined score > 0.7 was recognized as statistical signi cance. The plug-in MCODE (Molecular Complex Detection) app of Cytoscape (an public bioinformatics software, version 3.7.1) is constructed for clustering a network based on topology to determine intensively connected regions [16]. The PPI network was plotted with the application of Cytoscape and the most signi cant module in the PPI network was narrowed down using MCODE with the following criteria: degree cutoff = 2, node score cutoff = 0.2, k-core = 2, max depth = 100.

Hub genes selection and analysis
The plug-in Cytohubba of Cytoscape is an APP provided with 11 topological analysis methods for ranking nodes in a PPI network by their network features [17]. In the present study, the top 20 hub genes were ranked according to the maximal clique centrality (MCC). Oncomine (http://www.oncomine.com) online database was applied for explore the mRNA expression levels of the hub genes in various kinds of cancers, including CRC [18].
Gene Expression Pro lling Interactive Analysis (GEPIA2)(http://gepia2.cancer-pku.cn/#index) online tool was used for mRNA expression level and survival analysis of the hub genes [19]. The survival map of 20 hub genges was plotted by "Survival Analysis" function in GEPIA2. Accordingly, three core genes affecting CRC prognosis were screened.
The veri cation of the expression level of core genes and their diagnostic and prognostic value was conducted via analysising the TCGA-COAD dataset. The box plot and ROC curve were plotted by R 4.0.4. structure formation involved in morphogenesis, animal organ morphogenesis, circulatory system development, cell proliferation and epithelium development (Fig. 3A). As for molecular function of GO enrichment, DEGs were remarkably related to receptor regulator activity, receptor ligand activity, RNA polymerase II regulatory region sequence-speci c DNA binding, RNA polymerase II regulatory region DNA binding, extracellular matrix structural constituent, transcription regulatory region sequence-speci c DNA binding, sequence-speci c double-stranded DNA binding, extracellular matrix structural constituent conferring tensile strength, hormone activity, transcription regulatory region DNA binding (Fig. 3B). In addition to cellular component, collagen-containing, extracellular matrix, basement membrane, extracellular matrix component, complex of collagen trimers, basolateral plasma membrane, plasma membrane region, endoplasmic reticulum lumen supramolecular polymer, supramolecular complex (Fig. 3C). Besides, signaling pathway analysis of KEGG demonstrated that those DEGs played pivotal roles in Wnt signaling pathway, AGE-RAGE signaling pathway in diabetic complications, Central carbon metabolism in cancer, Biosynthesis of amino acids, Breast cancer, Malaria, Gastric cancer, ECMreceptor interaction, TGF-beta signaling pathway, MicroRNAs in cancer (Fig. 3D).

PPI network construction and signi cant module identi cation
To evaluate the interrelationships among these DEGs, we rst drew the network of DEGs in STRING, we set the maximum number of interacting bodies to 0 and used a con dence score of 0.7 as the cut-off criterion, the PPI network included 388 nodes and 334 edges (Fig. 4A). Additionally, the Molecular Complex Detection (MCODE) app was also employed to select modules of the PPI network in Cytoscape according to node score cut-off = 0.2, degree cut-off = 2, max.depth = 100, and k − core = 2. The top three functional clusters of modules were selected (module 1, MCODE score = 12.000; module 2, MCODE score = 6.333; module 3, MCODE score = 6.167) (Fig. 4B-4D ).
KEGG pathway analysis of each module was performed by Enrichr (https://maayanlab.cloud/Enrichr/) online tool. The KEGG pathway analysis of module 1 indicated that these genes were involved in the Neuroactive ligand-receptor interaction, Chemokine signaling pathway, and Cytokine-cytokine receptor interaction; genes in module 2 enriched in Progesterone-mediated oocyte maturation and cell cycle; and genes in module 3 related to the Protein digestion and absorption, ECM-receptor interaction, and carbon metabolism.

Hub genes selection and validation
Based on querying STRING protein information from the public database, we constructed a PPI network of the top 20 hub genes according to the maximal clique centrality (MCC). The top 20 hub genes with maximal clique centrality were as follows: SST, PPBP, SPP1, GCG, INSL5, CXCL8, EDN3, CXCL11, PYY, P2RY1, HTR1D, GPR4, SAA1, IL6, PLCB1, APLN, GRP, ADRA2C, GNGT1, GAL (Fig. 5). Then, mRNA expressions of hub genes in 20 types of cancers were measured and compared to normal tissues by ONCOMINE database, and 15 hub genes were found to be over-expression in colorectal cancer (Fig. 6).
We used GEPIA2 online tool to identify the survival data of 20 hub genes and found that the expression levels of three hub genes, namely SPP1, GRP and GNGT1, were remarkably related to the survival of colon adenocarcinoma(COAD) patients (Fig. 7,8). Furthermore, we validated the expression level of these three hub genes in COAD and normal tissues via TCGA-COAD dataset, and explored their diagnostic value as well as clinical signi cance. The results showed that expression levels of the three genes were signi cantly increased in the COAD samples and the over-expression of these three genes was correlated with the TMN stage of COAD patients (Fig. 9-10).

Discussion
CRC is a great threat to people's health, which has been the 3rd most common cancer throughout the world, according to the global burden of disease study in 2018. COAD is the most prevalent type of CRC. In the last decade, increasingly researches have focused on the pathogenesis mechanisms of CRC and the therapeutic strategies have been developed. Nevertheless, CRC mortality and morbidity rates remain high due to postsurgical recurrence and metastasis of primary tumours. Therefore, it is of great importance to identify novel molecular targets for the early diagnosis, treatment, and prognosis of CRC.
Bioinformatics analysis is an effective method to discover the pathogenesis of cancer by studying the occurrence and development of cancer at the molecular level. In the present study, DEGs between CRC and non-cancerous tissues were obtained from three mRNA microarray datasets via GEO2R. The functions of the DEGs were identi ed by GO and KEGG enrichment analysis. The results demonstrated that the DEGs were enriched in the function items and pathways which were associated with progression and prognosis of CRC, such as the cell proliferation, extracellular matrix structural constituent, Wnt signaling pathway, Central carbon metabolism in cancer, ECM-receptor interaction and TGF-beta signaling pathway. The extracellular matrix (ECM) regulates tissue development and homeostasis, and its dysregulation have been appreciated as key drivers for both development and cancer progression [20,21].
Aberrant Wnt/β-catenin signaling has often been reported in different cancers [22], particularly CRC [23], and this signaling cascade is central to carcinogenesis which has potential value as a therapeutic target in the treatment of CRC [24]. Transforming growth factor-beta (TGF-β) signaling is one of the important cellular pathways that play key roles for tissue maintenance [25]. In particular, it is important in the context of in ammation and tumorigenesis by modulating cell growth, differentiation, apoptosis, and homeostasis [25,26].
Furthermore, we constructed a PPI network via STRING online tools and detected 20 hub genes by CytoHubba. The expression and diagnostic value of these hub genes were veri ed in the ONCOMINE database, and the survival analysis by TCGA database reveals the high expression of SPP1, GRP and GNGT1 were signi cantly associated with poor cancer survival rates.
The secreted phosphoprotein 1 (SPP1, also known as osteopontin), is an extracellular matrix chemokinelike phosphoglycoprotein that facilitates cell-matrix interaction which is overexpressed in various malignant neoplasms and plays a role in tumourigenesis and metastasis [27]. Wei et al. [28] found that SPP1 overexpression led to enhanced anchorage-independent growth, cell migration and invasion in KRAS gene mutant cells and overexpression also induced PI3K signalling, expression of Snail and Matrix metallopeptidase 9 (MMP9), and suppressed the expression of E-cadherin in KRAS mutant cells. Zeng et al. [29] found the expression of SPP1 can activate activated Integrin β1/FAK/AKT pathway promotes ovarian cancer progression. Zhang et al. [30] found SPP1 could upregulate PD-L1 which mediates macrophage polarization and facilitates immune escape in lung adenocarcinoma. In our research, we inferred that SPP1 was enriched in ECM-receptor interaction, which was signi cantly associated with EMT pathway [31].
Gastrin-releasing peptide(GRP) is a small regulatory peptide with homology to bombesin [32]. GRP and its receptor play an important role in a multitude of physiological functions including sensory transmission, regulation of central autonomic pathways, thermoregulation, secretion of pituitary hormones, gastric and pancreatic secretion, and food intake and satiety [33][34][35].Increasing evidences demonstrated that GRP acts as a mitogen, morphogen and pro-angiogenic factor in certain cancers [36,37]. Patel et al. [38] found that nonamidated peptides derived from the C terminus of pro-GRP are expressed in signi cant quantities in CRC cell lines and tumors and stimulate the proliferation of CRC cells and of normal colonic mucosa. Such peptides are attractive targets for novel CRC therapies. Ni et al. [39] found GRP was highly expressed in breast cancer patients with lymph node metastasis. Besides, among the patients with lymph node metastasis, the ones with higher expression of gastrin-releasing peptide had shorter survival time.
Overexpression of gastrin-releasing peptide signi cantly enhanced cell invasiveness. Conversely, a knockdown of gastrin-releasing peptide through the short hairpin RNA approach remarkably reduced MCF-7 cell invasion. Additionally, our study reveals that the high expression of GRP signi cantly associated with poor cancer survival rates, and may be a novel diagnosis biomarker for colon cancer.
The transducing gamma subunit gene (GNGT1) encodes a member (gamma1) of the family of heterotrimeric G-protein gamma subunits that is speci c to rod photoreceptors. [40] Alsaleem et al. [41] found reduced/loss E-cadherin expression was associated with differential expression genes including GNGT1 which regulats PIK3-AKT signaling pathways. Mucaki et al. [42] found GNGT1 is a gene that has an e cient response to chemotherapy by carboplatin. In the current study, GNGT1 was signi cantly upregulated and high mRNA expression of GNGT1 was associated with poor overall survival in CRC patients.
Although hub genes were veri ed with different data sources, the main limitation of our research is that our study is only at the level of bioinformatics analysis. It was probably because of SPP1 and GRP were secretory proteins, we couldn't nd positive results from the human protein atlas (THPA) database. Further experimental analysis, including using a cell model and CRC tissues from the clinic, and different experimental methods such as western blot and ELISA to validate the prediction is urgently needed.

Conclusion
In conclusion, with the integrated bioinformatics analysis for gene expression pro les from GEO database, we dug out three core molecules associated with the prognosis of COAD, including SPP1, GRP and GNGT1. These core genes were all over-expressed in COAD, and may be regarded as novel diagnostic and prognostic biomarkers for COAD.  Figure 1 Page 13/19

Figures
The analysis process of present study.

Figure 2
DEGs' volcanic map and VENN diagram. A-C, differential genes were screened from three data sets (GSE35279, GSE21815, and GSE71187), and the distribution of differential genes in each data sets was observed by volcanic map. D, VENN diagram were drawn according to all differential genes(DEGs).  Transcriptional expression of Hub genes in 20 different types of cancer diseases. Difference of transcriptional expression was compared by students' t test. Cut-off of p value and fold change were as follows: p value, 1E-4; fold change, 2; gene rank, 10%; data type, mRNA.

Figure 7
Overall survival analysis of hub genes in pan cancer was performed by utilizing the GEPIA2 Survival Map online tool. Signi cance Level, 0.05; Group Cutoff, 25% high.

Figure 8
Page 18/19 The detection of the survival rate of the chosen hub genes in colon adenocarcinoma (COAD). A-C, Kaplan-Meier curves of SPP1, GRP, and GNGT1. Log-rank test was carried on the relevant results. Signi cance Level, 0.05; Group Cutoff, 25% high.

Figure 9
The detection of the expression level the diagnostic value the chosen hub genes. A-C, Box plot of SPP1, GRP, and GNGT1 expression level in COAD,the result show that these core genes were over-expressed in COAD (p<0.001). D-F, ROC curves of SPP1, GRP, and GNGT1 in COAD. AUC and CI were carried on the relevant results Figure 10