Genomic Expression Proling of Colorectal Cancer in Silico

Background: Colorectal cancer (CRC) is one of the most common malignancies of the digestive system; the progression and prognosis of which are affected by a complicated network of genes and pathways. The aim of this study was to identify potential hub genes associated with the progression and prognosis of colorectal cancer (CRC). Methods: We obtained gene expression proles from GEO database to search differentially expressed genes (DEGs) between CRC tissues and normal tissue. Subsequently, we conducted a functional enrichment analysis, generated a protein–protein interaction (PPI) network to identify the hub genes, and analyzed the expression validation of the hub genes. Kaplan–Meier plotter survival analysis tool was performed to evaluate the prognostic value of hub genes expression in CRC patients. Results: A total of 370 samples, involving CRC and normal tissues were enrolled in this article. 283 differentially expressed genes (DEGs), including 62 upregulated genes and 221 downregulated genes between CRC and normal tissues were selected. We nally ltered out 6 hub genes, including INSL5, MTIM, GCG, SPP1, HSD11B2, and MAOB. In the database of TCGA-COAD, the mRNA expression of INSL5, MT1M, HSD11B2, MAOB in tumor is lower than that in normal; the mRNA expression of SPP1 in tumor is higher than that in normal. In the HPA database, the expression of INSL5, GCG, HSD11B2, MAOB in tumor is lower than that in normal tissues; the expression of SPP1 in the tumor is higher than that in normal tissues. Survival analysis revealed that INSL5, GCG, SPP1 and MT1M may serve as prognostic biomarkers in CRC. Conclusions: We screened out six hub genes to predict the occurrence and prognosis of patients with CRC using bioinformatics methods, which may provide new targets and ideas for diagnosis, prognosis and individualized treatment for CRC.

critical roles in the development of CRC [8][9][10][11][12]. These genes and pathways can severe as diagnostic biomarkers and therapeutic targets, and also be used to evaluate the clinicopathological features of CRC.
Currently, stool-based tests (eg, guaiac-based fecal occult blood testing [gFOBT], immunochemical-based fecal occult blood testing [FIT], stool DNA [sDNA] testing), endoscopy (eg, exible sigmoidoscopy [SIG], colonoscopy), imaging (eg, double-contrast barium enema, computed tomographic colonography [CTC]) and pathological biopsy are effective screening and diagnostic methods for CRC [13,14]. Despite strong evidence that CRC screening reduces the incidence and mortality effectively, CRC screening rates are still low worldwide. Approximately 31% of US adults had never been screened for CRC in 2018 [15], Japan's target population of CRC screenings that had been done was less than 50% in 2013, and the target population's participation rate in CRC screening was more lower in China [16]. Therefore, the treatment of CRC is still very important. In the treatment of cancer, precision medicine is the current trend [17]. With the deepening of research, doctors' therapeutic decisions are being developed from "one gene, one drug" to "multi-gene, multi-drug" [18]. Although many aberrant genes and biomarkers have already been identi ed in CRC, there are still many challenges to overcome in order to establish a personalized therapy, ranging from the early diagnosis and treatment to the determination of prognosis factors. Therefore, it is urgent and imperative to search for reliable biomarkers for diagnosis and therapy.
At present, bioinformatics methods have been applied as very important tools to predict cancer-related genes [19]. In our study, we screened the differentially expressed genes (DEGs) between CRC and normal tissues in ve datasets from Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) [20].
Subsequently, Gene Ontology (GO) functional annotation analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analysis were performed for the screened DEGs. Then, we established protein-protein interaction (PPI) network of the DEGs and selected the hub genes related to CRC.
Moreover, expression validation and survival analysis of these hub genes were analyzed using GEPIA platform and Kaplan-Meier plotter. This study would provide reliable molecular biomarkers for screening and diagnosis, prognosis, as well as novel therapeutic targets for CRC.

Methods
Gene expression microarray data.
High-throughput gene expression pro les of normal colorectal tissues and CRC tissues were extracted from the GEO database. The independent datasets of GSE28000 [21], GSE33113 [22], GSE41328 [23], GSE47063 [24], and GSE74602 were selected which included 86 CRC samples and 40 normal tissues, 90 CRC tissues and 6 normal tissues, 5 CRC samples and 5 normal tissues, 31 CRC samples and 47 normal tissues, and 30 CRC samples and 30 normal tissues, respectively.
Data Processing of DEGs.
The online analysis tool --GEO2R (https://www.ncbi.nlm.nih.gov/geo/geo2r/) was used to search DEGs between CRC samples and normal samples. The adjusted p values were utilized to decrease the false positive rate using Benjamini and Hoch-berg false discovery rate method by default. The adjust p value < 0.01 and |logFC|≥1 were adjusted as the cutoff criterion.

Analysis of enriched GO terms and KEGG pathways.
GO analysis is commonly performed for functional enrichment analysis; gene function can be classi ed into biological process (BP), molecular function (MF) and cellular component (CC) terms. KEGG is widely used for exploring the advanced functions and mechanisms involved in the biological system at the molecular level. In this study, GO terms and KEGG pathways analysis was performed with the Database for Annotation, Visualization and Integrated Discovery (DAVID) platform ( v6.8; https://david.ncifcrf.gov/) [25]. P < 0.05 and gene counts ≥ 10 were considered statistically signi cant.
Construction of PPI network and identi cation of module's seeds.
The Search Tool for the Retrieval of Interacting Genes (STRING) database (http://string-db.org/) [26] was performed for PPI network construction. The minimum required interaction score was set to 0.4 as default. Then, Cytoscape software v3.7.1 (http://www.cytoscape.org/index.html) [27] was used for the analysis and visualization of the PPI network. In addition, the remarkable module of PPI network was selected via Molecular Complex Detection(MCODE) app (node score cutoff = 0.2, degree cutoff = 2, max. depth = 100, and k-score = 2). Nodes with higher degree of connectivity tend to be more essential in maintaining the stability of the entire network. The CytoHubba plugin was used to calculate the degree of each protein node. In our study, hub genes were selected based on a node degree ≥ 4.

Expression and survival analysis of hub genes
We checked the expression of hub genes corresponding to key proteins in TCGA database (NIH) through GEPIA platform (http://gepia.cancer-pku.cn/), which is the largest oncogene chip database currently in the world, constituting a conveniently integrated data mining platform [28]. The expression of hub genes was compared between CRC tissues and normal tissues in HPA database. Student's unpaired t-test was used in determining the statistical signi cance of the calculated differential expression. The fold change was de ned as 2 and P < 0.01 was considered statistically signi cant.
To investigate the potential prognostic role of hub genes, Kaplan-Meier (KM) survival curves was performed to illuminate correlations between expression of hub genes and the overall survival of CRC patients in the TCGA database. P < 0.01 was considered statistically signi cant.

Identi cation of DEGs
In the present study, we analyzed gene expression datasets of 5 datasets and extracted DEGs on the basis of the cut-off criteria. The present study covered 242 CRC and 128 normal tissues in total. The results showed that 1573, 3081, 1431, 2505, and 1675 DEGs between CRC and normal tissues respectively. There were 532 (1041) (Fig. 1A), 857 (2224) (Fig. 1B), 649 (782) (Fig. 1C), 1185 (1320) (Fig. 1D), 674 (1001) (Fig. 1E) up-(down-) regulated genes among these DEGs (Table 1). There were a total of 62 (221) overlapping up-(down-) regulated genes in these datasets, and the number of overlapping DEGs is 283 ( Fig. 1F and G). The numbers of biological process (BP), cellular component (CC) and molecular function (MF) among these enriched GO terms are 49, 16, and 20, respectively. Table 2 showed the enriched GO terms of the overlapping DEGs based on the counts of genes ≥ 10. These terms were integral component of membrane, plasma membrane, extracellular exosome, integral component of plasma membrane, extracellular space, extracellular region, zinc ion binding, endoplasmic reticulum, calcium ion binding, proteolysis, cell surface, cell proliferation, negative regulation of cell proliferation, apical plasma membrane, proteinaceous extracellular matrix, receptor binding, ion transmembrane transport, and actin binding.  NAAA, GNA11, PTGS1, SELENBP1, ANPEP,  CXCL12, SLC26A2, CMBL, CKB, AZGP1,  ACTG2, HSPH1, SMPDL3A, CD46, TGFBI,  GUCA2A, SLC22A5, FAM129A, SLC4A4,  GUCA2B, CEACAM1, AHNAK, GNG7, MB,  ACAA2, CLCA4, PIGR, METTL7A, NEBL,  SERPINB5, AKR1B10, PGM1, SCIN, CA4,  CA2, CA1, UGP2, C7, GCNT3, UGDH, PLPP1,  ITM2C, ITM2A, KRT24, TIMP1, PBLD, SIAE,  GLIPR2, HSPA2, CD177, ENTPD5, PLCD1 Table 3 showed the most signi cantly enriched KEGG pathways of the overlapping DEGs. These DEGs were enriched in metabolic pathways, pancreatic secretion, mineral absorption, drug metabolism-cytochrome P450, bile secretion, proximal tubule bicarbonate reclamation, tyrosine metabolism, aldosterone-regulated sodium reabsorption, fatty acid degradation, retinol metabolism, glycolysis / gluconeogenesis, nitrogen metabolism, galactose metabolism, starch and sucrose metabolism, and pentose and glucuronate interconversions. PPI network analysis and module identi cation Protein interactions among the DEGs were performed using STRING tools. A total of 166 nodes and 322 edges were identi ed (Fig. 2). These nodes and edges represented the overlapping DEGs involved in the PPI and the interaction between these DEGs, respectively. Table 4 displayed the degrees of all nodes. We obtained 11 Modules and 10 seeds in the PPI network based on MCODE app ( Table 5). The modules with a score of at least 4 were selected as hub genes (Fig. 3). These modules might play important roles in CRC.  Expression and survival analysis of the 6 selected hub genes GEPIA was utilized to compare the expression levels of the hub genes between CRC tumor samples and adjacent normal samples from the TCGA database. We selected the seeds of the modules with a score of at least 4 (INSL5, GCG, SPP1, HSD11B2, MT1M and MAOB) for further analysis. In the database of TCGA-COAD, the mRNA expression of INSL5 in tumor is lower than that in normal (Fig. 4B, p < 0.05); the mRNA expression of SPP1 in tumor is higher than that in normal (Fig. 4C, p < 0.05); the mRNA expression of MT1M in tumor is lower than that in normal (Fig. 4D, p < 0.05); the mRNA expression of HSD11B2 in tumor is lower than that in normal (Fig. 4E, p < 0.05); and the mRNA expression of MAOB in tumor is lower than that in normal (Fig. 4F, p < 0.05).
In the HPA database, the expression of INSL5 in tumor is lower than that in normal (Fig. 5A, B); the expression of GCG in tumor is lower than that in normal ( Fig. 5C and D), the expression of SPP1 in the tumor is higher than that in normal ( Fig. 5E and F); the expression of HSD11B2 in tumor is lower than that in normal ( Fig. 5G and H); and the expression of MAOB in tumor is lower than that in normal ( Fig. 5I and J). Unfortunately, there was no MT1M related immunohistochemistry data of patients with or without CRC in the HPA database.
To investigate the prognostic values of the six potential hub genes, we used Kaplan-Meier plotter bioinformatics analysis platform. Through survival analysis in the HPA database, high level expression of INSL5 (Fig. 6A), GCG (Fig. 6B), and MT1M (Fig. 6D) and low level expression of SPP1 (Fig. 6C) were associated with better prognosis patients with CRC (P < 0.05). However, the expression of HSD11B2 (Fig. 6E) and MAOB (Fig. 6F) were not signi cantly associated with prognosis in patients with CRC (P > 0.05).

Discussion
CRC is the second largest malignant tumor of the digestive system, and the third most lethal malignancy in the world [29]. The 5-year survival rate of patients with CRC at its early stage is 90%, while that of patients at later stages was less than 12%[30] [31]. Currently, the treatment of colorectal cancer is still mainly surgical resection, chemotherapy and physical therapy. However, due to the insu cient diagnostic methods for early-onset CRC, the vast majority of patients have been diagnosed at a late stage; and due to gene mutation and changes in immune microenvironment in the process of tumor development, chemotherapy cannot achieve targeted e cacy, resulting in a high mortality rate [32]. Therefore, it is of great signi cance to identify the key marker genes and to understand the molecular mechanisms of CRC for its early diagnosis, effective treatment strategy development and prognosis evaluation. As an important discipline of biological science, integrated bioinformatics analysis can easily access gene expression pro le information, which is quite useful for exploring various molecular mechanisms [32], and is considered as an e cient method for systematically predicting disease-related genes.
In the present study, we screened the common differentially expressed genes (DEGs) in 5 GSE datasets via GEO2R, and obtained a total of 62 (221) overlapping up-(down-) regulated genes in these datasets, and 283 overlapping DEGs. To obtain an in-depth understanding of these DEGs, we performed the GO function and KEGG pathway analysis of these DEGs. The results of GO and KEGG pathway enrichment analysis suggested that these DEGs were signi cantly enriched in different cancer-related functions and pathways. Furthermore, we constructed a PPI network and performed survival analysis along with veri cation of the RNA expression levels via the GEO database, DAVID and STRING online tools, Cytoscape software. After the above series of operations, 6 hub genes, including INSL5, MT1M, GCG, SPP1, HSD11B2, and MAOB were screened out.
In addition, we tested the expression of these 6 genes at the mRNA as well as protein levels and found that the mRNA expression of INSL5, MT1M, HSD11B2, MAOB in CRC tissues were signi cantly lower than that in normal tissues, while SPP1 in tumor is higher than that in normal. Through survival analysis in the HPA database, the high level expression of INSL5, GCG, and MT1M and the low level expression of SPP1 were associated with better prognosis patients with CRC.
INSL5 (insulin-like 5) was a member of the insulin/relaxin superfamily. It can activate the G-proteincoupled receptor relaxin/insulin-like family peptide receptor 4 (RXFP4). Mashima et al [33] have shown that INSL5 is expressed in enteroendocrine cells (EECs) along the colorectum with a gradient increase toward the rectum. INSL5-RXFP4 signaling may play a role in an autocrine/paracrine fashion in the colorectum with effects other than those on cell proliferation, in ammation and mucosal healing, and INSL5 might represent a unique marker of colorectal EECs and NETs.
Recently, Shi-Bing Li's work found INSL5 was elevated in nasopharyngeal carcinoma (NPC) tumor tissue and the plasma of NPC patients [34]. INSL5 acts on its own receptors through an autocrine pathway to promote tumor growth. Mechanistically, INSL5 enhanced phosphorylation and nuclear translocation of STAT5 and promoted glycolytic gene expression, leading to induced glycolysis in cancer cells.
Monoamine oxidases (MAOs) including MAOA and MAOB are enzymes located on the outer membranes of mitochondria, which are responsible for catalyzing monoamine oxidation. Recently, increased level of MAOs was shown in several cancer types. Elevation of MAOB activity in cells resulted in induction of oxidative stress in association with mitochondrial dysfunction. Yi-Chieh Yang et al [35] found that MAOB is highly expressed in CRC tissues compared to normal colorectal tissues, and its overexpression was signi cantly associated with a higher recurrence rate and poorer prognosis. However, in this study, we found that the expression of MAOB were not signi cantly associated with prognosis, which might be owing to the different samples used for the research; thus, this point deserves further investigation.
As a major member of the metallothionein family, MT1M is a cysteine-rich metal-binding protein that is induced by in ammation and protects the cell against carcinogens. Emerging evidence has shown that MT1M is related to pancreatic ductal adenocarcinoma [36], breast cancer [37], renal cell carcinoma [38] and colorectal carcinoma [39]. In colorectal cancer, MT1M was a highly expressed isoform in tumor epithelia. It is reported that low MT1M expression is an independent prognosticator for relatively poor survival of patients. Heya Na's study rst found that MT1M regulated the expression of lncRNA STEAP3-AS1 and that high levels of STEAP3-AS1 were associated with poor overall survival in colon cancer patients [40].
The protein encoded by the glucagon (GCG) gene is a preproprotein which is cleaved into four distinct mature peptides. These peptides are involved in maintaining nutrient homeostasis, and are regulators of cell proliferation, differentiation, and apoptosis [41]. GCG, a remarkable decrease was found in the gene expression already in the adenoma stage. Previous study showed that activity of GCG possibly interferes with tumor progression [42]. The current study showed that GCG expression was signi cantly down regulated in CRC tissues. Therefore, we inferred that GCG might play a role in suppressing the development of CRC. Further molecular studies are necessary to elucidate the regulatory mechanisms of GCG in CRC progression.
Type-2 11β-hydroxysteroid dehydrogenase (HSD11B2) is a key enzyme which converts cortisol to inactive cortisone and is involved in tumor progression and metastasis. Several studies have shown that the promotion of tumor progression and metastasis by HSD11B2 resulted from its physiological function of inactivating glucocorticoids (GC). Lu Qi et al [43] found that the expression of the HSD11B2 was markedly decreased in colorectal cancer tissues. Jin Chen et al [44] performed a series of in vivo and in vitro assays, and found the ectopic expression of HSD11B2 signi cantly promoted the migration, invasion and metastasis of colorectal cancer (CRC) cells via the Fgfbp1-AKT pathway. Therefore, targeting HSD11B2 may be a novel treatment strategy for inhibiting the metastasis of CRC.
SPP1, also known as osteopontin, is a phosphoprotein secreted by malignant cells. SPP1 is involved in multiple physiological and pathological processes. Recent studies have reported that SPP1 is signi cantly associated with cell growth, adherence and invasion in tumourigenesis and metastasis, and is over-expressed in breast, colorectal, liver, and lung cancers [45][46][47]. In addition, high levels of SPP1 gene expression were associated with a poor prognosis for these cancers. Cheng et al. [47] found that the expression level of SPP1 in CRC tissues was higher than that in normal tissue, which was closely correlated to tumor TNM stage and aggressiveness. A study by Choe et al [48] con rmed the higher SPP1 expression along with shorter survival times in CRC patients.
While providing insight into the prediction of cancer-related genes and biomarkers of CRC, our study still had several limitations. Firstly, bioinformatics was used to analyze the related genes and biomarkers of CRC, but there was no further validation of its function and mechanism by clinical samples or cytological experiments, therefore, further experiments in cell, animal models and clinical samples are required to validate the present results, and these critical issues will be further investigated in our future work.
Secondly, the disparity between platforms and analytical tools and algorithms poses challenge for bioinformatics analysis, as results for the same data using different bioinformatics tools may vary.

Conclusions
In summary, our integrated bioinformatics analysis screened out a total of 283 DEGs and 6 hub genes, which might play crucial roles in the occurrence, development, and prognosis of CRC. In order to get more accurate correlation results, further experiments are needed to validate the results of the prediction. Anyway, this study provided new diagnostic biomarkers and therapeutic targets, as well as promising prognostic indicator for CRC patients. Declarations