Identication of Key Genes Involved in Colorectal Cancer: A Bioinformatics Analysis

Colorectal cancer is the life-threatening tumor with both high prevalence and mortality worldwide. However, the molecular mechanism behind it remains unknown. Methods: Herein, the potential prognostic candidate biomarkers for colorectal cancer were tested by Bioinformatical analysis combined with the CRC clinical samples. Three data sets (GSE32323, GSE44076 and GSE43078) were collected from the gene expression omnibus (GEO). The limma and clusterProler packages were used to identify differentially expressed genes (DEGs) and conduct functional enrichment analysis, respectively. To retrieve Interacting Genes (STRING) database, protein–protein interaction (PPI) network was built up using Search Tool, and Cytoscape was applied to carry out the module analysis. Subsequently, the online tool GEPIA was employed to conduct overall survival analysis (http://gepia.cancer-pku.cn/index.html), and the Oncomine database was used to analyze prognostic candidate biomarkers. Finally, 4 key hub genes were selected for validation of their expression levels in 9 patients newly diagnosed with CRC via reverse transcription ‐ quantitative PCR (RT ‐ qPCR). To evaluate the accuracy of prediction, time-dependent receiver operating characteristic (ROC) was applied.

survival rate of the patient is <20% [5].Lacking sensitive and speci c early biomarkers, a high possibility of drug resistance and metastasis is considered to contribute the high mortality of this disease. Therefore, there has a pressing need for identifying the more sensitive and speci c biomarkers or drug targets of CRC for developing effective diagnosis and treatment strategies.
Microarray technology provides an all-in-one system biology solution from hardware to software systems. It can simultaneously scan the hybridization signals of tens of thousands of gene probes in the chip and carry out quantitative analysis on the transcriptome pro le of samples. Recent advances especially in the algorithms of probe signal detection and analysis, such as the introduction of arti cial intelligence technologies, will make the results of microarray more accurate and reliable [6][7][8].The microarray technique also provides a powerful tool for exploring the gene regulation pattern and molecular mechanisms involved in oncogenesis and progression of CRC.Recently, different types of biomarkers including coding genes, miRNAs, long non-coding RNAs and circRNAs have been identi ed in colorectal cancer. Dysregulation of these molecules is involved in the tumor progression or is associated with the prognosis of patients [9][10][11][12].In view of the complexity of the molecular regulatory network of CRC, current studies on tumor biomarkers are not su cient. Therefore, it is still necessary to identify novel prognostic biomarkers, which will help us develop more sensitive and effective diagnostic and therapeutic strategies. However, limited sample size and signi cant variability among different projects make it hard to obtain credible results.In this study, three microarray datasets containing mRNA expression data between CRC and non-cancerous tissues were downloaded from Gene Expression Omnibus (GEO) and the differentially expressed genes (DEGs) were screened out. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and protein-protein interaction (PPI) network analyses were performed to explore the key modules and hub genes involved in CRC progression. In sum, 547 DEGs and 10 hub genes were screened out, which may be candidate biomarkers for CRC.

Methods
In line with the process shown in Fig. 1, bioinformatics analysis was carried out. Through getGEO function in package GEOquery, three data sets (GSE32323, GSE44076 and GSE23878) containing the gene expression data on CRC and normal tissues were obtained from GEO (https://www.ncbi.nlm.nih.gov/geo/) [13]. Table 1 shows the speci cs of GEO data sets. Wih the aforementioned method applied, the raw expression les of three microarray datasets were processed preliminarily [14]. The selected chips were downloaded by the online tool GEO2R (https:// www.) Ncbi. Nlm. Nih.gov/geo/geo2r/) and the differentially expressed genes were screened. The screening criterion is adj. P. Val < 0.01, |log2 Fold Change| ≥ 1, and the probe name was transformed into the standard gene name.Then, based on annotation information, the array probes were transformed into matching genetic symbols. When there are a number of probes matching the same gene, the value of gene expression was treated as the mean of the probes. When the same probe matches a number of genes, the probe gets removed. The possible regulatory mechanisms behind the occurrence and development of the disease can be revealed by the identi cation of DEGs at various states and investigation into their correlations. A web tool called ClustVis was applied to build up heatmaps. Differentially up-regulated and differentially down-regulated genes were subject to screening for the three datasets bu. Finally, in the online Veeny tool (Https://bioinfogp.cnb.csic.es/tools/venny/index.html), as were the differential intersection genes of the three microarrays.

GO and KEGG enrichment analysis
GO functional analysis is a useful method for annotating genes and identifying characteristic biological attributes for high-throughput genome or transcriptome data [15]. KEGG incorporates a wide range of databases, including those on genomes, biological pathways, diseases, drugs and chemical substances [16].Gene Ontology and KEGG analyses were conducted using enrich GO and enrich KEGG functions of Database for Annotation, Visualization and Integrated Discovery (DAVID; http://david.ncifcrf.gov/), an online bioinformatics database that aims to provide tools for the functional interpretation of genes or proteins [17], respectively. P.adjust (FDR) < 0.05 was considered to be statistically signi cant.

Construction of PPI network and module analysis
The PPI network was constructed by Search Tool for the Retrieval of Interacting Genes (STRING) (https://string-db.org/) with interaction score of 0.9 as the threshold [18]. Subsequently, Cytoscape software [19] was used to construct a PPI network and analyze the interactions of the DEGs. The cytoHubba plug-in was used to identify hub genes. The Molecular Complex Detection (MCODE) plug-in was used to screen modules of the PPI network in Cytoscape with a degree cut-off =2, node score cut-off =0.2, k-core =2 and max depth =100. The KEGG pathway enrichment analysis of genes in the top module was performed using DAVID.

Hub gene analysis
The Gene Expression Pro ling Interactive Analysis (GEPIA; http://gepia.cancer-pku.cn/index.html) is an online tool, which delivers fast and customizable analysis for RNA sequencing expression data based on The Cancer Genome Atlas and GTEx databases, which include data on 9,736 tumors and 8,587 normal samples [20]. GEPIA provides key interactive and customizable functions, including differential expression analysis between cancer and normal tissues. The seed genes in modules referred to hub genes. The overall survival analyses were performed using online tool GEPIA (http://gepia.cancer-pku.cn/) . The logrank P < 0.05 was considered statistically signi cant. The association of expression level of hub genes with clinical traits were analysis using the Oncomine Database [21].
Tissue specimens, RNA extraction and qRT-PCR analysis. 9 tumor tissues and tumor-adjacent tissues from CRC donors were enrolled in our study to validate the expression levels of co-expressed DEGs. Prior patient consent and ethical approval from the ethics committee of the Jiangsu Provincial Hospital of Traditional Chinese Medicine. All methods were performed in accordance with the ethics guidelines and regulations. We selected 4 co-expressed DEGs, including CCNB1,CCNA2,AURKA and BUB1B, for validation. All tissues were histologically diagnosed. Total RNA from the tissue specimens was isolated using TRIzol reagent (Invitrogen, Carlsbad, California, USA), and qRT-PCR was performed with SYBR® Green dye (TaKaRa, Shiga, Japan), following the manufacturer's instructions. The primer sequences are provided in Supplementary Table S1. β-actin was used as a reference gene.
Ethics statement.The research protocol was subject to review and approval from the Research Ethics Committee of the First A liated Hospital. All of the experiments were carried out on the basis of the instructions from the First A liated Hospital, JiangSu Provincial Hospital of Traditional Chinese Medicine.Basic information of clinical patients were shown in supplement Table S2.
Ethical approval. All processes conducted in the research with human participants involved conformed to the ethical standards set out by the ethical committee of the First A liated Hospital, Nanjing University of Chinese Medicine, the 2019NL-KS1129 Declaration and the subsequent amendments to it. NO tissues were derived from prisoners.
Ethical standards. Expressed consent was received from all of those participating in this research.
With Graphpad Prime 8.3 used, the ROC curves were visualized to evaluate how accurate the prediction would be of independent prognostic hub genes such as CCNB1 CCNA2 AURKA and BUB1B. For comparing the discriminatory ability of above-mentioned prognostic genes, the area under the ROC curve (AUC) was calculated.
Statistical analysis. Data are indicated by the mean±SD for continuous variables. Both t-test and analysis of variance were conducted to assess the disparities of demographic data. All P values were two-sided, and P 0.05.

Identi cation of DEGs between CRC and tumor-adjacent tissues
The three data sets were subject to standardization, with the results presented in Figure 2. Subsequently, the duplicate genes and values with no speci c genetic symbols were removed. Totally, 3387 DEGs were obtained in GSE23878. Among them, 784 genes experienced up-regulation and 2603 genes underwent down-regulation. Besides, 4883 DEGs were obtained from GSE32323. There were 919 genes experiencing up-regulation and 3964 genes undergoing down-regulation. In GSE44076, 2485 DEGs were collected. Of them, 1254 were subject to up-regulation and 1231 were subject to down-regulation. Figure 3 presents the DEGs obtained from each data set. Clustering was performed using Euclidean distance. Figure 4 shows the top 100 DEGs conducted by heatmap.

Data preprocessing and DEG screening
Three GEO datasets were downloaded, pre-processed and merged into a global dataset which contained  Table 2.

GO and KEGG analysis
The biological functions and pathways analyses were conducted using DAVID 3.  Table 3; Fig. 6D).

Construction of PPI network and module analysis
Protein-protein interaction network re ect the spatiotemporal relationship of macromolecules within the cell which will provide valuable information about molecular mechanisms in physiological and pathological process. To explore the molecular mechanisms underlying CRC progression, the online STRING database was applied to construct the PPI network. The interaction score of 0.4 (highest con dence) was set as threshold, and nodes without connections were removed from network. Finally,the PPI network consisted of 542 nodes with 1338 edges, and average local clustering coe cient was 0.367 (PPI enrichment P-value < 1.0E-16) (Fig. 7A). Then,the hub genes were identi ed via CytoHubba plugin.The top 10 hub genes, including minichromosome maintenance3(MCM3), minichromosome maintenance7(MCM7), budding uninhibited by benzimidazoles 1 homolog beta(BUB1B), Cyclin A2(CCNA2),Cyclin B1(CCNB1), aurora kinase A(AUPKA), targeting protein for Xklp2(TPX2), threonine and tyrosine kinase(TTK), Flap endonuclease I(FEN1), kinesin family member 4A(KIF4A), were identi ed using the cytoHubba plug-in, with a higher degree of connectivity (Fig.7B) .Finally,the key modules were identi ed via MCODE plugin. A total of 22 functional clusters of modules and related hub genes were detected. The top three signi cant modules were presented in Figs. 7C-7E. The KEGG analysis of module genes revealed that the top three modules were mainly associated with the cell cycle(hsa04110), Neuroactive ligand-receptor interaction(hsa04080), DNA replication(hsa03030),chemokine signaling pathway (hsa04062) and Progesterone-mediated oocyte maturation(hsa04914) (Fig. 7F) Hub genes analysis A total of 10 genes were identi ed as hub genes. We selected 4 of the hub genes that are most closely related to the occurrence and development, which were CRC, CCNB1, CCNA2, AURKA and BUB1B, and analyzed their expression in colorectal cancer (Fig.8A-D).Sometimes,the overall survival analysis of the hub genes was performed using the online tool GEPIA(http://gepia.cancer-pku.cn/). Except CCNB1 and CCNA2,CRC patients with upregulated MM3, MCM7, BUB1B, AUPKA, TPX2, TTK, FEN1,KIF4A showed no statistically signi cant on overall survival (Figs. 9A-H). It is worth noting that CCNB1 and CCNA2 are upregulated in CRC patients, but the high expression level is associated with better overall survival (Fig.  9I-J). Subsequently, the expression status of hub genes with Logrank P 0.05 were further validated using the Oncomine database. The result showed that CCNB1and CCNA2 were markedly overexpressed in CRC in the different datasets (Fig. 10A-B). In the Oncomine dataset, the alternation of CCNB1 and CCNA2 were associated with tumor grade (Fig. 11A-B), implicating vital roles of these genes in the carcinogenesis or progression of CRC.
The validation of core genes expression in clinical samples.
For ascertaining which gene is crucial to the develpment of CRC, real-time qPCR was applied to examine the expression of 4 DEGs through clinical sampling, inclusive of CCNB1 CCNA2 AURKA and BUB1B . Notably, relative to tumor-adjacent tissues, CCNB1 CCNA2 AURKA and BUB1B were normally subject to upregulation in colorectal cancer tissues, which conforms to the aforementioned results of bioinformatics analysis (Fig. 12A-D

Discussion
A number of pre-clinical and clinical studies have been conducted to reveal the underlying mechanisms of CRC in the past decades; however, the incidence and mortality of CRC remain high. This is primarily due to the majority of the studies focusing on a single genetic event, or the results were generated from a single cohort study [1].In this study, three GEO datasets were analyzed and 547 DEGs were identi ed, including 72 up-regulated and 475 down-regulated genes. The 547 DEGs were classi ed into three groups (BP, CC and MF groups) by GO terms, and the KEGG pathway enrichment analysis of the DEGs was conducted using the DAVID database. Finally, the DEGs PPI network was constructed, and the top 10 hub genes and the most signi cant module was ltered from the PPI network.
The KEGG analysis revealed some DEGs were signi cantly enriched in Pathways in cancer,PI3K-Akt signaling pathway,cGMP-PKG signaling pathway.These annotation results provided valuable clues to reveal molecular interactions in the development of CRC.Additionally, the most signi cant module was ltered from the PPI network, among which the majority of the corresponding genes were mostly associated with cell cycle, Neuroactive ligand-receptor interaction and DNA replication. Studies have shown [1][2][3][4] that there are countless signaling pathways that play an important role in the cell cycle of cancer and the process of cell proliferation.The signaling pathway of the protein kinase B (PI3K)/ protein kinase B (PKB/ Akt) mammalian target of rapamycin (mTOR) is abnormally induced in a variety of human tumors, including the breast gland cancer, the pancreatic gland cancer, the non-small cell lung cancer, CRC, etc. [5]. The activation of PI3K-Akt signaling pathway is closely related to colorectal cancer metastasis and poor prognosis [6][7]. Research shows [8] that a deregulated PI3K-AKT signaling pathway in CRC patients, which might serve as therapeutic target(s). The cell cycle is controlled by various mechanisms, which ensure correct cell division; loss of normal cell cycle control is a hallmark of cancer [9]. An increasing number of studies has revealed that targeting the deregulation of the cell cycle in cancer is a potential therapeutic strategy [10]. Therefore, investigating the cell cycle pathway may promote the understanding of carcinogenic mechanisms and insights into CRC treatment options..These data may provide new ideas and directions for the mechanism research and therapeutic strategy of CRC.GO term enrichment analysis showed that the up-regulated DGEs were signi cantly enriched in cellular response to hypoxia, extracellular exiosome, heparin binding , suggesting that some of these DEGs could locate in the nucleus and be involved in cell cycle processes to promote cell proliferation by enhancing DNA helicase activity in colorectal cancer. Increasing experimental evidence indicates that hypoxia is prevalent in the development of solid tumors.In order to ensure their survival, tumor cells under hypoxia can undergo a series of changes in biological behavior to adapt to hypoxia, speci cally: (choose genotypes (such as TP53 mutation) that are conducive to survival of hypoxia repair [11] ,changes in expression of pro-survival genes [12] and inhibition of apoptosis [13], cell support for enhanced autophagy [14] and anabolic conversion [15], etc. Hypoxia can also enhance tumor angiogenesis [16], epithelial cells Interstitialization [17], immunosuppression [18], invasion and metastasis [19][20], etc. Through these changes, tumor cells can ght the body's autoimmunity and existing medical treatments, affecting their prognosis.It can be seen that the up-regulated DGEs play an important role in the occurrence and development of colorectal cancer.
Through integrated bioinformatics analysis, 10 hub genes were identi ed, including MCM3,MCM7, BUB1B,AUPKA,TPX2, TTK,FEN1, cyclin B1(CCNB1),cyclin A2(CCNA2) and KIF4A with a high degree of connectivity.Among them, MCM3 and MCM7 belong to the MCMs family, and their biological functions are primarily involved in the DNA replication [2].Multiple previous studies have demonstrated that minichromosome maintenance(MCM) protein additionally serve a role in the occurrence and development of varies malignancies [3][4][5].The MCM participates in DNA synthesis [6] and can be used as a biomarker of oral squamous cell carcinoma [7], melanoma [8], glioma [9] and colon cancer [10]. In this study, the KEGG pathway showed that the DEGs were mainly involved in the cell cycle and DNA replication. CCNB1, one of the highly conserved cyclin family members, is involved in regulating cell cycle at the G2/M transition by forming maturation-promoting factor (MPF) with p34, suggesting that its overexpression can promote the progression of cancers [11][12]. AURKA gene mainly exists on chromosome 1 and encodes cell cycle regulated protein kinase, which is involved in chromosome mitosis and plays a role in tumorigenesis and development.BUB1B is an important constituent protein of the mitotic checkpoint, and is a multidomain protein kinase that responds to centromere tension [13][14]. Studies have demonstrated that BUB1B is overexpressed in various different types of tumor, such as colorectal cancer, renal and breast carcinoma, and its [15][16]. Therefore, further investigation on BUB1B may lead to a greater understanding of its importance in the CRC process, and novel ideas for investigating its molecular mechanisms and establishing more effective treatments [17].The TPX2 gene is located at 20q11.2 and encodes a microtubule-associated protein involved in spindle assembly during cell mitosis. TPX2 overexpression is common to many tumor types. In hepatocellular carcinoma, it was correlated with increased proliferation, apoptosis inhibition, and induction of EMT [18]. In breast cancer, TPX2 silencing repressed PI3K/AKT and activated p53 signaling, which inhibited proliferation and promoted apoptosis [19].As a DNA repair protein, Flap endonuclease 1 (FEN1) plays crucial parts in preventing carcinogenesis. Two functional germ line variants (-69G > A and 4150G > T) in the FEN1 gene have been associated with DNA damage levels in coke oven workers and lung cancer risk in general populations.Studies showed lower FEN1 expression may lead to higher risk for malignant transformation of gastrointestinal cells [20][21].TTK is a class of bispeci c kinases rst discovered in Saccharomyces cerevisiae, which can phosphorylate serine, threonine, and tyrosine residues.It is the basic component of the spindle assembly checkpoint (SAC), and plays an important role in the replication of mitotic centrosomes and the correct separation of chromosomes [22].TTK is overexpressed and proposed as a therapeutic target for CRC and other cancers,and closely related to poor clinical prognosis [23][24][25][26].
Some reports are contradictory to our ndings, and this is potentially attributed to the heterogeneity of the tumor. Thus, it is necessary to further conduct sample functional veri cation. For the further veri cation of the expression of the 4 hub gene, RT-qPCR was carried out in 9 clinical samples. As illustrated in Figure  12, we noted up-regulation on the expression of the 4 hub genes in colorectal cancer tissues comparing to tumor-adjacent tissues,which were consistent with our bioinformatics analysis results above,indicating that they would be one of the effective biomarkers for cancer treatment.
Herein, a set of candidate biomarkers possibly signi cant to the development of CRC were identi ed.
They can be taken as research subjects to investigate their roles in the development of diseases, which can improve our understanding as to the mechanism of CRC on a molecular level. Besides, to gure out their prognostic effects, clinical validation study can be conducted with them as possible prognostic biomarkers. Nevertheless, the study remains subject to some limitations. Firstly, it is premised on bioinformatics analysis of published data. Secondly, it is uncertain whether the correlation between the differential expression of hub genes and disease progression is a causal one. Finally, despite the combination of four GEO data sets, the sample size remains small, which can render the results unreliable. Thus, it is essential to conduct further bioinformatics analysis and experimental veri cation with a larger sample size.

Conclusion
In general, we have excavated biomarkers for predicting the recurrence, metastasis and prognosis of colorectal cancer through the big biological data (GEO, TCGA, Ocomine database). Among these biomarkers, CCNA2 and CCNB1 are closely related to the survival and prognosis of colorectal cancer, and can predict the patient's OS and DFS. More importantly, we selected CCNB1, CCNA2, AURKA and BUB1B for veri cation on clinical colorectal tumor samples, which were consistent with our bioinformatics analysis results above, indicating that they would be one of the effective biomarkers for cancer treatment.However, further prospective and multi-center, large-sample studies should be conducted to ensure these results and provide evidence for individualized treatment.

Declarations
Ethical Approval and consent to participate All procedures performed in studies involving human participants were in accordance with the ethical standards of the ethical committee of the First A liated Hospital, Nanjing University of Chinese Medicine were obtained and with the 2019NL-KS1129 Declaration and its later amendments or comparable. NO tissues were procured from prisoners.All procedures performed in the present study involving human participants were in accordance with the ethical standards of institutional. Written informed consent was obtained from all individual participants included in the study.

Consent for publication
Not applicable.

Availability of Data and Materials
The Primer sequence and clinical sample information used to support the ndings of this study are included within the supplementary information le.

Competing interests
The authors declare that they have no con icts of interest.     Figure 1 Flow diagram of bioinformatics analysis.      The overall survival analysis of the 10 hub genes(A-J).

Figure 11
The association between the expression of selected hub genes and tumor stage.