Weighted Gene Coexpression Network Analysis of Key Genes Controlling Cancer Stem Cell Characteristics by Mrnasi in Colon Adenocarcinoma

Jie Zhang Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Tianjin Cancer Hospital Airport Hospital Honghao Yin the First Hospital of China Medical University Zhongsheng Tong (  tongzhongsheng63@163.com ) Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Tianjin Cancer Hospital Airport Hospital

However, the research on mRNAsi of colon cancer has not been realized. We downloaded patient information of colon adenocarcinoma (COAD) from the Cancer Genome Atlas (TCGA), and obtained the corresponding stem indices on the basis of previous research.
Weighted gene coexpression network analysis (WGCNA) is a systematic biological method that illustrates the interaction between gene modules and cancer. Focus on differentially expressed genes (DEGs) between cancer and normal tissues by calculating the relationship of gene sets and the correlation between gene sets and phenotypes, WGCNA can be performed to identify candidate biomarker and therapeutic target [15].
The purpose of this paper was to identify key genes and pathways associated with COAD stemness.
WGCNA method was utilized to analyze the high-throughput sequencing data of TCGA database, obtained modules closely related to mRNAsi, and identi ed relevant key genes. Data sets generated from GEO, GEPIA, and Oncomine were utilized to verify the key genes' expression in COAD. Our study used a new method to identify stemness-related genes, which offered new opinions on the action of some CSCrelated genes in COAD.

Data acquisition
The RNA-seq data (FPKM) of 41 normal samples, 473 COAD samples, and 452 patients' clinical information were downloaded from TCGA (https://portal.gdc.cancer.gov) on October31 2020. mRNAsi and EREG-mRNAsi of COAD and normal samples were obtained from previous studies [8]. The RNA-seq data of each sample was merged into a matrix le with Perl (https://www.perl.org), gene names were transformed into gene symbols according to the Ensemble database (https://asia.ensembol.org/index.html ). During the analysis of the correlations between mRNAsi and basic clinical characteristics, 393 cases with detailed and complete clinical data were adopted. Any sample with incomplete clinical information was excluded.

Prognostic Analysis
The "survival" package from R was performed to analyze the prognostic signi cance of mRNAsi. The "beeswarm" package in R was utilized to investigate the relationships between mRNAsi and clinical characteristics. Death from any cause was identi ed an event in the survival analysis.
Screening of differentially expressed genes (DEGs) The "beeswarm" package from R was used to compare the mRNAsi scores and clinical subgroup characteristics in tumor and normal tissues. The "limma" package was employed to compare DEGs between tumor and normal tissues [118]. The selection criteria: false discovery rate (FDR) < 0.05, |log2 fold change|> 1 and P < 0.05. DEGs that met the criteria were selected for further analysis. The "pheatmap" and "limma" packages were utilized to map heatmap and volcano plot, respectively.

Weighted Gene Coexpression Network Analysis (WGCNA)
The "WGCNA" package from R was employed to establish a coexpression network targeting DEGs [15].
Firstly, the coexpression similarity matrixs of all genes were built by the average linkage method and Pearson's correlation matrices. The soft thresholding parameter (β) was de ned as the correlation coe cient with strong correlation between genes. correlation coe cient with strong correlation between genes Subsequently, the network connectivity of genes was measured by topological overlap matrix, and the network gene ratio was obtained by summing adjacent genes, and corresponding dissimilarity is calculated. Next, DEGs with similar expression pro les were clustered to form modules. The minimum size of the gene dendrogram was 50.
Identi cation of key modules and key genes mRNAsi was chosen as the sample trait for CSC-related modules and genes recognition. Genes in CSCrelated module were considered as CSC-related coexpressed genes. Gene signi cance (GS) was computed for evaluating the interaction between gene expression and mRNAsi of each module. Module signi cance (MS) was the average GS in a particular module, the module with the largest average MS was considered the most CSC-related module. Module membership (MM) was used to selected gene highly related to mRNAsi in each module. Set the selected threshold as follows: cor. gene GS > 0.5, cor. gene MM > 0.75.

Functional enrichment
The clusterPro ler package in R was used to conduct GO and KEGG analyses for exploring biological functions of key genes [119]. The GO analysis included biological processes (BP), cellular component (CC) and molecular function (MF). P < 0.05 and FDR < 0.05 were identi ed as statistical signi cance.

Gene coexpression analysis and Protein-Protein Interaction (PPI)
The Pearson correlation analysis in R package corrplot was performed to analyze relationships between key genes at transcriptional level. Next, the key genes identi ed by coexpression analysis were used to build PPI network. The PPI network was established by online Search Tool for the Retrieval of Interacting Genes (STRING) [120]. The more edges a gene has, the stronger its relationship with other genes, and the stronger its role. When a gene has at least two edges, that gene will show up in the diagram.

Validation of key genes
Online databases GEO (www.ncbi.nlm.nih.gov/geo/), GEPIA (http://gepia.cancer-pku.cn/), and Oncomine (https://www.oncomine.org) were performed to compare the mRNA expression levels of key genes between cancer and normal samples. Gene expression pro le GSE44076 containing 98 colon tumor samples and 148 normal mucosa tissues was obtained from the GEO. The chip contains the expression of all key genes. The thresholds of Oncomine screening were fold change, 2; P-value, 1E−4, gene rank, top10%.

Results
The mRNAsi and clinical characteristics in COAD We used the TCGA dataset to compare mRNAsi expression between normal (n=41) and tumor tissues (n=473) to explore mRNAsi in COAD. The mRNAsi index of COAD was apparently higher than that of normal tissue ( Figure 1A). Patients were subdivided into six subgroups by age, gender, stage, and TNM stage ( Figure 1B-G). Remarkable differences were found for N stages ( Figure 1F) and M stages ( Figure  1G). The overall survival of patients with higher mRNAsi was signi cantly better than that of patients with lower mRNAsi (P=0.014, Figure1H). Tumor tissue was made up of thousands of different cells including tumor cells, stromal cells, immune cells, and other types of cells, reminding us that tumor purity could be a factor affecting the assessment of mRNAsi in clinical properties. Finally, we used the corrected mRNAsi (mRNAsi/tumor purity) which was calculated in the same way as a previous study [9], to eliminate the in uence of tumor purity and recalculated the overall survival between patients with higher mRNAsi and lower mRNAsi (P=0.228, Figure1I).

Screening of DEGs
The mRNAsi indices between tumor tissue and normal tissue were signi cantly different. Therefore, a total of 6478 DEGs were identi ed, among which 4562 were highly expressed genes and 1916 were low expressed genes ( Figure 2).

Construction of WGCNA
WGCNA was performed to build the gene coexpression network of 6478 DEGs with the soft threshold β = 4 (scale-free R 2 = 0.90) to determine biologically signi cant gene modules. We eliminated outlier samples and got 11 modules for future research ( Figure 3A-B). Among them, green, brown, and red modules had the closest relevance to mRNAsi ( Figure 3C-H). The genes in the brown (cor = 0.94, P < 1e-200), red (cor = 0.77, P = 1e-200) and green modules (cor = 0.94, P < 1e-200) showed high GS and MM. Additionally, green module was positively related to mRNAsi, while brown module and red module were negatively related to mRNAsi. Hence, we selected the green module for subsequent research.
The key genes screening threshold of the green module was cor. MM>0. 75

Correlations between key genes
We used the Pearson correlation to explore the mutual correlations between 27 key genes in green module, and found strong positive correlations between candidate genes at transcription level( Figure 5A).
The Pearson correlation coe cient ranged from 0.35 to 0.88. The maxmum correlation coe cient was between CCNA2 and MAD2L1(0.88), followed by KIF23 and BUB1B(0.84), CHEK1 and DDIAS(0.82), BUB1 and KIF18A(0.80). Next, we mapped the PPI network for identifying the relationship between key genes at the protein level. The PPI network consisted of 26 nodes and 211 edges, those key gene proteins composed a wide-ranging and closer interaction except for DDIAS. CDK1, BUB1, CCNA2, and CENPA had the highest node number: 22 ( Figure 5B-C,).
Functional and pathway enrichment of key genes As shown in Figure 6A, the result revealed that main biological functions were organelle ssion, chromosomal region, and protein serine/threonine kinase activity. The enriched signaling pathway were cell cycle, progesterone−mediated oocyte maturation, human T−cell leukemia virus 1 infection, oocyte meiosis, DNA replication, cellular senescence, viral carcinogenesis, p53 signaling pathway ( Figure 6B). These results indicate that key genes are remarkably correlated with cell division.

Validation of key gene expression
We compared 27 key genes' expression between colon cancer (98 cases) and normal tissues (148 cases) in GSE44076, all genes were signi cantly overexpressed in tumor tissues (Figure7). Furthermore, we used GEPIA and Oncomine databases to analyse the expressions of key genes in multiple cancer types. These genes were obviously upregulated in COAD and many other cancer types, compared to normal samples(Figure8-9). And these genes were not only ranked in the top 10% in GENE RANK, but also had more than one data to support the results(Figure9).

Discussion
Colon cancer is one of the most common malignant tumors in the world, with high morbidity and mortality. Although systematic medical treatment and surgical technique have made a great progress, many patients with recurrence and metastasis of colon cancer still have poor prognosis. CSCs are a small group of hidden tumor cells with the potential for self-renewal and multi-differentiation. Nowadays, accumulated studies have shown the action of CSCs in tumor initiation, progression and malignization, and therapeutic resistance. However, it is still a challenge to identify the therapeutic modalities targeting CSCs. Herein, we used WGCNA to screen out key genes associated with COAD stemness based on mRNAsi.
Firstly, we found COAD tissue exhibited higher stemness than normal tissue, consistent with other reports in bladder cancer [9], gastric cancer [11], lung cancer [12], and breast cancer [14]. Patients with higher mRNAsi had better survivals, which was similar to a previous report [16], but contrary to what we expected. After eliminating the effect of tumor purity, we found no statistically signi cant difference in survival curves between the two groups. Differences in treatment options might result in similar survival rates between the two groups. Next, we constructed a coexpression network to analyze the expression of DEGs, and divided DEGs into different gene modules. Green module had the highest positive correlation with mRNAsi, brown and red modules had negative correlations with mRNAsi. Hence, key genes affecting the COAD stemness should be screened from green module, where 27 key genes were obtained based on GS and MM. These genes were upregulated in COAD tissue. There were high coexpression relationships among them at the protein and transcription levels.
Next, KEGG and GO analyses showed that key genes were involved in organelle ssion and cell cycle, DNA replication, cellular senescence pathways. The enriched signaling pathway included human T−cell leukemia virus 1 infection and viral carcinogenesis, indicating that these genes may have a role in the origin of cancer and tumorigenesis. Previous studies have shown a high prevalence of JCV infection in colon cancer [17], which induces chromosomal instability[18] and provoke metastasis [19]. HPV infection may represent an important factor in the transformation of precancerous lesions to neoplastic phenotypes in colon cancer [20]. P53 signaling pathway plays an important role in metabolism, aging, apoptosis, proliferation, cell cycle regulation, and inhibition of tumor expression [21][22][23][24]. P53 mutations play a key role in adenoma-carcinoma transformation, chemoresistance, and poorer prognosis in colon cancer [25][26][27][28][29]. These ndings suggest that key genes may play important roles in tumorigenesis, progression and drug resistance. Finally, the expression pro le of each key gene was veri ed in multiple cancer tissues by external data from GEO, GEPIA, and Oncomine.
DNA2, ORC6, MCM10, and RFC4 are essential for the initiation of the DNA replication, chromosome maintenance and cell proliferation in eukaryotic cells [75][76][77][78][79][80][81]. ORC6 was implicated in colorectal cancer initiation and progression [82]. Reduction of ORC6 expression could increase the sensitivity of colon cancer cells to cisplatin and 5-uorouracil [83]. High levels of RFC4 were in association with differentiation and poor prognosis in colorectal cancer, the loss of RFC4 suppressed cell proliferation and promoted the cessation of the S-phase cell cycle arrest [76]. These four genes were high expressed and related to prognosis in other cancers, and could be potential therapeutic targets [45,[84][85][86][87][88]. Experimental research is required to demonstrate their role in CSCs.
The differential expression of CHEK1, XRCC2, DDIAS and NEIL3 were related to DNA damage and repair [89][90][91][92]. XRCC2 polymorphisms and its protein were associated with susceptibility of colorectal cancer [93,94]. CHK1 inhibitors are promising in chemo-and radio-sensitization [95]. XRCC2 single nucleotide polymorphisms may in uence the risk and survival of breast cancer [96]. NEIL3 were highly expressed in tumor tissues and cells with high proliferative potential [97].
KIF18A and KIF23 are members of the kinesin superfamily of microtubule-associated molecular motors, which are related to chromosome separation, intracellular transport, mitotic spindle formation and cytokinesis. Aberrant expressions of KIF18A and KIF23 were related to metastasis and poor prognosis in colorectal cancer [98,99]. TTK could regulate tumor proliferation and differentiation and predict prognosis in colon cancer [100]. PLK4 expression plays a key role in centrosome duplication for cell division. The upregulation of PLK4 may induce carcinogenesis and metastasis by regulating the Wnt/β-catenin signaling pathway, which may be a therapeutic target [101].
Studies found that upregulation of MTFR2 promoted proliferation, invasion, migration through switching glucose metabolism in breast cancer and oral squamous carcinoma [102,103]. Upregulated expression of MTFR2 predicted a poor prognosis [104][105][106]. Activation of MELK was signi cantly associated with prolonging survival and promoting proliferation of CSCs in many organs [107]. MELK expression was correlated with resistance of radiation and chemotherapy in colorectal cancer[108], as well as survival of patients with multiple cancer types [109][110][111].The clinical trials and basic experiments for cancers with MELK inhibitor are ongoing. The overexpression of NUP107 increased resistance to oxidative damage in cervical cancer [112]. Previous studies have indicated that DEPDC1B induces tumor cell proliferation, migration, invasion and metastasis through various mechanisms [113][114][115][116]. PNPT1 encodes a polyribonucleotide nucleotidyltransferase, mutations in PNPT1 inhibit RNA from importing into mitochondria and result in respiratory-chain de ciency [117]. There no study has showed the role of PNPT1 in tumorigenesis and development.

Conclusion
The conclusion was that 27 key genes that maintained the stemness of COAD were identi ed. The key genes were signi cantly associated with cell cycle events, which could be therapeutic targets to inhibit CSCs properties. Our conclusion was based on bioinformatic analysis of retrospective research. In order to make these ndings credible, further basic research is needed.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Competing interests
There are no competing interests.

Funding
The present study was supported by grants from the Research Program of Tianjin Medical University Cancer Institute and Hospital (grant no. 2001).
Authors' contributions JZ, HY and ZT conceived and designed the study, JZ and HY analyzed the data, JZ and HY wrote the paper, ZT provided administrative support and gave nal approval of the version to be published.       The expression of the 27 key genes were veri ed in GSE32867 from GEO database.  The expression of the 27 key genes between tumor and normal tissues in multiple cancer types from GEPIA database.