CLCA1 has a prognostic indicator for the colorectal cancer by Weighted Gene Co-expression Network Analysis

Colorectal cancer (CRC) has become the second most common digestive tract tumor. Even though the means to treat colon cancer have improved, patients prognosis is low due to the lack of accurate molecular targets. Hence, it urgently demanded better biomarkers for prognosis and progression of colon cancer. This study explores the hub gene associated with the prognosis of colorectal cancer and further analyzes the hub gene function. In this study, all genes mRNA expression data were from the cancer genome atlas (TCGA) colon cancer database and the Gene Expression Omnibus (GEO). These databases were used to screen the differentially expressed co-genes between colon cancer tissue and normal tissue. Weighted Gene Co-expression Network Analysis screened out a total of 103 differential coexpression genes (WGCNA). According to the R cluster profile package annotation analysis, these genes biological functions mainly concentrate on energy metabolism. Moreover, in the protein-protein interaction (PPI) network, the CytoHubba plugin of Cytoscape was used to screen out ten genes (CLCA1, ZG16, GUCA2B, GUCA2A, CLCA4, SLC26A3, MS4A12, GCG, SI, and NR1H4). According to the survival analysis results, high expression of CLCA1has better overall survival and disease-free survival in patients with CRC. Simultaneously, the mRNA expression of CLCA1 in normal tissues was higher than that in CRC tissues. Besides, there were significant differences in the expression of CLCA1 in pathological stage, T stage, and M stage. By using a gene set enrichment analysis, we found several considerable enrichment pathways in the high-groups. CIBERSORT analysis for the proportion of TICs revealed that B-cell naive, dendritic cells, plasma cells, and CD4+ T cells were positively correlated with CLCA1 expression, suggesting that CLCA1 might be responsible for the preservation of immune-dominant status for TME. Finally, in the Human Protein Atlas (HPA) database, the protein level of CLCA1 in the colorectal cancer samples decreased, consistent with the down-regulation of the mRNA expression level CLCA1. To sum up, by integrating WGCNA with differential gene expression analysis, this research generated a significant survival correlative gene called CLCA1 that can predict prognosis prediction in colon cancer.


Introduce
Colorectal cancer (CRC) is the most fatally and commonly diagnosed cancer globally in gastrointestinal tumors [1]. Currently, the early colorectal adenocarcinoma can be surgically removed with a good prognosis. Although great progress has been made in diagnosis of CRC, the fatality rate of disease is very high. Some studies have shown that patients with stage I-III and stage IV CRC recurrence rate were 30% and 65% after curative treatment, respectively [2,3]. Therefore, understanding prognostic biomarkers may effectively contribute to guidance to selection of treatment and predicting survival rate. With the development of microarray technologies，gene expression profiles have been widely used to tumor research. Microarray analyses have related to the comparison between tumor and normal samples [4]. With the rapidly development of bioinformatics analysis, weighted gene co-expression network analysis (WGCNA) has become an emerging technique for analyzing expressed genes (DEGs), but also high degree of interconnection between genes. [5]. In WGCNA, the basic concept is construction of co-expression modules, which are clusters of genes maintaining consistent expression patterns and even playing similar biological roles, and these modules are derived from the data of mRNA expression profiles by performing unsupervised hierarchical clustering [6,7].Moreover, WGCNA analyzes the internal relationships between modules and sample traits, which provides a way to reveal the specific hidden mechanisms. [5]. WGCNA has been widely applied to identify gene modules associated with clinical parameters in many cancers. The usefulness of WGCNA in the detection of the colon cancer recurrence gene module has been demonstrated in many literatures. [7,8].
In the current study, WGCNA and differential gene expression analysis were used for analysis of the mRNA expression data of colon cancer, which from the TCGA and GEO databases. The authors explored development of colon cancer by intersecting functional enrichment, protein-protein interaction (PPI) analysis and survival analysis. By analyzing different co-expressed genes, this study further reveals the potential molecular mechanism of tumor recurrence and makes certain contribution to clinical diagnosis or treatment.

Material and methods
The flow chart of this study is shown in Figure 1. Specific steps will be shown in the following sub-heading.

Data download and preprocessing
In the TCGA database, we select the colon cancer category and download the gene expression profiles related to colon cancer. In addition, we downloaded gene expression profiles from the GEO database with colorectal cancer as the keyword. All clinical information and mRNA data related to CRC were obtained after the R package TCGAbiolinks [9]. There were 473 colorectal cancer and 41 normal tissues in the TCGA database.
These data were preprocessed as follows: 1) Samples lacking clinical data were removed; 2) Samples with FPKM of 0 were excluded; 3) The CPM (count per million) > 1 gene was retained; 4) Samples with OS < 1 month were removed. Moreover, the expression profiles of GSE44861 were obtained using the R package GEOquery [10]. The GSE44861 included 49 tumor samples and 48 paired normal tissues. These data were preprocessed as follows: 1) Annotate the relationship between chip probes and the human gene SYMBOL using the Bioconductor package; 2) Samples with OS < 1 month were removed.

Application of WGCNA in identifying key Co-expres sion modules
Through the WGCNA package, we constructed gene co-expression networks based on the data in TCGA and GSE44861, respectively [11]. Firstly, the Pearson correlation analysis is carried out with all pairwise genes to construct the co-expression matrix.
Secondly, we converted the Pearson correlation matrix to an adjacency matrix (scale-free network). To decide the most appropriate β value, we calculated the scale-free fit index and meant connectivity for each supposed β from 3 to 20. To select the proper β value, we set β (from 3 to 20) to calculate the scale-free fit index. Next, transform the adjacency matrix to TOM (topological overlap matrix), reducing the false correlation. Finally, we divided the gene into several co-expression modules by different TOM values. To further identify functional modules, based on the algorithm, the characteristic correlation of clinical modules is constructed [12]. To better understand the logic, we found a more detailed method of the WGCNA [12].

Modules of Interest
The R package limma has a powerful function, which can label the biological function of genes and screen different genes [13]. Firstly, we used a limma package to screen out the differentially expressed genes in the TCGA and GSE44861 datasets. The inclusion criteria for screening are as follows: 1) cut-off value = |logFC| ≥ 1.0; 2) P < 0.05. The screened genes were then shown using volcanic maps [14]. Finally, the obtained DEGs gene and co-expression genes were analyzed interactively to obtain the potential prognostic genes overlapping part [15].

Functional Annotation for Genes of Interest
In order to reveal the biological functions and signaling pathways of the screened genes, Gene Ontology (GO) [16] enrichment analysis and Kyoto Encyclopedia of Genes and Genomes [17] were fully utilized. With cluster Profiler package in R package, we carried out enrichment analysis on genes of Interest, and the results were shown in bubble diagrams. The enrichment standard was set as p<0.05. Among the many products, we mainly focus on the result of the biological process.

Construction of PPI and Screening of Hub Genes
After enrichment analysis, the genes were screened by a visual network model to determine the hub genes. Firstly, we input the genes into the STRING online tool to build a network model. Secondly, we choose the genes with a score ≥ 0.4 to build a network model visualized by Cytoscape (v3.7.2) [18]. Then, the best central node was found in the co-expression network. Finally, the hub genes were the top ten in MCC value, which was calculated by CytoHubba.

Values of Hub Genes
Survival analysis of the hub genes was performed to verify the prognostic value of the genes. In the TCGA database, we use the survival package under the R environment to obtain relevant survival analysis data. We divided the patients into the high expression group and the low expression group to explore the relationship between the two groups overall survival (OS). Only patients with completed follow-up times were included for survival analysis. We used the Kaplan-Meier method to conduct a bilateral logarithmic test to explore the differences between the two groups, and P < 0.05 was considered statistically significant. Moreover, the association between disease-free survival (DFS) and hub genes expressed in CRC patients was determined using the online tool GEPIA2 [19].

Genes by the HPA Database
The Human Protein Atlas (https://www.proteinatlas.org) is a website that contains immunohistochemistry-based expression data for near 20 widespread kinds of cancers, and each tumor type includes 12 individual tumors [20]. In this study, the differences in CLCA1 in normal and CRC tissues were compared by immunohistochemical images.

Gene Set Enrichment Analysis
To understand the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways of CLCA1 signature, a gene set enrichment analysis (GSEA) was used to analyze the enrichment terms in the entire TCGA cohort. The GSEA version was 4.11, and meeting NOM p < 0.05 and FDR q < 0.25 were set to be statistically significant.

TICs Profile
CIBERSORT(https://cibersort.stanford.edu/)was used to estimate the abundance of TIC in all tumor samples. Quality filtering criteria were p < 0.05, and detailed methods were detailed in this study.

Statistical Analysis
Mann-Whitney U test tested the difference in CLCA1 expression between CRC tissues and adjacent nontumor tissues. The differences in CLCA1 among multiple groups were compared by Kruskal-Wallis test. Chi-square (χ2) test was used to evaluate the interrelation between CLCA1 expression and clinicopathological parameters. Kaplan-Meier analysis and logrank test were used to compare the significant differences in survival rates between the high-and the low-CLCA1 expression groups. All statistical analyses were performed with R software (version 4.03), and P < 0.05 was used to determine the significance level.

Weighted Gene Co-expression Modules construction
In order to move forward a single step to discover the functional clusters in CRC patients, the gene co-expression networks were made up of the TCGA-CRC and GSE44861 datasets with the WGCNA package. TCGA-CRC dataset and GSE44861 dataset were divided into 6 modules ( Figure 2A) and 15 modules ( Figure 3A) respectively.
In order to verify the relationship between each module and two clinical traits (cancer and normal), we plotted the adjacency heatmap of all genes. As was shown in Figure 2B, 3B, different colors represent different modules, the results of the module-trait that the green module in the TCGA-CRC and blue module in the GSE44861 were found to have the highest association with normal tissues (green module: r = 0.8, p = 1e−10; blue module: r = 0.68, p = 1e−14).

The DEG analysis and identification of CRC relation Modules
Base on the cut-off criteria of |log FC| ≥ 1.0 and adj. P < 0.05, 1135 DEGs in the TCGA dataset ( Figure 4A) and 278 DEGs in the GSE44861 dataset ( Figure 4B) were identified by the Limma-package. Differential expression analysis was performed on the RNA-sequencing data from TCGA and GEO. In Figure 4C which pointed out that, 1154 coexpression genes which were identified in the green module of TCGA dataset and 531 coexpression genes which were identified in the blue module of GSE44861. In the end, 103 overlapping genes were extracted from co-expression modules ( Figure 4C).

Go and KEGG pathway Enrichment Analyses for the 103 Genes
To explore the biological functions and signal pathways, we carried out GO and KEGG pathway enrichment analyses on genes in CRC related. Figure

PPI Network Construction and Hub Genes Identification
The PPI network of overlapping genes was established by Using the STRING database ( Figure 6A). The centrality of genes was analyzed in the sub-network with MCC method. The results of the hub genes from the PPI network were shown in Figure 6B. The results are shown in Figure 8 and the top ten highest-scored genes are highlighted. These including CLCA1, ZG16, GUCA2B, GUCA2A, CLCA4, SLC26A3, MS4A12, GCG, SI, NR1H4, were selected as the hub genes ( Figure 6B).

Hub Genes
We did a survival analysis of the ten hub genes through the use of Kaplan-Meier plotter which took advantage of the R survival package. The results showed that high gene expression of CLCA1, GCG, and GUCA2A was positively correlated with longer survival time (P < 0.05) ( Figure 7A, 7B, 7C). In addition, the DFS analysis of ten hub genes was performed by the GEPIA2 database. The results showed that high gene expression of CLCA1 was positively correlated with longer survival time (P < 0.05) ( Figure 8A).
Moreover, the protein levels of CLCA1 in colon tumor tissues are lower than those in normal tissues in the HPA database ( Figure 10). Therefore, it is concluded that high-expression of CLCA1 is associated with good prognosis and higher overall survival in CRC patients.

The Difference in CLCA1 Expression in CRC
In TCGA database, there were 514 tissues (including 473 CRC tissues and 41 adjacent nontumor tissues) associated with the mRNA expression level of CLCA1 gene.
As shown in this scatter plot ( Figure 9A), mRNA expression profiles of CLCA1 in colon tumor tissues was significantly down-regulated compared with normal tissues (P < 0.001).
Furthermore, the different expression group of CLCA1 was analyzed. The results shown that different expression group is associated with pathological stage (P < 0.05, Figure 9B), T stage (P<0.05, Figure 9C), and M stage (P<0.05, Figure 9E).

Identification of CLCA1-Related Signaling Pathways by GSEA
The function of CLCA1 was explored through the mining of TCGA database. In addition, CLCA1-related signaling pathways were explored through GSEA. According to NES, FDR q-value, and nominal p-value, the signal pathways with obvious enrichment were selected. Figure 11 exhibits the top-10 up-regulated in the high-expression group of CLCA1. The results revealed that the CLCA1 was related to biological processes of "Ascorbate and aldarate metabolism", "Butanoate metabolism", "Citrate cycle tca cycle", "Fatty acid metabolism", "Glycolysis gluconeogenesis", and "peroxisome" in colon.

Correlation of CLCA1 With the Proportion of TICs
In order to study the correlation between CLCA1 expression and immune microenvironment, the proportion of tumor filtered subset of the immune was analyzed by making full use of CIBERSORT algorithm. In the colorectal cancer samples, we finally

DISCUSSION
In recent years, Colon cancer has become the second most common digestive tract tumor [21]. The mechanisms underlying colon cancer development and progression remain unclear. Even though the means to treat colon cancer has improved, the prognosis of patients is poor due to the lack of accurate molecular targets. Hence, it was urgently demand better biomarkers for prognosis and progression of colon cancer.
In this paper, bioinformatics analysis methods were fully utilized to identify the 103 significant gene, which has the same expression trends in the TCGA and GSE44861 databases. According to KEGG and GO results, these genes were mainly concentrated in tumor energy metabolism. Cell energy metabolism is divided into aerobic oxidation and glycolysis. In most cases, tumor cells prefer to generate energy through glycolysis under aerobic conditions, which is the basis for tumor cell proliferation and invasion. [22][23][24]. In addition, 10 genes were screened out by the MCC scores (namely CLCA1, ZG16, GUCA2B, GUCA2A, CLCA4, SLC26A3, MS4A12, GCG, SI, NR1H4).
Among them, CLCA1 high expression, and that was highly relevant to good overall survival in colon cancers. Furthermore, CLCA1 expression levels were different in groups classified including the basis of pathological stage, T stage, and M stage. The expression of CLCA1 increased with high differentiation, and which was downregulated with rising tumor stage. Then, we revealed several significantly enriched pathways by the GSEA. In the patient with high-risk scores subgroup, these pathways are primarily enriched in the "Ascorbate and aldarate metabolism", "Butanoate metabolism", "Citrate cycle tca cycle", "Fatty acid metabolism", "Glycolysis gluconeogenesis", and "peroxisome" in cancer ( Figure   5A). From this perspective, high-risk patients may benefit more from metabolic therapy.
However, we need to explore the relationship further.
In addition, the Immune infiltration analysis for CLCA1 was carried out. As was known to all, CACA1 was closely related to the functions of B lymphocytes. Previous studies have shown a positive correlation between naive B cells and CACA1 expression in asthma though the CIBERSORT analysis for the proportion of TICs. Ching have found that epithelial cells could release CLCA1 to activate airway macrophages and eventually induce the pro-inflammatory response in the airway by inducing IL-1 β, IL-6, TNF-α and IL-8 [25].
However, it has been shown that the non-hydrolase mutant CLCA1 retains the ability to activate macrophages. [26]. Although these findings indicate that CLCA1 is closely related to the tumorigenesis mechanism, the relationship is still further study Calcium-activated chloride channel (CLCA) regulators are proteins with a symmetrical homocysteine motif in the terminal tail of the amino group [27]. The loci of the human CLCA genes are at chromosome 1p31-1p22 [28]. Calcium-activated chloride channel regulator 1 (CLCA1) is a group of secreted self-cleaving protein that activates calcium-dependent chloride channels. Recently, the function of CLCA1 in metalloprotease property and involvement in mucus homeostasis and immune modulation has become a research hotspot. More and more studies have shown that CLCA1 may be involved in the pathophysiological process of colon cancer. There is growing interest in utilizing CLCA1 as a diagnostic, prognostic and predictive biomarker, as well as a potential therapeutic target.
Whether CLCA1 can be used as a potential target for the treatment of colon cancer has attracted wide attention [29]. Knockdown of CLCA1 in Caco-2 cell line has been shown to inhibit cell differentiation and promote cell proliferation [30]. In vitro experiments further showed that CLCA1 may inhibit the proliferation and metastasis of colon cancer by inhibiting the Wnt/beta-catenin signaling pathway and the epithelial-mesenchymal transition process. In addition, in vivo experiments showed that overexpression of CLCA1 could inhibit the proliferation and metastasis of colon cancer [31]. It has been proved that CLCA1 is closely related to TMEM16A, and CLCA1 can regulate TMEM16A to participate in the proliferation, migration and invasion of colorectal cancer cells [32][33][34]. It is well known that c-myc is a pro-oncogene and its products closely regulate cell proliferation and apoptosis. It has been reported that there is a close relationship between CLCA1and cmyc transcription [35]. However, the specific mechanism of action remains to be studied cancer. In addition, important information such as tumor size is not available, and the clinical information was incomplete. Moreover, there was a shortage of specific details, such as surgical treatment and surgical details, which are crucial to the prognosis of patients. In the end, like all biogenetic studies, our study was unable to explore the results of CLCA1 in different subtypes of colorectal cancer (Left colon, right colon, sigmoid and rectal cancers). In despite of the fact that we provided a comprehensive bioinformatics analysis to identify potential diagnostic genes between cancer and normal tissues, it may be difficult to make accurate results for each patient with colon cancer subtypes.
To sum up, by integrating WGCNA with differential gene expression analysis, this research generated the significant survival correlative gene, which was called CLCA1 that has potential for prognosis prediction in colon cancer.

Acknowledgments
Not applicable.

Funding
This study was supported by the Fujian province teaching reform project (grant no.

Availability of data and materials
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Authors' contributions
SQC designed the study, obtained funding, analyzed the data and revised the manuscript. JJL drafted the manuscript and supervised the study. SQC was a supporter of the funding (grant no. FBJG20190063) and participating in the drafting and subsequent revision of the manuscript. SYL, ZHC and LZY collected and analyzed the data. All authors read and approved the final manuscript.

Ethics approval and consent to participate
Not applicable.      (B) Identification of the hub genes from the PPI network using maximal clique centrality (MCC) algorithm. Edges represent the protein-protein associations.
The red nodes represent genes with a high MCC sores, while the yellow node represent genes with a low MCC sore. The patients were stratified into high-level group (red) and low-level group (green) according to median expression of the gene. Log-rank P < 0.05 was considered to be a statistically significant difference. The patients were stratified into high-level group (red) and low-level group (green) according to median expression of the gene. Log-rank P < 0.05 was considered to be a statistically significant difference.