Construction of an RNA Binding Protein-based Model for Prognosis Prediction of Colorectal Cancer

Background: Colorectalcancer (CRC) is a prevalent gastrointestinal tumor with high incidence and mortality. Dysregulation of RNA binding proteins (RBPs) has been found in a variety of cancers and is related to oncogenesis and progression. This study aimed to develop and validate new biomarkers related to CRC prognosis by a series of bioinformatics analysis. Methods: We mined the gene expression data of 510 CRC samples from The Cancer Genome Atlas (TCGA) database, differentially expressed genes were screened and prognosis-related genes were identied. Furthermore, gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were carried out. A prognosis-related gene signature was constructed by univariate and multivariate Cox analysis. Kaplan–Meier curves and time-dependent receiver operating characteristic (ROC) curves were utilized to evaluate the signature,The test set was used to validate the RBPs risk score model.Survival analysis was carried out to determine the independent prognostic signicance of the signature. A nomogram combined with the gene signature was constructed. Results: A total of 224 aberrantly expressed RBPs were obtained, comprising 78 downregulated and 146 upregulatedRBPs. 13 RBPs with p < 0.005 were revealed in univariateCox regression analysis of train group, then stepwise multivariate Cox regression was applied for constituting an eight- RBP (BRCA1, TERT, TDRD7, PPARGC1A, LUZP4, CELF4, ZC3H12C, PNLDC1) signature prognostic biomarkers. Further analysis demonstrated that high risk score for patients was signicantly related to poor overall survival according to the model. The area under the time-dependent receiver operator characteristic curve of the prognostic model was 0.730 at 5 years. The signature-based risk score was an independent prognostic factor in CRC patients. We also established a nomogram based on eight RBPs and internal validation in the train set, which displayed a favorable discriminating ability for Colorectal cancer. Conclusions: The established eight-RBP signature may serve as a novel independent prognostic factor that


Background
Colorectal cancer (CRC) is a major cause of cancer mortality and morbidity worldwide. CRC patients usually show a survival rate of < 5 years due to early metastasis. Although treatments (such as surgery, radiotherapy, chemotherapy, and targeted therapy) have been developed eetly, high recurrence and poor prognosis remain troubling issues [1]. The molecular pathogenesis of colorectal cancer (CRC) is an elaborate multistep process. RBPs are a class of proteins that interact with a variety of types of RNAs involve in rRNAs, ncRNAs, snRNAs, miRNAs, mRNAs, tRNAs and snoRNAs. It can directly interact with RNA or can be part of ribonucleoprotein complexes without direct contact with RNA. An emerging node of regulation of intestinal epithelial homeostasis, response to injury and malignant transformation is through RNA binding proteins (RBPs) [2]. Currently, there are more than 1500 experimentally validated RBP coding genes in the human genome, accounting for about 7.5% of all protein-coding genes [3].
RBPs are vital for the regulation of several essential cellular processes, such as RNA splicing, modi cations, transport, localization, stability, degradation, and translation. RBPs function by binding to their target RNA, forming ribonucleoprotein (RNP) complexes [4] and regulating gene expression posttranscriptionally in a plethora of ways. As a crucial participant in post-transcriptional regulation, RBPs participates in most processes of post-transcriptional regulation, and eventually engages in a series of processes including cell proliferation, differentiation, migration, apoptosis and angiogenesis by regulating the expression of related proteins, and its abnormal regulation is one of the important mechanisms for development of cancer [5]. Previous studies have indicated that RBPs such as SRSF1, Quaking, Muscleblind and HuR as pivotal moderators to regulate the occurrence and progression of cardiovascular diseases by mediating a wide range of post-transcriptional events [6,7]. There have been some previous studies on RBPs and Colorectal cancer. Several studies have shown that LIN28B is overexpressed in about 30% of colorectal tumors [8,9]. IMP1 is overexpressed in more than 80% of human CRC [10] and linked to TNM stage, tumor size, lymph node metastasis, and poor prognosis [11]. The overexpression of MSI is su cient to transform the intestinal epithelium and form tumors via activation of the mTORC1 complex with inhibition of Pten [12,13]. High HuR protein expression is found in both the nucleus and cytoplasm of human colon cancers [14]. While low HuR protein expression is observed in the normal colon [15]. The forkhead box K2 protein (FOXK2) promotes colorectal cancer metastasis by upregulating mRNA expression of zinc nger E-box binding homeobox 1 (ZEB1) [16]. Therefore, RBPs might play a critical role in regulating cancer, however, the functions of most RBPs have not yet been determined in cancers. A systematic functional analysis of RBPs will help us thoroughly investigate its role in tumors.
In the present study, RNA-seq data and clinic pathological data were accessed from the TCGA database for the identi cation of prognosis-related RBPs. Univariate Cox and multivariate Cox regression were performed to select survival-related RBPs, and systematically explored them to analyze the potential functional and clinical signi cance. The study was designed and carried out strictly according to the Reporting Recommendations for Tumor Marker Prognostic Studies [17] and the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis statement [18]. Our results might be potentially helpful for developing prognosis biomarkers. It will enhance our understanding of molecular mechanisms of the model in CRC.

Data source and preprocessing
The Cancer Genome Atlas (TCGA) database (TCGA, https://portal.gdc.cancer.gov/)was utilized to access CRC mRNA data with corresponding clinical information (510 samples). All raw data was preprocessed by Limma package and excluded genes with an average count value less than 1. Wilcox Test was performed to test CRC samples. Genes with logFC > 0.5 and FDR < 0.05 were selected for follow-up analyses. Heatmaps and volcano plot were utilized to visualize the different expression patterns of RBPs through "pheatmap" R package.

KEGG pathway and GO enrichment analysis
These differently expressed RBPs were evaluated by functional enrichment analysis to explore the functions using the clusterPro ler (version 3.10.1) and org.Hs.eg.db (version 3.7.0) package [19].
Signi cantly enriched Gene Ontology (GO) terms and Kyoko Encyclopedia of Genes and Genomes (KEGG) pathways with a P value < 0.05 and q value < 0.05 were visualized using the R (version 4.0.0) software.

PPI network construction and module screening
The differently expressed RBPs were submitted to the STRING database (http://www.string-db.org/) [20] to identify protein-protein interaction information. The network was constructed with the Cytoscape 3.7.0 software. Subsequently, the key modules were screened from the PPI network with scores > 7 and node counts > 5 by using the MCODE (Molecular Complex Detection) plug-in in Cytoscape. All P ≤ 0.05 were considered as signi cant difference.
Construction and validation of the eight-RBP signature.
Univariate Cox regression analysis was conducted to screen survival-related genes in the key modules of TCGA dataset using survival R package. The "caret" package was used to randomly divide the samples with complete survival information into two subgroups: A training set (n = 222) and a test set (n = 222). Based on the above preliminary screened signi cant candidate genes. A series of multivariate Cox was therefore used in the training set based on the feature genes selected before using the survival R package [21,22]. Constructing an optimal model to calculate a risk score to predict the prognosis of CRC patients, consequently identifying the optimal feature genes. The risk score formula for each sample was as follows: where represents the coe cient value, and Exp represented the gene expression level. According to the median risk score survival analysis, CRC patients were divided into low-risk and high-risk groups. Then, Kaplan-Meier (KM) and log-rank methods were used to test the difference in the two groups by using the 'survival' R package (version 2.43; https://cran.r-project.org/web/packages/survival/index.html). After that, the model was validated in the test set. Additionally, a receiver operating characteristic (ROC) curve analysis was implemented in both the training set and test set using the SurvivalROC package to evaluate the prognostic capability of the above model [23]. And the AUC values of the 1-year, 3-year and 5year survival rates were calculated. Independent prognostic ability of the multi-gene signature

GO and KEGG pathway enrichment analysis of the differently expressed RBPs
To investigate the function and mechanisms of the identi ed RBPs, we divided these differently expressed RBPs into two groups: upregulated or downregulated expression. Then we uploaded these differently expressed RBPs for functional enrichment analysis. The results indicated that downregulated differently expressed RBPs were signi cantly enriched in the biological process related to regulation of cellular amide metabolic process, regulation of translation, RNA splicing (Fig. 2a). The upregulated differently expressed RBPs were signi cantly enriched in ncRNA processing, RNA phosphodiester bond hydrolysis, nucleic acid phosphodiester bond hydrolysis. (Fig. 2d). In terms of molecular function, the downregulated RBPs were notably enriched in catalytic activity, acting on RNA, mRNA 3'−UTR binding, endonuclease activity. (Fig. 2a), while the upregulated differently expressed RBPs were signi cantly enriched in catalytic activity, acting on RNA, nuclease activity, ribonuclease activity (Fig. 2d). Through the cellular component (CC) analysis, we found that the decreased differently expressed RBPs were enriched in cytoplasmic ribonucleoprotein granule, ribonucleoprotein granule, endolysosome membrane (Fig. 2a), and upregulated differently expressed RBPs were mainly enriched in cytoplasmic ribonucleoprotein granule, ribonucleoprotein granule, preribosome. (Fig. 2d).
The results of KEGG pathways about the downregulated differently expressed RBPs are 6 ( Fig. 2b-c), of which were mainly enriched in the Hepatitis C, TGF-beta signaling pathway, Progesterone − mediated oocyte maturation, Oocyte meiosis. The outcome of KEGG pathways about the upregulated differently Univariate and multivariate Cox regression analyses were conducted both in training and test set to determine whether risk score and corresponding clinicopathologic features were independent prognostic factors for CRC patients.

Development of the nomogram.
In order to provide clinicians with a quantitative tool to predict the individual probability of survival time, the nomogram was conducted using rms R package to forecast the likelihood of OS. P < 0.05 was considered to be a signi cant difference. expressed RBPs is 6, enriched in the mRNA surveillance pathway, RNA transport, Ribosome biogenesis in eukaryotes (Fig. 2e-f).

Protein-protein interaction (PPI) network construction and key modules selecting
To further investigated the roles of differently expressed RNA binding proteins in CRC, we created the PPI network using Cytoscape software, which incorporated 190 nodes and 895 edges based on the data from STRING database (Fig. 3a). The co-expression network was processed using the MODE tool to identify possible key modules (Fig. 3b). Module 1 included 24 nodes and 254 edges (Fig. 3c), and module 2 consisted of 8 nodes and 23 edges (Fig. 3d). Module 3 contained 5 nodes and 10 edges (Fig. 3e), The GO and pathway analyses showed that the RBPs in the key module 1 greatly abounded in ribosome biogenesis,rRNA processing,rRNA metabolic process. While the RBPs in module 2 were associated with mRNA 3'-UTR binding, RNA splicing, mRNA 3'-UTR AU-rich region binding. Module 3 were mainly enriched in defense response to virus, response to virus, cellular response to exogenous dsRNA (Table 1). Prognosis related RBPs selecting A total of 224 key differently expressed RBPs were identi ed from the PPI network. To investigate the prognostic signi cance of these RBPs, 13 survival-related feature genes were identi ed in the training set as being associated with prognosis following univariate Cox proportional hazards regression analysis (Fig. 4a). Subsequently, these 13candidate RBPs were analyzed by multiple stepwise Cox regression to investigate their impact on patient survival time and clinical outcomes, eight hub RBPs were found, including BRCA1, TERT, TDRD7, PPARGC1A, LUZP4, CELF4, ZC3H12C and PNLDC1 (Fig. 4b, Table 2). Expression pattern of 8 selected RBPs between tumor and normal colon tissue (Fig. 4c).  According to the median risk score, patients were divided into high-and low-risk groups. Moreover, to know better about the role of the risk score in the patient's survival, Kaplan-Meier analysis was performed on both the high and low-risk samples. The results demonstrated that the patients in the highrisk subgroup were with poor OS compared to those in the low-risk subgroup (Additional le 2a). Besides, 1-, 3-and 5-year time-dependent ROC analyses were performed to evaluate the signature prognostic sensitivity and speci city (Additional le 2b). The AUCs of the signature was 0.670, 0.675 and 0.678 at 1-, 3-and 5-year survival times, respectively, suggesting that this signature may have high prognostic accuracy.
According to the formula, the risk score of patients was calculated and arranged from low to high. As shown in Additional le 2c-e, we could observe that the risk score of the patients with high survival risk was over 1, while those of low-risk patients were all less than 1. Then, scatter diagrams of patients' survival time were plotted based on the risk score. It was found that with the increase of the risk score, the mortality of CRC patients was increased. However, the survival rate of patients no signi cant downward trend. Moreover, the expression levels of the 8 RBPs were analyzed in heatmaps. TERT exhibited high expression in the high-risk group, ZC3H12C, and PPARGC1A showed low expression in the high-risk group relative to the low-risk group.
Validation of the signature in the test set.
By using the established cut-off point, patients were assigned to a low-or high-risk group in the test sets.
The patients' risk score distribution and survival status were ranked by the risk score in the test set. The results from KM analysis for overall survival suggested that patients in the low-risk group had signi cantly better survival than those in the high-risk group (p < 0.05). (Fig. 5a).Furthermore, ROC analysis was conducted to determine the sensitivity and speci city of the risk score for survival prediction. As plotted in Fig. 5b, the AUC values of 1-year, 3-year and 5-year survival rates in training set were 0.701,0.709, 0.730, respectively (Fig. 5b). These results suggested that the prognostic model has better sensitivity and speci city. The risk score, survival time distribution and patients' status are the same as the training set ( Fig. 5c-e). Taken together, these results indicated that this eight-RBP signature could effectively screen out high-risk CRC patients with relatively worse clinical outcomes.
The signature-based risk score was an independent prognostic factor in CRC patients In order to determine whether the eight-RBP risk signature acts as an independent prognostic indicator, we performed univariate and multivariate Cox regression analyses of signature-based risk score in test set.
The univariate Cox analysis indicated that signature-based risk score was signi cantly associated with worse OS with HR = 1.759 (p = 0.003, 95% CI [1.212-2.554]) (Fig. 6a). Meanwhile, age (HR = 1.031, 95% CI [1.001-1.063], p = 0.045) were also proved to be signi cantly associated with the OS (Fig. 6a). After that, all the variables were enlisted into the multivariate Cox analysis. Notably, the signature-based risk score still exerted as a risk factor for lower overall survival (HR = 1.635, 95% CI [1.083-2.469], p = 0.019) of CRC patients (Fig. 6b). Hence, these data demonstrated that the signature-based risk score was an independent prognostic factor in CRC patients.

Construction of a nomogram based on the eight hub RBPs
Then, we constructed a nomogram that integrated eight-RBP signature to develop a quantitative method for CRC prognosis (Fig. 6c). Based on the multivariate Cox analysis, points were assigned to individual variables by using the point scale in the nomogram. We draw a horizontal line to determine the point of each variable and calculate the total points for each patient by summing the points of all variables, and normalize it to a distribution of 0 to 100. We can calculate the estimated survival rates for CRC patients.

Discussion
Colorectal cancer is a highly malignant tumor, which is particularly prone to liver and lung metastasis, seriously affecting the survival prognosis of patients [24]. Therefore, nding a prognostic marker with high speci city and sensitivity is the advisable choice applied in the clinical practice. Extensive evidence displayed RBPs play a pivotal role in the development and progression of various human tumors [25][26][27].
However, only a small part of RBPs have been studied in-depth and partially con rmed that they contributed to occurrence and development of cancers [28].
In the current study, we download mRNA expression pro les and corresponding patients' clinical information of CRC from TCGA database. 224 RBPs identi ed as signi cantly different by using the Wilcox Test for the differential analysis, consisting of 146 up-regulated RBPs and 78 down-regulated RBPs. Then, we systematically explored the potential functional pathways and constructed a PPI network of these differently expressed RBPs. Moreover, univariate Cox regression analysis, survival analysis, multivariate stepwise Cox regression analysis and ROC analysis were performed on Hub RBPs to explore its biological function and clinical signi cance further. These ndings may contribute to developing novel biomarkers for prognosis of CRC patients.
The function enrichment analysis demonstrated that the differently expressed RBPs were signi cantly related to regulation of translation, RNA splicing, cytoplasmic ribonucleoprotein granule, ribonucleoprotein granule, catalytic activity, acting on RNA, mRNA 3'−UTR binding. The KEGG pathways analysis showed that the aberrantly expressed RBPs regulate the tumorigenesis and progression of CRC by affecting mRNA surveillance pathway, RNA transport, TGF − beta signaling pathway. The above information suggests that most of the enrichment pathways relate to RNA modi cation, which also indicates that the modi cation and processing of RNA is related to tumorigenesis. RNA metabolism and RNA processing have been increasingly recognized in various diseases [29,30], Post-transcriptional regulation of RNA stability is an important procedure in gene expression processing. Previous studies have proved that RBPs as a key participant in post-transcriptional regulation, and are involved in most processes such as cell proliferation, differentiation, migration, apoptosis and angiogenesis by regulating the expression of related proteins. HuR, a member of ELA V family of RBPs ,High HuR protein expression is found in both the nucleus and cytoplasm of human colorectal cancer [14]. HuR protein can recognize and combined messenger RNA of target genes, thus stable target genes to inhibit the degradation of messenger RNA, participate in the post-transcriptional regulation of target genes, and regulate a variety of physiological and biochemical processes such as tumor cell growth, proliferation, differentiation, signal transduction, transcription and translation, angiogenesis, protein transport, cell apoptosis and its metabolism [31]. The insulin-like growth factor-2 mRNA binding proteins (IGF2BPs or IMPs) belong to a conserved subfamily of RBPs. The IMPs have been studied for their roles in regulation of posttranscriptional processes such as mRNA localization, turnover, and translational control [32]. IMP1 plays a functional role in the RNA stability by binding and shielding several mRNAs that play critical roles in cell growth and proliferation from proteolytic degradation [33]. IMP1 is overexpressed in more than 80% of human CRC [10], which regulates cell cycle progression and migration [34], and correlates with invasion, lymph node metastasis, and worse prognosis [11,35,36]. Moreover, ribonucleoprotein granule is an important area that implements protein synthesis. The mutation of ribonucleoprotein regulates the translation process and associated with tumor development [37]. By applying whole-exome sequencing, RNA seq, or whole-genome sequencing, ribonucleoprotein regulates gene mutations have been detected in the genome of cancer cells, including from endometrial cancer, T-cell acute lymphoblastic leukemia (T-ALL), chronic lymphocytic leukemia (CLL), colorectal carcinomas, and high-grade gliomas [38][39][40][41]. In brief, RBPs can bind to mRNAs and regulate their expression, which play an important role in the development of many diseases.
Furthermore, the hub RBPs were selected based on univariate Cox regression analysis, and multiple Cox regression analysis. 8 feature genes associated with CRC patient's survival were identi ed, including BRCA1, TERT, TDRD7, PPARGC1A, LUZP4, CELF4, ZC3H12C, and PNLDC1. Among these genes, several RBPs have been reported to be closely associated with colorectal cancer. BRCA1 associated hereditary breast cancer, and recently, meta-analysis results provide clinicians and healthcare regulatory agencies with evidence of the increased risk of colorectal cancer in BRCA1 mutation carriers [42]. The human telomerase reverse transcriptase (TERT) also plays an important role in colorectal cancer growth, initially affecting the tumor's stromal microenvironment [43]. CUG binding protein 4 (CUBP4) or CELF4 is a multifunctional RBP studied primarily for its role in RNA metabolism related processes like decay, translation and splicing. Previous studies have shown that CELF2 expression is consistently reduced during neoplastic transformation, suggesting that it might play a crucial role in tumor initiation and progression [44]. It is consistent with our ndings. PPARGC1A was indicated as a tumor suppressor in colorectal cancer [45] and ovarian cancer [46], as well as a negative prognostic biomarker for CRC.
Although the connection between most of the differently expressed RBPs and CRC remains unclear, some RBPs have been reported to be associated with other tumors. LUZP is a regulatory protein that contains 20 amino acids, Excessive expression or mutation of LUZP may lead to the occurrence of malignant tumors, and abnormal expression of LUZP can be detected in solid tumors such as cervical cancer, breast cancer and pancreatic cancer [47][48][49]. ZC3H12A, also known as MCPIP1 or Regnase-1, is a novel RNAbinding protein (RBP) that plays a key role in post-transcriptional regulation and immune homeostasis [50,51], a study shows that ZC3H12A is involved in circHECTD1/HECTD1-mediated macrophage activation in lung cancer [52]. PNLDC1 is a PARN-like 3'-to-5' exonuclease localized on the mitochondrial surface in mouse, Associated with the emergence of the mature piRNA [53]. TUDOR domain-containing proteins (TDRDs) are chie y responsible for recognizing methyl-lysine/arginine residue. TDRD dysregulation contributes to breast tumorigenesis [54].
Then, an 8-gene signature-based risk score formula was developed for survival analysis in the training and test set. High expression of BRCA1, TDRD7, PPARGC1A, and ZC3H12C was associated with a good prognosis in patients with CRC, whereas that of TERT, LUZP4, CELF4, and PNLDC1 were related to poor prognosis. Next, the eight RBPs were used to construct a risk model by multiple stepwise Cox regression analysis to predict prognosis in CRC patients. ROC analysis was performed and showed the higher analysis, the signature-based risk score was an independent prognostic factor in CRC patients. A nomogram was constructed to enable practitioners to predict1-year, 3-year, and 5-year OS of CRC patients.
Nevertheless, our study had several limitations. Firstly, our prognostic model was only based on the data from TCGA database, this study was a pure bioinformatics study, and the scienti c hypothesis was not proved by biological experiments. Secondly, our study was designed based on a retrospective analysis, and a prospective study should be performed to verify the results. Thirdly, the lack of some clinical characteristics in the datasets from TCGA may have decreased the statistical effectiveness and reliability of the multivariate stepwise Cox regression analysis.
In summary, our study not only constructed a new predictive model of RBPs signature prognosis but also by grouping to verify and evaluate the predictive ability of the model. The most important thing is can be used as an independent prognostic factor in CRC. As far as we know, this is the rst report of developing a RBPs-associated prognostic model for CRC. Our results would greatly contribute to show the pathogenesis of CRC and enhance our comprehension to treatment and progression of colorectal cancer.

Conclusions
We constructed an eight-RBP signature to predict prognosis for patients with colorectal cancer and validated the model in training and test sets. ROC analysis showed a higher accuracy of the risk score model. The eight-RBP signature was an independent prognostic factor for overall survival. Prospective Declarations version to be published. All authors read and approved the nal manuscript.

Funding
None.

Availability of data and materials
The datasets generated and analyzed during the current study are available in The Cancer Genome Atlas (TCGA), https://portal.gdc.cancer.gov/.
Ethics approval and consent to participate Not applicable.

Consent for publication:
All listed authors have actively participated in the study and have read and approved the submitted manuscript.