Development and Validation of a Novel Five-Gene-Based RNA Binding Protein Associated Prognostic Model for Human Colon Cancer

Background: Dysregulation of RNA binding protein (RBP) expression has been reported in various malignant tumors, and it is related to the occurrence and development of cancer. However, the role of RBPs in colon cancer remains unclear. Methods: We downloaded the RNA sequencing data of colon cancer from The Cancer Genome Atlas (TCGA) database, and determined the differently expressed RBPs between normal and cancer tissues. Then, through a series of bioinformatics analysis, we systematically studied the expression and prognostic value of these RBPs. Result: A total of 490 different expression differently expressed RBPs were identied, including 323 upregulated and 167 down regulated RBPs. Five RBPs (PNLDC1, NSUN6, NOL3, PPARGC1A, LRRFIP2) were identied as prognosis related genes for the construction of prognostic model. Further analysis showed that the overall survival rate (OS) of patients in the high-risk subgroup was worse than that in the low-risk subgroup based on this model. The area under the characteristic curve of time-dependent receiver was 0.691 in TCGA and 0.624 in GEO, which conrmed the prognostic model to be a good one. We also established a nominal map based on the internal validation in 5 RBPs mRNAs and TCGA sequeues, showing a good ability to differentiate colon cancer. Conclusions: We screened RBPs expression differences between colon cancer and adjacent non tumor colon tissues using the TCGA database to identify potential gene biomarkers.Besides,a very effective prediction model was constructed and tested based on the differential expression of RBPs using the TCGA and Gene Expression Omnibus (GEO) database.We also Validated of the relationship between the expression of ve RBPs and prognosis

identi ed as prognosis related genes for the construction of prognostic model. Further analysis showed that the overall survival rate (OS) of patients in the high-risk subgroup was worse than that in the low-risk subgroup based on this model. The area under the characteristic curve of time-dependent receiver was 0.691 in TCGA and 0.624 in GEO, which con rmed the prognostic model to be a good one. We also established a nominal map based on the internal validation in 5 RBPs mRNAs and TCGA sequeues, showing a good ability to differentiate colon cancer.
Conclusions: We screened RBPs expression differences between colon cancer and adjacent non tumor colon tissues using the TCGA database to identify potential gene biomarkers.Besides,a very effective prediction model was constructed and tested based on the differential expression of RBPs using the TCGA and Gene Expression Omnibus (GEO) database.We also Validated of the relationship between the expression of ve RBPs and prognosis Background Colon cancer is the third most common cancer and the second leading cause of cancer death worldwide.
In 2018, there were over 1.8 million new cases and 881 thousand deaths globally. The global burden of colon cancer is expected to increase by 60% to more than 2.2 million new cases and 1.1 million deaths by 2030 [1] . In addition, although incidence rate and mortality rate of many high-income countries were stable or declining, but the age-speci c incidence and mortality rates are still the highest in the world, especially the incidence among the young adults [2,3] .
In this context, RNA binding proteins (RBPs) play an important role as regulatory factors of RNA life cycle, because they can bind to RNA sequences [4] , In humans, more than 1,500 RBPs have been recently identi ed, accounting for 7.5% of the proteome. Each contemporary RBP binds to hundreds of RNAs, including both coding and non-coding RNAs, and precisely guides nuclear output, translation / degradation rates and intracellular localization of target transcripts. Thus, it plays an extensive regulatory role in almost all physiological processes. Accordingly, Alterations in RBP expression or function lead to abnormal RNA translation and may lead to the occurrence and development of various pathological conditions, including cancer [5] . Recently, the next generation sequencing analysis of tumor samples showed that the expression of gene encoding RBP was signi cantly higher than that in the gene not encoding RBP [6] . In addition, evidence showed that RBPs were closely related to the regulation of most cancer hallmarks, including cell proliferation, cell death resistance, stemness, cell dissemination and immune system evasion, which may act as a potential biomarker for tumor progression [7] . In some cases, two RBPs can bind to the same RNA target to stabilize it, either enhancing or repressing translation [8] .
RBPs can also have dichotomous functions where they can both enhance [9,10] or repress [11,12] tumorigenesis depending upon the cellular context.
In recent years, advanced structural-analysis experiments provided evidence for complex protein-RNA interactions that do not require a typical RNA binding domain. RNA interactome capture (RIC) has identi ed "unconventional" RBPs in some organisms that have no recognizable RNA binding domains and have no known relationship with RNA biology [13,14] . Further studies have also shown that disordered protein regions can also promote the interactions between protein and RNA, which can be speci c or nonspeci c. These unorthodox interactions regulate RNA metabolism and different RNA processes, including co-and post-transcriptionally [15] .
At least 16 families of RBPs are deregulated in cancer [16] . Several speci c RBPs have been published in the context of intestinal homeostasis and intestinal tumorigenesis including colon cancer including LIN-28, IGF2BPs/IMPs, musashi, HuR, Mex-3A, CELF1 and RBM3. .
Tumor is a kind of polygenic disease. Usually, each tumor is caused by the disorder of polygenic expression which form a complex network of signaling pathways, leading to the occurrence and development of tumor.
Numbers of projects began to explore methods to accurately predict the prognosis of cancer. Our study used mRNA-seq database from the Cancer Genome Atlas (TCGA). Univariate method was used to construct the gene marker of prognosis, signi cance analysis of RBPs expression and Cox regression survival analysis. The nal regression model represents a potentially useful tool but needs to be veri ed in clinical practice. Nevertheless, our ndings will help to elucidate the pathogenesis of colon cancer.

Methods
Data processing and screening for differentially expressed RBPS in colon cancer the RNA-seq dataset of 39 normal colon tissue samples and 389 colon cancer samples with matched clinical data were downloaded from The Cancer Genome Atlas database (TCGA, https://portal.gdc.cancer.gov/). The negative binomial distribution method was used to identify the differentially expressed RBPs between the two group. The Limma package (http://www. bioconductor.org/packages/release/bioc/html/limma.html ) was based on the negative binomial distribution, applied to perform the analysis and preprocess the raw data,siimultaneously. It ts a generalized linear model for each gene and uses empirical Bayes shrinkage for dispersion and foldchange estimation. All raw data were preprocessed by the Limma package. The Limma package was also used to identify the differently expressed RBPs in view of |log2 fold change (FC)|≥1 and false discovery rate (FDR) < 0.05.

Visualization of Differentlly expressed RBPs with the method of KEGG and GO enrichment analysis
The biological functions of these differently expressed RBPs were comprehensively detected by GO(Gene Ontology) enrichment and Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis(Visualization and Integrated Discovery (DAVID) Bioinformatics Tool, version 6.8.). The GO analysis terms, including cellular component (CC), molecular function (MF), and biological process (BP). All enrichment analyses were carried out by utilizing online WEB-based Gene Set Analysis Toolkit (WebGestalt, http://www.webgestalt.org/) [17] . Both P and FDR values were less than 0.05 as statistically signi cant. RBPs PPI network construction and module screening The differently expressed RBPs were submitted to the STRING database (http://www.stringdb.org/) to identify protein-protein interaction(PPI) information [18] . The Cytoscape 3.8.0 software was used to construct the PPI network further and visualized. The key modules and genes were selected in the PPI network by using Molecular Complex Detection (MCODE) plug-in with both MCODE score and node count number more than 5 [18] . All P ≤ 0.05 was considered a signi cant difference.
Construction of a weighted OS predictive score model related with the RBPs expression Univariate Cox regression analysis was performed on all differently expressed RBPs using the survival R package. A log-rank test was used to screen the signi cant candidate genes further. Subsequently, based on the above preliminary screened signi cant candidate genes, A multivariate Cox proportional hazards regression model was constructed and calculated a risk score to assess patient prognosis outcomes. The risk score formula for each sample was as follows: Risk score = β1*Exp1 + β2*Exp2 + βi*Expi Where β is the coe cient value, and Exp was the gene expression level.
According to the median risk score survival analysis, colon cancer patients were divided into low-and high-risk groups. A log-rank test compared the difference of OS between the two subgroups. Additionally, a receiver operating characteristic (ROC) curve analysis was implemented by using the Survival ROC package to evaluate the prognostic capability of the above model [19] . Besides, colon cancer patient samples with reliable prognostic information from the GSE75500 dataset (https://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc = GSE75500) were used as a validation cohort to con rm the predictive capability of this prognostic model. Finally, the nomogram with calibration plots was conducted using the RMS R package to forecast the likelihood of OS. P < 0.05 was considered to be a signi cant difference.

Results
Differentlly expressed RBPs between colon cancer and normal tissues The databases of colon cancer were downloaded from TCGA contained 389 tumor and 39 normal colon tissue samples. Applying the R software packages3.64, the differently expressed RBPs were identi ed. A total of 1542 RBPs [17] were included in the analysis, and 490 RBPs met the screening standard of this study (P < 0.05, |log2FC)| >0.5), which consist of 323 upregulated and 167 downregulated RBPs. The expression distribution and relationship each other of these differently expressed RBPs was displayed in Fig. 1.

GO and KEGG pathway enrichment analysis of the differently expressed RBPs and their downstream regulator
In order to study the function and mechanism of RBPs, we divided these differently expressed RBPs into two groups: up-regulated and down regulated. Then we use R software package to process them for functional enrichment analysis. The results showed that the upregulated RBPs were enriched in the biological processes(BP) related to RNA ribosome biogenesis, rRNA metabolic process, nucleic acid phosphodiester bond hydrolysis, RNA localization, RNA phosphodiester bond hydrolysis, tRNA metabolism, RNA 3 '-terminal processing, RNA modi cation (Fig. 1c). The down regulated RBPs were enriched in negative regulation of cellular amide metabolism, negative translation regulation, RNA splicing, mRNA catabolic process, tRNA metabolism, cytoplasmic ribonucleoprotein granule, ribonucleoprotein granule (Fig. 1e). Through cellular component (CC) analysis, we found that the upregulated RBPs with differential expression were enriched in preribosome, nuclear part, cytoplasmic ribonucleoprotein granule, ribonucleoprotein granule, spliceosomal complex, small-subunit processome, preribosome, large subunit precursor, 90 s preribosome, exoribonuclease complex, cytoplasmic exsosome(RNase complex) (Fig. 1c) In addition, The cellular component (CC) analysis indicated that the upregulated differently expressed RBPs were enriched in cytoplasmic ribonucleoprotein granule, ribonucleoprotein granule, nuclear speck, P-body, spiceseosomal complex, ribosome, sytoplasmic stress granule, ribosomal subunit, catalytic step 2 spliceosome, U2 − type spliceosomal complex (Fig. 1e). In terms of molecular function (MF), upregulated RBPs with different expressions are rich in RNA catalytic activity, nuclease activity, ribo nuclease activity, tRNA catalytic activity, endonuclease activity, ribonucleoprotein complex binding, mRNA 3' -UTR binding, RNA helicase activity, tRNA binding, RNA methyltrasnsferase activity (Fig. 1c). In addition, the down regulated RBPs in MF have rich expression in RNA catalytic activity, mRNA 3' -UTR binding, translation regulator activity, double-stranded RNA binding, translation repressor activity, translation regulator activity, mRNA 3'−UTR AU − rich region binding, Veri cation of express level and prognostic signi cance The Human Protein Atlas (HPA) online database (http://www.proteinatlas.org/) was used to detect the expression of ve RBPs at a translational level [20] . The prognostic value of the ve RBPs in colon cancer was veri ed by using the Kaplan Meier plotter (https://kmplot.com/analysis/) online tool [21] . translation repressor activity, mRNA regulatory element binding (Fig. 1e). With the method of the enrichment analysis of KEGG pathway, we found that the up regulated RBPs were enriched in ribosome biosgenesis in eukaryotes, RNA transport, spliceosome, mRNA surveillance pathway, RNA degradation, ribosome, RNA polymerase, DNA replication.A degradation, ribosome, mRNA monitoring pathway, RNA transport and splices (Fig. 1d), while the downregulated RBPs were enriched in RNA transport, spliceosome, ribosome, TGF-βsignaling pathway and RNA degradation (Fig. 1f).
Protein-protein interaction (PPI) network construction and key modules selecting.
To further investigated the roles of differently expressed RNA binding proteins in colon cancer, we used the STRING database to create the PPI network (Fig. 1a). Moreover, Cytoscape software was used to sort and analyze the data (Fig. 2b). The expression network was processed using the MODE tool to identify possible key modules and the three key modules acquired (Fig. 2c-f).

Prognosis-related RBPs selecting
With the method of univariate Cox regression analysis, the prognostic signi cance of these 5 differently expressed RBPs identi ed were investigate and 5 prognostic-associated candidates RBPs were validated ( Fig. 3a). This ve-gene signature involved PNLDC1, NSUN6, NOL3, PPARGC1A and LRRFIP2, Of these, PPARGC1A and LRRFIP2, were protective factors while the rest were risk factors. This ve-gene signature is of great importance for the prognostic evaluation of colon cancers. In addition, these 5 prognosticassociated candidate hub RBPs were analyzed by Multivariate Cox regression analysis and investigated their impact on patient survival time and clinical outcomes, and all the ve RBPs were found to be independent predictors in the patients with colon cancer (Fig. 3b).

Prognosis-related genetic risk score model construction and analysis
The ve RBPs identi ed from the Multivariate Cox regression analysis were used to construct the predictive model. The predictive model was characterized by the linear combination of the expression levels of the ve genes weighted by their relative coe cient in the multivariate Cox regression as follows: We then conducted a survival analysis to assess predictive ability of the 5 RBPs. A total of 337 patients with colon cancer were divided into low-risk and high-risk subgroups according to the risk score. The higher the risk score, the worse the clinical prognosis.The K-M OS curves of the two groups, based on the ve genes, were signi cantly different(p = 8.432e − 08; Fig. 3c). the area under the curve( AUC) of the timedependent ROC curve was calculated to evaluate the prognostic capacity of the 5-gene features. The higher the AUC, the better the model performance. The ve-gene-AUCs of biomarker prognostic model were 0.720 and 0.725 for the three and ve-year survival time, respectively, indicating that the predict model had high sensitivity and speci city ( Fig. 3d;3e).In addition, The expression heat map, survival status of patients, and risk score of the signature consisting of ve RBPs in the low-and high-risk subgroups are shown in Fig. 3f ~ h. Verifying the the prognostic model in the GEO dataset The prognostic model was evaluated in other datasets to verify the reliability for the patients with colon cancer. The ve-gene model was assessed in the GEO microarray data GSE75500 with same method in GSE75500 data. The OS of the colon cancer patients in the high-risk group was lower than that in the lowrisk group (p = 3.17e − 3; Fig. 3i). The time-dependent ROC analyses for the survival prediction of the prognostic model obtained AUCs of 0.691 at three years and 0.624 at ve years. (Fig. 3j;3 k), and the risk score of the signature consisting of ve RBPs in the low-and high-risk subgroups are displayed in Fig. 3l n. All these demonstrate that this ve-RBP-gene prognostic model was capable of predicting OS in HCC patients.

A nomogram based on the ve RBPs
In order to found a quantitative way for predicting the prognosis of the colon cancer patients, we integrated the ve RBPs signature based on the multivariate Cox analysis to establish a nomogram that might help doctors to make the clinical decision for the patients (Fig. 4a). We also assessed the prognostic signi cance of different clinical characteristics from TCGA by using COX regression analysis.
However, The results showed that the tumor stage and risk score were independent prognostic factors correlated with OS through univariate and multivariate analysis( Fig. 4b;4c).

Validation of the relationship between the expression of ve RBPs and prognosis
To further explore the prognostic value of ve RBPs in colon cancer, we drew Kaplan-Meier survival curves (PNLDC1, NSUN6, NOL3, PPARGC1A and LRRFIP2) to investigate the relationship between hub-RBPs and OS. The results of the log-rank test showed that the ve RBPs were associated with OS in colon patients ( Fig. 4d ~ h). In order to determine the relationship between each gene expression and clinical characters, we analyzed the expression of proteins encoded by these ve RBPs using clinical samples in the human protein map database. NOL3 in colon tissue was signi cantly increased and LRRFIP2 was decreased compared with normal colon tissue (Fig. 5). However, there was no signi cant difference in PNLDC1 and NSUN6 protein expression between tumor and normal colon tissues (Fig. 5). However, PPARGC1A was not found on the website.

Disscussion
We studied gene expression differences between colon cancer and adjacent non tumor colon tissues to identify potential gene biomarkers using the TCGA database. To establish a risk model and predict the prognosis of colon cancer, The differentially expressed genes were screened single variable, lasso and multivariate Cox analysis. Five RBPs: PNLDC1, NSUN6, NOL3, PPARGC1A and LRRFIP2 were identi ed to be related to the prognosis of colon cancer. Of these, PPARGC1A and LRRFIP2 were found to be protective factors. As an overall prognosis predictor, high levels of 5 RBPs were associated with poor prognosis of colon cancer. AUCs of ROC curve for the prognosis model for predicting the 3-and 5-year survival were 0.720 and 0.725, respectively, showed a good performance in survival prediction.
According to the RBPs-based risk scoring prognosis model, the patients with colon cancer were divided into high-and low-risk group. Clinicians can change patients' treatment plan to achieve personalized treatment. Strategies should be developed to prevent or detect metastasis of colon cancer in high-risk groups. Therefore, we should pay more attention to the high-risk groups, Until now, the study of PNLDC1 in colon cancer and even all tumors has not been reported. Some results provided evidence that mammalian PNLDC1 is a regulator of piRNA biogenesis, transposon silencing and spermatogenesis, protecting the germline genome in mice [22][23][24] . PIWI-interacting RNA (piRNA) is a small non-coding RNA (ncRNA) that interacts with PIWI proteins to form the piRNA silencing complex (piRISC).
PIWI is a subfamily of Argonaute, and piRNA must bind to PIWI to exert its regulatory role in gene silencing and gene modi cation upstream or downstream of oncogenes in cancer cell lines or cancer tissues. Current studies indicated that piRNA and PIWI are signi cantly abnormally expressed in gastric, breast, kidney, colon, and lung cancers, and are involved in the initiation, progression, and metastasis of cancers, which may be the potential diagnostic tools, prognostic markers, and therapeutic targets for cancers. So PNLDC1 might affect the development and progress through regulating piRNA in colon cancer [25] .
Human NSUN6 is an RNA methyltransferase that catalyzes the transfer of the methyl group from Sadenosyl-l-methionine (SAM) to C72 of some types of tRNAs, it relies on a delicate network for RNA recognition, which involves both the primary sequence and tertiary structure of tRNA substrates. Until now, the biochemical properties and functions of NSUN6 homologs were not clear and no studies were reported in cancer until now [26] .
NOL3 gene encodes the apoptosis repressor with caspase recruitment domain (ARC) which is a potent and multifunctional death repressor that inhibits both death receptor and mitochondrial apoptotic signaling. It has been previously reported to play a role in colon carcinogenesis [27] . As a result of the increased ARC expression, NOL3 played a role in TRAIL-induced apoptosis reduced by hypoxia in colon cancer [28] .
Proteins of the peroxisome proliferator-activated receptor γ (PPARγ) coactivator 1 (PGC-1) family of transcriptional coactivators coordinate physiological adaptations in many tissues, usually in response to demands for higher nutrient and energy supply. Of the founding members of the family, PGC-1α (also known as PPARGC1A) is the most highly regulated gene, using multiple promoters and alternative splicing to produce a growing number of co-activator variants. PGC-1α promoters are selectively active in distinct tissues in response to speci c stimuli [29] . Some results showed that PPARGC1A was signi cantly decreased in colon cancer. Taken together the data suggest that the transcriptional activity of PPARγ may not only be decreased by mutation but also by down-regulation of the co-activator PGC-1 of PPARγ in colon cancer [30] .
The preoperative pathology of colon cancer is often only used to determine its pathological nature. In combination with the imaging data of patients such as MRI scan, clinicians determine the treatment plans for patients with colon cancer. Although the preoperative clinical stage of some colon cancer is not late, the survival time of some patients with colon cancer is greatly shortened because of peritoneal implant metastasis or distant metastasis post-operation and even though receiving postoperative chemotherapy without considering the biological characteristics of the tumor. A model to predict the prognosis of patients could help clinicians to develop individualized treatment plan for the patients. For example, the method of neoadjuvant chemotherapy, intraoperative hyperthermia and shorten the interval between postoperative reexamination can be used to improve the overall survival time of the high-risk patients.
However, several limitations of the current study should be considered. First, the population ethnicities in the TCGA database are mainly con ned to people from western countries, and extrapolating the ndings to other ethnic groups needs to be substantiated. Second, our data did not perform PPARGC1A expression because of lacking the experimental results.
A robust nomogram should be validated externally in different queues; therefore, it needs to be further validated in multicenter clinical trials and prospective studies. In the future, we will also explore the possibility of including more predictive variables to further improve performance. Other regression modeling methods will be used to determine whether the prediction accuracy can be further improved. To sum up, our results show that the 5-RBP-gene prognostic model is a reliable tool for predicting OS of colon cancer, and the nomogram including the prognosis model can help clinicians choose personalized treatment for colon cancer patients.

Conclusions
We screened RBPs expression differences between colon cancer and adjacent non tumor colon tissues using the TCGA database to identify potential gene biomarkers.Besides,a very effective prediction model was constructed and tested based on the differential expression of RBPs using the TCGA and GEO database.
We also Validated of the relationship between the expression of ve RBPs and prognosis.

Declarations Acknowledgments
We thank platforms of TCGA, and GEO datasets for data download.

Competing interests
The authors declare that the research was conducted in the absence of any commercial or nancial relationships that could be construed as a potential con ict of interest.

Availability of data and materials
Publicly available datasets were analyzed in this study. This data can be found here: GSE75500, TCGA:The data that support the ndings of this study are available from the corresponding author upon reasonable request.   The differentially expressed RBPs and the relationship each other in HCC. (a) Heat map; In the transverse type, blue represents 50 normal tissues, red represents 100 colon cancer tissues, in the longitudinal thermogram, each point represents the expression dysregulation of RBP in this specimen, the color strip in the upper right corner represents the degree of RBP dysregulation, red represents the highest degree of up regulation, and green represents the highest degree of down regulation. (b) Volcano plot. The abscissa log FC represents the pair value of times for the difference of expression level between cancer and normal tissues, and the ordinate -log10 (FDR) represents the negative pair value of FDR. The larger the ordinate value is, the more signi cant the difference between the expression of RBP gene in the two samples is, and the more reliable the differential expression gene is, Red represents up-regulated expression of RBP in cancer, green represents down regulated expression, and black represents no difference or exclusion. KEGG pathway and GO enrichment analysis of aberrantly expressed RBPs,   (a) Nomogram for predicting 1-, 3-, and 5-year OS of colon cancer patients in the TCGA cohort. In the nomogram, the parameters that affect the prognosis are not the same distribution, some are continuous data, some are binary or multivariate data, so after calculating the regression coe cient, the incomparable variables are uniformly calculated as counting variables, and nally the total score is calculated, and the prognosis is quantitatively determined by the total score comparison table  (a) Nomogram for predicting 1-, 3-, and 5-year OS of colon cancer patients in the TCGA cohort. In the nomogram, the parameters that affect the prognosis are not the same distribution, some are continuous data, some are binary or multivariate data, so after calculating the regression coe cient, the incomparable variables are uniformly calculated as counting variables, and nally the total score is calculated, and the prognosis is quantitatively determined by the total score comparison table