Identification of differentially expressed RBPs
The transcriptome data and corresponding clinical information for 473 COAD samples and 41 normal samples were downloaded from TCGA. After preprocessing and difference analysis of the raw data of the 1542 RBPs , we found that there were 123 upregulated and 54 downregulated RBPs (Fig. 1A, B).
GO And KEGG Enrichment Analysis Of Differentially Expressed RBPs
In order to explore the function and molecular mechanism of these differentially expressed RBPs, we performed GO and KEGG enrichment analysis on upregulated and downregulated RBPs, respectively. Upregulated differentially expressed RBPs (UP-DE-RBPs) were significantly enriched in biological processes associated with ncRNA processing, RNA phosphodiester bond hydrolysis, and ribosome biogenesis. In the cellular component enrichment analysis, UP-DE-RBPs were significantly enriched in cytoplasmic ribonucleoprotein granule, ribonucleoprotein granule, and nucleolar part. The molecular function analysis showed that UP-DE-RBPs were significantly enriched in catalytic activity, acting on RNA, ribonuclease activity, and translation regulator activity (Fig. 2A, B). Downregulated differentially expressed RBPs(DOWN-DE-RBPs) were significantly enriched in biological processes associated with defense response to virus, regulation of cytoplasmic translation, and regulation of translation. In cellular component enrichment analysis, DOWN-DE-RBPs were significantly enriched in endolysosome membrane, apical dendrite, and mitochondrial matrix. The molecular function analysis showed that DOWN-DE-RBPs were significantly enriched in mRNA 3'-UTR AU-rich region binding, AU-rich element binding and mRNA 3'-UTR binding and (Fig. 2C, D). In regard to the KEGG enrichment analysis, the UP-DE-RBPs were mainly enriched in ribosome biogenesis in eukaryotes, mRNA surveillance pathway, and RNA transport (Fig. 2E, F). The DOWN-DE-RBPs were mainly enriched in progesterone-mediated oocyte maturation, oocyte meiosis, and hepatitis C (Fig. 2G, H).
PPI Network Construction
In order to more clearly understand the mechanism of these DE-RBPs, we used the STRING database and Cytoscape software to construct a PPI network (Fig. 3A). This PPI network contained a total of 145 nodes and 596 edges. Next, we used the plug-in MODE in Cytoscape to analyze the PPI network to identify the potential key subnetwork. Finally, we determined the first three important subnetworks. Subnetwork 1 included 21 nodes and 195 edges, subnetwork 2 included 13 nodes and 38 edges, and subnetwork 3 included 5 nodes and 10 edges. An overall analysis was conducted by integrating these three key subnetworks into a network (Fig. 3B). The GO and KEGG analyses showed that the subnetwork 1 genes were mainly enriched in ncRNA processing, preribosome, snoRNA binding, and ribosome biogenesis in eukaryotes. The subnetwork 2 genes were significantly enriched in DNA alkylation, cytoplasmic ribonucleoprotein granule, and catalytic activity acting on RNA. The subnetwork 3 genes were significantly enriched in RNA splicing via transesterification reactions with bulged adenosine as nucleophile, perikaryon, and the mRNA 3'-UTR AU-rich region binding.
Prognosis-related RBP Screening
We conducted a univariate Cox regression analysis of 177 differentially expressed RBPs and found that TDRD6, POP1, TDRD7, LUZP4, PPARGC1A, LIN28B, PABPC1L, LRRFIP2, ZC3H12C, RBM47, CELF4, and PNLDC1 were significantly related to survival (Fig. 4A).
Prognosis-related Genetic Risk Score Model Construction And Validation
To further determine the RBPs with the greatest potential prognosis ability, multivariate Cox regression analyses were used to identify seven RBPs: TDRD6, POP1, TDRD7, PPARGC1A, LIN28B, LRRFIP2, and PNLDC1. The seven RBPs were analyzed and used to construct a predictive model (Fig. 4B). The formula for calculating the risk-score of COAD patient is as follows:
Risk-score = (-1.9076 × ExpTDRD6) + (-0.5828 × ExpPOP1) + (-0.6433 × ExpTDRD7) + (-0.5888 × ExpPPARGC1A) + (1.2823 × ExpLIN28B) + (-0.8568 × ExpLRRFIP2) + (0.3925 × ExpPNLDC1)
In order to evaluate the predictive ability of the model, we randomly divided 446 samples of COAD patients with detailed survival information into two groups, namely the train group and the test group. At the same time, we divided the train group and the test group into high- risk and low-risk groups according to the median risk score for survival analysis. In the train and test groups, patients in the high-risk group exhibited a significantly lower overall survival rate than low-risk group (Fig. 4C, D). Then, we performed a time-dependent receiver operating characteristic (ROC) analysis to further evaluate the prognostic accuracy of this model. The AUC of the ROC curve for overall survival in the train and test groups was 0.715 and 0.681, respectively (Fig. 4E, F). The risk-score, survival status, and expression heatmap of patients in the train and test groups are shown in Fig. 5.
Independent Prognostic Analysis
In order to further explore the independent prognostic factors of COAD, we integrated the patient's age, gender, cancer stage, risk value, and other information, and conducted single-factor and multi-factor independent prognostic analyses on the train group and the test group, respectively. The results are shown in Fig. 6A, B. In the single-factor independent prognostic analysis of the train group and the test group, the P-value of age and cancer stage was less than 0.05, indicating that these two factors are significantly related to the survival of COAD patients. In the multi-factor independent prognostic analysis, the P-value of age and cancer stage was less than 0.05, indicating that these two factors can be used as independent prognostic factors for COAD patients.
In order to more accurately predict the survival rate of COAD patients, we constructed a nomogram using the expression levels and points of model genes (Fig. 7). By corresponding to the expression level and score of the Cox model genes, the total points of the patient can be calculated. Based on the total points, the patient’s predicted 1- year, 3- year, and 5-year survival rates are intuitively displayed.
Prognostic Model Gene Expression Validation
In order to further confirm the expression of key genes in COAD patients in the risk model, we used the Human Protein Atlas database to search for the expression of these seven genes. It was found that TDRD6, TDRD7, LIN28B, LRRFIP2, and PNLDC1 were significantly increased in COAD patients (Fig. 8). At present, the database does not contain data related to POP1 or PPARGC1A protein expression.