Construction and Validation of a Prognostic Model of Gastric Cancer with 10RBPs Based on 6 Microarrays

For explore the potential connection of RNA binding proteins (RBPs) to the expression function of gastric cancer (GC). We download the GPL10558 and GPL6947 platform mircroarray data from Gene Expression Omnibus (GEO) and Express database. Then the system integrates and analyzes the differentially expressed RBPs. And enrich the differentially expressed RBPs to understand the mechanism of its inuence on tumors. Univariate Cox, lasso regression and multivariate Cox regression analysis were used to screen independent prognostic parameters to construct prognostic model, and calculate aera under time-dependent receiver operating characteristics (AUC) and survival analysis were used to evaluate their prognostic ability. GSE15459, GSE62254 cohorts were used to verify hub signature. Finally, we also veried the prognosis and expression of hub-RBPs. Systematic analysis identied 23 up-regulated and 30 down-regulated RBPs, and enrichment analysis showed that they mainly affect their modication by binding to mRNA, and their stability affects the progression of GC. After multiple statistical analyses, we obtained the prognostic signature constructed by 10 RBPs and determined that it has better predictive performance (AUC = 0.685). Through comprehensive bioinformatics analysis, we have obtained 10 key gastric cancer RBPs as potential prognostic biomarkers, providing new perspectives for the treatment and prognostic of GC. mRNA catabolic process etc.; Molecular Function (MF): mRNA 3'-UTR binding, translation initiation factor activity, poly(U) RNA binding, single-stranded RNA binding and so on. According to research, the regulation of RNA translation, processing, and catabolism is related to the occurrence of various diseases, and also plays a key role in the development of cancer 17,18 . Type I interferon can activate immune cells and cancer cells feedback inhibition is also closely related to cancer 19 . And more and more studies have shown that Type I interferon controls the autocrine or paracrine circuit that forms the basis of cancer immune monitoring. A variety of chemotherapeutic drugs are fully effective in the presence of complete type I IFN signals. Therefore, for type I interferon signaling pathway is worthy of further study 20 . In our research, it was also conrmed that RBPs are closely related to GC through the control of RNA and mRNA process, such as mRNA 3'-UTR binding and poly(U) RNA binding etc. Meanwhile, RBPs can also affect the biological process of GC by regulating type I interferon signaling pathway.

Introduction transcriptional processes in uences 7 . It can be found in most cancers that RBPs show abundant mutations and overexpression such as: GF2BP1, IFIT1B, PABPC1, TLR8, GAPDH, PIWIL4, RNPC3, and ZC3H12C are closely related to the prognosis of lung cancer 8 . The high expression of APOBEC3C and EIF4E3 will have a better prognosis in breast cancer 9 .
Multiple reports suggest that RBPs can affect the occurrence and development of GC. For example, gastric cancer patients with HUR high expression 10,11 and TTP low expression has a worse prognosis 12 .
QKI5 acts as a tumor suppressor gene through H2A-type histone variant 13 . In addition, LIN28 directly binds to neuropilin-1(NRF-1) 30UTR and enhances its mRNA stability and reduce the sensitivity of platinum drugs 14 . However, there is currently no systematic research on RBPs in GC. In this study, we systematically analyzed the expression of RBPs in GC by integrating the Illumina HumanHT-12 microarray data and analyzed the mechanism of its in uence on tumor progression. We also constructed a prognostic signature of 10 RBPs, and veri ed it on GSE15459 and GSE62254, showing the accuracy and universality of the signature. In addition, we constructed a nomogram based on the signature, and veri ed the accuracy of its prediction. Finally, we also veri ed the expression and prognosis of 10 RBPs, and the results are consistent with our results. In short, we provide a new idea for the prognosis and treatment of GC, which will help improve the quality of life of GC patients.

Results
Identi cation of differently expressed RBPs (DERBPs) in GC patients.
In this study, we downloaded GC-related data from the GEO database and the original data of the chip from the Array Express database. After data processing, a total of 827 tumor samples and 118 normal samples were obtained. According to the analysis of the difference conditions, a total of 1261 probes (1102 genes) were nally obtained. Then we further screened the differentially expressed RBPs and obtained 61 probes (53 RBPs). Further analysis of the data found that 23 of them were up-regulated and 30 were down-regulated (Table S1). According to the differentially expressed RBPs we got, make heat map (Fig. 1).

PPI network construct and DERBPs enrichment analysis.
We construct a PPI network based on these differently expressed RBPs, the red gene in the gure indicates up-regulation, and blue indicates down-regulation (Fig. 2). In total, a network with 46 nodes and 183 edges was obtained (Fig. 2a). Through MCODE we got 2 key modules (Fig. 2b, c). One key Module contains 10 nodes and 20 edges (k = 4.444), and the other key Module contains 11 nodes and 20 edges (k = 4.000).According to the key modules in the PPI network, we conduct GO and KEEG function enrichment analysis on the RBPs of the key modules. According to the threshold, KEGG is not eligible.
About GO enrichment analysis results, mainly involved 39 items (Table S2). We can conclude that RBPs are rich in RNA catabolism and regulation, transfer, degradation, processing, nucleic acid phosphodiester bond hydrolysis, RNA binding, peptide metabolism and negative regulation of cellular amide metabolism content.
Screening of prognostic hub RBPs.
Prognosis-related RBPs risk score model construction and analysis.
According to the median score, we divide patients into low-risk groups and high-risk groups,the patient's survival time and risk score can be obtained according to the risk score map, it also shows that the lowrisk group has a better survival rate (Fig. 4a, b). we can also nd that low risk score represents the better prognosis (P = 1.6631e -09) (Fig. 4c). Evaluate the accuracy of the prediction model by the area of AUC of the risk scoring system, we can nd that 5 years AUC = 0.685 (Fig. 4d), this result shows certain diagnostic value. Combined with the patient's clinical information, First of all, under the condition that univariate Cox does not exclude other in uences, we got that Age(P = 0.002, HR 1.020), N stage(P < 0.001, HR 1.676), T stage( P < 0.001, HR 1.740), and then under the condition of multivariate Cox, we obtain that Age(P < 0.001, HR 1.022), N stage(P < 0.001, HR 1.431), T stage( P < 0.001, HR 1.521), and risk scores(P < 0.001, HR 1.696) are all independent prognostic survival factors (Fig. 4e,f). Finally, in addition, to verify the sensitivity and speci city of our constructed model, we use the GSE15459 and GSE62254 databases to verify that our model obtains the same results through the same formula and method. We found that in the GSE15459 the area under the ROC curve was 0.670, the low-risk group has a signi cant advantage in survival prognosis than the high-risk group(P = 1.896e-03) and GSE62254 database 5 years AUC = 0.645, low risk scores had better OS than those with high risk scores(P = 3.769e-04), in comparison with other clinicopathological characteristics, the risk score of the model is an independent prognostic factor, and it has a good prognostic ability (Fig. 5, 6). These results indicate that the prognostic model we constructed based on RBPs has good reproducibility and is suitable for most clinical patients.
Bioinformatics analysis of different risk groups.
In order to understand the GC patients between high and low risk groups, we rst analyze their clinicopathological characteristics. We found that most patients in the high-risk group were in N-stage N2-N3, and most patients in the T-stage were in stage T4 with more deaths. This indicates that the high-risk group of GC patients is more malignant than the low-risk group (Fig. 7a). We further analyzed the relationship between each hub RBPs and clinical case characteristics, and the results suggest that CPEB3 is positively correlated with N staging, CDC20, MYH11 are positively correlated with T staging, and ANXA1, AUH, KIAA9191 are positively correlated with both ( Fig. 7b). In addition, we also used GSEA to analyze the differences in the mechanism of GC development between high and low risk groups.We can nd that TGF-beta, Angiogenesis, EMT, Wnt/β-catenin, Hypoxia, KRAS, Hedgehog, Myogenesis, Coagulation, Apical surface, Apical junction, NV response Dn and Apoptosis signi cantly enrichment. The enrichment of these pathways shows that it is closely related to the recurrence and distant metastasis of GC (Fig. 8).
Construction of a nomogram based on the 10 hub RBPs.
Based on the results of multi-factor Cox analysis, to visualize the 1-5 years of OS in patients with GC more intuitively, we use nomogram to visualize the regression (Fig. 9a). Then, we use the calibration chart to judge the accuracy of the nomogram. The results show that the slope of the red line is almost 1 (Fig. 9b). It shows that the actual survival rate and the predicted survival rate are almost similar, which suggests that the nomogram we constructed has excellent predictive power. This suggests that the nomogram we constructed can help clinicians make more accurate treatment judgments for patients and improve the quality of life of patients.
Validation the prognostic value and expression of hub RBPs.
We use Kaplan Meier-plotter to draw a survival curve for the RBPs constructed to further verify the prognostic value of our RBPs in GC. The survival rate of IREB1, AUH, CPEB3, DAZAP1, KIAA0101 in the high expression group was signi cantly better than that in the low expression group(P < 0.05), this is consistent with the results of our analysis. The survival rate of ANXA1, CDC20, EEF1A2, ITGB1, MYH11 in the low expression group was better than that in the high expression group (Fig. 10a). And using GEPIA to verify the mRNA expression levels of 10 RBPs. It can be found that CDC20, DAZAP1, ITGB1, KIAA0101 are signi cantly increased expression, however, EEF1A2, MYH11 are signi cantly reduced expression (Fig. 10b). Finally, we used the HPA database to verify the protein levels of the constructed model RBPs in normal tissues and tumor tissues. It can be found that in tumor tissues CDC20, ITGB1 are signi cantly high expression. ANXA1, AUH low expression; in normal tissues, MYH11 low expression, this is consistent with the results of our analysis (Fig. S1).

Discussion
As one of the top ve malignant tumors in humans, diagnosis and treatment of GC are still facing challenges. Surgery has been recognized as the gold standard for resectable advanced GC 15 . However, there has been a lack of reliable methods for the follow-up of patients after surgery. According to research, RBPs mainly act on the occurrence of cancer by operating the "mRNA life cycle". Therefore, the abnormal expression of RBPs is usually closely related to the prognosis of cancer patients 16 . In our study, rstly, we obtained the differentially expressed RBPs between cancer tissues and normal tissues through data processing and difference analysis. Then, we constructed PPIs of these RBPs and systematically studied the relevant biological pathways. Then, we performed GSEA analysis, K-M survival analysis and ROC analysis to explore the potential biological functions and clinical value of hub RBPs. Finally, we constructed a risk model to predict the prognosis of GC based on the 10 RBPs gene signature.
In our research, by constructing a protein-protein network for differently expressed RBPs and screening key model and key RBPs. These RBPs may lead to GC by regulating RNA and mRNA process. And then we through the GO enrichment pathway, it can be found that involves Biological Process (BP): RNA catabolic process, type I interferon signaling pathway, mRNA catabolic process etc.; Molecular Function (MF): mRNA 3'-UTR binding, translation initiation factor activity, poly(U) RNA binding, single-stranded RNA binding and so on. According to research, the regulation of RNA translation, processing, and catabolism is related to the occurrence of various diseases, and also plays a key role in the development of cancer 17,18 . Type I interferon can activate immune cells and cancer cells feedback inhibition is also closely related to cancer 19 . And more and more studies have shown that Type I interferon controls the autocrine or paracrine circuit that forms the basis of cancer immune monitoring. A variety of chemotherapeutic drugs are fully effective in the presence of complete type I IFN signals. Therefore, for type I interferon signaling pathway is worthy of further study 20 . In our research, it was also con rmed that RBPs are closely related to GC through the control of RNA and mRNA process, such as mRNA 3'-UTR binding and poly(U) RNA binding etc. Meanwhile, RBPs can also affect the biological process of GC by regulating type I interferon signaling pathway.
In view of the key role that RBPs play in solid tumors such as GC. We used complete statistical methods to screen out 10 RBPs with independent prognosis for GC and constructed a prognostic model. Five RBPs with HR greater than 1 are considered dangerous genes (ANXA1, CDC20, EEF1A2, ITGB1, MYH11), and the other 5 RBPs with HR < 1 are considered protective genes (ACO1, AUH, CPEB3, DAZAP1, KIAA0101).According to reports, these hub RBPs play an important role in solid tumors. The compulsive ANXA1 expression in GC cells leads to cell growth inhibition, and at the same time acts on the development of GC by regulating the expression of COX-2 21,22 . CDC20 plays a key role in the occurrence and development of tumors, and high expression of CDC20 is often accompanied by a later stage and a worse prognosis and EEF1A2 is rarely found in normal gastric tissues, but is highly expressed in GC tissues 23,24 . Zhou et al. have shown that ITGB1 is sensitive to the predictive value of advanced GC 25,26 , in colorectal cancer, the single nucleotide repeat (C8) of the MYH11 gene has a frameshift mutation which is one of its important mechanisms. In GC, we can also nd the same gene mutation 27 , above genes are also veri ed in our prognostic model. High expression of above genes indicates a worse prognosis, this is correlate with our research results. Increasing number of research recognizes that the expression of many RBPs in GC tissues has also changed. Moreover, studies have shown that ACO1 can interact with LINC00477 and inhibit the conversion of ACO1 from citrate to isocitrate, leading to the occurrence of GC 28 . CPEB3 could inhibits epithelial-mesenchymal transition (EMT) by disrupting the crosstalk between colorectal cancer cells and tumor associated macrophages via IL-6R/STAT3 signaling 29 , however, there is currently no in-depth research on GC and more research is needed in the future. Meanwhile, EMT pathway also enrichment in our study. KIAA0101 was rst discovered in 2001. It is well known as a p15PAF proliferating cell nuclear antigen (PCNA) related factor 30 , The study found that patients with high expression of KIAA0101 had a signi cantly higher postoperative recurrence rate than other patients. The main mechanism may be and its effect mRNA and protein levels are related 31 . However,in our study KIAA0101 is a protective gene, which needs further research to prove. Then, we used the GSE15459 and GSE62254 data sets to verify the prognostic signature we constructed. We combined the clinicopathological characteristics for analysis. Whether it is training set or test set, risk score is an independent predictor of the prognosis of GC. The results obtained are consistent with our model, indicating that our prognostic model is accuracy. Compared to other prognostic models such as: ImmunoScore Signature 32 Long noncoding RNA (lncRNA) Prognostic of GC 33 our research is a completely new point of view.
We further analyzed the high-risk and low-risk groups by GSEA, and it can be found that the following pathways are enriched: TGF-beta signaling, Angiogenesis, Wnt/β-catenin-mediated signaling, epithelialmesenchymal transition, Hypoxia, KRAS signaling up, Hedgehog signaling Coagulation, Myogenesis, Apical surface, Apical-junction, NV-response-DN, Apoptosis. According to research, TGF-beta signaling is an important regulatory growth factor in our body, which mainly maintains the development of tissues and the homeostasis of the internal environment, so it is related to the onset of many diseases 34 . Previous reports have shown that RBPs participate in these pathways and affect tumor progression. Among them, RHBDF2 promotes the lysis and high expression of TGF-β by regulating the TGF-β signaling pathway, and accelerates the invasion of GC cells into extracellular matrix and lymphatic vessels, which ultimately increases the high recurrence rate after surgery 35 . Kyung HoPak et al. research also con rmed that TGF-β1 can induce VEGF-C in GC to enhance tumor-induced lymph angiogenesis, and ultimately promote the recurrence and metastasis of GC. Therefore, it may also be a potential target for prevention and treatment of GC 36 [45][46][47][48] . In-depth studies of the Hh-signaling pathway have found that high expression often indicates poor prognosis. Coagulation has also studied the recurrence and metastasis of GC. Kentaro et al. retrospectively analyzed the D-dimer level of 448 patients with GC on the 7th day after surgery, and found that the hypercoagulable state has a higher recurrence rate and poor survival rate may be related to the impact of surgical stress on the coagulation system, increasing the chance of micro metastasis 49 furthermore coagulation can promote platelet activation, provide vascular endothelial growth factor (VEGF), transforming growth factor-cell growth factor (TGF), platelet-derived growth factor (PDGF) etc. for tumor growth promote recurrence and metastasis 50 . So, in our research, the above are all highly enrichment. The occurrence and development of tumors are usually the result of the interaction of multiple pathways. For example, in our research, KRAS, TGF-beta signaling, and Wnt /β-catenin-mediated signaling pathway are ultimately closely related to the EMT pathway.
Nomogram for predicting GC recurrence using biomarker gene expression has been con rmed 51 ,then we draw a nomogram to evaluate the survival rate of 1-5 years, and verify the calibration curve. The result shows that the nomogram has good predictive ability. This will help clinicians make precise decisions, Finally, we veri ed the 10 hub RBPs at the gene level and protein level, and the prognostic relationship obtained was consistent with the results of our analysis. It further illustrates that the prognosis model we constructed is accurate and practical.
However, this study also has some shortcomings: rst, we obtained the results through retrospective research and analysis, which lacked a certain degree of sensitivity. Later, prospective studies are needed to verify the results again. Finally, we only download data from GEO database and Array Express database for analysis, which has certain limitations and heterogeneity.
In our current study,we have obtained 10 RBPs after processing the data using bioinformatics technology, which can be used to predict the prognosis of GC, which has a certain promotion signi cance in the clinic.

Materials And Methods
Dataset processing and differential expressed RBPs (DERBPs) screening.
We download the original data of GSE26942, GSE29998, GSE38024, GSE8443 microarray datasets from the GEO database (https://www.ncbi.nlm.nih.gov/gds/) 52 , and download the E-MTAB-1338 and E-MTAB-1440 microarray datasets from the Array Express database (https://www.ebi.ac.uk/arrayexpress/). we use the "lumi" package 53 to process the original data to obtain the expression level of each RBPs, use the "sva" package 54 to remove the batch effect of each microarray datasets and carry out standardized merging to obtain a combined complete dataset. For the complete data set, analyze the differential RBPs through the "limma" package 55 . The lter condition is |log2 Fold Change|>0.585, P < 0.05 to screen differentially expressed RBPs.
Protein-protein interaction (PPI) network construction and enrichment analysis.
Submit DERBPs to the STRING (http://www.string-db.org/) and the minimum correlation coe cient is 0.150 to identify protein-protein interaction information 56 . Cytoscape 3.7.2 software was used to further construct and visualize the PPI network. By using the molecular complexity detection (MCODE), the key modules and RBPs can be selected in the PPI network. In order to study the role of these key RBPs in GC, we use the "clusterPro ler" package 57 to perform Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis for key RBPs. The items that meet P < 0.05 and adj P < 0.05 are considered signi cant.
Construction of a prognostic model based on RBPs.
We use univariate Cox regression analysis to screen overall survival related RBPs, Least Absolute Shrinkage and Selection Operator (LASSO) regression to remove linear tting, and further screen for candidate RBPs that are signi cantly related to prognosis. Subsequently, based on the above preliminary screened signi cant candidate RBPs, a proportional hazards prognostic model was constructed through multivariate Cox regression analysis. And calculated a risk score to assess patient prognosis outcomes.
The risk score formula for each sample was as follows: Risk score = β1 * Exp2 + β1 * Exp2 + β1 * Exp2 +…βi * Expi (Exp is the expression level of each prognostic gene and β is its regression coe cient).
Based on the median risk score survival analysis, patients were divided into low-risk and high-risk groups. Firstly, we plot the K-M curve of the high and low risk groups Additionally, we use R software to draw receiver operating characteristic (ROC) curve and estimate area under ROC curve (AUC), the higher the AUC indicates a better predictability for the model. In addition, we used univariate and multivariate Cox regression to analyze the prognostic ability of risk scores and other clinical characteristics. To con rm the effectiveness of our model, patient samples with reliable prognostic information from the GSE15459 GSE62254 dataset were used as a validation cohort to con rm the predictive capability of this prognostic model.
Gene Set Enrichment Analysis.
Gene Set Enrichment Analysis (GSEA) was constructed by "Subramanian A" in 2005. Compared with single gene, GSEA has obvious advantages in gene set. It mainly consists of three key steps: calculation of an enrichment score; estimate of signi cance level of enrichment score; adjustment for multiple hypothesis testing 58 . In this study, we use GSEA 4.0 to set normalized enrichment score (NES) > 1, false discovery rate (FDR) < 0.001 as the selection criteria, hallmark7.1 as the comparison gene set, number of permutation = 1000 for enrichment and nally get the difference between high and low risk groups.
Nomogram construction and evaluation.
To develop a quantitative prognostic approach, we constructed nomogram to predict the impact of each gene on 1 to 5-years overall survival. Based on multivariate Cox analysis, point scales in the nomogram were used to assign values to individual variables. We use a horizontal line to determine the points of each variable and calculate the total points for each patient by adding up the points of all variables, normalizing the distribution from 0 to 100. Then, to better evaluate the predicted survival rate and actual survival rate, we compare the predicted and observed results in the calibration curve. When the predicted survival rate is close to the actual survival rate, it indicates that the nomogram has better predictive ability.
Veri cation of express level and prognostic signi cance.
The main purpose of The Kaplan Meier plotter (http://kmplot.com) 59 is the discovery and veri cation of survival biomarkers based on meta-analysis. Therefore, it was used to verify the prognostic value of 10 RBPs in GC. GEPIA provides key interactive and customizable functions including differential expression analysis, pro ling plotting, correlation analysis, patient survival analysis, similar gene detection and dimensionality reduction analysis. Using GEPIA (http://gepia.cancer-pku.cn/) 60       Enrichment pathways for high-risk groups.