Comprehensive Analysis of the Functions, Prognostic and Diagnostic Values of RNA Binding Proteins in Head and Neck Squamous Cell Carcinoma


 Background: Increasing evidence has suggested that RNA binding protein (RBP) dysregulation plays an important part in tumorigenesis. Here, we sought to explore the potential molecular functions and clinical significance of RBP and develop diagnostic and prognostic signatures based on RBP in patients with head and neck squamous cell carcinoma (HNSCC). Methods: The Limma package was applied to identify the differently expressed RBPs between HNSCC and normal samples with |log2 fold change (FC)|≥1 and false discovery rate (FDR)<0.05. the immunohistochemistry images from the Human Protein Atlas database The diagnostic signature based on RBP was built by LASSO-logistic regression and random forest and the prognostic signature based on RBP was constructed by LASSO and stepwise Cox regression analysis in training cohort and validated in validation cohort. All these analyses were performed using the R software.Results: A total of 84 aberrantly expressed RBPs were obtained, comprising 41 up-regulated and 43 down-regulated RBPs. Seven RBP genes (CPEB3, PDCD4, ENDOU, PARP12, DNMT3B, IGF2BP1, EXO1) were identified as diagnostic related hub gene and were used to establish a diagnostic RBP signature risk score (DRBPS) model by the coefficients in LASSO-logistic regression analysis and shown high specificity and sensitivity in the training (area under the receiver operating characteristic curve [AUC] = 0.998), and in all validation cohorts (AUC > 0.95 for all). Similarly, seven RBP genes (MKRN3, ZC3H12D, EIF5A2, AFF3, SIDT1, RBM24 and NR0B1) were identified as prognosis associated hub genes by least absolute shrinkage and selection operator (LASSO) and stepwise multiple Cox regression analyses and were used to construct the prognostic model named as PRBPS. The area under the curve of the time-dependent receiver operator characteristic curve of the prognostic model were 0.664 at 3 years and 0.635 at 5 years in training cohort and 0.720, 0.777 in the validation cohort, showing a favorable predictive effificacy for prognosis in HNSCC.Conclusions: Our results demonstrate the values of consideration of RBP in the diagnosis and prognosis for HNSCC and provide a novel insights to understand potential role of dysregulated RBP in HNSCC.


Introduction
Head and neck cancer is one of the malignant diseases which is still one of the main causes of cancer deaths worldwide. It is mainly composed of tumors of the larynx, pharynx throat and oral cavity, and its main histology is squamous cell carcinoma [1]. More than 600,000 patients with head and neck squamous cell carcinoma are diagnosed each year, making it the sixth most common malignant tumor in the world [2,3]. Despite there has been great improvements in the methods of treatment including ablative surgery, chemotherapy and radiotherapy over the past few decades, the prognosis of patients with this highly malignant disease remains unsatisfactory as proved by unfavorable 5-year survival rates still remains around 50% [4,5]. Various clinicopathological parameters such as tumor size, cervical lymph node involvement and clinical stage (TNM) have been identi ed as the pivotal prognostic indicators [6].
However, due to the speci city, sensitivity and performance of prognostic prediction, these prognostic indicators are far from optimal [7,8]. Therefore, identify novel biomarkers with su cient performance and clinical feasibility for early diagnosis and prognostic prediction will be crucial for improving the therapeutic effect and quality of life in patients with HNSCC.
RNA binding protein (RBP) is a type of protein that binds to various types of RNA through one or more spherical RNA binding domains (RBD) and is widely expressed in cells [9]. To date, more than 1,54 proteins have been determined as RBPs based on the presence of characteristic RNA-binding domains or their residence in established ribonucleoprotein complexes by genome-wide screening in human genome, approximately 7.5% of all protein coding genes [10]. These RBP have a far-going of biological functions, including post-transcriptional regulation of mRNA stability, mRNA splicing, mRNA editing and translation, mRNA polyadenylation and localization, which will ultimately affect the expression of genes in the cell [9][10][11][12]. Given the importance of post-transcriptional regulation in life processes, it is not surprising that abnormal expression of RBP is closely associated with the occurrence and development of various human cancers. Undoubtedly, the potential application of RBP as either a therapeutic target or a diagnostic marker in cancer represents a fast-growing research eld.
In recent years, the fast-paced development in bioinformatics as well as the increasing availability of large-scale RNAsequencing (RNA-seq) transcriptome data of multiple cancers combined with clinical follow-up information makes exploring biomarkers for diagnosis and prognosis much more e cient and accurate, providing better approaches for the classifcation of patients and the implementation of personalized therapy, as well as uncovering the underlying mechanism of tumorigenesis [13]. Recently, genome-wide analysis has also shown that, compared to adjacent normal tissues, many RBPs are dysregulated in tumors, and their dysregulated expression were involved in carcinogenesis and related to the prognosis of patients with cancers [14][15][16][17] In this study, we collected all available RNA sequencing (RNA-seq) data for HNSCC from Gene the Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases and performed comprehensive analysis to explore the potential molecular functions and clinical signi cance of RBPs in HNSCC. We distinguished diverse differentially expressed RBPs related to HNSCC, which provide new understanding of the pathogenesis of HNSCC. Then we also applied several machine learning methods to recognize pivotal RBP predictors and develop diagnostic and prognostic RBP signatures, which could provide new insights into predicting and evaluating the clinical survival of patients with HNSCC.

Data collection and preprocessing
The original TCGA RNA-seq data were downloaded from TCGA database. And all relevant epidemiological, clinicopathological as well as follow-up data were obtained and collated. Moreover, six independent HNSCC datasets (GSE41613, GSE42743, GSE23036, GSE30784, GSE25099 and GSE12452) from GEO database were found and utilized. Of which, the follow-up data of GSE41613 and GSE42743 were also downloaded from GEO database.
Identi cation of differentially expressed RBPs and gene function analysis A total of 1542 RBPs were included in the analysis [10]. The Limma package was applied to identify the differently expressed RBPs between HNSCC and normal samples with |log2 fold change (FC)|≥1 and false discovery rate (FDR)<0.05. Moreover, the biological functions of these differently expressed RBPs were comprehensively detected by gene ontology (GO) enrichment analysis which including biological process (BP), cellular component (CC) and molecular function (MF). And the enrichment analyses were performed by clusterPro ler package in R with a P-value less than 0.05.

Protein-Protein Interaction (PPI) Network Construction, Module and Hub genes Screening
We obtained the protein-protein interaction information of differently expressed RBPs form the STRING database. Then we constructed the PPI network of RBPs by using the Cytoscape 3.6.1 and the keys modules were screened by using the MCODE (Molecular Complex Detection) with the MCODE score and nodes numbers more than 5. The cytoHubba was used to select hub genes in Cytoscape. The two side P<0.05 was were considered statistically signi cant. Moreover, the immunohistochemical images from the Human Protein Atlas (HPA) database were used to detect the expression of the hub genes at the translational level.
The construction of diagnostic model for HNSCC The TCGA HNSC dataset was de ned as the training cohort to construct the diagnostic model. The random forest analysis was performed by using the "randomForest" package and the top ten important varliables were screen based on the Gini values. Then LASSO-logistic analysis, as a statistical method, was carried out to identify the most important RBP genes, whose predictive accuracy could be improved signi cantly. The overlapping markers between these two methods were nally selected to build the diagnostic model which was built based on the coe cients of LASSO-logistic analysis. The GSE42743, GSE23036, GSE30784, GSE25099 and GSE12452 datasets were used to validate the result from TCGA dataset. In addition, the receiver operating characteristic (ROC) curve was plotted to evaluate the sensitivity and speci city by using the "ROCR" package in R software.
The construction of prognostic model for HNSCC patients All HNSCC patients from TCGA-HNSC dataset were de ned as the training cohort. The prognostic signature was constructed according to the following steps. Firstly, the univariate Cox regression was used to identify survival-related RBPs from the differently expressed RBPs with the P-value<0.05. Secondly, the LASSO regression was performed on these survival-related metabolic pathways and the non-zero coe cients' RBPs were ltered out for further analyses. Finally, the nal prognostic model was constructed by stepwise multivariate Cox regression analysis. The LASSO analysis was performed by "glmnet" package. Moreover, Datasets in patients from GSE41613 and GSE42743 were pooled and de ned as an independent, validation cohort to validate the result from TCGA dataset. And"Combat" in R package "sva" was used to remove batch effects. All these analyses were performed using the R software (version 3.6.3).

Statistical analyses
All HNSCC patients were divided into low-risk and high-risk groups based on the median value. The overall survival curve were plotted by Kaplan-Meier method and the difference were compared with the log-rank test. A time-dependent ROC curve with 1, 3, and 5 years as the de ning point was performed to evaluate the predictive value of prognostic risk score by using the "survivalROC" package. Univariate and multivariate Cox regression analyses were employed to determine prognostic factors associated with survival. The evaluation of statistically signi cant differences of DRBPS between HNSCC and normal tissue in several datasets were performed by one-way ANOVA analyses. All these analyses were performed using the R software (version 3.6.3). And all statistical tests were two sided and p values less than 0.05 were considered statistically signi cant.

Results
Patient cohorts and RNA-seq datasets 502 HNSCC samples and 44 normal samples with original RNA-seq data and clinical follow-up information from TCGA were retrieved and de ned as the training cohort. RNA-seq data and clinical information of 171 HNSCC samples (de ned as validation cohort for prognostic model) were obtained from GSE41613 and GSE42743. Moreover, a total of 392 HNSCC samples and 111 normal samples were obtained from GSE42743, GSE23036, GSE30784, GSE25099 and GSE12452 (de ned as validation cohorts for diagnostic model).

Identi cation of differently expressed RBP between HNSCC and normal samples from TCGA
The work ow of this study design was illustrated in Figure 1. The dataset of HNSCC were downloaded from TCGA contained 502 tumor samples and 44 normal tissue samples. The limma packages were applied to handle the data and discover the differently expressed RBPs. A total of 1542 RBPs were included in the analysis [10], and 84 RBPs met the screening standard of this study (P<0.05, |log2FC)| >1.0), which consist of 41 upregulated and 43 downregulated RBPs were shown in Supplementary Table  1. The expression distribution of these differently expressed RBPs was displayed in Figure 2.
Functional Enrichment Analysis of the Differentially Expressed RBP To explore the potential functional and molecular mechanisms of the identi ed RBPs, they were divided into two groups based on their expression level. Then we carried out GO analysis for these differentially expressed RBPs by R software. As showed in Figure 3, upregulated differentially expressed RBPs were signi cantly enriched in biological processes associated with defense response to virus, response to virus, RNA catabolic process and nucleic acid phosphodiester bond hydrolysis (Figure3A). The downregulated differentially expressed RBPs were notably enriched in the regulation of cellular amide metabolic process, regulation of translation, RNA splicing and regulation of mRNA metabolic process ( Figure 3B). The molecular function analysis showed that, among the differentially expressed RBPs, the upregulated RBPs were signi cantly enriched in RNA binding catalytic activity, acting on RNA, doublestranded RNA binding and RNA helicase activity ( Figure 3A), whereas the downregulated RBPs were signi cantly enriched in mRNA 3'-UTR binding, AU-rich element binding and translation factor activity, RNA binding ( Figure 3B). In regard to the cellular component, the upregulated RBPs and downregulated RBPs were mainly enriched in the cytoplasmic ribonucleoprotein granule and ribonucleoprotein granule ( Figure 3).

Protein-protein interaction (PPI) network construction and key modules selecting
In order to further study the potential molecular functions of abnormal expressed RNA binding proteins in HNSCC, we used Cytoscape software to create a PPI network, which merged 84 nodes and 158 edges based on the data from the STRING database ( Figure 4A). Then we used cytoHubba in Cytoscape to screen the hub genes by calculation and obtained the top 10 hub genes: IFIH1, DDX60, DDX58, EIF2AK2, OASL, OAS2, PARP12, IFIT3, IFIT1 and IFIT2. These hub genes were all unregulated in HNSCC samples.
In order to further determine these hub genes expression in HNSCC, we utilized immunohistochemistry results from the Human Protein Atlas database (HPA) to indicate that IFIH1, DDX60, DDX58, EIF2AK2, OASL, OAS2, PARP12, IFIT3, IFIT1 and IFIT2 were uncommonly increased in HNSCC compared with normal tissues.( Figure 5). Subsequently, we further analyzed the co-expression network by using the plugin MODE in Cytoscape to detect potential key modules, and identi ed the two most important modules. Module 1 consisted of 12 nodes and 64 edges ( Figure 4B), Module 2 contained 6 nodes and 14 edges ( Figure 4C). The GO analysis indicated that the genes from module 1 were predominantly enriched in defense response to virus and double-stranded RNA binding, whereas the genes in module 2 were signi cantly enriched in piRNA metabolic process, macromolecule methylation and cytoplasmic ribonucleoprotein granule (Supplementary Figure S1).

Diagnostic signature building
The TCGA HNSCC cohort was selected as the training cohort to construct the diagnostic signature and validated in several HNSCC cohorts from GEO databases. Twenty-four potential predictors in the TCGA HNSCC cohort were identi ed and were features with nonzero coe cients in the LASSO-logistic regression model. Then, the random-forest analysis ( Figure 6A The violin plot showed that the DRBPS value was markedly upregulated in HNSCC compared to normal tissues in all cohorts from TCGA and GEO databases ( Figure   6E). The receiver operating characteristic curve revealed the DRBPS has a high accuracy in distinguishing HNSCC patients from normal controls with the AUC was 0.998 in the training cohort ( Figure 6F). Furthermore, And the ROC curves also showed the excellent ability of DRBPS in differentiating between tumor and normal tissues in all validation cohorts with the AUC values were 0.958 in GSE42743, 0.994 in GSE23036, 0.993 in GSE30784, 0.997 in GSE25099 and 0.974 in GSE12452 cohorts (Figure 6 G-K).

Prognostic signature building
The TCGA HNSCC cohort was selected as the training cohort to construct the prognostic signature and validated in the validation cohort which combined with GSE41613 and GSE42743 datasets from GEO database. We identi ed 84 dysregulated RBP genes which differs between HNSCC and normal samples with P<0.05 and then found 14 survival-related RBP genes by univariate Cox regression analysis (P<0.05).  To further explore the prognostic value of seven prognostic RBPs in HNSCC, we divided patients into low or high expression subgroups based on the median values of gene expression. The Kaplan Meier-plotter was used to determine the relationship between seven RBP genes and OS. And the results demonstrated that most prognostic RBP genes were associated with OS in HNSCC patients (P<0.05) except SIDT1 and NR0B1 (Supplementary Figure S3).

Univariate and multivariate Cox regression analyses for PRBPS
To further delineate the prognostic value of PRBPS in HNSCC, we performed univariate and multivariate Cox regression using the TCGA HNSCC cohort. Our data revealed that the PRBPS was associated with overall survival in HNSCC patients (HR, 1.961; 95% CI: 1.489-2.584; P <0.0001). In addition, some wellestablished clinicopathological parameters such as margin status and tumor stage were also identi ed to be prognostic factors (Supplementary Figure S4A). Moreover, we performed multivariate Cox regression analyses to eliminate confounding factors and highlight the prognostic value of PRBPS. And PRBPS was identi ed as an independent predictive factor for HNSCC (HR, 2.019; 95% CI: 1.477-2.760; P <0.0001) (Supplementary Figure S4B).

Discussion
The poor long-term survival ability of HNSCC and the lack of valuable prognostic prediction tools highlights the urgent need to identify new biomarkers with the best predictive performance to facilitate the diagnosis, treatment and prognosis monitoring of cancer. In recent years, many studies have shown that the expression of RBPs is dysregulated in a variety of human cancers [20][21][22]. However, the expression patterns and roles of RBPs in HNSCC are poorly understood. Microarray or high-throughput sequencing technology provides an effective detection tool for understanding the changes in key gene expression during cancer development, and also provides promising biomarkers for cancer treatment, diagnosis and prognosis [23][24][25]. In this study, we summarized all HNSCC-related TCGA RNA sequencing data, and identi ed 84 differentially expressed RBP genes between cancer and normal tissues. We systematically studied the relevant biological pathways and constructed the PPI network of these RBPs. Moreover, we established a diagnostic prediction signature (DRBPS) based on ten RBPs genes and we also constructed a prognostic RBP signature (PRBPS) to predict HNSCC prognosis in the light of seven prognostic-related RBP genes. These ndings may be bene cial to the development of new biomarkers, thereby helping the diagnosis and prognosis prediction of HNSCC patients.
The enrichment analysis of functional pathways showed that differently expressed RBPs were greatly enriched in the adjustment of biological processes associated with mRNA metabolic process, RNA processing, RNA catabolic process, cellular amide metabolic process, regulation of translation, RNA splicing and regulation of mRNA metabolic process. Previous studies have proved that the regulatory processes of RNA processing, translation and RNA metabolism are associated with the occurrence and development of various human diseases [26]. The post-transcriptional regulation of RNA stability is an important biological process in the process of gene expression. RBPs can interact with RNA to form ribonucleoprotein complexes, thereby increasing the stability of target mRNA and promoting gene expression, which plays a key role in the development of various diseases [27]. For example, The RNAbinding protein NELFE is an oncogenic protein that may cause the imbalance of the HCC transcriptome by regulating MYC signaling [28]. RNA binding protein IGF2BP3 targets oncogenic transcripts and promotes proliferation of hematopoietic progenitor cells [15]. In addition, we created the protein-protein interaction network of these differently expressed RBPs and identi ed the two most important modules. Module 1 included 12 nodes and 64 edges, and module 2 consisted of 6 nodes and 14 edges. Then we screen the hub genes by computing degree, and obtained the top 10 hub genes including IFIH1, DDX60, DDX58, EIF2AK2, OASL, OAS2, PARP12, IFIT3, IFIT1 and IFIT2. These hub genes were all upregulated at transcriptional level and translation level in HNSCC evidenced by the results from TCGA and HPA database. Among these Hub RBP genes, most of them have been shown to play an important role in the occurrence and development of cancer. For example, RBP-related proteins IFIT1 and IFIT3 contribute to the metastasis of oral squamous cell carcinoma and promote the anti-tumor effect of ge tinib by enhancing the p-EGFR recycling process [29]. EIF2AK2 impedes the DNA damage response, and is closely related to low survival rate of AML [30].
As we all know, prognostic and diagnostic features might be able to clarify the likelihood of a certain disease and its relative progress, recurrence and death, which could conduce to patient classi cation, treatment management, and monitoring of disease status in clinical practice [31]. Therefore, the role of abnormally expressed genes or other markers in cancer tissues on diagnosis and prognosis has aroused great interest. For example, Yang et al established a diagnostic and prognostic immune cell in ltration risk score for the diagnosis of digestive system cancer [32]. Zhou et al, established a diagnostic immune cell in ltration risk score for the diagnosis of colon cancer [33]. In this study, we established a diagnostic RBP signature (DRBPS) risk score model based on the seven RBP genes, including CPEB3, PDCD4, ENDOU, PARP12, DNMT3B, IGF2BP1, EXO1. The receiver operating characteristic curve revealed the DRBPS has a superior performance in distinguishing HNSCC from normal controls in the training cohort and it also showed the excellent ability in differentiating between tumor and normal tissues in all validation cohorts with the high AUC values. The signi cant stepwise increase in the DRBPS value from a normal tissue to tumor tissue, as well as the high AUC value not only demonstrated that the DRBPS model could effectively identify patients with HNSCC from individuals with healthy controls but also suggested that the RBPs participates in HNSCC carcinogenesis.
Recent studies have shown that the combination of several biomarkers has better predictive performance than a single biomarker. For example, Hess J et al build a 5-microRNA signature that can predict the survival and disease control of HPV-negative patients with head and neck cancer [34]. Liu et al obtained 5-lncRNA signatures using high-throughput lncRNAs data in the TCGA database to predict the survival time of patients with head and neck squamous cell carcinoma [35]. In addition, Wei Li et al found that a 9-RBP gene markers can predict the prognostic value of LUSC [36]. In this study, we analyzed the RBP related genes to nd a more e cient and accurate features for prognosis-related RBPs, the prognostic RBP signature (PRBPS) were constructed by using LASSO regression analysis and stepwise multiple Cox regression analysis. We nally identi ed seven RBP-coding genes: MKRN3, ZC3H12D, EIF5A2, AFF3, SIDT1, RBM24 and NR0B1. High expression of ZC3H12D, SIDT1 and AFF3 was related to a good prognosis of HNSCC patients, whereas that of RBM24, EIF5A2, MKRN3 and NR0B1 were associated with poor prognosis. Previous research have con rmed that the dysregulated expression of SIDT1, AFF3, and NR0B1 were closely related to the occurrence and development of human tumors [37][38][39]. Consistently, abnormal overexpression of EIF5A2 was related to poor prognosis of patients with oral squamous cell carcinoma [40]. Our PRBPS robustly strati ed patients into subgroups and served as an independent factor affecting the survival and prognosis. This predictive performance of RBPS was validated in an independent cohort, thus reinforce its translational potentials. In addition, the RBPs-related gene showed key biological function, suggesting that they have a potential value for clinical adjuvant therapy. Together, these results revealed the prognostic value of PRBPS for HNSCC, and suggest that dysregulated RBP genes involved might facilitate the initiation and progression of HNSCC.
Nonetheless, we have to admit that our research has some unavoidable limitations. Firstly, the amount of data released in the publicly available data set is limited, so that the clinicopathological parameters analyzed in this study are not comprehensive, which might lead to potential error or bias. Secondly, our research was conducted on the basis of a retrospective analysis, and prospective research should be conducted to verify the results. Thirdly, the number of HNSCC sample is relatively small, this signature is still needed to be independently validated in more cohorts. However, data from multiple databases and our training validation queue design might rescue for these shortcomings.
In conclusion, our study demonstrates the values of consideration of RBPs in the diagnosis and prognosis for HNSCC. The proposed DRBPS and PRBPS models might provide much needed comprehensive clinical information for improving the personalized management of HNSCC patients. Figure 1 The procedures for analyzing RBPs in head neck squamous cell carcinoma (HNSCC).