Identification of Potential Prognostic Biomarkers for Tongue Squamous Cell Carcinoma

Background : Tongue squamous cell carcinoma (TSCC) is one of the most common types of oral cancer and has a poor prognosis owing to a limited understanding of the pathogenesis mechanisms. The purpose of this study was to explore and identify potential biomarkers in TSCC by integrated bioinformatics analysis. Methods : The RNA sequencing data and clinical characteristics of TSCC patients were downloaded from The Cancer Genome Atlas (TCGA), and then differentially expressed RNAs (DERNAs), including differentially expressed long noncoding RNAs (DElncRNAs) and differentially expressed messenger RNAs (DEmRNAs), were identified in TSCC by bioinformatics analysis. Subsequently, Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Hallmark pathway analyses were used to analyze the DERNAs. Using univariate and multivariate Cox regression analyses, four-lncRNA and two-mRNA signatures were developed and used for predicting survival in TSCC patients. We established a risk model to predict the overall survival (OS) of TSCC patients based on the DERNAs with Kaplan–Meier analysis and the log-rank p test. Furthermore, weighted gene coexpression network analysis (WGCNA) was performed in Cytoscape, and a protein-protein interaction (PPI) in the Genes/Proteins Results : A total of 2,006 DEmRNAs and 1,001 DElncRNAs were found to be dysregulated in TSCC. A total of 417 DERNAs were used to construct the coexpression network, and the PPI network included 103 DEmRNAs. Univariate regression analysis of the DERNAs revealed 51 DElncRNAs and 90 DEmRNAs that were associated with OS in TSCC patients. Multivariate Cox regression analysis demonstrated that four of those lncRNAs (MGC32805, RP1-35C21.2, RP11-108K3.1, RP11-109M17.2) and two mRNAs (CA9, GTSF1L) had significant prognostic value, and their cumulative risk score indicated that these four-lncRNA and two-mRNA signatures independently predicted OS in TSCC patients. Conclusions : The current findings provide novel insights into the molecular mechanisms of TSCC and identify four lncRNAs and two mRNAs that are potential biomarkers that may be independent prognostic signatures for TSCC diagnosis and treatment. expressed lncRNA; expressed


Introduction
Tongue squamous cell carcinoma (TSCC) is one of the most common and lethal types of oral cancer [1] , and is characterized by poor prognosis, frequent lymphatic metastasis, and a high rate of regional recurrence. Currently, the most preferred treatment for TSCC is surgery combined with postoperative radiotherapy and chemotherapy. However, the 5-year overall survival (OS) rate of TSCC has not improved significantly over the past decades. It is pivotal to diagnose TSCC in the early stage. Therefore, elucidating the molecular mechanisms and exploring effective molecular therapeutic targets play essential roles in improving the survival rate of patients with TSCC.
Less than 2% of the human genome contains protein-coding genes, while the majority of transcripts are non-protein-coding RNAs [2] . Among them, long noncoding RNAs (lncRNAs) are RNA transcripts longer than 200 nucleotides and have no obvious protein-coding capacity. LncRNAs affect various aspects of biological processes in TSCC, including cell proliferation, tumorigenesis, survival, apoptosis, and migration [3] [4] [5] [6] . However, the underlying mechanisms have not been fully clarified. Moreover, the expression patterns of cancer-specific lncRNAs seems to be more tissue-and stagespecific than those of protein-coding genes, indicating the potential development of lncRNAs as promising alternative biomarkers and therapeutic targets [7] . To date, multiple lncRNAs have been identified in TSCC. For example, upregulation of the lncRNA FALEC suppresses malignant behaviors in TSCC [6] . Overexpression of the lncRNA HOTTIP in TSCC patients indicates a poor clinical prognosis [5] . Moreover, the lncRNAs KCNQ1OT1 and SNHG17 act as competing endogenous RNAs (ceRNAs) to control cell proliferation and cancer progression in TSCC [3,8] . In addition, recent studies revealed that lncRNA combined with mRNA biomarkers could improve diagnosis [9] [10] . However, there are limited valuable lncRNA and mRNA biomarkers used for the diagnosis of TSCC.
In this study, we aimed to analyze the differentially expressed profile of noncoding RNAs in TSCC and establish a Cox regression model to predict the OS of patients with TSCC. Moreover, we constructed a lncRNA-mRNA coexpression network and a protein-protein interaction (PPI) network. Our findings provide novel insights and may be conducive to understanding the molecular mechanisms. Further, verification of the lncRNAs and mRNAs could provide new therapeutic targets for TSCC.

Patient information collection and data preparation
Data associated with TSCC, including the RNA sequencing (RNA-Seq) data of transcriptome profiling and clinical data, were retrieved from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/) and were downloaded through the Genomic Data Commons (GDC) data transfer tool. In the predictive survival-related model, we excluded 27 samples because they lacked complete clinical data. Finally, 146 TSCC samples (119 samples for the predictive survival-related model) and 15 normal control samples were collected in our study.

Data processing and differentially expressed gene analysis
The EdgeR package in R (version 3.6.3) was used to identify the differentially expressed mRNAs (DEmRNAs) with the thresholds of |log2 (fold change [FC])|>2.0 and false discovery rate (FDR) (adjusted p-value) <0.05 [11] . Then, we used the annotation file in GTF format (Homo_sapiens.GRCh38.100.chr.gtf) to identify and annotate differentially expressed lncRNAs (DElncRNAs) with the thresholds of |log2FC|>2.0 and FDR<0.05. A heatmap and volcano plot were constructed by the gplots package in R software. A scatter diagram and paired plot were generated by the Wilcox test in R software.

Cell lines
Two human TSCC cell lines, CAL-27 and SCC-9 were purchased from the American Type Culture Collection (ATCC; Manassas, VA, USA), and the human normal oral keratinocyte (hNOK) cell line was obtained from the Institute of Biochemistry and Cell Biology of the Chinese Academy of Sciences (Shanghai, China). CAL-27 cells and hNOK cells were grown in DMEM (Gibco) supplemented with 10% FBS (Invitrogen, Carlsbad, CA, USA). SCC-9 cells were cultured in DMEM/F-12 (Gibco) supplemented with 10% FBS and 0.4 μg/ml hydrocortisone. All cells were maintained at 37°C in an incubator supplied with 5% CO2.

RNA extraction and quantitative real-time PCR analysis
Total RNA was extracted from cells using TRIzol reagent (Invitrogen, CA, USA) according to the manufacturer's instructions. Total RNA was reverse transcribed into cDNA using a PrimeScript RT reagent kit (TaKaRa). RT-qPCR analyses were performed using SYBR Green Master Mix (TaKaRa). This process was performed using an Applied Biosystems 7500 Real-Time PCR System. The results were normalized to the expression of glyceraldehyde-3-phosphate dehydrogenase (GAPDH).
The relative expression level was calculated by the 2-∆∆ Ct method. The primers were provided by the SunYa Company.

Functional enrichment analysis
Gene Ontology (GO) analysis included biological processes, cellular components and molecular functions. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis was used to annotate the potential functions. GO and KEGG functional analyses were conducted for the DEmRNAs. A significance level of p < 0.05 was set as the cutoff criteria, and the plots were constructed by the ggplot2 and clusterProfiler packages in R software [12] .

Gene set enrichment analysis (GSEA)
The presumptive biological roles of the DElncRNAs correspond to those of their associated mRNAs. The expression profile data were ranked according to the signature score, and the data were divided into a high-risk group and a low-risk group by the mean score. Then, GSEA (version 4.0.3) software was used for analysis.

Cox risk regression model construction and validation
The OS and hazard ratios (HRs) were calculated by the Kaplan-Meier algorithm and univariate Cox regression analysis, respectively. The log-rank method was used to test the differences between the survival curves. The univariate Cox proportional hazards regression method was applied to analyze the relationships between the DElncRNAs and OS when the significance level was set at 0.05 to determine those with prognostic value in TSCC. We used the following formula to establish a prognostic risk score model: Risk score = explncRNA1 * βlncRNA1 + explncRNA2 * βlncRNA2 + ··· explncRNAn * βlncRNAn; Risk score = expGene1 * βGene1 + expGene2 * βGene2 + ···expGenen * βGenen (where exp is the expression level of the prognostic DElncRNA or DEmRNA and β is the corresponding multivariate Cox regression model regression coefficient). The gene signature score as a prognostic predictor for TSCC patients was evaluated in the model.
We determined the significant variables through univariate Cox regression analysis.
Candidate variables with a p-value of < 0.015 for DElncRNAs and a p-value of < 0.01 for DEmRNAs in univariate analysis were included in the multivariable model. Setting the median risk score as the threshold, the TSCC patients were stratified into high-and low-risk groups. The survival ROC package in R was used to construct time-dependent receiver operating characteristic (ROC) curves within 5 years as the defining point and to calculate the risk prediction rate of specific lncRNAs and mRNAs between the two groups. Finally, univariate and multivariate analyses were implemented to evaluate the effects of other clinical variables of TSCC patients on the OS risk score. R software was used for all statistical analyses.

Weighted gene coexpression network analysis (WGCNA)
The WGCNA package in R was used to construct a coexpression network utilizing the expression values of the mRNAs and lncRNAs screened above. In brief, we constructed the weighted adjacency matrix using a power function based on a softthresholding parameter β. In this study, we chose a β value of 12. The DElncRNA and DEmRNA coexpression network was visualized in Cytoscape (Version 3.7.2) software.
Subsequently, module analysis [13] of the coexpression network was performed using the Molecular Complex Detection (MCODE) tool of Cytoscape software.

PPI analysis
The DEmRNAs with FDR < 0.001 and confidence score > 0.9 were included in the PPI network through the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database (https://string-db.org/), and the PPI network was visualized in Cytoscape. The subnetworks were analyzed by the MCODE tool, and GO and KEGG analyses of the modules was carried out using R software.

Statistical analysis
All statistical analyses were performed using R software. For the survival analysis, the differences between two groups were assessed by the log-rank p test, and for the paired plot analysis, the differences between two groups were assessed by the Wilcox test. P-values < 0.05 were considered statistically significant.

Identification of DElncRNAs and DEmRNAs
The RNA expression profiles and corresponding clinical data of 146 TSCC patients and 15 normal controls were downloaded from the TCGA database. |Log2FC|>2 and FDR < 0.05 were used as the thresholds. A total of 1,001 DElncRNAs (680 upregulated and 321 downregulated, Figure 1A) and 2,006 DEmRNAs (856 upregulated and 1,150 downregulated, Figure 1B) were identified.

Functional analysis of the DEmRNAs
GO and KEGG enrichment analyses were used to explore the potential functions of the DEmRNAs. The results indicated that these genes were mainly enriched in extracellular structure organization, cell differentiation and transporter activity ( Figure   2C). In addition, KEGG pathway analyses demonstrated that the most significant pathways were the starch and sucrose metabolism, glycine, serine and threonine metabolism and glycolysis/gluconeogenesis pathways ( Figure 2D).

First, univariate Cox regression analysis and Kaplan-Meier analysis (log-rank p test)
were used to identify lncRNAs associated with the OS of TSCC patients. With the significance level cutoff threshold set at p < 0.05 (Table 1) (Table 2), and all candidate lncRNAs involved in the multivariate Cox regression model were used to develop a lncRNA prognostic model.
We also adopted the same approach to verify the mRNAs associated with the OS of TSCC patients (Table 3), and a total of 90 mRNAs were associated with the OS of TSCC patients. As shown in Figure 4, many mRNAs, including ADH1B, ASCL4, C6, CA9, FAM65C, GNG8, GTSF1L, ODF4, OVOS2, PCK1, and SHOX2, were found to have significant prognostic value. Fifteen candidate DEmRNAs with p-value < 0.01 were all fitted into the multivariate Cox regression model, and the results showed that CA9, OVOS2, LRRTM3, C7, GTSF1L, PCK1, F2RL2 and ODF4 had significant prognostic value in TSCC (Table 4).
Risk score analysis of the lncRNAs was performed for each patient, and based on the risk scores, the patients were divided into the low-risk and high-risk groups. The survival rate of the low-risk patients was significantly higher than that of the high-risk patients ( Figure 5E). The heatmap in Figure 5F shows the relationships between the expression of 12 candidate lncRNAs and the risk status of each patient. The 5-year survival correlation of the 12-lncRNA signature was assessed by ROC analysis, and the area under the curve (AUC) was computed to assess the discriminatory capacity of the lncRNA signature ( Figure 5G). The AUC of the 12-lncRNA signature was 0.876, indicating its utility as a prognostic model for predicting the survival status of TSCC patients. Significantly, we found that FOXD2-AS1, MGC32805, PLS3-AS1, RP1-35C21.2, RP11-108K3.1, and RP11-109M17.2 were simultaneously identified to be associated with OS in Kaplan-Meier and multivariate Cox regression analyses.
Moreover, risk score analysis of the eight candidate mRNAs was performed for each patient ( Figure 5H) and a heatmap was constructed ( Figure 5I). Subsequently, the ROC curve was plotted, and the AUC was 0.84 ( Figure 5J). Likewise, CA9, GTSF1L, ODF4, OVOS2, and PCK1 were simultaneously verified to be associated with OS in Kaplan-Meier and multivariate Cox regression analyses.

Prognostic value of the four-lncRNA and two-mRNA signatures in TSCC
Univariate regression analysis was then used to predict the potential risk factors for the long-term all-cause death of patients, including age, sex, grade, stage, T stage, N stage, and the expression of each lncRNA or mRNA. Multivariate regression models were conducted to assess the prognostic power of each lncRNA and mRNA signature separately. Potential risk factors with p < 0.05 were included in the multivariate Cox regression analysis to screen independent risk factors for predicting long-term death.
Univariate analysis indicated that grade, stage, T stage, and N stage were significantly correlated with the OS of TSCC patients (p < 0.05) (Supplementary Table 1). As with the DElncRNAs, the DEmRNA results were obtained using the same approach (Supplementary Table 2). Of particular note, multivariate analysis found that the expression of MGC32805, RP1-35C21.2, RP11-108K3.1, RP11-109M17.2, CA9 and GTSF1L were significantly correlated with the OS of TSCC patients (p < 0.05) ( Figure   8).

GSEA of the four-lncRNA and two-mRNA signatures
The GSEA results showed that for patients with high signature scores, MGC32805 was mainly enriched in the GO terms of RNA binding and nuclease activity; RP1-

Weighted gene coexpression network construction
Moreover, a coexpression network was constructed successfully. The expression levels of 3,007 genes, including 2,006 mRNAs and 1,001 lncRNAs, were used for coexpression network construction by the WGCNA package. In the current study, to ensure a scale-free network, we selected β = 12 as the soft-thresholding power (Supplementary Figure 1). Then, the coexpression network was constructed by 417 DERNAs and visualized in Cytoscape. Subsequently, three subnetworks of the coexpression network were constructed by using the MCODE tool of Cytoscape ( Figure 10).

PPI network analysis
The PPI network contained a total of 619 proteins and 2,639 edges. With the thresholds of confidence score > 0.9, 103 genes were selected to construct three subnetworks with degree >=20 by using the MCODE tool of Cytoscape ( Figure 11).
According to GO term analysis, the three modules were related to extracellular matrix component, G protein-coupled receptor signaling pathway, coupled to cyclic nucleotide second messenger, actin binding, and posttranslational protein modification, showing that the genes in these modules play a pivotal role in cancer. Regarding the KEGG enrichment analysis, module 1 regulated metabolic pathways such as protein digestion and absorption, human papillomavirus infection, and the PI3K-Akt signaling pathway.
Module 2 was significantly associated with hypertrophic cardiomyopathy (HCM). Moreover, module 3 was significantly related to the chemokine signaling pathway, calcium signaling pathway, and cAMP signaling pathway, which are associated with the genesis and progression of tumors (Supplementary Tables 4, 5, and 6).

Validation of the biomarkers
In summary, all the above results show that four lncRNAs, including MGC32805, RP1-35C21.2, RP11-108K3.1, and RP11-109M17.2, and two mRNAs, including CA9 and GTSF1L, were simultaneously identified to be associated with the OS of TSCC patients. Thus, they could act as potential prognostic biomarkers in TSCC. The differential expression information of the biomarkers is shown in Supplementary Table   7. qRT-PCR was performed to verify the expression levels of the candidate biomarkers in TSCC cell lines. The expression of MGC32805, RP1-35C21.2, RP11-108K3.1, RP11-109M17.2, CA9, and GTSF1L were consistent with the sequencing results ( Figure 12). All primers information are listed in Supplementary Table 8.

Discussion
TSCC is a common lethal malignant cancer, and the lack of specific diagnostic and prognostic biomarkers contribute to the currently low 5-year survival rate of TSCC patients. Therefore, it is crucial to explore the comprehensive regulatory mechanisms of TSCC initiation and progression and to identify potential biomarkers for the early diagnosis and prognosis of TSCC that are predictive of outcomes.
Emerging evidence indicates that most lncRNAs play essential roles in TSCC progression, with diverse underlying mechanisms. In this study, a total of 2,006 DEmRNAs and 1,001 DElncRNAs were identified. GO analysis demonstrated that the functions of the DERNAs were mainly associated with extracellular structure organization, cell differentiation and transporter activity, which play vital roles in tumorigenesis. In addition, KEGG pathway analyses revealed that the most significant pathways were the starch and sucrose metabolism, glycine, serine and threonine metabolism and glycolysis/gluconeogenesis pathways. All of these pathways are significantly associated with cancers. For example, the genetic activation of serine biosynthesis has been indicated to directly drive cancer proliferation; intriguingly, genetic and functional evidence supports the importance of the activity of serine and glycine biosynthesis as drivers of oncogenesis [14] . Furthermore, differentially expressed proteins showed significant enrichment for the glycolysis/gluconeogenesis pathways in breast cancer patients [15] .
With the goal of identifying lncRNAs and mRNAs significantly associated with OS, we established a Cox regression model to predict the OS of patients with TSCC.
Univariate regression analysis of the DElncRNAs identified 17 candidate lncRNAs associated with OS. Multivariate analysis showed that 12 of them were related to OS, and 10 of those lncRNAs were significantly associated with OS (p<0.05). A cumulative risk score of the 12 lncRNAs was calculated, which indicated that they may independently predict OS in TSCC patients. Then, six of these lncRNAs were identified to have significant prognostic value after Kaplan-Meier analysis was carried out. The same approach was also applied for the DEmRNAs. Univariate and multivariate regression analyses were then used to predict the potential risk factors for long-term all-cause death in TSCC patients. Four of those lncRNAs (MGC32805, RP1-35C21.2, RP11-108K3.1, and RP11-109M17.2) and two mRNAs (CA9 and GTSF1L) were simultaneously identified to be associated with OS in TSCC patients. Moreover, we also established a lncRNA-mRNA coexpression network and a PPI network using the information obtained from the TCGA database. Our findings could help improve the understanding of lncRNA-mRNA coexpression regulatory mechanisms and provide new insights for therapeutic targets in TSCC.
In the current study, among these four lncRNA biomarkers, a genome-wide association study (GWAS) showed that MGC32805 may be involved in paclitaxel in vitro drug response phenotypes, which are moderately associated with the time to epithelial ovarian cancer recurrence [16] . Another GWAS revealed that the SNP rs146949893 in RP1-35C21.2 may be related to height [17] . In addition, previous studies have shown that RP11-108K3.1 could serve as a potential independent biomarker for the prognostic prediction of colorectal cancer [18] . RP11-108K3.1 is one of the DElncRNAs in lung adenocarcinoma [19] . However, no study so far has reported any association of RP11-109M17.2 with cancer, which means that it may serve as a novel therapeutic target in TSCC. All of these studies confirm that the DElncRNAs screened in our research play critical roles in tumor pathogenesis. However, the underlying mechanisms remain unknown. In our study, the GO analysis of the four lncRNAs showed that they were mainly enriched in RNA binding and nuclease activity, regulator and regulation of lymphocyte apoptotic process, identical protein binding and positive regulation of cellular component biogenesis, and RNA splicing and mRNA splice site selection. All of these processes are significantly associated with cancers.
For the two-mRNA signature, research has shown that high CA9 expression is related to poor prognosis and tumor grade in TSCC [20] ; in addition, the GAA haplotype of 3 CA9 SNPs (rs2071676, rs3829078, and rs1048638) is correlated with a higher risk of oral cancer [21] . Natural selection shapes cancer genomes, and the study revealed that negative selection is a hallmark of cell essentiality and immune response in cancer.
Although there was no significant difference, GTSF1L may be related to negative selection [22] . In addition, GTSF1L could also serve as a biomarker of common reproductive impairments in males [23] . In our findings, the KEGG analysis of CA9 and GTSF1L showed that they were mainly enriched in glycolysis/gluconeogenesis, the B cell receptor signaling pathway, the JAK/STAT signaling pathway, natural killer cellmediated cytotoxicity, and the T cell receptor signaling pathway. Furthermore, the Hallmark analysis of CA9 and GTSF1L showed that they were mainly enriched in mTORC1 signaling, MYC targets V2, PI3K/AKT/MTOR signaling, IL2/STAT5 signaling, and IL6/JAK/STAT3 signaling. These pathways all play essential roles in tumor immunity.
In our study, the expression of MGC32805, RP1-35C21.2, RP11-108K3.1, RP11-109M17.2, CA9, and GTSF1L was then analyzed in TSCC and normal tissues, and their expression in the paired groups was also found to be significantly associated with poor clinical outcomes in TSCC patients. This is the first study to indicate the aberrant expression of MGC32805, RP1-35C21.2, RP11-108K3.1, and RP11-109M17.2 in TSCC and indicates a potential prognostic role of this 4-lncRNA signature in TSCC. In addition, our findings also support that CA9 and GTSF1L could be potential biomarkers in TSCC. The bioinformatics-based investigation of lncRNAs will contribute to future experimental research.

Conclusion
Taking all the above results together, we identified the four-lncRNA and two-mRNA signatures as potential prognostic biomarkers for TSCC patients by analyzing genomewide lncRNA and mRNA expression data from the TCGA database. Furthermore, we established a lncRNA-mRNA coexpression network that is beneficial to understanding the relationships between lncRNAs and mRNAs and provides effective strategies for further functional studies of lncRNAs. Moreover, construction of the PPI network provides novel insights for TSCC treatment, and the risk score model contributes to

FIGURE 5
The survival rate of the low-risk patients was significantly higher than that of the high-risk patients (E). The heatmap shows the relationships between the expression of 12 candidate DElncRNAs and the risk score of each patient (F). The 5- year survival correlation of the 12 DElncRNAs was assessed by ROC analysis, and AUCs were computed to assess the discriminatory capacity of the lncRNA signature (G). Similar to that for DElncRNAs, risk score analysis (H) of eight candidate DEmRNAs was performed for each patient and a heatmap was constructed (I).
Subsequently, the ROC curve was plotted, and the AUC was 0.84 (J).   CA9, and GTSF1L based on the risk score model.

FIGURE 10
Three subnetworks of weighted gene coexpression network analysis were constructed using the MCODE tool of Cytoscape.

FIGURE 11
A total of 103 genes were selected to construct the three PPI subnetworks with degree>=20 by using the MCODE tool.