Discovery Potential Therapeutic Drugs for Oral Tongue Squamous Cell Carcinoma Based on Text Mining and Data Analysis


 Background: Oral tongue squamous cell carcinoma (OTSCC) is the most common malignant tumor of the oral cavity. The aim of this study was to use text mining and data analysis to discover some existing drugs that target genes and to explore potential therapeutic drugs for OTSCC. Methods: We used the text mining tool pubmed2ensembl to extract genes associated with OTSCC, and two datasets (GSE30784, GSE23558) from Gene Expression Omnibus (GEO) were used for the data analysis. Then, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were performed for the intersection of the three gene sets. Protein-protein interaction (PPI) network was constructed by STRING, gene module analysis was performed using the Molecular Complex Detection (MCODE), a plug-in in Cytoscape. Lastly, a database of drug-gene interactions was used to identify significant genes to explore potential drugs for the treatment of OTSCC.Results: We produced 403 unique genes associated with oral tongue squamous cell carcinoma through text mining. GSE23558 and GSE 30784 obtained 1637 and 1159 differentially expressed genes (DEGs) through data analysis, respectively. A total of 28 genes were obtained from the intersection gene sets, including 20 up-regulated genes and 8 down-regulated genes. We screened the most significant modules by using MCODE, among which 8 genes were associated with oral tongue squamous cell carcinoma as core genes. Eventually, nine drugs were found to target eight genes.Conclusions: In this study, PLAU, SERPINE1, MMP1, MMP3, MMP10, CXCL10, CXCL12, and SPP1 were potentially key genes involved in the treatment of OTSCC. Furthermore, 12 drugs were identified as potential therapeutic agents for oral tongue squamous cell carcinoma treatment and management


Introduction
Oral squamous cell carcinoma (OSCC) is a malignant tumor and poses a serious threat to human health. Oral tongue squamous cell carcinoma (OTSCC) is a malignant tumor that occurs in the anterior 2/3 of the tongue in the oral cavity and is the sixth most common cancer in the world (1,2). It is an aggressive head and neck malignancy characterized by local invasion and distal metastasis, with a high recurrence rate and no signi cant improvement in 5-year survival (3)(4)(5). The incidence of oral cancer varies globally (6, 7). Typical risk factors are smoking and excessive alcohol consumption (8). Another risk factor is HPV (9). Cervical lymph node metastasis is an important prognostic factor (10). Early symptoms of oral tongue squamous cell carcinoma are mild and patients are usually diagnosed at a later stage (11). Therefore, the prognostic effect is relatively poor. OTSCC is a complex disease. In the clinic, surgery and chemoradiotherapy are considered to be effective methods for the treatment of OTSCC, but research on drug therapy remains limited (12). Obviously, there is an urgent need for a new approach to the treatment and prevention of OTSCC.
In this study, we aimed to nd some existing drugs to provide new ideas for the prevention and treatment of oral tongue squamous cell carcinoma. The development of new drugs requires physical experimentation and drug compound development, which take a long time and costs. However, drug repositioning may be a more cost-saving and shortened drug development time. Further research is needed to provide a theoretical basis for the development of new treatment modalities that maximize functional preservation and minimize recurrence and metastasis rates.
Nowadays, data analysis and text mining have been applied in many aspects such as disease diagnosis, identifying potential key genes and predicting the occurrence of diseases. Compared with bioinformatics in the cancer eld, there are fewer researches focused on oral tongue squamous cell carcinomas through text mining and data analysis. Following the repurposing paradigm, in this study, we used bioinformatics strategies to search for existing drugs to provide new ideas for the treatment of OTSCC. First, we obtained a unique list of genes through text mining and data analysis and intersected these genes to obtain common genes. Then, we performed protein-protein interaction analysis (PPI) on these genes and identi ed the most signi cant gene modules. Finally, drug candidates were obtained from drug-gene interaction analysis of module genes. Figure 1 illustrates the work ow of this study.

Method And Materials
Text mining and data analysis We used pubmed2ensemble (http://pubmed2ensemble.ls.manchester.ac.uk), an open website, to perform text mining (13). We rst entered the key word "tongue squamous cell carcinoma" into pubmed2ensemble, and then combined all the duplicates to obtain unique gene set.
GSE23558 and GSE30784 were downloaded and obtained from the National Center Biotechnology Gene Expression Omnibus (NCBI-GEO) database (http://www.ncbi.nlm.nih.gov/geo/) (14). The platforms used were GPL6480 and GPL570. GSE23558 contains 27 patients and 5 healthy controls. GSE30784 includes 229 samples. After the gene expression pro le was obtained, R software was used to analyze the original datasets. The gene probes were converted into the corresponding gene symbols. Moreover, for one probe corresponding to multiple gens and non-mRNA probes were removed. GSE30784 and GSE23558 were analyzed using RStudio software. Limma package of RStudio software was used for identify the DEGs between patients and healthy control samples (15). The DEGs were selected for subsequent analysis by setting the adjusted p-value<0.01 and |log 2 fold change(FC)| ≥1.5 as cutoff criteria. Veen package (http://www.molbiotools.com/listcompare.html) was used to draw the intersection of the three gene sets.
Finally, we intersected the three gene sets, and used the intersection of the three gene sets for further analysis.
Functional and Signal Pathway enrichment Analysis DAVID Bioinformatics Resources 6.8 (https://david.ncifcrf.gov/), an online website, provides researchers with a comprehensive set of functional annotation tools to understand the biological signi cance behind a large number of genes (16). The gene ontology (GO) analysis annotates genes and gene products from the molecular to the organism level in three categories: biological process (BP), cellular component (CC), and molecular function (MF) (17). The Kyoto Encyclopedia of Genes and Genomes (KEGG) is an open access informatic and a systematic analysis database for linking high-level function information and the biological systems (18), based on the gene-chip and high-throughput experimental technologies. P<0.05 was considered as statistically signi cant as the cutoff criterion.

Protein-Protein interaction Analysis and Gene Module Analysis
The database STRING (http://string-db.org) is a global open resource for predicting and analyzing protein-protein associations, including text mining, experiments, databases, co-expression, neighborhood, gene fusion, co-occurrence (19). We submitted 28 genes to STRING to build PPI networks. The medium con dence score (0.400) was set as the minimum required interaction score and we set up to hide disconnected nodes in the network, the other parameters in STRING were set by default. The Cytoscape plugin MCODE and STRING app were then used to screen the most signi cant gene modules (20,21). Besides, we set the parameter k-core=5 in MCODE as the standard and the other parameters as default.

Drug-gene interaction and functional analysis of potential genes
The drug-gene interaction database (DGIdb: http://www.dgidb.org) is an open website that consolidates and organizes information on drug-gene interactions and gene druggability from papers, databases and online resources (22). The most signi cant gene modules were entered into the gene-drug database as potential targets to explore existing drug or compounds, and these candidate genes were then performed enrichment analysis.

Results
Text mining and microarray data analysis GSE23558 and GSE30784 were analyzed singly to identify differential genes, which expressed in oral squamous cell carcinomas. There were 1637 DEGs, including 1094 down-regulated genes and 543 upregulated genes in GSE23558, which were differentially expressed between patients and healthy control.
There were 1159 DEGs(585 down-regulated and 574 up-regulated) in GSE30784, which were differentially expressed between cancer and healthy. We extracted 403 unique genes related to tongue squamous cell carcinoma from text mining. A total of 28 common genes were obtained from the intersection of the three gene sets (Figure 1). Among these common genes, 20 genes were upregulated genes and 8 genes were downregulated genes (Table1).

Functional and Signal Pathway enrichment Analysis
We input a list of 28 genes into DAVID for GO and KEGG enrichment analysis( Figure 2). Furthermore, as shown in Figure 2, it showed that the top six signi cant most enrichment terms for BP, CC, MF, and KEGG signal pathways of common genes. In BP category, it was mainly enriched in the regulation of cellular component movement, extracellular matrix organization and extracellular structure. In CC category, it was enriched in the extracellular region part, extracellular region and cell surface. In MF category, it was enriched in the receptor binding, cell adhesion molecule binding and serine-type endopeptidase activity. In KEGG category, it was ECM-receptor interaction, focal adhesion and leukocyte transendothelial migration.

Protein-Protein interaction Analysis and Gene Module Analysis
We pasted 28 genes into STRING to make PPI networks and downloaded les in TSV format, and then input the les into Cytoscape for analysis. There were a total of 23 nodes with 49 edges in the PPI networks, but 5 genes were not in the PPI networks ( Figure 3A). MOCDE and STRING in Cytoscape were used to screen the most signi cant gene modules. Module 1 was composed of 8 nodes with 24 edges ( Figure 3B). We nally have eight hub genes, namely PLAU, SERPINE1, MMP1, CXCL10, SPP1, CXCL12, MMP3, and MMP10. These genes were subjected to drug-gene interaction analysis.

Drug-gene interaction and functional analysis of potential genes
We selected 8 genes from the most meaning gene modules as potential gene targets for drug-gene interaction analysis, and nally 12 drugs were found to target 6 of the 8 genes (the exceptions were MMP10 and CXCL12), and then these 12 drugs were classi ed ( Figure 4, Table 2). In addition, as shown in Table 3, the enrichment analysis of 8 genes using David website showed that the GO terms were mainly enriched in extracellular matrix disassembly (BP), extracellular space (CC), serine-type endopeptidase activity (MF).

Discussion
OTSCC is a highly malignant tumor, which causes harm to human health. Recently, the incidence of OTSCC is on the rise. The tongue has abundant blood circulation and is prone to lymph node metastasis, which often leads to a poor prognosis. Currently, OTSCC is usually treated with a combination of surgery, radiation therapy, and chemotherapy. However, the surgical treatment of OTSCC will inevitably cause oral dysfunction. Therefore, in order to protect the oral function of patients as much as possible, drug therapy is adopted to improve the quality of OTSCC patients.
In our study, we identi ed 28 genes related to OTSCC through text mining and data analysis. Therefore, our purposes were to discover new drug indications. Finally, as we expected, the PPI network and enrichment analysis identi ed 8 core genes, namely PLAU, SERPINE1, MMP1, MMP3, MMP10, CXCL10, CXCL12, and SPP1, target to 9 existing potential drugs which might be used for the treatment of OSTCC.
These 8 core genes are involved in the NF-kappa B signaling pathway, Toll-like receptor signaling pathway, TNF signaling pathway, Complement and coagulation cascades and Rheumatoid arthritis.
Urokinase-type plasminogen activator (uPA) is involved in the proteolysis of basement membrane and extracellular matrix structure, and activates broblast growth factor, vascular endothelial growth factor and transforming growth factor-β, which plays an important role in tumorigenesis process. In the colorectal, stomach, and oral cavity, increased expression of uPA was observed. Marianna et al. reported that uPA is highly expressed in high-grade tumors and in the worst invasive mode, and is involved in the progression of OSTCC to poor prognosis (23). Serpin family E member 1 (SERPINE1), a serine proteinase inhibitor (serpin), is the main inhibitor of tissue plasminogen activator (tPA) and urokinase (uPA). Defects in this gene lead to plasminogen activator inhibitor-1 de ciency (PAI-1 de ciency). SERPINE1 is upregulated in OTSCC and plays an oncogenic role (24). Dhanda J et al. have found that the combination of Serpin family E member 1 (SERPINE1) and alpha smooth muscle actin (SMA) immunohistochemistry offer potential as prognostic biomarkers in OSCC (25). Matrix metalloproteinases (MMPs), a family of zinc-dependent endopeptidases, can degrade various proteins in the extracellular matrix (ECM). MMPs include MMP1, MMP3 and MMP10. Therefore, MMPs may play an important role in development of cancer. Matrix metallopeptidase 1 (MMP1) is the interstitial collagen, which is activated in aggressive oral squamous cell carcinoma. Jordan et al. have shown that MMP1 was overexpressed and the mRNA level of MMP1 was signi cantly elevated in OSCC (26). Overexpression of MMP3 can lead to ECM degradation and overexpression of MMP3 appears to kill epithelial cells (27). Chemokines and their receptors are involved in tumor development and metastasis. In addition, chemokines also promote the anti-tumor activity by guiding and mobilizing the mobilization and colonization of cells. The ligand for CXCL10 is CXCR3,which inhibits tumor growth. In different human cancers, high expression of CXCL10 at the tumor site is associated with a favorable prognosis (28-30). C-X-C motif chemokine ligand 12 (CXCL12), also known as SDF-1 (stromal cell-derived factor-1), belongs to the chemokine family. CXCL12 includes the chemokine receptors CXCR4 and CXCR7. The combination of CXCL12 and CXCR4 is responsible for regulating a variety of biological and pathological processes, including hematopoiesis and apoptosis, immune and mitotic activity, cancer cell growth, migration, and neovascularization (31). Therefore, the CXCL12/CXCR4 axis is considered as a new drug target for the treatment of oral cancer. Secreted phosphoprotein 1 (SPP1), also known as osteopontin, a secreted and chemokine-like glycolphosphoprotein is involved in immune responses, cancer progression and cell signaling (32). Feng et al. con rmed that SPP1 plays an important role in oral cancer invasion(33). Additionally, Liu et al. found that SPP1 was signi cantly associated with the survival rate and the survival rate of patients with high SPP1 expression was signi cantly reduced. They also demonstrated that the regulation of SPP1 expression in uenced proliferation, migration, and invasion, and inhibited apoptosis in cell lines (34).
In total, we obtained 12 drugs from the drug-gene interaction database. These drugs are classi ed as antineoplastic agent, small molecule, thrombolytic agents and protein (Table 3). Although these existing drugs are likely to be of further help in the treatment of OTSCC, the lack of experimental validation is a limitation of this study, so further experimental studies are needed to verify the results.

Conclusions
According to the text mining conception (keyword: tongue squamous cell carcinoma) and microarray data analysis (dataset: GSE30784 and GSE23558), we found 20 existing drugs, approved by FDA, target to nine genes, which involved in the in ammatory pathway. These genes might be used for OTSCC, as well as its original drug indications.    Tables   Table I. There were a total of 28 intersection genes from oral tongue squamous cell carcinoma, including 20 up-regulated genes and 8 down-regulated genes.  Figure 1 Summary of research process design. Text mining was conducted using pubmed2ensembl to identify genes associated with TSCC. DAVID was then used to analyze the gene ontology (GO) biological process terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway for the intersection genes. Gene enrichment was analyzed using STRING and MCODE. Drug-gene interaction database (DGIdb) was used to obtain the nal list of enriched genes and their corresponding drugs. The protein-protein interaction (PPI) networks construction and signi cant gene modules analysis. (A)

Declarations
The entire PPI networks of 23 genes, (B) the signi cant gene module, including 8 genes.