Acquisition of the datasets and identification of the DEGs
Datasets were searched in GEO (http://www.ncbi.nlm.nih.gov/geo) and the three datasets with the largest sample size (GSE39366,GSE40774,GSE55550[10–12]) were sorted for further analysis. GEO2R is a web analysis tool for identifying DEGs across different experimental conditions. HPV 16-positive and HPV 16-negative samples (the HPV-inactive (DNA + RNA−) samples in GSE55550 were excluded to obtain the most significant results) were separated into two groups and analyzed with GEO2R. The Benjamini-Hochberg (false discovery rate, FDR) method was applied to adjust the P-values (adjusted P-value, adj.P-value). The probe identifiers were converted to gene symbols, and gene symbols with duplication or loss were deleted. The DEGs meeting the criteria of an adj.P-value of < 0.05 and a |log (FC) | of ≥ 1 were considered statistically significant DEGs. Metascape (https://metascape.org) is a web-based portal designed to provide a comprehensive gene list annotation and analysis resource . Three groups of DEGs were uploaded to Metascape for meta-analysis to explore the most highly enriched pathways and the overlapping genes of the three datasets were visualized.
Go Term And KEGG Pathway Enrichment Analyses
The GO database is the largest database worldwide that provides information on the functions of genes, including their biological processes (BPs), cellular components (CCs) and molecular functions (MFs) . KEGG is a web database for exploring advanced functions of biological systems . The Database for Annotation, Visualization and Integrated Discovery(DAVID, https://david.ncifcrf.gov) tool is an online tool comprising integrated biological knowledge bases and analytical tools. The overlapping DEGs were submitted to GO and KEGG pathway analyses by DAVID (version 6.8). To identify the direct interactions among the DEGs, “GOTERM_DIRECT” BP, CC and MF categories were selected, and a P-value of < 0.05 was considered statistically significant.
Construction of the PPI network and identification of hub genes
The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING, https://string-db.org/) database is widely used to predict PPI networks, as a biological database, it integrates PPI information from publicly available sources . To examine the functional connections among the DEGs, STRING (version 11.0) was used to determine the direct interactions among the DEGs. An interaction score of > 0.4 was considered statistically significant. The PPI network was visualized by platform Cytoscape (version 3.8.0) . The Cytoscape plugin Molecular Complex Detection (MCODE)  was used to identify the top two modules in the PPI network with the following parameters: degree cutoff = 2, node density cutoff = 0.2, node score cutoff = 0.1, K-core = 2, and max.depth = 100. In addition, the cytoHubba plugin was used to calculate the degree of each protein node, and the top 11 genes were identified as hub genes. The expression of these hub genes in the three GSE datasets was visualized with R software (version 3.6.3).
Verification Of Hub Gene mRNA Expression Levels
Oncomine (https://www.oncomine.org/) is a cancer microarray database and integrated data mining platform . The mRNA expression levels of the 11 hub genes were verified in Oncomine, and only two HNSCC datasets (Zhai Cervix  and Slebos Head-Neck ) met the criteria of including the HPV infection status. Differential gene expression was visualized with the R software.
The Cancer Genome Atlas (TCGA, https://www.cancer.gov/tcga.) contains many patient samples across 33 cancer types, with clinical information as well as genome sequencing data. In addition, University of California Santa Cruz (UCSC) Xena (https://xenabrowser.net/) is an online database that could be used to explore correlations between genomic and phenotypic variables , and TCGA data are included. To verify mRNA expression, HNSCC samples and mRNA expression data of the hub genes were downloaded from TCGA by UCSC Xena. Data for 604 samples, including HNSCC and solid normal tissues were obtained. Samples without mRNA expression data were removed, and the differential expression of the key genes was analyzed using GraphPad Prism software (version 8.0.0). The Mann-Whitney U test was performed, and a two-tailed exact P-value of < 0.05 was considered statistically significant.
The Human Protein Atlas (HPA, http://www.proteinatlas.org,version 19.3) is the largest database mapping human proteins. Since HPV status data of HNSCC samples were not available in the HPA, immunohistochemical data for the key genes in HPV-positive and HPV-negative samples were not provided. Thus, immunohistochemical analysis of HNSCC and normal tissues (all normal tissues were oral mucosal tissues) was performed.
Survival Analysis Based On TCGA Database Data
Clinical information for the patients was downloaded from UCSC Xena, and the OS of patients stratified by the mRNA levels of the 11 key genes was analyzed. High expression or low expression was defined as an expression level of greater than or less than the median value, respectively. A log-rank (Mantel-Cox) test was performed, and a P-value of < 0.05 was considered to indicate a significant difference. For the comparison between the HPV-positive (+) and the HPV-negative (-) groups, the hazard ratio (HR) was calculated by the log-rank test as follows: HR = expression in the HPV (+) group /expression in the HPV (-) group. For comparisons between other groups, the HR was calculated as follows: HR = high expression/low expression. The disease-free interval (DFI), disease-specific survival (DSS) and progression-free interval (PFI) of patients stratified by the expression levels of genes with statistical significance in the OS analysis were analyzed.
GSEA Of AREG, CAV1, STAG3 And C19orf57
GSEA software (version 4.0.3, https://www.gsea-msigdb.org/gsea/downloads.jsp) was used to explore the association between gene expression and dysregulated pathways in cancer. The 73 samples from the TCGA database (HPV-negative) were divided into high and low expression groups based on the levels of AREG, CAV1, STAG3 and C19orf57, with the median value of each gene signifying the cutoff value. The predefined gene sets “c2.all.v7.1.symbols.gmt”, “c4.all.v7.1.symbols.gmt” and “c6.all.v7.1symbols.gmt” from the Molecular Signatures Database (MSigDB) were used for analyses. The normalized enrichment score (NES) was calculated, the cutoff for significance was defined as follows: a nominal p-value of < 0.05 and FDR q-value of < 0.25.
Identification Of Potential Small Molecule Drugs
The Connectivity Map (https://clue.io/cmap) is a resource that uses cellular responses to perturbations to find relationships between diseases, genes, and therapeutics. The CLUE platform (https://clue.io) was used to identify potential small molecules based on data from 9 tumor cell lines . The DEGs co-expressed in the three GEO datasets were divided into up- and downregulated genes and were then uploaded to run the query. The parameters were defined as follows: Gene expression (L1000) and Touchstone. After calculation, each small molecule was assigned an enrichment score ranging from − 100 to 100. A positive score indicates similarity between the signature of a given perturbagen and that of the query, while a negative score indicates a disparity between the two signatures. Scores of + 90 or higher and of -90 or lower were considered strong scores for developing hypotheses for further study. Here, the molecules with the top five positive and negative scores were identified for further study. The 2D structures of these molecules are available in PubChem (https://pubchem.ncbi.nlm.nih.gov/), a public database of small molecules and their biological properties .
Evaluation of the therapeutic value of small molecule drugs in vitro
HN4, HN6, HN30, SCC9, SCC25 and CAL27 cell lines were obtained from the Shanghai Key Laboratory of Stomatology; zonisamide, NVP-AUY922, PP-2 and fostamatinib were purchased from Absin Bioscience Inc. Zonisamide and PP-2 were dissolved in Dimethyl sulfoxide (DMSO) and diluted with medium with a gradient of 5 µM. Similarly, NVP-AUY922 and fostamatinib were diluted with a gradient of 5 nM. Approximately 3 × 103 cells per well were seeded in 96-well culture plates. After culture for 24 h, the medium was removed. The four drugs at different concentrations were added with the volume of 100 µl. For the negative control wells, drug-free medium was added. In addition, the background control wells were added only drug-free medium but without cells. After culture for 72 h, the medium was removed, 100 µl of fresh medium containing 10% Cell Counting Kit-8 (CCK8, Dojindo, Japan) solution was added. The medium in the control wells was replaced by fresh medium. After incubation for 3 h, the absorbance at 450 nm was measured in a multimode microplate reader (SpectraMax i3, Molecular Devices). The experiments were repeated in triplicate. Finally, the cell inhibition rate was calculated with the following formula: %cell inhibition rate = 100- (mean optical density (OD) of the treatment wells - mean OD of the background control wells)/(mean OD of the negative control wells – mean OD of the background control wells) × 100, and the half-maximal inhibitory concentrations (IC50) of the drugs were calculated with GraphPad Prism software (version 8.0.0).