Differential expression analysis of GEO merged dataset
We downloaded three datasets GSE13601, GSE34105 and GSE34106 containing tongue squamous cell carcinoma (TSCC) in GEO database. Each dataset contains tumor samples and normal tissue samples. We used SVA package to merge these datasets and obtain a combined dataset. To identify differences between TSCC samples and normal tongue samples, we performed principal component analysis (PCA) and found that the two groups could be separated in the combined dataset (Figure 1A). We used limma package (|log2 (FC)| ≥ 1.5, p < 0.05) in R to analyze the 178 samples and screened for DEGs. We identified 170 DEGs, including 104 up-regulated genes and 66 down-regulated gene which were shown in the volcano diagram, where the red on the right represents the up-regulated differentially expressed genes and the blue one represents the down-regulated differentially expressed genes (Figure 1B). In the heatmap, groups were made according to datasets and sample characteristics to explore the expression of different genes. The top annotation represents the different datasets. DEGs are marked on the right side of the figure (Figure 1C). In order to further explore the functions of the screened DEGs, we use R language for further functional enrichment analysis of these genes in KEGG database, and enriched some up-regulated pathways such as Cytokine-Cytokine receptor interaction, Viral protein interaction with cytokine and cytokine receptor and primary immunodeficiency (Figure 1D). Then GO analysis was performed on these genes, and the pathways enriched in up-regulated genes were mainly cartilage development, connective tissue development, skeletal system development and chondrocyte differentiation in the biological process (BP) module (Figure 1E). In cellular component (CC) module, the enrichment pathways were mainly collagen-containing extracellular matrix, basement membrane, apical part of cell and laminin complex. In Molecular Function (MF) module, mainly enriched in receptor ligand activity, signaling receptor activator activity and metalloendopeptidase activity. GSEA analysis was also performed on the first six up-regulated pathways, and the results were consistent with the previous analysis (Figure 2). The KEGG correspondence is in Supplementary Table2.
Differential expression analysis of TSCC in TCGA database
DEGs of TSCC vs normal samples in TCGA database was performed using DESeq package in R language (|log2 (FC)| ≥ 1.5, p < 0.05). 644 DEGs up-regulated and 945 down-regulated DEGs were found as shown in the volcano diagram (Figure 3A). We used Metascape website to analyze these up-regulated DEGs and found that the mainly enriched GO pathway was aminoglycoside antibiotic metabolic process,proximal/distal pattern formation and negative regulation of blood coagulation (Figure 3B). KEGG enrichment analysis of up-regulated DEGs showed that the up-regulated pathways were mainly Folate biosynthesis, Steroid hormone biosynthesis, Complement and coagulation cascades, Staphylococcus aureus infection and Cytokine-cytokine receptor interaction. The main down-regulated pathways were Salivary secretion, Calcium signaling pathway and Pancreatic secretion (Figure 3C).
Acquisition and analysis of hub genes
DEGs acquired in GEO were crossed with those in TCGA, and screened 5 common up-regulated genes (CCL20, SCG5, SPP1, FOLR3 and KRT75) and 15 common down-regulated genes (MAOB, HLFSELENBP1, CRISP3, TMPRSS2, TLX1, TGFBR3, APOD, PPP1R3C, SLITRK5, MFAP4, STATH, GSTM5, RNASE4, COX7A1, FHL1, SLC25A4, EFHD1, MYH11, MGP, SCGB2A1, LPIN1, ELF5, ADH1B, CKMT2) (Figure 4A). The protein-protein interaction (PPI) network analysis of these five up-regulated genes found that proteins expressed by other genes were all functionally related except FOLR3 (Figure 4B). Then, we conducted interaction network analysis of these genes individually (Figure 4C).
Further expression verification of these hub genes
We further verified the expression of up-regulated DEGs using GSE13601 dataset. This dataset includes tumor samples and their paired normal samples. After remove the individual samples in the database, we obtained 20 pairs of paired samples and then we compared the expressions of CCL20, SCG5, SPP1, FOLR3 and KRT75 genes in tumor tissues vs normal tissues. It was shown that the expression of these 5 genes was higher in TSCC than in normal tissue both in the whole sample and the most paired samples (Figure 5A). The same results can be obtained using TCGA dataset (Figure 5B).
To explore the relationship between the expression of these Hub genes and clinical stage,we used TCGA database to explore and found that the expression level of CCL20 was significantly increased in patients with stage III and IV compared with stage I (P < 0.05), the expression of SPP1 and SCG5 in stage IV was significantly higher than that in stage I patients (P < 0.05) (Figure 5C).
Multivariate COX regression of TSCC in TCGA database for each clinical feature and prognosis
To further explore the clinical data of TCGA, Multivariate Cox regression was performed using clinical data from TCGA database (Figure 6). We found that ajcc_pathologic_T > 2 vs ajcc_pathologic_T <= 2 the risk increased. Similarly, years of smoking also has an impact on the survival of patients, when the smoking time is more than 29 years, the risk of patients is significantly increased. Besides, we found that when TSCC stage is higher than grade III, age > 59, N stage > 1 and the number of cigarettes smoked per day > 2.5 increased the risk, but P value of multivariate Cox did not meet the statistical standards. Kaplan-meier analysis was performed on clinical data using SPSS software, and we found that patients with stage higher than III had a significantly worse prognosis than patients with stage lower than III. AJCC_pathologic_T had a poor prognosis when it was higher than T2. Similarly, AJCC_pathologic_N stage higher than N1 and smoking years higher than 29 years remained risk factors (Figure 7).
Hub genes analysis in HNCS
These up-regulated hub genes were further verified by GEPIA, and the results showed that the expression of these 5 up-regulated genes in Head and Neck squamous cell carcinoma (HCNC) was also higher than that in normal tissues (Figure 8). Kaplan – Meier Plotter was used to analyze the relationship between the expression of these genes and prognosis, and the results showed that high expression of these genes was strongly associated with poor prognosis in HCNC (P<0.05) (Figure 9).