Identifying KRT20 as Key Gene in Lymphatic Metastasis of Head and Neck Squamous Cell Carcinoma

(TCGA) to identify differentially expressed (DEGs) between and Random the and were to identify in including (Cytokeratins We analyzed survival and pan-cancer of KRT20 in the After overexpressing KRT20 in HNSCC cell lines, migration and invasion were investigated. We analyzed KRT20’s expression in HNSCC tissue microarrays, as well as the survival and clinicopathological features.


Results
In total, we identi ed 243 DEGs-143 upregulated genes and 100 downregulated genes. Further analysis revealed KRT20 as a potential key gene associated with lymphatic metastasis and overall survival rates among patients with HNSCC. Overexpression of KRT20 increased migration and invasion ability of Tu686 and FD cells. Our analysis of tissue microarrays revealed an overexpression of KRT20 among N1+ patients. Survival analysis results and HNSCC tissue microarrays' clinicopathological features were consistent with our analysis of the TCGA. Moreover, high level of KRT20 expression might suggest an adverse HNSCC prognosis. Additionally, our GSEA analysis revealed that KRT20 participates in many compounds' metabolic pathways-including pathways related to tumorigenesis and development.

Conclusion
We proposed that KRT20 may be a key gene in HNSCC with LM.

Introduction
Head and neck squamous cell carcinoma (HNSCC) was the seventh-most-common cancer worldwide in 2018 (with 890,000 new cases and 450,000 related deaths) [1]. Cervical lymphatic metastasis (LM) is among the leading causes of HNSCC's recurrence and fatality [2]. HNSCC has often been associated with lymphatic-rather than hematogenous-metastasis, and patients with LM require neck dissection and prophylactic neck radiotherapy [3]. Although multi-therapy has signi cantly improved the prognosis of patients with locally advanced HNSCC, the emergence of local LM and distant metastases remains a prognostic challenge. Previous studies have shown that the gradual metastasis of primary tumor cells necessarily destroys the association between adjacent tumors and stromal cells, the survival rate of blood and lymphatic vessels, regional and systemic circulation, and growth after implantation or deposition into regional or distant tissues [4]. However, the mechanism driving LM remains unclear.
Therefore, an exploration of LM's key genes and the potential biomarkers of HNSCC are urgently needed.
Bioinformatics analysis has become a powerful tool with which to identify potential key genes in cancers.
Machine learning, such as the Random Forest method, can improve the classi cation performance of transcriptome data screening. One of this method's primary merits is its feature selection. Random Forest has shown great potential to identify key therapeutic targets [5]. Meanwhile, The Cancer Genome Atlas (TCGA) is a landmark cancer genome project that has identi ed the molecular features of more than 20,000 primary cancers [6] [7]. The TCGA's creation-and the ubiquity of high-throughput sequencing data -means that genome data sets related to cancer research have become more and more widely shared. In the current study, we aimed to screen LM and non-LM HNSCC data for differentially expressed genes (DEGs). More importantly, we identi ed the hub genes that affect LM among patients with HNSCC using the Random Forest model and protein-protein interaction (PPI) network.

Sources of expression pro les and clinical data
We obtained a dataset containing the mRNA expression pro les of paired samples of HNSCC tissue. As of October 29, 2020, this dataset-which we acquired from the TCGA (https://portal.gdc.cancer.gov/)contained 298 HNSCC tissue samples including 135 N0 samples and 163 N1+ samples. All of this study's sequencing procedures were performed using the Illumina HiSeq-Counts platform. Moreover, all data used were obtained from the TCGA. Because the data's original deposition adhered to the TCGA's ethical guidelines, no additional ethical clearance was needed from our local research ethics committee.

DEG identi cation
We identi ed differentially expressed genes (DEGs) using the DESeq2 package [8]. DEGs were identi ed using the following criteria: an absolute log2fold change (FC) > 1.0 and p < 0.05.

Functional and pathway enrichment analysis
We performed gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses using the DAVID 6.8 tool (https://david.ncifcrf.gov/tools.jsp). This study's cutoff for signi cance was set to p < 0.05. PPI network construction and the Random Forest model Next, we used the Search Tool for the Retrieval of Interacting Genes (STRING) database to analyze DEGs' PPI network. Then, these results were visually represented using Cytoscape software. The cutoff criterion for the combined score was > 0.4. Subsequently, we used a CytoHubba plug-in in Cytoscape to further screen candidate genes [10]. At this point, a degree algorithm was applied, and the screening criterion was degree ≥ 6. We used a Random Forest model (ntree = 500) for DEGs with a Random Forest package. Next, we ranked DEGs by their MeanDecreaseGini scores and selected the top 30 genes. We then repeated this process three times to verify its accuracy.

Survival analysis of hub genes
We used GEPIA 2 (http://gepia2.cancer-pku.cn/#survival), a tool based on the TCGA [9], to analyze survival. The top 50% of samples with higher expression levels were classi ed as high expression, while the remaining 50% of samples were classi ed as low expression. Then, we used the Kaplan-Meier estimator with a log rank test to compare the prognostic differences between the two groups. The hazard ratios (HRs) are shown, along with their 95% con dence intervals and log rank p values.

The hub gene's expression in pan-cancer contexts
We used the TIMER tool (https://cistrome.shinyapps.io/timer/), which systematically analyzes expression and immune in ltration in more than 30 cancer types [11]. Speci cally, we used "Cancer Exploration" in this study.

Tissue microarrays' immuno uorescence
Referring to our previous articles [12][13], we conducted an immuno uorescence analysis of paired tissues from a total of 68 HNSCC patients from our cohort. We used cytokeratin 20 (ab76126, abcam, 1:200 dilution) for this analysis. We also conducted survival analysis of the hub gene in human tissue microarrays. Patients were divided into two groups, according to their median KRT20 expression. A Kaplan-Meier estimator with a log rank test was used to examine the prognostic difference between the two groups. All the participants signed a written informed consent form. The Clinical Research Ethics Committee of Fudan University's Eye & ENT Hospital approved the protocols (NO. KJ2008-01).

Statistical analysis of tissue microarrays' clinicopathological features
We analyzed and visually represented statistical data using GraphPad Prism 5.0 (GraphPad Software, La Jolla, CA, USA) and SPSS 22.0 software. An χ2 test was used to analyze the correlations between KRT20 expression levels and the clinicopathological features of patients with HNSCC.
Transwell Migration/Invasion Assays 2×10 4 cells were seeded in the upper chamber of the 8µm transwell inserts (BD Biosciences, Franklin Lakes, NJ, USA) for the transwell/migration assay with 100µl of serum-free medium. The lower chamber was lled with medium (50µl) containing 10% bovine serum albumin. After 48 hours of incubation, cells in the upper chamber were carefully removed. The cells that adhered to the membrane were xed in paraformaldehyde for 15 minutes before being stained with 1% crystal violet (KeyGEN Biotech, Nanjing City, China) for 20 minutes. In invasion assays, the upper chamber was coated with 50µl of Matrigel (BD Biosciences, Bedford, MD, USA) diluted 1:8 with serum-free medium before seeding 2×10 5 cells in 100µl of serum-free medium. The rest of the procedure followed the same pattern as the transwell migration in our published articles [12].
Gene set enrichment analysis of whole-genome dataset and KRT20 expression levels To further examine the potential mechanism underlying diverse KRT20 expression groups in HNSCC tumor tissues, we then performed a gene set enrichment analysis (GSEA). Results meeting the following criteria were regarded as statistically signi cant: nominal p < 0.05, normalized enrichment score > 1, and FDR < 0.3.

Differentially expressed gene identi cation and bioinformatics analysis
After data preprocessing and normalization using the DESeq package, we identi ed 243 DEGscomprising 143 upregulated genes and 100 downregulated genes-gures 1A and 1B show. Next, we conducted KEGG and GO pathway enrichment analyses using the clusterPro ler R packages. To examine their cellular components, we enriched the DEGs in corni ed envelopes, intermediate laments, and extracellular space. Additionally, to assess their molecular function, we enriched the DEGs with peptide crossing-linking, keratinocyte differentiation, and keratinization. KEGG pathway clustering revealed that the DEGs were mostly enriched by serotonergic synapse, tyrosine metabolism, and salivary secretion ( Figure 1C). The DEGs' top 30 GO terms are shown in Figure 1D. Our cluster analysis of biological processes suggested that the DEGs were signi cantly enriched by "structural molecule activity," "serinetype endopeptidase inhibitor activity," and "structural constituent of epidermis." PPI network construction and Random Forests analysis of differentially expressed genes Using STRING analysis, we input the DEGs into the PPI network, including 58 nodes and 128 sides ( Figure  1E). Then, we used cytoHubba-a plugin to rank nodes by their network capabilities. We also used the MCC (maximal clique centrality) method to identify the top 20 genes (Table 1). Our ndings may indicate that these 20 genes, including LOR, LCE3E and KRT2, play important roles in HNSCC. To further identify hub genes, we used Random Forest packages in R to identify hub genes. The top 30 hub genes are shown in Table 2.
Identifying KRT20 as the hub gene Next, we used Venn plots to select intersect genes. KRT20, KRT3, and ALOX12B were identi ed as key genes (Figure 2A). We then conducted survival analysis of KRT20, KRT3, and ALOX12B ( gures 2B-2D). For all HNSCC cases in the TCGA, the results of this analysis revealed that high expression levels of KRT20 were associated with poorer overall survival rates among patients with HNSCC-rather than high expression levels of KRT3 or ALOX12B. We also used TIMER 2 to explore KRT20 expression in various tumors. The results of this exploration revealed that KRT20 expression can vary between normal tissue and tumor tissue, based on the TCGA ( Figure 2E). Among patients with HNSCC, KRT20 was highly expressed in tumor tissue compared to normal tissue. Our analysis also revealed a similar tendency to highly express KRT20 across the LIHC, LUAD, LUSC, and PRAD samples. KRT20 expression is higher among HNSCC patients with N1+, indicating poorer overall survival rates Total 68 patients were selected from our database and were analyzed. KRT20 immuno uorescence staining showed higher KRT20 expression among patients with N1+ HNSCC compared to patients with N0 HNSCC (p = 0.0373) (Figures 3A and 3B). Additionally, our survival analysis showed that the overexpression of KRT20 and the N1+ stage were related to poorer overall survival rates among patients with HNSCC (p = 0.0203 and p = 0.0359, respectively) ( Figure 3C and 3D). Further analysis revealed that KRT20 expression levels were related to cervical lymph node metastasis ( Table 3).

Overexpression of KRT20 promotes migration and invasion in HNSCC cells.
Based on bioinformatic analysis of KRT20, we investigated the ability of KRT20 in HNSCC cell lines. KRT20 expressing lentivirus were successfully transfected into HNSCC cell lines (Figures 4A-B). We performed transwell migration/invasion assays in Tu686 and FD-LSC-1 cell lines. The results of the transwell migration/invasion assays showed that more cells migrated or invaded through the chamber in the KRT20 overexpression (KRT20 OE group) group vs. the vector group (KRT20 Vector group) ( Figures  4C-F).

KRT20 is related to important biological processes
To further explore KRT20's molecular mechanism in HNSCC, we used the GSEA method to analyze a whole-genome dataset of HNSCC tumor tissues across different KRT20 expression levels. This GSEA analysis revealed that drug metabolism cytochrome P450, glycerophospholipid metabolism, glutathione metabolism, retinol metabolism, lysine degradation, valine leucine and isoleucine degradation, and steroid hormone biosynthesis were signi cantly associated with high KRT20 expression levels among patients with HNSCC ( Figure 5).

Discussion
In this study, we screen a total of 243 DEGs, comprising 143 upregulated genes and 100 downregulated genes. After this screening with the PPI network and a Random Forest model, we selected KRT20 as a potentially key gene related to HNSCC's LM and prognosis. Our analysis of HNSCC tissue microarrays was consistent with the results of our bioinformatics analysis, both indicating that KRT20 is highly expressed in N1+ HNSCC patients and predicts a poor prognosis.
Cytokeratins are lament cytoskeleton proteins that are found in all cells. Keratins are the most prevalent intermediate lament category found in epithelial cells, and they are expressed differently depending on the cell type. Tono laments made of keratin laments braid the nucleus, span the cytoplasm, and adhere to the cytoplasmic plaques of epithelial cell-cell junctions (desmosomes) in epithelia [14]. Keratins are an integral part of the stability continuum from single cells through tissue formation, and they play a key role in the integrity and mechanical stability of both single epithelial cells and epithelial tissues. [15][16]. Several types of cytokeratins have been described in both normal urothelium and neoplastic urothelium [17][18]. KRT20, also known as keratin 20, is a low-molecular-weight cytokeratin encoded by the KRT20 gene, and it is located on chromosome 17q21.2. KRT20 is expressed differently in tumors. In the context of urothelial bladder cancer, strong evidence has suggested KRT20's diagnostic and prognostic value. Moreover, molecular techniques have demonstrated higher KRT20 expression levels in bladder cancer samples compared to non-neoplastic bladder tissue [19]. In 2010, Ye et al. found that KRT20's staining intensity in cancer tissues signi cantly correlated with bladder carcinoma grades, distant metastasis, and TNM grades [20]. Moreover, a previous study suggested similar results for colorectal tumors. KRT20 expression levels were signi cantly higher in colorectal tumors than in normal mucosa. Furthermore, KRT20 expression was associated with colorectal cancer recurrence and median survival rates among that study's patient population [21]. Yet, there are also several studies indicated different roles of KRT20 in tumor progression. Researchers found that tumors characterized by the complete absence of KRT20 expression were very poorly differentiated and contained high percentages of Ki67+ cells [22]. More research was required to learn about KRT20. KRT20's role in HNSCC remains unclear-especially in the LM context. Through genome-wide coexpression analysis, we found that KRT20 may participate in drug metabolism cytochrome P450, glycerophospholipid metabolism, glutathione metabolism, retinol metabolism, lysine degradation, valine leucine and isoleucine degradation, and steroid hormone biosynthesis. Cytochrome P450s' major function is the metabolism of foreign compounds. Previous studies have demonstrated cytochromes involvement in different tumors' development and drug resistance. Furthermore, CYP1A1, a member of the cytochromes family, is regarded as a typical lung cancer biomarker [23]. CYP1A1 is also thought to play a role in the pathophysiology of in ammatory illnesses by acting as an oxygenase in eicosanoid metabolism, according to recent research [24]. The relationship between in ammation and malignancies has been well-studied. Chemokines, prostaglandins, and cytokines, all of which have been demonstrated to inhibit CYP enzyme function, are components of cancer in ammation [25]. The proin ammatory cytokines IL-6, INF-, TNF-, and IL-1 are thought to be the most powerful mediators of cytochrome activity and expression reduction. Eicosanoids generated from cytochromes are affected by cancer-related in ammation [26].
Glycerophospholipid metabolism plays an important role in tumor cell survival, and it affects fundamental cellular processes-including signal transduction and gene expression. Highly proliferating cancer cells require the synthesis of new fatty acids to provide a constant glycerophospholipids supply.
New fatty acids are needed to continuously provide glycerophospholipids for membrane production. Therefore, unlike normal tissues, endogenously synthesized fatty acids are primarily esteri ed to phospholipids-rather than triacylglycerols-from phosphatidic acid in membrane production [27]. Glutathione (GSH) is the most common antioxidant in living creatures, and it has a variety of roles, the majority of which are related to maintaining the redox equilibrium of cells. Glutamyl transferase (GGT) is a membrane-bound enzyme that catalyzes the breakdown of extracellular GSH, allowing the creation of the constituent glutamate and cysteine required for intracellular GSH synthesis to take place. Under oxidative stress, GGT levels are considerably increased, particularly in highly metabolic cancer cells. A considerable glutathione route and dipeptide metabolites have been linked to stage IV clear-cell renal cell carcinoma, according to metabolomic analysis [28] [29].
The results of the current study indicated that KRT20 is highly expressed among patients with N1+ HNSCC. Moreover, KRT20 overexpression was found to relate to poorer overall survival rates among patients with HNSCC. These results were veri ed in our clinical samples database. The research also showed the overexpression of KRT20 promoted migration and invasion in HNSCC cell lines. Accordingly, we propose that KRT20 is a potential biomarker involved in the LM of HNSCC.
However, this study entailed several limitations. First, it was based on the TCGA. Although the results were veri ed in our clinical samples containing 68 HNSCC patients, the sample size was insu cient. Importantly, this study's results must be validated using external databases. Secondly, our study was veri ed in cell experiments but not in in vivo experiments. Besides, proposed molecular mechanism and its genotype-phenotype correlation were not investigated in corresponding in vitro and in vivo experiments. KRT20's molecular mechanism in the LM of HNSCC remains unclear, and further studies are needed to elucidate this mechanism.

Conclusion
Our study has shown that high KRT20 expression levels among patients with HNSCC are associated with both LM and unfavorable prognoses. KRT20 could also promote the ability of migration and invasion in HNSCC cells. Through genome-wide functional enrichment analysis, we found that KRT20 may be involved in the following biological processes and pathways in HNSCC: drug metabolism cytochrome P450, glycerophospholipid metabolism, glutathione metabolism, retinol metabolism, lysine degradation, valine leucine and isoleucine degradation, and steroid hormone biosynthesis. However, our results must be veri ed by future studies. Authors' contributions YFZ, QH, HLR and LZ conceptualized the study. YFZ and QH performed experiments. YFZ, QH, YJS and HCL performed bioinformatics data analysis. HYH, YG and CYH contributed to the enrolment of patients, collection and processing of clinical samples, and collection and analysis of clinical data. YFZ and QH wrote analysis scripts and drafted the manuscript. HLR and LZ revised the manuscript. HLR. and LZ managed funding. All authors reviewed the manuscript.      Gene set enrichment analysis was performed to further screen for the signi cant pathway between the different expression-level groups of KRT20 among patients with HNSCC. p < 0.05 was considered signi cant.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.