A Global Genome-wide Scan with Optimal Cutoﬀ Mining for Emerging Biomarkers in Head and Neck Squamous Cell Carcinoma

Background: The survival analysis of the Cancer Genome Atlas (TCGA) dataset is a well-known method to discover the gene expression-based prognostic biomarkers of head and neck squamous cell carcinoma (HNSCC). In order to utilize a continuous gene expression for survival analysis, it is necessary to determine a cutoﬀ point by the dichotomization of the patients. There is some optimization software for cutoﬀ determination. However, those predetermined cutoﬀs by software usually set at the median, 1/4 quantile, or 3/4 quantile of RNA sequencing (RNA-Seq) value to ﬁnd a signiﬁcant P-value of the Kaplan-Meier curve. There are few clinicopathological features available on their pre-processed data sets. Methods: We developed a comprehensive workﬂow by R script, running on the Rstudio platform. It includes data retrieving and pre-processing, feature selection, cutoﬀ mining engine, Kaplan-Meier survival analysis, Cox proportional hazard modeling, and biomarker discovery. Results: Using this workﬂow on the TCGA HNSCC cohort, we scanned human protein-coding genes (20,500) programmatically. After adjustment with other confounders, we found that the clinical tumor stage and the surgical margin involvement are independent risk factors in patient survival. According to the resulting tables with Bonferroni adjusted P-value under optimal cutoﬀ as well as hazard ratio ( > = 1 . 5) , there were ten candidate biomarkers, named as DKK1, CAMK2N1, STC2, PGK1, SURF4, USP10, NDFIP1, FOXA2, STIP1, and DKC1, which are signiﬁcantly associated with the poor prognosis of overall survival (OS). At the same time, the other ten genes were over-expressed in the better survival patients (with hazard ratio < = 0 . 5 ), named as ZNF557, ZNF266, IL19, MYO1H, FCGBP, LOC148709, EVPLL, PNMA5, IQCN (previous name as KIAA1683), and NPB. Further validations are warranted. Conclusions: We suggested this analysis tool equipped with an optimal cutoﬀ ﬁnder will help with biomarker discovery of protein-coding genes, in terms of therapy. 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 29 31


Background
Head and neck squamous cell carcinoma (HNSCC), including oral, oropharyngeal, and hypopharyngeal origin, is the fourth leading cancer causes of death for males in Taiwan [1]. The age-standardized incidence rate of HNSCC in males is 42.43 per 100,000 persons [2]. The treatment strategies of HNSCC are surgery alone, systemic therapy with concurrent radiation therapy (systemic therapy/RT), or surgery with adjuvant systemic therapy/RT (according to National Comprehensive Cancer Network, NCCN Clinical Practice Guidelines in HNSCC, Version 2.2020) [3]. Despite the improvement in those interventions, the survival of HNSCC has improved only marginally over the past decade worldwide [4]. The critical advancement of targeted therapy and immuno-oncology should benefit from emerging prognostic biomarkers, which guide the development of modern systemic therapy.
Accumulative knowledge showed that some biomarkers have prognostic significance in HNSCC. For example, node-negative HNSCC patients with p53 overexpression were found to have lower survival [5]. Overexpression of hypoxia-inducible factor (HIF)-1 alpha [6] or Ki-67 [7] was found to be correlated with poor response to radiotherapy of HNSCC. The epidermal growth factor receptor (EGFR) [8] [9] and matrix metalloproteinase (MMP) [10] were found to be over-expressed to promote invasion and metastasis of HNSCC. From 2000 to 2006, the anti-EGFR antibody-drug (cetuximab) has been developed and combined with radiotherapy, known as bio-RT, to increase survival of unresectable locoregionally advanced disease [11]. The systemic therapy of cetuximab plus platinum-fluorouracil chemotherapy (EXTREME regimen) improves overall survival when given as first-line treatment in patients with recurrent or metastatic HNSCC [12] [13]. It was approved by the US Food and Drug Administration (FDA) in 2008. In advance, the bio-RT could have proceeded with docetaxel, cisplatin, and 5-fluorouracil (Tax-PF) induction chemotherapy to overcome the radio-resistance of HNSCC [14].
However, Rampias and his colleagues suggested Harvey rat sarcoma viral oncoprotein (HRAS) mutations could mediate cetuximab resistance in systemic therapy of HNSCC via the EGFR/rat sarcoma (RAS)/extracellular signal-regulated kinases (ERK) signaling pathway [15]. After that, the EGFR tyrosine kinase inhibitor (TKI) was introduced to help cetuximab in 2018. The anti-tumor activity was observed in a phase 1 trial for HNSCC patients using cetuximab and afatinib, a TKI of EGFR, human epidermal growth factor receptor (HER)2, and HER4 [16]. Other EGFR TKI, such as gefitinib, erlotinib, osimertinib, were also developed to treat advanced HN-SCC. Although 90% of HNSCC has overexpression of EGFR, cetuximab has only 10% to 20% response rate on those patients. So far, cetuximab is still the only drug of choice with proven efficacy, which targeted the selected HNSCC patients [17].
In our previous proteomic study in 2017, thymosin beta-4 X-linked (TMSB4X) was reported to be related to tumor growth and metastasis of HNSCC [24]. It was also found by the subsequent investigations that TMSB4X engaged in tumor aggressiveness through epithelial-mesenchymal-transition (EMT) on pancreatic [25], gastric [26], colorectal [27], lung [28], ovarian [29] and melanoma [30] cancers. Thus, it might be suggested that TMSB4X is possible for tumor type-agnostic therapy [31] as a common biomarker crossing several types of cancer.
In summary, identifying predictive biomarkers for selecting standard-of-care or advanced systemic therapy [32] in HNSCC is crucial. However, there are three challenges of biomarker discovery from survival analysis, so far. Firstly, although TCGA genomics data were harmonized, there is unclean data, including null expressed genes, which over 50% of the cohort, should be manually investigated and cleaned. Second, we need to find a way to determine candidates from the expression level of 20,500 human protein-coding genes [33]. Usually, the investigators should get the rationale or revelation of the genes of interest on a specific cancer type. They should upload those genes manually onto bioinformatics tools, such as SurvExpress (http://bioinformatica.mty.itesm.mx:8080/Biomatec/SurvivaX.jsp, which has been lost since Oct/2019 and currently out of funds), analyze with TCGA cohort. After downloading the survival results, they could curate plots and tables carefully. It is not possible to scan the whole human protein-coding genome in this way. Third, we need to find an optimal cutpoint of those RNA expression data to maximize candidate mining coverage. The above mentioned online tools might set a cutpoint at the median, 1/4 quantile, or 3/4 quantile for subsequent analyses. There are several visualization software or R packages which deal with cutoff determination, such as Prognoscan [34], Cutoff Finder [35], Findcut [36], Human protein atlas [37], OptimalCutpoints [32], cutpointr (available at https://github.com/thie1e/cutpointr), and cutoffR (available at https://cran.rproject.org/web/ packages/cutoffR). However, non of them could combine the scanning of the protein-coding genes and cutoff optimization programmatically.
In our approach, this article described a comprehensive workflow implemented in the R script, which runs on the Rstudio server. Its functions include data retrieving and pre-processing, feature selection, cutoff mining engine, Kaplan-Meier survival analysis, Cox proportional hazard modeling, and biomarker selection. Using this workflow on the TCGA HNSCC cohort, the 20,500 human protein-coding genes were scanned. The analysis workflow is shown in Figure 1.

Patient Cohort
The Cancer Genome Atlas (TCGA) profiled 528 HNSCC clinical and genomic data, which has been standardized and available on a unified data portal, Genomic Data Commons (GDC) of the the National Cancer Institute (NCI). GDC is available at https://portal.gdc.cancer.gov/projects/TCGA-HNSC. TCGA offers several computational tools to the public for facilitating cancer research. Broad Genome Data Analysis Center (GDAC) Firebrowse (firebrowse.org, version 1.1.35, 2016 09 27) is one of those tools to provide data access to each TCGA disease through a Representational State Transfer (REST) Application Programmable Interface (API). The 528 TCGA HNSCC patients' clinical information and RNA-Seq data were obtained from the Firebrowse RESTful API with an R package, FirebrowseR (available at https://github.com/mariodeng/FirebrowseR) [38]. We utilized FirebrowseR with our analysis workflow (see Figure 1, step 1) to receive standardized data frames while avoiding data re-formatting, often causing some errors.

RNA Sequencing Data
The number of protein-coding genes was suggested as 20,500 [33]. The GDC Data Portal provided TCGA data has been harmonized and re-aligned RNA sequencing data against an official reference genome build (Genome Reference Consortium Homo sapiens genome assembly 38, GRCh38). RNA-Seq expression level read counts produced by Illumina HiSeq have been normalized using the Fragments per kilobase per million reads mapped (FPKM) method, as described in reference [39]. The RNA-Seq preprocessor of Broad GDAC picked the RNA-Seq by Expectation-Maximization (RSEM) value from Illumina HiSeq/GA2 messenger RNA-Seq level 3 (v2) dataset of NCI GDC. It made the messenger RNA-Seq matrix with log2 transformed for the downstream analysis, as described in their reference [40]. We utilized FirebrowseR's function call, Samples.mRNASeq(cohort = "HNSC", gene=GeneName, format="csv"), to download each RNA-Seq data of all HNSCC patients and to save as 20,499 data frame files, named as "HN-SCC.mRNA.Exp.[GeneName].Fire.Rda". After careful investigation of the genomics dataset, the RNA-Seq values of "solute carrier family 35 member E2A (SLC35E2A)" and "solute carrier family 35 member E2B (SLC35E2B)" should be considered two distinct expression entities. We concluded that the number of protein-coding genes in the TCGA dataset is 20,500. We removed null expressed genes, which over 50% of the cohort, to avoid the useless result.

Cutoff Finder Core Engine
To evaluate the effect of gene expression on the patient's survival, we introduced the stratifying of patients with Kaplan-Meier survival analysis according to each gene's low/high expression. Our cutofFinder func subroutine employs the minimum Pvalue approach to recognizing cutoff points in continuous gene expression measurement for patients sub-population. First, patients were ordered by RNA-Seq value (RSEM) of a given gene. Next, patients were stratified at a serial cut (counted by person ranked in 30% to 70% percentile of the cohort; please see Figure 1 "HN-SCC cohort"). The survival risk differences of the two groups were estimated by log-rank test to yield around 165 Kaplan-Meier P-values for each gene. Then, the optimal cutoff of RNA-Seq, giving the minimum P-value, was selected by the cut-ofFinder func subroutine. This iteration method could calculate all possible cutoff of each gene expression in this cohort. At each run of cutofFinder func function call for an individual gene, it returned an optimal cutoff value (e.x. 0.027 for gene calcium/calmodulin dependent protein kinase II inhibitor 1, CAMK2N1). The optimal cutoff value and its correlated patient grouping size (e.x. low-expression in 262 persons vs. high-expression in 152 persons with gene CAMK2N1) would be returned to the main program to allow downstream Cox survival analysis. The percentile range we applied as 30% to 70% was used to avoid a small grouping effect [44] [34]. In case there was no significant P-value, a median expression of this gene was set as its cutpoint as usual.

Statistical Consideration
Our workflow has loops to call function survival marginSFP(GeneName) with given GeneName to process the survival analysis gene by gene. We dichotomized the clinicopathological features, which includes gender (male/female), age at diagnosed (below/beyond 65 year-old), clinical tumor size (T1-2/T3-4), clinical nodal status (negative/positive), clinical distant metastasis (negative/positive), TNM staging (early/late), surgical margin status (negative/positive) and tobacco exposure (low/high). The patients were grouped by an RNA-Seq value of each gene, cut at low-or high-expression on an optimal P-value obtained from the cutofFinder func subroutine (see the section of "Cutoff Finder Core Engine"). Pearson's chi-square test was used for these binary variables. Kaplan-Meier survival was analyzed using the log-rank test for two groups OS comparison. The Cox proportional hazards regression is the widely accepted approach for modeling survival while accounting for confounding factors [45]. Univariate and multivariate Cox proportional regression model [46], using the "coxph" function in R package "survival", was applied to calculate hazard ratio, 95% confidence interval (95% CI) and its significance, and to estimate the independent contributions of each clinicopathological features to the OS. Results were considered statistically significant when a two-sided P- value   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 < 0.05, or a lower threshold if indicated. The false discovery rate (FDR) (< 0.05) could be used to pick up the optimal P-value to ensure the control for type I error of the Kaplan-Meier survival test during the cutoff finding procedure. There were also multiple correlated tests of null hypotheses in the Kaplan-Meier family during our global scanning of protein-coding genes. The stringent Bonferroni correction could result in an adjusted P-value to ensure the control for type I error.
The resulting data, including Kaplan-Meier curves, cumulative P-value plots, and Cox regression tables, were exported to ".xlsx" and ".Rda" files (by R package "r2excel") for subsequent biomarker selection.

Biomarker Selection
Those genes with prognostic impact, whose hazard ratio >= 1.5 or <= 0.5 in both Cox's univariate/multivariate model, were assigned as provisional candidates. Bonferroni adjusted (Kaplan-Meier) P-value was used to make a ranking of candidates for the final decision (see Figure 1, step 2).

Results
The 9416 Kaplan-Meier plots with associated Cox's univariate and multivariate tables were generated at workflow step 1 (see Figure 1) and justified by the ranking of hazard ratios. By uncorrected P-value below 0.05, we selected 967 genes in which hazard ratio (HR) is greater than 1.5 or less than 0.5 (see Figure 2A univariate, and Figure 2B multivariate plots). At the final step, a Bonferroni P-value correction was used to yield the twenty candidate genes, under the stringent criteria (see Figure  2C, D). The ten candidates, including DKK1, CAMK2N1, STC2, PGK1, SURF4, USP10, NDFIP1, FOXA2, STIP1, and DKC1, have significantly associated with the poor prognosis of OS (see Table 1), while the other ten genes were over-expressed in the better survival patients, named as ZNF557, ZNF266, IL19, MYO1H, FCGBP, LOC148709, EVPLL, PNMA5, IQCN (previous name as KIAA1683), and NPB (see Table 3, with their full name of genes). We made a volcano plot for 9416 genes by Kaplan-Meier P-value (less than 0.05, obtained during cutoff finding procedure) against the Cox hazard ratio (see Figure 3). The plot revealed that the most significant (Bonferroni-adjusted P-value < 0.05) candidate genes are located above the dotted line. At the same time, Cox's HR separated them on the two-side with prognostic impact.
In our previous work, altered glucose metabolism (e.g., the Warburg effect [61]) promotes the progression of HNSCC, which is partially attributed to the solute carrier family 2 member A4 (SLC2A4) (or glucose transporters 4 (GLUT4)) and tripartite motif-containing 24 (TRIM24) pathway [62]. In this study, LOC148709 (a long non-coding RNA (lncRNA)) was suggested as a biomarker of HNSCC (see Table 3). It was also found to has a contribution to the Warburg effect on esophagus cancer [63].
Although we combined the power of genome-wide scanning and an optimal cutoff finder for Kaplan-Meier survival analysis, the study has some limitations. We are aware of the importance of direct assessment of protein products comprising the basic functional units in cancer cells' biological processes. The announcement of the Cancer Proteome Atlas (TCPA, http://tcpaportal.org) excited the cancer research community [65] [66]. By the utility of the reverse-phase protein arrays (RPPAs) or reverse-phase protein lysate microarray (RPMA), a microarray of "Western blots" in the TCPA could help to test our hypotheses. However, there are only 237 antibodies available so far. We found our 20 candidates are not included in the TCPA database (v3.0 [67]). At the same time, we found our candidates (DKK1, CAMK2N1, STC2, PGK1, SURF4, NDFIP1, STIP1, DKC1) are also on the list of unfavorable prognostic genes for HNSCC from the Human Protein Atlas (HPA) (available at https://www.proteinatlas.org/humanproteome/pathology/head+and+neck+cancer, accessed 28 September 2020). The ZNF557, ZNF266, and FCGBP are on the list of favorable prognostic genes as well. Our strategy still has the strength to explore the more possible biomarkers from RNA-Seq datasets in cancer research. In line with tumor-agnostic research, we plan to explore more TCGA diseases to find common biomarkers. However, the GDC provided standardized data frames that could not directly fit our workflow's scope. Before the global genes scanning process, we needed to re-format, transpose and, merge the 528 patients' clinical datasets and correlated 20,500 expressions of bio-specimen. It should be carefully curated to confirm the data integrity with the correct definition. We also plan to upgrade our plain R script at the Rstudio platform to program in the C++ language and source it in R. The high performance of C++ could speed up the cutoff finding engine in this workflow involving heavy computations [68]. Thus, it is possible to 1 2 3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 introduce the Rstudio Shiny app (https://shiny.rstudio.com) as a web application of the "pvalueTex" packaged with our workflow in the future.

Conclusion
Our findings suggested 20 candidate biomarkers, DKK1, CAMK2N1, STC2, PGK1, SURF4, USP10, NDFIP1, FOXA2, STIP1, DKC1, as well as ZNF557, ZNF266, IL19, MYO1H, FCGBP, LOC148709, EVPLL, PNMA5, IQCN (previous name as KIAA1683), and NPB, are all heavily associated with the prognosis of OS under optimal cutoff points with stringent Bonferroni P-values and proper confounders control. They also might be potential common biomarkers for subsequent study. We wish this bioinformatics tool will be available for the broad usage of tumor-agnostic research [69] to cross several TCGA diseases to make translational impacts.

Declarations
Ethics approval and consent to participate Not applicable.