Identication of a Novel Epithelial-Mesenchymal Transition Gene Signature Regulated by KEAP1-NRF2 Pathway in Esophageal Carcinoma

KEAP1-NRF2 pathway alterations were identied in many cancers including, esophageal cancer (ESCA). Identifying biomarkers that are associated with mutations in this pathway will aid in dening this cancer subset; and hence in supporting precision and personalized medicine. into and we Enrichment integrated with KEAP1-NRF2 we of cell


Background
Esophageal cancer is the sixth most common cause of cancer death and the eighth in incidence worldwide. In fact, it accounts for 4% of cancer diagnoses and for 6% of cancer deaths. The prognosis for esophageal carcinoma is poor, with a 5-year survival rate of 19% and only 0.9% for advanced esophageal carcinoma (1).
To maintain oxidative homeostasis, cancer cells increase the transcription of antioxidant genes by acquiring either stabilizing mutations in NFE2L2 (encoding NRF2, the master transcriptional regulator of the cellular antioxidant program) or by selecting for inactivating mutations in its negative regulator, KEAP1 (2). KEAP1 is a substrate receptor of the Cul3-RING ubiquitin ligase (CRL3) that, under physiological conditions, constitutively binds and targets NRF2 for degradation. In response to oxidative stress, the KEAP1-NRF2 binding is inhibited and, consequently, NRF2 is stabilized (3).
The TCGA network has revolutionized the cancer research by enriching the cancer research community with a huge amount of cancer-related data. This revolution has enabled researchers to identify cancer driver genes, cancer dependency, prognostic biomarkers and therapeutic targets. In addition, it has enabled researchers to segregate one cancer type into subgroups in order to assist personalized and precision medicine eld. Identi cation of genes and biological processes that are regulated by the KEAP1-NRF2 pathway in different cancers may provide an effective approach for therapy of subset of cancers that harbor KEAP1-NRF2 pathway alterations. Moreover, these gene signatures can be used to predict survival of patients.
In previous studies, we identi ed gene signatures that are regulated by the KEAP1-NRF2 pathway in lung adenocarcinoma (LUAD) and head and neck squamous cancer (HNSC) (4)(5)(6). In the current study, we used the genomics, and transcriptomics data of the ESCA cohort from TCGA to identify a gene signature regulated by the KEAP1-NRF2 pathway in ESCA.

Methods
Overall database selection TCGA RNA-Seq gene expression version 2 level 3 data (Illumina HiSeq platform) for 182 ESCA tissues were downloaded from the Broad GDAC (Global Data Assembly Centers) Firehose website (http://gdac.broadinstitute.org/). All the mutation data used in the present study was obtained from CVCDAP (https://omics.bjcancer.org/cvcdap/home.do) and UCSC Xena Browser (https://xenabrowser.net/) (7,8). Twelve percent (22 out of 185) of the TCGA-ESCA patients were found to harbor mutations in either KEAP1, NRF2 or both. Then, we segregated these patients into two groups, one had 22 patients with mutations in either KEAP1, NRF2 or both (mutated group) while the other had 160 patients with neither KEAP1 nor NRF2 mutations (wild-type group).

RNA-Seq data analysis
TCGA RNA-Seq gene expression version 2 level 3 data (Illumina Hiseq platform) for 182 ESCA tissues were subjected to differential gene expression analysis. Brie y, level 3 transcriptomic data was normalized by the FPKM (Fragments per Kilobase of transcript per Million mapped reads) method. All gene expression values were log-transformed to approximate the data to a normal distribution. In addition, genes with zero values in more than 25% of the patients were excluded. The differentially expressed genes (DEGs) were identi ed by applying the two-tailed t-test assuming unequal variance.
Then, P values were adjusted using the FDR method. DEGs with FDR 0.05 were considered signi cant.
Gene Set Enrichment Analysis (GSEA) GSEA (https://omics.bjcancer.org/) was performed to determine all signi cantly affected biological pathways. GSEA is a computational method that determines whether a de ned set of genes shows statistically signi cant, concordant differences between two biological states. The primary result of the GSEA is the enrichment score (ES), which re ects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. The gene list metric was built using ratio between classes (biological states). A positive ES indicates gene set enrichment at the top of the ranked list (up-regulated); a negative ES indicates gene set enrichment at the bottom of the ranked list (down-regulated).
Cancer Cell Line Encyclopedia (CCLE) data analysis RNA-seq gene expression data for 1019 cell lines were downloaded from CCLE https://portals.broadinstitute.org/ccle/data. We identi ed ESCA cell lines that harbor mutations in either KEAP1, NRF2 or both using the COSMIC data base https://cancer.sanger.ac.uk/cell_lines. Then, we divided the ESCA cell lines into two groups mutated and wild-type. Differential gene expression analysis was performed as described above.

Identi cation of NRF2 binding sites by in silico analysis
To identify the NRF2 binding sites within the promoter regions of the putative NRF2 regulated genes, we used the transcription factor-binding site nding tool ConTra V3 (9) with stringency core = 0.95 and similarity matrix = 0.85. The search was limited to the -1 kb upstream the transcription start site (TSS).

Survival analysis
For the identi cation of prognostic biomarkers, Kaplan-Meier curves were generated by using the webbased patients survival analysis tool SurvExpress (10). Log rank test P<0.05 was used as the cutoff for signi cance. The method of analysis has been discussed in a previous study (5).

Overview of the ESCA mutational landscape
The lack of prognostic biormarkers for the ESCA subgroup with KEAP1-NRF2 mutations motivated us to focus on identifying prognostic biormakers that is regulated by KEAP1-NRF2 in ESCA. First, we investigated the different driver mutations in the TCGA-ESCA patients and we found that TP53 was the most frequently mutated gene in ESCA as it was found mutated in 86% of patients (Fig.2B). 42% of ESCA patients harbored TTN mutations which made TTN in the second place. MUC16, SYNE1, CSMD3, FLAG, DNAH5, HMCN1, LRP1B, PCLO and RYR2 were mutated in 23%, 20%, 20%, 18%, 16%, 16%, 16%, 15% and 12% of ESCA patients, respectively. KEAP1-NRF2 was found mutated in 12% of ESCA patients.
The TCGA-ESCA cohort included 182 patient samples. We segregated the cohort into two groups: KEAP1-NRF2-mutated (22 patients) and wild-type (160 patients). In order to ensure that the differences between the two groups were due to KEAP1-NRF2 mutations, we rst investigated the driver mutations in the KEAP1-NRF2-mutated group. We found that TP53 was mutated in 95% of patients in this group, NRF2 was found mutated in 85% patients, TTN, KMTD2, MUC13, KEAP1, PATCH, and SACS were found mutated in 41%, 32%, 27%, 23%, 23% and 23%, respectively (Fig.3A). Then, we performed differential gene mutation analysis between the two groups to investigate the percentages of mutations of these driver genes in the two groups. Only NRF2 and KEAP1 weren't mutated in wild-type group while the other driver genes were found mutated in both the KEAP1-NRF2-mutated and wild-type groups, with similar percentages (Fig.3B). Therefore, none of these driver genes can be considered as variables that contribute to differences between the two groups.
In order to better understand the mutational landscape of KEAP1-NRF2 in ESCA, we used the USCS Xena browser to examine the types of mutations and their positions in the domain structure of KEAP1 and NRF2 proteins. As noted earlier, we found that 2.19% of TCGA-ESCA patient samples had KEAP1 mutations while NRF2 was mutated in 9.34% and both were mutated in 0.54%. All the detected KEAP1 mutations were missense mutations while 77.8% (14/18) of NRF2 mutation were missense mutations, 11.1% were intron (2/18), 5.6% (1/18) were in-frame-deletions and 5.6% (1/18) were in-frame-insertions. KEAP1 consists of 605 amino-acids, and 3 main domains with two mutations were detected in the BTB (broad-complex, tramtrack, and bric-a-brac) domain, three in the IVR (intervening region), and one in the Kelch domain, which is essential for the binding of NRF2 (Fig.3C). In the case of NRF2 structure, the majority of mutations (17) occurred in the crucial KEAP1-binding domain Neh2, and only one was found in the Neh1 domain.
Identi cation of genes regulated by the KEAP1-NRF2 pathway in ESCA In order to identify genes that are regulated by the KEAP1-NRF2 pathway in ESCA, we subjected the TCGA-ESCA data set (22 KEAP1-NRF2-mutated versus 160 wild-type tumor samples) to differential gene expression analysis. We identi ed 896 DEGs with log FC >|1| (p < 0.05 with FDR adjustment) (Fig.4A). Of these DEGs, 403 were up-regulated and 493 were down-regulated (Additional le1: Table S1). Since the ultimate effect of changes in the KEAP1-NRF2 pathway is increased activity of NRF2; and hence the overexpression of its target genes, it was not surprising that several bona de NRF2 target genes were among the up-regulated genes including, AKR1C1, AKR1C2, AKR1B10, GSTM2, UGT1A6, AKR1C3, G6PD, GCLC, GCLM, GSTM3, GPX2, ABCC1, OSGIN1, SRXN1, and TXNRD1. The gene expression pro les of 22 KEAP1-NRF2-mutated and 160 wild-type ESCA patient samples were visualized on a heatmap produced by unsupervised hierarchical clustering, and major differences between the gene expression patterns enabled cluster analysis to discriminate between sample types. Signi cant differences or trends between KEAP1-NRF2-mutated and wild-type ESCA patient samples were detectable for DEGs with log FC> |1| (Fig.4B). CES1, AKR1C1, ADH7, ALDH3A, and CYP4F11 were the top ve up-regulated genes in KEAP1-NRF2-mutated ESCA patient samples (Fig.4C), while PIGR, MUC13, TASPAN8, LGALS4, and OLFM4 were the top ve down-regulated genes (Fig.4D).
Epithelial-mesenchymal transition is regulated by the KEAP1-NRF2 pathway in ESCA The expression signatures of the hallmark gene sets, each containing 50 speci c gene sets, were derived by concentrating multiple gene sets from the Molecular Signatures Database to represent well-de ned biological statuses or courses. GSEA was performed to determine whether the identi ed gene sets showed statistically notable differences between the KEAP1-NRF2-mutated and their wild-type counterparts groups (Additional le 2: Table S2). Interestingly, four gene sets were up-regulated in the KEAP1-NRF2-mutated ESCA, namely, estrogen response late, hypoxia, reactive oxygen species pathway and EMT, the 4 gene sets were greatly enriched, with FDR < 0.05 (Fig.5). The gene set with the lowest FDR, namely, EMT (FDR = 0.001), which contained 194 genes was selected for further analysis. In order to speci cally identify EMT genes that is associated with the KEAP1-NRF2 pathway in ESCA, we integrated the DEGs between KEAP1-NRF2-mutated and wild-type ESCA patient samples (log FC> |1.5|, FDR < 0.05) with the set of 194 EMT-enriched genes (Fig.6A) using Venny 2.1 web-based tool (http://bioinfogp.cnb.csic.es/tools/venny/index.html). Intriguingly, we found 11 common genes: SPP1, PTHLH, WNT5A, COL11A1, COL7A1, GPC1, SNAI2, ADAM12, FBN2, PFN2, and IGFBP3 (Fig.6B).

Epithelial-mesenchymal transition genes were validated using ESCA cell lines
In order to validate these 11 EMT-related genes as potential NRF2 targets, we used CCLE to download RNA-seq mRNA expression data of 1019 cell lines. Then, using the COSMIC data base we identi ed human ESCA cell lines that harbors NRF2 and/or KEAP1 mutations. We selected three cell lines (TE6, TE11 and KYSE180) as the KEAP1-NRF2-mutated group. In addition, we selected another three human ESCA cell lines that have neither NRF2 nor KEAP1 mutations (TE5, TE9 and KYSE150) as the wild-type group. Then, we carried out differential gene expression analysis between the two groups. As shown in GSTM3, AKR1C1 and TXNRD1 (well-known NRF2 targets) showed signi cant up-regulation in the mutated group compared to the wild-type counterpart, which ensures KEAP1-NRF2 pathway alteration in the group (Fig.6C). Furthermore, we investigated the expression of these 11 EMT-related genes between the two groups. Interestingly, ve of the 11 genes (SPP1, WNT5A, PTHLH, PNF2, and GPC1) were signi cantly up-regulated in the mutated ESCA cell lines (Fig.6D). This nding suggests that these ve genes are potential NRF2 targets. For further evidence, we investigated the presence of the putative and known antioxidant responsive elements (AREs), the NRF2 binding site, (Fig.6E) in the promoter region of these ve genes by using ConTra V3 web tool. We performed insilico analysis within the -1 kb upstream the transcription start site (TSS) of the 5 genes. Interestingly, we identi ed highly conserved NRF2 binding sites (AREs) in the promoter regions of human PTHLH (positions: -71 and -916), WNT5A (position: -434), SPP1 (positions: -545,-605 and -870), PFN2 (positions:-737 and -864) and GPC1 (position: -458) (Fig.6F).

Evaluation of prognostic power of EMT gene-signature regulated by the KEAP1-NRF2 pathway in ESCA
In order to evaluate the prognostic power of these 5 genes (SPP1, WNT5A, PTHLH, PFN2, and GPC1) as an EMT-derived signature for KEAP1-NRF2 pathway alterations in ESCA, we rst analyzed overall survival in the TCGA-ESCA cohort using the SurvExpress database. A total of 184 patient samples were divided into high-risk (n = 127) and low-risk groups (n = 57) based on their expression patterns (Fig.7A). The separation of risk groups was optimized using the 'maximize risk group' option provided in the SurvExpress database. The survival probability estimates in the two risk groups were visualized as Kaplan-Meier plots. Strikingly, overall survival analysis revealed that the patients in the high-risk group had poorer survival (HR = 1.67 (CI: 1.01-2.78); Log-Rank p = 0.04443) than the low-risk group (Figure.7B). Moreover, the Rao Giddings (GSE11595) cohort (34 ESCA patient samples) showed that the expression of SPP1, WNT5A, PTHLH, PFN2, and GPC1 in the high-risk group (n=17) was associated with poorer survival (HR = 6.84 (CI: 2.36 -19.8); Log-Rank p = 3.072×e-5) than the low-risk group (n=17) (Fig.7C). In addition, we analyzed the overall survival in the Peters C.Fitzgerald (GSE19417) cohort available in the SurvExpress database. After optimized risk group separation, a total of 70 ESCA patient samples were divided into high-risk (n = 27) and low-risk groups (n = 43) based on their expression patterns (Fig.7D). The survival probability estimates in the two risk groups were represented as Kaplan-Meier plots. Similarly, overall survival analysis showed that the patients in the high-risk group had poorer survival (HR = 2.25 (CI: 1.34 -3.79); Log-Rank p = 0.001659) than the low-risk group. As shown in Fig.5, The 5 genes were signi cantly over-expressed in the high risk patients compared to the low risk group. Moreover, the expression of SPP1, WNT5A, PTHLH, PFN2, and GPC1 successfully discriminated the survival of the ESCA high risk group from that of the low risk group in three ESCA cohorts (288 patients). These ndings indicated that over-expression of the 5 genes is associated with a poor prognosis of ESCA and presents SPP1, WNT5A, PTHLH, PFN2, and GPC1 as an EMT signature based on changes in the KEAP1-NRF2 pathway in ESCA.

Discussion
The key role of KEAP1-NRF2 pathway alterations in developing drug-and radio-resistance in ESCA is wellestablished. The lack of speci c biomarkers for KEAP1-NRF2 pathway alterations in ESCA motivated us to analyze TCGA-ESCA data in order to identify different biological processes that are regulated by KEAP1-NRF2 in ESCA; hence identifying new therapeutic targets and biomarkers to predict prognosis of ESCA patients. We found that the KEAP1-NRF2 pathway was altered in 12% of TCGA-ESCA patients. Swada et al., performed whole-exome sequence analysis of tumor and nontumor esophageal tissues collected from 144 patients with ESCA (11). They found that NRF2 was mutated in 16.7% of the patients and it was one of the most frequently mutated gene in their cohort while KEAP1 was mutated in almost 5% of patients. We performed GSEA to detect gene sets that showed statistically-notable differences between KEAP1-NRF2-mutated samples and their wild-type counterparts groups. Interestingly, 4 gene sets, namely, estrogen response late, hypoxia, reactive oxygen species pathway and epithelial mesenchymal transition, were greatly enriched, with FDR < 0.05. Since the ultimate effect of KEAP1-NRF2 pathway alterations is the stabilization of NRF2, the master transcriptional regulator of the cells antioxidant program (12), it was not surprising to nd the reactive oxygen species pathway among the top pathways that show statistically notable differences between the KEAP1-NRF2-mutated and wild-type groups. Surprisingly, pathways such as EMT and hypoxia were greatly enriched. As EMT was more enriched, we selected EMT pathway to identify a NRF2-KEAP1 pathway signature in ESCA. EMT, an evolutionarily conserved developmental program, has been implicated in carcinogenesis and confers metastatic properties upon cancer cells by enhancing mobility, invasion, and resistance to apoptotic stimuli. Furthermore, EMT-derived tumor cells acquire stem cell properties and exhibit marked therapeutic resistance (13). Given these attributes, the complex biological process of the EMT has been heralded as a key hallmark of carcinogenesis, and targeting EMT pathways constitutes an attractive strategy for cancer treatment (14). Recently, it has been suggested that NRF2 contributes to malignant transformation of pancreatic duct epithelium through distinct EMT-related mechanisms accounting for an invasive phenotype (15). Furthermore, the expression of NRF2 is correlated with the lymph node metastasis of esophageal squamous cell carcinoma and blockage of NRF2 enhances the expression of E-cadherin , the well-known marker of epithelial cell polarity (16). In order to identify an EMT-derived signature for the KEAP1-NRF2 pathway in ESCA, we integrated the EMT gene list obtained from GSEA with the DEGs between the mutated and wild-type groups (log FC> |1.5|, FDR < 0.05). Interestingly, 11 genes were identi ed (SPP1, PTHLH, WNT5A, COL11A1, COL7A1, GPC1, SNAI2, ADAM12, FBN2, PFN2, and IGFBP3). Then, we validated these 11 genes by subjecting the KEAP1-NRF2-muatated and wild-type ESAC cell lines to differential gene expression analysis. Intriguingly, only 5 of the 11 genes showed signi cant upregulation between the two groups (SPP1, WNT5A, PTHLH, PFN2, and GPC1). Further, we evaluated the prognostic power of these 5 genes using three ESCA cohorts (288 patients), and we found that the overexpression of these ve genes were associated with a poor prognosis in ESCA.
In agreement with our analysis, SPP1 and EMT have been shown by bioinformatics analysis to have a close association in colorectal cancer (17). Moreover, Osteopontin, encoded by SPP1 promotes the EMT in hepatocellular carcinoma through regulating vimentin (18), and high SPP1 expression in hepatocellular carcinoma is associated with poor survival outcome (19). Additionally, up-regulation of WNT5A has been suggested to promote EMT and metastasis in pancreatic cancer models, which involves activation of βcatenin-dependent canonical Wnt signaling (20). Furthermore, it has been illustrated that WNT5A promotes EMT and metastasis in non-small-cell lung cancer (NSCLC), and high WNT5A expression is associated with poor prognosis in NSCLC patients (21). In addition, parathyroid hormone related-protein, encoded by PTHLH has been found to promote EMT in prostate and pancreatic cancers (22,23). It has been indicated that PTHLH is a poor prognosis marker and promotes cell growth of HNSC. Besides, it has been illustrated that inhibition of PFN2 hinders cell invasion and migration, as well as induces an EMT phenotype, including increased expression of epithelial marker E-cadherin, decreased mesenchymal marker Vimentin, Snail, Slug and ZEB1, and morphological changes in ESCA cells in vitro (24). High PFN2 expression independently predicts poor overall survival in primary HNSC and ESCA (24,25). Moreover, Over-expression of GPC1 activates EMT which then increases invasion and migration in colorectal cancer and ESCA (26)(27)(28). Additionally, it has been suggested that GPC1 plays an important role in regulating TGF-β-mediated EMT and stemness, and could be a potential future therapeutic target to prevent progression of gastric cancer (29). It has been pointed out that GPC1 is over-expressed and implies a poor prognosis in several cancers including, ESCA, uterine cervical cancer, pancreatic cancer, and methothelioma (30)(31)(32)(33)(34). Altogether, the above evidence suggests an oncogenic role of the 5-gene signature in many cancers.

Conclusion
Our study identi ed an EMT-derived gene signature regulated by the KEAP1-NRF2 pathway that is strongly associated with tumorigenesis, metastasis, and drug resistance in ESCA. This 5-gene signature provides potential biomarkers and therapeutic targets for ESCA patients in whom KEAP1-NRF2 pathway is activated.