Potentially functional genetic variants in the c-Myb signaling pathway genes are associated with survival of non-small cell lung cancer patients in the PLCO Cancer Screening Trial

The c-Myb signaling pathway plays a critical role in regulating expression of genes involved in differentiation, proliferation of cells and cell death, which are related to cancer progression and metastasis. To test the hypothesis that genetic variants in the c-Myb pathway genes are associated with survival of non-small cell lung cancer (NSCLC) patients, we used genotyping data from 1,185 NSCLC patients in the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial as the discovery dataset and 984 NSCLC patients in the Harvard Lung Cancer Susceptibility (HLCS) Study as the validation dataset. With genotyping data for 8587 SNPs in 83 genes of the c-Myb pathway from the PLCO dataset, we performed Cox proportional hazards regression analysis to evaluate association between each SNP and survival, and resultant significant SNPs were validated in the HLCS dataset and further evaluated for their functional relevance.

2014), and the imputed SNPs with an info score ≥ 0.8 were used in further analyses. The Manhattan plot of all of the candidate SNPs was generated by using Haploview software (v4.2). The regional association plot was made for the final significant genes to show all genotyped and imputed SNPs in that gene.

Statistical analysis
In the single locus analysis, we used multivariate Cox proportional hazards regression analysis with adjustment for age, sex, smoking status, histology, tumor stage, chemotherapy, radiotherapy, surgery and the first four principal components of the population structures of the PLCO dataset to evaluate the associations between extracted SNPs and survival in an additive genetic model using the GenABEL package in R software.
We estimated hazards ratio (HR) and 95% confidence interval (CI) for each SNP. We first used the false discovery rate (FDR)method for multiple testing correction; nevertheless, considering the vast majority of the SNPs in the present study were imputed and thus not independent as required by FDR or other correction methods such as the Bonferroni correction , the false positive report probability (FPRP) method was then chosen to perform the multiple testing correction with a strict cut-off of 0.10. We assigned a prior probability of 0.10 to detect an HR of 2.0 for an association with variant genotypes or minor alleles of the SNPs with P <0.05. The SNPs that were identified with a P-value <0.05 by Cox regression analysis and passed the FPRP threshold were subjected to linkage disequilibrium (LD) analysis using Haploview software. Pairwise r2>0.80 was considered in high LD, and the representative SNPs (r 2 >0.80 in LD with other SNPs in the same gene) were used for further population and functional validation. Then, we assessed the associations of those SNPs, identified in the discovery PLCO dataset and validated the HLCS dataset, with OS/DSS of NSCLC using additive, dominant and recessive genetic models. To assess their collective effect, we also combined the protective genotypes of those validated SNPs as a genetic score for the number of protective genotypes in association with OS/DSS.
To assess any possible interactions, we first performed stratified analysis of the combined effect of protective genotypes as defined by the genetic score on NSCLC OS/DSS by clinical characteristics in the PLCO dataset, including age, sex, smoking status, histology, tumor stage, chemotherapy, radiotherapy and surgery, because the HLCS did not have such detailed clinical information. To combine the two GWAS datasets by a meta-analysis approach, a fixed-effects model was used when no heterogeneity was found between two studies (Q-test P-value > 0.10 and I 2 < 50.0%); otherwise, a random-effects model was applied. Kaplan-Meier survival curves, plotted using GraphPad Prism 6.0 (GraphPad Software Inc, San Diego, CA, USA), and the log-ranks tests were used to estimate the effects of protective genotypes on the cumulative probability of OS/DSS. In all the findings, a P-value < 0.05 was considered statistically significant. All statistical analyses were performed with SAS software (v9.4; SAS Institute, Cary, NC, USA), if not specified otherwise.
The expression data of lung cancer tissues from the Cancer Genome Atlas (TCGA) database (dbGaP Study Accession: phs000178.v1.p1),obtained from the Broad TCGA GDAC site (http://gdac.broadinstitute.org), were used for the expression quantitative trait loci (eQTL) analysis of the validated SNPs. We also compared differential expression levels of validated genes in lung cancer tissues with that in the adjacent normal lung tissues.
Additional eQTL analysis was performed, in which the correlations between the genotypes of those validated SNPs and their mRNA expression levels were explored using RNA sequencing data from lymphoblastoid cells of 373 European descendants available in the 1000 Genomes Project.

Study population characteristics
Demographic and clinical characteristics of NSCLC patients from the PLCO trial have been previously described as also shown in Supplementary Table 1. There were 798 (67.3%) deaths during the follow-up with a median survival time of 23.77 months. There were 11 subjects with missing clinical information, including one for pack years, two for tumor stage, and eight for chemotherapy/radiotherapy/surgery). Overall, subjects with age 71 years, being female and never smoker, having a lower stage and chemotherapy or surgery tended to have a better survival.
The identified SNPs obtained from the PLCO discovery analysis were further validated by the GWAS dataset from the HLCS study. After applying quality control, 984 histologyconfirmed Caucasian patients remained in the HLCS study, who were older than 18 years old, with newly diagnosed, histologically confirmed primary NSCLC, as previously described 15 . Gene and SNP extraction from the PLCO trial The overall workflow is shown in Figure 1. Eighty-four candidate genes in the c-Myb signaling pathway were identified from MSigDB, with one gene located on chromosome X removed from further analyses (Supplementary Table 2 Table 3). In the combined analysis of two datasets for these two validated SNPs showed an improved OS of NSCLC associated with the rs11606640 A and rs62053220 G alleles [HR=0.78 (0.70-0.86) with P =2.55x10 -6 and HR=0.73 (0.63-0.83) with P=2.54x10 -6 , respectively] ( Table 1). All genotyped and imputed SNPs are shown in the regional association plots with an expansion of 500 KB in the flanks of the gene region, in which the two validated SNPs are each labeled in purple ( Supplementary Figure 2A).

Independent SNPs and further survival analyses with genetic models
We further conducted a stepwise Cox regression analysis of selected clinical variables from the PLCO dataset plus the two validated SNPs to confirm whether these SNPs were independent predictors of survival. We did this using the PLCO dataset, because only the PLCO dataset, but not the HLCS dataset, had detailed clinical variables, including age, sex, smoking status, tumor stage, tumor histology, chemotherapy, radiotherapy, and surgery. The Cox regression analysis also included top four principal components calculated based on all the SNPs in the 83 genes. The final stepwise analysis included 1,163 patients, because there were 10 patients with missing data for clinical variables (two for tumor stage and eight for chemotherapy/radiotherapy/surgery), eight patients with missing data for ETS1 rs11606640, and four patients with missing data for ZFHX3 rs62053220. Both of the two SNPs remained significant in the final multivariate model (

Combined analyses of the two SNPs
Because both ETS1 rs11606640 GA+AA and ZFHX3 rs62053220 CG+GG genotypes were associated with a favorable prognosis in NSCLC patients, we combined these genotypes into a genetic score to assess their collective effect. All the NSCLC patients were divided into three groups with zero, one and two of this genetic score. As shown in Table 3, the increase per score unit was correlated with an improved OS after adjustment for age, sex, smoking, tumor stage, histology, chemotherapy, radiotherapy, surgery and the top four principal components (P trend <0.001). In the multivariate OS analysis with the dichotomized genetic score of 0 and 1-2, the latter was associated with a significant

Assessment of interaction and stratification analysis
Stratification analysis between the two groups of protective genotypes (score 0 versus 1-2) was performed with the PLCO dataset by other covariates, such as male vs. female, patients ≤71 vs. >71years, patients with early vs. late tumor stage, surgery vs. no surgical therapy, with vs. without radiotherapy/chemotherapy, and the results are depicted by Kaplan-Meier survival curves and log-rank tests, as shown in Figure 2. The association between scores 1-2 and a favorable survival was more evident for male patients, patients aged >71years, patients with late tumor stage and without surgery (all P<0.001), but a multiplicative interaction was found only between the genetic scores with age and radiotherapy (P=0.002 and 0.040, respectively) (Supplementary Table 5). Hence, as shown in Supplementary Tables 6 and 7, we further reassessed the associations of ETS1 rs11606640 and ZFHX3 rs62053220 with OS of NSCLC using different genetic models by subgroups of age ≤ 71 or > 71years and with or without radiotherapy.

The eQTL analyses
Different mRNA expression levels of ETS1 and ZFHX3 between lung cancer tissues and adjacent normal lung tissues were compared using the TCGA dataset. The results showed that ETS1 and ZFHX3 mRNA expressionlevels were significantly lower in lung cancer tissues than in adjacent normal lung tissues, adenocarcinoma tissues as well as squamous tissues (all P<0.05) (Figure 3).
In addition, the ETS1 rs11606640 genotypes were significantly associated with the corresponding mRNA expression levels in lung squamous cancer tissues in an additive and recessive model (P additive =0.045, P recessive =0.042, respectively), which was not found in lung adenocarcinoma tissues; ZFHX3 rs62053220 genotypes were significantly associated with the corresponding mRNA expression levels in adenocarcinoma tissues only in a dominant model (P dom inant =0.048) (Figure 4). Furthermore, based on survival analysis of available online data, low mRNA expression levels of ETS1 and ZFHX3 in lung cancer tissue were associated with a poor outcome of the lung cancer patients (Supplementary Figure   2B).
However, additional eQTL analysis using datasets in the 1000 Genomes Projectshowed thatETS1 rs11606640 and ZFHX3 rs62053220 genotypes were not significantly associated with their levels of the corresponding mRNA expression in the lymphoblastoid cells in an additive model (P additive = 0.459 and 0.979, respectively) (Supplementary Figure 2C).

Functional validation analyses in silico
In a functional prediction analysis of the two validated SNPs, we found a RegulomeDB score of 4 for both of ETS1 rs11606640 and ZFHX3 rs62053220.

Discussion
In recent years, a pathway-based hypothesis-driven strategy for the post-GWAS analysis has generated some interesting results. Multiple functionally related genes in the same biological pathway may work together to confer susceptibility to disease development and progression, but independently they may not reach a significant threshold as defined in In the present study, we found that variant genotypes of two SNPs (i.e., ETS1 rs11606640 GA+AA and ZFHX3 rs62053220 CG+GG), as well as their combined effect, were independently and significantly associated with a favorable survival outcome of NSCLC patients. Additional eQTL analysis of the target tissues and tissue-specific survival analysis suggests that the ETS1 genes may be tumor suppressors, because ETS1 mRNA expression was lower in lung cancer tissue than that in adjacent normal tissue, and the variant ETS1 rs11606640 A genotypes were associated with an increased mRNA expression in tumor tissues and because the variant-associated high expression improved survival of the NSCLC patients. However, the fact that we failed to find a correlation between these two SNPs and their mRNA levels in lymphoblastoid cells may further reflect that such gene expression levels are indeed tissue-specific. Therefore, the molecular mechanisms underlying the observed associations needs additional investigations. ETS1 rs11606640 is located on chromosome 11q24.3, encoding the ETS1 protein, a founding member of the ETS family of transcription factors, which likely binds to DNA through the action of an 85 amino acid DNA-binding domain. ETS1 is known to target a wide variety of genes involved in proliferation, angiogenesis, apoptosis differentiation, migration and invasion of lung cancer. A previous GWAS reported that another two ETS1 SNPs (rs4937362 and rs4938573) were associated with risk of haematopoietic malignancies; but there is no report for an association of ETS1 SNPs with lung cancer survival. In the present study, we reported an association between ETS1 rs11606640 and survival in NSCLC patients.
However, ETS1 is considered as a marker of poor prognosis in several other cancer. For example, an elevated ETS1 expression has been observed in several invasive and metastatic solid tumors, such as lymph node metastasis. In addition, ETS1 functions as a tissue-specific transcription factor, targeting genes including cyclin D1, cyclin D3, MMP-1, MMP-3 and MMP-9, which are key elements in the process of carcinogenesis and metastasis. Therefore, the exact mechanisms underlying lung cancer progression remain to be investigated. ZFHX3 rs62053220 is located on chromosome 16q22.2-q22.3, encoding a zinc finger homeobox, also named ATBF1 (the AT motif binding factor 1), which is a large transcription factor with 23 zinc fingers and four homeodomains. ZFHX3 functions in multiple biological processes, including embryonic development, neuronal differentiation, and neuronal death in response to DNA damage or oxidative stress, and ZFHX3 abnormalities play a role in multiple human diseases, including cancer. Several published studies on ZFHX3 SNPs have focused on atrial fibrillation and obesity, , although the present study identified ZFHX3 rs62053220 to be associated with survival in NSCLC patients.
The existence of four homeoboxes in ZFHX3 suggests its dynamic function in biological processes, because few homeobox genes have more than one homeobox that interacts with other transcription factors to regulate the transcription of many genes. ZFHX3 is thought to be a tumor suppressor through multiple mechanisms, because its expression has been frequently reported to be reduced in multiple types of cancers, , which is consistent with our results using the TCGA database and also supported by animal studies.
For example, the deletion of ZFHX3 in mice led to disturbance of epithelial homeostasis and neoplastic morphology. Because ZFHX3 is normally localized in the nucleus, mislocalization to the cytoplasm in cancer cells has been reported to be associated with disease progression and decreased survival. In a study that assessed ZFHX3 mRNA expression in NSCLC tissue samples from 140 patients, a poor five-year overall survival rate among patients was associated with low levels of ZFHX3 expression. Therefore, additional investigations are needed to unravel the role of ZFHX3 in tumor progression.
The present study has some limitations. First, because the GWAS datasets were only    Tables 1-3 are only available    Comparison of mRNA expression levels of ETS1 and ZFHX3 between lung cancer tissue and normal lung tissues in TCGA. a.The ETS1 mRNA expression levels in lung cancer tissues were significantly lower than that in normal lung tissues (P<0.001); b. The ETS1 mRNA expression levels in lung adenocarcinoma tissues were significantly lower than that in normal lung tissues (P<0.001); c. The ETS1 mRNA expression levels in lung squamous tissues were significantly lower than 30 that in normal lung tissues (P<0.001); d. The ZFHX3 mRNA expression levels in lung cancer tissues were significantly lower than that in normal lung tissues (P=0.018); e. The ZFHX3 mRNA expression levels in lung adenocarcinoma tissues were significantly lower than that in normal lung tissues (P=0.037); f. The ZFHX3 mRNA expression levels in lung squamous tissues were significantly lower than that in normal lung tissues (P=0.002); ETS1_t= lung cancer tissues; ETS1_n= adjacent normal lung tissues; ZFHX3_t= lung cancer tissues; ZFHX3_n= adjacent normal lung tissues.