From tobacco smoking to mutational signature: the role of epigenetic changes in human cancers

Background Tobacco smoking is associated with a unique mutational signature in the human cancer genome. It is unclear whether tobacco smoking -altered DNA methylations and gene expressions affect smoking-related mutational signature. Methods We evaluated the smoking-related DNA methylation sites reported from ve previous studies using peripheral blood cells to identify possible target genes. Using the mediation analysis approach, we evaluated whether the association of tobacco smoking with mutational signature was mediated through altered DNA methylation and expression of these target genes. Results Based on data obtained from 21,108 blood samples, we identied 374 smoking-related DNA methylation sites, annotated to 248 target genes. Using data from DNA methylations, gene expressions and smoking-related mutational signature generated from ~7,700 tumor tissue samples across 26 cancer types from The Cancer Genome Atlas (TCGA), we found 11 of the 248 target genes whose expressions were associated with smoking-related mutational signature at a Bonferroni-correction P < 0.001. This included four for head and neck cancer, and seven for lung adenocarcinoma. In lung adenocarcinoma, our results showed that smoking increased the expression of three genes, AHRR , GPR15 , and HDGF, and decreased the expression of two genes, CAPN8 , and RPS6KA1 , which were consequently associated with increased smoking-related mutational signature. Additional evidence showed that the elevated expression of AHRR (cg14817490) and GPR15 (cg19859270), were associated with smoking-altered hypomethylations. Lastly, we showed that the elevated expression of HDGF and decreases expression of RPS6KA1, were associated with poor survival of lung cancer patients.


Background
Somatic mutations are one of the most common causes of carcinogenesis in humans [1,2]. Recent studies using data from The Cancer Genome Atlas (TCGA) have created a landscape of somatic mutations in each cancer genome, ranging from hundreds to thousands of somatic mutations across multiple cancer types [2,3]. To explore the biological processes of somatic mutations, Alexandrov and colleagues developed a mathematical framework to deconvolute them into mutational signatures. The approach characterized 96 mutation classi cations that included six substitution types, together with a anking base pair to the mutated base [3]. More than 30 mutational signatures have been identi ed across cancer types in TCGA [3,4]. Certain mutational signatures were associated with tobacco smoking, exposure to ultraviolet (UV) light, aging, de cient mismatch repair (MMR), mutations in POLE, increased activity of the APOBEC family of cytidine deaminases, and DNA polymerase POLH [3][4][5]. In particular, a smoking-related mutational signature featured by predominantly C > A mutations with a transcriptional strand bias was observed in multiple human cancer types, including lung adenocarcinoma, lung small cell carcinomas, head and neck squamous, liver, larynx, oral cavity, and esophagus cancers [3,6,7].
Tobacco smoking is a well-known risk factor for multiple cancer types, especially lung cancer [8][9][10]. DNA methylation, one of the major forms of epigenetic modi cation, essentially plays a regulatory role in gene expression. It has been a focus of multiple studies as a potential underlying molecular mechanism for tobacco smoking-related cancers. Previous epigenome-wide association studies (EWAS) have reported thousands of DNA methylations at CpG sites associated with tobacco smoking in blood, buccal cell and tumor-adjacent normal lung tissue samples [11][12][13][14][15][16][17][18]. These epidemiological studies showed that tobacco smoking was consistently associated with DNA hypomethylated CpG sites in speci c genes such as AHRR (encoding aryl-hydrocarbon receptor repressor) and GPR15 (encoding G protein-coupled receptor 15) in multiple studies [19]. In particular, Stueve and colleagues identi ed seven smokingassociated hypomethylated CpG sites in adjacent normal tissues from 237 lung cancer patients. Of note, ve of the seven sites including a hypomethylated CpG site in AHRR had been reported by previous bloodbased EWAS, which suggests that methylation biomarkers identi ed from blood samples might re ect methylation changes in the target tissues [15].
In our study, we evaluated the previously reported smoking-related DNA methylations from a total of 21,108 blood samples to identify candidate target genes [11-13, 17, 18]. Using data from DNA methylations, gene expressions and smoking-related mutational signature generated from approximately 7,700 tumor samples across 26 cancer types, we evaluated the associations of expression of these target genes with the smoking-related mutational signature for each cancer type. Using a mediation approach, we further evaluated whether the association of tobacco smoking with the mutational signature may be mediated through an altered expression of these target genes. Similar analyses were performed to evaluate the association of tobacco smoking with the gene expression mediated through smoking-altered DNA methylation.
Detailed information about datasets, analyses, and data sources are described at Firebrowse (http://gdac.broadinstitute.org/). A total of 30 somatic mutational signatures for each sample in TCGA have been characterized from mSignatureDB (http://tardis.cgu.edu.tw/msignaturedb). We downloaded the data and only analyzed the known tobacco-associated "mutational signature 4" reported in the mSignatureDB, corresponding to tobacco-associated mutational signature in this study. We measured the enrichment score of this mutational signature for each sample (details described in our previous work [20]).

The analysis of predicted neoantigen load
We downloaded the number of neoantigen loads for each sample from TCIA and applied log2 transfer to t it into a better distribution. Mutational neoantigens were predicted by the use of HLA typing and MHC class I/II binding capabilities. The established neoantigen prediction algorithm NetMHCcons [21] was applied to missense somatic mutations to estimate their binding a nity to the HLA alleles. A more detailed analysis of the processing has been described in previous literature [22,23].

Statistical analysis
The distribution for relative contribution of smoking-related mutational signature to overall mutation burden is severely right-skewed. To better t regression models, we used the ordinal semi-parametric regression models [24] to evaluate the associations of smoking-related mutational signature with tobacco smoking, gene expression and DNA methylation. The analyses were implemented in the 'orm' function from the 'rms' library of the R package [24]. To evaluate the enrichment of the signi cant associations for the 248 smoking-related target genes, we compared the proportion of the signi cant associations from samples of 248 gene randomly selected from the whole genome -this process was repeated 1000 times. To explore the mediation effects of DNA methylation on the association of tobacco smoking with smoking-related gene expression and the mediation effects of the smoking-related gene expression on the association of tobacco smoking with the smoking-related mutational signature, we conducted mediation analyses using the R package 'mediation' [25] to estimate the average direct effect (ADE) and the average causal mediation effect (ACME) of the mediators, which represent the population averages of these causal mediation and direct effects. All the analyses were adjusted for age and gender.
To estimate the association between the smoking-related gene expression and overall survival of lung cancer patients, we conducted survival analysis using the Cox proportional hazards model with the adjustment of age, gender and tobacco smoking.

Results
Identifying blood-based DNA methylations associated with tobacco smoking To identify smoking-related DNA methylations at CpG sites, we evaluated previously reported methylations in blood samples from ve EWAS  Identifying genes associated with the smoking-related mutational signature in a pan-cancer study The smoking-related mutational signature was characterized in TCGA samples in previous studies [3,26] ( Fig. 1B). Utilizing this study, we used the relative contribution of the mutational signature to overall mutation burden, with values ranging from 0 to 1, for each sample across 26 cancer types in TCGA (see Materials and Methods). Using regression analyses, adjusting for gender and age, we observed that tobacco smoking was signi cantly associated with increased smoking-related mutational signature in lung adenocarcinoma (P = 1.75 × 10 − 9 ; Fig. 1C). In line with previous studies, we observed that the contributions of smoking-related mutational signature to the overall mutation burdens varied in different cancers, with the most enrichments being observed in lung adenocarcinoma (median of contribution: 42%) and lung carcinoma (median of contribution: 35%) (Fig. 1D). Using regression analyses, adjusting for gender and age (see Materials and Methods), we evaluated the associations between the expressions of the identi ed 248 smoking-related target genes and smoking-related mutational signature for each cancer type. Of these target genes, we found that 234 genes were associated with smoking-related mutational signature in 19 cancer types (at a P < 0.05) (Supplementary table 3), suggesting that these genes were over-represented (P < 0.001 for the enrichment analysis, see Materials and Methods). At a more strict threshold of a P < 1 × 10 − 4 , a total of 59 genes were identi ed in six cancer types: breast (n = 2), colon (n = 1), head and neck (n = 24), lung adenocarcinoma (n = 28), lung carcinoma (n = 2), and melanoma (n = 2) ( Fig. 1E; Supplementary table 3).
In the end, using a Bonferroni-correction of P < 0.001 (corresponding to a raw P = 5.0 × 10 − 8 , given 20,000 tests in a genome-wide level), we identi ed four genes for head and neck cancer and seven genes for lung adenocarcinoma. Speci cally, for head and neck cancer, the expression levels of three genes, NFE2L2, RMND5A and SLC44A1, were associated with increased smoking-related mutational signature, while an inverse association was observed for one gene, ARRB1 (Fig. 1F, Table 1). For lung adenocarcinoma, we found that the expression levels of three genes, GPR15, HDGF, and AHHR, were associated with increased smoking-related mutational signature, while an inverse association was observed for the other four genes, NWD1, KCNQ1, CAPN8 and RPS6KA1 (Fig. 1F, Table 1). GPR15 showed the most signi cant association with a P < 2.22 × 10 − 16 (Table 1). Table 1 Associations between smoking-associated mutational signature and expression of candidate genes (Bonferroni-correction P < 0.01). "N" refers to sample size for each cancer type. A regression analysis was constructed to include tobacco smoking-associated mutational signature as a dependent variable and gene expression levels as the independent variable for each gene of each cancer type.
Mediation effects of the identi ed seven genes on the association of smoking with mutational signature in lung adenocarcinoma For the identi ed seven genes for lung adenocarcinoma, we evaluated the associations between their expression and tobacco smoking (see Materials and Methods). We found that tobacco smoking was signi cantly associated with an increased expression of AHRR, GPR15 and HDGF with a P = 6.9 × 10 − 5 , P = 2.7 × 10 − 7 and P = 3.3 × 10 − 4 , respectively, and a decreased expression of CAPN8 and RPS6KA1 with a P = 9.6 × 10 − 4 and P = 0.01, respectively ( Fig. 2A; Supplementary table 4). Using a mediation analysis approach, we further estimated the ACME of the expression of these genes that would be altered by smoking on the mutational signature. We found that they showed signi cant mediation effects on the association of smoking with the signature (Fig. 2B, C). Speci cally, we observed a signi cant percentage of ACME for the smoking-related gene expressions: 13.4% (95% CI: 0.046 and 0.256) with a P = 2.0 × 10 − 4 for AHRR, 9.8% (95% CI: 2.4% and 21.7%) with a P = 2.2 × 10 − 3 for CAPN8, 22.8% (95% CI: 11.3% and 39.4%) with a P < 1 × 10 − 4 for GPR15, 12.3% (95% CI: 4.7% and 24.6%) with a P = 8.0 × 10 − 4 for HDGF, and 8.6% (95% CI: 0.5 and 20.6%) with a P = 0.032 for RPS6KA1 ( Fig. 2C; Table 2). " * ": "ACME" refers to the average causal mediation effects. "ADE" refers to the average direct effects. "Prop" refers to the proportion of ACME related to total affects. Prop 8.6% 5% 20.6% 0.032 " * ": "ACME" refers to the average causal mediation effects. "ADE" refers to the average direct effects. "Prop" refers to the proportion of ACME related to total affects.

Mediation effects of smoking-related DNA methylation on the association of smoking with gene expression in lung adenocarcinoma
In the above mediation analysis, we found that ve genes, AHRR, CAPN8, GPR15, HDGF, and RPS6KA1, mediated the association between smoking and mutational signature in lung adenocarcinoma. For these, six smoking-related DNA methylations, cg11554391, cg14817490, cg21446172, cg19859270, cg00867472 and cg13092108, have been reported in blood cells [11-13, 17, 18]. We further evaluated the associations between these methylations and tobacco smoking in lung adenocarcinoma. In line with previous ndings, we found that consumed tobacco smoking was signi cantly associated with hypomethylations at the CpG sites: cg11554391 (AHRR), cg14817490 (AHRR), and cg19859270 (GPR15) (P < 0.05 for all; Fig. 3A; Supplementary table 4). Next, we evaluated the association between the methylation at each CpG site and gene expression. Interestingly, our results showed that the smokingaltered hypomethylation was associated with an elevated expression for each gene (P < 0.05 for all): AHRR (cg11554391 or cg14817490) and GPR15 (cg19859270), indicating that these smoking-altered hypomethylations likely play an up-regulation role in their gene expression ( Fig. 3B; Supplementary table 5). In particular, these hypomethylated CpG sites are located in regions with evidence of enhancer activities associated with their target genes ( Supplementary Fig. 1).
Using a mediation analysis approach, we further estimated the ACME of the methylations that would be altered by smoking on gene expressions. We found that the methylations at two CpG sites, AHRR (cg14817490, P = 0.03) and GPR15 (cg19859270, P < 1 × 10 − 4 ), showed signi cant mediation effects on the association of smoking with gene expression (Fig. 3C, D; Table 3). Speci cally, we observed a signi cant percentage of ACME for both smoking-related DNA methylations: 8.5% (95% CI: 8% and 24.5%) with a P = 0.03 for AHRR, and 15.9% (95% CI: 5.2% and 32.9%) with a P < 1.0 × 10 − 4 for GRP15 ( Fig. 3D; Table 3). " * " ACME refers to the average causal mediation effects. ADE refers to the average direct effects (ADE). "Prop" refers to the proportion of ACME related to total affects.

Supplementary Data
Expression of HDGF and RPS6KA1 associated with overall survival of lung cancer patients To explore the association between overall survival of lung cancer patients and the identi ed ve genes that mediated the association between smoking and mutational signature in lung adenocarcinoma, we conducted the Cox regression analysis using data from TCGA (see Materials and Methods). Our results revealed that the elevated expression level of HDGF was associated with the reduced overall survival of lung cancer patients, while an opposite trend was observed for RPS6KA1 when comparing the high level of gene expression (> median) versus low level ( < = median) (Hazard Ratio [HR] = 1.45 and HR = 0.61, P = 0.02 and P = 1.5 × 10 − 3 for HDGF and RPS6KA1, respectively) (Supplementary Fig. 2A, B). These ndings are in line our initial results that tobacco smoking increased expression level of HDGF and decreased expression level of RPS6KA1. No signi cant associations with overall survival of lung cancer patients were observed for other three genes.

Discussion
In the present study, a total of 374 smoking-related methylations annotated to 248 target genes were identi ed using strict statistical criteria from previous EWASs in blood samples. Using data from TCGA, we identi ed a total of 11 candidate genes of 248 target genes whose expressions were associated with smoking-related mutational signature, including four in head and neck cancer and seven in lung adenocarcinoma. Of seven genes for lung adenocarcinoma, our results further showed that smoking increased the expression of three genes, AHRR, GPR15, and HDGF, and decreased the expression of two genes, CAPN8, and RPS6KA1. These smoking-altered gene expressions were consequently associated with increased smoking-related mutational signature. In addition, our results showed that the elevated expression of AHRR (cg14817490) and GPR15 (cg19859270), were associated with smoking-altered hypomethylations.
Our analysis focused on the identi ed 374 blood-based methylations associated with tobacco smoking, which have strong evidence of statistical associations from previous studies. In particular, the initial discovery of methylations associated with tobacco smoking is based on a study with the largest sample size we have found so far (N = 15,907) (see Materials and Methods) [13]. In addition to studies of blood, two studies have investigated methylations associated with tobacco smoking in buccal cells (N = 790) [16] and tumor adjacent normal lung tissue (N = 237) [15]. Notably, both studies had limited sample sizes and were insu cient in statistical power to identify smoking-related methylation sites, while they have revealed evidence that blood-based methylation biomarkers could re ect changes in their target tissues.
Recently, Ma and Li performed pathway enrichment analyses based on 320 smoking-affected genes identi ed in blood. Their results showed that 104 of these genes were signi cantly enriched in pathways associated with the etiology of different cancers [27]. Consistent with these ndings, two recent epidemiology studies showed that smoking-related hypomethylations in blood cells were associated with lung cancer risk [28,29]. Thus, our study shows a connection of blood-based methylations with tobacco smoking-related mutational signature in tumor tissue. Our study not only provides an understanding of the molecular mechanisms underlying tobacco smoking carcinogenesis, but also can potentially lead to a new avenue for target intervention.
Using mediation analyses, we concluded that two genes, AHRR and GPR15, signi cantly contributed to smoking-related mutational signature, mediated by smoking-altered methylation and gene expression in lung adenocarcinoma. These conclusions are supported by multiple layers of evidence from previous literature. It is known that the AHRR gene encodes a suppressor of the aryl hydrocarbon receptor (AHR). It is involved in the AHR signaling cascade, which plays an essential role in dioxin toxicity, including polycyclic aromatic hydrocarbons (PAHs), an important class of smoking carcinogens [30,31]. It has been documented that the AHRR gene is associated with tobacco smoking, based on EWAS from blood, buccal cell and normal lung tissue [11][12][13][14][15][16][17][18]. In recent studies, the hypomethylated CpG sites in the AHRR gene in pre-diagnostic peripheral blood samples was reported to be associated with lung cancer risk [28,29]. Based on in vitro experiments in lung tissue from both humans and mice, the evaluated AHRR expression has been validated by tobacco smoking-altered methylations [14]. In addition to AHRR, GPR15 encodes an orphan G-protein-coupled receptor involved in the regulation of innate immunity and T-cell tra cking in the intestinal epithelium [32,33]. Previous studies have shown that the GPR15 gene is associated with tobacco smoking, which is based on EWAS from blood, while no studies have reported it for lung cancer [11,12,[16][17][18]. Our nding was the rst to identify this smoking-related novel gene that signi cantly contributed to smoking-related mutational signature in lung cancer.
Our results showed three additional genes, CAPN8, HDGF and RPS6KA1, contributed to smoking-related mutational signature, mediated by gene expression altered by tobacco smoking in lung adenocarcinoma. Tobacco smoking-related methylations in these genes have been reported in the previous EWAS in blood samples. However, we did not observe that these methylations were associated with tobacco smoking in lung adenocarcinoma, although consistent association directions were observed for HDG and RPS6KA1 (Data not shown). Notably, unlike the studies in large sample size from blood studies, the statistical analysis in detecting association between DNA methylation and tobacco smoking is still challenge in tumor tissues due to possible factors, such as tumor heterogeneity, potential confounders, and limited sample size. In fact, our focus on the analysis of the reported blood-based smoking-related DNA methylation sites could identify reliably smoking-related target genes and reduce the possibility of reverse causation. Nevertheless, given the tissue-speci cities of some methylations in blood, further studies with a large sample size are still needed to replicate the associations for these candidate tobacco smokingrelated genes in lung adenocarcinoma. In fact, our results showed that smoking-related methylations of these genes were associated with decreased expressions of these genes (P < 0.01 for all), indicating that they may play a down-regulation role in their gene expression in lung adenocarcinoma ( Supplementary   Fig. 3). Further in vitro or in vivo functional assays are needed to validate the genes that are affected by tobacco smoking in lung cancer.
It is known that neoantigens (or neoepitopes) result from missense somatic mutations in cancer cells [34]. However, how smoking-related mutational signature contribute to neoantigen loads remain unclear. We additionally evaluated the associations between smoking-related mutation signature and predicted neoantigen loads (see Materials and Methods). We observed that smoking-related mutational signature were signi cantly associated with increased neoantigen loads in three cancer types, head and neck, lung adenocarcinoma, and lung carcinoma (see Materials and Methods). An inverse association was observed in melanoma (P < 1 × 10 − 4 for all; Supplementary Fig. 4A, B; Supplementary Table 6). The most signi cant association was observed in lung adenocarcinoma with a P < 2.2 × 10 − 16 . In addition, we also observed that neoantigen loads were associated with all ve identi ed genes (P < 1 × 10 − 5 ) and tobacco smoking (P = 2.16 × 10 − 11 ) in lung adenocarcinoma ( Supplementary Fig. 4C, D). In particular, the expressions of AHRR and GPR15 had associations with an increased predicted neoantigen load with P = 7.6 × 10 − 10 and P = 7.7 × 10 − 7 , respectively ( Supplementary Fig. 4D). Thus, our ndings may provide new clues to explore the biological and immunological mechanisms through which smoking-related mutational signature may be involved in carcinogenesis, and provide potential genomic biomarkers for the development of cancer prevention and immunotherapy.

Conclusions
Our results showed that the smoking-altered DNA methylations and gene expressions play an important role in contributing to smoking-related mutational signature in human cancers.  Figure 1 Identi cation of genes and their associations with smoking-related mutational signature. A) A ow chart to illustrate the identi cation of candidate smoking-related DNA methylations from the previously reported blood-based methylations in ve EWAS. "N" represents the sample size for each study. B) Smoking-related mutational signature displayed according to the 96 substitution classi cations characterized by six substitution types, together with a anking base pair to the mutated base (Alexandrov et al 2013). C) A scatter plot indicating tobacco smoking correlated with known smokingrelated mutational signature in lung adenocarcinoma. The dotted line refers to association coe cient.
Each point represents one sample. The x axis represents the number of packs per year for each sample, the y axis represents the contribution of smoking-related mutational signature to overall mutation burden for each sample. The color from red to green refers to a higher to lower density of samples (this note applies to all other gure legends). D) Box plots of the enrichment score of smoking-related mutational signature across 26 cancer types. E) Bar plots indicating the P value of associations between the candidate genes and smoking-related mutational signature in six cancer types. Only genes with a P value of less than 1 × 10-4 were presented. The dashed dot box highlights the genes with signi cant associations at a Bonferroni-correction P < 0.001. F) Scatter plots for each gene with signi cant associations at a Bonferroni-correction P < 0.001. From the left to the right panel, four genes in head and neck and seven genes in lung adenocarcinoma are presented.

Figure 2
Mediation analysis illustrating the effect of the expression of ve genes that would be altered by smoking on smoking-related mutational signature in lung adenocarcinoma. A) Scatter plots indicating the statistical signi cance between ve candidate genes and tobacco smoking in lung adenocarcinoma. B) A diagram to illustrate a mediation analysis framework, where gene expression can be a mediator to affect smoking-related mutational signature. C) Five candidate genes are presented with signi cant mediation effect, at P < 0.05. "ACME" refers to the average causal mediation effects via gene expression on smoking-related mutational signature. Mediation analysis illustrating the effect of tobacco smoking-altered methylation on gene expression in lung adenocarcinoma. A) Scatter plots indicating the statistical signi cance of associations between methylations at three candidate CpG sites and tobacco smoking in lung adenocarcinoma. B) Scatter plots indicating negative correlations between DNA methylation at three candidate CpG sites and gene expression in lung adenocarcinoma. C) A diagram to illustrate a mediation analysis framework, where DNA methylation can be a mediator to affect the expression of tobacco smoking-altered genes. D) Two candidate CpG sites are presented with signi cant mediation effects on gene expression, at P < 0.05.
"ACME" refers to the average causal mediation effects via DNA methylation on gene expression.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.