Pan-cancer DNA methylation signature quantication of lifestyle exposures and cancer prognosis

Background: Alcohol consumption, body mass index (BMI) and cigarette smoking are among the most well-studied lifestyle cancer risk exposures which can also change the host’s epigenetic methylation patterns. Some of the changes associated with lifestyle exposure are specic and stable over time, thus, can be used to predict and quantify the exposure. Although the link between these lifestyle exposures and increased odds ratio (OR) of different cancer types is well known, their role in predicting cancer survival remains less clear. We hypothesized that by using predicted lifestyle exposures based on the methylation proles in tumour DNA we could predict the overall survival probability in cancer patients associated with these exposures. Results: The Cancer Genome Atlas (TCGA) Pan-Cancer dataset was used to test the prognostic value of the predicted DNAmethylation (DNAm) alcohol, BMI and smoking exposures in 24 cancer types (n= 8,238 subjects). Multivariable Cox proportional hazards models with adjustment for age, cancer stage and other exposures were used to calculate the hazards ratio (HR) for overall survival associated with these predicted DNAm exposures. We observed specic cancer types with strong associations between poorer survival and higher alcohol consumption (bladder, brain, esophageal, and head and neck cancers), higher BMI (bladder, pancreatic and post-menopausal breast cancers), and smoking (B-cell lymphoma, stomach, bladder and lung cancers). Interestingly, we also observed associations between better survival from kidney cancer with higher alcohol consumption and smoking exposures. For alcohol consumption we found a positive association between HR and OR across all cancers, indicating that for cancers where alcohol is a signicant risk factor, it is also associated with poorer survival (p = 0.022). This was not the case for the BMI (p = 0.548) or smoking exposures (p = 0.193). Conclusions: In conclusion, these DNAm exposure signatures may provide novel information on the relationship between these lifestyle factors and cancer outcomes. for this study. All participants in the TCGA were originally recruited with informed consent in line with the


Introduction
Obesity, alcohol and smoking consumption are among the most studied lifestyle exposures known to be associated with increased cancer risk for many cancer types (1)(2)(3)(4)(5). However, less is known about the value of these lifestyle factors in predicting cancer patient's survival. Epidemiological studies have reported that higher alcohol consumption is associated with poorer esophageal, head and neck, pancreatic and colorectal cancer survival (6)(7)(8)(9); higher BMI is associated with poorer breast, ovarian, pancreatic, bladder and colorectal cancer survival (10)(11)(12)(13)(14)(15); and smoking is associated with poorer lung, B-cell lymphoma, stomach and bladder cancer outcomes (16)(17)(18)(19). These studies used reported clinical or questionnaire exposure data to analyse their association with prognosis. However, this reported exposure data is typically captured by questionnaires, which can be subjected to measurement error, recall bias and patient underestimation. Furthermore, these studies frequently only analysed one cancer type at a time which makes inter-cancer type comparisons di cult. DNA methylation is a commonly studied epigenetic modi cation characterised by the addition of a methyl group to DNA, typically at a cytosine-phosphate-guanine (CpG) nucleotide base pairing. These modi cations have been shown to be dynamic and stable, tissue and cell-speci c, involved in transcription and gene regulation, and can be in uenced by genetic, demographic and lifestyle exposures (20,21). Many epigenome-wide association studies (EWAS) and meta-analyses have identi ed DNAm signatures that are associated with lifestyle exposures that encompass genome-wide CpG methylation differences in the extreme levels of each exposure, compared to those without the exposure (22)(23)(24)(25).
Additionally, age acceleration, BMI, alcohol, smoking and estrogen DNAm exposure signatures have been found to be associated with breast and lung cancer risk (26)(27)(28)(29)(30)(31). These studies usually only measure DNAm lifestyle exposures in patients' blood and have not investigated if DNAm exposures are associated with prognosis. However DNAm signatures in blood can be observed in other tissues and cells (26,32,33), and the TCGA Pan-Cancer dataset has patient's DNA methylation and clinical information on over 30 cancer types that can be used for DNAm exposure associated survival comparisons in multiple cancers.
In this study, we therefore hypothesised that published alcohol, BMI and smoking DNAm exposure signatures can be used to measure lifestyle exposures in tumour DNA and then used to predict patients overall survival probability in different types of cancer and are further related to cancer risk. Firstly for 24 TCGA cancer types we were able to extract the ORs for the association of these reported lifestyle exposures with cancer risk, from published meta-analyses (1)(2)(3)(4)(5)(34)(35)(36). Next, we measured these lifestyle exposures using their respective published DNAm exposures signatures in DNA methylation data for these cancer types and available matched adjacent normal tissues, from the TCGA Pan-Cancer database. For each cancer type we then calculated the DNAm exposure associated HRs and then compared this to their literature reported exposure associated ORs, to further measure how well each exposure cancer survival correlated with their respective exposure cancer risk.

Study population and data
The TCGA collection, contains approximately 11,000 patient tissue samples, covering 33 cancer types to date and includes patient molecular assay datasets and clinical data. This collection has also been standardized for inter-cancer type comparisons with four major clinical endpoints and optimized for various omics studies into the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR) (37). In this study, we rst identi ed and reviewed published meta-analyses that contained information on reported BMI, alcohol and smoking consumption exposure associations with cancer risk. The reported exposure ORs (relative risks (RR) were also called ORs in this study) and con dence intervals (CIs) were then extracted from these studies for each cancer type that DNA methylation, clinical and survival data was also available for in the TCGA Pan-Cancer collection (1)(2)(3)(4)(5)(34)(35)(36). In total this was for 24 cancer types, for which, patients age, tumour-node-metastasis (TNM) stage ('stages I and II' and 'stages III and V' were combined), overall survival times and vital status clinical data and Illumina methylation BeadChip betavalue data for their tumour and available adjacent normal tissue was obtained from the TCGA Pan- Cancer datasets stored on the TCGA University of California Santa Cruz (UCSC) Xena browser (https://xenabrowser.net). For these patients, 8,238 had DNA methylation data available for their primary tumour tissue, 722 for their adjacent normal tissue, and 696 of these were from both tissues. The TCGA 450K methylation beta-value data was previously pre-processed through standard quality control steps such as probe ltering, normalisation, and correction for batch effects using the min package (38).

DNAm exposure signature measurements
The DNAm exposure signatures used in this study, were obtained from previously published EWASs measuring overweight/obese BMI, moderate to heavy alcohol consumption and current smoking behaviour exposures (25). These DNAm exposure signatures where then used to predict these lifestyle exposures in patients primary tumour and available adjacent normal tissue from the pre-processed TCGA 450K methylation data. For each of these DNAm exposure signatures the number of CpG sites and methylation beta-values that were available (due to missing CpGs in the pre-processed TCGA methylation data) were as follows: 612/1109, 262/450 and 132/233 for BMI, alcohol and smoking respectively. For each exposure and patient, a DNAm score was calculated from the total sum of each of their exposure CpG site beta-values multiplied by their corresponding DNAm exposure signature beta coe cients. Patients DNAm exposure scores were then standardized to z-scores within the study cohort, for subsequent inter-cancer comparison. A summary of the study patients clinical data and predicted DNAm alcohol, BMI and smoking exposures for each cancer type is available in Supplementary Table 1.

DNAm exposures in tumour and adjacent normal tissue comparison
The predicted DNAm exposures derived from patients adjacent normal tissue was used to assess the performance of the DNAm exposure signatures performance in the primary tumour tissues, to investigate whether tumour DNA represented an accurate representation of the exposure as determined in the normal tissue. Firstly, to compare the consistency of the DNAm exposures measurements between tumour and adjacent normal tissue, Spearman's rank correlation coe cients were calculated for each of the DNAm exposure signatures CpG sites beta-values between the patients tumour and matched adjacent normal tissue (n=696 patients). These correlation coe cients were then normalized into proportional frequencies by dividing by the number of total CpG sites associated with each exposure. The distribution of these CpG site tumour versus normal correlation coe cients were then plotted for each exposure, against their proportional frequencies. Next, for each exposure the patients tumour versus normal correlation coe cients were plotted against their DNAm exposure signature CpG site beta coe cients (impact of each CpG site in the DNAm exposure signatures), to examine each of the CpG sites relationship between these two tissues according to their weights of contribution. For each exposure, hierarchical cluster analysis using the Manhattan distance was then carried out for the patients tumour and adjacent normal tissue DNA exposure CpG beta-values and visualised in dendrograms, to see how similar the DNAm exposure measurements were in the two tissue types. Lastly, Pearson's correlation coe cients were calculated between the predicted DNAm exposure z-scores in patients tumour and matched adjacent normal tissue.
Exposures, cancer prognosis and risk DNAm exposure survival analysis was then carried out by univariable and multivariable Cox proportional hazards analysis, by calculating the HR associated with each DNAm exposure z-score for each cancer type. The DNAm exposure multivariable HR models were adjusted for age at diagnosis, TNM stage (where available) and relevant DNAm exposure scores, for each cancer type. Where, the DNAm alcohol and BMI exposure associated HR models were adjusted for DNAm smoking, and the DNAm smoking associated HR models for DNAm BMI, as these DNAm exposures were found to signi cantly confound the analyses. Additionally the breast cancer data was also strati ed into pre and post-menopausal breast cancers due to the well-known association between BMI and menopausal status at time of diagnosis (1). These breast cancer subgroups were then also analysed for DNAm BMI associated survival, with adjustment for age at diagnosis, TNM stage (where available) and DNAm smoking scores. Next, to test the association between the DNAm exposure cancer survival and reported exposure cancer risk, for each exposure and cancer type the log-transformed multivariable DNAm exposure HRs were regressed on their respective log-transformed reported exposure ORs for each cancer type by linear regression, after further adjustment for the group size of each cancer type. The group size was normalized by division of the largest group size in the studied cancer types. All statistical analyses in this study were carried out using R version 3.6.0.  (Fig. 1B). After the hierarchical clustering analysis, the DNAm alcohol and BMI exposure signature CpG beta-values did not have a tendency to separate into normal and tumour tissue samples, indicating that these groups did not cluster separately (Figs. 2A, 2B). While, the DNAm smoking exposure CpG beta values did have more of a tendency to separate into normal and tumour tissue samples, indicating the existence of a systematic difference between the smoking exposure signature beta-values in both tissue types (Fig. 2C). DNAm alcohol and BMI exposure z-scores in patients tumour and adjacent normal tissue were moderately correlated, with Pearson's correlation coe cients of 0.55 and 0.39 respectively (Figs. 3A and 3B). The DNAm smoking exposure z-scores in patients tumour and adjacent normal tissue was weakly correlated, with a correlation coe cient of 0.2 (Fig. 3C). This is consistent with the hierarchical clustering analysis ndings, indicating that tumour and normal tissue methylation levels were different for the DNAm smoking exposure CpG sites.

DNAm exposures and cancer prognosis
The DNAm exposure associated HR analyses and reported exposures associated ORs for each of the 24 cancer types for cancer survival and risk respectively, are shown in Tables 1-3  tumour types we were also able to adjust for response to rst line treatment (complete response versus stable disease/progression) as a potential confounder and found the majority of results remained the similar (Supplementary Table 3).   Abbreviations: body mass index (body mass index), con dence interval (CI), deoxyribonucleic acid methylation (DNAm), number of patients (N), hazards ratio (HR), odds ratio (OR) and tumour-nodemetastasis (TNM) Signi cant HR associations are shown in bold * p < 0.01, ** p < 0.001, *** p < 0.0001 For the full, pre-menopausal and post-menopausal DNAm BMI associated HR analyses, it was also found that DNAm BMI, age and late TNM stage were all signi cant predictors of survival for the full and postmenopausal BRCA groups. No variables were signi cant predictors of survival for the pre-menopausal BRCA group (Table 4). Furthermore, in the subsequent analyses, ovarian cancer was excluded due to low patient numbers.  corresponding reported exposures were also associated with cancer risk, usually in the same direction. While for the DNAm exposures and cancers that were signi cantly associated with cancer survival for; bladder (BLCA) and brain (LGG) cancers for higher alcohol consumption, bladder (BLCA) cancer for higher BMI, and B-cell lymphoma (DLBC) cancer for smoking exposures; their corresponding reported exposures were not associated with cancer risk. Interestingly, the reported smoking exposure increased the risk of developing kidney (KIRC) cancer, but DNAm smoking exposure appeared to be protective in terms of prognosis.

Discussion
In this study we have used existing prediction models for the alcohol, BMI and smoking lifestyle exposures based on DNAm signatures to predict the patient's exposures based on their tumour DNA samples. Previous work has developed and validated these DNAm exposure signatures in numerous tissue samples, predominantly blood sample DNA, but this study is the rst to our knowledge, to use tumour DNA to predict the exposures of the individuals. We rst show that the DNAm exposure signatures observed in tumour DNA are correlated with the signatures as predicted from matching adjacent normal tissues for the alcohol and BMI exposures. This is important to address the potential limitation that tumour DNA methylation pro les change dramatically compared with the normal tissue in which they occur. We have then used these predicted DNAm exposures to investigate how these exposures relate to overall survival in the cancer patients. We nd that speci c cancer types have strong associations between poorer survival and higher alcohol consumption (bladder (BLCA), brain (LGG), esophageal (ESCA), and head and neck (HNCS) cancers), higher BMI (bladder (BLCA), pancreatic (PAAD) and postmenopausal breast (BRCA) cancers), and smoking (B-cell lymphoma (DCLB), stomach (STAD), bladder (BLCA), and lung (LUSC) cancers). While kidney (KIRC) cancer unusually was found to have improved survival with higher alcohol consumption and smoking exposures. For alcohol consumption we found a positive association between HRs and ORs across all cancers, indicating that for cancers where alcohol consumption is a signi cant risk factor, it is also associated with poorer survival.
For the smoking exposure, we found the normal tissue and tumour tissue did not correlated strongly and were separated in the hierarchical clustering. We propose two possible explanations for this Firstly, unlike the other two exposures, smoking is known to induce many mutations in CpG sites directly which could impact on observed DNA methylation patterns in tumour compared with normal. Alternatively, it could be that methylation patterns in the tumours represented in this analysis are more divergent for the smoking related CpG sites compared with the other exposure CpG sites.
Many of the ndings in our study are consistent with the existing literature. The hazardous role of high alcohol in patients with esophageal (ESCA) and head and neck (HNSC) cancers (8,9), and high BMI in breast (BRCA), bladder (BLCA) and pancreatic (PAAD) cancers (12)(13)(14), and smoking in stomach (STAD), lung (LUSC) and B-cell lymphoma (DLBC) cancers (16,18,19) was supported by studies that were based on clinical or self-reported phenotypes. However, we did not nd studies supporting our ndings of the hazardous role of high alcohol in patients with bladder (BLCA) and brain (LGG) cancers, and these represent novel ndings. Furthermore some reported associations of lifestyle exposures with cancer prognosis were not supported by our study. This includes the poorer cancer prognosis associations between colorectal cancer (6) and high alcohol, and ovarian (11) and colorectal (15) cancers and higher BMI. This lack of replication of previous ndings could be due to the different patient cohorts used in these studies, low statistical power for these tumour types, or could re ect an interesting biological difference in the way the exposures are measured. For example, BMI often used in reported datasets, is measured by patients current weight and height is typically a single measurement used as a proxy for the measurement that may uctuate throughout life, while the DNAm BMI exposure measurement may re ect a longer-term history of high or low adiposity.
This study has many strengths. Firstly, the large sample size of the TCGA Pan-Cancer collection, allowed us to examine and compare the effect of the lifestyle-associated DNAm exposures in multiple cancer types and granted us su cient statistical power in the survival analysis. The pre-standardized molecular data prevented any in uence caused by batch-effects or other technical confounders. The usage of the revised version of the clinical endpoint data also increased the accuracy of the survival analysis.
However, this study is not without limitations. Firstly, we acknowledge up-front that the variability in DNA methylation pro les in tumour DNA may in uence the accuracy of these exposure predictions.
Nevertheless, this prediction model can represent the biologically measured exposure rather than the phenotype itself reported by individuals. In the case of smoking, it has been con rmed that hypomethylation associated with the AHRR and CYP1B1 gene induced by cigarette smoking were found in both lung tissue, blood and other tissues in the body (32). Therefore, it is not unexpected that the exposures can also be detected in tumour DNA. This biologically measured exposure may represent a more accurate representation than what can be achieved with questionnaires that ask about historical alcohol consumption with considerable recall bias.
Additionally, the DNAm exposure prediction model we used to quantify the lifestyle exposure was developed from methylation data measured in blood samples. However, in the TCGA dataset, DNA methylation was measured in the target organ, with the majority been taken from the primary tumour tissue. Whether these organ tissues have a consistent DNA methylation pro le with the blood in terms of CpG sites associated with lifestyle exposure remained unclear. Although one study has pointed out that tissue from alveoli has a similar epigenetic pro le with blood-derived sample at CpG sites associated with smoking exposure (32). We are unable to ensure that this is the case for the remaining organ tissues and lifestyle exposures, due to the lack of blood-derived DNA methylation data in the TCGA dataset. We are also unable to account for potential disparities in exposure or methylation associated with variables such as ethnicity and recruitment centre that may be biased in some tumour types compared with others as this data was not available in this dataset. While we were able to adjust for treatment response in some tumour types, we were not able to do that for all, therefore this could be improved in future studies.
Another limitation lies with the missing values in the TCGA's DNA methylation data which prevented us from investigating the complete set of exposure associated CpG sites. In the future, blood-derived DNA methylation data measured in cancer patients could be used to validate our study. The consequence of not including these missing CpG sites in the DNAm exposure prediction models cannot be assessed without comparison to the complete methylation data. There may also be unobserved confounding factors that have remained unadjusted, as we only adjusted for the most relevant confounding factors in consideration of the reduced statistical power.

Conclusions
In summary, we presented the lifestyle exposure mediated cancer risk and the survival risk in multiple cancer types. We found that DNAm exposure signatures can be measured in tumour DNA and are associated with poorer cancer survival in many cancers due higher alcohol consumption, higher BMI and smoking exposures. Cancer types whose survival probability is affected by the predicted DNAm exposures are also likely to have reported exposure cancer risk in the same direction, with few exceptions. The cancers that originated in organs with direct contact to the exposure, also tends to have a positive association between the cancer survival and cancer risk.