This study describes and validates two novel epigenetic signatures of lifestyle exposures. Moreover, the epigenetic signature for tobacco consumption was associated with an increased odds of self-reported CVD, as well as an increased risk of all-cause and lung cancer-related mortality in an independent cohort. Although the signature for alcohol consumption was also associated with an increased odds of reported CVD, it was not associated with all-cause or cause-specific mortality in the LBC cohorts.
The EpiTob and EpiAlc signatures require fewer CpGs and had a similar if not higher correlation with their associated self-reported phenotype (i.e., tobacco and alcohol consumption, respectively) in comparison with previously published epigenetic signatures. For example, Liu et al proposed four epigenetic signatures for alcohol consumption (grams per day + 1) that were based on 5, 23, 78 and 144 CpGs [4]. Ignoring results based on the initial training dataset, the signature composed of 144 different CpGs explained the most variance (difference in reported R2 between the model with and without the epigenetic signature = 13.8) (data from the Atherosclerosis Risk in Communities study [13]), while the maximum variance explained by the five CpG-based signature was about 10% when using the 1936 Lothian Birth Cohort (R2 = 10.4). In comparison, the three CpG-based EpiAlc signature accounted for 13% of the explained variance when also using data from the LBC1936 cohort (adjusting for age, sex and BMI as done in the study by Liu et al [4]). The EpiTob signature similarly out-performed other published signatures. For example, a risk score proposed by Maas et al included 13 CpGs, and reported an AUC of 0.91 in an external validation cohort when discriminating between smoking versus non-smoking [14]. Another proposed epigenetic signature – or polyepigenetic biomarker to use the term proposed by the authors – was based on 2,623 CpGs identified from an independent EWAS [15][16]. Although comparable statistics were not provided, Sugden et al reported a similar distribution pattern of methylation, with an increase in the signature corresponding to smoking status and pack-years smoked (Fig. 1 from Sugden et al) [16]. Finally, one of the original approaches to discriminating between smokers and non-smokers is based on a single CpG, cg05575921 [17]. Using data from 61 smokers and non-smokers (that did not smoke in the 10-years prior to measurement), Philibert et al reported a discriminatory AUC value of 0.99 [17]. However, this single CpG has been assessed more recently in a study that combined four separate, publicly available datasets from Liu et al (GSE42861), Ventham et al (GSE87648), Tsaprouni et al (GSE50660) and Su et al (GSE85210) and although a high discriminatory value was reported across the combined datasets (AUC = 0.80–0.90), it was not as high as that originally reported by Philibert et al [18] [19][20][21][22]. Important to note, however, is the inclusion of the AHRR-associated CpG, cg05575921, in the EpiTob signature.
Biological interpretation
The five CpGs for the signature of tobacco consumption were located on three genes (NDE1, CACNA1D, and AHRR), while the three CpGs composing the signature of alcohol consumption were located on two genes (SLC7A11, JDP2). The CpG sites included in both epigenetic signatures were initially identified through a literature search of published EWAS results that identified upwards of 3,000 CpGs (alcohol and tobacco consumption combined). In creating these epigenetic signatures, the models were maximized to ensure parsimony with the least amount of variables possible, while concurrently ensuring the greatest variance explanation as possible. Thus, the selected CpGs likely only capture a small portion of the whole, representing a few cogs in a greater physiologic machinery, and so it is unsurprising that the Gene Enrichment analysis identified neither significant terms nor necessarily relevant terms (see Supplementary table). However, with respect to the annotated genes, the aryl hydrocarbon receptor repressor (AHRR) gene encodes a protein involved in the aryl hydrocarbon receptor (AhR) signaling cascade, which links environmental chemical stimuli (e.g., smoking) with adaptive responses [23]. In terms of the EpiAlc signature, the Solute Carrier Family 7 Member 11 (SLC7A11) gene is involved in the transport of cysteine and glutamate. A recent study by Lohoff et al used data from a population-base cohort, a case-control study, a postmortem mRNA analysis of human brain samples, and an mRNA analysis of liver tissues from mice, to provide evidence for the downregulation of SC7A11 in association with alcohol consumption; such downregulation has also been linked to increased oxidative stress [24]. Moreover, Lohoff et al also identified CpGs annotated to the c-Jun-dimerization protein 2 (JPD2) gene in association with alcohol consumption [24].
Utility of epigenetic signatures in epidemiological research
Cohort studies are often hindered by the type of data collected, unmeasured confounding, reliability of self-reported data and its associated biases (e.g., recall or response bias), as well as duration and intensity of exposure. Epigenetic signatures of lifestyle exposures, health conditions, and as predictors for disease risk and progression have the potential to assuage such issues in epidemiological research. Currently, objective biomarkers of exposure are available that measure self-reported exposure to tobacco consumption (e.g., cotinine levels with a half-life of 12–20 hours) or alcohol consumption (e.g., ethyl-glucuronide [EtG] with a half-life of about 2–3 hours) [25][26][27][28]. However, these biomarkers require additional tests, independent from one another, that can contribute to participant burden. Importantly for prospective cohort studies, participant burden can play a role in study attrition, which in turn may lend to a loss of power or even bias study results resulting in erroneous conclusions.[29]. Moreover, measuring at a mechanistic level – or biological level – the impact of an exposure can aid in refining risk stratification in epidemiological studies, and in turn improve the identification of individuals most susceptible to developing disease.
Utility of epigenetic signatures in clinical setting
Epigenetic signatures could become means to track intervention effectiveness of behavioral changes at an individual-level, which would be relevant in the clinical management of tobacco and alcohol use disorders, by providing a call for action and motivational tools. The EpiTob and EpiAlc signatures were created to maximize prediction and minimize the number of CpGs included. Therefore, in terms of clinical utility, a parsimonious signature is advantageous as it can use less expensive techniques than chip arrays. Advancements in technology that are already underway (such as those by Oxford Nanopore technologies® and MassARRAY®) could facilitate measurement of only those CpG sites of interest, helping drive down costs and ensure accessibility. However, usage of these signatures to track the effectiveness of behavioral changes would be dependent on the plasticity of measured CpGs to changes in exposure. Although the plasticity of the epigenetic signatures could not directly be demonstrated in the present analysis, there are arguments for its existence. For example, in a longitudinal analysis Dugué et al observed that changes in the reported alcohol intake were associated with changes in methylation levels for 513 CpG sites; within which two of the three CpGs (cg06690548 and cg00716257) encompassed within the EpiAlc signature were included [30]. In a separate analysis using data from the Framingham Heart Study, Liu et al observed that methylation levels among heavy drinkers revert to that of non-drinkers by four years [4]. However, these results were based on samples taken at roughly four-year intervals, which makes it difficult to quantify a precise timeline of signature plasticity.
Similar to epigenetic changes induced by alcohol consumption, epigenetic changes following smoking cessation also appear to be relatively dynamic [31]. For example, using data from the Generation Scotland: Scottish Family Health Study – a large, population-based cohort – McCartney et al found that among light (so-called low-dose) smokers, only prolonged exposure to tobacco consumption induced epigenetic changes that could adequately characterize smoking status [32]. Additionally, while it took less than a year for the methylation profile of low-dose ex-smokers to convert to that of a never smoker, it took up to nine years for high-dose ex-smokers [33]. Further demonstrating this plasticity, a recent study of adolescents observed that methylation levels of the cg05575921 CpG site (included in the EpiTob signature) remained stable over the course of two years among non-smokers, but diminished for smokers observed even within a 6-month period [34]. However, while the current evidence base is promising in terms of the clinical utility of EpiAlc and EpiTob to track progress of cessation efforts results, additional research is needed to prove the plasticity of these signatures and better understand the dose-response relationship. A final note regarding the utility of these – as well as other – epigenetic signatures in the clinical setting is the need to address ethical concerns surrounding the use of this sensitive information, most notably with regards to privacy and confidentiality [35]. While many countries have legislation in place that protects genetic information garnered from widely accessible genetic testing, such laws may need to be updated to address concerns unique to epigenetic information.
Strengths and limitations
The epigenetic signatures included in the present manuscript were initially created using data from the SKIPOGH cohort. The SKIPOGH cohort is a population-based cohort including participant pairs or families with shared genetics. The genetic homogeneity and limited size of the SKIPOGH study population may be a potential limitation to the present study as it could have contributed to the lack of identification of relevant SNPs for inclusion in the final signatures, particularly given that genetic associations generally have very small effect sizes and hence would require a much larger sample size [36]. Similarly, the cohorts included in the present study are primarily composed of European populations. Recent evidence points towards an influence of ancestry on the epigenetic architecture that is largely driven by genetics, and has also evidenced differential methylation patterns according to ancestry [37][38]. As such, additional validation in a diverse population is necessary to ensure the generalizability to populations of non-European descent.
A strength of the present study is the validation of epigenetic signatures in independent samples with varied populations in terms of inclusion criteria, mean sample age, and background. To this effect, with respect to mean sample age, comparisons across the SKIPOGH and LBC cohorts may provide insights into the influence of bias on study outcomes. For example, while the epigenetic signatures for alcohol and tobacco consumption were associated with self-reported CVD in the SKIPOGH cohort, the strength of these associations appeared weaker in the LBC1936 cohort, and was non-existent in the LBC1921 cohort. The diminishing strength of association across the three cohorts may be influenced by the higher average age, and as such a result of survivor bias. This is similar to the observed fact that epigenetic age increases at a slower rate than chronological age, especially among older individuals, which is likely a consequence of survival bias [39]. Moreover, both LBC cohorts have proportionally fewer smokers at baseline, which is also likely influenced by the older age of cohort participants. Another potential limitation of the present study that may have affected comparability across studies was that normalization techniques of the DNA methylation data were not standardized across the included studies and samples, nor did all included samples use the same methylation profiling array. For example, DNA methylation data for the Lothian Birth Cohorts was obtained using the Infinium® HumanMethylation450 BeadChip assay [9], while the majority of the SKIPOGH participant methylation data was obtained using the Infinium® Methylation EPIC BeadChip. However, regardless of the array used, replication results suggest that the signatures are robust to the potential added noise associated with varying normalization techniques. Finally, self-reported levels of alcohol and cigarette consumption were used to assess relationship between self-reported tobacco or alcohol consumption and the respective epigenetic signature. Given that self-reported data are subject to self-report or recall bias, e.g., smokers may underreport their tobacco intake levels, effect estimates may be over- or underestimated. This could have contributed to a less than optimal choice of CpGs selected for inclusion in the signature. However, the association of the included epigenetic signatures with long-term health outcomes such as self-reported CVD or mortality suggests the validity of the included CpGs as biomarkers for the long-term effects, although they are based on self-reported exposures.