Subject recruitment, demographic/lifestyle variables acquisition and DNA methylation measurements
EPIC Italy. Study participants were drawn from the Italian component of the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort [46], a large general population cohort consisting of ~520,000 individuals, with standardized lifestyle and personal history questionnaires, measured anthropometric data and blood samples collected for DNA extraction. Smoking habits data were collected at study enrolment using a questionnaire, and participants were categorized as ‘never’, ‘former’ and ‘current’ smokers. Height and weight were measured at enrolment with a standardized protocol, and body mass index (BMI) was calculated as the ratio between weight in kg and squared height in meters, treated as a continuous variable. Methods for measurements of blood pressure, cholesterol levels, triglycerides, and PAI1, D-dimer, and CRP are reported elsewhere [47].
This study sample includes individuals from five nested case-control studies on breast, colon, and lung cancer, lymphomas, and myocardial infarction [48,49]. Participants were sampled from the 47,749 participants of the EPIC Italy cohort and included 354 incident breast cancer cases, 169 incident colon cancer cases, 192 incident lung cancer cases, 72 incident lymphoma cases, 295 incident myocardial infarction cases and their 1,079 matched controls. Controls were individually matched on age (±5 years), sex, the season of blood collection, centre, and length of follow-up. Since the disease diagnoses were made years after the blood draw, all the subjects were treated as healthy at recruitment. In the time to CVD event analyses, the follow-up time was (right) censored at the time of diagnosis for incident cancer cases.
Overall, after DNA methylation data quality controls and sample filtering 1,803 EPIC Italian subjects were used in this analysis.
EXPOsOMICS CVD. Study participants pertain to the EPIC Italy cohort. 160 incident CVD cases and one-to-one matched controls (not overlapping with the EPIC Italy dataset described hereafter) were extracted using the incident density sampling method [37]. After DNAm data quality control and sample filtering, 315 individuals were included in this study.
For the microarray, DNA samples were extracted from buffy coats using the QIAsymphony DNA Midi Kit (Qiagen, Crawley, UK). Bisulphite conversion of 500 ng of each sample was performed using the EZ-96 DNA Methylation-Gold™ Kit according to the manufacturer’s protocol (Zymo Research, Orange, CA). Then, bisulphite-converted DNA was used for hybridization on the Infinium HumanMethylation 450 BeadChip, following the Illumina Infinium HD Methylation protocol. Briefly, a whole genome amplification step was followed by enzymatic end-point fragmentation and hybridization to HumanMethylation 450 BeadChips at 48°C for 17 h, followed by single nucleotide extension. The incorporated nucleotides were labeled with biotin (ddCTP and ddGTP) and 2,4-dinitrophenol (DNP) (ddATP and ddTTP). After the extension step and staining, the BeadChip was washed and scanned using the Illumina HiScan SQ scanner. The intensities of the images were extracted using the GenomeStudio (v.2011.1) Methylation module (1.9.0) software, which normalizes within-sample data using different internal controls that are present on the HumanMethylation 450 BeadChip and internal background probes. The methylation score for each CpG was represented as a β-value according to the fluorescent intensity ratio representing any value between 0 (unmethylated) and 1 (completely methylated).
The Irish Longitudinal Study on Ageing (TILDA) is a large prospective cohort study examining the social, economic and health circumstances of 8,175 community-dwelling older adults aged 50 years and over resident in the Republic of Ireland. The sample was generated using a 3-stage selection process and the Irish Geodirectory as the sampling frame. The Irish Geodirectory is a comprehensive listing of all addresses in the Republic of Ireland, which is compiled by the national post service and ordnance survey Ireland. Subdivisions of district electoral divisions pre-stratified by socio-economic status, age, and geographical location, served as the primary sampling units. The second stage involved the selection of a random sample of 40 addresses from within each PSU resulting in an initial sample of 25,600 addresses. The third stage involved the recruitment of all members of the household aged 50 years and over. Consequently, the response rate was defined as the proportion of households including an eligible participant from whom an interview was successfully obtained. A response rate of 62% was achieved at the household level. There were three components to the survey. Respondents completed a computer-assisted personal interview and a separate self-completion paper and pencil module which collected information that was considered sensitive. All participants were invited to undergo an independent health assessment at one of two national centres using trained nursing staff. Blood samples were taken during the clinical assessment with the consent of participants. A more detailed exposition of study design, sample selection and protocol are available elsewhere [19]. The present study sample included 500 healthy individuals: 125 for each of the four socioeconomic classes: stable professional, any downward mobility, any upward mobility, and stable unskilled. Buffy coat or peripheral blood mononuclear cells (PBMC) samples were available for all the individuals. Overall, after DNA methylation data quality controls and sample filtering, 490 subjects were analysed in this study.
For the microarray, DNA samples were extracted from buffy coats using the QIAGEN GENTRA AUTOPURE LS (Qiagen, Crawley, UK). Bisulphite conversion of 500 ng of each sample was performed using the EZ DNA Methylation-Lightning™ Kit according to the manufacturer’s protocol (Zymo Research, Orange, CA). Then, bisulphite-converted DNA was used for hybridization on the Infinium HumanMethylation 850k BeadChip, following the Illumina Infinium HD Methylation protocol. Briefly, a whole-genome amplification step was followed by enzymatic end-point fragmentation and hybridization to HumanMethylation EPIC Chip at 48°C for 17 h, followed by single nucleotide extension. The incorporated nucleotides were labeled with biotin (ddCTP and ddGTP) and 2,4-dinitrophenol (DNP) (ddATP and ddTTP). After the extension step and staining, the BeadChip was washed and scanned using the Illumina HiScan SQ scanner. The intensities of the images were extracted using the GenomeStudio (v.2011.1) Methylation module (1.9.0) software, which normalizes within-sample data using different internal controls that are present on the HumanMethylation 850k BeadChip and internal background probes. The methylation score for each CpG was represented as a β-value according to the fluorescent intensity ratio representing any value between 0 (unmethylated) and 1 (completely methylated).
Understanding Society. The study sample consisted of participants from the United Kingdom Household Panel Study (UKHLS), also known as Understanding Society [50], an ongoing longitudinal, nationally representative study of the UK, designed as a two-stage stratified random sample of the general population. While Understanding Society is a panel survey, the data used here consist of two pooled cross-sectional waves where a nurse collected blood samples from the respondents, among other physiological measures. The eligibility criteria for collecting blood samples were: (a) participation in the previous main interviews in England (had participated in all annual interviews between 1999 (BHPS wave 9) and 2011–2013 (Understanding Society wave 2 and 3); (b) age 16 and over; (c) living in England, Wales, or Scotland. From the potential pool of 6,337 survey respondents, eligibility requirements for epigenetic analyses meant that the samples for DNA methylation measurement were restricted to participants of white ethnicity, resulting in 1,175 subjects; more details can be found elsewhere [30]. Details about laboratory analyses for DNAm and how to access raw data can be found at the Understanding Society web site
(https://www.understandingsociety.ac.uk/documentation/mainstage/dataset-documentation/variable/epigenetics).
For the GSE174818 (Covid-19 case-control) study, details of sample characteristics and laboratory methods for DNAm and biomarker analyses are described in the original publication [31].
DNA methylation data pre-processing and quality controls
For all the studies, raw DNAm data were pre-processed and normalized using in-house software written for the R statistical computing environment, including background and colour bias correction, quantile normalization, and BMIQ procedure to remove type I/type II probes bias, as described elsewhere [51]. DNAm levels were expressed as the ratio of the intensities of methylated cytosines over the total intensities (β values). Samples were excluded if the bisulphite conversion control fluorescence intensity was less than 10,000 for both type I and type II probes. Methylation measures were set to missing if the detection P-value was greater than 0.01. Additionally, the set of cross-reactive and/or polymorphic (with minor allele frequency greater than 0.01 in Europeans) CpGs (N=39,238) described by Chen et al. [52] was excluded due to the low reliability of methylation measure.
The Fernández-Sanlés methylation risk score (MRS) was computed as a standardised weighted sum of 34 CpG sites, with weights defined by the estimates described by the authors in the Supplementary material of their original publication [1]. DNAmGrimAge and other epigenetic clocks were computed using Steve Horvath online DNAmAge calculator
(https://horvath.genetics.ucla.edu/html/dnamage/).
Statistical analyses
Development and validation of DNAm surrogates
We developed DNAm surrogates for BMI, systolic and diastolic blood pressure, and ten blood measured biomarkers. We used the EPIC Italy dataset randomly split into training (N=1,352; 75% of the sample) and test set (N=451; 25% of the sample). For each risk factor/biomarker, we created a DNAm surrogate through a three-step procedure:
1) We identify risk factors/biomarkers showing significant differences across EPIC Italy centres (Turin, Varese, Naples, Ragusa) via ANOVA analyses. We employed a linear model with a random intercept component, accounting for differences across centres for this subset of biomarkers, consisting of all but PAI-1, CRP, D-dimer, and triglycerides. We used a fixed-effect linear model for the other biomarkers.
2) Log-transformed risk factors/biomarkers were regressed on DNAm through a linear model adjusted for age, gender (fixed effect), and centre of recruitment (random effect, where necessary) to identify the top 1% ranked CpGs based on the P-value.
3) DNAm surrogates of risk factors/biomarkers were constructed, regressing the response variables on the top 1% CpG sites, adjusting for sex and age. Finally, we applied L1 penalised estimation for enforcing sparsity in the regression coefficients employing the LASSO procedure [53] or the corresponding penalised mixed model [54] (for the biomarkers showing difference by centre) depending on the biomarker. For the latter method ad-hoc R routines were devised: the source code is freely available in the form of an R package at https://github.com/AndreaCappozzo/mixedelnet.
We validated the DNAm surrogates investigating their association (Pearson correlation coefficients) with the corresponding measured risk factor/biomarker in the EPIC Italy testing set (N=451, 25% of the sample), and four additional independent studies: Understanding Society (N=1,174), TILDA (N=490), EXPOsOMICS CVD (N=315), and GSE174818 (N=128). We used fixed-effect meta-analysis (inverse variance weights) to combine the results across the four validation datasets into a single estimate. As a result, we defined as ‘validated’ DNAm surrogates with significant associations (P < 0.05) in both EPIC Italy and the combined validation sets. As further validation, we investigated the correlation of our newly developed DNAm surrogates with those previously developed for BMI, HDL cholesterol [23], and PAI-1 [11].
Derivation of DNAmCVDscore
We developed a blood DNAm based biomarker (that integrates several DNAm surrogates) for predicting the risk of future CVD events named DNAmCVDscore. We used Cox regression model with elastic net regularisation to regress the time from recruitment to CVD event, and for selecting the most critical features from 60 (standardised: mean=0, sd=1) previously described blood DNAm surrogates.
The best λ parameter was derived from ten-fold cross-validation to minimise the Harrel concordance C-index. The overall procedure includes 1,000 permutations using each time 80% of the whole EPIC Italy dataset (n=1,443). The DNAm surrogates comprising the DNAmCVDscore were selected among those with non-zero coefficients in at least half of the permutations. Finally, DNAmCVDscore was computed as a linear combination of the selected DNAm surrogates where weights correspond to the average (non-zero) coefficient among the 1,000 permutations.
Validation of DNAmCVDscore and comparison with MRS, SCORE2 and DNAmGrimAge
We validated the DNAmCVDscore in an independent dataset (EXPOsOMICS CVD, N=315), including incident CVD cases and matched controls. Since the testing set is designed as a case-control study nested in a cohort, we ran logistic regression analyses, and we evaluated the predictive performance of DNAmCVDscore through ROC curve analysis.
We compared the performance of five logistic regression models:
1) Based on DNAmCVDscore adjusted for matching parameters (age, sex, and centre of recruitment).
2) Based on MRS adjusted for matching parameters.
3) The SCORE2 prediction model based on chronological age, sex, diabetes, smoking, systolic blood pressure, total and HDL cholesterol, adjusting for matching parameters.
4) An enriched version of SCORE2, denoted with SCORE2 + DNAmCVDscore, in which DNAmCVDscore is included in the set of covariates.
5) Based on DNAmGrimAge adjusted for matching parameters.
To investigate the predictive performance of the five composite biomarkers at different time points, we computed the area under the ROC curve (AUC), sensitivity, and specificity as a function of the time from recruitment to diagnosis, right-censoring follow-up at constant intervals of one year from 18 to two years. Confidence intervals for AUC were computed according to De Long et al. [55].
DNAm surrogates and DNAmCVDscore vs COVID-19 case-control status and severity
As an additional sensitivity analysis, despite being out of the main scope of this work, we investigated the usefulness of using DNAm surrogate biomarkers in epidemiological studies on COVID-19 using the GSE174818 dataset (101 patients with COVID-19 infection and 26 controls hospitalised with respiratory problems). Specifically, we investigated the association of BMI and blood measured CRP with COVID-19 case-control status and severity (using the GRAM score as a proxy), and we compared the results with those obtained using their DNAm surrogates (DNAmBMI and DNAmCRP). Finally, since CVDs and COVID-19 share several risk factors [43] we investigated the association of the DNAmCVDscore with COVID-19 case-control status and severity. We used logistic and linear regression models adjusted for age and gender to investigate the association with case-control status and GRAM score, respectively.