DOI: https://doi.org/10.21203/rs.3.rs-2097550/v2
Background: The genome-wide assessment of the DNA methylome has revolutionized our comprehension of epigenome alterations linked to complex human traits and diseases. The ability of epigenome-wide association studies (EWAS) to translate into biologically meaningful results relies on detecting epigenomic signatures with a high level of statistical certainty. However, the classical analyses of EWAS are prone to statistical inflation and bias, leading to spurious associations, particularly in case series with small sample sizes, such as those analyzing patients with rare inherited disorders.
Methods: Based on the co-methylation pattern of CpG dinucleotides within the CpG islands, we propose the smoothing method at the genome-wide level through a sliding window approach to calculate and visualize data from EWAS to decipher the most informative epigenetic alterations of EWAS with a high degree of accuracy.
Results: The smoothing method is a simple method that identifies epigenomic signatures with a high degree of certainty while controlling the risk of spurious findings outside the significant loci at a genome-wide level. We have systematically compared the smoothing method with a classical supervised approach in several EWAS settings, including two monogenic epigenetic diseases (epi-cblC and primary constitutional MLH1epimutation) and epigenetic predictors of aging. In the latter example, we showed that the smoothing method remained efficient even after applying an 80% reduction of the original sample size.
Conclusions:
The smoothing method for DNA methylation analyses is based on the biological correlate of the epigenome structure and identifies highly accurate epigenomic signatures in DNA methylation analyses. Its application to several settings of epigenome-wide analyses confirmed its usefulness for deciphering the most informative epigenomic signatures with a high degree of certainty while controlling the risk of spurious findings outside the significant loci at a genome-wide level.
Our results suggest revisiting EWAS by applying the smoothing method to already available datasets to re-analyze and potentially identify highly accurate epigenomic signatures that could translate into biologically meaningful results.
The advent of genome-wide methods for assessing the epigenome landscape, notably the DNA methylome, has revolutionized our understanding of changes and alterations of epigenetic architecture in complex human traits and human diseases’, respectively [1]. Epigenome-wide association studies (EWAS) use microarrays to analyze the methylation of CpG dinucleotides (also called CpG probes) at the genome-wide level. The ability of EWAS to translate into biologically meaningful results relies on detecting epigenomic signatures with a high level of statistical certainty. However, EWAS may be prone to bias, potentially leading to spurious associations, notably in case series with small sample sizes [1, 2].
In a landmark study, Eckhardt et al. used bisulfite DNA sequencing to generate high-resolution methylation profiles of human chromosomes 6, 20, and 22, and showed that over 90% of CpG sites within 50 bases had the same methylation status [3–5]. Based on this concept of co-methylation pattern of CpG dinucleotides, notably within CpG islands, we proposed the smoothing method at the genome-wide level through a sliding window approach to calculate and visualize data from EWAS in order to decipher the most informative epigenetic alterations with a high degree of accuracy. We have illustrated the smoothing method by performing several EWAS from various settings, including two monogenic epigenetic diseases (epi-cblC and primary constitutional MLH1 epimutation) and an EWAS on epigenetic predictors of aging. To assess the magnitude of signal enhancement at the top significant loci and evaluate the benefit provided by the smoothing method, we used several metrics, including the signal-to-noise ratio (SNR), variance reduction, and variance ratio (VR), that demonstrated the method’s efficiency. Importantly, by applying the smoothing method, only the genomic regions carrying an aberrant epimutation related to monogenic disorders (e.g., MMACHC promoter CpG island and MLH1 CpG island) reached a top significant association without any other spurious association at the genome-wide level. Using the smoothing method in an EWAS of aging epigenetic predictors, we showed that it remained efficient even after applying an 80% reduction of the original sample size.
We carried out bisulfite conversion of 600 ng of DNA extracted from whole blood using the EZ DNA Methylation kit (Zymo Research, Proteigene, Saint-Marcel, France). The genome-wide profiling of DNA methylome was determined using the Infinium MethylationEPIC BeadChip array (Illumina, Paris, France), according to the manufacturer’s instructions. The Infinium MethylationEPIC BeadChip provides coverage of 850,000 CpG probes in enhancer regions, gene bodies, promoters, and CpG islands. The arrays were scanned on an Illumina iScan® system, and raw methylation data were extracted using Illumina’s Genome Studio methylation module. For each CpG probe, the methylation level was described as a β value, ranging between 0 (fully unmethylated CpG probe) and 1 (fully methylated CpG probe). Background correction and normalization were implemented using the SWAN method (R Package Minfi) [6]. Probe annotation information, including sequence and chromosome location for the Infinium MethylationEPIC BeadChip array, was retrieved from the Infinium MethylationEPIC v1.0 B5 Manifest File.
We studied 16 subjects with proven MMACHC epimutation. All exhibited aberrant methylation on the CpG:33 on the MMACHC gene promoter (Chr1: 45,965,587–45,966,049, GRCh37). Of the 16 patients with MMACHC epimutation, five were previously reported [7]. For the control group, we used in silico data generated using the Infinium Human Methylation 450k BeadChip array from the MARTHA cohort, which included 350 unrelated European-ancestry patients [8]. In the EWAS, we compared the mean β value of each CpG probe between cases and controls using a t-test.
We used DNA methylome data generated using the Infinium Human Methylation 450k BeadChip array using blood samples from 12 patients carrying MLH1 epimutation (CpG: 93, EPM2AIP1-MLH1), 61 patients with Lynch syndrome patients, and 41 healthy controls (GSE107351) [9]. In the EWAS, we compared the 12 MLH1 epimutation carriers (cases) with the 102 remaining patients without MLH1 epimutation from the series (controls). In the EWAS, we compared the mean β value of each CpG probe between cases and controls using a t-test.
We used DNA methylome data generated using the Infinium Human Methylation 450k BeadChip array using blood samples from 656 subjects (426 Caucasians and 230 Hispanics), aged 19 to 101 (GSE40279) [10]. We performed an EWAS using linear regression to look for epigenetic predictors associated with age as a continuous dependent variable after adjusting for gender and ethnicity. We selected the CpG probes with a rho Spearman’s correlation coefficient ≥ 0.5 in association with age from the top significant loci. We elaborated the “Polyepigenetic CpG aging score” by calculating the sum of the beta values corresponding to the top CpG probes (n = 15). We assessed the correlation between the “Polyepigenetic CpG aging score” and age using Spearman’s rank correlation coefficient. To assess the distribution of the “Polyepigenetic CpG aging score” in an independent dataset, we used DNA methylome data generated using the Infinium Human Methylation 450k from 20 newborns and 20 nonagenarians (ArrayExpress, GEOD-30870) [11]. We compared the median value of the “Polyepigenetic CpG aging score” between newborns and nonagenarians using the Wilcoxon-Mann-Whitney test.
We visually inspected the genome-wide distribution of the CpG probes according to their β value. We assessed population stratification according to the whole methylome landscape by performing numeric principal component analysis on normalized beta values of each CpG probe across the methylation array. The top ten principal components (eigenvectors) were calculated with their respective eigenvalue.
2.5. Description of the smoothing method for calculating and visualizing data from EWAS results using nominal –Log10( P -values)
The smoothing method relies on the DNA co-methylation pattern, which is the correlation between adjacent CpG sites. The first description of the co-methylation pattern of CpG sites was reported by Eckhardt et al. using bisulfite DNA sequencing to generate high-resolution methylation profiles of human chromosomes 6, 20, and 22, providing a resource of about 1.9 million CpG methylation values derived from 12 different tissues [3]. In this landmark study, over 90% of CpG sites within 50 bases had the same methylation status [3]. Taking into account this assumption, we proposed the conversion of the nominal –Log10(P-value) obtained for each CpG probe at the genome-wide level to smoothed –Log10(P-values) using a pre-specified window width (w) value ranging between 1 and 3 (Smoothed[Pk,w]) (Figure S1). We used a symmetric smoothing method so that the Smoothed(Pk,w) of each CpG probe corresponds to the mean of the –Log10(P-values) of the central CpG probe (k) and those located on either side of the central CpG probe within the limit of the pre-defined window width, through a sliding window approach, according to the following formula:
, where k corresponds to the central CpG probe and w to the number of CpG probes on either side of the central CpG probe (window width). The Infinium Human Methylation 450k BeadChip array encompasses 482,421 CpG probes and 26,658 CpG islands with an average number of 5.63 CpG probes per CpG island [5]. Considering this resolution, we estimated the optimal smoothing width at 2 since this value corresponds to a total number of 5 CpG sites (2w + 1) to calculate the Smoothed(Pk,w) value.
To assess the magnitude of signal enhancement at a given top significant locus, we used several metrics, including the signal-to-noise ratio (SNR), variance reduction, and variance ratio (VR). For a given top significant locus, the signal-to-noise ratio (SNR) was defined as the ratio of the power of a signal (meaningful information at the top locus) and the power of background noise (unwanted signal in the remaining genomic regions at the genome-wide level) according to the following formula [12]:
, where ‘µ (signal)’ corresponds to the mean of Smoothed(Pk,w) of the CpG probes located in the analyzed top significant locus and ‘σ (noise)’ to the standard deviation of Smoothed(Pk,w) of all the CpG probes located outside the analyzed top significant locus at the genome-wide level, using the same window width value (w).
We calculated the variance ratio according to the following formula:
, where ‘VAR (Smoothed[Pk,w]) (signal)’ corresponds to the variance calculated from the Smoothed[Pk,w] of the CpG probes located in the studied genomic region and ‘VAR (Smoothed[Pk,w]) (noise)’ to the variance calculated from the Smoothed[Pk,w] of all the CpG probes located outside the studied genomic region at the genome-wide level, using the same window width value (w).
We reported the results of epigenome-wide association studies using epi-Manhattan plots for nominal and smoothing transformed –Log10(P-values). In the epi-Manhattan illustrating the smoothing transformation of –Log10(P-values), we used (2×σ2) as a significance threshold value, where σ2 corresponds to ‘VAR (Smoothed[Pk,w]) (noise)’, as previously described. The smoothing method and all statistical analyses were performed using the SNP & Variation Suite (v8.8.1; Golden Helix, Inc., Bozeman, MT, USA) and MedCalc, version 19.5.3 (MedCalc Software, Ostend, Belgium).
DNA methylome analysis and quality controls
All the DNA methylome profiles of the 16 subjects with an MMACHC promoter epimutation (CpG island CpG:33) were of optimal quality and were used for statistical analyses. A total of 449,517 CpG probes were analyzed. As shown in the epi-Manhattan plot, the EWAS using untransformed –Log10(P-values) did not retrieve any top significant locus at the genome-wide level (Fig. 1A).
After applying a smoothing transformation at the genome-wide level using a window width of 1 (three CpG probes in each window), we observed a top significance at the MMACHC promoter CpG island (CpG:33) without any other DNA methylome signature at the genome-wide level. The SNRMMACHC/outside_MMACHC was increased by ~ 50% (SNR = 16.2) and the VRMMACHC/outside MMACHC increased by + 95% (VR = 12.7). Using this window width, the median Smoothed(Pk,1) (k: from cg03123370 to cg15896098) of the MMACHC promoter CpG island (CpG:33) was 161 (IQR, 129–177) vs. 9 (IQR, 5–15) for the genomic regions outside the CpG:33 of MMACHC (Tables 1–2 and Fig. 1B). The increase in VRMMACHC/outside_MMACHC was related to a -61% decrease in noise variance (outside CpG:33, MMACHC) that was concomitant with a decrease but to a much lesser extent (-23%) of the signal variance (CpG:33, MMACHC) (Table 1 and Fig. 2).
Minimum | Maximum | Mean | Median | IQR, 25th − 75th | Variance | SD | Variance reduction (%) | Signal-to-noise Ratio* | Variance Ratio † | |
---|---|---|---|---|---|---|---|---|---|---|
Genomic region: Outside CpG:33 (MMACHC) (449,507 CpG probes) | ||||||||||
Untransformed –Log10 P-value (window width = 0; 1 CpG probe) | 0 | 311 | 11 | 6 | 2–14 | 225 | 15 | — | — | — |
Smoothed –Log10 P-value (window width = 1; 3 CpG probes) | 0 | 255 | 11 | 9 | 5–15 | 89 | 9 | -60% | — | — |
Smoothed –Log10 P-value (window width = 2; 5 CpG probes) | 0 | 212 | 11 | 10 | 6–14 | 45 | 7 | -80% | — | — |
Smoothed –Log10 P-value (window width = 3; 7 CpG probes) | 0 | 209 | 11 | 10 | 7–14 | 37 | 6 | -84% | — | — |
Genomic region: CpG:33 (MMACHC) (10 CpG probes) | ||||||||||
Untransformed –Log10 P-value (window width = 0; 1 CpG probe) | 104 | 217 | 163 | 165 | 133–193 | 1467 | 38 | — | 10.9 | 6.5 |
Smoothed –Log10 P-value (window width = 1; 3 CpG probes) | 89 | 197 | 153 | 161 | 129–177 | 1128 | 34 | -23% | 16.2 | 12.7 |
Smoothed –Log10 P-value (window width = 2; 5 CpG probes) | 100 | 181 | 139 | 140 | 114–159 | 810 | 28 | -45% | 20.7 | 18.0 |
Smoothed –Log10 P-value (window width = 3; 7 CpG probes) | 103 | 167 | 135 | 131 | 121–150 | 417 | 20 | -72% | 22.1 | 11.2 |
Note. IQR: interquartile range; SD: standard deviation. | ||||||||||
* Signal-to-noise ratio is defined as the ratio of the power of a signal (meaningful information) and the power of background noise (unwanted signal) according to the following formula: SNR = Mean of –Log10(P-values) (CpG:33, MMACHC gene) / SD of –Log10(P-values) (outside CpG:33, MMACHC gene). | ||||||||||
† Variance-ratio = Variance of –Log10(P-values) (CpG:33, MMACHC gene) / Variance of –Log10(P-values) (outside CpG:33, MMACHC gene). |
CpG probe | Chr. | Position* | t-test P-value | Mean β-value, epi-cblC | Mean β-value, Controls | –Log10 P-value (Untransformed) (WW = 0) | Smoothed(Pk,1) (WW = 1) ‡ | Smoothed(Pk,2) (WW = 2) ‡ | Smoothed(Pk,3) (WW = 3) ‡ |
---|---|---|---|---|---|---|---|---|---|
cg03123370† | 1 | 45965343 | 1.43×10–133 | 0.51 | 0.11 | 133 | 89 | 100 | 119 |
cg04700938† | 1 | 45965449 | 4.98×10–130 | 0.44 | 0.06 | 129 | 122 | 108 | 122 |
cg13848568 | 1 | 45965625 | 3.54×10–105 | 0.36 | 0.04 | 104 | 127 | 131 | 120 |
cg00081251 | 1 | 45965679 | 5.99×10–148 | 0.51 | 0.07 | 147 | 151 | 149 | 137 |
cg09323228 | 1 | 45965727 | 2.02×10–203 | 0.57 | 0.05 | 203 | 181 | 156 | 160 |
cg27393325 | 1 | 45965846 | 7.18×10–194 | 0.56 | 0.02 | 193 | 176 | 169 | 167 |
cg22536808 | 1 | 45965870 | 1.03×10–133 | 0.40 | 0.00 | 133 | 170 | 181 | 152 |
cg14836864 | 1 | 45965990 | 7.49×10–184 | 0.49 | 0.03 | 183 | 178 | 160 | 141 |
cg03108114 | 1 | 45966048 | 2.77×10–217 | 0.59 | 0.02 | 217 | 197 | 131 | 125 |
cg15896098 | 1 | 45966115 | 2.11×10–192 | 0.56 | 0.03 | 192 | 136 | 105 | 103 |
Note. Chr.: chromosome; WW: window width. | |||||||||
* Genomic positions according to GRCh37 (hg19). | |||||||||
† The two CpG probes, cg03123370 and cg04700938, were in the vicinity of the CpG island:33 and were included in the analysis of the MMACHC locus. | |||||||||
‡ The median and IQR –Log10(P-values) much lower in the CpG probes outside the CpG:33, MMACHC gene were equal to 6 (IQR, 2–14), 9 (5–15), 10 (6–14), and 10 (7–14) for the untransformed –Log10(P-values) and smoothed –Log10(P-values) using a window width of 1, 2, or 3, respectively. |
The smoothing transformation using a window width of 2 (five CpG probes in each window) achieved a + 90% increase of the SNRMMACHC/outside_MMACHC (20.7) and a + 176% increase in the VRMMACHC/outside_MMACHC (18.8). Using this window width, the median Smoothed(Pk,2) of the MMACHC promoter CpG island (CpG:33) was 140 (IQR, 114–159) vs. 10 (IQR, 6–14) for the genomic regions outside the CpG:33 of MMACHC (Tables 1–2 and Fig. 1C). Using the window width of 2, the decrease in noise variance was − 80%, while that of the signal variance (CpG:33, MMACHC) was − 45% (Table 1 and Fig. 2).
The smoothing transformation using a window width of 3 (seven CpG probes in each window) provided a 103% increase of the SNRMMACHC/outside_MMACHC (22.1) and a + 73% increase in the VRMMACHC/outside_MMACHC (11.2) (Table 1 and Fig. 1D). Using this window width, the median Smoothed(Pk,3) of the MMACHC promoter CpG island (CpG:33) was 131 (IQR, 121–150) vs. 10 (IQR, 7–14) for the genomic regions outside the CpG:33 of MMACHC (Tables 1–2 and Fig. 1D). Using the window width of 3, the decrease in noise variance was − 84%, while that of the signal variance (CpG:33, MMACHC) was − 72% (Table 1 and Fig. 2).
The EWAS using untransformed –Log10(P-values) retrieved a significant association in the EPM2AIP1–MLH1 CpG island (CpG:93, gene promoter) (Fig. 3A) with a hemimethylated profile among patients harboring the CpG:93 epimutation when compared to controls who had a fully unmethylated CpG:93 (Fig. 3B). The smoothing transformation using a window width of 3 maintained the top significance at CpG:93, reduced the noise variance in the remaining genomic regions, and did not retrieve any other DNA methylome signature at the genome-wide level (Fig. 3A).
Using linear regression EWAS, we looked for epigenetic predictors associated with age as a continuous dependent variable after adjusting for gender and ethnicity. The EWAS using untransformed –Log10(P-values) found no clear association at the genome-wide level (Fig. 4A). After applying the smoothing transformation and a window width of 3, we retrieved a top significance at four loci: ZYG11A (CpG:112, gene promoter), FHL2 (CpG:81, gene promoter), ELOVL2/ELOVL2-AS1 (CpG:141 bidirectional promoter of the head-to-head sense-antisense gene pair), and KLF14 (CpG:156, gene promoter) (Fig. 4B). Fifteen CpG probes from the four top significant loci were highly and positively correlated with age (Supplemental Table 1). The ‘Polyepigenetic CpG aging score’ based on the 15 top CpG was highly and positively correlated with age (Spearman’s rho = 0.89; 95% CI, 0.88–0.90; P = 3.00×10–219) (Fig. 4C). Notably, on the four top loci retrieved in the present analysis, two were highlighted in the original published study (ELOVL2, KLF14), and two remaining loci have been very well documented in later studies as being involved in age-related epigenomic signatures [13–19]. We performed a sensitivity analysis on independent sub-samples by randomly selecting five subgroups, each of them representing 20% of the total population (Supplemental Table 2). Using a Smoothed(Pk,3) threshold of 10, all the five sensitivity analyses confirmed the FHL2, ELOVL2/ELOVL2-AS1, and KLF14 loci (Fig. 5 and Supplemental Fig. 2). Using the same Smoothed(Pk,3) threshold, the ZYGA11 locus was confirmed in three EWAS (Fig. 5 and Supplemental Fig. 2). In the replication EWAS, the Polyepigenetic CpG aging score differed significantly between newborns and nonagenarians with a median score of 2.00 (IQR, 1.89–2.08) and 5.60 (IQR, 5.52–5.72), respectively (P = 3.40×10–8) (Fig. 6).
The smoothing method for DNA methylation analyses is based on the biological correlate of the epigenome structure and identifies highly accurate epigenomic signatures in DNA methylation analyses. Its application to several settings of epigenome-wide analyses confirmed its usefulness for deciphering the most informative epigenomic signatures with a high degree of certainty while controlling the risk of spurious findings outside the significant loci at a genome-wide level.
Several analytical methods have been developed to identify epigenomic signatures in EWAS [20]. Among them, the supervised approaches based on differentially methylated region (DMR) calling and the intelligent classifiers represent the most widely used [21–32]. The DMR calling methods are based on CpG annotations and use conventional statistical approaches (e.g., Student’s t-test [33]) or more elaborated methods (e.g., beta-binomial regression fit [34] or Shannon entropy [35]). All these methods and algorithms are well detailed in the review by Fan et al. [26], and as suggested by the authors, the supervised methods are in line with the classical paradigms of study power and multiple testing correction [26]. The smoothing method utilizes the concepts of signal processing theory to mirror the biological reality of the DNA methylome architecture. Formerly, several smoothing methods have been developed as mathematical models for signal de-noising, notably in engineering, spectrometry, and image processing [36–40].
Jaffe et al. reported a ‘bump hunting’ approach for DNA methylome analysis that relies on the co-methylation pattern of CpG probes [22]. They modeled the methylation level of CpG probes as a smooth function of genomic position and they implemented a multistep approach to identify differentially methylated regions based on a logit-transformation of methylation levels of CpG probes depending on the outcome of interest, an estimation of smoothed beta-values, and statistical testing to assess the statistical uncertainty to each estimated differentially methylated region [22]. In this way, the smoothing method we propose in the setting of EWAS analyses overcomes both statistical power and adjustment to multiple testing constraints since it handles nominal P-values to reconstruct an updated epi-Manhattan plot after signal processing. We showed that by applying the smoothing method, we optimized the signal-to-noise ratio, which allowed highlighting potentially relevant epigenomic signatures. The optimization of the signal-to-noise ratio resulted from a downward shift (shift to the left) of the distribution of the –Log10(P-values) at the genome-wide level, allowing background noise to be reduced while maintaining the strength of the signal (Supplemental Fig. 3). Interestingly, the signal-to-noise ratio approach has been tested to uncover de novo locus-disease discoveries in genome-wide association studies. It was suggested as a potential alternative to multiple testing correction approaches that could result in an over-conservative correction and a reduced power to detect true associations [41, 42].
We demonstrated that the smoothing method effectively uncovers epigenomic signatures in monogenic epigenetic disorders with small sample size datasets but also in large sample size EWAS on complex phenotypes. Importantly, in the case of large datasets, the smoothing method provided similar results even after applying an 80% reduction of the original sample size. Our results suggest revisiting EWAS by applying the smoothing method to already available datasets to re-analyze and potentially identify highly accurate epigenomic signatures that could translate into biologically meaningful results.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and material
Competing interests
The authors declare no conflict of interest.
Funding
INSERM UMR_S 1256, Nutrition, Genetics, and Environmental Risk Exposure.
Authors’ contributions
Conceptualization: AO, DAT, JLG, Formal Analysis: AO, Methodology: AO, DAT, JLG, Supervision: AO, Validation: AO, DAT, JLG, Writing – original draft: AO, DAT, JLG, Writing – review & editing: AO, DAT, JLG.
Acknowledgements