2.1. Example #1: epimutation of the MMACHC promoter CpG island (CpG:33)
2.1.1. DNA methylome analysis and quality controls
We carried out bisulfite conversion of 600 ng of DNA extracted from whole blood using the EZ DNA Methylation kit (Zymo Research, Proteigene, Saint-Marcel, France). The genome-wide profiling of DNA methylome was determined using the Infinium MethylationEPIC BeadChip array (Illumina, Paris, France), according to the manufacturer’s instructions. The Infinium MethylationEPIC BeadChip provides coverage of 850,000 CpG probes in enhancer regions, gene bodies, promoters, and CpG islands. The arrays were scanned on an Illumina iScan® system, and raw methylation data were extracted using Illumina’s Genome Studio methylation module. For each CpG probe, the methylation level was described as a β value, ranging between 0 (fully unmethylated CpG probe) and 1 (fully methylated CpG probe). Background correction and normalization were implemented using the SWAN method (R Package Minfi) [6]. Probe annotation information, including sequence and chromosome location for the Infinium MethylationEPIC BeadChip array, was retrieved from the Infinium MethylationEPIC v1.0 B5 Manifest File.
2.1.2. Description of cases and controls and EWAS design
We studied 16 subjects with proven MMACHC epimutation. All exhibited aberrant methylation on the CpG:33 on the MMACHC gene promoter (Chr1: 45,965,587–45,966,049, GRCh37). Of the 16 patients with MMACHC epimutation, five were previously reported [7]. For the control group, we used in silico data generated using the Infinium Human Methylation 450k BeadChip array from the MARTHA cohort, which included 350 unrelated European-ancestry patients [8]. In the EWAS, we compared the mean β value of each CpG probe between cases and controls using a t-test.
2.4. Quality controls of methylome array data
We visually inspected the genome-wide distribution of the CpG probes according to their β value. We assessed population stratification according to the whole methylome landscape by performing numeric principal component analysis on normalized beta values of each CpG probe across the methylation array. The top ten principal components (eigenvectors) were calculated with their respective eigenvalue.
2.5. Description of the smoothing method for calculating and visualizing data from EWAS results using nominal –Log10( P -values)
The smoothing method relies on the DNA co-methylation pattern, which is the correlation between adjacent CpG sites. The first description of the co-methylation pattern of CpG sites was reported by Eckhardt et al. using bisulfite DNA sequencing to generate high-resolution methylation profiles of human chromosomes 6, 20, and 22, providing a resource of about 1.9 million CpG methylation values derived from 12 different tissues [3]. In this landmark study, over 90% of CpG sites within 50 bases had the same methylation status [3]. Taking into account this assumption, we proposed the conversion of the nominal –Log10(P-value) obtained for each CpG probe at the genome-wide level to smoothed –Log10(P-values) using a pre-specified window width (w) value ranging between 1 and 3 (Smoothed[Pk,w]) (Figure S1). We used a symmetric smoothing method so that the Smoothed(Pk,w) of each CpG probe corresponds to the mean of the –Log10(P-values) of the central CpG probe (k) and those located on either side of the central CpG probe within the limit of the pre-defined window width, through a sliding window approach, according to the following formula:
, where k corresponds to the central CpG probe and w to the number of CpG probes on either side of the central CpG probe (window width). The Infinium Human Methylation 450k BeadChip array encompasses 482,421 CpG probes and 26,658 CpG islands with an average number of 5.63 CpG probes per CpG island [5]. Considering this resolution, we estimated the optimal smoothing width at 2 since this value corresponds to a total number of 5 CpG sites (2w + 1) to calculate the Smoothed(Pk,w) value.
To assess the magnitude of signal enhancement at a given top significant locus, we used several metrics, including the signal-to-noise ratio (SNR), variance reduction, and variance ratio (VR). For a given top significant locus, the signal-to-noise ratio (SNR) was defined as the ratio of the power of a signal (meaningful information at the top locus) and the power of background noise (unwanted signal in the remaining genomic regions at the genome-wide level) according to the following formula [12]:
, where ‘µ (signal)’ corresponds to the mean of Smoothed(Pk,w) of the CpG probes located in the analyzed top significant locus and ‘σ (noise)’ to the standard deviation of Smoothed(Pk,w) of all the CpG probes located outside the analyzed top significant locus at the genome-wide level, using the same window width value (w).
We calculated the variance ratio according to the following formula:
, where ‘VAR (Smoothed[Pk,w]) (signal)’ corresponds to the variance calculated from the Smoothed[Pk,w] of the CpG probes located in the studied genomic region and ‘VAR (Smoothed[Pk,w]) (noise)’ to the variance calculated from the Smoothed[Pk,w] of all the CpG probes located outside the studied genomic region at the genome-wide level, using the same window width value (w).
We reported the results of epigenome-wide association studies using epi-Manhattan plots for nominal and smoothing transformed –Log10(P-values). In the epi-Manhattan illustrating the smoothing transformation of –Log10(P-values), we used (2×σ2) as a significance threshold value, where σ2 corresponds to ‘VAR (Smoothed[Pk,w]) (noise)’, as previously described. The smoothing method and all statistical analyses were performed using the SNP & Variation Suite (v8.8.1; Golden Helix, Inc., Bozeman, MT, USA) and MedCalc, version 19.5.3 (MedCalc Software, Ostend, Belgium).