Identification of differentially methylated regions in rare diseases from a single patient perspective

doi:10.21203/rs.3.rs-2084072/v1

Download PDF

Research Article

Identification of differentially methylated regions in rare diseases from a single patient perspective

https://doi.org/10.21203/rs.3.rs-2084072/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background:

DNA methylation (5-mC) is being widely recognized as an alternative in the detection of sequence variants in the diagnosis of some rare neurodevelopmental and imprinting disorders. Identification of alterations in DNA methylation plays an important role in the diagnosis and understanding of the etiology of those disorders. Canonical pipelines for the detection of differentially methylated regions (DMRs) usually rely on inter-group (e.g. case versus control) comparisons. However, in the context of rare diseases and ii-locus imprinting disturbances, these tools might perform suboptimal due to small cohort sizes and inter-patient heterogeneity. Therefore, there is a need to provide a simple but statistically robust pipeline for scientists and clinicians to perform differential methylation analyses at the single patient level as well as to evaluate how parameter fine-tuning may affect differentially methylated region detection.

Result:

In this paper, we describe an improved statistical method to detect differentially methylated regions in correlated datasets based on the Z-score and empirical Brown aggregation methods from a single-patient perspective. To accurately assess the predictive power of our method, we generated semi-simulated data using a public control population of 521 samples and assessed how the size of the control population, the effect size and region size affect DMRs detection. In addition, we have validated the detection of methylation events in patients suffering from rare multi-locus imprinting disturbance and discuss how this method could complement existing tools in the context of clinical diagnosis.

Conclusion:

We present a robust statistical method to perform differential methylation analysis at the single patient level and evaluated its optimal parameters to increase DMRs identification performance and show its diagnostic utility when applied to rare disorders.

DNA methylation

differentially methylated regions

rare diseases

imprinting

multilocus imprinting disturbance

statistical method

epivariation

Beckwith–Wiedemann syndrome

neurodevelopmental disorders

congenital disease

single patient

optimization

DNA methylation (DNAm) of cytosines (5-mC) plays an important role in cell biology, most notably in tissue specific regulation of gene expression. Other roles include X-chromosome inactivation, regulation of splice-junctions and genomic imprinting [1], [2]. Differential methylation of cytosines, or epivariations, have been linked to a wide array of diseases such as cancer, ageing, metabolic, cardiovascular, neurodevelopmental and autoimmune disorders [3]–[7]. Differential methylation can occur either at single cytosines (DMCs), or affect several loci within a region, resulting in differentially methylated regions (DMRs). Depending on their origin, primary and secondary epivariations can be differentiated. Primary epivariations arise from stochastic errors in the establishment or maintenance of a methylation state by the DNA Methyl Transferase proteins family. Secondary epivariations, by contrast, derive from genetic alterations such as copy number variations (CNVs) or single nucleotide variations (SNVs) at the differentially methylated locus or inactivating variants in trans-acting factors with a key role in the establishment or maintenance of methylation state of that locus [8]. Both primary and secondary epivariations are found in patients suffering from rare diseases, a worldwide public health issue estimated to affect between 260 and 445 million people [9]. On one hand, primary epivariations are the main molecular event causing some imprinting disorders [10], rare cases of cancer[11], [12] and neurodevelopmental diseases [13]. On the other hand, secondary epivariations are a known alternative mechanism in rare diseases and the detection of these sequence variants have gained popularity in the diagnostic process. That is the case in the group of neurodevelopmental disorders known as the Mendelian disorders of the epigenetic machinery (MDEMs), for which detection of episignatures (i.e. group of DMCs acting as a blueprint for the disease) has been shown to enable patient diagnosis [14]–[22], or in imprinting disorders [23]–[28], where DMRs are localized at imprinting control centers. Episignatures and DMRs at imprinting loci are usually linked to a single disease. However, it has been shown that MDEMs’ episignatures sometimes share overlapping DMCs[29] and there has been increasing reports of patients showing multi-locus imprinting disturbances (MLIDs). MLIDs represent rare cases of imprinting disorders characterized at the molecular level by several defects at imprinting regions [30]. Patients suffering from MLIDs often share overlapping phenotypes based on the imprinted regions showing defects [24], [28], [31]–[35]. As a consequence of this molecular and phenotypic heterogeneity, aggregating patients in groups is not always trivial.

Classical methods to identify differentially methylated regions and episignatures are usually based on inter-group comparisons, requiring a large number of samples in each group to reach statistically significant results [36], [37]. Those methods cannot be systematically applied in the context of rare diseases due to either the cohort size or the intra-group heterogeneity. It is especially the case when the disease affects only a handful of patients, hence making it difficult to gather cohorts large enough to satisfy canonical group-comparison method assumptions. In addition, group-comparison loses the ability to capture inter-patients’ heterogeneity, such as in MLIDs. Therefore, single patient-based analyses could alternatively be used to address those issues as well as support personalization of diagnosis.

In the literature, only two methods have been described for single case-control DNAm analysis. The first method is divided in two steps. First, the Crawford-Howell (C-H) adaptation of the t-test is used to detect differential methylation at individual CpGs. Then, individual scores are aggregated in a DMR score using the Fisher aggregation method [38]. The second method used in [13] has been developed following two empirical rules: (i) at least 3 probes that each have methylation levels above the 99.9th percentile of the control distribution for that probe and are ≥ 0.15 above the control mean; (ii) at least 1 probe with a methylation level ≥ 0.1 above the maximum observed in controls for that probe.

Although both methods allow the detection of biologically relevant DMRs, they present some limitations. In the first method, the statistical method for individual probe testing described by Crawford-Howell is suggested to be used when the normative sample size (i.e. the size of the control population) is less than 50 [39]. Above that threshold, the Z-score is preferred. In addition to this limitation, the Fisher aggregation method used to combine individual scores (i.e. p-values) assumes independence between variables. However, this assumption does not hold in most large high-throughput biology datasets that show a correlation between variables. Indeed, it has been shown that closely located CpGs tend to be co-methylated [40]–[42]. In the second method, the empirical rules, while relevant, do not allow the ranking of candidate regions by a confidence score such as a p-value [44] and therefore loses the flexibility of applying a threshold for DMR calling. Finally, there is no evaluation of how the choice of the used parameters (e.g. number of probes, difference in methylation, cohort size) may affect DMRs calling.

Therefore, in this paper, we propose a statistical method based on the Z-score followed by the Empirical Brown method that takes into account covariance between variables [43] to identify DMRs in a single-patient setting. First, we characterize the behavior of CpGs methylation status in various regions of biological interest and show that CpGs display a high correlation in those regions, thus justifying the use of Brown’s aggregation method to assign a DMR score. Second, we investigate how different parameters such as the size of the control population, the amplitude of the methylation difference and the size of the regions affect the performance of this method for DMRs identification. In addition, we show the diagnostic utility of this method in the context of MLIDs and other neurodevelopmental disorders and congenital anomalies (ND-CAs), as well as its potential to identify new epivariants in existing datasets from a single-patient perspective.

Characterizing CpGs methylation within normal population

DNA methylation analyses are highly dependent on the control population used. Therefore, we decided to characterize the behavior of CpG methylation within our control population of 521 unaffected individuals (GEO accession number: GSE152026). In the same way as DNA sequence variants, it is easier to infer the significance of an epivariant when it is located in a region with a known function [44]. Thus, we focused on several regions known for their biological functions typically investigated in DNAm analysis: predicted CpG islands (CGIs); known imprinted regions; Fantom 5 enhancers; cis-regulating elements from the Encode project; genes associated to rare diseases from the Orphanet database (see Methods).

First, we wanted to assess whether methylation between pairs of CpGs is correlated within those regions. Indeed, the canonical way to identify DMRs aims at aggregating p-values of CpGs tested for differential methylation individually. As discussed previously, Fisher’s aggregation method has been the method of choice. However, this method assumes independence between variables. Thus, we computed the Pearson correlation across all the samples for CpG pairs in regions of biological interest as a function of the distance between the two CpGs forming the pair (Fig. 1a). We confirmed previous findings [40]–[42] showing that closely located CpGs showed co-methylation and that this correlation declines with distance in all of the regions of interest. In all regions, correlation continuously decreased until a distance of 800bp with the sharpest drop occurring in imprinted regions, and the lowest in CGIs. Furthermore, we showed that mean correlation levels were higher in imprinted (Mean r = 0.33) and enhancer regions (Mean r = 0.31), lower in CGIs (Mean r = 0.19) and close to zero in Orphanet genes (Mean r = 0.05) (Supp. Figure 1a).

In a second step, we investigated how methylation levels vary at the single CpG levels.

In order to do this, we calculated the standard deviation and Shannon’s entropy of the CpGs beta values (i.e. methylation percentage) within the same loci of interest (see Methods). Those two measures are complementary and give an indication about the stability of a given CpG’s methylation state within the normal population. The higher the entropy, the less stable is the methylation level for the tested element. Our results indicate that CpGs methylation levels are stable within the tested regions. Indeed, mean standard deviation of the beta value is under 4% in all groups (Supp. Figure 1b). We also show that overall mean entropy levels are low with CGIs having the lowest mean entropy (Mean entropy < 0.2), which highlights a high level of consistency in methylation (Fig. 1b). At the epigenome level, mean entropy was 0.16 and mean standard deviation 3%. We could not detect any significant changes in those parameters at the chromosome level (Supp. Figure 2a and 2b). This highlights a stable distribution of methylation levels at the CpG level within our control population.

Optimizing parameters for DMR identification in single patient

As mentioned in the introduction, our method to assign a confidence score to differentially methylated regions (DMRs) consists of the Z-score because our normative population was large (N = 521) and prevent the use of the Crawford-Howell method in addition to Brown’s aggregation method to take into account the interdependence of the methylation level between adjacent CpGs. After defining the statistical bases of our method, we sought to assess how different parameters associated to DMRs detection would influence the score of a region. Because of the difficulty to establish which signal is false in real data, we decided to use a semi-simulated approach based on a population of unaffected patients (see Methods). This strategy enabled us to define true DMRs and false signal that we considered as background noise and allow the usage of standard performance metrics such as the area under the precision recall curve (AUC) to evaluate the influence of several parameters. First, we tested how the difference in methylation levels between a sample of interest and the control population would affect the outcome of the scoring method. We performed this analysis on two datasets where we introduced either a low noise (5%, Fig. 2a) or high noise (10%, Fig. 2b) level. Then, in those noisy datasets we assessed how the method performed to detect increasing true methylation differences. As expected, performances were poor when trying to detect a small methylation effect of only 5% relative to the noise (signal of 10%, low noise conditions mean AUC = 0.77; signal of 15%, high noise: mean AUC = 0.69). The method performed better when the methylation effect increases. Indeed, at 10% of relative methylation difference, mean AUC for the low noise data and the high noise data were 0.89 and 0.83 respectively and a mean AUC over 0.95 was obtained with a methylation defect of 15% for the data with low noise against 20% for the noisier one. In the subsequent analyses we decided to use the low noise setting (5% of noise level) and introduced a 30% shift in methylation as true signal to evaluate the influence of other parameters. Precision/Recall curves as well as the AUC of the true and false positive rate are available in the supplementary data (Supp. Figures 3–4).

Next, we assessed how the number of modified CpGs within in a window would affect its score. Indeed, in the literature, it is commonly accepted to use windows of 1000bp containing a minimum of 3 CpGs when looking for DMRs [13], [38], [45], [46]. However strong arguments on the choice of this parameter are lacking. Therefore, we evaluated the detection of DMRs using windows of increasing size, from 1 to 7 CpGs (Fig. 2c, Supp. Figure S5). While the AUC for precision and recall was high for all window categories (> 0.96), performances tended to increase with the number of CpGs, and approached a plateau around 0.995 when the number of CpGs per window was ≥ 4.

Finally, we tested how the size of the control population would influence performance by comparing the semi-simulated data against a population of increasing size, from 5 to 509 samples (Fig. 2d, Supp. Figure S6). Although overall performance was good (AUC > 0.89), we could observe significant improvements when the size of the control population increased significantly until 30 controls, the highest differences were seen from 5 to 10 controls (mean AUC from 0.904 to 0.980) and from 10 to 20 controls (mean AUC from 0.980 to 0.988). Larger control populations display lower increase in performances with this high signal noise ratio (30% signal, 5% noise) settings.

Identification of DMRs in Beckwith-Wiedemann patients

After defining optimum parameters, we sought to evaluate the performance of the method for DMRs identification on real patient data. We performed the methylome analysis on 5 patients suffering from Beckwith-Wiedemann syndrome (BWS) that also showed multilocus imprinting disturbances (BWS-MLID, GEO accession number GSE133774 and GSE153211). Because controls (N = 27) from the same batch were available we decided to compare the scoring of DMRs using the Crawford-Howell method with batch matched controls and the Z-score with a larger population of controls (N = 521) from another batch (GEO accession number: GSE152026). This allowed to evaluate whether in the context of single-patient analysis, one should gather a small control cohort (N < 50) obtained at the same facility or whether using a larger cohort of publicly available controls will yield better results. Nevertheless, we used a modified version of BMIQ[47] described in[48] to rescale methylation values distribution between patients and controls to reduce batch effect (see Methods). Rescaling efficiency was evaluated through looking at probes’ methylation level distribution (Supp. Figure S7). In order to compare the two tests, we investigated the aggregated p-value of known imprinted regions and checked whether the regions detected with our method were also retrieved in the original papers[24], [34] (Supp. File 1, Table 2). Across the 5 samples and out of the 48 imprinting loci tested in the original paper, we found 21 to be under the 0.05 corrected p-value significance threshold with the Crawford-Howell method, versus 49 with the Z-score, 19 DMRS identified with one method were also significant using the other (representing 90% of the 21 found with C-H and 38% of the 49 found with the Z-score). Those numbers were lower at the 0.01 threshold (Number of regions: C-H = 17, Z-score = 46) for 15 DMRs deemed significant by both methods. One of the typical molecular defects of BWS involves loss of methylation at the KCNQ1OT1:TSS-DMR locus and normal methylation at H19/IGF2 IG-DMR; this pattern was identified in the patients in the original papers through molecular testing. Using our single patient approach, the KCNQ1OT1:TSS-DMR locus was considered as significantly differentially methylated (p-value < 0.01) in all patients only when using the Z-score, suggesting a higher sensitivity in comparison to the Crawford-Howell test. Visualization of the profile of methylation levels in that region showed that this effect is due to the high variability of the small control population used for the C-H test (Fig. 3). In addition to this locus, several other known imprinting regions were found as significantly differentially methylated in the patients using the Z-score, thus confirming the MLIDs diagnostic previously established, and the capacity of the method to identify regions of interest (Supp. File 2).

Then, we performed a whole epigenome DMRs scan of the BWS patients in order to identify new DMRs outside of canonical regions. We analyzed only windows containing at least 4 CpGs (see Methods) as this provided the optimal performances on simulated data and applied a strict threshold of 10% on the median difference between the patient CpGs methylation level and the and the mean methylation in the control population. Except for one patient (GEO accession number: GSM4635795) where we found 111 DMRs, we identified less than 10 DMRs in the other patients for a total of 148 additional significant DMRs (Supp. Figure S8 a-f, Supp. File 3 Table 3). Out of those, 2 regions were found hypermethylated in all patients, encompassing the ABCD1P4 (NCBI entry: 26957) pseudogene and the AC093787.2 long non-coding RNA promoters. Additionally, one DMR in the CpG Island within the protein-coding gene TNNT3 (NCBI entry: 7140) was found in 4 of the patients and 4 DMRs were found in two patients, two of those in the protein-coding genes COL18A1 (NCBI entry: 80781) and ANK1 (NCBI entry 286). Interestingly, among the DMRs identified outside canonical imprinting regions, some were located in genes with known relationships to congenital and neurodevelopmental diseases, and further investigation would help to better characterize the impact of DMRs in such genes.

Identification of DMRs in ND-CAs patients

To further evaluate our method, we applied the same analysis procedure to methylation data of 489 individuals suffering from neurodevelopmental disorders (NDs) and congenital anomalies (CAs) described in [49]. DMRs were originally identified in that cohort using the empirical method described in the introduction. Using our method, we found a total 4292 DMRs in 296 patients (i.e. 60% of the patients tested), with most patients having less than 3 DMRs (percentile 75) (Fig. 4a). Similarly to the original paper, we decided to remove samples having more than 10 DMRs. Doing so yielded 519 identified DMRs in 270 patients (i.e. 55% of the patients tested), 55 of those DMRs were already described in the original paper (i.e. 38% of DMRs identified in [49]) (Supp. File 4). Among the 519 identified DMRs, 269 were present in at least two samples (i.e. 51% of total in patients with less than 10 DMRs), mapping to 76 genes. At the gene level, we identified 31 DMRs in genes affected in more than two samples (representing 41% of all the affected genes). The most affected gene was GSDMD which shows a significant hypermethylation in 17 patients (Fig. 4b), the second one was ECEL1P2 a pseudogene hypomethylated in 15 patients. GSDMD has been linked to neonatal-onset multisystem inflammatory disease (NOMID) in mice [50]. According to the rare disease database, NOMID symptoms include cognitive disabilities. No existing data point to disease association in the case of ECEL1P2. Interestingly, we could detect a DMR in the gene PRDM16 (Fig. 4c), a gene shown to be involved in cardiomyopathy [51], which was consistent with the symptoms experienced by the patients (i.e. GEO ID: GSM2366439, GSM2366759, GSM2366459, GSM2366724).

In the context of rare disorder affecting the epigenome, classical case-controls studies are not always applicable. In addition, it has been shown that individuals with overlapping phenotypes suffering from multilocus methylation disturbances (MLMDs) show unique methylation patterns that could be used to further refine the clinical diagnosis [10], [15], [17], [52]–[54]]. Previously, two methods have been proposed to detect aberrant methylation in cases using a single-patient approach with one of them based on statistical testing [13], [38]. In this paper, we built on those previous methodologies to propose a comprehensive single-patient based method for DNA methylation analyses. First, we confirmed previous findings that methylation levels of CpGs within close distance are correlated [40]–[42] and observed a constant decrease in correlation until 800bp (Fig. 1a). We further showed that there were positive mean correlation levels between CpGs located within CpG islands, known imprinted regions, Fantom 5 enhancers, cis-regulating elements from the Encode project and to a lesser extent in genes associated to rare diseases from the Orphanet database (Supp Fig. 1a). Those results indicate that the assumption of independence made by Fisher’s method does not hold and should be replaced by a method taking this interdependence into account when aggregating scores of individual CpGs into DMRs. We suggest the use of Brown’s aggregation method implemented in [43]. We further showed that CpGs in a population of unaffected individuals have a high stability as illustrated by the low entropy and standard deviation observed (Fig. 1b, Supp. Figure 1b). That stability was seen throughout the entire epigenome (Supp. Figure 2a and 2b). This low variability can have an impact on DMRs identification capacity. Indeed, extremely low standard deviation values can be caused by bad sampling of the normative population, which can lead to extremely significant scores for a CpG even if the difference in methylation is low. Thus, we used semi-simulated data to quantify how methylation difference, amongst other parameters would influence the performance of DMRs identification. The method showed satisfying performances when the methylation difference in the DMRs was at least 10% both for low and high noise data, but better performances were achieved at 15% difference and above (Fig. 2a and 2b). Therefore, we advise applying a regional threshold of at least 10% on the median difference in methylation between the controls and the case. In addition to the effect size, we investigated the influence of the number of CpGs per window. Common DMRs identification methods use windows of at least 3 CpGs [13], [38]. Precision-Recall AUC starting at one CpG were already in a very good range, and we observed increasing performance until a peak that plateaued at 4 CpGs (Fig. 2c). We concluded that every window size tested (≥ 1 CpG) is acceptable in terms of performance but warn about the analysis burden that smaller window size generates. We thus decided to use ≥ 4 CpGs for the subsequent analyses. Then, we tested for the minimum number of samples that should be included in the control population (Fig. 2d). Based on our results, we suggest using a control population of at least 30 samples. However, due to sampling bias, we believe that a larger control cohort will generally yield less false positive. Nevertheless, our semi-simulated data present some limitations. Indeed, we could not account for batch effects that are present when using a different cohort as controls and the way we modelled DMRs may not reflect the full field of biological variations occurring in various syndromes. In addition, our analysis was based on a strong signal of 30% to find the best value for the size of the control population and the numbers of CpGs per window. We seldom encountered DMRs with a signal as strong as 30% in patient data and thus speculate that the measured performances using semi simulated data are probably overestimated. However, trends in those performances are still a good indication that a larger control population size and number of CpGs per window will yield better results, hence our suggestion to use a control population as large as possible, and focusing on windows containing at least 4 CpGs. We also compared the use of a batch matched cohort (N = 27) against a larger cohort (N = 521) from another batch, using data of 5 patients diagnosed with Beckwith-Wiedemann syndrome and MLIDs. In the context of the two control populations used here, we showed that using the Z-score with a larger cohort outperformed the Crawford-Howell t-test with a smaller - although batch-matched - control population in the ability to retrieve the undermethylation of the KCNQ1OT1 region (Fig. 3). However, we want to underline the necessity to correct the batch effect prior to this comparison. We rescaled the global distribution of methylation levels using an adapted version of the BMIQ software [47], [48] (Supp. Figure 7). This method allows the use of a golden standard to rescale new samples, and thus is very well suited to single-patient analysis, where individual samples can all be normalized against the same standard. Using this method allowed us to improve greatly the outcome of DMRs analysis for the non-matching batch cohort. Therefore, with the greater availability of DNAm data, we suggest gathering a meta-cohort, if possible complemented with batch-matched controls. Finally, we applied this method to detect additional DMRs in the same patients at the whole epigenome level (Supp. Figure 8). We were able to identify DMRs that weren’t reported previously. Among those new differentially methylated regions, two were present in all BWS-MLID patients and implicated genes that should be studied further in the context of BWS-MLID. In addition, we applied our method to a previously described cohort of 489 undiagnosed NC-DAs patients. Similarly, we identified new DMRs of interest in several patients. Although additional research would be needed to assign any role to those DMRs in the symptoms experienced by patients, we believe that our method of analysis allowed a greater characterization of their DNA methylation landscape and showed promising results to understand the molecular mechanism at play.

In conclusion, we described the optimal parameters of an improved single-patient based method to detect differentially methylated regions to increase its utility and reliability in a diagnostic setting.

Cohorts

Illumina EPIC data were retrieved for GSE152026, GSE133774 and GSE153211. IDAT files were available for GSE133774 and GSE153211. We used R minfi package to preprocess them. Cross-reacting probes, probes containing SNPs and probes with a detection pvalue > 0.01 were removed according to minfi functions, samples were normalized using minfi’s quantile normalization. Probes from sexual chromosomes were removed from the analysis, resulting in 830257 probes left. Beta values from Illumina 450k data of GSE89353, GSE36064, GSE40279, GSE42861, and GSE53045 described in [49] were retrieved. Only the 370065 overlapping probes were used for the analysis. Beta values were rounded to 3 digits. Rounded Beta values were used for batch correction (see Batch correction). Logit transformed Beta values (= M values) were used for all statistical analyses. Genome version hg38 and hg19 were respectively used for the annotation of Illumina EPIC and 450k data.

Characterizing CpGs within normal population

521 control patients from GEO datasets GSE152026 were used to characterize probes present on the Illumina EPIC array. Several annotation files in bed format were retrieved from UCSC table browser using the hg38 version of the genome. Those annotation included Orphanet genes [9], [55], CpG Islands (This track was generated using a modification of a program developed by G. Miklem and L. Hillier (unpublished)) and Encode candidate cis-regulation elements (based on ENCODE data released on or before September 14, 2018) [56]. CpGs within imprinted regions were selected based on the research in [57]. Fantom5 enhancers were downloaded from the Zenodo database [58].

Correlation between pairs of CpGs was calculated using Pearson’s correlation. We used Fisher’s z-transformation to calculate mean correlation: individual correlation coefficients were transformed in Z-scores prior to mean calculation, then mean Z-scores were transformed back into mean correlation. Shannon’s entropy was calculated using the entropy function in Python scipy.stats package, and the histogram function from numpy package, by binning CpGs beta values in 10 bins from 0 to 1, and default parameters. Entropy was normalized to vary between 0 and 1.

Semi-simulated data

Generation of control population datasets, windows, beta-value shifts, p-value per window and performance parameters were made using in house Python3 scripts and the libraries numpy, scipy, pandas, statsmodels as well as in house R 4.1.1 scripts with packages reshape2 and data.table. We selected 10 random samples from the control population (GEO accession number: GSE152026) to be modified and compared to the rest of the controls. To evaluate the influence of the size of the control population the remaining control population was progressively divided into smaller datasets (N = 100, 50, 40, 30, 20 and 10). We defined windows of CpGs using the Illumina v1.0 B5 annotation from Illumina website (https://emea.support.illumina.com/downloads/infinium-methylationepic-v1-0-product-files.html ). ChrX, ChrY, ChrM and individual probes with missing information about chromosome, position or strand (hg38 version) were removed. Adjacent probes were aggregated into non-overlapping windows using a fixed number of CpGs and a maximum window size of 1000bp. The number of CpGs per window ranged from 1 to 7 to assess influence of window size, otherwise it was 4. To avoid any effect due to genomic location, chromosomes were segmented into 1000 regions of equal size: 1/10 of these regions were selected for modification. Windows overlapping these regions were selected for modification. Selected windows with missing beta-values (for at least one probe and at least one control) were removed (Supp Table S1). To mimic noise in our data, a shift in beta value was applied to the probes in all regions. The shift applied (x %) per probe was sampled from a gaussian distribution with mean = x and, and std = 0.5%, and was either low (x = 5%) or high (x = 10%). The same principle applies when signal (5, 10, 20, 30 or 40% of methylation) was inserted for DMR identification and evaluating effect size. To avoid negatives beta value and beta value over 1, we added and subtracted signal when the beta value was over and under 50% respectively.

Batch correction

Batch correction was applied through a rescaling of the distribution of Beta-values using the adapted BMIQ [47] function described in [48] with default parameters except nfit = 820000 (BWS analysis) or nfit = 415000 (ND-CAs analysis) and th1.v = c(0.10, 0.60). Rescaling is made in function of a reference sample. For the BWS analysis, reference was either defined by the mean beta value of samples from GSE152026 (when testing with the z-score) or GSE153211 (when testing with C-H). For the ND-CAs samples, the reference was the mean beta value of samples from GSE42861. All were normalized by those two references in the respective analysis.

DMR identification

Individual CpGs in BWS and ND-CAs samples were tested individually for differential methylation using either a two tailed Z-score or two-tailed Crawford-Howell t-test [39] against a control population using the Python scipy.stats library. P-values obtained from Z-score were adjusted for multiple testing by the Bonferroni method (using the array size as the number of tested CpGs). DMRs were defined by a rolling window approach of 1000bp containing at least 4 CpGs, resulting overlapping windows were then merged. P-values for CpGs within a same window were aggregated using Brown’s aggregation method described in [43]. Significant DMRs were defined as having an aggregated P-value > 0.01 and a median difference in methylation of 10% in respect to the controls. Statistical testing was always performed on M-values and not beta values due to their statistical properties.

Ethics approval

Not applicable.

Competing interests

The authors declare no competing interests.

Authors’ contributions

RG and AH developed the method, analysed, and interpreted the data. RG, AH and MD supervised the project and contributed to writing the manuscript. GS provided ideas, support and in-depth review of the manuscript. CO and SVD discussed result outcome of developed tool with RG, AH and MD on a regular basis. All authors read and approved the final manuscript.

Funding

Part of this research was financed by Innoviris in the context of the BRIDGE IGencare project 2017-PFS-11e IgenCare (ULB) and BRGIMP12 (VUB). RG is a PhD fellow of the Fonds pour la Formation à la Recherche dans l’Industrie et l’Agronomie (F.N.R.S-FRIA, Belgium) (1.E.013.20F).

Availability of data and materials

The data supporting the findings are available within the article and its supplementary materials. Public datasets used in the study are described in the method section.

Acknowledgments

We thank the Fonds de la Recherche Scientifique, Innoviris and the Fonds David et Alice Van Buuren for their financial support and Laurence Desmyter, Bruno Pichon, Claire Detry for their contribution.

M. V. C. Greenberg and D. Bourc’his, “The diverse roles of DNA methylation in mammalian development and disease,” Nat Rev Mol Cell Biol, vol. 20, no. 10, pp. 590–607, 2019, doi: 10.1038/s41580-019-0159-6.
G. Lev Maor, A. Yearim, and G. Ast, “The alternative role of DNA methylation in splicing regulation,” Trends in Genetics, vol. 31, no. 5, pp. 274–280, 2015, doi: https://doi.org/10.1016/j.tig.2015.03.002.
M. T. Mc Auley, “DNA methylation in genes associated with the evolution of ageing and disease: A critical review,” Ageing Res Rev, vol. 72, p. 101488, 2021, doi: https://doi.org/10.1016/j.arr.2021.101488.
R. A. Kowluru and G. Mohammad, “Epigenetic modifications in diabetes,” Metabolism, vol. 126, p. 154920, 2022, doi: https://doi.org/10.1016/j.metabol.2021.154920.
J. Reichard and G. Zimmer-Bensch, “The Epigenome in Neurodevelopmental Disorders.,” Front Neurosci, vol. 15, p. 776809, 2021, doi: 10.3389/fnins.2021.776809.
J. Li et al., “Insights Into the Role of DNA Methylation in Immune Cell Development and Autoimmune Disease,” Front Cell Dev Biol, vol. 9, p. 3025, 2021, doi: 10.3389/fcell.2021.757318.
Y. Xia, A. Brewer, and J. T. Bell, “DNA methylation signatures of incident coronary heart disease: findings from epigenome-wide association studies,” Clin Epigenetics, vol. 13, no. 1, p. 186, 2021, doi: 10.1186/s13148-021-01175-6.
B. Horsthemke, “Epimutations in Human Disease,” in DNA Methylation: Development, Genetic Disease and Cancer, W. Doerfler and P. Böhm, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 45–59. doi: 10.1007/3-540-31181-5_4.
S. Nguengang Wakap et al., “Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database,” Eur J Hum Genet, vol. 28, no. 2, pp. 165–173, Apr. 2020.
T. Eggermann et al., “Imprinting disorders: a group of congenital disorders with overlapping patterns of molecular changes affecting imprinted loci,” Clin Epigenetics, vol. 7, no. 1, p. 123, 2015, doi: 10.1186/s13148-015-0143-8.
M. P. Hitchins and R. L. Ward, “Constitutional (germline) MLH1 epimutation as an aetiological mechanism for hereditary non-polyposis colorectal cancer,” J Med Genet, vol. 46, no. 12, pp. 793–802, 2009, doi: 10.1136/jmg.2009.068122.
E. Dámaso et al., “Primary constitutional MLH1 epimutations: a focal epigeneticevent,” Br J Cancer, vol. 119, no. 8, pp. 978–987, 2018, doi: 10.1038/s41416-018-0019-8.
M. Barbosa et al., “Identification of rare de novo epigenetic variations in congenital disorders,” Nat Commun, vol. 9, no. 1, p. 2064, 2018, doi: 10.1038/s41467-018-04540-x.
D. Grafodatskaya et al., “Multilocus loss of DNA methylation in individuals with mutations in the histone H3 Lysine 4 Demethylase KDM5C,” BMC Med Genomics, vol. 6, no. 1, p. 1, 2013, doi: 10.1186/1755-8794-6-1.
J. A. Fahrner and H. T. Bjornsson, “Mendelian disorders of the epigenetic machinery: tipping the balance of chromatin states,” Annu Rev Genomics Hum Genet, vol. 15, pp. 269–293, 2014, doi: 10.1146/annurev-genom-090613-094245.
S. Choufani et al., “NSD1 mutations generate a genome-wide DNA methylation signature,” Nat Commun, vol. 6, no. 1, p. 10207, 2015, doi: 10.1038/ncomms10207.
D. T. Butcher et al., “CHARGE and Kabuki Syndromes: Gene-Specific DNA Methylation Signatures Identify Epigenetic Mechanisms Linking These Clinically Overlapping Conditions,” The American Journal of Human Genetics, vol. 100, no. 5, pp. 773–788, 2017, doi: https://doi.org/10.1016/j.ajhg.2017.04.004.
E. Chater-Diehl et al., “New insights into DNA methylation signatures: SMARCA2 variants in Nicolaides-Baraitser syndrome,” BMC Med Genomics, vol. 12, no. 1, p. 105, 2019, doi: 10.1186/s12920-019-0555-y.
S. Choufani et al., “DNA Methylation Signature for EZH2 Functionally Classifies Sequence Variants in Three PRC2 Complex Genes,” The American Journal of Human Genetics, vol. 106, no. 5, pp. 596–610, 2020, doi: https://doi.org/10.1016/j.ajhg.2020.03.008.
E. Chater-Diehl, S. J. Goodman, C. Cytrynbaum, A. L. Turinsky, S. Choufani, and R. Weksberg, “Anatomy of DNA methylation signatures: Emerging insights and applications,” The American Journal of Human Genetics, vol. 108, no. 8, pp. 1359–1366, 2021, doi: https://doi.org/10.1016/j.ajhg.2021.06.015.
M. T. Siu et al., “Functional DNA methylation signatures for autism spectrum disorder genomic risk loci: 16p11.2 deletions and CHD8 variants,” Clin Epigenetics, vol. 11, no. 1, p. 103, 2019, doi: 10.1186/s13148-019-0684-3.
J. A. Fahrner and H. T. Bjornsson, “Mendelian disorders of the epigenetic machinery: postnatal malleability and therapeutic prospects,” Hum Mol Genet, vol. 28, no. R2, pp. R254–R264, Apr. 2019, doi: 10.1093/hmg/ddz174.
V. Dagar et al., “Genetic variation affecting DNA methylation and the human imprinting disorder, Beckwith-Wiedemann syndrome,” Clin Epigenetics, vol. 10, no. 1, p. 114, 2018, doi: 10.1186/s13148-018-0546-4.
A. Sparago et al., “The phenotypic variations of multi-locus imprinting disturbances associated with maternal-effect variants of NLRP5 range from overt imprinting disorder to apparently healthy phenotype,” Clin Epigenetics, vol. 11, no. 1, p. 190, 2019, doi: 10.1186/s13148-019-0760-8.
J. Beygo, C. Grosser, S. Kaya, C. Mertel, K. Buiting, and B. Horsthemke, “Common genetic variation in the Angelman syndrome imprinting centre affects the imprinting of chromosome 15,” European Journal of Human Genetics, vol. 28, no. 6, pp. 835–839, 2020, doi: 10.1038/s41431-020-0595-y.
M. Kagami et al., “ZNF445: a homozygous truncating variant in a patient with Temple syndrome and multilocus imprinting disturbance,” Clin Epigenetics, vol. 13, no. 1, p. 119, 2021, doi: 10.1186/s13148-021-01106-5.
T. Eggermann, M. Begemann, and L. Pfeiffer, “Unusual deletion of the maternal 11p15 allele in Beckwith--Wiedemann syndrome with an impact on both imprinting domains,” Clin Epigenetics, vol. 13, no. 1, p. 30, 2021, doi: 10.1186/s13148-021-01020-w.
T. Eggermann et al., “Trans-acting genetic variants causing multilocus imprinting disturbance (MLID): common mechanisms and consequences,” Clin Epigenetics, vol. 14, no. 1, p. 41, 2022, doi: 10.1186/s13148-022-01259-x.
E. Aref-Eshghi et al., “Evaluation of DNA Methylation Episignatures for Diagnosis and Phenotype Correlations in 42 Mendelian Neurodevelopmental Disorders,” Am J Hum Genet, vol. 106, no. 3, pp. 356–370, 2020, doi: 10.1016/j.ajhg.2020.01.019.
M. Sanchez-Delgado et al., “Causes and Consequences of Multi-Locus Imprinting Disturbances in Humans,” Trends in Genetics, vol. 32, no. 7, pp. 444–455, Jul. 2016, doi: 10.1016/J.TIG.2016.05.001.
S. Azzi et al., “A prospective study validating a clinical scoring system and demonstrating phenotypical-genotypical correlations in Silver-Russell syndrome,” J Med Genet, vol. 52, no. 7, pp. 446–453, 2015, doi: 10.1136/jmedgenet-2014-102979.
M. Begemann et al., “Maternal variants in NLRP and other maternal effect proteins are associated with multilocus imprinting disturbance in offspring.,” J Med Genet, vol. 55, no. 7, pp. 497–504, Jul. 2018, doi: 10.1136/jmedgenet-2017-105190.
D. Monk, D. J. G. Mackay, T. Eggermann, E. R. Maher, and A. Riccio, “Genomic imprinting disorders: lessons on how genome, epigenome and environment interact,” Nat Rev Genet, vol. 20, no. 4, pp. 235–248, 2019, doi: 10.1038/s41576-018-0092-0.
M. V. Cubellis et al., “Loss-of-function maternal-effect mutations of PADI6 are associated with familial and sporadic Beckwith-Wiedemann syndrome with multi-locus imprinting disturbance.,” Clin Epigenetics, vol. 12, no. 1, p. 139, Sep. 2020, doi: 10.1186/s13148-020-00925-2.
L. E. Docherty et al., “Mutations in NLRP5 are associated with reproductive wastage and multilocus imprinting disorders in humans,” Nat Commun, vol. 6, p. 8086, 2015, doi: 10.1038/ncomms9086.
A. E. Jaffe et al., “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies,” Int J Epidemiol, vol. 41, no. 1, pp. 200–209, Apr. 2012, doi: 10.1093/ije/dyr238.
M. E. Ritchie et al., “limma powers differential expression analyses for RNA-sequencing and microarray studies,” Nucleic Acids Res, vol. 43, no. 7, pp. e47–e47, Apr. 2015, doi: 10.1093/nar/gkv007.
F. I. Rezwan et al., “A statistical method for single sample analysis of HumanMethylation450 array data: genome-wide methylation analysis of patients with imprinting disorders,” Clin Epigenetics, vol. 7, p. 48, 2015, doi: 10.1186/s13148-015-0081-5.
J. R. Crawford and D. C. Howell, “Comparing an Individual’s Test Score Against Norms Derived from Small Samples,” Clin Neuropsychol, vol. 12, no. 4, pp. 482–486, 1998, doi: 10.1076/clin.12.4.482.7241.
F. Eckhardt et al., “DNA methylation profiling of human chromosomes 6, 20 and 22,” Nat Genet, vol. 38, no. 12, pp. 1378–1385, 2006, doi: 10.1038/ng1909.
D. Saito and M. Suyama, “Linkage disequilibrium analysis of allelic heterogeneity in DNA methylation,” Epigenetics, vol. 10, no. 12, pp. 1093–1098, 2015, doi: 10.1080/15592294.2015.1115176.
O. Affinito et al., “Nucleotide distance influences co-methylation between nearby CpG sites,” Genomics, vol. 112, no. 1, pp. 144–150, 2020, doi: https://doi.org/10.1016/j.ygeno.2019.05.007.
W. Poole, D. L. Gibbs, I. Shmulevich, B. Bernard, and T. A. Knijnenburg, “Combining dependent P-values with an empirical adaptation of Brown’s method,” Bioinformatics, vol. 32, no. 17, pp. i430–i436, Apr. 2016, doi: 10.1093/bioinformatics/btw438.
E. Rojano, P. Seoane, J. A. G. Ranea, and J. R. Perkins, “Regulatory variants: from detection to predicting impact,” Brief Bioinform, vol. 20, no. 5, pp. 1639–1654, Apr. 2018, doi: 10.1093/bib/bby039.
P. Garg and A. J. Sharp, “Screening for rare epigenetic variations in autism and schizophrenia,” Hum Mutat, vol. 40, no. 7, pp. 952–961, 2019, doi: https://doi.org/10.1002/humu.23740.
P. Garg et al., “A Survey of Rare Epigenetic Variation in 23,116 Human Genomes Identifies Disease-Relevant Epivariations and CGG Expansions,” The American Journal of Human Genetics, vol. 107, no. 4, pp. 654–669, Apr. 2020, doi: 10.1016/j.ajhg.2020.08.019.
A. E. Teschendorff et al., “A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data,” Bioinformatics, vol. 29, no. 2, pp. 189–196, 2013, doi: 10.1093/bioinformatics/bts680.
S. Horvath, “DNA methylation age of human tissues and cell types.,” Genome Biol, vol. 14, no. 10, p. R115, 2013, doi: 10.1186/gb-2013-14-10-r115.
M. Barbosa et al., “Identification of rare de novo epigenetic variations in congenital disorders,” Nat Commun, vol. 9, no. 1, p. 2064, 2018, doi: 10.1038/s41467-018-04540-x.
C. A. N. D. Y. J.-C. A. N. D. A. Y. A. N. D. X. C. A. N. D. K. D. A. N. D. C. R. A. N. D. A.-A. Y. A. N. D. K. T.-D. A. N. D. L. D. C. A. N. D. M. G. Xiao Jianqiu AND Wang, “Gasdermin D mediates the pathogenesis of neonatal-onset multisystem inflammatory disease in mice,” PLoS Biol, vol. 16, no. 11, pp. 1–13, Sep. 2018, doi: 10.1371/journal.pbio.3000047.
A.-K. Arndt et al., “Fine Mapping of the 1p36 Deletion Syndrome Identifies Mutation of PRDM16 as a Cause of Cardiomyopathy,” The American Journal of Human Genetics, vol. 93, no. 1, pp. 67–77, 2013, doi: https://doi.org/10.1016/j.ajhg.2013.05.015.
D. J. G. Mackay et al., “Multilocus methylation defects in imprinting disorders.,” Biomol Concepts, vol. 6, no. 1, pp. 47–57, Mar. 2015, doi: 10.1515/bmc-2014-0037.
A. Rochtus et al., “Genome-wide DNA methylation analysis of pseudohypoparathyroidism patients with GNAS imprinting defects,” Clin Epigenetics, vol. 8, no. 1, p. 10, 2016, doi: 10.1186/s13148-016-0175-8.
E. G. Bend et al., “Gene domain-specific DNA methylation episignatures highlight distinct molecular entities of ADNP syndrome,” Clin Epigenetics, vol. 11, no. 1, p. 64, 2019, doi: 10.1186/s13148-019-0658-5.
S. Pavan, K. Rommel, M. E. Mateo Marquina, S. Höhn, V. Lanneau, and A. Rath, “Clinical Practice Guidelines for Rare Diseases: The Orphanet Database.,” PLoS One, vol. 12, no. 1, p. e0170365, 2017, doi: 10.1371/journal.pone.0170365.
J. E. Moore et al., “Expanded encyclopaedias of DNA elements in the human and mouse genomes,” Nature, vol. 583, no. 7818, pp. 699–710, 2020, doi: 10.1038/s41586-020-2493-4.
J. R. Hernandez Mora et al., “Characterization of parent-of-origin methylation using the Illumina Infinium MethylationEPIC array platform.,” Epigenomics, vol. 10, no. 7, pp. 941–954, Jul. 2018, doi: 10.2217/epi-2017-0172.
M. Dalby, S. Rennie, and R. Andersson, “FANTOM5 transcribed enhancers in hg38,” Zenodo. Apr. 2017.

No competing interests reported.

Download PDF

Editorial decision: Major revision
25 Oct, 2022
Reviews received at journal
09 Oct, 2022
Reviewers agreed at journal
30 Sep, 2022
Reviewers invited by journal
29 Sep, 2022
Editor assigned by journal
22 Sep, 2022
Submission checks completed at journal
22 Sep, 2022
First submitted to journal
20 Sep, 2022

You are reading this latest preprint version

Identification of differentially methylated regions in rare diseases from a single patient perspective

Status:

Version 1

Abstract

Background:

Result:

Conclusion:

Figures

Background

Results

Characterizing CpGs methylation within normal population

Optimizing parameters for DMR identification in single patient

Identification of DMRs in Beckwith-Wiedemann patients

Identification of DMRs in ND-CAs patients

Discussion

Methods

Cohorts

Characterizing CpGs within normal population

Semi-simulated data

Batch correction

DMR identification

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1