DNA methylation (DNAm) of cytosines (5-mC) plays an important role in cell biology, most notably in tissue specific regulation of gene expression. Other roles include X-chromosome inactivation, regulation of splice-junctions and genomic imprinting [1], [2]. Differential methylation of cytosines, or epivariations, have been linked to a wide array of diseases such as cancer, ageing, metabolic, cardiovascular, neurodevelopmental and autoimmune disorders [3]–[7]. Differential methylation can occur either at single cytosines (DMCs), or affect several loci within a region, resulting in differentially methylated regions (DMRs). Depending on their origin, primary and secondary epivariations can be differentiated. Primary epivariations arise from stochastic errors in the establishment or maintenance of a methylation state by the DNA Methyl Transferase proteins family. Secondary epivariations, by contrast, derive from genetic alterations such as copy number variations (CNVs) or single nucleotide variations (SNVs) at the differentially methylated locus or inactivating variants in trans-acting factors with a key role in the establishment or maintenance of methylation state of that locus [8]. Both primary and secondary epivariations are found in patients suffering from rare diseases, a worldwide public health issue estimated to affect between 260 and 445 million people [9]. On one hand, primary epivariations are the main molecular event causing some imprinting disorders [10], rare cases of cancer[11], [12] and neurodevelopmental diseases [13]. On the other hand, secondary epivariations are a known alternative mechanism in rare diseases and the detection of these sequence variants have gained popularity in the diagnostic process. That is the case in the group of neurodevelopmental disorders known as the Mendelian disorders of the epigenetic machinery (MDEMs), for which detection of episignatures (i.e. group of DMCs acting as a blueprint for the disease) has been shown to enable patient diagnosis [14]–[22], or in imprinting disorders [23]–[28], where DMRs are localized at imprinting control centers. Episignatures and DMRs at imprinting loci are usually linked to a single disease. However, it has been shown that MDEMs’ episignatures sometimes share overlapping DMCs[29] and there has been increasing reports of patients showing multi-locus imprinting disturbances (MLIDs). MLIDs represent rare cases of imprinting disorders characterized at the molecular level by several defects at imprinting regions [30]. Patients suffering from MLIDs often share overlapping phenotypes based on the imprinted regions showing defects [24], [28], [31]–[35]. As a consequence of this molecular and phenotypic heterogeneity, aggregating patients in groups is not always trivial.
Classical methods to identify differentially methylated regions and episignatures are usually based on inter-group comparisons, requiring a large number of samples in each group to reach statistically significant results [36], [37]. Those methods cannot be systematically applied in the context of rare diseases due to either the cohort size or the intra-group heterogeneity. It is especially the case when the disease affects only a handful of patients, hence making it difficult to gather cohorts large enough to satisfy canonical group-comparison method assumptions. In addition, group-comparison loses the ability to capture inter-patients’ heterogeneity, such as in MLIDs. Therefore, single patient-based analyses could alternatively be used to address those issues as well as support personalization of diagnosis.
In the literature, only two methods have been described for single case-control DNAm analysis. The first method is divided in two steps. First, the Crawford-Howell (C-H) adaptation of the t-test is used to detect differential methylation at individual CpGs. Then, individual scores are aggregated in a DMR score using the Fisher aggregation method [38]. The second method used in [13] has been developed following two empirical rules: (i) at least 3 probes that each have methylation levels above the 99.9th percentile of the control distribution for that probe and are ≥ 0.15 above the control mean; (ii) at least 1 probe with a methylation level ≥ 0.1 above the maximum observed in controls for that probe.
Although both methods allow the detection of biologically relevant DMRs, they present some limitations. In the first method, the statistical method for individual probe testing described by Crawford-Howell is suggested to be used when the normative sample size (i.e. the size of the control population) is less than 50 [39]. Above that threshold, the Z-score is preferred. In addition to this limitation, the Fisher aggregation method used to combine individual scores (i.e. p-values) assumes independence between variables. However, this assumption does not hold in most large high-throughput biology datasets that show a correlation between variables. Indeed, it has been shown that closely located CpGs tend to be co-methylated [40]–[42]. In the second method, the empirical rules, while relevant, do not allow the ranking of candidate regions by a confidence score such as a p-value [44] and therefore loses the flexibility of applying a threshold for DMR calling. Finally, there is no evaluation of how the choice of the used parameters (e.g. number of probes, difference in methylation, cohort size) may affect DMRs calling.
Therefore, in this paper, we propose a statistical method based on the Z-score followed by the Empirical Brown method that takes into account covariance between variables [43] to identify DMRs in a single-patient setting. First, we characterize the behavior of CpGs methylation status in various regions of biological interest and show that CpGs display a high correlation in those regions, thus justifying the use of Brown’s aggregation method to assign a DMR score. Second, we investigate how different parameters such as the size of the control population, the amplitude of the methylation difference and the size of the regions affect the performance of this method for DMRs identification. In addition, we show the diagnostic utility of this method in the context of MLIDs and other neurodevelopmental disorders and congenital anomalies (ND-CAs), as well as its potential to identify new epivariants in existing datasets from a single-patient perspective.