For each of the processing Steps 1–4, we compared different options with a variety of evaluation diagnostics and leveraged the technical replicates (Fig. 1).
Normalization Evaluation
First, we explored normalization of both data sets together using ssNoob (coupled with the minfi probe QC), as recommended by Fortin et al (11). After associating the first 10 PCs with platform, we found that the first and second PC had an extremely high association with platform and sex respectively (Fig. 2). Sex differences are expected to be a large contributor to methylation profiles as methylation is known to have a large role in X-chromosome inactivation in females. Applying subsequent batch adjustment did not reduce the strong platform effect (Supplemental Table S1), regardless of batch adjustment method applied. Therefore, we applied normalization procedures by each platform separately.
To explore the effect of SWAN or SeSAMe on harmonization of the platforms, we examined technical replicates across platforms (see Methods). The correlation across probes for each pair of technical replicates (Eqn S2) was extremely high (> 0.98) for both methods. This is not surprising given the large amount of data points used to calculate each correlation, and similar to the high correlation between random pairs of samples (> 0.97). Individual probe correlations deemed much more informative (Eqn S3). We generated the densities of probe-level correlations across the technical replicate pairs as well as across random sample pairs (Fig. 3). The distribution of the random sample correlations for both the SeSAMe and SWAN are centered around 0 and look more like a Gaussian distribution compared to the distributions for the technical replicate correlations, which look like a mixture of two or more distributions in addition to being centered around a higher correlation coefficient.
We also examined the absolute differences in methylation on the probe level (Eqn S1) of the Beta value (% methylation) for each technical replicate sample (Supplemental Figure S1). In all technical replicate pairs, SeSAMe has a tighter distribution closer to 0 and based on these differences is shown to harmonize the data better, while SWAN has higher absolute differences. Given the results on the technical replicates, we moved forward with the SeSAMe normalization (see supplement for probe QC filtering numbers).
Batch Effect Adjustment
Even normalizing within each platform type, we still have technical batch effects to consider, since a variety of factors can add unwanted technical variation (18). Therefore, we examined within platform batch effect being defined as plate and row location combination. We performed ComBat and RUVm to adjust for within platform batch effects on the SeSAMe dataset. Based on the PC analysis, ComBat performed slightly better, as the top PCs were less associated with batches defined as plate by rows (Supplemental Figure S2, (18)). Our results of ComBat outperforming other methods is consistent with Jiao and colleagues (19).
Probe Filtering
Applying the Logue beta range filter (8), removing probes with < 5% Beta methylation range, resulted in removing 15.8% (59,397) and 33.9% (225,342) of probes in the 450K and EPIC platforms respectively. The mean beta values for the probes which were removed fell only on the extremes for both platforms (Supplemental Figure S3), while the probes which passed this criteria had mean beta values throughout the potential 0-100% methylation values. Additional considerations are summarized in the Supplement.
Genomic Inflation in Statistical Analysis
After statistical analysis, it’s important to consider the genomic inflation factor lambda (i.e., general inflation of test statistics due to population structure), which is the ratio of the median of the observed distribution of the test statistics to the expected median, and should be close to 1. In the 450K platform, the SWAN normalized dataset resulted in an extreme genomic inflation were the EPIC was deflated (Supplmental Figure S6). However, the genomic inflation was very comparable between the platforms for the SeSAMe normalized datasets. To account for any additional genomic bias and inflation we applied BACON (16), which was developed to control for genomic inflation specifically for epigenomic data. After this adjustment, the genomic inflation factor for the SeSAMe 450K and EPIC datasets were 1.03 and 1.08 respectively.
To perform the meta-analysis, we only kept probes present in both the SeSAMe datasets (199,243 probes). Final results is reported by R. K. Johnson and colleagues (4). This final pipeline as it gives comparable candidate CpG sites to other DNA methylation papers in T1D (20–23).