I – Selection of candidate pop-CpGs
Illumina Infinium HumanMethylation 450 BeadChip Array (HM450K array), previously applied to characterize methylation level in B-lymphocyte cell lines representing CEU (n=18) and CHB (n=18), revealed a set of 96 CpGs, differentiating the two populations at the significance level p<0.05, and representing the highest inter-population differences in the average methylation levels (|Mav_diff|> 1; q<0.05) see (40). From these differentially methylated CpGs, a small set of 14, characterized by the absence of confounding features (lack of SNPs in the studied CpG, lack of frequent SNPs under Illumina probe; no multi-site mapping of the probe), was selected as candidate pop-CpGs (Table 1).
Eleven of 14 best-differentiating CpGs were located outside CpG islands (in shore or shelf regions, gene body, transcription site start or 5'UTR regions). Three CpG sites, cg04036182 (chr15:45458818), cg07207043 (chr6:7051497) and cg00031303 (chr3: 195681400), were located in the genomic island of SHF, RREB1 and SDHAP1 genes, respectively. The highest inter-population differences in the methylation level (~40% difference) were observed in cg18136963 (chr6:139013146) and cg26367031 (chr3:178984747) (Mav_diff ≥ 2.7).
DNA methylation and gene expression correlation analysis
Thirty-six B-lymphocyte cell lines from both populations (CEU and CHB) were analyzed on HM450 array (Illumina) and HumanHT-12v4 Expression BeadChip Kit expression array (Illumina). Based on the results obtained from both Illumina platforms, a t-test was performed to identify CpG loci and genes, showing statistically significant inter-population differences in the level of DNA methylation and in the gene expression, respectively. Subsequently, to identify a relation between the gene expression and the corresponding methylation status, a Pearson correlation analysis was performed.
Based on the two-step statistical analysis, a group of genes and CpG loci meeting statistical criteria, p<0.01 in t-tests and in Pearson correlation analysis, was identified. None of the pop-CpGs, except for cg24861686 (1_CpG1, chr8:11418058), met the abovementioned statistical criteria. This CpG site showed positive correlation with BLK gene (Pearson coefficient 0.63).
II – Technical validation
A subset of ten pop-CpGs candidates meeting even more stringent statistical criteria (|Mav_diff|≥ 1.2 at q<0.05), and ten additional CpGs located in their close proximity, was analyzed using pyrosequencing technique (Table 2).
Due to technical reason (see Additional file 1 for details), some CpGs were excluded, and a subset of 17 CpGs was analyzed in further experiments.
Pyrosequencing results were collected as proportional values, separately for each analyzed CpG site (Table 2, Fig. 2). The average value of differences in methylation level between the studied populations was in the range of 0.119 (PyroAssay 6_CpG1 chr15:45458826) to 0.387 (PyroAssay 2_CpG1 chr1:37939320). Statistically significant population differences (p<0.05) were obtained for most of the CpG sites. The results from pyrosequencing were concordant with the results from HM450K array. The only exception was PyroAssay 5, where no statistically significant population differences in the level of methylation were noted for two out of the three examined CpGs (5_CpG2 chr5:132113755 and 5_CpG3 chr5:132113777); nevertheless, this PyroAssay was not excluded from further analyzes.
Figure 2 shows the distribution of methylation levels in individual B-lymphocyte cell lines used in the technical validation phase. Eight PyroAssays (1, 2, 3, 5, 6, 8, 9 and 10) passed the technical validation and were used in the further step of biological validation.
III – Biological validation of population differences in methylation level
Independent B-lymphocyte cell lines
To test the biological validity of population-differentiating methylation status of 17 CpG sites, eight PyroAssays were performed in the independent set of B-lymphocyte cell lines. Statistically significant (p<0.05) population differences in the mean methylation level were observed for 6 out of 8 tested PyroAssays (covering 12 CpG sites, see Table 3).
In the majority of PyroAssays, the level of methylation was similar across the neighboring CpG sites (Table 3). Only two CpGs (5_CpG3 chr5:132113777 and 9_CpG1 chr6:7051497) had distinct methylation level compared to the rest of positions targeted by the respective PyroAssay, with no statistically significant differences between the two populations (Table 3). The highest inter-population differences in methylation level were noted for CpGs covered by PyroAssays 8 and 10 (Table 3, CEUmean-CHBmean column). PyroAssays 2 and 3 didn't reveal any statistically significant population differences in CpG methylation.
Peripheral blood samples
To test, whether population differences in the methylation levels of CpGs observed in CEU and CHB cell lines, reflected real differences between the two populations (and were not due to the cell lines’ peculiarities), the second step of biological validation was performed, using a primary biological material, i.e. peripheral blood samples from individuals representing two analyzed populations (n=40 from both CEU and CHB).
Overall, PyroAssays revealed similar inter-population differences in the level of CpG methylation in both B-lymphocyte cell lines and in blood samples. Furthermore, similar to the results obtained in B-lymphocyte cell lines, a high consistency in the methylation level among individual CpG sites examined within a given PyroAssay was also observed in blood samples (Fig. 3). The greatest inter-population differences in the level of CpG methylation was observed in PyroAssays 8 and 5. Only few inconsistencies were observed between B-lymphocyte cell lines and blood samples. Population differences in the methylation of 5_CpG3 (chr5:132113777) and 9_CpG1 (chr6:7051497) sites, which did not reach statistical significance in B-cell lines, were statistically significant in blood samples, whereas the inter-population differences in 1_CpG1 (chr8:11418058) were not significant in blood samples. On the other hand, CpG sites targeted by PyroAssay 10, which classified as strongly population-differentiating sites in the B-cell lines, in blood samples were characterized by the lowest average differences in their methylation values.
For the majority of PyroAssays, methylation readouts in individual blood samples were tightly clustered, as opposed to those observed in B-lymphocyte cell lines. The only exception was PyroAssay 8, where the spread of the readouts from blood samples was much larger, and had a clear a tri-modal methylation distribution (see Discussion).
IV- Discriminating potential of the selected pop-CpGs
Identification of a composite pop(CEU-CHB)-CpG marker
Pearson correlation analysis was performed using data from B-lymphocyte cell lines analysis (n=10 CEU; n=10 CHB) obtained during the technical validation step. Analysis showed a high correlation coefficient (0.8-1) within each of the corresponding PyroAssays, and simultaneously a low correlation (<0.5) between individual PyroAssays (see Fig. 4 below).
To select the non-redundant set of validated pop-CpGs, correlated sites identified in the Pearson correlation analysis in each of the PyroAssays were removed. Based on the p-value after Benjamin Hochberg correction (the lowest padj_beta values were selected, see Table 3), a set of eight CpG sites (1_CpG1 chr8:11418058, 2_CpG1 chr1:37939320, 3_CpG2 chr3:178984959, 5_CpG1 chr5:132113734, 6_CpG2 chr15:45458818, 8_CpG1 chr6:139013142, 9_CpG3 chr6:7051504, 10_CpG1 chr1: 36489272) was selected. This set of eight non-redundant, validated pop-CpGs formed a composite pop(CEU-CHB)-CpG marker, with the potential to discriminate between CEU and CHB populations based on the differences in the level of methylation.
Testing of the composite pop(CEU-CHB)-CpG marker
To assess the population-discriminating potential of the 8-site composite pop(CEU-CHB)-CpG marker, three different classification methods were used: support vector machines (SVM) with linear kernel, linear discriminant analysis (LDA) and random forest (RF). The predictive ability of each method was assessed using 10-fold cross-validation, which was repeated 1000 times due to the moderate number of available cases.
The results obtained using each of the classification algorithms (SVM, LDA and RF) were compared in terms of AUC parameter (area under ROC curve) (see Fig. 5).
The shape of all presented curves followed the left-hand corner and the top border, indicating the high accuracy of the 8-site composite pop(CEU-CHB)-CpG marker with a high level of true positive in comparison to false positive results. Similar result was obtained using all three tested classification methods (AUC>0.9), of which SVM was the most reliable (AUC=0.996). The SVM validation performed on two independent datasets, B-lymphocyte cell lines (n=48) and blood samples (n=40), showed a high accuracy of the classification power in both sets (>85%) (see Additional file 2 ).
Principle Component Analysis was used to assess the potential of the 8-site composite pop(CEU-CHB)-CpG marker to separate samples from two analyzed populations. While the vast majority of samples clustered according to their population affiliation, two population-specific clusters were located in the close vicinity. The more accurate separation was obtained for blood samples (population-specific clusters were more separated from each other compared to B-cell samples) (Fig. 6A,B).
The variance distribution was attributed to the first (~30% ) and the second (~17%) dimension in both B-lymphocyte cell lines and blood samples. In both PC plots, markers 2_CpG1 (chr1:37939320, 6_CpG2 (chr15:45458818), 9_CpG3 (chr6:7051504) and 10_CpG1 (chr1:36489272) correlated with each other and showed higher methylation level in CHB population, whereas markers 1_CpG1 (chr8:11418058), 3_CpG2 (chr3:178984959), 8_CpG1 (chr6:139013142) and 5_CpG1 (chr5:132113734) showed higher metylation level in CEU population. The weight of an individual CpG marker on the principle component was diverse, as indicated by the vectors length. What is interesting, most CpG markers had similar weight in PC analyzed in B-lymphocyte cell lines (Fig 6A), while in blood sample, the impact of one marker, 1_CpG1 (chr8:11418058), was distinctly smaller (Fig 6B).
An additional test was performed to assess the minimal number of pop-CpGs that would classify European and Chinese samples with high accuracy. The minimal number of seven unlinked pop-CpGs (10_CpG1 chr1:36489272, 6_CpG2 chr15:45458818, 1_CpG1 chr8:11418058, 2_CpG1 chr1:37939320, 9_CpG3 chr6:7051504, 8_CpG1 chr6:139013142, 3_CpG2 chr3:178984959) had a high classification accuracy (AUC~1, and precision>0.8) (Fig.7, lower panel) in both B-lymphocyte cell lines and blood samples; discrimination potential obtained in peripheral blood samples (precision =0.925) was higher in comparison to B-lymphocyte cell lines (precision=0.854). In order to obtain similar discrimination power in both B-lymphocyte cell lines and peripheral blood samples, we decided to retain the 8-site composite pop(CEU-CHB)-CpG marker to be used for methylation-based classification of CEU and CHB populations (see Fig.7, lower panel).
To assess the population-discriminating potential of the 8-site composite pop(CEU-CHB)-CpG marker on the individuals of both genders, an in silico analysis was performed using additional DNA methylation data for B-lymphocyte cell lines investigated on Illumina Infinium Human Methylation 450 BeadChip Array platform, obtained from GEO database (GSE36369). The SVM validation performed on two independent datasets: 93 Males (CEU=47; CHB=46) and 99 Females (CEU=49; CHB=50), showed a high accuracy of the classification power in both genders (>89%) (see Additional file 3).
Furthermore, a biological validation of the 8-point composite pop(CEU-CHB)-CpG marker was performed. Male and Female blood samples from CEU (n=96) and CHB (n=96) population were obtained from the same Illumina microarray experiment as before (GSE36369). Results, similar to those coming from B-lymphocyte cell lines, indicated high population discrimination potential of our 8-point marker, regardless of the gender (see Additional file 4).