Samples and DNA extraction
Wild-type and double-knockout HCT116 cells (respectively called HCT116 WT and HCT116 DKO) were cultured in triplicate in McCoy’s 5A medium supplemented with 10% foetal calf serum at 37°C under 5% CO2. Genomic DNA was extracted with the QIAamp DNA Mini Kit (Qiagen, Hilden, Germany) and with the recommended proteinase K and RNase A digestions. Ten fresh-frozen breast tumour samples and four normal breast tissue samples, previously profiled with the 450k array(69), were obtained from patients diagnosed with breast cancer at the Jules Bordet Institute between 1995 and 2003, following approval by the Medical Ethics Committee of Institute Jules Bordet, Brussels, Belgium. All patients gave written informed consent before their participation in the study. Genomic DNA from frozen samples was extracted from 10-μm sections with the Qiagen DNeasy Blood and Tissue Kit according to the supplier’s instructions (Qiagen). The procedure included proteinase K digestion at 55°C overnight. DNA was quantified with the NanoDrop® ND-1000 UV–Vis Spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA).
Bisulfite conversion and DNA methylation profiling with the 450k and 850k arrays
Genomic DNA (800 ng) treatment with sodium bisulfite was done with the Zymo EZ DNA Methylation KitTM (Zymo Research, Orange, CA, USA) according to the manufacturer’s procedure, with the alternative incubation conditions recommended when using the Illumina Infinium Methylation Assay. The methylation assay was performed on 4 μl bisulfite-converted genomic DNA at 50 ng/μl according to the Infinium HD Methylation Assay protocol. Bead chip arrays were scanned on Illumina Scan and raw .idat files were generated.
Array data analysis
All analyses were done with R (v3.4.4) except CpG annotation, which was done with python (v2.7.15).
1. Data loading
Depending on the normalisation method, two different loading methods were used. The read.metharray function of the minfi bioconductor R package (v1.24.0) was used to load data into a minfi-specific RGChannelSet object when normalisation methods requiring this object were used. For the other normalisation methods, raw .idat files of 450k data were loaded with the methylumi bioconductor R package (v2.24.1), while 850k data were loaded with the illuminaio package (v0.20.0). The detection p-values were obtained with Genome Studio® software (v1.6) and probes with detection p-values ≥ 0.05 were filtered out. Data quality was checked visually with control probes and all samples passed this quality control.
2. Data normalisation
The following five methods were used to normalise the data:
i) Peak-based correction (PBC)(11), the preprocessing pipeline developed by Touleimat and Tost (Tost)(61), and the background correction and quantile normalisation method(62) treating types I and II separately (Dasen) were run with the wateRmelon package and array-specific annotation files provided by Illumina (‘MethylationEPIC_v-1-0_B4.csv’: http://emea.support.illumina.com/downloads/infinium-methylationepic-v1-0-product-files.html and ‘HumanMethylation450_15017482_v.1.2.csv’: http://emea.support.illumina.com/array/array_kits/infinium_humanmethylation450_beadchip_kit/downloads.html).
ii) The normal exponential convolution model using out-of-band intensities (NOOB)(59), the subset quantile for within-array normalisation (SWAN)(63), and the NOOB followed by functional normalisation pipeline (NOOB+Fun)(65) were run with the appropriate functions of the minfi package (i.e. preprocessSWAN, preprocessNOOB and preprocessFunnorm, respectively).
iii) The regression on correlated probes method (RCP)(57) was run with the method of the ENmix bioconductor R package (v1.14) after applying the preprocessRaw method of the minfi package.
iv) The pipeline proposed by Marabita et al.(60), consisting in quantile normalisation on the intensity signal followed by beta-mixture quantile normalisation (QN+BMIQ), was run with the lumiMethyN function of the lumi package and the BMIQ function (v 1.1) (code available at https://code.google.com/archive/p/bmiq/).
v) LOESS between-array normalisation from Heiss et al.(70) was adapted from the R code provided with the associated paper.
3. Probe filtering
After normalisation, ambiguous probes from both sex chromosomes were discarded. Cross-reactive probes identified by Price et al.(71) were filtered out of the 450k data and the annotation of McCartney et al.(72) was used to filter out cross-reactive 850k probes. Breast cancer data were also filtered against probes targeting SNPs identified by Price et al. (450k) or McCartney et al. (850k). Raw data (.idat) and preprocessed data were submitted to the Gene Expression Omnibus public database (GEO) (www.ncbi.nlm.nih.gov/geo/) (token will be provided once GEO approved data).
Reduced-representation bisulfite sequencing (RRBS) data
Two already-processed public RRBS datasets were used with different aims:
i) For CpG coverage comparisons, processed RRBS data for three Ewing sarcoma samples were downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89026)(73). CpGs located in both sex chromosomes or covered by less than 10 reads were filtered out. Only CpGs whose methylation status was available for all three samples were analysed.
ii) For the comparison of methylation levels, RRBS data on HCT116 cell samples from two independent experiments were downloaded from ENCODE (https://www.encodeproject.org/experiments/ENCSR000DFS/). To simplify the comparison with bead-array technology, strand specificity was not taken into account. When methylation values were available for both strands of the same CpG site, the values were averaged if the two values were similar (delta < 10%) and discarded otherwise. Only cytosines common to the 450k and 850k arrays and covered by at least 10 reads were kept for subsequent analysis. In order to maximise the number of methylation-level comparisons between 850k and RRBS, cytosines whose methylation levels were available for only one sample were kept in this dataset.
CpG annotation
CpGs were annotated with human genome build hg19. Ensembl positions were lifted from hg38 to hg19 with the ‘LiftOver’ tool from UCSC (https://genome.ucsc.edu/cgi-bin/hgLiftOver).
1. CpG islands
CpG island (CGI) positions were retrieved from the UCSC database (https://genome.ucsc.edu/cgi-bin/hgTables). CGI shores were generated by adding 2 kb at both ends of each CGI. 850k, 450k, and RRBS data were overlapped with CGIs and Shores, using a custom python script. CpGs not located in a CGI or a Shore region were annotated as ‘Open-sea’.
2. Regulatory regions
Regulatory genomic regions were retrieved with the ‘UCSC ENCODE Experiment Matrix tool’ (https://genome.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html) for all available cell lines (HMEC, HSMM, K562, NHEK, NHLF, HEPG2, HUVEC, HESC, and GM12878). For each cell line, the hidden Markov model provided by ENCODE classifies the whole genome into 15 chromatin states. In order to simplify this annotation, we grouped the ‘1_Active_Promoter’, ‘2_Weak_Promoter’, and ‘3_Poised_Promoter’ states into a ‘Promoter’ super-region while ‘4_Strong_Enhancer’, ‘5_Strong_Enhancer’, ‘6_Weak_Enhancer’ and ‘7_Weak_Enhancer’ states were assigned to the ‘Enhancer’ super-region. The other states were ignored. States from the same super-region succeeding each other in the genome were fused. RRBS, 850k and 450k data were then overlapped with the ‘Promoter’ and ‘Enhancer’ super-regions in each cell line. As a CpG can be associated with a ‘Promoter’ super-region in one cell line and an ‘Enhancer’ in another, we assigned such CpGs to a third category called ‘Dual’ to reflect their dual role.
3. Transcripts
Coding and noncoding transcript positions were retrieved from the LNCipedia high-confidence set version 5.2 and from Ensembl v93. Duplicates between the two databases were identified with the ‘lncipedia_5_2_hc_hg19.gff’ file from LNCipedia and only the LNCipedia version of each duplicated lncRNA was kept. Several CpG-to-transcript association types were investigated:
i) Association with the transcription start site (TSS): each transcript whose TSS was located in a ‘Promoter’ super-region was assigned to that promoter and associated with the CpGs of that promoter. If a TSS for an lncRNA transcript fell into an ‘Enhancer’ super-region, the transcript was associated with that region and defined as an eRNA (enhancer RNA). The list of ‘Promoter’/‘Enhancer’-transcript associations is available in Table S1 (Additional File 2).
ii) Enhancer targets: To identify targets of the ‘Enhancer’ super-regions we used the EnhancerAtlas database (data downloaded in June 2016). This database associates its own set of enhancer positions with Ensembl transcripts for a set of 73 cell lines. For each ‘Enhancer’ super-region position from ENCODE overlapping a position from EnhancerAtlas in the same cell line, we associated the target provided by EnhancerAtlas with this ‘Enhancer’. Remaining ‘Enhancers’ were characterised as having an unknown target.
iii) Gene body association: Each CpG assessed by RRBS, 850k or 450k was also intersected with LNCipedia and Ensembl transcript positions. The transcription start site (TSS) and transcription termination site (TTS) corresponding to each transcript were located on the genome. CpGs falling between a transcript-associated TSS and the corresponding TTS were described as ‘Gene body associated’. If such a CpG was not already classified as ‘Promoter’, ‘Enhancer’, or ‘Dual’, it was classified as ‘Gene body’, while the remaining CpGs were assigned to the ‘Intergenic’ category. This annotation is available as alternative platform annotation on GEO (token will be provided once GEO approved data). The list of data source URLs is available in Table S2 (Additional File 3).
4. Promoter and non-promoter regions
Analyses using only the ‘Promoter’ and ‘Non-promoter’ categories were carried out with the following grouping: ‘Promoter’ and ‘Dual’ cytosines were classified as ‘Promoter’ cytosines while ‘Enhancer’, ‘Gene body’, and ‘Intergenic’ cytosines were defined as ‘Non-promoter’ cytosines.
5. Illumina default annotation
To compare the Illumina default annotation with our annotation, we collapsed the different levels of information provided by the Illumina 850k annotation file into the five categories used in our custom annotation. Associations with known transcripts were extracted from the ‘UCSC_RefGene_Group’ and ‘GencodeCompV12_Group’ columns. CpGs associated with ‘TSS 1500’, ‘TSS 200’, ‘1st Exon’ and ‘5' UTR’ locations were categorised as ‘Promoter’ CpGs, while the remaining CpGs, associated with ‘Body’ and ‘3' UTR’, were categorised as ‘Gene body’ CpGs. Enhancer-associated CpGs were retrieved from the ‘Phantom4_Enhancers’, ‘Phantom5_Enhancers’, and ‘450k_Enhancer’ columns. These CpGs were categorised as ‘Enhancer’ or ‘Dual’ CpGs, depending on their association with the ‘Promoter’ category. The remaining CpGs were categorised as ‘Intergenic’.
Variance heterogeneity evaluation
Variance heterogeneity was assessed with a metric we called ‘variance heterogeneity measurement’ (h), which can be described as the distance between the mean standard deviation (representing the expected profile in a context of homogeneous variance) and a local regression model of the standard deviation along the methylation profile (reflecting the observed situation). It was computed as follows:
- The M value was computed for each probe with the formula described elsewhere(11).
- The standard deviation (sp) and the mean (mp) of the M-value of each probe across the HCT116 WT triplicates were computed so as to generate the vectors s and m.
- The observed profile was modelled with a local regression (loess) model fitting s as a function of m and a loess-smoothed value of the standard deviation was produced for each probe (sp*).
- In a context of homogeneous variance, s should be independent of m, so the expected profile should be modelled as a flat line at s0, the mean of s.
- A measurement of the variance heterogeneity (hp) was computed for each probe as the absolute difference between sp* (value from the observed model) and s0 (expected value in a homogeneous context).
- Finally, h was defined as the mean of all hp values.
Differential methylation analysis
Differential methylation was assessed with a t-test applied to the M-values. In parallel, a delta beta value was computed as the absolute difference between the median beta value within each category. CpGs showing an adjusted p-value (Benjamini-Hochberg correction) < 0.05 together with an absolute delta beta > 0.2 were reported as differentially methylated.