Study subjects and ethical considerations
Nine SSc patients and nine control subjects gave informed consent and were recruited from an ongoing SSc research cohort based at McGill University, Montreal, Canada. Of the nine SSc patients, none were on immunosuppressive drugs at the time of sampling (three were previously on methotrexate and mycophenolate but those medications had been discontinued for > 1 year).
Cell purification and whole-genome bisulfite sequencing (WGBS)
Forty milliliters of blood were obtained from each study subject and processed fresh within 4 hours of being drawn. CD4+ T cells were positively selected [anti-CD4 microbeads (Miltenyi Biotec) and auto-MACS] and their purity assessed with flow cytometry. Only samples with a purity >95% were used for genomic DNA extraction and sequencing. The samples were processed using the in-house DNA isolation and Illumina HiSeq 4000 PE 100 WGBS workflows at the McGill University and Genome Quebec Innovation Centre. Quality control of the genetic materials was performed using fluorescence assay quantification, agarose gel electrophoresis and NanoDrop nucleic acid quantification to ensure sufficient quantity, quality and purity.
Data processing and filtering
The WGBS data were aligned to the human genome GRCh37 (hg19) using the NovoAlignTM pipeline (http://www.novocraft.com/). For each cytosine retained for further analysis, coverage by both strands in the paired-end sequencing library was required. To ensure accuracy in estimation of methylation level, valid cytosines with good read depth were extracted for CpG, CHG, and CHH motifs respectively. For each valid CpG dinucleotide, the estimated methylation level was obtained after merging methylated and unmethylated read counts for the forward and reverse cytosines. Read depth was required to be deeper than 3 at both C/G or C/H sites and the between-site difference in empirical methylation b values was required to be less than 0.2. For each valid CHG/CHH, minimum read depth required for further analysis was 6. Genome-wide SNPs were identified from the same dataset using the Bis-SNP pipeline [20].
Identification of differentially methylated regions
We used bumphunter version 3.3 [21] to identify DMRs in five sets of comparisons:
(i) SSc cases (N = 9) versus female controls (N = 4);
(ii) SSc cases (N = 9) versus all controls (N = 9);
(iii) Diffuse SSc cases (N = 6) versus female controls (N = 4);
(iv) Limited SSc cases (N = 3) versus female controls (N = 4);
(v) Diffuse SSc cases (N = 6) versus limited SSc cases (N = 3).
In all sets of comparisons, we adjusted for the additive effect of age within bumphunter. Moreover, in (ii), we additionally adjusted for the effect of gender.
To maintain statistical power, we restricted the analysis to regions with consistently high coverage across samples and imposed a minimum coverage rate on a per-CpG dinucleotide or per-CHG/CHH site basis. We imposed a minimum sample-level coverage rate for each site analyzed; minimum sample sizes for the analyses presented in (i)-(v) were 6/9, 4/6, 3/4 and 3/3, respectively. Cytosines were then clustered into regions with a maximum 200 bp gap between two cytosines in the same region. The total number of tests done for each of (i)-(v) was 392,810, 383,235, and 380,756 for CpG, CHG and CHH, respectively.
Regions with an adjusted p value (q-value) <0.05 and an average methylation level difference > 0.2 reported by bumphunter were considered to be DMRs. A Bonferroni corrected p-value threshold of significance for five genome-wide analyses each comprising more than comparisons would require a significance threshold below , which would be very unlikely to achieve here given the sample size. For example, for a simple t-test we would require a standardized difference of 9.2 to obtain 90% power at this significance level; that is, the mean difference in methylation between SSc and controls would have to be nine times larger than the standard deviation. Thus, our results should be considered as preliminary and therefore we have placed most emphasis on the results of analysis (i) as they included all the female patients and avoided the potential confounding effect of sex.
Multiple testing
We used the false discovery rate estimates from bumphunter to select DMRs, and added a filter requiring that the difference in methylation be at least 0.2. We also performed a permutation test of the primary analysis comparison between SSc cases (N = 9) versus female controls (N = 4). We randomly relabelled samples as SSc cases or controls, and then repeated the genome-wide identification of DMRs using bumphunter. We repeated the permutation and genome-wide analysis 40 times, and then counted in how many permutations a previously identified DMR was still identified as a DMR with an identical or smaller p value. We also compared the number of identified DMRs between the original data and the permutations.
Annotation of DMR and functional analysis
Genomic context of each DMR was annotated by annotatr [22] based on the most recent annotations of human genome downloaded from the UCSC genome browser (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/. Accessed 5 March 2019). All genes overlapping with DMR were regarded as DMGs. We performed functional analysis using Ingenuity Pathway Analysis [23] to investigate potential biological impacts through epigenetic alterations in these DMGs. Adjusted p values and averaged methylation level difference of DMRs were used to indicate the degree of discrepancy. For single-DMR genes, the averaged difference of the corresponding DMR represents the gene-level difference. For genes associated with more than one DMR, we calculated the average of the averaged difference of each DMR to represent overall methylation level difference. Genes with both hypermethylated and hypomethylated DMRs may therefore have had the differences neutralized. Identification of DMR and functional analysis were conducted on CpG, CHG and CHH separately.
Detection of SNP-CpG associations
We explored short-range SNP-CpG associations around a selected subset of CpG-based DMR identified in the comparison between SSc cases (N = 9) versus female controls (N = 4) that could be deemed consequential for SSc. Within a window of ±5 kb around each CpG-based DMR (inclusive), we extracted all SNPs as candidates for cis-interaction with the methylation pattern of the DMR. Simultaneously, within each CpG-based DMR, we first regressed out the effect of age and obtained residual methylation level on each CpG dinucleotide. We then used a multivariate method (PCEV [24]) to test associations between residual methylation levels and the binary disease status. We followed this by examining the PCEV-derived variable importance measures to identify the dinucleotide most strongly associated with disease status in the DMR. We extracted the residual methylation level on this specific CpG dinucleotide as a representative of the DMR and performed linear regression with all candidate SNPs of the corresponding DMR. In total, linear associations between 599 DMR and 36,838 candidate SNPs were tested. Unadjusted p values were reported with adjusted for each of the 36,838 SNP-CpG pairs. To adjust for multiple testing on this analysis would require appropriate adjustments for linkage disequilibrium and were not undertaken here.