Data
We used clinical, biospecimen, gene expression (RNAseq V2) and DNA methylation (Illumina Human Methylation 450K) data of 1,098 patients with breast invasive carcinoma from TCGA (cancergenome.nih.gov). Samples represented in TCGA were all collected prior to adjuvant therapy (21). TCGA also recorded patient follow-up information describing clinical events such as type of treatment, the number of days from the date of initial pathological diagnosis to a new tumour event, death, and date of last contact. Since clinical and biospecimen data are scattered over multiple files in the TCGA repository, we first merged all information in a single table with one row per patient using the patient identifiers provided in the clinical and biospecimen data. Subsequently, we corrected drug names for tamoxifen and AIs (anastrozole, exemestane and letrozole) for spelling variants and mapped synonyms to their generic drug names (Additional File 1). We selected the subset of patients (samples) that were treated with AI or tamoxifen.
Patient cohorts
For all patients with DNA methylation data available we selected data from primary tumours (indicated with “01” in the patient barcode) of female ER+/HER2- BRCA patients (Figure 1). The molecular subtype was determined using TCGA gene expression data for these samples (see below). The ER+/HER2- cohort was further subdivided according to the endocrine treatment (AI or tamoxifen) that patients received during follow-up. Patients who received both drugs were included in both sub-cohorts. Consequently, we considered three patient cohorts, i.e., ER+/HER2-, AI, and TAM.
Subtype determination
Information for BRCA subtyping by immunohistochemistry of ER or HER2 is missing for 192 out of 1,098 patients. Therefore, we used TCGA BRCA RNAseq V2 gene expression data to determine molecular subtypes (Additional File 2). To this end, gene expression data from primary tumours were retrieved from the Genomic Data Commons legacy archive using the R package TCGAbiolinks (22). RSEM estimated abundances were normalised using the upper quartile method from the R package edgeR (23) and subsequently log2-transformed with an offset of one. BRCA subtypes ER-/HER2-, HER2+, and the lowly proliferative ER+/HER2- (luminal A) and highly proliferative ER+/HER2- (luminal B) subtypes were determined using the SCMOD2 model from the R package genefu (24).
DNA methylation data and pre-processing
Illumina Human Methylation 450K raw data (IDAT files) for the patients in the cohorts defined above were retrieved from TCGA. Pre-processing was performed using the R package minfi (25). Detection p-values were calculated for each methylation probe. 82,150 probes showed an unreliable signal (p>0.01) in one or more samples and were removed. Data were normalized using functional normalization (26). Probes corresponding to loci that contain a SNP in the CpG site or in the single-base extension site were removed. We also removed probes that have been shown to cross-hybridize to multiple genomic positions (27). Finally, M-values were calculated and probes with low variation across samples (standard deviation of M-values 0.4) were removed. The final data set comprised 322,426 CpG loci. Probes were annotated to genes and enhancer regions using the R package IlluminaHumanMethylation450kanno.ilmn12.hg19.
Survival analysis
Clinical variables
Based on literature (28-30) we selected menopause status (pre/post, after merging pre- and peri-menopausal; values ‘[Unknown]’ and ‘Indeterminate’ were considered missing), AI treatment (yes, no), tamoxifen treatment (yes, no), tumour stage (I-IV, after merging subcategories; stage X was considered missing), and age at diagnosis as candidate variables predictive of survival. We tested association with survival using the Cox proportional hazards model (R package survival). We defined an event as the first occurrence of a new tumour event or death. For patients without an event we used the latest contact date as provided by the clinical data (right censoring). To account for missing values for the variables menopause status and stage we used multiple imputation (R package mice) to generate 50 datasets and perform survival analysis on each dataset separately (31). The input data used for multiple imputation is available in Additional File 3. Rubin’s rule was applied to combine individual estimates and standard errors (SEs) of the model coefficients from each of the imputed datasets into an overall estimate and SE resulting in a single p-value for each variable. Clinical variables with a p-value <0.10 in a univariable survival model were selected for inclusion in the multivariable survival model. Variables in the final multivariable model were determined using backward selection by iteratively removing variables with the highest p-value until all variables had a p-value <0.05.
Single-locus survival analysis
Next we performed survival analysis to identify single methylation loci associated with patient survival using the methylation M-values in a Cox proportional hazards model. The models for each locus were adjusted for significant clinical variables from the multivariable model. To account for missing values for clinical variables, multiple imputation was used as described above. Resulting p-values were corrected for multiple testing using the Benjamini-Hochberg false discovery rate (FDR). Adjusted p-values <0.05 were considered statistically significant. Subsequently, single-locus survival models were also adjusted for ER+/HER2- subtypes (luminal A/luminal B) in addition to the clinical variables selected above. Kaplan-Meier curves for individual loci were determined by calculating the median of the methylation levels over all patients in a cohort and then assigning a patient to a low (methylation level < median) and a high (methylation level ≥ median) group.
Multi-locus survival analysis
We used the Cox proportional hazards model with elastic net regularization (function cv.glmnet, R package glmnet) (32) to identify a signature of multiple methylation loci associated with survival. We followed a two-stage approach. First, the CpG signature was determined without including clinical variables using Cox regression with elastic net penalty. Secondly, from the resulting model the risk score (see below) was calculated and used in a new model that includes the clinical variables selected above in order to establish whether the methylation signature provided additional information compared to merely using clinical variables. Optimal values, minimizing the partial likelihood deviance, for the elastic net mixing parameter (α) and tuning parameter (λ) were determined by stratified (for event status) 10-fold cross-validation using a grid search varying α from 0 to 1 in steps of 0.1 and using 100 values for λ that were automatically generated for each α. We constructed one model for each of the three cohorts (ER+/HER2-, AI, TAM). Subsequently, for each cohort we used the identified signature to calculate a risk score for each patient:
[Please see supplementary files to access the equation.]
where for CpG locus i, ci denotes the corresponding coefficient in the Cox model and Mi the methylation M-value. Next, multivariable Cox proportional hazards regression was performed using the risk score as a variable and adjusting for significant clinical variables from the multivariable model. Missing values for the clinical variables were imputed as described above. Finally, the risk-score-based models were also adjusted for ER+/HER2- subtypes (luminal A/luminal B) in addition to the selected clinical variables. Kaplan-Meier curves were determined for two groups of patients by calculating the median of the risk scores over all patients in a cohort and then assigning a patient to a good (risk score < median) and a bad prognosis group (risk score ≥ median).
Stability of multi-locus signatures
To assess the stability of the multi-locus signatures 50 regularized Cox models were fitted using a stratified (for event status) selection of 90% of the samples for each cohort. We counted the number of times each CpG locus was included in the 50 signatures and then selected those CpGs that occurred in at least 10 or at least 35 signatures. We refer to the resulting signatures as stability signatures. Fisher’s exact test was used to determine the significance of the overlap between the original multi-locus signature and the stability signatures.
Correlation between DNA methylation and gene expression
CpGs in single-locus and multi-locus signatures were annotated to their nearest gene(s) (package IlluminaHumanMethylation450kanno.ilmn12.hg19). For each signature Pearson correlation coefficients (and corresponding p-values) were calculated between the methylation and gene expression profiles of each CpG-gene pair. Resulting p-values were corrected for multiple testing in each signature using the Benjamini-Hochberg FDR.
Methylation profiling of resistance acquisition in an ER+ breast cancer cell line
T47D cells were either treated with 100 nM 4-hydroxytamoxifen (TMX), long-term estrogen deprived (LTED; modelling AI treatment) (33) or not treated (wild type (WT)) in two biological replicates cultured for 7 and 5 months, respectively. DNA was extracted after 0, 1, 2, 5 and 7 (only one replicate) months. Methylation profiling was performed using the Illumina MethylationEPIC BeadChip platform at the Genomic and Proteomic Core Facility (DKFZ, Germany). For each sample two technical replicates were measured. Pre-processing was performed as described above. 8,682 probes showed an unreliable signal (detection p-value >0.01) in one or more samples and were removed. Probes that cross-hybridized to multiple genomic positions as listed by Pidsley et al. (34) were removed. No filtering based on M-values was performed. The final data set contains 786,500 CpG loci. Using the resulting M-values CpG-wise linear models were fitted with coefficients for each treatment (TMX, LTED, WT) and time point combination. In addition, we included a coefficient to correct for systematic differences between the two biological replicates (R package limma). For both LTED and TMX treated cells, contrasts were made between each individual time point t and the WT cell line at baseline, that is, LTEDt – WT0 and TMXt – WT0, respectively. The comparison of the average of TMX and LTED treated cells versus WT baseline was estimated via the contrast (LTEDt + TMXt)/2 – WT0. Differential methylation was assessed using empirical Bayes moderated statistics while also including the consensus correlation within pairs of technical replicates in the linear model fit (function duplicateCorrelation, limma package). The resulting signatures are referred to as the LTED, TMX and TMX/LTED signatures.
Enrichment analysis
We performed generalized gene set testing using the gsameth function (R package missMethyl) to test if significant CpG sites are enriched in selected pathways. Gsameth uses a hypergeometric test and corrects for the highly variable number of CpG sites per gene on 450K and EPIC arrays (35). For the single-locus survival analysis signatures were defined as those CpGs with p-value <0.006 (TAM, AI) and p-value<0.002 (ER+/HER2-) corresponding to signatures of ~2,500 CpGs. For the T47D RA experiment signatures were defined as the top 10,000 CpGs ranked on p-value as determined using a moderated F-test (limma package), which tests whether a CpG is differentially methylated at any timepoint versus WT, for the three sets of contrasts (TMX, LTED, TMX/LTED) described above. We used a combination of Hallmark gene sets (collection H) and a subset of 16 curated gene sets (collection C2; gene set name contained either “tamoxifen” or “endocrine_therapy”) from the Molecular Signatures Database (MSigDB) v7.0 (Entrez Gene ID version) (36). Resulting p-values were corrected for multiple testing using the Benjamini-Hochberg FDR.
We also tested whether the methylation loci identified from the TCGA BRCA single-locus and multi-locus signatures (based on Illumina 450k arrays) and also represented on the Illumina EPIC array were enriched in the T47D RA experiment using ROAST rotation-based gene set tests (limma package) (37). Enrichment of TAM and AI survival signatures was assessed using the comparisons of respectively TMX and LTED treated cells to WT baseline described above. Enrichment of the ER+/HER2- survival signature was assessed using the comparison of the average of TMX and LTED treated cells versus WT baseline described above. ROAST p-values were calculated, for two alternative hypotheses denoted as ‘up’ and ‘down’ using 9999 rotations. In the ROAST analyses directional contribution weights of 1 or -1 were used depending on whether a CpG of the signature under consideration had a positive (corresponding to increased risk of an event) or negative (corresponding to decreased risk of an event) coefficient in the corresponding Cox model. In this case, the alternative hypothesis ‘up’ corresponds to methylation levels changing in the same direction in the TCGA BRCA survival signature and in the T47D RA experiment, whereas the alternative hypothesis ‘down’ corresponds to a change in the opposite direction (Figure 2). The two-sided directional p-value is reported.
Quantitative real-time PCR
Total RNA was isolated from WT and T47D cells treated with tamoxifen or deprived from estrogen with RNeasy Mini kit (Qiagen, Hilden, Germany) according to the manufacturer’s instructions and treated with DNase Max Kit (Qiagen). cDNA was synthesized with the Revert Aid H Minus First Strand cDNA Synthesis Kit (Fermentas, Waltham, MA, USA). Quantitative real-time PCR (qRT-PCR) reactions for target genes were performed with the Applied Biosystems QuantStudio™ 3 Real-Time PCR System, using probes from the Universal Probe Library, UPL (Roche Diagnostics, Mannheim, Germany). The data were analyzed using the SDS software with the ΔΔCt method. The Ct values were normalized to the housekeeping gene ACTB.