Data and preprocessing
TCGA breast cancer data: We downloaded TCGA breast cancer (BRCA) ATAC-seq raw bam files (n=75) and RNA-seq raw fastq.gz files from NCI Genomic Data Commons (GDC) data portal (https://portal.gdc.cancer.gov) (54). Breast cancer peak calls and bigwig files of ATAC-seq profiles were downloaded from https://gdc.cancer.gov/about-data/publications/ATACseq-AWG. For the hormone receptor subtypes of TCGA BRCA tumors, we followed the annotation data provided by the TCGA ATAC-seq data publication, Supplementary Data 1 (21). The RNA-seq read count and reverse phase protein array (RPPA, replicate-base normalization) data were downloaded from Xena Functional Genomics Explorer (https://xenabrowser.net/hub/) GDC and TCGA hub, respectively.
METABRIC breast cancer data: We downloaded METABRIC microarray data from cBioPortal (https://www.cbioportal.org/study/summary?id=brca_metabric) (55).
Human Protein Atlas: The Human Protein Atlas (https://www.proteinatlas.org) is a public resource that extracts and analyzes information, including images of immunohistochemistry (IHC), protein profiling, and pathologic information, from specimens and clinical material from cancer patients to determine global protein expression (56). Here, we compared the protein expression of available TFs in ILC and IDC tissues by IHC image.
Breast cancer cell lines shRNA screen: To identify and validate the TFs essential in ER+ ILC and ER+ IDC cell lines, we accessed the data for whole-genome small hairpin RNA (shRNA) ‘‘dropout screens’’ on three ER+ ILC and 11 ER+ IDC breast cancer cell lines (45) (GSE74702).
Differential peak accessibility
Reads aligning to atlas peak regions (hg19) were counted using the countOverlaps function of the R packages, GenomicAlignments (v1.30) (57) and GenomicRanges (v1.46.1) (57). To remove the bias created by low count peaks, we filtered 19,364 peaks with mean counts of less than 30 across all samples. Differential accessibility of peaks was calculated using DESeq2 (v1.34.0) (58). DA peaks were defined as significant if they had an adjusted p-value < 0.05 and the magnitude of the DESeq-normalized counts changed by a stringent factor of one or more between ER+HER2- ILC and ER+/HER2- IDC. The significant DA peaks were aggregated and represented in the hierarchical clustering heatmap using the DESeq size-factor normalized read counts and the ‘complete’ distance metric for the clustering algorithm. We used ChiPseeker (59) and clusterProfiler (60) R packages for peak region annotation and visualization of peak coverage over chromosomes.
ATAC-seq peak clustering
For visualization of ER+/HER2- ILC, ER+/HER2- IDC, and ER+/HER2+ IDC tumors by PCA, we used DESeq2 (v1.34.0) (58) to fit multi-factorial models to ATAC-seq read counts in peaks. We used all peak counts and generated DESeq2 models including factors for hormone receptor subtypes (ER +/- and HER2 +/-) and histological classes (ILC vs IDC). Then, we calculated a variance stabilizing transformation from the DESeq2 model and performed PCA.
De novo TF motif analysis
The HOMER v.4.11.1 utility findMotifsGenome.pl (22) was used to identify the top 10 TF motifs enriched in differential accessible peaks. We set 100-bp-wide regions around the DA peak summits with hg19 as the reference genome. We generated custom background regions with a 150-bp-wide range around the peak summits. The top motifs were reported and compared to the HOMER database of known motifs and then manually curated to restrict them to TFs that are expressed based on RNA-seq data and to similar motifs from TFs belonging to the same family.
Pathway enrichment analysis
We used GREAT (Genomic Regions Enrichment of Annotations Tool, v1.26) to associate the subcluster of the DA peaks to genes and used pathway analysis to identify the functional significance of the DA peaks (37).
TF essentiality analyses in ER+ ILC and ER+ IDC cell lines
We used small interfering RNA (siRNA)/shRNA mixed-effect model (siMEM) (45) to score the screening results of the TFs and identify their significant context-specific essentiality between ER+ ILC and ER+ IDC from the shRNA screening data. The significantly essential TFs had an FDR q-value < 0.2 in the siMEM results. The annotation data for ER subtype and histological types in the breast cancer cell lines are available at https://github.com/neellab/simem.
Cis-Regulatory Element Motif Activity analysis
We used the CREMA (Cis-Regulatory Element Motif Activities, https://crema.unibas.ch/) to analyze genome-wide DNA-accessibility, calculate TF motif activities, and identify active cis-regulatory elements (CREs). CREMA first identifies all CREs genome-wide that are accessible in at least one sample, quantifies the accessibility of each CRE in each sample, predicts TF binding sites (TFBSs) for hundreds of TFs in all CREs, and then models the observed accessibilities across samples in terms of these TFBS, inferring the activity of each TF in each sample. A Wilcoxon rank sum test was used to compare TF activities and assess the association between TF and histologic subtypes. Then, the resulting p-values were adjusted for multiple hypothesis testing (across TFs). This analysis was visualized with a scatterplot where the x-axis represents mean TF activity difference, and the y-axis represents FDR-corrected p-value. The significant TF motifs were selected by absolute mean TF activity difference > 0.035 and FDR-corrected p-value < 0.05.
The TF targets identified by CREMA are CREs, not genes directly. After identifying TF target CREs, the gene-CRE association probabilities are calculated on the basis of distance to transcription start sites (TSSs) of gene within +/- 1,000,000 bp, using a weighing function. The weighing function quantifies the prior probability that a CRE will regulate a TSS at distance d and is a mixture of two Lorentzian distributions with length-scales 150 bp (corresponding to promoter regions) and 50Kb (corresponding to enhancer regions). This weighing function is used to weigh log-likelihood score per possible CRE-TSS interaction. The target gene score is a sum of the log-likelihood scores of all CREs associated with the gene weighted with the association probability. Then, the scores were used to predict over-represented canonical pathways in the TF’s target genes.
Differential gene expression analysis
We ran DESeq2 on the TCGA RNA-seq read count data between ER+/HER2- ILCs (n=100) and ER+/HER2- IDCs (n=297), which include all available tumors for hormone receptor and histological subtypes. We used the Limma (v3.48) (61) package to calculate the log2 fold change of differentially expressed genes between ER+/HER2- ILCs (n=121) and ER+/HER2- IDCs (n=1,030) for the METABRIC dataset.
We calculated the cumulative distribution of expression changes for the target genes and background genes and ran the Kolmogorov-Smirnov (K-S) statistic to quantify the distance between empirical cumulative distribution function (eCDF) of target genes and cumulative distribution function (CDF) of background genes and determine its significance. We used all 16,537 genes as background genes after removing genes with low mean counts across samples.
Statistical analysis and data visualization
All statistical analyses were performed using R version 4.1.1 (R Foundation for Statistical Computing, Vienna, Austria) (62). Heatmaps were generated using the R package ComplexHeatmap v2.10.0 (63). Graphs were generated using the R package ggplot2 v3.3.5 (64). Genome track images were generated using the IGV (v2.11.1) (65). P-values in multiple comparisons were adjusted using the Benjamini–Hochberg (BH) method.