Materials or datasets
A chemorefractory cohort from the Mayo Clinic breast cancer study
The Breast Cancer Genome Guided Therapy Study (BEAUTY) is a prospective institutional review board–approved NAC clinical trial (NCT02022202) from March 2012 to May 2014. As previously described[3], the BEAUTY study enrolled 140 patients with invasive breast cancer across Mayo Clinic Rochester, Mayo Clinic Florida, and Mayo Clinic Arizona. All patients provided written informed consent. The analysis in this manuscript includes the subcohort of 42 patients with TNBC who received 12 weeks of weekly paclitaxel, followed by four cycles of an anthracycline-based regimen. The methods were performed in accordance with relevant guidelines and regulations and approved by the Mayo Clinic Institutional Review Board (IRB).
Eighteen of the 42 patients with TNBC had the residual disease at surgery. We obtained pre-NAC and post-NAC transcriptomics data for these 18 cases (36 tumors) and identified luminal androgen positive (LAR) cases using our in-house algorithm (LAR-Sig)[13]. Due to significant clinical, molecular, and pathological differences we previously observed between the LAR and non-LAR tumors, we focused our analysis on the non-LAR (AR negative) TNBC patients[14-18]. We also removed three samples as the post-NAC biopsies contained minimal epithelial tissue (see below). The final cohort for paired visit analysis included 24 tumors with four early recurrent (ERC, <24 months post-NAC) and eight non-recurrent samples (NRC, >= 48 months post-NAC). In addition to gene expression data from the BEAUTY study, we also investigated copy number alteration and somatic mutation data from whole-exome sequencing (WES, pre-NAC due to availability). Here we summarize our findings of interrogating paired gene expression data, paired protein expression data (described below), and pre-NAC WES data from the BEAUTY cohort.
Reverse Phase Protein Array data from the BEAUTY breast cancer study
Lysates from paired tissue preparations of the paired breast tumor tissue samples at pre-NAC and post-NAC time points from the BEAUTY study were provided for evaluation by Reverse Phase Protein Array (RPPA). Relative protein levels were determined by interpolation using a super curve provided by MD-Anderson Cancer Center for functional proteomics[19]. Of the twelve patients in the ERC and NRC groups, eight had RPPA data at pre-NAC, and six had RPPA data at post-NAC visits. We performed two-sample t-tests with pooled variances on the log2 transformed protein levels for 295 antibodies between the pre-NAC and post-NAC data.
Tumor bed assessment using digital deconvolution methods
A digital deconvolution was performed on the post-NAC samples using gene expression data to ensure sufficient epithelial cells were obtained rather than tumor beds. Twenty cell types were previously observed in a single-cell RNA sequencing experiment, originating from five patients with primary TNBC[20]. We obtained the single-cell gene expression counts and t-SNE clustering scheme and constructed a balanced dataset with prevalent and low abundance cell types. Our balanced dataset consisted of eighty nearest neighbor cells to each of the twenty centroids to adequately represent the least prevalent cell type (i.e., 106 immature perivascular-like fibroblasts, imPVL). The mean-dropout relationship (high zero counts while maintaining a high mean expression level) was evaluated, and the features were reduced to 3,205 genes (curve fitting parameters of a=1.5 and b=1.1)[21]. A deconvolution model was then constructed using the balanced TNBC single-cell data and CIBERSORTx [22, 23] method. The deconvolution was implemented with the twenty cell types merged into nine lineages. Each BEAUTY study sample with bulk RNA-Seq and phenotype was individually evaluated with pseudoreplication to achieve a sampling set larger than the nine cell types. Three paired RNA-Seq samples (two non-recurrent and one early recurrent) were removed as tumor bed biopsies.
I-SPY1 trial gene expression data
The clinical and molecular data from the I-SPY1 breast cancer trial have been described in detail previously [24, 25]. The normalized expression data were downloaded from the gene expression omnibus (GEO) database with a GEO ID: GSE32603[9]. From the I-SPY1 GEO data set (N=141, with baseline data), we selected patients with TNBC (N=39, 27.7%) that were resistant to NAC (N=23, 59.0%). We also considered only paired TNBC samples (N=9, 39.1%) having gene expression data at both pre-NAC treatment (T1) and post-NAC or surgery (TS) time points. Further, like the BEAUTY breast cancer study, we also considered the paired TNBC samples in the I-SPY cohort, with N=5 patients who had an early recurrence (in less than two years after NAC) and N=4 patients who remained non-recurrent for more than four years for gene expression data analysis.
Gene set analysis
Sample-wise pathway analysis of the RNA-Seq data was carried out using the gene set variation analysis (GSVA) method [26] for the curated (C2, n=2,232) and hallmark (H, n=50) gene sets from MSigDB version 7.1[27]. GSVA scores were calculated for the pre-NAC and post-NAC gene expression datasets. The scores were then evaluated using linear regression with empirical Bayes statistics with a limma package[28]. Gene sets were determined to be significant with a p-value < 0.05. Gene sets and individual genes were filtered based on the scores, fold-changes, and p-values observed in the I-SPY1 and BEAUTY trials.
Statistics and Bioinformatics analysis
Differential expression (DE) analysis was performed using the empirical Bayes quasi-likelihood F-test (QLF) to identify genes associated with recurrence for both the pre-NAC and post-NAC data with edgeR[29]. Over-representation analysis (ORA) was performed to identify cytobands significantly enriched in the DE genes using WebGestalt[30]. Copy number alterations associated with recurrence were evaluated with Fisher's exact test. Circos plots were generated with Rcircos[31]. Survival association analysis for the selected biomarkers were independently investigated using the online tool Kaplan-Meier (KM) plotter[32]. In addition, 1) The Cancer Genome Atlas (TCGA) paired tumor and normal-adjacent TNBC samples[33], 2) non-LAR vs. LAR TNBC gene expression cohorts 3) non-LAR gene expression cohorts from Thompson et al. [13] with respect to pathologic complete response phenotype were also surveyed using publically available gene expression data[3, 34-38].
Assessment of biomarkers using machine learning methods
The I-SPY1 and BEAUTY studies were pooled together and scaled with the combat algorithm from the sva package (v 3.14.0)[39] to remove the batch effect. Given the lack of sampling power, we implemented a three-fold cross-validation strategy with the super learner package[40]. We examined six classification algorithms, including a generalized linear model (glmnet), k nearest neighbors (k-NN), neural networks, random forests, support vector machine (SVM), and kernel SVM. The models were implemented using down selection methods (Pearson's correlation, Spearman's rank correlation, and random forest). Similarly, we evaluated clinical features (age, menopause, TNM stages, and pre-NAC and post-NAC Ki67 scores) for the BEAUTY cohort, using two-fold cross-validation with the super learner, as well as combined with the biomarkers. We monitored the area under the receiver operating curve (AUC) to indicate the model's classification performance. Further, we evaluated the Cox proportional hazards of the univariates in both the I-SPY1 and BEAUTY cohorts. In addition to the classification performance of random forests, kNN, and neural networks, we evaluated non-negative least squares (nnls), bayesian general linear model (glm), and rpart algorithms.