In the current study we developed an automated digital analysis pipeline (TMArQ) for IHC-based cell count scoring to investigate the TME in TNBC using image data from single-plex IHC stains. Using automated cell counts on individual TMA cores facilitates a more reproducible, quantitative, and accessible way to generate and test hypotheses and perform integrative multi-omics analyses. Although not the principal aim of this study, we believe our approach should be amenable to other cancer types as well and we have therefore made the underlying code available as a publicly accessible stand-alone pipeline.
We base our analyses on multiple immune marker stainings from a TMA constructed from a molecularly very well-characterised cohort of population-representative early-stage TNBC from South Sweden 2010–2015 (3, 11). While this cohort allows for in-depth comparisons versus molecular data from both RNA-sequencing and WGS, we note that the cores in the used TMA were aimed to target tumour-rich areas in the corresponding FFPE tumour tissue blocks. As such, the TMA cores may not necessarily capture the full immune phenotype of the TME in their respective tumour. This could in part explain some of the observed differences between pipeline and pathology estimates for TILs. Moreover, for each tumour case in the TMA there were two 1mm cores and the averages of these have been used in most analyses. As demonstrated in Fig. 4, core-to-core variability exists and may in part account for discrepancies compared with estimates obtained using other methods. Interestingly, we observed a general trend towards higher marker-to-marker correlation within a core for the analysed immune markers (Fig. 4). A general trend towards higher co-occurrence of all immune cell types within a given tumour was also observed (Fig. 4E). This is not surprising considering that many of the analysed immune cell types may act together in mounting a tumour immune response. Our combined analysis of different stains from the same core (stack) for a tumour emphasizes the value of a pre-planned staining order of the TMA to facilitate in silico merging of single-plex IHC stains, to avoid issues arising from stainings to be merged being several micrometres physically apart in a tissue core.
To evaluate the automated IHC-based cell counts we compared pipeline quantifications to several pathologist-derived estimates of the same stainings (CD20 and PD-L1 sp142) as well as TIL-estimates from whole slide H&E-stains taken before TMA construction. For the latter, we observed a clear correlation between automated cell counts for the CD3 pan lymphocyte marker and pathologist estimates of TILs. As a caveat we note that IHC staining was performed on cores from tumour-rich areas that may not capture the full range of the TME immune response. In comparison, for TIL-estimations, full sections are typically recommended over biopsies in guidelines (35), and TIL-estimates from whole slide evaluations have been reported to be higher than in matched TMA core analysis (36). Moreover, TIL scoring based on H&E-stained specimens does not consider the various origins and functions of immune cells (37), and like other pathology-estimated scores human TIL-scoring is associated with inter- and intra-observer variability (36–38). In support of the validity of our approach, a notable finding is that the correlation of our automated CD3 cell counts derived from TMA cores to TIL estimates generated on the corresponding whole-slide sections by an expert pathologist was close in strength to the correlation obtained when an AI-trained TIL predictor based on DL methodology was applied to the same whole section H&E-stains used by the pathologist (Spearman’s rho of 0.55 versus best AI predictor correlation of 0.63) in the study by Bai et al. (17). Moreover, based on data from the same study by Bai et al. (17) we note that for the same set of chemotherapy-treated patients as in Fig. 6A, our automated cell counts outperformed the AI-trained eTILs estimates in survival association. The latter was not significantly associated with IDFS when patients were split into low/high groups based on the median eTIL (log-rank p = 0.24, Supplementary Fig. 4C). This may indicate that simple estimates such as automated leukocyte counts on dual 1mm TMA cores can be used to probe relevant clinical correlations in a high-throughput fashion.
Estimation of PD-L1 status has become important in cancer therapy following the introduction of immune checkpoint inhibitors (ICIs) in patient care. A problematic issue in this context is that usage of an ICI is often associated with the requirement to use a specific PD-L1 antibody as an accompanying diagnostic assay to guide treatment (39). The SP142 PD-L1 antibody is the accompanying diagnostic assay for the PD-L1 inhibitor atezolizumab, which was first reported effective in combination with nab-paclitaxel in patients with unresectable, locally advanced, or metastatic triple-negative breast cancer in the phase 3 IMpassion130 trial (40). This trial invoked an accelerated FDA approval for atezolizumab that however was later retracted based on negative results from the IMpassion131 trial (41). Irrespectively, for pipeline assessment, the SP142 assay serves as an excellent example, with a positive staining call of PD-L1-positivity set as low as 1% staining of tumour-infiltrating immune cells (ICs), compared to the scoring of PD-L1 expressing tumour cells (TCs). Figures 2B-C demonstrate that the automated pipeline counts correspond well with pathology-derived binary classes of PD-L1-low/PD-L1-high, but also that there is a strong correlation to actual SP142 grading in percent. Here, the full dynamic range of the pipeline counts could enhance scoring and reproducibility at low expression levels compared to the challenging assessment of whether more or less than 1% staining of IC exists by the human eye. PD-L1 IC scoring is challenging for several reasons as discussed in (39) that together may explain discrepancies observed both for the discrete class analysis and the overall correlation analysis in this study. In the near future, it is conceivable that computer image analysis algorithms could function as an orthogonal method for PD-L1 IC scoring particularly for borderline and challenging cases (39). Finally, we also compared pipeline output to pathology scores for CD20, classified into four expression bins. Here we saw a good agreement between cell counts and score groups, but also a particularly sharp increase for the highest score bin (Fig. 2D). This sharp increase may indicate either a potential manual bias in the scoring of the highest CD20-expressing tumours, or a biological aspect of large numbers of infiltrating CD20-positive cells (e.g., as tertiary lymphoid structures) connected to some specific tumour feature. Regarding the latter hypothesis, the highest CD20 pathology group (group 3) contained 80% HRD-positive tumours, compared to 66% for group 2, 60% for group 1, and 44% for group 0. Whether this observation represents a direct consequence of a specifically rearranged genome, or merely a correlative association remains to be determined through deeper molecular analyses.
In addition to comparisons to matched pathology estimates we also compared the automated IHC-based cell counts to different outputs from matched bulk tissue RNA-sequencing. For the latter comparisons these are: I) made on different tumour pieces, and II) made on selected tumour rich areas for the IHC stainings versus RNA extracted from bulk tissue without specific regard for tumour cell content. Still, we observed strong correlations between automated cell counts and matched TPM expression for particularly CD3 and CD8 (Fig. 3A). Notably, for these markers, Locy et al. (42) also reported the highest agreement between TIL pathology scores and mRNA expression estimates based on Nanostring analysis. Overall, a good correlation between estimates by IHC or flow cytometry and bulk mRNA expression levels (in similar ranges that we observe) has been reported for certain cell types (including T-cells, cytotoxic cells, mast cells, and macrophages) in different malignancies (43). Bulk mRNA expression however cannot provide insights into the spatial distribution of cell types as in situ analyses can.
Several transcriptome-based cell-type quantification methods for immuno-oncology have to date been reported, of which CIBERSORT is one of the most commonly used (44). When CIBERSORT estimated cell type proportions were compared to automated cell counts we found an intermediate correlation for macrophages, B-cells, and T-cells between the methods for analysed antibodies (Figs. 3B-C). The correlation increased somewhat when CIBERSORT cell type fractions were summed into a more general immune cell score (Fig. 3C). Collectively, these findings imply that CIBERSORT deconvolution predominantly identifies a general trend of immune cells rather than specific cell subsets, except for outlier cases that score highly using any method. Notably, when comparing CIBERSORT B-cell proportions versus the CD20 pathology classifications we saw increasing fractions for CIBERSORT up until the highest score category, which had a lower average CIBERSORT B-cell fraction than the second highest group (Supplementary Fig. 3). This contrasts with the pattern seen for the cell counts (Fig. 2D), suggesting here that the in situ derived counts are in better agreement with manual scoring than the deconvolution method. Moreover, it should be noted that observed correlations between automated cell counts and bulk mRNA deconvolution cell estimates are likely in line with what might be expected in heterogeneous breast cancer tissue. This assertion is in line with observations using xCell (45) similar to CIBERSORT, which showed correlations between 0.5–0.7 when estimated cell proportions were compared to cell sorted blood proportions. This illustrates the difficulties in using in silico deconvolution methods for precise assessments of the TME in individual samples.
When we compared automated CD3 cell counts to a more general immune classification in TNBC, IM-class (13), we observed a consistent pattern overall with higher CD3 quantifications in IM + tumours. The IM classification is based on a general set of immune response associated genes considering the general pathway enrichments reported by Lehmann et al. (46). As such, the IM-signature may be less associated with the expression of a specific cell type, and more with a general immune cell load. While patients with LAR-classified tumours have been associated with both a generally immune cold TME (together with the TNBCtype M class (12)) and a generally poorer response to conventional systemic therapy (13, 47), it remains to be determined if the observation of a subset of TNBC LAR patients with high lymphocyte infiltration (reflected by automated CD3 cell counts, Fig. 3D) translates to a difference in patient outcome. Considering the high marker-to-marker cell count co-expression we observed for individual marker pairs (Figs. 4C-D) we also clustered pipeline cell counts for all analysed immune markers (Fig. 4E). The latter analysis revealed that most markers appeared correlated with regard to cell count levels, defining at one end an immune cold subset of TNBC tumours and at the other end an immune hot tumour subset, as reported previously (10, 12). Consistent with the established prognostic role of immune response in TNBC and the general immune marker-to-marker co-expression in TMA cores, patient stratification based on automated cell counts subdivided patients in our cohort with adjuvant systemic therapy into groups with differing prognosis in terms of IDFS (Fig. 6). This association was also observed for pathologist-derived TIL scores, but not for AI-estimated eTILs (17). As for PD-L1 scoring, the continuous counts produced by our automated pipeline could be useful for considering threshold selection for the definition of prognosis or risk groups in a discovery-validation cohort approach.
Another observation made in our cohort based on linking the cell count co-expression analysis with genomic data was that a highly rearranged tumour genome is not equivalent to an immune hot phenotype in situ. This is exemplified by WGS-classified HRD-positive tumours in our cohort, that bear highly rearranged genomes and a high mutational load (3). In our analyses, these tumours with similarly complex genomes display variable automated immune cell counts across all investigated immune markers (Fig. 4). Similar findings, albeit using more limited sequencing data, have also been noted by other studies (10). Therefore, one can infer a complex interplay between DNA repair deficiency status and the immune phenotype of the TME that may have considerable prognostic relevance. Indeed, a simple 4-tier grouping of HRD-status and dichotomized CD3 cell counts showed that patients with HRD-positive tumours with higher CD3 counts have a better IDFS after adjuvant chemotherapy compared to the other HRD/CD3 groups (Fig. 6). Moreover, we demonstrate that while p53 staining in general is associated with the presence of pathogenic TP53 mutations, this is not always the case, as highlighted by our integration of WGS-derived variant consequence and position data (Fig. 5). Together, these examples illustrate how somatic tumour features from other platforms can be combined with in situ TME estimates in breast cancer to add nuance to analyses and explore novel hypotheses.
The proposed pipeline has its limitations. We did not validate the pipeline for the use on individual patients. Curation of input images still requires a human inspection to avoid technical artefacts that may affect cores with tissue folding, considerable gaps, poor staining quality or other artefacts. However, the pipeline can be easily adapted to flag potential outliers and spot cores that appear different in quality. Moreover, the current cut-off in the pipeline is based on the CD3 marker and set to be a general-purpose cut-off. Thus, we acknowledge that it may not be the optimal cut-off for all individual markers. It should therefore, ideally, be individually examined by a user in new analyses. Regarding in silico merging of different stains we note that combining stains from a stack of sections (the z-axis of a core) may result in positional shifts when transitioning across different cell layers. To some extent, careful planning of the IHC staining order in single-plex studies may mitigate this issue. Future pipeline development could include the integration of deep learning frameworks for positive cell detection, detection of partial staining, and alternative segmentation methods for general cell segmentation. While currently applied exclusively to TMA cores the pipeline could also be expanded to whole slide analysis, including scoring of virtual TMA cores from whole-slide sections. Additionally, applying this framework to multiplex stainings on e.g., the PhenoImager-platform could allow for an even greater potential for analysing cell-cell interactions.