Associations of breast cancer etiologic factors with stromal microenvironment of primary invasive breast cancers in the Ghana Breast Health Study

Background: Emerging data suggest that beyond the neoplastic parenchyma, the stromal microenvironment (SME) impacts tumor biology, including aggressiveness, metastatic potential, and response to treatment. However, the epidemiological determinants of SME biology remain poorly understood, more so among women of African ancestry who are disproportionately affected by aggressive breast cancer phenotypes. Methods: Within the Ghana Breast Health Study, a population-based case-control study in Ghana, we applied high-accuracy machine-learning algorithms to characterize biologically-relevant SME phenotypes, including tumor-stroma ratio (TSR (%); a metric of connective tissue stroma to tumor ratio) and tumor-associated stromal cellular density (Ta-SCD (%); a tissue biomarker that is reminiscent of chronic inflammation and wound repair response in breast cancer), on digitized H&E-stained sections from 792 breast cancer patients aged 17–84 years. Kruskal-Wallis tests and multivariable linear regression models were used to test associations between established breast cancer risk factors, tumor characteristics, and SME phenotypes. Results: Decreasing TSR and increasing Ta-SCD were strongly associated with aggressive, mostly high grade tumors (p-value < 0.001). Several etiologic factors were associated with Ta-SCD, but not TSR. Compared with nulliparous women [mean (standard deviation) = 28.9% (7.1%)], parous women [mean (standard deviation) = 31.3% (7.6%)] had statistically significantly higher levels of Ta-SCD (p-value = 0.01). Similarly, women with a positive family history of breast cancer [FHBC; mean (standard deviation) = 33.0% (7.5%)] had higher levels of Ta-SCD than those with no FHBC [mean (standard deviation) = 30.9% (7.6%); p-value = 0.01]. Conversely, increasing body size was associated with decreasing Ta-SCD [mean (standard deviation) = 32.0% (7.4%), 31.3% (7.3%), and 29.0% (8.0%) for slight, moderate, and large body sizes, respectively, p-value = 0.005]. These associations persisted and remained statistically significantly associated with Ta-SCD in mutually-adjusted multivariable linear regression models (p-value < 0.05). With the exception of body size, which was differentially associated with Ta-SCD by grade levels (p-heterogeneity = 0.04), associations between risk factors and Ta-SCD were not modified by tumor characteristics. Conclusions: Our findings raise the possibility that epidemiological factors may act via the SME to impact both risk and biology of breast cancers in this population, underscoring the need for more population-based research into the role of SME in multi-state breast carcinogenesis.


Introduction
In addition to the neoplastic parenchymal cells, tumors are comprised of a complex mixture of cellular and non-cellular (extracellular matrix (ECM)) elements that collectively comprise the stromal microenvironment (SME) [1][2][3][4]. The SME plays a critical role in tumor initiation, progression, and response to treatment [1,[5][6][7][8]. Although widely believed to evolve in parallel with the neoplastic parenchyma, accumulating data suggest that changes in the SME may precede breast cancer development and that premalignant/pre-invasive SME phenotype could in uence the biologic phenotype of ensuing tumors [9][10][11].
Emerging lines of evidence suggest that host factors, such as parity, body mass index (BMI), and race/ethnicity may impact the SME [12][13][14][15]. This is of particular relevance given the observed heterogeneity in the incidence of breast cancer subtypes according to host factors [16,17]. For instance, parity has been shown in several epidemiological studies to be more strongly associated with risks of triple negative breast cancer (TNBC), an aggressive form of breast cancer that is characterized by lack of expression of estrogen receptors (ER), progesterone receptors (PR), and human epidermal growth factor receptor 2 (HER2)), while postmenopausal obesity has been shown to more strongly predispose to risks of high grade/ER + tumors [18][19][20][21]. Other risk factors, such as early menarche, nulliparity, lack of breastfeeding, postmenopausal status, and use of menopausal hormone therapy (MHT) have been shown to be more consistently associated with elevated risks of ER + than ER-breast cancer subtypes [16,18,22]. Furthermore, compared with women of European ancestry, women of African ancestry tend to have a higher prevalence of aggressive, TNBC or high grade, breast cancer phenotypes [23][24][25]. Similar patterns of higher rates of aggressive breast cancer phenotypes are seen among women in Sub-Saharan Africa [26,27].
Although most previous studies of breast cancer etiologic heterogeneity have focused on tumor parenchymal characteristics such as ER, PR, and HER2, distinct SME phenotypes may contribute to observed differences in tumor biology by host factors [12,13,15,28]. In general, SME phenotypes reminiscent of chronic in ammation and wound repair have been shown to predominate among parous than nulliparous, obese than normal weight, and African American women [15,28]. In particular, and with respect to the latter, the densities of tumor-associated macrophages, endothelial cells, and micro-vessels in the SME, in addition to the expression of an interferon signature, have been shown to be higher among women of African than European ancestry after accounting for age, tumor stage, ER subtype, and grade [15]. To date, however, there have been no studies investigating the impact of host on the SME of breast cancer among women from an indigenous sub-Saharan African population who are more likely to develop aggressive breast cancer subtypes and less likely to undergo screening.
Our main aim in this study was to investigate the association between host factors and SME phenotypes among women with breast cancer from a sub-Saharan African population. To address this aim, we utilized high-accuracy machine learning algorithms to characterize the SME of primary invasive breast cancer using digitized hematoxylin and eosin (H&E)-stained sections of tumor tissues. Two biologically-relevant H&E-based SME phenotypes were considered, including the tumor-stroma ratio (TSR), which is a metric of the proportion of tumor tissue that is stroma relative to the neoplastic parenchyma [29,30]; and the tumor-associated stromal cellular density (Ta-SCD), which is a metric of the proportion of the tumor-stroma that is occupied by nucleated cells such as immune cells (e.g., tumor in ltrating lymphocytes (TILs), macrophages, and natural killer (NK) cells), other cellular components such as broblasts, endothelial cells, and pericytes [31]. Relationships between breast cancer risk factors and SME phenotypes were assessed overall, and according to relevant clinicopathologic characteristics, including age, ER status, and histologic grade.

Study population
The Ghana Breast Health Study (GBHS) is a multidisciplinary, population-based, case-control study in Ghana, details of which have been previously described [32]. In brief, cases were women who presented with lesions suspicious for breast cancer in three hospitals in Kumasi (Komfo Anokye Teaching Hospital and Peace and Love Hospital) and Accra (Korle Bu Teaching Hospital), Ghana, between 2013 and 2015. These hospitals represent the primary hospitals that provide surgical and oncological care for breast cancer patients in Ghana [32]. A total of 1,126 breast cancer patients were recruited as part of the GBHS [33]. Of these, 792 had available digitized H&E-stained images that were con rmed by study pathologists to contain invasive breast cancer. Accordingly, the current analysis is comprised of 792 patients aged 17-84 years with histologically con rmed invasive breast cancer and for whom we had digitized H&E-stained images. The study was approved by institutional review boards at the National Institutes of Health (NIH; Institutional Review Board of the National Cancer Institute), Kwame Nkrumah University of Science and Technology (Kumasi, Ghana), Noguchi Memorial Institute for Medical Research (Accra, Ghana), School of Medical Sciences, Komfo Anokye Teaching Hospital (Kumasi, Ghana) and Westat (Rockville, MD, USA). All participants provided written informed consent.

Risk factor information
Data on established breast cancer risk factors, including demographic factors, menstrual and reproductive characteristics, family history of breast cancer (FHBC), medical history, occupational history, anthropometric, and physical activity variables, were obtained using a detailed questionnaire that was administered to each participant in the hospital by trained personnel. The women were asked speci cally about pregnancy and breastfeeding practices, including number of child births and total duration of breastfeeding for each birth. Data on average lifetime body size were obtained using pictograms that were classi ed as slight, average, slightly heavy, and heavy [32]. This study relied on pictograms due to lack of scales in some places and to overcome the challenge of weight loss due to disease. For this analysis, we used risk factor variables that have been previously described and shown to be associated with breast cancer risk in this population [32,33]. Participants' ages (years) were categorized as < 35, 35-44, 45-54, and ≥ 55 years. Age at menarche (years) was categorized as < 15, 15, 16, ≥ 17. Parity was classi ed as nulliparous, 1-2, 3-4, and ≥ 5 children. Age at rst birth (years) was categorized as < 19, 19-21, 22-25, and ≥ 26. Breastfeeding (median, per birth, months) was categorized as < 13, 13-18, and ≥ 19. Body size phenotypes were classi ed as slight, moderate, or heavy. FHBC was considered as positive if a patient had a history of breast cancer in a rst degree relative and negative if there was no such history.

Tumor pathology data
Pre-treatment core-needle biopsies (n = 4-8) from each patient were xed in 10% neutral buffered formalin for 24-72 hours and subsequently processed into para n-embedded tissue blocks [32]. Data on tumor size were based on clinical palpation of the mass at the time of diagnosis [34]. Information on histologic grade was obtained through centralized pathologist (MAD) review [35]. Immunohistochemical (IHC) staining for ER, PR, HER2 were performed using standard laboratory protocols [33]. Tumors were considered to be ER + and/or PR + if ≥ 10% of the tumor cells demonstrated positive staining. 3 + staining for HER2 was considered as positive while borderline and negative cases were considered as HER2 negative. As previously reported [33], good agreements (79%, 65%, and 78% for ER, PR, and HER2, respectively; P < 0.01) were observed between scores that were determined by pathologists in Ghana and at the NCI. Breast cancer subtypes were de ned using data on ER, PR, and HER2 as follows: luminal Alike (ER+/PR+/HER2-); luminal B-like (ER + and/or PR+/HER2 + or ER+/PR-/HER2-); HER2-enriched (ER-/PR-/HER2+) and TNBC (ER-/PR-/HER2-).
Machine learning characterization of stromal microenvironment using hematoxylin and eosin (H&E) stained images H&E-stained sections were digitized at the NCI and archived using the Halo Link digital image repository (Indica Labs, Albuquerque, NM). Imaging quality control (QC) was performed by pathologists (NT, LE, MA) using a standard operating protocol designed to identify and positively annotate tumor regions on the slide (including intra-tumoral and peri-tumoral stroma), and to negatively annotate regions with substantial crushing artifacts, damaged tissue, or widespread necrosis. In addition, pathologists used a computer-assisted visual counting tool to identify and count nucleated cells within well-de ned regions (500×500µm 2 ) of stroma.
Digitized images were analyzed using optimized machine-learning scripts based on the random forest algorithm. First, a 160datapoint random forest tissue-classi er script was trained and optimized to identify, segment, and quantify (in mm 2 ) areas on each image consisting of tumor (94-datapoints) and stroma (67-datapoints) (Fig. 1). The performance of the tissue classi er script in distinguishing between tumor and stroma was con rmed by visual inspection of random images by study pathologists. As previously reported in an independent study population [36], the random forest tissue-classi er demonstrates excellent reproducibility when comparing scripts that were independently trained by two pathologists (Spearman's rho = 0.95 and 0.97 for epithelium and stroma, respectively) [36]. Stromal regions were digitally annotated to allow subsequent cell detection to be con ned to the stromal compartment. Next, a previously validated [31] cell-detection script was reparametrized (Additional le 1) to identify and count nucleated cells in the stroma including lymphocytes, macrophages, broblasts, endothelial cells, etc. (Fig. 2). In validation analysis, the cell detection script showed strong correlation with two pathologists' manual cell counts within the intra-tumoral (R = 0.93 and 0.75 vs P1 and P2, respectively) and peri-tumoral (R = 0.74 and 0.73 vs P1 and P2, respectively) stromal compartments. Optimized scripts were used for centralized image analysis, which was blinded to patient's demographic, epidemiological, or clinical characteristics. Percent TSR was calculated by dividing the stromal area (mm 2 ) by total broglandular tissue area on the slide and multiplying by 100. Ta-SCD was calculated by dividing the total number of cells in the stroma by the total stromal area (mm 2 ). This was converted to a percentage by multiplying the total number of nucleated cells by the average area (mm 2 ) of a single nucleus (2.0×10 − 04 ), dividing this by the total stromal area (mm 2 ), and multiplying by 100. There was near-perfect positive correlation (r = 0.99; Additional le 2) between standard and percent Ta-SCD. The latter was used for all further analysis to facilitate interpretability.
Digitized hematoxylin and eosin (H&E)-stained images (A) were analyzed using optimized machine-learning scripts based on the random forest algorithm. A 160-datapoint random forest tissue-classi er was trained and optimized to identify, segment, and quantify (in mm 2 ) areas on H&E image consisting of tumor (94-datapoints, red regions (B) and stroma (67-datapoints; green regions (B)).
Stromal regions on digitized hematoxylin and eosin (H&E)-stained images (A) were digitally annotated to allow subsequent cell detection to be con ned to the stromal compartment. The cell detection script was parameterized to detect non-malignant nucleated cells in the stroma (green dots; B and C) based on size, shape, nuclear detection weight, nuclear contrast threshold, and optical density (Additional le 1). The algorithm was trained to exclude in ltrating nests of malignant epithelial cells from the tumorassociated stromal cellular density metric (C). The cell detection script showed strong correlation with two pathologists' manual cell counts within the intra-tumoral (R = 0.93 and 0.75 vs P1 and P2, respectively) and peri-tumoral (R = 0.74 and 0.73 vs P1 and P2, respectively) stromal compartments.

Statistical analysis
Kruskal Wallis test was used to evaluate differences in the distributions of TSR and Ta-SCD by patients' characteristics. Multivariable linear regression models were used to test associations between risk factors, patients' clinicopathological characteristics, and SME characteristics. TSR and Ta-SCD were normally distributed hence generalized linear regression was used to test associations with risk factors and tumor characteristics. In separate linear regression models, TSR and Ta-SCD were modelled as outcomes while the risk factors and patients' characteristics were modelled as predictors. Partially-adjusted models contained the relevant risk factor in addition to age, study site, and tissue area. The multivariable model was mutually adjusted for the individual risk/clinicopathological factors in addition to age, study site and tissue area. Multiplicative interaction terms were included in full models to test for evidence effect modi cation between host factors and tumor characteristics in relation to the SME characteristics.
Missing covariate values on tumor characteristics were imputed using the multiple (×5) imputation by chained equations (MICE) approach [37]  The median age at menarche was 15 (range = 9-22) years. About 91% of the population had at least one child at the time of breast cancer diagnosis, with the majority having 3 or more children. 32% of the population breastfed for < 13 months/pregnancy while 13% breastfed for at least 19 months/birth. The distributions of body size phenotypes were 25%, 40%, and 35% for slight, moderate, and heavy. 55 patients (~ 7%) reported having a positive FHBC ( Table 1).
We did not nd evidence to suggest modi cation of the associations between parity, parity/BF, or family history and Ta-SCD by age (< 50 vs ≥ 50 years; Additional le 3), ER status (ER + vs ER-; Additional le 4), or histologic grade (Additional le 5). Similarly, the association between body size and Ta-SCD did not differ according to age or ER status. The directionality of the association between body size and Ta-SCD, however, differed according to levels of histologic grade (p-value for heterogeneity = 0.04). Among patients with low grade tumors, increasing body size was positively associated with increasing Ta  Partially-adjusted linear regression models were adjusted for the relevant clinicopathological characteristic in addition to study site and tissue area. Multivariable linear regression models were mutually adjusted for all clinicopathological characteristics in addition to study site and tissue area. Table 4 Beta (β) coe cients and 95% con dence intervals (CI) for the associations between breast cancer risk factors and tumor microenvironment phenotypes (i.e., Tumor-Stroma Ratio (TSR) and Tumor-associated Stromal Cellular Density (Ta-SCD)) among patients participating in the Ghana Breast Health Study Partially-adjusted linear regression models were adjusted for the relevant risk factor in addition to age, study site, and tissue area. The primary multivariable linear regression model was mutually adjusted for age at menarche, parity, body size, family history of breast cancer in a rst degree relative, age, study site and tissue area. In secondary models, parity was substituted for number of children among parous women, or age at rst birth, or breastfeeding, or parity/breastfeeding.

Discussion
In analyses of 792 breast cancer patients from the GBHS, a population-based case-control study in Ghana, we utilized high-accuracy machine learning algorithms to characterize H&E-based SME phenotypes, including TSR and Ta-SCD, and investigated relationships with breast cancer etiologic and tumor factors. We found TSR and Ta-SCD to be associated with tumor characteristics, particularly histologic grade. Several risk factors were associated with Ta-SCD, but not TSR. In particular, parity, FHBC, and body size were statistically signi cantly associated with Ta-SCD, independently of relevant tumor characteristics. The observed associations between individual risk factors and Ta-SCD were mostly consistent with their breast cancer risk relationships, suggesting that SME phenotype may re ect cumulative exposure to etiologic factors.
To date, theories of breast carcinogenesis have focused largely on sequences of epithelial abnormalities in tissues [5], but accumulating data suggest that the SME might play a crucial role in tumor initiation as well as tumor biology [1,39]. For instance, chronic in ammation has been shown in experimental studies to contribute to tumor initiation by inducing malignant transformation of epithelial cells via inactivating mutations in tumor suppressor genes or through the posttranslational modi cation of proteins involved in apoptosis and DNA repair [10,40]. In addition to experimental evidence, the potential role of SME in breast cancer initiation is supported by data from population-based studies suggesting that factors that decrease chronic in ammation, such as Aspirin and other non-steroidal anti-in ammatory drugs (NSAIDS), also reduce breast cancer risk [41][42][43][44]. Comprising nucleated stromal cells, including lymphocytes, macrophages, broblasts, endothelial cells, etc., Ta-SCD is re ective of morphological changes that are reminiscent of chronic in ammation and wound healing response [31]. Accordingly, our ndings of strong associations between FHBC, parity, and body size with Ta-SCD, which were consistent with their breast cancer risk associations support the idea that epidemiological exposures may impact breast cancer risk by acting on the SME.
Our nding of higher Ta-SCD among parous than nulliparous women is consistent with experimental data showing that parity might in uence breast tumor biology by inducing SME changes [45,46]. In particular, genes associated with immune, in ammation, and wound response pathways have consistently been shown to be signi cantly upregulated among parous than nulliparous women [14,47]. Together with our nding of associations between increasing Ta-SCD and aggressive, high grade, breast tumors in this and other populations [31], our results add to the growing body of literature indicating that parity might predispose to the development of aggressive breast cancer phenotypes through SME-related pathways [45]. This notion is particularly relevant among women in Ghana [33] and other sub-Saharan African populations who tend to have higher parity rates in parallel with higher incidence rates of aggressive, early-onset, breast cancer [25,27]. Although breastfeeding is thought to attenuate parity-related risk of aggressive breast cancer [20], we found the association between parity and Ta-SCD to persist irrespective of breastfeeding duration. This nding could be explained in light of studies showing that post-lactational changes in mammary tissue, such as upregulation of genes related to immune response and development, do not revert to nulliparous levels several years after lactation [47].
Our ndings of higher Ta-SCD levels in relation to positive FHBC and, to a lesser degree, earlier age at menarche are consistent with their well-established associations with elevated breast cancer risk [48]. The precise mechanisms by which a positive family history of breast cancer increases breast cancer risk is yet to be fully de ned but is closely linked to a higher probability of carrying pathogenic variants in breast cancer predisposition genes. Mutations in breast cancer predisposition genes such as BRCA1, BRAC2, and TP53, lead to uncontrolled proliferation via perturbations in DNA repair mechanisms or loss of cell cycle control [49,50].
Because these perturbations are not limited to epithelial cells, our SME-related ndings might suggest that the associations between FHBC and breast cancer incidence and/or tumor biology may be mediated, at least in part, through SME disruption. Recent data from a large-scale international study showed that protein-truncating variants and pathogenic missense mutations in BRCA1, RAD51C, RAD51D, and BARD1 more strongly predisposed to TNBC than other breast cancer subtypes [51]. However, this study was mostly comprised of European ancestry populations and the mechanism by which genetic perturbations in uenced TNBC incidence was not studied. Data from other studies showed that differences in ancestry-speci c immunologic landscapes correlated with differences in TNBC biology and clinical outcomes [52]. Our data are consistent with the notion that SME biology (which encompasses immunologic response) might be a pathway by which familial factors may impact breast cancer incidence and tumor biology. Further studies are needed to characterize the relationships between speci c germline pathogenetic variants and tumor SME phenotype among women from sub-Saharan African populations.
We found larger body size to be inversely associated with Ta-SCD. As a chronic in ammatory state [53], it is not clear why higher body size was associated with lower Ta-SCD in this population. A possible explanation may have to do with the complex relationship between BMI and tumor biology with respect to age, menopause, and immune response. For instance, higher BMI is more strongly associated with ER-tumors among younger/premenopausal women [54] and with ER + tumors among older/postmenopausal women [55]. However, within ER + tumors, higher BMI is reportedly associated with more aggressive highgrade subtypes [21]. Accordingly, our observed inverse association between body size and Ta-SCD in the overall analysis may be re ective of the predominant tumor pathology in this population. ~95% of the tumors in this study were intermediate-(~ 25%) or high-(~ 70%) grade and the inverse association between increasing body size and decreasing Ta-SCD was limited to these aggressive tumor subtypes. However, and in keeping with the notion that higher body size may be a chronic in ammatory state, we found increasing body size to be positively associated with higher Ta-SCD among patients with low-grade disease. These ndings could also be explained by the well-recognized paradoxical effect of obesity to cause suppression or activation of in ammatory response pathways in breast cancer [56,57]. Immune exhaustion is another plausible explanation for the inverse association between larger body size and Ta-SCD [58]. T-cells mediate several aspects of tumor-associated in ammation and immune response, but they are themselves regulated by a complex crosstalk involving cancer cells, in ammatory cells, stromal cells, and cytokines. Persistent in ammation can overwhelm the T-cell response and trigger their terminal differentiation, leading to the emergence of 'exhausted' T-cell phenotypes with diminished capacity to mount immune or in ammatory responses [58]. Although speculative, the combination of obesity, a proin ammatory state, and prolonged and episodic exposure to malaria may predispose to immune exhaustion among women in sub-Saharan Africa. Immune exhaustion may, in turn, pave way for the development of more aggressive breast cancer subtypes via immunoediting [59]. Further studies involving detailed exposure assessment, including past medical and infectious disease history, in conjunction with detailed molecular characterization of the SME, will be required to lend additional insights into the relationships between body size and SME biology among women from sub-Saharan Africa.
Important strengths of this study include the relatively large sample size in an underrepresented population, detailed risk factor information, availability of standard H&E-stained sections, and the innovative application of digital pathology and high-accuracy machine learning algorithms to characterize SME phenotypes (TSR and Ta-SCD) using H&E-stained images. In terms of limitations, lack of follow-up data on clinical outcomes precluded our ability to investigate the prognostic relevance of TSR and Ta-SCD.
Nevertheless, these SME features demonstrated associations with other clinicopathologic characteristics similar to what has been reported in other populations [29,31], suggesting that they may be similarly associated with clinical outcomes among patients from Sub-Saharan Africa.
In conclusion, results from our study indicate that factors that impact the incidence of breast cancer in the general population might do so by in uencing the SME in breast tissues. In particular, parity, FHBC, and body size demonstrated associations with stromal cellularity, captured using Ta-SCD. Notably, the observed associations between risk factors and Ta-SCD were consistent with their associations with breast cancer risk and/or tumor heterogeneity. Thus, these ndings raise the possibility that some breast cancer etiologic factors may act via the SME to impact both risk and molecular phenotype of breast cancer.  Machine-learning classi cation of stromal and tumor tissue compartments.

Figure 2
Machine-learning classi cation of tumor-associated stromal cellular density.