Machine learning radiomics for predicting recurrence risk in patients with early-stage invasive breast cancer

There are no satisfying approaches to identify high- and low-risk recurrence patients with early-stage breast cancer in current clinical practice. Patients might be overtreated or undertreated due to the inaccurate prediction of recurrence risk. Herein, machine learning magnetic resonance imaging radiomic-based signature that integrates the intratumoral and peritumoral radiomic signatures, and clinicopathological characteristics was developed to classify high- and low-risk recurrence patients and predict recurrence within multicentre cohorts. The radiomic-clinical signature could also discriminate high- from low-risk recurrence patients among different breast cancer molecular subtype, and HR+/Her2-, T1N0M0 stage patients. Furthermore, it was observed that the neoadjuvant chemotherapy improved survival in high-risk Luminal subtype patients compared with the adjuvant chemotherapy. The survival-associated radiomic features also showed the correlation with the immune microenvironment. The radiomic-clinical signature presented the feasibility of predicting recurrence risk and assisting clinical decision-making in early-stage invasive breast cancer patients.


Introduction
Breast cancer is the rst leading cause of cancer death among women globally and approximately 10-15% of patients experience a recurrence in the rst 5 years from diagnosis 1,2 . The St Gallen 3 consensus proposed clinicopathological risk categories to identify patients with high or low likelihood of recurrence, and suggested the use of endocrine monotherapy without adjuvant chemotherapy for clinical low-risk group. Nevertheless, there remained part of misclassi ed patients who might be overtreated or undertreated. Nowadays, 70-gene expression pro le 4 and 21-gene recurrence score assay 5 , are currently recommended in clinical practice to predict recurrence risk and the bene t of adjuvant chemotherapy 6 , but the cost of these assays remains prohibitive and is an appropriate option only for Luminal subtype patients. Therefore, a more widely applicable and accurate method to pinpoint patients who are at high or low risk of recurrence is expected.
Several studies indicated that the radiomic features were signi cantly associated with tumor microenvironment and prognosis 7 , and a previous study has established a radiomic signature based on 294 invasive breast cancer patients to predict disease-free survival, but the model was di cult to be applied to clinical practice due to a small and single-centre dataset they used, no validation in different molecular subtype and lack of high-level evidences 8 . Recently, some studies showed that peritumoral radiomic features were also be predictive of prognosis rather than just focus on the tumor region 9,10 . This multicentre study aimed to construct a magnetic resonance imaging (MRI) radiomic-clinical signature that integrates intratumoral and peritumoral radiomic signatures, and clinicopathological characteristics for predicting the high and low recurrence risk in patients with early-stage invasive breast cancer.

Patient characteristics
This study eligibled 1,084 patients from four academic institutions in China (Supplementary Table 1), The study work ow was shown in Fig. 1. The table 1 showed the clinicopathological characteristics of patients in the training cohort (n=799), the prospective-retrospective validation cohort (n=105), and the external validation cohort (n=180). Adjuvant chemotherapy was administered to 709 (89%) of 799 patients in the training cohort, 57 (54%) of 105 patients in the prospective-retrospective validation cohort, and 156 (87%) of 180 patients in the external validation cohort. 105 patients underwent neoadjuvant chemotherapy from the prospective-retrospective validation cohort. Median follow-up was 22.8 months (IQR 15.5-35.4) for patients in the training cohort, 24.2 months (IQR 14.3-34.9) for those in the prospective-retrospective validation cohort, and 23.0 months (IQR 9.7-48.8) for those in the external validation cohort. The detailed information regarding the patient recruitment was described in Supplementary Fig. 1.

Intratumoral and peritumoral signatures for predicting recurrence risk
The key radiomic features were selected by the Random forest algorithm to construct T1+C, T2WI, or DWI-ADC sequence signature by Cox regression, the detailed results were summarized in Supplementary  Tables 3.   The intratumoral radiomic signature incorporated T1+C, T2WI, and DWI-ADC single sequence signature was conducted, which could assign patients into high-and low-risk groups. Patients with low-risk had better RFS in the training cohort (HR 0.06, 95% CI 0.02-0.15; P < .001), the prospective-retrospective validation cohort (P = .039), and the external validation cohort (P < .001) ( Supplementary Fig. 2a-c). In addition, the e cacy of the intratumoral radiomic signature showed AUCs of 0.86, 0.88, and 0.92 for 1-, 2-, 3-year RFS prediction in the training cohort, 0.86, 0.88, and 0.86 in the prospective-retrospective validation cohort, and 0.92, 0.94, and 0.91 in the external validation cohort, respectively (Supplementary Fig. 2d-f).
Combined intratumoral and peritumoral signatures for predicting recurrence risk
In addition, the intratumoral-peritumoral radiomic signature was employed to classify high-and low-risk recurrence patients with the consideration of molecular subtype. Encouragingly, the radiomic signature could identify high-from low-risk patients in the subgroups of Luminal A (P < .001), Luminal B (P < .001), human epidermal growth factor receptor 2 (Her-2) positive (P = .007), and triple-negative breast cancer (TNBC) (P < .001) patients ( Supplementary Fig. 4).

Radiomic-clinical signature for predicting recurrence risk
To develop a more precisely and clinically applicable method that could predict an individual's recurrence, we took the clinicopathologic characteristics that associated with RFS in the univariate analysis into consideration. Multivariable analysis indicated that intratumoral-peritumoral radiomic signature, number of tumors, histological grade, pTNM stage, and Ki-67 status were independent factors of RFS (Supplementary Table 4), and these factors were used to construct the radiomic-clinical signature.
According to the radiomic-clinical signature, an optimal cutoff value (281) was generated to classify patients into high-and low-risk groups in the training cohort. A radiomic-clinical signature-print was showed to illustrate the association of these factors with the recurrence risk. The intratumoral-peritumoral radiomic signature presented the largest proportion in both high-risk (83%) and low-risk (45%) recurrence groups, followed by the histological grade (high-risk, 68%; low-risk, 44%) (Fig. 3).
In addition, the radiomic-clinical signature demonstrated the capability of precisely predicting recurrence risk and could be used for identifying high-and low-risk patients among different molecular subtype (P < .001 for Luminal A, P < .001 for Luminal B, P = .007 for Her2-positive, P < .001 for TNBC) (Fig. 5a-d). For Luminal subtype patients in the high-risk group who received the neoadjuvant chemotherapy showed signi cantly prolonged RFS (P = .048) compared with patients who received the adjuvant chemotherapy, whereas there was no added bene t of the neoadjuvant chemotherapy for patients in the low-risk group ( Supplementary Fig. 5). Moreover, among Luminal subtype (T1N0M0 stage, HR-positive and Her2negative status) patients, the radiomic-clinical signature could recognize high and low-risk patients ( P < .001; Fig. 5e), including the subgroups analysis of patients who received adjuvant chemotherapy (P < .001; Fig. 5f).

Radiomic features associated with tumor immune microenvironment and genomics
The key radiomic features from intratumoral T1+C and T2WI sequences of The Cancer Genome Altas (TCGA) and The Cancer Imaging Archive (TCIA) were found to be correlated linearly with the immune cells (Fig. 6a). The activeted natural killer cells were observed to have a positive correlation with the most radiomic features. The M0 macrophages, T cells regulatory Tregs and T cells follicular helper also presented a strong correlation. In addition, we had previously identi ed 29 lncRNAs which were associated with survival and immune response 11 . In this study, most of the lncRNAs were indicated to be remarkably correlated with the radiomic features (Fig. 6a), including the NKILA, which had been proved to play an important role in immune microenvironment in a previous study 12 . These results illustrated that the radiomic features could provide important information about tumor immune microenvironment.
Different classes of the radiomic features were identi ed using the unsupervised consensus clustering analysis in patients from TCGA and TCIA. A total of 536 differentially expressed lncRNAs and 835 differentially expressed genes were identi ed to be associated with radiomic features. Then the unsupervised consensus clustering analysis was performed with these lncRNAs and genes in 1,082 breast cancer patients. Two main radiomic-based lncRNA subtypes were identi ed to be associated with signi cant difference in overall survival (HR 0.71, 95% CI 0.51-0.97; P = .031) (Fig. 6b). Next, the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis were conducted to evaluate the enrichment of the radiomic-based genes. The GO enrichment analysis indicated that the radiomic-based genes were enriched in various physiological metabolic processes, such as affection of oxidoreductase activity, lipid metabolism and potassium channel complex, details were illustrated in Fig.  6c. The KEGG pathway enrichment analysis found these genes were involved in the vitamin digestion and absorption and peroxisome proliferator-activated receptor signaling pathway.

Discussion
In this multicentre cohort study, the intratumoral-peritumoral radiomic signature based on machine learning algorithm discriminated high-from low-risk recurrence patients and performed well in predicting RFS. The radiomic-clinical signature comprised the intratumoral-peritumoral radiomic signature and clinicopathological characteristics was found to be signi cantly associated with RFS and presented higher predictive value in RFS. In addition, the radiomic-clinical signature sucessfully classi ed high and low recurrence risk among different breast cancer molecular subtype patients, and HR+/Her2-(T1N0M0 stage) patients. The key radiomic features were also found to be associated with immune microenvironment. Therefore, this study developed and validated a prognostic, radiomic-clinical signature for individualized prediction of high and low recurrence risk, which provided an effective tool for prediction of survival and clinical decision-making in patients with early-stage invasive breast cancer.
While previous studies 13,14 showed the potential of MRI-based radiomics for predicting recurrence in breast cancer, their clinical value were limited because of the small sample size from single-centre and the radiomic features only extracted from tumor region. Our study built a radiomic-clinical signature with integrating the intratumoral-peritumoral radiomic signature and clinicopathological characteristics based on a more than 1000-patient size from multicentre and independent external validation cohort. Our results showed that the radiomic-clinical signature played an important role in predicting recurrence, and the signature-print indicated that radiomic features were more associated with recurrence risk than clinicopathological characteristics. Thus we proposed to combine radiomic features with clinicopathological characteristics, which could better predict recurrence and utilize in clinical practice.
In the past few decades, the high-and low-risk recurrence was mainly evaluated by the clinicopathological characteristics of the patients. A study retrospectively reviewed 1,500 patients with node-negative breast cancer and found that using the 2007 St Gallen risk categories resulted in different outcomes 3,15 . The St Gallen devided patients with node negative, pathological tumor size ≤2 cm, histological grade 1, absence of extensive peritumoral vascular invasion, HR+, HER2-and age ≥35 years into low-risk group. In our study, using St Gallen categories assigned only 13 (11%) 0f 118 patients into low-risk group among patients with Luminal subtype (HR+/HER2-, and T1N0M0 stage), and was disable to further recognize high and low recurrence risk ( Supplementary Fig. 6). Additionally, the intermediateand high-risk groups de ned according to St Gallen categories included many patients who had a good outcome, which indicated that St Gallen criteria contained misclassi ed patients who might be potentially overtreated with adjuvant chemotherapy. However, our radiomic-clinical signature could assign 110 (93%) of 118 HR+/Her2-, T1N0M0 stage patients into low-risk group, which could minimize the probability of being overtreated.
Nowadays, multigene pro les were constructed for risk strati cation and therapy strategies guidance in breast cancer 16,17 . The randomized trial TAILORx 5,18 enrolled patients with HR+, HER2-, and axillary nodenegative breast cancer, and demonstrated that patients with low range 21-gene recurrence score could avoid adjuvant chemotherapy. Another randomized trial MINDACT 4 allowed enrollment of patients with up to three positive axillary nodes, and showed the ability of identifying patients with high clinical risk who can avoid chemotherapy by testing 70-gene signature. In this study, the radiomic-clinical signature displayed the competence of discriminating high-from low-risk patients in different breast cancer molecular subtype, which indicated that the combination of multigene pro les and the radiomic-clinical signature might distinct high and low recurrence risk more precisely and reduce the rate of undertreatment and overtreatment in future clinical practice.
In current clinical practice, patients with large tumor size, positive axillary nodes would consider to receive neoadjuvant chemotherapy. The results of randomized trials NSBPA-18 and NSABP-27 19 manifested that though neoadjuvant chemotherapy was equivalent to adjuvant chemotherapy, patients who achieved a pathologic complete responses after neoadjuvant chemotherapy had signi cantly prolonged survival and lower risk of recurrence compared with patients who did not. Although pathologic complete responses have been shown to be predictive of bene t from neoadjuvant chemotherapy, it could only be evaluated after surgery, therefore a preoperative approach is urgently needed to distinguish patients who could bene t from neoadjuvant chemotherapy.
In this study, the radiomic-clinical signature could predict RFS and identify high-and low-risk recurrence in the prospective-retrospective validation cohort of which all of the patients underwent neoadjuvant chemotherapy. It is worth noting that the neoadjuvant chemotherapy improved RFS in high-risk Luminal subtype patients compared with the adjuvant chemotherapy. However, larger sample sizes and longer follow-up time were needed to further validate radiomic-clinical signature's potential of recognizing patients who could obtain more bene t from neoadjuvant chemotherapy.
Several limitations still existed in this study. The heterogeneity MRI scans from multiple centres was inevitable, and the median follow-up was about 24 months, the signatures could not be applied for predicting overall survival. Previous studies have shown the association between radiomic features and immune response 8,20 . In this study, we analyzed the correlation of radiomic features with immune cells and lncRNAs, and signi cant linear correlation was presented, which indicated that the radiomic features can provide tumor immune microenvironment information. We also evaluated the radiomic-based genes and related pathways. However, due to the retrospective approach taken in this study and the lack of available data of gene expression, we were unable to further analyze the more association between radiomic features and tumor microenvironment, especially the mechanisms of using the radiomic features to predict recurrence need to be further explored. It may be bene cial to comprise radiomic signatures with genetic signatures such as genomics and transcriptomics, which had better prediction ability of recurrence and clinical application value.
In conclusion, this study presented a radiomic-clinical signature that incorporated MRI intratumoralperitumoral radiomic signature and clinicopathological characteristics, which could identify high-and low-risk recurrence among different molecular subtype and be conveniently used for individualized prediction of RFS in patients with early-stage invasive breast cancer.

Study design and patients
This study was conducted in accordance with the STROBE guideline checklist 21  The primary outcome was recurrence-free survival (RFS), RFS was calculated from the surgery date to the date of most recent medical review or diagnosis of recurrence. The inclusion criteria were female patients aged at least 18 years with histologically con rmed as stage I-III invasive breast cancer 22 and patients underwent breast tumor and axillary MRI scans before surgery and axillary lymph node dissection.
Patients suffering from other tumor diseases before or at the same time, having incomplete pathological information, or unavailable standard MRI scans with or without contrast enhancement were excluded.

Radiomic feature extraction
The multiparametric MRI (contrast-enhanced T1-weighted imaging [T1+C], T2-weighted imaging [T2WI], and diffusion-weighted imaging quantitatively measured apparent diffusion coe cient [DWI-ADC]) acquisition protocol across all institutions and MR scanner parameters for patients were described in Supplementary 1 and Supplementary Table 5. All of the MRIs were normalized to obtain a standard normal distribution of image intensities using the N4ITK Bias Correction code. 3D regions of interest (ROIs) of the breast intratumoral area, and peritumoral area ( the tumor parenchymal constituting 10-mm extension outward) were semi-automatically segmented by 3D Slicer software method (https://www.slicer.org/, version 4.10.2) 23 . The 3D regions of intratumoral and peritumoral (DICOM format) was transferred to the SlicerRadiomics code, the in-house texture extraction platform developed based on the python package "PyRadiomics". A total of 5,178 quantitative radiomic features, including six groups of radiomic features were extracted separately, including shape, rst-order, the gray-level cooccurrence matrix (GLCM), the gray-level size zone matrix (GLSZM), the gray-level dependence matrix (GLDM), and the neigbouring gray tone difference matrix (NGTDM). More details regarding the radiomic feature extraction was described in Supplementary 2.

Radiomic signature building and validation
The Random forest algorithm 24 was applied to select the most predictive candidate radiomic features in the training cohort. The combination of the key features in each sequence were used to constructed T1+C, T2WI, and DWI-ADC single sequence signature of intratumoral or peritumoral. The intratumoral or peritumoral radiomic signature incorporated T1+C, T2WI, and DWI-ADC single sequence signature were calculated by Cox regression. Intratumoral-peritumoral radiomic signature combined intratumoral and peritumoral radiomic signatures was calculated by Cox regression model with the radiomic scores in the training cohort. The radiomic scores calculated for each patient via a combination of selected features that were weighted by their respective coe cients. The intratumoral-peritumoral radiomic signature predicted RFS was then assessed in the prospective-retrospective validation cohort and the external validation cohort, respectively. The essential radiomic features and formula composition were presented in Supplementary Table 6.

Radiomic-clinical signature building and validation
The univariate analysis was used to assess the association between clinicopathological characteristics and RFS in the training cohort. A multivariate regression analysis was used to test the independent signi cance of the intratumoral-peritumoral radiomic signature and signi cant clinical variables in relation to RFS. Finally, to provide the clinician a quantitative tool to predict individual probability of recurrence, a radiomic-clinical signature combined the intratumoral-peritumoral radiomic signature and signi cant clinical variables was constructed by Cox regression model. The performance of the radiomicclinical signature was validated in the prospective-retrospective and the external validation cohorts.

Radiomic features associated with tumor immune microenvironment and genomics
To quantify the proportions of tumor immune microenvironment in the 90 breast cancer patients form TCGA and TCIA, the CIBERSORT algorithm 25 and the LM22 gene signature were used for highly sensitive and speci c discrimination of 22 human immune cell phenotypes including B cells, T cells, natural killer cells, macrophages, dendritic cells, and myeloid subsets. CIBERSORT is a deconvolution algorithm that uses a set of reference gene expression values (a signature with 547 genes) that is considered a minimal representation for each cell type. Based on those values, CIBERSORT infers cell type proportions in data from bulk tumor samples with mixed cell types using support vector regression. Gene expression pro les were prepared using standard annotation les, the data were uploaded to the CIBERSORT web portal (http://cibersort.stanford.edu/), and the algorithm was run using the LM22 signature at 1,000 permutations. The 29 lncRNAs were selected using the univariable Cox proportional hazards regression model and the LASSO algorithm in patients treated with immunotherapy from the IMvigor210 trial 11 .
The unsupervised hierarchical clustering methods (K-means) 26 were used to identify different classes of the radiomic features in 71 patients, who had both T1+C and T2WI sequences from TCGA and TCIA with the ConsensuClusterPlus R package 27 . The t-test and R package limma 28 were utilized to identify differentially expressed lncRNAs and genes associated with the key radiomic features, respectively. Differentially expressed radiomic-based lncRNAs and genes were divided into different clusters in 1,082 breast cancer patients with trascriptome RNA sequencing data from TCGA using the ConsensuClusterPlus R package 27 . The GO and KEGG analysis were performed using the clusterPro ler R package 29 . The GO terms and KEGG pathways were considered statistically signi cant with P values and false discovery rates less than .05.

Statistical analysis
The Fisher's exact test was performed to examine the differences in the occurrence of categorical variables, while the independent t-test was used to compare the differences in continuous variables between two groups. Survival was calculated using the Kaplan-Meier method and the log-rank test, and hazard ratios (HRs) and 95% con dence intervals (Cls) were calculated using a Cox regression analysis. Patients were categorized into high and low-risk groups with the optimal cutoff values de ned by the R package ggsurvimier. The prognostic or predictive accuracy of the signatures was assessed by using operating characteristic curve (ROC) analysis. The area under ROC curve (AUC) was used to assess the sensitivity and speci city, and this was calculated to evaluate the performance of signatures for predicting RFS. For all the analyses, two-sided P-values less than 0.05 were considered statistically signi cant. Statistical analyses were performed using R software (version 4.0.0). This study is registered with ClinicalTrials.gov, number NCT04003558.

Declarations
Ethics: This multicentre study was conducted in accordance with the Declaration of Helsinki. The study's protocol was approved by the ethics committee of each participating hospital (Sun Yat-sen Memorial Hospital of Sun Yat-sen University, SYSEC-KY-KS-2019-054-001; Sun Yat-sen University Cancer Center, B2020-114-01; Shunde Hospital of Southern Medical University, KYLS-20190579; Tungwah Hospital of Sun Yat-sen University, 2020DHLL018). The requirement for informed consents in retrospective cohorts were waived. Participants from the prospective phase III clinical trials NCT01503905 have signed informed consents, the trial was also approved by the ethics committee with number 2011 EC # (12).

Article information
Author Contributions: All authors had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Herui Yao, Yunfang Yu, Wei Ren, Zifan He, Yongjian Chen, Yujie Tan are co-rst authors. Erwei Song, Herui Yao, Qiugen Hu, Chuanmiao Xie are cocorresponding authors.
Drafting of the manuscript: All authors.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: All authors.