Genomic Machine Learning Model Predicts Radiation Therapy Bene t in Early-Stage Breast Cancer Patients with High Accuracy


 BackgroundRadiation therapy (RT) is frequently recommended for post-surgery treatment of early-stage breast cancer (BC) patients, though not all benefit. Clinical factors currently guide RT treatment decisions. At present, models to predict RT-benefit predominantly use statistical methods with modest performance. In this paper we present a high-accuracy genomic Machine Learning (ML) model to predict RT-benefit in early-stage BC patients. We also present a novel method for selecting genomic features for training ML algorithms. MethodsGene expression data from 463 early-stage BC patients treated with surgery and RT from the METABRIC cohort were obtained. Wilcoxon Rank Sum (Wilcoxon RS) test and Cox Proportional Hazards (Cox PH) were used to reduce the number of genes used to train eight ML algorithms. ML algorithms were trained on 80% of data using 10-fold cross validation and tested on 20% of data to assess performance in predicting relapse status. Results Genome-wide gene expression data was reduced by 96% using Wilcoxon RS and Cox PH to a 1,596 gene set and a 977 gene set. These gene sets were used to train eight ML algorithms resulting in models that ranged in performance accuracies from 54.01% to 95.6%. Highest accuracies were obtained using Support Vector Machine (SVM977–93.41%, SVM1596–95.6%) and Neural Networks algorithms (NN977 – 92.31%, NN1596 – 93.41%). In RT-untreated patients, accuracies of all models were 30% to 40% lower compared to RT-treated patients. SVM977 had the highest sensitivity of 91.09%. Members of the 977 set were enriched with genes involved in cell cycle and differentiation as well as genes associated with radiosensitivity and radioresistance.Conclusion This study presents a novel genomic feature selection approach that used Wilcoxon RS followed by Cox PH to reduce the number of genes from genome-wide gene expression data used for training ML algorithms by 96%. This approach led to an SVM model that used the expression values of 977 genes to predict RT-benefit in early-stage BC patients with 93.41% accuracy. This work demonstrates that ML models can be clinically useful for predicting cancer patient outcomes.


Introduction
Breast cancer (BC) is a major global public health concern. It represents 11.6% of all cancer cases and 6.6% of all cancer deaths globally, second after lung cancer (1). Radiation therapy (RT) has been shown to reduce 10-year recurrence risk for early-stage BC patients by 15% (2). RT also offers improvements in overall and disease-free survival in patients presenting with distant metastases at diagnosis (3).
Presently, RT recommendations are based on clinicopathological characteristics and patient age only (4).
According to the 2019 European Society for Medical Oncology Clinical Practice Guidelines, RT is strongly recommended for all early-stage BC patients following breast-conserving surgery (BCS) to reduce locoregional recurrence (LRR) risk (4). An RT boost is recommended for patients who are at a high risk of recurrence such as those who were diagnosed under the age of 50 years, with grade three tumours, and with vascular invasion (4). Despite these recommendations 10% of node-negative BC patients will still have a recurrence after BCS and RT (2). Therefore, a more personalized approach to determine which patients should and should not have RT is warranted.
The role of RT is to remove occluded cancer deposits in the tumour area after resection. RT destroys cancer cells by directly causing DNA crosslinks, and single and double strand breaks, and indirectly through free radical production which in turn damages DNA leading to apoptosis (5,6). Rapidly proliferating cancer cells have been shown to be more sensitive to DNA damage, repair more slowly, produce more double strand breaks in their own proliferation, and typically have mutations that cause a loss in DNA repair pathway redundancies that are seen in normal cells (5,7). Tumours also have mechanisms of acquired radioresistance which include enhanced migration and invasion (8), subtype switching (8), repopulation during gaps in RT (9,10), and redistribution of cells to more radioresistant G1 and S phases of the cell cycle (11).
Some studies have suggested that BC subtype may be predictive of recurrence after postmastectomy RT as basal-like and luminal tumours show the most reduction in locoregional recurrence (LRR) rate after RT (12,13). However, clinical trials have con rmed that subtype is prognostic of recurrence rather than predictive of RT response (14)(15)(16). The 2021 St. Gallen Breast Cancer Conference also concluded that genomic signatures of intrinsic subtype currently approved for clinical practice such as Oncotype DX ® and MammaPrint ® cannot guide RT treatment decisions (17). Therefore, RT recommendations are currently based on clinicopathological characteristics and patient age only.
Advances in molecular medicine and precision oncology offer the opportunity to include genetic information in addition to clinicopathological and demographic information in RT-bene t prediction models. Clinicogenomic models predicting RT-bene t as percentage risk reduction in either LRR, diseasefree or overall survival have been previously proposed (18-23). These models generally utilize gene sets selected from the literature shown to be associated with radiosensitivity or radioresistence and use traditional statistical approaches such as Cox Proportional Hazard (Cox PH) and linear regression to predict RT response. A recent study by Zhang et al. used clinical factors to develop a nomogram that predicted 5-and 10-year survival bene t from post mastectomy RT with 60% to 80% accuracy (18). In 2019, Sjostrom et al. developed a clinicogenomic classi er called the Adjuvant Radiotherapy Intensi cation Classi er (ARTIC), which used expression values of 27 genes and patient age to predict the need for RT intensi cation in early stage BC patients that was validated on clinical trial data (19). Patients with low ARTIC scores had a statistically signi cant 70% risk reduction in LRR with RT compared to no RT, while patients with a high ARTIC score had a 30% risk reduction in LRR with RT compared to no RT which was not statistically signi cant (19).
Earlier work includes a 2012 study by Torres-Roca et al. where a radiosensitivity index (RSI) was developed which used the expression values of 10 genes and a linear regression model (20). This molecular signature was validated in BC cohorts where it strati ed patients as radioresistant or radiosensitive, and where radiosensitive patients had an improved 5-year relapse-free survival compared to radioresistant patients (95% vs. 75%) that was not observed in RT-untreated patients (21). In 2014, Tramm et al. used expression values of seven genes to predict LRR post-mastectomy in high-risk BC patients treated with systemic therapy from the Danish Breast Cancer Cooperative Group randomized control trial. They classi ed patients as low-or high-risk of LRR and RT bene t was observed in high-risk patients (22). In 2018, Cui et al. developed a 34-gene radiosensitivity signature that also strati ed patients into radiosensitive and radioresistant groups (23). Notably, there is no overlap between the genes used in some of the models proposed (21,22,24), possibly re ecting the complexity of the biological networks that govern RT response. A more comprehensive, hypothesis-independent approach to gene selection that considers the expression values of all genes in the human genome is yet to be published.
ML algorithms are advantageous as they can process large amounts of data, and build complex models to make predictions (25). As such, they have been successfully applied to a range of cancer prediction problems, with accuracies as high as 100% (25)(26)(27)(28). Statistical tests such as Wilcoxon Rank Sum test (Wilcoxon RS) have been used to select genomic features for ML algorithms (29,30). For example, Niméus-Malmström et al. selected 5,237 genes using Wilcoxon RS which were used as features in a Support Vector Machine (SVM) algorithm which was able to predict recurrence in oestrogen receptorpositive patients (Area Under Receiver Operating Characteristics curve (AUROC) -0.91) (30). ML algorithms have yet to be applied to the prediction of RT-bene t and therefore present a unique opportunity to create an improved and novel model. In this paper we present a high-accuracy genomic ML model to predict RT-bene t in early-stage BC patients. We also present a novel method for selecting genomic features for training ML algorithms that incorporates Wilcoxon RS.

Description of datasets
Clinical and gene expression data from the METABRIC study (31) were downloaded from cBioPortal (32).
Clinical data included BC type, stage, surgery, chemotherapy and recurrence status. Gene expression data was in the form of log-transformed Z-scores compared to the expression distribution of all samples for 24,368 genes obtained using an Illumina HumanHT-12 v3 Expression BeadChip microarray. Gene expression and clinical data were merged using patient ID. The cohort was limited to patients with stage one or two breast invasive ductal carcinoma, who were treated with surgery and RT, but not chemotherapy. The outcome to be predicted was relapse status which was the time from the date of diagnosis to the date of the rst report of a new tumour event, which included LRR, distant metastasis or death with tumour (33). In this dataset, referred to as dataset one, the event status was assumed to be known for all patients, that is, all patients who were coded as "Disease Free", were assumed to not have a relapse and all patients who were coded as "Recurred/Progressed" were assumed to have a relapse during the follow-up. Therefore, patients whose recurrence status was unknown because they were lost to follow-up, or died of other causes before a relapse, i.e. right censored patients were coded as "DiseaseFree." To address the issue of patients with unknown relapse status due to right censoring, dataset one was further limited to patients that had complete follow-up for at least 15 years or had a recurrence or death within 15 years, whichever came rst. A 15-year period was chosen as the majority of patients (73.58%), had a follow-up period of 15 years or less. This dataset is referred to as dataset two. Patients who had a recurrence within 15 years were coded as "Recurred/Progressed", while patients who did not have a recurrence within 15 years were coded as "DiseaseFree."

Technical speci cations
All analyses were performed in RStudio Version 1.3.959 using a computer with the following speci cations: MacOS 10.14.6, 1.6 GHz Intel Core i5 CPU, 8 GB RAM. All ML models were developed using the Classi cation and Regression Training (caret) package (34). Sample code used for analyses is included in Additional le 1.

Gene set selection and machine learning algorithm training
The set of gene expression values for 24,368 genes was reduced using Cox PH and Wilcoxon RS to form three gene sets. First, Wilcoxon RS was used to determine which genes were differentially expressed between patients with and without a relapse at a signi cance level of p < 0.05. This led to the rst gene set referred to as the Wilcoxon set. Cox PH was used to determine which genes affected recurrence risk at a signi cance level of p < 0.05. This led to a second gene set referred to as the Cox PH set. Lastly, Wilcoxon RS followed by Cox PH were used sequentially to reduce the number of genes which led to the third gene set referred to as the Wilcoxon-Cox set.
Eight machine learning (ML) models were chosen based on their extensive use in cancer prediction research (25,27). These models were: Arti cial Neural Networks (NN) (35), Linear Support Vector Each model was trained on 80% of the data randomly selected with 10-fold cross validation using gene sets selected using Wilcoxon RS and/or Cox PH as features, and then tested on the remaining 20% of data.
Other techniques were used to further reduce the three gene sets to determine the effect of smaller sets on prediction accuracy. First, the Wilcoxon set and the Cox PH set were reduced to the top 1000, 500, 100 and 50 genes with the lowest p values. Second, genes in the Wilcoxon-Cox set were reduced using hazard ratios (HR) to form three sets: (i) genes with a HR>1, (ii) genes with a HR<1, and (iii) genes with HRs in the rst and third tertiles. The subtype variable was also added to the Wilcoxon-Cox set to determine if it improved model performance. Each of these reduced sets were also used to train the ML algorithms. Third, recursive feature elimination (RFE) was used to investigate whether smaller subsets of genes would improve classi cation accuracies when used for algorithm training.
Five sets of genes selected from the set of genes with a Wilcoxon RS p value greater than 0.05 were also used for training and accuracy compared. The gene set was ordered by increasing Wilcoxon RS p value and two subsets of 1,596 genes with a p value greater than 0.05 corresponding to differentially expressed genes with ranked numbers 1597-3192 and 3193-4788 were selected. Another three sets of 1,596 genes chosen at random from genes that had a Wilcoxon RS p value greater than 0.05 were selected. Each of these ve sets of insigni cant genes were used to train SVM and NN algorithms.
A curated list of 64 radiogenes (Supplementary Table S2) from three publications on RT-bene t (21,22,24) was also used to train an SVM algorithm to determine whether genes of biological relevance selected from the literature would be valuable for training ML algorithms.

Model comparison and testing in other clinical populations
The four ML models with the highest accuracies were tested on patients who did not have RT to determine whether they were speci c to RT-treated patients. These four models were further compared on computing time on a test set of 90 patients, AUROC, sensitivity and speci city. Of the four models, the best performing model was further tested on ER+, ER-, and chemotherapy-treated patients.
Hyperparameter tuning and gene set enrichment analysis The hyperparameter of the best performing model was tuned using a manual grid search. The genes in the nal model were characterized using Gene Set Enrichment Analysis (GSEA) with Gene Ontology terms for "biological process", "cellular compartment" and "molecular function." The overlap of this model's gene set with a list of 64 genes previously used in RT-bene t predictive models (22,24,43) and 723 cancer driver genes from the Catalogue for Somatic Mutations in Cancer (COSMIC) v93 (44) were calculated. The stringApp (45) in Cytoscape (46) was used to visualize protein to protein interactions in the nal gene set using an evidence threshold of 0.6.

Results
Demographic and clinicopathological characteristics of training cohort After limiting the METBRIC cohort to patients who were stages one or two, who had surgery and RT but no chemotherapy and had gene expression data, this cohort had 463 patients ( Figure 1A) of which 36.5% (n=169) had a recurrence ( Figure 1B). The median follow-up time was 10.68 years (range 0.21 -29.25 years).

Page 7/27
The clinical pro le of patients in the RT-treated cohort was similar to that of the entire METABRIC cohort as previously described (33). In RT-treated patients the average age at diagnosis was 63.17 years (  Figure 1B). Each of these gene sets were used to train ML algorithms. These gene sets were further reduced in a second stage using p values and recursive feature elimination as shown in Figure 1B.  2). When the 1,000 genes with the lowest p values from the Wilcoxon set were used to train each of the eight ML algorithms a decrease in classi cation accuracies were observed across all eight models compared to when the full Wilcoxon set was used for training ( Figure 2 Figure S1). When the top 100 and top 50 genes were used for training, all of the eight models had lower classi cation accuracies ranging from 58.24% to 72.53% (Supplementary Figure S1). The two algorithms that gave the models with highest accuracies -SVM and NN -were chosen for further training. Two subsets of 1,596 genes with a p value greater than 0.05 corresponding to DEGs ranked numbers 1597-3192 and 3193-4788 in the Wilcoxon RS test results were used for training. When these gene sets were used there was a decrease in accuracy of approximately 10% to 40% for both the SVM and NN models (Figure 4) Figure   5). Thus, accuracies were lower by approximately 30% to 40% for RT-untreated compared to RT-treated patients for all models ( Figure 5).

Computational Time, Sensitivity and Speci city of ML Models
The four models with the highest accuracies: SVM1596, SVM977, NN1596 and NN977 were compared on computing time, sensitivity and speci city. Each model took less than one second to predict relapse status on a test set of 90 patients (Supplementary Figure S3). All models had an AUROC value greater than 0.94 (Supplementary Figure S4). SVM977 had the highest sensitivity (91.09%), but the lowest speci city (78.95%) compared to all other models, while SVM1596 had the highest speci city (92.79%) but the lowest sensitivity (81.55%) (Supplementary Table S1). The NN models had sensitivities and speci cities between 85.29% to 89.23% (Supplementary Table S1). SVM977 was chosen for further analysis using GSEA and hyperparameter tuning as it had the highest sensitivity. Varying the values of the cost hyperparameter of the SVM algorithm with manual grid search showed no change in accuracy ( Figure S5). Therefore, a cost value of one was chosen.

SVM model trained with the Wilcoxon-Cox set shows high performance independent of ER status
When the SVM model trained with the Wilcoxon-Cox set was tested on ER+ or ER-patients only, the accuracy, sensitivity and speci city were high ranging between 95.45% and 99.32% (Table 2). When this model was tested on chemotherapy treated patients, the accuracy decreased to 64.02% ( Table 2). The sensitivity and speci city also decreased to 52.27% and 74.26% respectively. Wilcoxon-Cox gene set is enriched for known radiogenes and cancer driver genes GSEA using Gene Ontology terms revealed that the most signi cant terms (p<0.01) in the 977 gene set mapped to biological processes related to multicellular organism development (n = 204 genes), mitotic cell cycle (n = 114 genes), cell cycle (n = 153 genes) and cell differentiation (n = 153 genes). When GSEA was performed for terms related to which cellular compartment the proteins of these genes operate in, it was found that the most highly represented cellular regions were the nucleoplasm (n = 194 genes), chromosome centromeric region (n = 39 genes) and centrosome (n = 22 genes). Lastly, the most signi cantly annotated terms related to molecular function were microtubule binding (n = 29 genes), cell adhesion molecule binding (n = 22 genes) and ATP binding (n = 72 genes).
Sixty-four of the 68 radiogenes (Supplementary Table S2) had mRNA expression data in the METABRIC dataset, representing 0.26% of the dataset. Sixteen of the 64 curated radiogenes (Supplementary Table  S2) were in the Wilcoxon-Cox set of 977 genes representing 1.6% of the set. This means that there was an approximately 6.2 times enrichment of known radiogenes in the Wilcoxon-Cox set. Most of these 16 radiogenes are involved processes related to cell division e.g. MKI67, SPC25, and PRC1 (Supplementary  Table S3).
Fifty-ve of the 723 genes from the COSMIC database overlapped with the Wilcoxon-Cox set. Assuming that all COSMIC genes are represented in the METBRIC dataset, this corresponds to an approximately 1.4 times enrichment of cancer driver genes in the Wilcoxon-Cox set. Cytoscape protein-protein network analysis revealed a highly interconnected network with interactions between almost all genes in the Wilcoxon-Cox set (Supplementary Figure S5).

SVM algorithm trained using 61 known radiogenes shows decreased performance
The SVM algorithm trained with 61 known radiogenes present in the METABRIC dataset (Appendix Table  E (Table 3). There were 316 genes that overlapped between the set of 977 and 1,044 genes.

Discussion
This study demonstrates that ML can be used to develop highly accurate, sensitive, and speci c models to predict RT-bene t in early-stage BC patients. We present a high-performance SVM model (93.41% accuracy, 91.09% sensitivity, and 78.95% speci city) that can predict RT-bene t in early-stage BC patients independent of subtype. Here, RT-bene t was de ned as relapse-free status following surgery and RT. The accuracy of this model (93.41%) represents an improvement from that of the best previously reported model (80%) in predicting RT bene t (18). This model used an SVM algorithm and expression values of a set of 977 genes referred to as the Wilcoxon-Cox set to predict RT-bene t. This study also presents a novel genomic feature selection approach that reduced the number of genes from genome-wide gene expression data by 96% using Wilcoxon RS followed by Cox PH. This feature selection method contrasts previous studies that selected genes with known functions from the literature or in vitro experiments to build RT-bene t models (18-23). To our knowledge, this is the rst study to apply ML algorithms to predicting RT-bene t with consideration of all genes in the human genome.
The preliminary challenge in model development was nding a publicly available dataset with complete clinical and gene expression data. The dataset also needed to be balanced in the outcome variable as unbalanced datasets can lead to a misleadingly high prediction accuracy. The dataset also needed to be su ciently large as small datasets can lead to model over tting (47) and lack of precision (48) for some ML algorithms. The METABRIC dataset was chosen because of its large cohort size (2,509 patients), balance in outcome (approx. 40% of patients had a recurrence), extensive longitudinal follow-up data (approx. 30 years) and availability of genome-wide gene expression data (24,368 genes).
There was signi cant variability in the follow-up time for patients (range 0.21 -29.25 years). This meant that some patients were right censored, i.e. the patient's follow-up ended before relapse occurred. While Cox PH models account for right censorship, classi cation ML algorithms do not. Two strategies were used to address this issue of right censorship in the data: 1) patients who were lost to follow-up were assumed to have had no relapse in dataset one, 2) the cohort was limited to patients who were followed for at least 15 years, or had a recurrence or death within 15 years, whichever came rst in dataset two.
The assumption in the rst method allowed all the data for the 463 patients to be retained for training. However, it had a disadvantage in that it treated patients with an unknown relapse status as no relapse, which may and may not be true for each patient. The second method made no assumption about the patient's relapse status as it limited the cohort to only patients who had complete follow-up. The disadvantage of this method is that it led to loss of 46% of the data which would have introduced some degree of exclusion bias into the dataset. The majority of analysis here used dataset one, due to its larger sample size. Dataset two was used for comparison in order to determine whether or not the ML methodology used would apply across datasets with different limitations.
Controls on BC type, stage, chemotherapy, surgery and RT status were implemented to de ne the clinical population which was early-stage BC patients who were treated with BCS and RT. Previous work modelling RT-bene t also controlled for BC subtype by building separate models for ER+ and ERpatients (30). The rationale was that ER+ and ER-tumours are distinct in their gene expression pro les which are associated with the differences in outcomes observed by subtype (30). This study did not control for BC subtype for two reasons: rst, such controls would create a more homogenous patient group eliminating key differences in expression pro les that the ML algorithm can utilize to make a classi cation, and second, a model that works irrespective of BC subtype would be easier to implement rather than two distinct models. For these same reasons, the cohort was not limited to patients who had hormone therapy (HT). HT is recommended for BC patients who are HR+ therefore, limiting to patients who either did or did not have HT would also limit the dataset to patients who were either HR+ or HRsubtype. The SVM model with the Wilcoxon-Cox set of 977 genes was shown to have high accuracy in ER+ and ER-patients demonstrating that the model is independent of subtype.
A key challenge in this study was to reduce the number of genes as the use of the entire set of 24,368 genes as the use of too many features can result in an over tted model for some ML algorithms (49). To achieve this, a novel lter feature selection approach was developed that used Wilcoxon RS followed by Cox PH which reduced the number of genes by 96% to the Wilcoxon-Cox set of 977 genes. Wilcoxon RS was previously used to determine differentially expressed genes (DEGs) in BC datasets (29,30). The application of Wilcoxon RS followed by Cox PH for genomic feature selection has not been reported. This novel approach reduced the dataset substantially by selecting a set of DEGs that also affected recurrence risk. This approach was also better than selecting known genes of biological relevance for training as when 64 radiogenes were used for training, the resulting SVM model had poor accuracy of 54.61%.
Therefore, considering all genes, in a hypothesis-independent manner appears to be a better approach than selecting known genes for training.
A clear relationship between model accuracy and the number of genes selected using Wilcoxon RS and Cox PH was observed. When smaller gene sets of the top 1000, 500, 100 and 50 gene with the lowest p values were used for training, there was an overall decline in accuracy across all eight ML algorithms.
Therefore, a larger number of genes was needed for higher accuracy. There was also a relationship between the signi cance threshold of 0.05 for gene selection and model accuracy. When gene sets with a pvalue greater than 0.05 were selected there was also a decline in accuracy for both the SVM and NN algorithms. The lowest performance (~55%) was seen when insigni cant genes with a p value greater than 0.05 were randomly selected. These results demonstrate the importance of considering the signi cance threshold in genomic feature selection using Wilcoxon RS and Cox PH.
The top four models presented use SVM or NN with either the Wilcoxon set of 1,596 genes or the Wilcoxon-Cox set of 977 genes. Given that SVM and NN are the most consistently used algorithms in BC prediction research, this result further corroborates the utility and consistent performance of these models (25). The consistently high accuracy of SVM suggests that the genomic features selected by Wilcoxon RS and Cox PH are su ciently separate in high dimensional space to determine an optimal hyperplane with a large margin. It also suggests that this feature selection approach was able to reduce noise in the feature space and overlap between classes. SVM with a radial or polynomial kernel function was also investigated, however this did not improve accuracy (data not shown), therefore a linear hyperplane was su cient for this problem.
The lower performance of the majority of ML algorithms chosen (RF, DT, XGBoost, KNN, NB and LR) may be attributed to their underlying assumptions or their inability to model complex relationships. For example, NB and LR assume independence among predictors. This assumption would not hold with gene expression data where the expression pattern of one gene is often directly or indirectly dependent on the expression of another. LR is also generally not able to model complex relationships and is traditionally used to model a linearly separable classi cation problem. KNN is known to underperform with high dimensional data where all the vectors are almost equidistant making it di cult to determine clusters using distance metrics. DTs are also known to underperform as single trees are unstable and tend to over t the data.
Addition of the subtype variable to the Wilcoxon-Cox set did not improve accuracy of the SVM model and decreased accuracy of the NN model. For the SVM model, the subtype variable was not a support vector and therefore did not in uence the position of the linear hyperplane separating those who did have a recurrence from those who did not. In summary, subtype was an unnecessary feature for the models presented.
It is signi cant that the ML models demonstrated better prediction accuracies for RT-treated patients compared to untreated-patients. The top four models (SVM977, SVM1596, NN977, NN1596) all performed poorly when applied to RT-untreated patients, with prediction accuracies of 50% to 60%.
Notably, patients in the RT-untreated cohort had larger tumours, were more likely to have a mastectomy, and to have no lymph nodes examined as positive. Therefore, biological differences between the tumours of patients in the RT-treated and untreated cohorts likely resulted in differences in gene expression pro les between the cohorts, which subsequently impacted the SVM model performance. A similar trend of poor accuracy (64.02%) was also observed when the SVM977 model was tested on data for chemotherapy-treated patients. Taken together, these results are promising in supporting the validity of the SVM977 model in predicting relapse in early-stage, surgery and RT-treated, chemotherapy-untreated BC patients. Future work would involve further controlling for treatment factors such the type of surgery, and control for the extent disease progression by selecting patients with no lymph node metastasis in the training cohort.
Comparison of the four models with the highest accuracy (SVM977, SVM1596, NN977 and NN1596) revealed small differences in AUROC values (1-2%), and even smaller differences in computational time (<1 second) that would not be noticeable to the end-user. Therefore, a model was not chosen based on these characteristics. Sensitivity or the number of true positives was more important than the speci city or the proportion of true negatives. That is, it is more important to correctly predict recurrence in RTtreated patients as they can be given the opportunity for RT-intensi cation or sensitization as a clinical intervention to reduce the risk of recurrence. A RT boost has been shown to signi cantly reduce the risk of LRR but with an increased risk of moderate to severe brosis (50). Patients who are correctly identi ed as having no recurrence (speci city) can continue with standard of care or have RT omission. Careful consideration of false positives is needed as these patients would be overtreated. Thus, the RT treatment course would require a risk-bene t discussion between the treating radiation oncologist and the patient. In summary, SVM977 is the best model because it had the highest sensitivity among all models.
Characterization of the Wilcoxon-Cox 977 gene set using GSEA revealed that many of these genes are involved in cell cycle and division and operate in the nucleoplasm. This was expected as it is well known that uncontrolled cell division is a hallmark of cancer (51). Further, previous work in BC cell lines found that the expression levels 51 genes that were correlated with radiosensitivity were enriched for genes involved in cell cycle arrest (24). This is also consistent with research that has shown that RT-resistance mechanisms are involved in repopulation and redistribution of cells to more radioresistant G1 and S phases of the cell cycle (10,11). The 977 gene set was also enriched (6.2 times) with radiogenes which further demonstrated that the feature selection approach was able to select for known genes of biological relevance. These results suggest that it is likely the compounded effect of several hundred genes in highly interconnected networks involved in cell division and redistribution of cells in the cell cycle, that drives recurrence after RT.
When dataset two was used to develop a model to predict RT-bene t, the SVM model had approximately 7-9% lower accuracy, 20% lower sensitivity, but 15-18% higher speci city than when dataset one was used for training. This change in performance is likely due to the smaller training dataset used (limited to patients who had complete 15-year follow-up), also re ected in the wider con dence interval. However, the overall performance pro le of this model was good, demonstrating that the methodology used for ML model development was valid using both datasets. Wilcoxon RS followed by Cox PH selected for a set of 1,044 genes in dataset two, of which 316 genes overlapped with the 977 gene set. Therefore, the genes selected for training using the proposed feature selection methodology is not xed and depends on the patients in the cohort used. Given the genomic heterogeneity that has been shown to occur between and within BC subtypes (31,52), and between different ancestral populations (53), it would be expected that the gene sets selected would vary with the cohort used.
This study had some limitations. First, the outcome used was relapse-free status. A more direct outcome for measuring RT-bene t would be ipsilateral LRR which was unavailable in the METABRIC dataset. Therefore, this study could not differentiate those patients who had recurrence of the same primary BC versus those who had a new primary. No information on the status of resection margins was available which is a known factor affecting recurrence risk. Further, information on RT-elds and dosages were absent to determine if the RT given was a commonly used dosage. This study also could not limit the cohort to patients who had a lumpectomy as the majority of patients had a mastectomy (~80%) while few had a lumpectomy (~20%). This is likely because the METABRIC cohort consists of patients who were diagnosed between 1977 and 2005 and since then there has been a shift toward breast conservation for early-stage BC patients (4). This study was also unable to test the SVM977 model in another BC cohort as the BC datasets available for public use were not adequately clinically annotated or su ciently large for ML training. Further, inconsistent gene naming conventions resulted in an inability to select the Wilcoxon-Cox set of 977 genes in other dataset. Future work would involve the application of the methodology used here to another BC cohort, preferably in the setting of a prospective randomized controlled trial as the gold standard (54).

Conclusion
We presented a methodology that can be used to develop ML models to predict RT-bene t in early-stage BC patients. The methodology incorporates a novel genomic lter feature selection approach that used Wilcoxon RS followed by Cox PH to reduce the set of genes from genome-wide gene expression data by 96%. This methodology resulted in a high-performance SVM model (93.41% accuracy, 91.09% sensitivity, and 78.95% speci city) that predicted RT-bene t in early-stage BC patients, independent of subtype, using the expression values for a set of 977 genes. The achievement of high accuracy demonstrates the potential of ML to address important problems of clinical interest. Our methodology can be applied to develop ML models that can be used to differentiate those patients who will go on to have a recurrence despite RT.

Declarations
Ethical Approval and Consent to participate Not applicable

Consent for publication
Not applicable Availability of supporting data All data used in this study were accessed through open access online repositories.

Competing interests
The authors have no competing interests to disclose.

Funding
This work was not funded by any grant.

Authors' Contributions
KB did the analysis and wrote the manuscript. MJ, RH and JF were major contributors in conceptual development and in writing the manuscript. All authors read and approved the nal manuscript. Flowcharts of A) patient selection criteria for the METBRIC cohort, and B) the feature selection methods used to select the genes to train ML algorithms to predict relapse status