In this study, we investigated the feasibility of predicting NAR score and survival outcomes for LARC patients using deep machine learning and radiomics modelling constructed from radiotherapy planning contrasted CT images. The results indicate that the radiomics features can augment the predictive power of clinical models for OS, DMFS and LRFFS. The model was able to predict these outcomes with moderate accuracy.
The challenge with LARC is that most validated predictive and prognostic models are based on post-operative parameters, limiting the ability for pre-operative treatment decisions. There is emerging data that the response to standard NACRT is heterogenous [4]. With about 20% of patients achieving pCR after standard NACRT, the indications for life changing surgery for this group of patients require justification, especially with data supporting good outcomes when a watch-and-wait strategy was adopted [12, 13]. On the other hand, some patients do not respond adequately to standard NACRT. For these patients, an intensified strategy such as that described in the RAPIDO trial or adjuvant chemotherapy may be more appropriate [35]. This is where the radiomics prediction model can be utilized for personalized patient-centred pre-operative treatment decision making.
This study also showed that the radiomics model predicted the NAR score with moderate accuracy. Furthermore, in our study, we correlated the NAR score with survival outcomes and this congruently indicated that the higher the NAR score, the poorer the outcome. Most radiomics studies in LARC predict for pCR which is a dichotomous histopathologic variable achieved in only about 20% of patients post-NACRT [11, 20]. In comparison, the NAR score which is derived from more variables, may provide more information. The NAR score is a widely used surrogate in clinical trials [17]. It was developed and widely validated as a short-term endpoint to act as surrogate for DFS and OS in rectal cancer to allow more rapid determination of success or failure of an experimental intervention in LARC [17, 36–39]. The NAR score has a greater predictive ability than pCR for OS [17, 37]. From the NSABP R-04 randomised phase 3 trial patient dataset, the authors conclude that the 5 year OS for NAR < 8 (low), NAR 8–16 (intermediate) and NAR > 16 (high) were 92%, 89% and 68% respectively [37]. In the German CAO/ARO/AIO-04 randomised phase 3 trial patient dataset, they found that the 3 year DFS was 91.7%, 81.8% and 58.1% for low, intermediate and high NAR score respectively [39]. However, the NAR score can only be calculated after neoadjuvant treatment and resection and is therefore not available to clinicians for making the decision to offer neoadjuvant treatment at the outset. Again, this is where the radiomics model for predicting NAR score can be useful in guiding pretreatment counseling but it may also lend its use in clinical trials.
Our results show that the model has a relatively good discriminatory ability when predicting for high NAR > 16 with an AUC of 0.68 ± 0.11. We applied the NAR < 8 model to a contemporaneous cohort of patients (N = 31) who declined surgery and found that the majority of the patients (N = 29) were predicted to have NAR > 8 and had poorer overall survival (Figures S4). Here, the NAR model can be used as an added layer of assessment in deciding on neoadjuvant treatment strategies as discussed. Barring the possibility of contraindications, in this group of patients, surgical intervention may have benefited them. Whilst the OS was not statistically significant between patients with NAR < 8 vs NAR > 8, the limitation here was the small sample size and large proportion of patients with NAR > 8 (n = 29) vs NAR < 8 (n = 2) for it to be meaningful and representative. Testing this model on a larger sample size is required but was beyond the scope of this study.
In this study, we demonstrated that the CT based mesorectal (CTV) imaging features contribute significantly to the accuracy of the final model compared to the intratumoral features (Fig. 4). The distinction between intratumoral and peritumoral radiomics has been studied in different cancers [40–44]. Like Shaish et al, we also derived value in the mesorectal compartment in predicting response and prognosis [21]. Most other radiomics studies in LARC often looked at only the gross tumour whilst the mesorectum which contains important information has often been overlooked. The information contained in the peri-tumoral region may inform on immune response, angiogenesis and invasion beyond the usual radiotherapy or surgical fields which in turn can be analysed to additionally predict for survival outcomes [40–44]. This suggests its inclusion in future rectal based radiomics studies with a consideration for further investigations to clinical regions beyond such as the pelvic side wall. The latter may serve as a predictive tool in guiding the need for pelvic lymph node dissection.
We have undertaken several rigorous approaches to ensure the quality of the study. For example, the whole tumour volume and surrounding mesorectum was analysed individually, instead of working with a single segmentation. A robust procedure was designed to select a subset of features from the original 1130 radiomics features to account for CT scanner variation and inter-rater variation in CTV and GTV contouring. A further feature reduction technique based on retaining uncorrelated features was performed. With the eventual final set of 404 and 254 robust radiomics features for the GTV and CTV respectively, this increased the credibility of the study and reduces overfitting with the model. For the model performance, a nested 10-fold cross validation was used. Feature reductions were applied strictly to the training fold to ensure no data leakage. The IBSI guide was used in the construction of the model [28]. The overall radiomics quality score (RQS) for our model was 38.89% (Figure S5), a higher score than most CT-based radiomics where the range is from 0 to 47% with majority falling below 20% [45, 46].
There are several additional strengths to our study. To our knowledge, this is the first machine learning study using contrasted CT-based radiomics of the rectum and mesorectum for the prediction of NAR score and survival outcomes in LARC. We created two radiomics models - the NAR score model and survival model, and compared the relationship between clinical, radiomics and combined features in model performance. The NAR score was also correlated to survival outcomes. Most other CT-based radiomics studies looked at pCR, some of which could not show the added value of radiomics data in predicting pCR or did not additionally predict for survival outcomes [20, 47–51]. The international multicentre MRI-based radiomics study by Shaish et. al. is the only other radiomics study in LARC predicting for NAR score [21]. Their model had a similar performance (AUC of 0.66) and the study also evaluated the mesorectal compartment. Nevertheless, the methodology was heterogenous with variable MRI scanner, MRI protocol and neoadjuvant chemotherapy used over the accrued time and between institutions. The authors however felt that the heterogenous image data was a strength of their study as they showed that after controlling for imaging parameters in multivariate analysis, the radiomics features bear most of the predictive strength, driving the outcome-response R2 and improves the generalizability of the model. The results obtained from the study may be too optimistic due to data leakage from performing feature selection in a single fold while evaluating the performance using random train-test split.
Most radiomics studies for predicting treatment response and survival in LARC have been MRI-based [26]. Translation of MRI-based radiomics application in real world is often limited by cost, lack of resources, difficulty with reproducibility and lack of multi-centred validation. This is less of an issue with CT imaging and CT-based radiomics as the voxel value (known as Hounsfield Units) has an actual physical interpretation relating to the X-ray attenuation coefficient. This ensures certain degree of consistency between CT images acquired across different scanners and provides an advantage for using CT-based radiomics. Furthermore, our model is more readily deployable due to the utilization of routinely performed pre-radiation therapy CT scan. The use of contrasted scans in our study may also provide additional textural features [52].
There are several limitations to this study. All segmentations were performed by a single radiation oncologist which may introduce bias but were nonetheless performed without knowledge of the pathologic outcome of the patient. To account for intra-rater variation in contouring, we mimicked the contouring by dilating and eroding the contours from the single radiation oncologist. This was described in detail in the Supplementary Method. As this was an exploratory study, a retrospective methodology was used, sample size was small and the study was conducted in a single centre with no external validation cohort. Although we used nested cross-validation which is more rigorous than a single hold-out internal test set, this was not as rigorous as external validation. Finally, we recognize that different institutions may use different software platforms making it difficult to compare or reproduce results. We recommend standardization of radiomics workflow, use commercially available software and avoid in house applications between institutions.
There are two main recommended ways our model can be used in the real world setting. At the outset, discussions such as more intensified neoadjuvant treatment or the possibility of ‘watch-and-wait’ approach post-neoadjuvant treatment can be better guided using both the NAR and survival model. The NAR model can additionally be used in prospective studies or trials when investigating a new neoadjuvant treatment especially when there is difficulty in recruiting participants. This model can be used to predict for NAR score for the included patients, forming the control arm. The same cohort of patients will undergo the experimental treatment and will derive a final NAR score, forming the experimental arm. Comparisons of the NAR score can then be made between the two groups. Future studies also calls for external validation and collaboration among various institutions to create a large annotated dataset to facilitate the establishment of reliable radiomics models. Further evaluation in randomized clinical trials followed by its implementation within treatment planning systems in radiation oncology to better personalize treatments should be considered.