As previously stated, currently the gold standard for volumetric assessment is manual segmentation. Numerous efforts have been taken to improve volumetric assessment and segmentation of the tumoral lesion in the preoperative context (Baid et al., 2021; Eijgelaar et al., 2020; Fyllingen et al., 2016; Kommers et al., 2021).
Available automatic algorithms were developed mainly for preoperative images; this results in low reliability for postoperative assessment (Zeppa et al., 2020).
The practical reason behind this is the intrinsic difficulty in postoperative MRI segmentation (Baid et al., 2021; Cordova, Schreibmann, Hadjipanayis, Guo, Shu, et al., 2014). In fact, the RC is frequently a source of artefacts in the MRI because of blood residuals and air bubbles (Ermiş et al., 2020b; Visser et al., 2019). In addition to this, brain anatomy may be partly altered as a consequence to the surgical act, the postsurgical edema and the tumor itself (Visser et al., 2019). These problems lower the accuracy of available algorithms in obtaining postoperative evaluation of MRI, in addition to logistical issues concerning regular post-surgical follow-up (Chang et al., 2019). Nevertheless, some studies recently reported good accuracy in postoperative segmentation of MRI, though it is still far from the level of accuracy achieved in preoperative evaluation (Chang et al., 2019; Cordova, Schreibmann, Hadjipanayis, Guo, Shu, et al., 2014).
Another limit of the available algorithms is that they are often trained on cured and standardized datasets that do not include low-quality images. Though this selection bias makes the training process easier, it is not as easily transferable to real-world clinical practice. In fact, suboptimal quality of data is very common in clinical practice, including non-volumetric scans, missing sequences, and artifacts (Ermiş et al., 2020a).
In this study, we aimed to train an AI algorithm for the postoperative MRI evaluation of glioblastoma in order to prospectively introduce this tool in clinical practice as support for the decision-making process. For this reason, the MRI database used for the training is representative of the real-world clinical scenario, frequently including heterogeneous and incomplete data. We did not apply restrictive inclusion criteria concerning the quality of the available data in order not to affect the results with selection bias.
Low-quality images were also included, concerning especially non-volumetric MRIs. The presence of non-volumetric images is related to old acquisition protocols, but their presence in the clinical scenario is still relevant, accounting for almost 25% of the MRIs collected in the Molinette database. As a consequence, this data was not excluded from the study as it would limit prospective application of the algorithm in clinical practice.
Moreover, postoperative images have different acquisition times given the time-course of the disease and the treatment schedule. This means that the postoperative MRI database contains images from different points in time: immediate postoperative, before and after adjuvant treatment, and regular follow-up. Herein, the algorithm is exposed to different biological entities such as post-surgical residual, RC, progressively growing lesion, and edema.
Herein, the results achieved are similar to the ones reported in other studies, considering both preoperative (mean DS: 91.09 ± 0.60) and postoperative (mean DS: 72.31 ± 2.88) evaluation. From the results obtained, it is evident that the accuracy in the postoperative setting is still far away from that in the preoperative scenario. This contrast in accuracy is especially remarkable for the RC segmentation, with a mean DS of 63.52 ± 8.90. This element causes both cases of hyper-segmentation, including adjacent regions, and sub-segmentation, excluding some parts of the cavity. Nevertheless, the evaluation of the RC is complex with less accurate results even for expert human operators performing manual segmentation.
The level of accuracy reached in this study was improved by the application of various informatic strategies. In particular, TL, data augmentation, cross-validation and an ensemble of models aggregated through the STAPLE algorithm compensate for the limited amount of data. 22. Another challenge for applying automatic segmentation in clinical practice is the variable number of sequences available. IMT is a technique that takes information from existing sequences to create the missing ones, but it is still at an experimental level. In this study, IMT architecture from Osman et al. (Osman & Tamam, 2022a) was applied to T1ce sequences to create T1 and T2 whenever they were not available in the Molinette database. In preoperative segmentation, even if these sequences are nonessential, they improve the performance of the algorithm. In our study, we did not observe any benefits associated with IMT, unlike suggested by previous literature.(Yang et al., 2020) However, it is possible that with larger or more diverse datasets the quality of the synthesized images could be improved, especially in the post-operative setting.
LIMITS OF THE STUDY
Institutional studies with private datasets are essential to scientific and informatic research, but they have some limitations (Petrick et al., 2013). Literature reports that models developed and tested with data from one collection hardly achieve similar results when applied to data from a different institute (Wei et al., 2019). It is therefore advisable to corroborate the results from this study with multi-institutional data, consequently increasing the level of reliability.
In addition, several studies highlight that reference standards based on the expertise of radiologists are not completely objective (Revesz et al., 1983). It is reported that the number of operators performing the segmentation should be at least three (Petrick et al., 2013), while, in this study, the manual segmentation was performed by four neurosurgeon, one medical student and revised by a senior neurosurgeon and a neuroradiologist in order to overcome interobserver variability.
A further limitation in the proposed work is the final post-processing pipeline proposed to bring back labels to tumoral segments of the postoperative evaluation (edema, enhancing tumor and resection cavity). Even if the parameters are obtained by averaging grid-search outputs, the limited amount of data decreases the reliability of these values. Indeed, since the network is less confident in its predictions than in the preoperative cases, it is plausible that such prediction confidence might not be above the chosen threshold for some MRI scans, leading to imprecise segmentations.
As the training phase influences the outcomes of the algorithm, quality assessment of MRIs used in this step would be helpful. Moreover, results would be more accurate if the T2-FLAIR sequence was always volumetric, however, the purpose of this work was to avoid selection bias of data to get an algorithm reliable in the clinical practice. For this reason, the possible improvement relies on more accurate protocols of MRI acquisition in common clinical practice and not on image selection for research studies.
FUTURE PERSPECTIVES
Due to the benefits granted by informatic tools and strategies, our results are in line with the existing literature on this topic. Different from previous studies, this work is not biased by restrictive inclusion/exclusion criteria for MRI scans. Therefore, we present this work as a starting point to apply AI to clinical practice for glioblastoma with remarkable reliability both in the preoperative and postoperative context.
Future studies should involve multiple institutions, allowing for an increase in the sample size of the database overall and of glioblastoma postoperative MRIs acquired from different protocols and machines. Moreover, experimental techniques such as IMT could be refined, adding greater support to the algorithm. The elimination of non-volumetric scans and low-quality imaging from clinical practice would be essential not only for research purposes but also for future clinical application of the AI technologies. All of these initiatives may improve the AI algorithm performance and lead to clinically reliable use of AI in glioblastoma evaluation.
Finally, working with AI requires simultaneous specialized technical competences and a comprehensive view of the clinical scenario. Thus, it is advisable to face the current limitations of biological, clinical, logistical, and technical issues within the analysis from a multidisciplinary point of view. This outlook highlights the importance of clear communication between the neurosurgical team and the engineers in searching for appropriate solutions.