The effect of delineation on radiomic analyses in PET has been extensively investigated. To date, reports regarding reproducibility of predictive performance across various delineation approaches have been inconclusive and often cohort-specific (36). This study investigated the effect of conventional binary as well as fuzzy radiomics in both lesions and their surroundings by predicting clinically relevant end-points in PET and hybrid imaging cancer cohorts. Across three different cohorts, no significant changes were identified either between binary and fuzzy radiomics or between lesion and extended radiomics. Nevertheless, cohort-specific predictive performances demonstrated a diverse pattern.
Specifically, the glioma MET-PET cohort had the highest AUC (0.77) with the extended binary radiomics (Ext-B), where sensitivity was also the highest (SNS 0.79) against all other delineations (SNS 0.76) to predict 3-year survival. We hypothesize that this phenomenon is due to the infiltrating behavior of glioma (37, 38), which can be better characterized by extended radiomics, resulting in the 3% performance increase.
In the lung 18F-FDG PET/CT cohort, extended binary and fuzzy delineations slightly decreased predictive performance of 2-year survival. Since the PET acquisitions utilized no motion compensation, the reconstructed PET were subjects to motion artefacts (39–42). Therefore, we argue that the reference delineations were already subjects to overestimation of the true metabolic tumor volumes in this cohort. Consistently, further extending the reference lesions slightly decreased predictive performance. In this cohort, fuzzy radiomics, in particular, had no benefit to handle uncertainties at the edges of delineated lesions. This implies that fuzzy radiomics may not be able to counter-balance motion artefact-related smoothing effects, which is logical, as motion may significantly alter the heterogeneity pattern within tumors, not only at the tumor boundaries (42).
Contrary to the above, the 68GA PSMA-11 PET/MRI study yielded the highest AUC of 0.84 with the reference fuzzy (Ref-F) delineation against all other delineations (AUC 0.79–0.80) and this delineation resulted in the most balanced high-ranking feature distributions as well. We consider the following reasons for this phenomenon: First, this cohort utilized a relatively new hybrid camera system and a high PET target resolution (2.08 x 2.08 x 2.03 mm) and here, reference binary delineations relied on full-mount histopathology slices (43, 44). However, delineation was still performed on the PET images. This means, that in this cohort, the partial volume effect had the most-significant contribution to the delineation of prostate lesions (45). Cohorts operating with relatively small lesions are more prone to delineation effects than e.g., binning (46). This is logical, given, that small lesions are also more prone to the PVE (45, 47) or more sensitive to the absence of point-spread function (PSF) modelling (48). The PVE was most prominent in our prostate cohort as it had the smallest lesions as well (average lesion volume in prostate: 10.9 cm3 vs. 113 cm3 in lung and 93 cm3 in glioma respectively), where a Ref-F delineation resulted in + 4% cross-validation AUC. This finding was in line with those from Cysouw et al (48) who investigated the predictive performance of various delineations in [18F]DCFPyL PET-CT prostate patients in combination with analyzing the effect of partial volume correction. The above findings imply that fuzzy radiomics can be an ideal tonot only handle delineation uncertainties at lesion edges, but can also to model partial volume effects directly in the radiomic calculations themselves. Regardless of lesion size, following EARL guidelines and relying on imaging systems operating with FPS modelling has been proven to generally increase radiomic predictive performance in the context of delineation variations (49, 50).
The aggregate performance analysis across the four delineation methods and cohorts within a common lesion volume range revealed that reference fuzzy (Ref-F) delineations in < 35000 mm3 lesions systematically outperformed the reference binary (Ref-B) delineations in all cohorts. However, the glioma cohort already demonstrated superior predictive performance with extended binary (Ext-B) delineation even in the above-mentioned small lesion volume range. While disease-specific imaging characteristics (e.g., infiltrating behavior) may influence these results, it is important to emphasize that all three cohorts were delineated by different clinicians, thus, our findings may also be subjects to interobserver variability bias. This implies that while fuzzy radiomics on its own has added value compared to conventional binary radiomics – especially in small lesions – future studies shall not exclude the analysis of extended fuzzy or binary regions around lesions within their investigations.
While fuzzy radiomics could naturally model a weighted average of multiple clinician-defined delineations, automated approaches have been repeatedly presented as more robust compared to manually-defined delineations that are prone to multi-observer variabilities (12, 36, 51, 52). Recently, novel deep learning approaches have been reported to provide highly accurate and automated delineation in a wide range of lesion types (53–57). In the context of automated, especially DL approaches, we wish to emphasize that this study does not promote a particular fuzzy delineation approach, only the concept of incorporating probability weights into standard radiomics calculations. Deep learning is a naturally probabilistic approach; however, its output delineation is routinely dichotomized by a threshold to analyze the lesions by conventional radiomics afterwards (17, 18, 56). This step introduces an uncertainty into the dichotomized delineation mask (58–60), and overall, results in information loss. Dichotomization does not only influence analyzed lesion boundaries, but may also excludes lesions with relatively lower DL probabilities, that may otherwise be important for predicting the given clinical end-point. Fuzzy radiomics on the other hand can organically fit the naturally probabilistic output of DL delineation approaches and can minimize the above uncertainties originated by utilizing thresholds.
Further to the above, fuzzy radiomics systematically decreased redundancy across radiomics features in all three involved cohorts by approximately 20%. Due to the naturally high redundancy of various radiomics features (61, 62), they need to undergo redundancy reduction prior to building machine learning models. However, redundancy reduction approaches routinely select one from redundant clusters of features having the highest variance (2). This, however, does not guarantee that the selected feature is the most predictive. Since fuzzy radiomics decreases redundancy, it may support the identification of precise imaging biomarkers in the future by better discriminating features that are otherwise prone to be redundant. Nevertheless, feature redundancy is a phenomenon which is not only affected by inherently similar radiomic calculations, but also by the volume effect (58, 63) which is feature-specific (15, 29). In this regard, future studies shall investigate how fuzzy radiomics contributes to volume effects, given, that its contribution to decrease feature redundancy is significant.
The aggregated feature ranking analysis revealed that highest-ranking features coming from only one delineation type do not guarantee an increased predictive performance. This was particularly true in case of the Prostate cohort. This is logical, since per-delineation feature ranking systematically demonstrated a less-balanced feature ranking pattern which tends to decrease predictive performance (24, 64, 65). In addition, we argue that a feature being identified as high-ranking – therefore predictive in the context of the given clinical end-point – across multiple delineation types is a clear indication that the given feature is robust against delineation variations.
This study had limitations, namely, that it only utilized single-center cohorts. Nevertheless, the collected cohorts were from different camera systems and relied on various tracers. In addition, this study relied on Monte Carlo cross-validation to estimate the predictive performance of its models built on its delineations and radiomics evaluations in order to minimize the chances of false discoveries.