Comparison of radiomic pre-processing steps in the reproducible prediction of disease free survival across multi-scanners/centers

Features reproducibility and the generalizability of the models are currently among the most important limitations when integrating radiomics into the clinics. Radiomic features are sensitive to imaging acquisition protocols, reconstruction algorithms and parameters, as well as by the different steps of the usual radiomics workow. We propose a framework for comparing the reproducibility of different pre-processing steps in PET/CT radiomic analysis in the prediction of disease free survival (DFS) across multi-scanners/centers. We evaluated and compared the prediction performance of several models that differ in i) the type of intensity discretization, ii) feature selection method, iii) features type i.e, original or tumour to liver ratio radiomic features (OR or TLR). We trained our models using data from one scanner/center and tested on two external scanner/centers. Our results show that there is a low reproducibility in predictions across scanners and discretization methods. Despite of this, TLR based models were generally more robust than OR. Maximum relevance minimum redundancy (MRMR) forward feature selection with Pearson correlation was the feature selection method that had the best mean area under the precision recall curve when using it combining the features from all discretization’s bin’s number (D_All_FBN) with TLR features for two of the four classiers. We compared predictions

the features type, i.e., original radiomics (OR) or TLR radiomics. Models were trained on the data from scanner A and then evaluated on data from two remaining external scanners independently. The metric used to evaluate model performance was the area under the curve of the precision recall curve (AUCpr) (Fig. 1).
Statistical and ML analyses were performed using R software, version 4.0.1.

Segmentation and interpolation
PET images from scanners B and C were interpolated using the research toolbox (Oncoradiomics SA, Liège, Belgium), up-sampling or down-sampling the images using a linear method, so that all datasets had isotropic voxels of 4×4×4 mm 3 (i.e., the voxel size in images of scanner A). The 3D primary tumour volumes were segmented in the [ 18 F]FDG PET images using the 2 classes semi-automatic Fuzzy Local Adaptive Bayesian algorithm [24]. Volumes of 20 cm 3 in the liver were manually drawn in order to investigate the predictive value of TLR radiomic features, as explained below. All segmentations were reviewed and edited if needed by one nuclear medicine physician with 9 years of experience in clinical PET/CT.

Images Radiomic features
We extracted two hundred and fteen features from the segmented volumes, which included rst order grey level statistics, geometry, fractals, texture matrix based features and others. Features were extracted using the Oncoradiomics research toolbox and their detailed description can be found in supplementary data of our previous study [23]. All features were calculated according to the Imaging biomarkers standardization initiative (IBSI) [25]. We also studied the ratio of the features values calculated in the tumour and in the liver the (TLR versions of features), except for the shape features as done in our previous study [23]. We hypothesized that TLR features may reduce the variability of radioactive dose uptake within the different patients and across centers by normalizing the radiomic features using the liver which is an organ with an homogenous and reproducible uptake. There were no missing data for any patient.

Radiomic features intensity discretization
Image intensities were discretized using the two schemes currently standardized by the IBSI: xed bin number (FBN, with 32 and 64 bins) and xed bin width (FBW, with 4 different widths of 0.05, 0.1, 0.2 and 0.5 Standardized Uptake Values) [25]. These two sets of features were considered either alone or by: 1) joining the features from all discretization's widths/bin's number (D_All_FBW, D_All_FBN), 2) combining the four discretization's widths from the FBW discretization method or combining the two number of bins from FBN through the calculation of their median value (D_Med_FBN, D_Med_FBW).

Features selection, classi ers and model selection
We applied 7 different FS methods to identify the 5 most relevant features: 1-Accuracy decrease obtained from the embedded FS of the random forest (RF) classi er; 2-Gini impurity decrease obtained from the embedded FS of the RF classi er; 3-forward FS using maximum relevance minimum redundancy (MRMR) method with Pearson correlation; 4-backward FS using MRMR with Pearson correlation; 5-forward FS using MRMR with Spearman correlation; 6backward FS using MRMR with Spearman correlation; 7-forward MRMR based on the mutual information (MI). We also considered 4 ML classi ers: RF, support vector machine (SVM) with radial kernel, Naïve Bayes (NB) and a logistic regression (LR) [26-28]. We used for each classi er the default hyperparameters values in their respective R packages. We used 5-fold cross-validation in the training data to internally validate and select the models with better predictions for each classi er independently. Additionally, models were trained using all the training data then tested in the two external data sets.
A paired Wilcoxon Rank Sum test was used to test whether the predictions for each discretization scheme were statistically signi cantly different from each other in the two external validation schemes. Wilcoxon Rank Sum tests were considered signi cant if p < 0.05. Holm-Bonferroni correction method was used to correct for multiple hypothesis testing. Table 1 depicts the mean AUCpr between the three validation schemes, i.e, i) Internal validation using 5-fold cross validation using scanner A ii) external validation using scanner B iii) external validation using scanner C. The table shows the mean AUCpr of the models using RF, SVM, LR and NB classi er using the different FS methods applied to the OR and TLR features. Additionally, the results shown for the standard FBW and FBN discretization schemes correspond to the model with discretization width/bin number that had a better AUCpr in the internal validation scheme. The AUCpr of the three validation schemes individually are shown in the supplementary material.

Results
Our results showed a low reproducibility between scanners. The discretization scheme that showed the higher AUCpr in the validation scheme was D_Med_FBN combined with TLR features. This was not the case for the two external scanners. (Supplementary material).  Mean AUCpr between the three validation schemes using the four classi ers (RF, SVM, LR and NB) and the seven FS methods represented in the columns and discretization schemes in the rows. The features discretization were applied to the OR and TLR features.
Despite the models low reproducibility, D_All_FBN with TLR features was the model with the better mean AUCpr for the LR and NB classi er (0.57 and 0.58 respectively). It was also the second model with overall higher AUCpr in the two independent scanners with AUCpr of 0.45 and 0.7 in scanner B and C respectively (Fig. 2). For the RF classi er the model with higher AUCpr was D_Med_FBN TLR whereas for SVM it was FBW TLR. Regarding the FS method, when combined with the D_All_FBN TLR, MRMR Forward with Pearson correlation was the optimal FS method for at least one of the four classi ers in all validation schemes. MRMR Forward with Pearson correlation is also the FS method that showed better mean AUCpr for 4 out of the 6 discretization schemes when using TLR features. The discretization schemes that showed higher mean AUCpr across classi ers were D_Med_FBN, D_All_FBW and D_All_FBN. All of them were used with TLR features (Fig. 3). Despite of the good performance of D_All_FBN with TLR, the only discretization methods that showed to be statistically signi cant from D_All_FBN TLR were FBW with OR features, and FBW, FBN, D_Med_FBW, D_Med_FBN, D_All_FBW with TLR features. FBW combined with OR features was the only discretization scheme statistically different from all the others (Table 2). TLR based models had higher mean AUCpr values than OR based models for all the classi ers and discretization schemes, except when using D_med_FBW with LR or FBW and D_Med_FBN with NB (Fig. 3). TLR based models predictions in the two external validation schemes were statistically signi cant from all the OR based models (p-value < < 0.05). compared the prediction performance of numerous models that differed in their discretization and FS methods. This comparison was carried out within the context of predicting DFS from [ 18 F]FDG PET images in a multi-scanner/center [ 18 F]FDG PET cohort of LACC patients. Additionally, we investigated the effect of features transformation using features ratios with an organ of reference and we combined it with the previous pre-processing steps. We also compared the models performances using four classi ers. Multi-classi er radiomics predictive models, ensemble classi ers or the combination of different classi ers performances to measure feature importance, consistently tend to outperform traditional single classi er approaches [29][30][31]. Moreover, the choice of classi cation method is one the most dominant sources of performance variation in radiomics studies [12]. Due to this, we believe that comparing the results of multiple classi ers is needed when evaluating the robustness of the work ow. The discretization scheme is one of the factors that affect radiomic features reproducibility. FBW discretization in PET has been recommended [18] [25], although some studies have also reported more favourable properties using FBN [32]. This is related to the fact that FBN and FBW have different drawbacks and advantages. FBW preserves the relationship between PET units and the corresponding physical substrate, contrary to arbitrary units (such as in some non-quantitative magnetic resonance imaging sequences). FBN on the other hand does not preserve such relationship but introduces a normalization effect that can be favourable when contrast is considered important or when the actual original image intensity value does not have a 'meaning'. In our study D_All_FBN and FBW both combined with TLR features were the discretization scheme that showed the best AUCpr in the two external scanners. Combining features discretized with different widths/bin numbers can as shown in our study introduce complementary information and be a more reproducible and simple strategy as it also avoids the uncertain assumption or the extensive search of the optimal feature discretization width/bin number. Combining feature discretization schemes has also been done in previous studies [33][34].
Furthermore, our results show that when using D_All_FBN with TLR the FS scheme with higher mean AUCpr in the three validation schemes is MRMR Forward with Pearson correlation for 2 of the 4 classi ers. FS is an effective strategy to improve radiomics-based predictive studies. Different FS strategies are used in radiomics studies, each with his pros and cons, and some known to work better with certain type of features or classi ers [35].
Finally, we also evaluated the feature type, i.e. OR and TLR radiomics. We have shown in our previous study that using the ratio of the tumour features with a reference organ (TLR radiomics) improves the predictive performance of radiomics model in LACC. In contrast to our previous study, we trained our models using only data from one clinical center/scanner and evaluated our models in two external scanners. We emphasize the conclusions of our previous study, by observing that all of the most robust models used TLR features instead of OR. This can be caused by a normalizing effect of the SUVs on each patient. The importance of data normalization/transformation has also been accessed for other radiomic studies and shown to improve models performances [36-38].
Moreover, feature transformation using an organ of reference has also been investigated by other authors, leading to normalized images and increased reproducibility of radiomic features [39][40].
The results of our study are encouraging and can potentially be used as a rst recommendation approach to improve reproducibility of radiomics studies across multi-scanners/centers in LACC. However, it corresponds to a preliminary study and we still observed performance differences between the 3 scanners: for some scanners some models work quite well while for other scanners other models work better. This could be caused by the variation in scanner properties, the different number of patients in each scanner, or the variation in tumour recurrence rate for each population. Bar plot showing the models with FS that showed the better performances for all intensity discretization schemes and for the four classi ers. The AUCpr correspond to the mean value of the 3 validation schemes.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Supplementarymaterial.pdf