The overview of the workflow used to demonstrate our model reliability assessment method is illustrated in Figure 3. First, we collected pre-treatment CT images and clinical outcomes from a publicly available head-and-neck cancer (HNC) dataset and randomly split the data into training (70%) and testing cohorts (30%), with similar outcome ratios between the two cohorts. Second, a radiomic survival model was built to assess distant metastasis-free survival. Third, internal validation datasets (Perturbed-Train and Perturbed-Test) with perturbations were simulated 12. The simulated perturbation datasets were used to extract perturbed radiomic features and validate the survival model’s reliability against randomizations, as shown in Figure 3(b). Finally, the ICC was used to quantify the model’s reliability, reflecting its prediction consistency when using the perturbed data. The experiement is approved by the department of health technology and informatics, the Hong Kong Polytechnic University. The reporting of the radiomic survival model is based on Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) 22.
The dataset, Head-Neck-PET-CT 14, was collected in The Cancer Image Archive 23. This dataset consists of 298 patients with head-and-neck squamous cell carcinoma (HNSCC) with a median follow-up of 43 months. The patients were treated at four different centers and received only radiation (n = 48, 16%) or chemo-radiation (n = 250, 84%) with curative intent. The patients’ characteristics and image reconstruction parameters are summarized in Supplementary Tables 1 and 2. Due to the nature of the retrospective study and the publicity of the dataset, the informed consent was waved.
The region of interest for feature extraction was the primary gross tumor volume (GTV), which was the primary treatment target of radiation therapy. The GTV is the most reliable region for predictive feature extraction 24 and has been used in several predictive radiomics studies of HNSCC 1,25,26.
Distant metastasis-free survival, defined as the interval from the first day of treatment to the date of the event, was the clinical endpoint in this study to demonstrate the reliability assessment of the radiomic model 27. Previous studies of binary classification models of HNC 25,28 have achieved good prediction results but were limited because the time-to-event was neglected during model development.
Image Preprocessing and Radiomic Feature Extraction
The CT images and their GTV contours were preprocessed before their features were extracted to maintain the features’ reproducibility and consistency 29,30. First, the GTV contours were interpolated to a voxel-based segmentation mask. Second, an isotropic resampler (1 mm × 1 mm × 1 mm) was applied to the images and masks, with B-spline interpolation on the image and nearest-neighbor interpolation on the mask to enhance the reproducibility of the radiomic features 31. The preprocessing steps were implemented on Python v3.8 using the SimpleITK v1.2.4 32 and OpenCV 33 packages.
The radiomic features were then extracted using the Pyradiomics v2.2.0 34 package, which is Image Biomarker Standardization Initiative-compliant 35,36. A total of 5,486 radiomic features were extracted from the GTV of each patient’s CT scan. Twelve images were included in the feature extraction, including one unfiltered image, three Laplacian-of-Gaussian filtered images (with sigma values of 1 mm, 3 mm, and 6 mm), and eight Coiflet1 wavelet filtered images (LLL, HLL, LHL, LLH, LHH, HLH, HHL, HHH). In addition to the 14 shape features from GTV segmentation, 18 first-order and 73 second-order features were extracted from the region of interest of each filtered image. A re-segmentation of the soft-tissue range (−150 to 180) 12 and discretization, with fixed bin counts of 4, 8, 16, 32, 64, and 128, were specified for the texture feature extraction. The detailed feature extraction parameters can be found in Supplementary 3.
Patients were randomly assigned to the training and testing cohorts (70/30 split) with stratification by distant metastasis status 6,37. The data in the training cohort were used for feature selection and subsequent model training, while the data in the testing cohort were used to evaluate the model’s performance.
A filter-based feature selection method was adopted in our analysis 38. This process has two steps: feature–outcome relevance filtering and feature–feature redundancy filtering. Identifying the most relevant and less redundant features is a common practice in radiomics studies, regardless of the evaluation metric 39.
Relevance filtering. Relevance filtering aims to identify the radiomic features that are correlated with the outcomes 25. First, the outcome relevance of each feature was repeatedly evaluated by log-rank test p-values under downsample bootstrapping (imbalanced-learn 0.8.0 40) without replacement over 100 iterations on the training dataset. Downsampling can be used to capture useful information in an imbalanced dataset 41. Second, features with p-values less than 0.1 were selected in each iteration and ranked by their frequencies, with the top 10% of features with the highest frequencies selected.
Redundancy filtering. Redundancy filtering aims to remove features correlated with each other 42. First, the feature pairs with Pearson correlation coefficients higher than 0.6 were identified. Then, the features with higher mean correlation coefficients than the rest of the features were removed. The removal of these redundant features should improve the predictive ability of the classifiers 43.
To build the survival model, the optimal features for the model building were identified using backward recursive feature elimination based on the penalized Cox proportional hazard model 44. This approach maximizes the validation concordance index (C-index) curve by using repeated three-fold cross-validation in the training set. After identifying the optimal features, a penalized Cox proportional hazard survival model was built for distant metastasis-free survival. The hyperparameter of the model was fine-tuned with five-fold cross-validation to maximize the C-index for the survival model. Thus, the model’s performance with the training and testing cohorts was evaluated.
This section describes the method to evaluate the model reliability using perturbations and the workflow shown in Figure 3(b). First, the internal validation datasets were simulated with the perturbations by adding plausible randomizations to the original images and segmentations. Second, the survival model was evaluated using both the perturbed training and testing data. Third, the model reliability against simulated randomization was quantified using the reliability index ICC.
Validation Data Simulation
The internal validation data sets were simulated using the perturbation method 12,45. For each perturbation, both the image and mask were translated and rotated simultaneously by a random amount. This simulation aimed to mimic variations in the patient’s position during imaging. Then, a random Gaussian noise field was added to the image to mimic the noise level variations between different image acquisitions 46. The detailed perturbation parameters are presented in Supplementary Table 4. Next, the GTV mask was also perturbed by a randomly generated deformable vector field, which aimed to simulate uncertainties in inter-observer delineations on the same target 47. In total, 60 sets of perturbed images and contours were simulated, with the corresponding radiomic features extracted as the internal validation sets to evaluate the model reliability under randomization.
The model performance was validated and reported on the original and perturbed datasets using the C-index as the evaluation metric. Two observations may warrant attention. First, the model performance consistency between the original and perturbed datasets might be a qualitative indicator of model performance reliability against the simulated randomizations. Second, the model performance variance with perturbed datasets may reflect the model’s sensitivity to slight fluctuations. A qualitative assessment of model reliability could be performed by comparing the model performance on the original and perturbed data.
Model Reliability Quantification
In addition to the qualitative analysis of model reliability, a quantification metric, the ICC, was proposed to evaluate model reliability under randomization. The ICC is often used as a reliability index for inter-rater reliability analysis 48, and several radiomic studies have used this measure to quantify feature reproducibility 13,49,50.
The model reliability ICC reflects the extent to which the measurements can be replicated. We aimed to determine whether model predictions can be repeatedly measured/produced after adding plausible randomizations to the images and segmentations both for the same patient and across the entire dataset. As each perturbed dataset was simulated randomly and the model was expected to yield an identical outcome, the one-way random effects with absolute agreement, ICC(1, 1), were calculated to quantify the model’s reliability, with patients as the subjects and perturbations as the raters 48. ICC values range between 0 and 1, with values closer to 1 representing more robust reliability. Typically, ICC values less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and greater than 0.9 indicate poor, moderate, good, and excellent reliability, respectively 48.
Model Reliability Validation
To validate the calculation of model robustness, the same experiment was repeated with highly reliable features (ICC > 0.75). This validation aimed to verify the sensitivity of the ICC in response to changes in model input reliability. An increase in feature robustness was expected to increase the model ICC.