Subjects
This retrospective study was approved by the Institutional Review Board, and the informed consent from the patients was waived.
Two binary classification tasks with different levels of difficulty were performed using radiomics features extracted from brain magnetic resonance imaging (MRI) as input data. The first task was a relatively ‘simple’ task of differentiating between glioblastoma (GBM) and single metastasis, with reported accuracies of up to 89%.11,12 The first dataset consisted of 167 adult patients with a single GBM (n=109) or brain metastasis (n=58) that were pathologically confirmed following brain MRI from January 2014 to December 2017. The second task was a ‘difficult’ task of differentiating between low- vs. high-grade meningioma, with reported accuracies of less than 76% by conventional MRI.13,14 The second dataset consisted of 258 adult patients with a low grade (n=163) or high grade (n=95) meningioma diagnosed from February 2008 to September 2018. Both the datasets were from the same tertiary academic hospital, and some subsets of these patients were used in our previous reports.11,15 MRI acquisition, image preprocessing, and radiomics feature extraction are described in Supplementary appendix.
Random split of datasets into training and test sets
The dataset was randomly split into training and test sets with a ratio of 7:3, while maintaining the proportions of the two classes. To examine how a result changes according to the data composition by a random training-test set split, the split was repeated 1,000 times by changing random state numbers from 0 to 999. By setting a random state number, we can reproduce the same results although it is called ‘random’ (See Discussion for more detailed explanation).
Feature selection
In the 1,000 different training sets, radiomics features were repeatedly selected based on the coefficient or feature importance by four machine learning models: least absolute shrinkage and selection operator (LASSO), linear support vector machine (SVM), adaptive boosting, and random forest. The frequencies of each feature being selected out of 1,000 trials were calculated for each model and were averaged. The top k features were selected in descending order of the average frequency, where k was a hyperparameter.
Model stability and performance
We investigated how the model stability, that is, the degree of change in results by the random training-test set split, as well as the model performance are affected by the number of input features, sample size, and task difficulty. In this study, we used LASSO for machine learning, which is one of the least flexible algorithms, in order to minimize the effect of model selection on the results. For each trial of the 1,000 different training-test set splits, a LASSO model was trained following optimization by 5-fold CV without a repetition in the training set and tested in the test set, and the mean cross-validated AUC and the test AUC were calculated. In this part of our experiments, a model was considered more stable when the difference between the mean CV AUC and the test AUC was smaller.
Number of input features
The process of repeating training and testing 1,000 times was repeated by increasing k (i.e., the number of input features) from 1 to 50 by 1. Based on the results of this experiment, the optimal number of features that achieved the best performance and stability was determined for each of the two tasks and was used for the following analyses.
Sample size and task difficulty
For each trial of training-test set splits, the average of and the difference between the mean CV AUC and the test AUC were calculated for the simple (GBM) task and difficult (meningioma) task. In addition, to determine the effects of sample size, 50% of the GBM and meningioma datasets were randomly sampled (with a random state number of 2020), on which all the processes were repeated.
Visualization of the effect of randomly splitting training-test sets
We attempted to visualize how the composition of training and test datasets determined by a random splitting can affect the fitting and evaluation of machine learning models. Of the 1,000 random dataset splits, three trials from the meningioma task were selected as representative cases: two trials where the training and test sets showed significant mismatch and one trial where the datasets showed similar compositions. On a two-dimensional feature space by two most robust radiomics features (k = 2), each case was plotted in different colors according to the class, and a decision boundary was drawn, along with the cross-validated mean AUC in training set and the AUC in the test.
Comparison of CV methods
In addition to 5-fold CV without a repetition, three other CV methods were conducted and compared: 5-fold CV with 100 repetitions, nested CV, and nested CV with 100 repetitions (Fig. 1). In CV with n repetitions, CV is repeated n times with shuffling data for each combination of hyperparameters. The nested CV has an inner loop CV nested in an outer CV; the inner loop is responsible for model selection and hyperparameter tuning, while the outer loop is for error estimation.16 In addition to assessing the model performance by AUC, model stability was compared using a metric of relative standard deviation (RSD). RSD is calculated by dividing the standard deviation (SD) of a group of values by the average of the values.
All the analyses were performed using Python 3 with scikit-learn 0.23.2 or R 4.0.2. The 95% confidence interval (CI) of AUC was estimated by the DeLong method. The difference in values between two groups was considered statistically significant when two-sided probability by t-test was less than 0.05. Benjamini-Hochberg procedure was used to correct for multiple comparisons.