Radiomics machine learning study with a small sample size: single random training-test set split may result in unreliable results

Objective: To determine how the estimated performance of a machine learning model varies according to how a dataset is split into training and test sets using brain tumor radiomics data, under different conditions. Materials and Methods: Two binary tasks with different levels of diculty ('simple’ task, glioblastoma [GBM, n=109] vs. brain metastasis [n=58]; 'dicult’ task, low- [n=163] vs. high grade [n=95] meningiomas) were performed using radiomics features from magnetic resonance imaging (MRI). For each trial of the 1,000 different training-test set splits with a ratio of 7:3, a least absolute shrinkage and selection operator (LASSO) model was trained by 5-fold cross-validation (CV) in the training set and tested in the test set. The model stability and performance was evaluated according to the number of input features (from 1 to 50), the sample size (full vs. undersampled), and the level of diculty. In addition to 5-fold CV without a repetition, three other CV methods were compared: 5-fold CV with 100 repetitions, nested CV, and nested CV with 100 repetitions. Results: The highest mean cross-validated area under the receiver operating characteristics curve (AUC) and the higher stability (lower AUC differences between training and testing) was achieved with 6 and 13 features from the GBM and meningioma task, respectively. For the simple task, simple task with undersampling, dicult task, and dicult task with undersampling, average mean AUCs were 0.947, 0.923, 0.795, and 0.764, and average AUC differences between training and testing were 0.029, 0.054, 0.053, and 0.108, respectively. Among four CV models, the most conservative method (i.e., lowest AUC and highest relative standard deviation [RSD]) was nested CV with 100 repetitions. Conclusions: A single random split of a dataset into training and test sets may lead to an unreliable report of model performance in radiomics machine learning studies, and reporting the mean and standard deviation of model performance metrics by performing nested and/or repeated CV on the entire dataset is suggested.


Introduction
Since the advent of precision or personalized medicine, machine learning has attracted great interest as a framework that could revolutionize how we identify the best diagnosis and treatment for an individual patient. Machine learning research has expanded rapidly in many elds including radiomics, a method that extracts and uses a large number of features from medical images to uncover disease characteristics that fail to be appreciated by the naked eye. 1 According to the PubMed database with the search term of "(machine learning OR deep learning) AND radiomics", the number of published papers per year increased from 2 to 308 between 2015 and 2019. 2 Machine learning is highly 'data hungry'. For a machine learning model to be robust, it may require millions of observations to reach acceptable performance levels. 3 Radiomics-based machine learning studies, however, are often conducted with a limited number of samples, especially when dealing with rare diseases.
Nonetheless, many published papers reported fair or good performances. However, the reproducibility-which has been an intensely debated topic in science for the past few decades-of machine learning is a big concern; machine learning algorithms have a large number of parameters to train or manually set, and its training typically involves a lot of randomness, all of which pose unique challenges to the reproducibility by machine learning. 4,5 Thus, caution should be taken when interpreting a reported model performance, as it may be overoptimistic, especially in the lack of external validation. 6,7 As preliminary experiments, we tried performing two binary classi cation tasks, one of which was to differentiate meningioma grades using radiomics features. Single-institutional data were randomly split into training and test sets. In the training set, 5-fold cross-validation (CV) was performed for hyperparameter optimization. The optimized model showed a cross-validated area under the receiver operating characteristics curve (AUC) of 0.820 in the training set and 0.804 in the test set. However, in another trial with the same data and methods, we failed to reproduce the results; the cross-validated AUC in the training set was 0.840, but the AUC in the test set was far worse, 0.650.
In fact, not every method was the same in the above two trials; the compositions of the training and test sets were different because the dataset was split 'randomly'. From our experiences, the results often varied greatly depending on how datasets were split into training and test sets, especially with a small sample size or a di cult task with a suboptimal model performance. Previous studies have also suggested that the model performance measured with a single split of training and test sets may be over-optimistic. [8][9][10] However, to the best of our knowledge, the impact of random dataset splitting on the reliability of model performance in radiomics machine learning studies has not been investigated.
Therefore, the purpose of this study was to determine how the estimated performance of a machine learning model varies according to how a dataset is split into training and test sets using real-world brain tumor radiomics data, under different conditions: the number of input features (from 1 to 50), the sample size (full vs. undersampled), and the level of di culty of machine learning task (simple vs. di cult).

Subjects
This retrospective study was approved by the Institutional Review Board, and the informed consent from the patients was waived.
Two binary classi cation tasks with different levels of di culty were performed using radiomics features extracted from brain magnetic resonance imaging (MRI) as input data. The rst task was a relatively 'simple' task of differentiating between glioblastoma (GBM) and single metastasis, with reported accuracies of up to 89%. 11,12 The rst dataset consisted of 167 adult patients with a single GBM (n=109) or brain metastasis (n=58) that were pathologically con rmed following brain MRI from January 2014 to December 2017. The second task was a 'di cult' task of differentiating between low-vs. high-grade meningioma, with reported accuracies of less than 76% by conventional MRI. 13,14 The second dataset consisted of 258 adult patients with a low grade (n=163) or high grade (n=95) meningioma diagnosed from February 2008 to September 2018.
Both the datasets were from the same tertiary academic hospital, and some subsets of these patients were used in our previous reports. 11,15 MRI acquisition, image preprocessing, and radiomics feature extraction are described in Supplementary appendix.

Random split of datasets into training and test sets
The dataset was randomly split into training and test sets with a ratio of 7:3, while maintaining the proportions of the two classes. To examine how a result changes according to the data composition by a random trainingtest set split, the split was repeated 1,000 times by changing random state numbers from 0 to 999. By setting a random state number, we can reproduce the same results although it is called 'random' (See Discussion for more detailed explanation).

Feature selection
In the 1,000 different training sets, radiomics features were repeatedly selected based on the coe cient or feature importance by four machine learning models: least absolute shrinkage and selection operator (LASSO), linear support vector machine (SVM), adaptive boosting, and random forest. The frequencies of each feature being selected out of 1,000 trials were calculated for each model and were averaged. The top k features were selected in descending order of the average frequency, where k was a hyperparameter.

Model stability and performance
We investigated how the model stability, that is, the degree of change in results by the random training-test set split, as well as the model performance are affected by the number of input features, sample size, and task di culty. In this study, we used LASSO for machine learning, which is one of the least exible algorithms, in order to minimize the effect of model selection on the results. For each trial of the 1,000 different training-test set splits, a LASSO model was trained following optimization by 5-fold CV without a repetition in the training set and tested in the test set, and the mean cross-validated AUC and the test AUC were calculated. In this part of our experiments, a model was considered more stable when the difference between the mean CV AUC and the test AUC was smaller.

Number of input features
The process of repeating training and testing 1,000 times was repeated by increasing k (i.e., the number of input features) from 1 to 50 by 1. Based on the results of this experiment, the optimal number of features that achieved the best performance and stability was determined for each of the two tasks and was used for the following analyses.

Page 5/12
Sample size and task di culty For each trial of training-test set splits, the average of and the difference between the mean CV AUC and the test AUC were calculated for the simple (GBM) task and di cult (meningioma) task. In addition, to determine the effects of sample size, 50% of the GBM and meningioma datasets were randomly sampled (with a random state number of 2020), on which all the processes were repeated.

Visualization of the effect of randomly splitting training-test sets
We attempted to visualize how the composition of training and test datasets determined by a random splitting can affect the tting and evaluation of machine learning models. Of the 1,000 random dataset splits, three trials from the meningioma task were selected as representative cases: two trials where the training and test sets showed signi cant mismatch and one trial where the datasets showed similar compositions. On a twodimensional feature space by two most robust radiomics features (k = 2), each case was plotted in different colors according to the class, and a decision boundary was drawn, along with the cross-validated mean AUC in training set and the AUC in the test.

Comparison of CV methods
In addition to 5-fold CV without a repetition, three other CV methods were conducted and compared: 5-fold CV with 100 repetitions, nested CV, and nested CV with 100 repetitions (Fig. 1). In CV with n repetitions, CV is repeated n times with shu ing data for each combination of hyperparameters. The nested CV has an inner loop CV nested in an outer CV; the inner loop is responsible for model selection and hyperparameter tuning, while the outer loop is for error estimation. 16 In addition to assessing the model performance by AUC, model stability was compared using a metric of relative standard deviation (RSD). RSD is calculated by dividing the standard deviation (SD) of a group of values by the average of the values.
All the analyses were performed using Python 3 with scikit-learn 0.23.2 or R 4.0.2. The 95% con dence interval (CI) of AUC was estimated by the DeLong method. The difference in values between two groups was considered statistically signi cant when two-sided probability by t-test was less than 0.05. Benjamini-Hochberg procedure was used to correct for multiple comparisons.

Results
Model stability and performance

Number of input features
Averaging the results from the 1,000 different training sets revealed the pattern that, as the number of input features increased, the mean cross-validated AUC increased at rst but began to decrease at some point, more distinctly with the simple GBM task than the di cult meningioma task (Fig. 2). Without averaging, however, in some trials the mean AUC did not decrease despite the increase in the feature number but rather plateaued, whereas in other trials the mean AUC decreased more steeply and plateaued at a lower level (Fig. 2, the upper   panels). Consequently, the deviation of mean AUCs across the different training-test set splits showed the reverse pattern; as the number of features increased, the average difference between mean AUC from CV in the training set and AUC in the test set decreased until it reached the lowest point around at the optimal feature number, and then increased again (Fig. 2, the lower panels). In our datasets, the graphs reached the peak for AUC and the base for stability when the number of features was 6 for the GBM task and 13 for the meningioma task, which were used as the optimal feature numbers for the following experiments.

Sample size and task di culty
For the simple task, the overall model performance was better and more stable than the di cult task, indicated by higher AUCs and lower AUC differences between CV and testing. In the same task, undersampling a dataset resulted in diminished overall model performance and stability (Fig. 3). For the simple task, simple task with undersampling, di cult task, and di cult task with undersampling, average mean AUCs (±SDs) were 0.947

Visualization of the effect of randomly splitting training-test sets
Depending on which samples comprised the training or test set, AUCs in CV and in testing varied widely.
Consequently, in some of the trials there were signi cant discrepancies between the expected and actual model performance; three representative trials for each task are summarized in Table 1.
The mechanism behind the discrepancy is demonstrated in Figure 4, which depict three representative results chosen among 1,000 trials of classifying low-vs. high-grade in meningioma using two radiomics features. The rst one (Trial no. 602) is the case where mean AUC in CV was low (0.612 [SD, 0.044]), but AUC in testing was high (0.893 [95% CI, 0.814-0.972]); the tted linear decision boundary did not separate the two classes well in the training set, but when applied to the test set, it could perform better because samples from each class were located more often on the correct sides in the feature space (Fig. 4, the right panel). The second one (Trial no.

Comparison of CV methods
The estimated AUC was the highest with CV without a repetition, followed by CV with 100 repetitions, nested CV without a repetition, and nested CV with 100 repetitions in decreasing order. In contrast, the RSD was in reverse order. Therefore, the most conservative method (i.e., lowest AUC and highest RSD) was nested CV with 100 repetitions, although the differences were not statistically signi cant (p > 0.326) among the three methods-CV with repetitions and nested CV with or without repetitions (Table 2). However, in cases of severe mismatch in performance between training and testing, none of the CV methods was helpful to reduce the gap between the estimated AUC by CV and the test AUC; the results of example trials are summarized in Table 3.

Discussion
The results of this study demonstrate that the model performance and optimal hyperparameters estimated by cross validation in a training set may vary considerably according to the composition of the dataset, especially when the sample size is too small, or the task is too di cult to tackle with available data. In other words, if we are allowed to choose the distribution of samples in the feature spaces of training and test sets, we can obtain what we think would be ideal results, which is also called 'cherry picking'.
A single random training-test set split allows for this cherry picking. Researchers may rst nd a random state number that produces the best possible training-test set split, and then x the number to reproduce the same results. Although we call it 'random', the split we perform on computers is not truly random. In fact, nothing computers do is random when they are functioning correctly, because computers are deterministic devices and always behave predictably by design. Instead, to produce results that are 'random enough', computers use a 'pseudorandom' number generator which uses a mathematical algorithm to simulate randomness. 17 Thus, computers can reproduce the exact same results, given a pseudorandom number, which is essential to many computer or academic applications, including reproducible research. Therefore, the real-world performance of machine learning should be validated using an independent external validation set. However, despite the importance of external validation to generalize the study result, external validation is often infeasible in radiomics machine learning studies; a recent report showed that external validation was missing in 81.8% of radiomics studies published in high-impact journals. 18 Thus, in a limited single-institutional dataset enabling only internal validation, an appropriate strategy is necessary in machine learning to produce reliable and stable results.
Several methods have been proposed to obtain more generalizable and reliable estimates of model performance. 8,16,19,20 However, a signi cant mismatch between training and testing datasets is di cult to overcome by any methods, because computer algorithms, unlike humans, cannot make a correct inference if a new datapoint is very different, or far away in the feature space, from the datapoints used for training. Thus, to reduce the risk of reporting over-optimistic results in machine learning studies without an external validation set, we suggest that a metric for model stability such as RSD should also be reported following nested and/or repeated CVs in the entire dataset, instead of a single random training-test set split.
Our results related to the optimal number of input features, task di culty, and sample size can be better understood by imagining a n-dimensional feature space, although here we presented only two-dimensional spaces for the sake of visualization (Fig. 4). If we pick two random points in the two-dimensional unit square, the Euclidian distance between two points is roughly 0.52 on average. In the three-dimensional unit cube, the average distance is 0.66, and in the 10-dimensional hypercube, it increases up to around 3.16. 21 Therefore, in a high-dimensional space, datapoints are located much farther than we can possibly imagine based on the realworld space. A larger number of input features mean a higher dimensionality of feature space. Similarly, a classi cation task is harder when the datapoints belonging to the same class are not close together in the feature space, but rather scattered and mixed with other class datapoints. Lastly, it is more likely that the distance between datapoints is farther with a smaller sample size. Thus, the impact of which datapoints are sampled (i.e., the results of random training-test set splitting) on the overall results can be larger with a larger number of input features, a harder task, and a smaller sample size, as under these circumstances.
In conclusion, a single random split of a dataset into training and test sets may lead to an unreliable report of model performance in radiomics machine learning studies, especially with a small sample size or a di cult task for available data. Thus, in a radiomics machine learning study with a limited number of sample or without an independent external validation set, we suggest reporting the mean and deviation of model performance metrics by performing nested and/or repeated CV on the entire dataset rather than simply reporting the results following a single random training-test set split. Abbreviations AUC = area under the receiver operating characteristics curve; CI = con dence interval; LASSO = least absolute shrinkage and selection operator; MRI = magnetic resonance imaging; RSD = relative standard deviation; SVM = support vector machine Declarations This study was approved by Yonsei University Health System Institutional Review Board.  RSD, relative standard deviation; CV, cross validation; GBM, glioblastoma. Table 3. Comparison of four cross validation methods in a discrepant training/test sets for each task