According to the above retrieval strategies and standards, 218 articles were retrieved, and 68 articles were included in this study after all screening. The include and exclude flowchart is shown in Figure 1. The area under the curve (AUC) values of the included studies ranged from 0.71 to 0.95, implying a moderate to good predictive performance of the radiomics model. The key information of included studies is summarized in Additional Table 1.
From the four aspects of the number of articles, research tasks, research types, and RQS scores, we gave an overall overview of included articles, so as to analyze the current situation of the application of radiomics in colorectal cancer.
In the field of colorectal cancer, radiology was mainly used for prognosis (51%), while few studies have carried out staging and grading (9%) (Figure 3A). The imaging methods studied can be roughly divided into CT, MRI, PET, PET-CT, UA, and the combination of the two imaging methods (Figure 3B). These studies were mainly based on two imaging methods: CT (45%) and MRI (41%), while there were a few studies related to PET-CT (3%).
Prospective research may provide better clinical evidence for evidence-based medicine and make up for the shortcomings of retrospective research. However, the prospective study has a long follow-up time and many restrictions, so it is not easy to achieve. Most radiomics of colorectal cancer are based on retrospective data sets (94%), and only a few are prospective studies[12–15](6%).
Multicenter research can fully reflect the overall situation and evaluate the generalization ability of the model. Of all the studies, only 7[12, 16–21] (10%) conducted multicenter studies and 5[22–26] (7%) were dual-center studies.
3.1 RQS score
The included articles were scored by RQS, and the specific scores were shown in Additional Table 2. Its score ranges from-7 to 28 (- 19.44–77.78%), the median was 12 (33.33%), of which 13 studies scored less than 10%.
Figure 4 shows the score of each item in the RQS score, 66 studies (97%) reduced the dimension of features to reduce the risk of overfitting, 53 studies (78%) adopted multiple segmentation, and 65 studies reported the differentiation of the model. However, only 24 reported the calibration of the model. Low scores in the following six items (1) phantom studies (2) multiple time point (3) prospective studies (4) open science data (5) Cut-off analysis (6) cost-effectiveness analysis.
Figure 4. Completion rate of 68 studies in RQS.
Only two studies (3%) adopted test-retest reliability. Only five studies disclosed the code and data, but there were no studies that opened ROI-related data. Most studies used the best threshold method to select the threshold, only six (9%) studies used the median as the threshold. Four studies (6%) were prospective, and none of the studies conducted a cost-benefit analysis of the clinical application.
The repeatability of radiomics features is directly related to the accuracy of model. And many factors in the workflow of radiomics will affect the repeatability of the radiomics features, Such as scanner[9, 27, 28], acquisition parameters[27, 29-33], pretreatment method[34, 35], segmentation method[36-39], inter/intra-observer variability [33, 34, 36], feature selection method , modeling method . The factors and solutions that may affect repeatability in the radiomicsworkflow are shown in Figure 5.
Figure 5. Radiomics workflow and repeatability. Each step has associated factors which may influence the repeatability of the study. Although modelling affect reproducibility, there is still no solution.
3.2.1 Intra-individual repeatability
JE van Timmeren, et al.  scanned forty patients with rectal cancer twice with the same scanning scheme at 15-minute intervals, and used a consistent correlation coefficient (CCC) to assess the correlation between the features of the two scans, then 7 of 542 features had a CCC> 0.9, and 9 features had a CCC>0.85. Therefore, only some of the features are repeatable at different times for the same individual. The features with the highest repeatability were the “shape”, and the “wavelet” features appeared to be the least reproducible. Certain features are sensitive to changes in organ motion or expansion or shrinkage of the target volume caused by physiological factors such as respiration, bowel peristalsis, cardiac and cardiac activity, so these features show low reproducibility. However, a set of highly reproducible radiomic features can be obtained using the test-retest based on phantom or patients [42, 43].
However, only 2[44, 45] of the 68 articles carried out retest experiments. X Ma, et al.  set a base of intra-class correlation coefficient of 0.6 for the retest analysis, in order to ensure the robustness of the features. And J Wang, et al.  selected 40 patients with stage II rectal cancer and scanned twice using the same scanner and imaging protocol before treatment, and then used the Spearman correction coefficient to select repeatable features. Only these two studies considered and took measures to control for Intra-individual repeatability.
3.2.2 Acquisition parameters
There may be differences between different scanners. D Mackin, et al. used phantom to compare the radiomics features obtained from four CT scanners: GE, Philips, Siemens, and Toshiba, and found differences between scanners. Then R Berenguer, et al.  used two phantom models (the pelvic phantom and the phantom of different materials) to detect the feature differences of intra-CT analysis(differences between different CT acquisition parameters) and inter-CT analysis(differences between five different scanners), showing that only 71 out of 177 features were reproducible. And using hierarchical cluster analysis, the 10 most representative features were selected, including "60 Percentile", “Global Median”, “Global Minimum”, “Kurtosis”, “Mass”, “Volume”, “Roundness”, “Surface Area Density”, “4-Inverse Difference Normalized” and “4-Auto Correlation”. In addition, R Berenguer, et al.  reported that the impact of different scanners could be reduced by standardizing the acquisition parameters.
Of the 31 CT-based imaging studies, except for 5 [25, 45–48] that did not provide scanner parameters, the other studies neither used consistent scanning parameters nor assessed the impact of scanner differences on feature repeatability. Therefore, it could not be ruled out that scanner differences do not affect the results of these studies. Among the 28 studies based on MRI, SP Shayesteh, et al.  considered the influence of scanner and scanning parameters on feature repeatability and tried to use image preprocessing (noise reduction, intensity normalization and discretization) to reduce the difference. In the PET-CT based research, J Kang, et al.  reduced the SUV measurement difference between the two scanners to less than 10 percent through regular standardization and quality assurance.
L He, et al.  demonstrated that acquisition parameters (slice thickness, convolution kernel and enhancement) had affects on the diagnostic performance of radiomics, and that radiomics features constructed based on thin-slice (1.25 mm) have better performance in differential diagnosis than features based on thick-slice (5 mm). The reason for the better performance of thin-slice may be the introduction of larger partial volume artifacts in thick-slice. Similarly, L Lu, et al.  demonstrated there exist differences in the values of radiomics feature extracted from CT images with different slice thicknesses and reconstruction methods. And Features associated with tumor size, border morphology, low-order density statistics, and coarse texture were more sensitive to variations in acquisition parameters. Subsequently, a more rigorous experiment  showed that 63 of the 213 features were affected by voxels, but 42 features were significantly improved, and 21 features changed greatly after resampling. Therefore, for the image data with different slice thicknesses, the resampling may effectively reduce the influence of layer thickness on the repeatability of the study. Of the 68 studies, 9 [16, 21, 23, 51–56] (13%) reduced the effect of slice thickness by resampling.
Accurate and efficient segmentation of regions of interest is helpful to extract robust quantitative imaging features. The segmentation method can be roughly divided into manual, semi-automatic, and automatic segmentation. Manual segmentation is usually regarded as the gold standard, but it has two problems. According to statistics, it took an average of 18 minutes to delineate the region of interest of a tumor. Therefore, it is unlikely to be implemented in the clinic. Second, there were great subjective differences among the observers, which may affect the repeatability of the target area[57, 58]. Existing studies [36–38, 59, 60] proved that semi-automatic segmentation had better stability and higher efficiency than manual segmentation (the average segmentation time was reduced by 4 minutes). Although automatic segmentation based on deep learning may further improve the accuracy of segmentation,automatic segmentation is not yet mature and needs further research before it can be used in the clinic. Of the 68 studies, 13 [12, 26, 61–71](19%) used semi-automatic segmentation, 3 [52, 72, 73](4%) used automatic segmentation. Except for 12 [21, 25, 44, 45, 49, 53, 74–79] studies which did not describe the segmentation methods used, the remaining 40[13–20, 22–24, 46–48, 50, 51, 54–56, 80–100]studies adopted the manual segmentation method.
Manual and semi-automatic segmentation may cause deviation between features and real values because of the variability of the segmentation process . Variability includes subjective differences among multiple observers (inter-observer variation) and subjective differences of the same person at different times (intra-observer variation), so it is necessary to use multi-person segmentation or multiple methods to reduce deviation. Of all the articles, 15 did not use multiple segmentation. 19 of the other articles did not analyze the differences between observers or segmentation methods and did not rule out unstable features. Only 34 articles evaluated the variation, but the evaluation indicators were not consistent: 24 articles evaluated by intraclass correlation coefficient(ICC), 3 articles[16, 20, 21]used Dice similarity Coefficient and/or Jaccard similarity coefficient, 1 article  used Bland-Altman plots as evaluation parameter, 1 article used Spearman correlation coefficient, 1 article  used automatic segmentation which repeatability was verified,and 4 articles [12, 48, 52, 54] did not describe the evaluation index. In a word, most studies (71%) use ICC to evaluate variability, while I Fotina, et al.  preferred to use Jaccard similarity coefficient, conformal number, or generalized conformability index to evaluate inter-observer variability.
3.2.4 Feature selection
Radiomics studies always extract a large number of features, whereas the number of samples is often very small, so it is easy to cause dimension disaster so that the model is over-fitted and lacks generalization ability[8, 103]. To ensure that the model has statistical significance and clinical significance, and reduce the false positive rate,A Chalkidou, et al.  proposed the following measures: (1) repeatability of features (2) cross-correlation analysis (3) inclusion of clinically important features (4) at least 10-15 patients with each feature (5) external verification.
The main purpose of feature selection is (1) to select repeatable features between different institutions, (2) to remove redundant features (highly related features between features), and (3) to select features that are strongly related to the result variables. Feature selection can effectively reduce the number of features, but different methods need to be selected according to the needs of the research. In all the studies, the Least Absolute Shrinkage and Selection operator (LASSO) (46%) was the most commonly used feature selection method, followed by correlation analysis (33%). The feature selection method is not unique, it needs to be adjusted according to the number of features and sample size. The most suitable method should be selected by comparing a variety of methods.
The sample size of all studies ranged from 15 to 701, with a median of 111, and 78% of the studies had a sample size of 0-200. To assess the adequacy of the sample size in the study, MA Babyak  suggested that at least 10-15 patients were needed for each feature. Based on this standard, 17 (25%) of the included studies did not meet the above conditions except 5 studies[73, 74, 79, 98, 106], which did not establish a model and 4 studies[26, 46, 78, 87] that did not indicate the characteristic quantity(Figure 6).
Figure 6. Sample size of included studies. Adequate sample means the ratio of the sample size to the feature number of the study is more than 10, inadequate sample means the ratio is less than 10, unclear means the study did not establish a model or did not specify the number of features.
3.2.5 Modelling methodology
C Parmar, et al.  evaluated the performance and stability of 12 classification methods in predicting overall survival, which showed that random forest classification had the highest prediction performance and stability. However, it is not clear which statistical method or machine learning method is better. The model generated by the simple modeling method is easy to explain, and the complex model improves the performance but needs further verification .
The generalization ability of the model can be evaluated by using the verification data. According to the principle of confirmatory analysis, independent data sets are needed to verify the results of the training set. Only 51 articles (75%) used independent dataset validation, including 5[22–26] articles using dual-center validation sets and 7[12, 16–21] articles using multicenter validation sets.
3.3 How to increase repeatability
3.3.1 Standardization protocol
Standardizing the radiomics process is the most reliable way to increase repeatability. So Image Biomarker Standardization Initiative (IBSI) standardizes the definition, naming, and software. The Quantitative Imaging Network (QIN)  project initiated by NIC (National Cancer Institute) has also promoted the standardization of imaging methods and imaging protocols. In addition, the Quantitative Imaging Biomarkers Alliance (QIBA)  organization sponsored by, Radiological Society of North America (RSNA) has developed a standardized quantitative imaging document "Profiles" to promote clinical trials and practices of quantitative imaging markers. At present, only 4[26, 53, 54, 61] (6%) of the radiomics studies of colorectal cancer comply with the IBSI standard.
3.3.2 Test-retest reliability
Before a unified standard is formed, test-retest reliability tests can be taken to increase the repeatability of the study. The same scanner and the same patient were scanned twice at an interval of 15 minutes to determine the characteristics with high repeatability. Moreover, JE van Timmeren, et al.  indicated that appropriate test-retest reliability should be carried out in each step. Also, the effects of hardware, acquisition, reconstruction, tumor segmentation, and feature extraction should be strictly controlled. However, only 2 articles[44, 45] (3%) conducted test-retest reliability.
For retrospective data that did not use the same scanning scheme, phantom could remove unstable features due to differences in scanner, scanning, and reconstruction parameters. However, none of the included studies used phantom to analyze its repeatability.
With the continuous emergence of new features, the efficiency of test-retest reliability research and phantom research becomes lower. The following post-processing methods can also reduce the variability of features.
Resampling and normalization: Some studies[29, 30, 113]showed that resampling could effectively improve the feature variation caused by voxel differences. However, resampling alone might not improve the variations of all features, and features need to be normalized according to voxel size. Among the 68 studies, resampling was used in 9 articles [16, 21, 23, 51–56] (13%). In other studies, normalization was used to reduce the influence of different gray ranges or the effects of low frequency and intensity inhomogeneity. Normalization also has some disadvantages, such as introducing noise, blurring the image, and causing the loss of image details . However, these shortcomings of normalization would be avoided by using ComBat.
ComBat: Previously, genomics has been affected by batch effects, that is, systematic technical biases introduced by samples in different batches of processing and measurement that are not related to biological status. WE Johnson, et al.  developed and validated a method to deal with the "batch effect"-ComBat. In radiomics, the impact of different scanners or scanning schemes is similar to that of batches. Studies[117–119] showed that ComBat could reduce the feature differences caused by different scanners or scanning schemes, and retain the feature differences formed by biological variation. Although ComBat is practical, convenient and fast, it will be affected by the distribution of validated data sets, and it cannot be directly applied to imaging data. So Y Li, et al.  developed a normalization method based on deep learning, which may effectively avoid the above problems.