Application of CT Radiomics Combined with Machine Learning Methods in Predicting the Recurrence or Metastasis of Gastrointestinal Stromal Tumors

The study aimed to evaluate the diagnostic performance of machine learning-based CT radiomics models for predicting the recurrence and metastasis of gastrointestinal stromal tumors (GISTs) preoperatively. A total of 382 patients with histopathological conrmed GISTs were retrospectively included. According to postoperative follow-up, patients were classied into non-recurrence and metastasis group (NRM) and recurrence or metastasis group (RM). Radiomics features were extracted from arterial and portal venous phase CT images. Four feature selection methods and ten machine learning techniques were used to train predicting models on training cohort with internal validation by 10-fold cross-validation. F1 score was used to evaluate the performance of the classication model. The best model of two phase were stacked to build an ensemble model. The area under the curve (AUC), recall, precision, accuracy, and F1 score were used to evaluate the performance of the models and compare with clinical criteria based on diameter. ndings highlight the potential of machine learning techniques based on CT radiomics in the prediction of recurrence and metastasis of GISTs preoperatively. to t within µ ± 3σ (µ: gray-level mean between the VOI and σ: gray-level standard deviation).In total, 851 radiomic imaging features were automatically extracted from target volume based on the 7 texture analysis methods available in the software package: rst order statistics; shape features; features of the gray level co-occurrence matrix (GLCM); features of gray level run-length matrices (GLRLM); features of the gray level size zone matrix (GLSZM); features of the neighboring gray tone difference matrix (NGTDM) and features of the gray level dependence matrix (GLDM).


Introduction
Gastrointestinal stromal tumors (GISTs) are the most common mesenchymal tumors of the gastrointestinal tract, with a varying malignant potential [1,2]. There are many criteria for evaluating postoperative biological behavior of primary resectable GISTs. The most recognized standards include the National Institute of Health (NIH) classi cation, the Armed Forces Institute of Pathology (AFIP) classi cation, and the Memorial Sloan Kettering Cancer Center (MSKCC) prognostic nomogram, which all classify GISTs to different recurrence risk groups by mitotic index and tumor size, and AFIP and MSKCC also highlight the importance of tumor site [3]. The biological behavior of GISTs ranges from very low risk to malignant, and different treatment is designed for GISTs with different risk strati cation [3,4].
Fortunately, molecular-targeting therapy using tyrosine kinase inhibitor imatinib mesylate is effective for high-risk GISTs [5,6]. And neoadjuvant therapy is reported have the potential to increase the complete resection rate and to avoid the surgical rupture by decreasing the tumor size. The non-recurrence and metastasis (NRM) survival rate can be obtained by the neoadjuvant therapy of high risk GISTs [7].
So, accurate preoperative assessment the recurrence and metastatic risk of GISTs has high clinical value, which can provide important clues for personalized treatment of GISTs, like the use of neoadjuvant chemotherapy. No matter NIH, AFIP classi cation, and MSKCC nomogram, they are all based on the mitotic index and are appropriate for GISTs classi cation after operation. The higher the mitotic index means the higher the risk of recurrence, and a mitotic count >5/50 HPF is considered as a warning sign of high recurrence rate of GISTs [8]. As we know, the mitotic index can only be measured by anatomic pathology. Although the tumor tissue of gastric stromal tumors could be obtained by endoscopy, the mitotic index is determined for the full volume of the whole tissue. Additionally, small intestinal stromal tumors are not so easy to get the tissue by endoscopy preoperatively. Therefore, clinicians often considered a diameter more than 5 cm of a GIST as a marker for poor prognosis in preoperative decision making [9,10].
Machine learning (ML) has a variety of applications in the healthcare and medicine eld. The multi algorithms have been employed in the cancer diagnosis and prognosis prediction, like breast cancer, oral cancer, cervical cancer, colon cancer, gastric cancer, multiple myeloma, and so on [11,12].These ML methods have been regarded as more powerful prognostic biomarkers and diagnostic tools, which can provide additional information to clinical data due to the ability of ML tools to detect key features from complex datasets [13].
However, there are few literatures on the performance of ML methods in the setting of recurrence or metastasis (RM) of GISTs and whether the performance is rival or even surpasses that of clinical criteria based on diameter. Thus, we aim to provide a general overview on various ML tools in predicting of RM of GISTs and to compare their predictive performance to the clinical criteria.

Research dataset
Ethical approval for this retrospective study was obtained from our institutional review board and informed consent was waived. By searching of electronic medical records, a total of 382 patients from January 2010 to December 2019 were included in the study (215 men and 167 women). The inclusion criteria were as follows: (1) patients with primary GISTs con rmed by postoperative histopathological examination; (2) availability of standard contrast-enhanced CT images before surgery; (3) patients with complete clinicopathologic data. The exclusion criteria were as follows: (1) patients with other concurrent primary malignant tumors; (2) distant metastasis con rmed by preoperative images; (3) patients received preoperative targeted therapy, such as the use of imatinib; (4) tumor ruptured during or before the operation; (5) unclear lesion in the CT images. Baseline clinical data includes age, clinical symptom, tumor site, size, mitotic rate, Ki67 index, and risk strati cation (according to the modi ed National Institutes of Health criteria). After radical surgery, all patients were routinely followed up with abdomen CT examinations or telephone calls annually. The last follow-up date was June 2020. The endpoint was time to recurrence or metastasis (RM).
The CT images of all patients in the arterial phase and portal venous phase were utilized for tumor analysis and segmentation. CT imaging acquisition procedure was described in the supplementary material (Text S1). Tumor size (maximal diameter) was measure on CT by one radiologist (QXF, who had abdominal radiological experience of 5 years), who was unaware of the clinical and pathological data. Lesion segmentation was semi-automatically performed with a dedicated commercial software package (Frontier, Syngo via, Siemen's healthcare) by one radiologist (BT) and recon rmed one month later by another radiologist (QXF who had abdominal radiological experience of 3 years). After lesion segmentation, imaging features were analyzed from target volumes using an open-source python package for the extraction of Radiomics features (https://pyradiomics.readthedocs.io/en/latest/index.html). Image normalization was performed using a method that remaps the histogram to t within µ ± 3σ (µ: gray-level mean between the VOI and σ: graylevel standard deviation).In total, 851 radiomic imaging features were automatically extracted from target volume based on the 7 texture analysis methods available in the software package: rst order statistics; shape features; features of the gray level co-occurrence matrix (GLCM); features of gray level run-length matrices (GLRLM); features of the gray level size zone matrix (GLSZM); features of the neighboring gray tone difference matrix (NGTDM) and features of the gray level dependence matrix (GLDM).

Machine learning
All patients were randomly selected to set up a training data set (n=267, 16/251 = positive/negative) and the testing data set (n=115, 7/108 = positive/negative). To remove the unbalance of the training data set, we used the Synthetic Minority Oversampling Technique (SMOTE) to make positive/negative samples balance. Then normalization was applied to the feature matrix and each feature vector was subtracted by the mean value and was divided by the standard deviation. After normalization process, each vector has mean values of 0 and standard deviations of 1, and the similarity of each feature pair was compared. If the PCC value of the feature pair was larger than 0.99, it would be removed. After this process, the dimension of the feature space was reduced, and each feature was independent to each other. Before build the model, we evaluated four feature selection methods to select features: analysis of variance (ANOVA), Kruskal Wallis (KW), recursive feature elimination (RFE), and relief.
ANOVA and KW were used to select features according to the corresponding F-value. The goal of RFE was to select features based on a classi er by recursively considering smaller set of the features. And relief selected sub data set and found the relative features according to the label recursively. Then all radiomics features selected were applied to the classi ers to establish the predicting model of different algorithm combinations for the recurrence or metastasis of GISTs. The ten machine learning classi ers were: linear discriminant analysis (LDA), Support Vector Machines (SVM), Random Forest (RF), Adaptive Boosting (AdaBoost), Auto-Encoder (AE) sometimes called multi-layer perceptron (MLP), Gaussian process (GP), Naive Bayes (NB), Logistic Regression (LR), Least Absolute Shrinkage and Selection Operator (LASSO), and Decision Tree (DT). Here we would describe the ten ML tools generally as following. LDA was a linear classi er by tting class conditional densities to the data and using Bayes' rule. SVM was an effective and robust classi er to build the model. The kernel function has the ability to map the features into a higher dimension to search the hyper-plane for separating the cases with different labels and was easier to explain the coe cients of the features for the nal model. RF is an ensemble learning method which combining multiple decision trees at different subset of the training data set and it is an effective method to avoid over-tting. AdaBoost is a meta-algorithm that conjunct other type of algorithms and combine them to get a nal output of boosted classi er. It is sensitive to the noise and the outlier. Over-tting can also be avoided by AdaBoost. Here we used decision tree as the base classi er for AdaBoost. MLP is based neural network with multi-hidden layers to nd the mapping from inputted features to the label. Here we used 1 hidden layer with 100 hidden units. The non-linear activate function was recti ed linear unit function and the optimizer was Adam with step 0.001. GP combines the features to build a joint distribution to estimate the probability of the classi cation. NB is a kind of probabilistic classi ers based on Bayes theorem. NB requires number of parameters linear in the number of features. Logistic regression is a linear classi er that combines all the features. A hyper-plane was searched in the high dimension to separate the samples. Logistic regression with LASSO constrains is also a linear classi er based on logistic regression. L1 norm is added in the nal lost function and the weights was constrained, which make the features sparse. DT is a non-parametric supervised learning method and can be used for classi cation with high interpretation.
To determine the hyper-parameter of model, cross validation with 10-fold was applied on the training data set. The hyper-parameters were set according to the model performance on the validation dataset using one standard error rule. All above processes were implemented with FeAture Explorer Pro (FAEPro, V 0.3.3) on Python (3.7.6) [14].
The study process diagram is shown in Figure 1.

Statistical analysis and predictive performance of models
To analyze baseline clinical data, categorical variables were compared by using the x 2 test or Fisher exact test. Continuous variables were compared by using the Student t test. As for machine learning model, each model was evaluated by calculating accuracy, the area under the ROC curve (AUC), recall, precision, and F1 Score indicators. These performance measures or indicators were de ned and computed as follows: A two-sided P value<0.05 was considered to indicate statistical signi cance. All regular statistical analyses were performed using the Medcalc software (Version 19.1.6). The machine learning algorithms were programmed using were performed on Python (3.7.6) [14].

Result
Baseline clinical characteristics Among the 382 patients (mean age, 58.6 years old; range: 20-90 years old) included in the present study, 215 patients were male and 167 patients were female. Recurrence or metastasis was found on 23 (6%) patients during following up imaging. The mean RM time was 1.93 years (±1.21 years) from 0.30 to 4.50 years. The age, gender, and clinical symptom had no signi cant relationship with RM of GISTs. And the tumor size, mitotic rate, Ki67 index, and risk strati cation showed difference between the RM and NRM groups. The baseline characteristics of all patients were summarized in Table 1. Diagnostic Performance Of Ten Ml Techniques Cross validation with 10-fold was applied on the training data set to determine the hyper-parameter of model. The hyper-parameters were set according to one standard error rule to select the smallest model. And on the testing data set, the performance of the forty smallest ML models in arterial phase were showed in Table 2. Among the machine learning models, ANOVA_NB which selected only 5 features provided the highest F1 Score of 0.560 than the others. The precision and recall could achieve 0.389 and 1.000, whereas accuracy was 0.904 and AUC of ROC was 0.942(95%CI: 0.8891-0.9851). And on the testing data set, the performance of the forty smallest ML models in venous phase were showed in Table  3. Among the forty models, KW_ AdaBoost which selected only 4 features provided the highest F1 Score of 0.500. The precision and recall could achieve 0.600 and 0.429, whereas accuracy was 0.948 and AUC of ROC was 0.644(95%CI: 0.3509-0.9782). The selected features of each model were provided in the supplementary material (Table S1).

Comparison Of The Ensemble Model And The Clinical Criteria
We combined the ANOVA_NB model and the KW_ AdaBoost model to an ensemble model for RM prediction of GISTs. The performance of the ensemble model and the clinician criteria based on diameter were shown in Figure 2 and Table 4.

Discussion
Our study combined four feature selection methods and ten machine learning techniques to predict RM of GISTs identi ed based on radiomics of contrast CT and then stacked the arterial and venous phase best model to an ensemble model by logistic regression. The ANOVA_NB and KW_ AdaBoost models were found outperformed other feature selections and ML techniques. Ultimately, we compared the ensembled model's diagnostic performance to that of clinical criteria based on diameter. It provided superior F1 Score, accuracy, recall, and precision measures compared with clinical criteria.
Patients with GISTs who had recurrence or metastasis had a higher mortality rate when compared with the patients who did not have [15]. Similar to previous study, the pathological features including Ki67 and mitotic index were tended to be related to the prognosis of GISTs, and the tumors in the recurrence or metastasis group tended to be larger in diameter (mean size: 9 cm) in our research. And clinically, a GIST more than10 cm with any mitotic rate is considered to have higher risk of recurrence, subsequently requiring target drug therapy [16]. There are many nomograms and models for predicting the biological behavior of GISTs. Moreover, the clinical pathological classi cation of GISTs is to predict the risk or probability of metastasis of GISTs like modi ed NIH and AFIP criteria [3]. And although approximately 50% of the patients with primary GIST who have complete tumor resection will survive more than 5 years, moreover, it may take more than ten years for a patient with GISTs to have metastasis [17], surgery alone may not be enough for the treatment [18]. However, few studies obtained long-term survival follow-up of stromal tumors and directly predict the recurrence or metastasis of GISTs.
In our study, 23 cases were found RM and one of them was found metastasis nearby 5 years. The proportion of recurrent cases in our study is low, which is inconsistent with previous reports by Ronald P[18] and Chairat [15], but similar to the recent report by Ao [19]. We ruminated that it might result from our country's clinicians deepening understanding of GISTs in recent years, combined with the detection of kit and PDGFR gene mutation and the application of targeted medicine in high-risk stromal tumors. All of these lead to the decline of recurrent and metastatic incidence rate in our study. However, based on this imbalanced data, different from Ao's evaluation of the performance of the model by using ROC curve, SMOTE technique was used to balance the data and we also used F1 Score and PR cure to evaluate the performance of our designed model. The results of our study showed that even though both the ensemble model and the clinical criteria had high AUC, there were still huge differences between the two models on PR curve, and the ensemble model designed by ML techniques was better for predicting RM of GISTs.
As seen in the literature, the application of radiomics based on CT or MRI have been found in the differential diagnosis of GISTs, risk strati cation and prediction of prognosis after surgical resection, and evaluation of mutational status in GISTs [20]. Various feature selection methods of KW and Relief algorithms and ML techniques, including SVM, LASSO, LR, DT, and RF had been used in GISTs diagnosis and prognosis prediction [19,[21][22][23]. And all these methods and techniques were reported to have achieved promising application in GISTs differential diagnosis and biological behavior prognosis prediction with the range of AUC from 0.754 to 0.962 [20].
In addition to the two methods mentioned above, we also added RF and ANOVA in selecting features after data cleansing since the number of features is high and identifying the in uencing features is of paramount importance in clinical disease diagnosis and prognosis predicting. Moreover, excerpt for SVM, LASSO, LR, DT, and RF classi ers, AdaBoost had been used in breast tumors and brain tumors classi cation [24,25], MLP was applied in bladder cancer diagnosis[26], breast cancer detection and diagnosis [27], classi cation of skin cancer [28], and speci c Borrmann classi cation in advanced gastric cancer [12], GP regression was proved to be of some value for predicting the survival time of a cancer patient based on his/her genome-wide gene expression, LDA could be considered as appropriate tool for classifying bladder cancer cases and searching for important biomarkers [29], and NB was capable of achieving 98% accuracy of predicting breast cancer and 90% of predicting lung cancer [30]. However, a single model may not make the best predictions and may be subject to errors such as variance and bias and combining several models to a single model could reduce these errors and improve the predictions.
So, all of ten ML techniques were proposed in our study to evaluate the ability of predicting the recurrence and metastasis of GISTs and they were compared with each other and the best model in two phase was choose to stack an ensemble model. De nitely, the ensemble model showed good performance in RM prediction of GISTs with high accuracy, recall, precision, and F1 Score in our study.
However, several shortcomings in our research should be noted. First, the positive group with recurrence or metastasis type was relatively small, and it may affect the statistical stability. We will increase the sample size next. Second, in the ensemble model of machine learning, we only tried to integrate the two best models, not all models. And other ensemble learning methods will be researched and compared with each other. Third, because this was a retrospective single-center study, the patient population and imaging methods were heterogeneous and selection bias may exist, this may result in lacking external data validation, which is essential in the future research.

Conclusion
Our study indicates that CT radiomics combined with machine learning methods is of powerful preoperative value for recurrence or metastasis prediction of GISTs, and may help better stratify patients for clinicians to select optimal treatment strategies and an individualized management to improve clinical outcomes.

Declarations
Ethics approval and consent to participate Ethical approval for this retrospective study was obtained from the institutional review board of the First A liated Hospital of Nanjing Medical University and informed consent was waived. All procedures were performed in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments.

Consent for publication Not applicable
Availability of data and materials All data generated or analysed during this study are included in this published article [and its supplementary information les.
Con icts of interest The authors declare that they have no con ict of interest.  Study-process diagram.