Research dataset
Ethical approval for this retrospective study was obtained from our institutional review board and informed consent was waived. By searching of electronic medical records, a total of 382 patients from January 2010 to December 2019 were included in the study (215 men and 167 women). The inclusion criteria were as follows: (1) patients with primary GISTs confirmed by postoperative histopathological examination; (2) availability of standard contrast-enhanced CT images before surgery; (3) patients with complete clinicopathologic data. The exclusion criteria were as follows: (1) patients with other concurrent primary malignant tumors; (2) distant metastasis confirmed by preoperative images; (3) patients received preoperative targeted therapy, such as the use of imatinib; (4) tumor ruptured during or before the operation; (5) unclear lesion in the CT images. Baseline clinical data includes age, clinical symptom, tumor site, size, mitotic rate, Ki67 index, and risk stratification (according to the modified National Institutes of Health criteria). After radical surgery, all patients were routinely followed up with abdomen CT examinations or telephone calls annually. The last follow-up date was June 2020. The endpoint was time to recurrence or metastasis (RM).
The CT images of all patients in the arterial phase and portal venous phase were utilized for tumor analysis and segmentation. CT imaging acquisition procedure was described in the supplementary material (Text S1). Tumor size (maximal diameter) was measure on CT by one radiologist (QXF, who had abdominal radiological experience of 5 years), who was unaware of the clinical and pathological data. Lesion segmentation was semi-automatically performed with a dedicated commercial software package (Frontier, Syngo via, Siemen’s healthcare) by one radiologist (BT) and reconfirmed one month later by another radiologist (QXF who had abdominal radiological experience of 3 years). After lesion segmentation, imaging features were analyzed from target volumes using an open-source python package for the extraction of Radiomics features (https://pyradiomics.readthedocs.io/en/latest/index.html). Image normalization was performed using a method that remaps the histogram to fit within µ ± 3σ (µ: gray-level mean between the VOI and σ: gray-level standard deviation).In total, 851 radiomic imaging features were automatically extracted from target volume based on the 7 texture analysis methods available in the software package: first order statistics; shape features; features of the gray level co-occurrence matrix (GLCM); features of gray level run-length matrices (GLRLM); features of the gray level size zone matrix (GLSZM); features of the neighboring gray tone difference matrix (NGTDM) and features of the gray level dependence matrix (GLDM).
Machine learning
All patients were randomly selected to set up a training data set (n=267, 16/251 = positive/negative) and the testing data set (n=115, 7/108 = positive/negative). To remove the unbalance of the training data set, we used the Synthetic Minority Oversampling Technique (SMOTE) to make positive/negative samples balance. Then normalization was applied to the feature matrix and each feature vector was subtracted by the mean value and was divided by the standard deviation. After normalization process, each vector has mean values of 0 and standard deviations of 1, and the similarity of each feature pair was compared. If the PCC value of the feature pair was larger than 0.99, it would be removed. After this process, the dimension of the feature space was reduced, and each feature was independent to each other.
Before build the model, we evaluated four feature selection methods to select features: analysis of variance (ANOVA), Kruskal Wallis (KW), recursive feature elimination (RFE), and relief.
ANOVA and KW were used to select features according to the corresponding F-value. The goal of RFE was to select features based on a classifier by recursively considering smaller set of the features. And relief selected sub data set and found the relative features according to the label recursively. Then all radiomics features selected were applied to the classifiers to establish the predicting model of different algorithm combinations for the recurrence or metastasis of GISTs. The ten machine learning classifiers were: linear discriminant analysis (LDA), Support Vector Machines (SVM), Random Forest (RF), Adaptive Boosting (AdaBoost), Auto-Encoder (AE) sometimes called multi-layer perceptron (MLP), Gaussian process (GP), Naive Bayes (NB), Logistic Regression (LR), Least Absolute Shrinkage and Selection Operator (LASSO), and Decision Tree (DT). Here we would describe the ten ML tools generally as following. LDA was a linear classifier by fitting class conditional densities to the data and using Bayes’ rule. SVM was an effective and robust classifier to build the model. The kernel function has the ability to map the features into a higher dimension to search the hyper-plane for separating the cases with different labels and was easier to explain the coefficients of the features for the final model. RF is an ensemble learning method which combining multiple decision trees at different subset of the training data set and it is an effective method to avoid over-fitting. AdaBoost is a meta-algorithm that conjunct other type of algorithms and combine them to get a final output of boosted classifier. It is sensitive to the noise and the outlier. Over-fitting can also be avoided by AdaBoost. Here we used decision tree as the base classifier for AdaBoost. MLP is based neural network with multi-hidden layers to find the mapping from inputted features to the label. Here we used 1 hidden layer with 100 hidden units. The non-linear activate function was rectified linear unit function and the optimizer was Adam with step 0.001. GP combines the features to build a joint distribution to estimate the probability of the classification. NB is a kind of probabilistic classifiers based on Bayes theorem. NB requires number of parameters linear in the number of features. Logistic regression is a linear classifier that combines all the features. A hyper-plane was searched in the high dimension to separate the samples. Logistic regression with LASSO constrains is also a linear classifier based on logistic regression. L1 norm is added in the final lost function and the weights was constrained, which make the features sparse. DT is a non-parametric supervised learning method and can be used for classification with high interpretation.
To determine the hyper-parameter of model, cross validation with 10-fold was applied on the training data set. The hyper-parameters were set according to the model performance on the validation dataset using one standard error rule. All above processes were implemented with FeAture Explorer Pro (FAEPro, V 0.3.3) on Python (3.7.6)[14].
The study process diagram is shown in Figure 1.
Statistical analysis and predictive performance of models
To analyze baseline clinical data, categorical variables were compared by using the x2 test or Fisher exact test. Continuous variables were compared by using the Student t test. As for machine learning model, each model was evaluated by calculating accuracy, the area under the ROC curve (AUC), recall, precision, and F1 Score indicators. These performance measures or indicators were defined and computed as follows:
Accuracy = (TP + TN)/ (TP + FP + FN + TN)
Recall = TP/ (TP + FN)
Precision = TP/ (TP + FP)
F1 Score = 2 * (Recall * Precision)/ (Recall + Precision)
TP means true positive, TN means true negative, FP means false positive, and FN means false negative. They are all ensembled in the confusion matrix. Due to our imbalanced data, we use F1 Score as the main performance indicator. Precision-Recall (PR) Curves were draw to compare the performance of machine learning model between the clinical criteria. The AUC of each PR curve was calculated.
A two-sided P value<0.05 was considered to indicate statistical significance. All regular statistical analyses were performed using the Medcalc software (Version 19.1.6). The machine learning algorithms were programmed using were performed on Python (3.7.6)[14].