Patient selection
This retrospective study was approved by Institutional Ethics Committee of our hospital and the need for informed consent was waived. Patients from two centers with gastric GISTs were enrolled in this research from January, 2012 to September, 2022. The criteria of patient inclusion were as follows: (a) patients who had complete CT images (including unenhanced, arterial and portal venous phases images) within 15 days before surgery; (b) solitary and primary lesion; (c) lesions without neoadjuvant treatment; (d) the lesion larger than 1 cm and smaller than 5 cm in the long diameter. (e) patients who had detailed clinical data (including age, gender, clinical symptoms and tumor markers). The inclusion and exclusion of patients were shown in Fig. 1. Finally, 231 patients (109 men and 122 women; mean age, 59.47 ± 10.13 years) from our hospital and 78 patients (41 men and 37 women; mean age, 62.69 ± 10.78 years) from other hospital were included in our series. 231 patients were assigned into training cohort (n = 161) and internal validation cohort (n = 70) according the ratio of 7:3. Another 78 patients served as external validation cohort. Clinical characteristics of every patient were collected including patient age, gender, symptom and tumor marker. All of GISTs from surgical resection consisted of low-grade malignancy group and high-grade malignancy group. Low-grade malignancy category consisted of GISTs with very low or low risk and high-grade malignancy group included GISTs with intermediate or high risk. NCCN Guidelines in 2022 [5] were applied to stratify risk assessment.
Ct Examination
All patients underwent abdominal CE-CT examination using 64-slice spiral CT (Siemens, Forchheim, Germany or Philips Medical Systems, Cleveland, OH, USA). The parameters of CT imaging were set as follows: 120 kV tube voltage, 150–250 mA tube current, 0.5 s tube rotation time, 64 × 0.625 mm detector collimation, 350 × 350 mm field of view, 5 mm section thickness and 1-1.5mm reconstruction interval. Subsequently, arterial phase (delay 30–40 s) and the portal venous phase (delay 60–70 s) images were obtained with 2 mL/kg of iodinated contrast medium injected intravenously at a rate of 3 ml/s.
Image Analysis
Two radiologists (reviewer 1 with 6 and reviewer 2 with 13 years’ experience in abdominal imaging) reviewed CT scan images independently, and final findings were reached by consensus without knowledge of the surgical and pathological information of every patient. The determined CT imaging features included (a) the CT attenuation value (Hounsfield units, HU) in unenhancement phase (CTU), (b) arterial phase (CTA) and (c) venous phase (CTV) of the tumor, (d) degree of enhancement in arterial phase (DEAP) and (e) in portal venous phase (DEPP), (f) enhanced potentiality in arterial phase (EPa) and (g) in portal venous phase (EPv), (h) long diameter (LD), (i) short diameter (SD), (j) the ratio of long diameter to short diameter (LD/SD), (k) contour (round; oval; irregular), (l) necrosis (yes or no), (m) calcification (yes or no), (n) surface ulceration (yes or no), (o) intratumoral angiogenesis (yes or no) and (p) peripheral enlarged lymph node (LN) (yes or no). The CT attenuation value was measured by drawing the region of interest (ROI) on the tumor in the same axial image avoiding vessels, calcification, and the necrotic regions. DEAP or DEPP was obtained by subtracting CTU from CTA or CTV respectively. EPa or EPv was equal to DEAP or DEPP divided by CTU. Enlarged lymph node was considered present if the shortest axis diameter of lymph node was more than 10 mm. A part of CT features referred to our previous report [26].
Machine Learning
Scikit-learn software was used to build three classifiers -DT, GBDT and LR for our data. The detailed methods were described in the website of official documentation (https://scikit-learn.org/), which also be applied to our previous research [25]. Three datasets (training, internal validation and external validation cohort) do not have any intersection in our study. Training dataset was aimed to train model, internal validation cohort to adjust parameters and external validation cohort to evaluate the model performance. For each model, sensitivity, specificity, accuracy, and area under the curve (AUC) together with 95% confidence intervals (95% CI) were calculated to evaluate the performance of each classifier.
Grid Search Strategy For Selecting Optimal Parameters
In order to find the optimal parameters of three models, the grid search strategy in scikit-learn software was used. The detail of grid search method was described in the model selection module in the website of official documentation (https://scikit-learn.org/stable/model_selection.html#model-selection).
Logistic Regression (Lr)
LR is the most conventional approach to measure the relationship between discrete response variable and several covariates by estimating probabilities. It can be written as: p = 1/(1 + e− z). z refers to logistic regression model. The response variable can take two values (0 as no and 1 as yes) according to p smaller than 0.5 or not.
The final optimal parameters of LR were set as following: C = 100, random_state = 12, penalty = ’l1’, solver = ’liblinear’. Other parameter factors were set as default in sklearn software module.
Decision Tree (Dt)
DT as a binary method that can classify data by calculating their characteristics. Decision nodes, branches and leaves are the three main components of DT. DT starts with a node and extends to many branches and child nodes, finally to leaves. The criterion used in our model were Gini’s Diversity Index, a measure of node impurity. The standard CART algorithm implemented using sciki-learn library in Python was applied to build decision tree.
The parameters set in the DT were: random_state = 0, max_features = 6, max_depth = 6, criterion = ’gini’. Other parameters were set as default in sklearn software module.
Gradient Boosting Decision Tree (Gbdt)
GBDT is an ensemble classifier based on bootstrap sampling, and its purpose is to improve the generalization ability and robustness by combining the predicted results of multiple base learners (i.e. weak decision trees). The weight is adjusted with iteration, so that the higher weight will be assigned to the data poorly classified. Total 15 weak decision trees were created in GBDT model in this study (e.g. a tree was showed in Fig. S1).
The following showed the parameter factors in the GBDT: learning_rate = 0.1, max_depth = 8, random_state = 0, min_samples_leaf = 2. Other parameters were also set as default in sklearn software.
Performance Comparison Between Radiologists And Models
The diagnostic performance differences between three ML models and two radiologists were compared in internal validation cohort.
Feature Variable Analysis
GBDT and LR showed excellent diagnostic effency in the prediction of risk classification of gastric GISTs on account of the high accuracy and strong robustness. LR is famous for determining the beneficial features to support decision by linear analysis, since the result is easy to explain. Firstly, significant CT features were determined by univariate analysis. Secondly, variable with P less than 0.05 were as the input data to calculate the independent risk factors for high-risk malignant GISTs. In order to find out the top five important features for high-grade malignant GISTs, the function of Feature_Importance was performed. The description of feature importance was in the website: https://scikit-learn.org/stable/ modules/ensemble.html#gradient-tree-boosting). According to the official documentation description, individual decision trees in the GBDT model intrinsically perform feature selection by selecting appropriate split points. This information can be used to measure the importance of each feature. The basic idea is: the more often a feature is used in the split points of a tree, the more important that feature is. Subsequently, the feature variables of LR and GBDT were compared.
Statistical Analysis
Continuous distributed data were showed as mean ± SD, and categorical variables were expressed as n (%). Univariate analysis using t test or Mann-Whitney U test for continuous variables and Fisher’s exact test for categorical variables were performed to compare CT features between the low-grade malignancy and high-grade malignancy groups. Variables with P < 0.05 were considered as significant features and included in the LR multivariate analysis. The final features with P < 0.05 from multivariate logistic regression model indicated the significant predictors of high risk GISTs. Statistical analyses were performed using SPSS version 22.0 (SPSS Inc., Chicago, IL, USA). A statistically significant difference was defined as two - sided P value < 0.05.