Case Structures and Clinical Baselines
The initial LM data was included in 2010 and the latest one was updated in 2016 in the SEER database. In the current study, 262,285 CRC patients from 2010 to 2016 were included. According to the above inclusive and exclusive criteria, a total of 16785 patients were ultimately enrolled in the inner dataset while 326 out of 8,226 CRC patients in Xijing hospital were recruited. The data of these 326 patients was further normalized via SEER database standard. Baselines of the inner training set, inner testing set, and outer validating set were exhibited in Table 1.
Eleven independent clinical factors were included in the model, consisting of age at diagnosis, gender, marital status at diagnosis, primary site, tumor size, tumor grade, tumor type, N stage, CEA level, tumor deposits, and PNI (Table 2). Patients from SEER database were categorized into LM (-) group (16,023 patients without LM, 95.5%) and LM (+) (762 patients with LM, 4.5%) group respectively. In LM (+) patients, the age at diagnosis is mostly ranged from 40 to 90 (721/762, 94.6%). Besides, the proportion of diagnosed age less than 60 years in LM (+) group (333/762; 43.7%) is significantly surpassed the LM (-) group (6553/16,023; 40.9%; P< 0.001). The proportion of male with T1 CRC is significantly higher in LM (+) group compared with LM (-) one (P = 0.001), while race demonstrated no statistical difference between the two groups. Intriguingly, a higher occurrence rate was observed in the single (167/2611, 6.4%) than the married (376/8918, 4.2%; P<0.001). The rectum is the most common primary site in both groups, and its proportion is comparatively higher in the T1 stage than other T stages in all CRC patients (P< 0.001). Average tumor size of LM (+) group (mean = 52.1mm) was considerably larger than that of LM (-) one (mean = 17.5mmp; P< 0.001). LM (+) group portended a dramatically higher proportion of Grade II-IV than LM (-) group (92.8% vs 68%; P<0.001). Similarly, T1 CRC patients with LM tend to have advanced N stage (P<0.001). Adenocarcinoma (Adenocarcinoma, NOS, Adenocarcinoma in tubulovillous adenoma, and Adenocarcinoma in adenomatous polyp; 12714/16785, 75.7%) is the most common neoplastic category among all patients. Furthermore, we observed a significantly higher level of positive CEA, more tumor deposits and more PNI in LM (+) group than LM (-) one (P< 0.001). Additionally, the baselines of SEER training, SEER testing and our outer validating sets were exhibited in Table 2.
Parameters tuning in our models
We trained the LGBM with a depth of five, a learning rate of 0.01, basic learners of 240, leaves of 16, and max bins of 128. For RF and CART, we also elected 5 as the max depth of the basic trees. The number of neighbors 200 for KNN is the best. In MLP, we ultimately selected a learning rate of 0.01, epochs of 300, hidden layer of 1, and employed the Adam Optimizer and ReLU activation function. For SVM, a combination of a C value of 0.01 and kernel smoothing parameters of 0.0001 was determined. Lastly, every Bagging model, which owns 10 basic models, was trained with identical algorithms but different data. The ultimate stacking model consists of seven bagging models, which outputs probability and a GNB as meta classifier.
Evaluation of Models
To better evaluate the performance of our constructed models, ROC curves and PR curves during the model training were plotted. Via internal verifying, all models were observed to have superior predictive abilities (AUC values > 0.94). And, by incorporating seven other single models, the stacking model demonstrated an ultimate AUC of up to 0.9631 (Figure 2A). Except for GNB models, AP values of nearly all models attain relatively preferable levels. Noticeably, the ultimate AP of the stacking mode reached 0.693 (Figure 2B). Intriguingly, the external validation set demonstrated more desirable performance. All models have exhibited dramatically high predictive value except the MLP model, and the stacking model contains a final AUC value of 0.992 and an ultimate AP value of 0.811 (Figure 2C, D).
Additionally, via employing the confusion matrix to evaluate the value of models, predictive outcomes of both the inner testing set and outer validation set were shown in Table 3. LGBM produced fewer quantities of FN (False Negative) and FP (False Positive) than other models in both testing sets. The stacking model was capable of screening approximately all LM (+) patients in both sets. Detailed values of AUC, sensitivity, specificity, precision, NPV, FDR, accuracy, AP, F1-values, and Matthews correlation coefficient of each model in inner and outer validation sets were listed respectively in Table 4 and Table 5. The accuracy of 5 single models reached 0.95, among which LGBM displayed the highest accuracy (0.9657). The specificity of MLP and sensitivity of GNB were the highest among seven single models. Generally speaking, the stacking model demonstrated the most satisfying AUC and sensitivity, indicating that this model has clinical value for early screening of LM, excellent precision, NPV, FDR, accuracy, AP score, F1 score, and Matthews correlation coefficient value in CRC patients.
Furthermore, employing survival status and time from the SEER database, we plotted the Kaplan Meier (K-M) curves of the testing set. It is universally acknowledged that LM is an unfavorable prognostic indicator for CRC patients (Figure 3A). Likewise, we found that the stacking model resembled LM in predicting T1 CRC patients’ outcomes (Figure 3B).
Comparison of Significance of Each Factor
In all single models, tumor size, preoperative CEA levels, tumor deposits, N stage, histology, and PNI played a vital role in predicting for LM in T1 CRC. Even though the AI model manifested desirable performance, the individualized influence of each factor on the result and underlying relationships between these factors remain unknown. Hence, we calculated and digitized the significance of each factor used in the built-up AI models (Figure 4). We found that tumor size, CEA level prior to surgery, tumor deposits, and N stage were the top four crucial predictors among all models. Noticeably, and tumor size was the most critical one in nearly all models.