In this study, we developed and validated a promising ML architecture for predicting the 3-classified occurrence time (3- and 5-year were the cutoff) of 4 oncological outcomes and screened the important variables for each ML model, which were categorized by different oncological outcomes. The four outcomes were patient’s death, tumor recurrence/distant metastasis, tumor recurrence, and tumor distant metastasis. This architecture represents a simple, practical, and easily assessable tool that clinicians can refer to when selecting treatments. In addition, our architecture has good tolerance for heterogeneous patients and does not require clear patient medical histories, which lowers the threshold for use. Our work is different from previous studies. We cut off the survival times, predicted them as multicategorized endpoints, and understood patients’ prognoses longitudinally through our results. Our ML models were designed based on specific oncological outcomes to predict the possible occurrence time (multiclassified). In addition, we screened important indicators for each survival time, and some of them were not commonly taken as leading predictors of CRC patients’ prognoses, which provides new insights for pathologists and basic science researchers and even pharmacologists. Moreover, the adaptability and interpretability of our architecture promote its application in hospitals at different levels. Our work also demonstrated the feasibility of applying ML models to a large number of heterogeneous CRC patients.
Although several genetic and molecular markers have been proven to be correlated with patient prognoses [25, 29], we did not select them as potential variables. Because we aimed to build a clinically generalizable architecture that might be used in data-poor situations, the indicators we selected were clinically applicable and had small heterogeneity among patients. In addition, to avoid selection bias and limitations, we avoided the TNM stage-centric impasse by inputting a large number of potential variables into our work, allowing the ML algorithms to screen the important variables that performed best. Furthermore, there were no missing values in our database, which prevented bias caused by improper filling of missing values. We controlled potential confounding factors as much as possible. We randomly grouped the patients into training sets and testing sets to avoid selection biases. Furthermore, we explored the different baseline characteristics between the two sets and found almost no significant differences. Therefore, we did not further explore the potential confounding factors. However, related investigations are valuable and need to be conducted in the future.
We managed to choose patients with a clear medical history before the operation who underwent curative initial treatment, so we excluded stage I, II, and III CRC patients who had neoadjuvant chemoradiotherapy (which may lead to a vague history). Patients with stage IV CRC who were not eligible for radical surgery were also excluded. However, the history of postoperative adjuvant chemoradiotherapies in our patients was unclear. Consequently, the model might be a rough reference when physicians at higher-level hospitals are redesigning treatment strategies for patients from lower-level hospitals with unclear postoperation radiotherapy and chemotherapy histories. Treatment options for these patients are difficult to determine, and it is difficult for oncologists to obtain references from previous studies that stratify patients by chemotherapy regimens. Moreover, to avoid bias, we excluded patients who did not have endpoint data. When applying our architecture, patients predicted to not have outcomes would be classified as greater than 5 years (equivalent to oncological recurrence). To further avoid the bias caused by the time duration as endpoints, we cautiously excluded patients with long follow-up intervals and patients with clarified noncancer-specific deaths. In addition, the number of patients with tumor recurrence was the smallest; therefore, after strict 9:1 classification, the sample size for testing in the RFS model was still small.
To better predict 3-categorized survival time for varied oncological outcomes, we configured and optimized hyperparameters wherever possible. To better evaluate the fit of the models, we chose the classification predictors C statistics (to avoid the influence of the threshold value, we used the AUC) and AP, which corresponded to the nonparametric ML algorithms, to evaluate the accuracy of the prediction. In the process of screening variables, due to the influence of sample size, variables (the number of variables was greater than 20), and other factors, we chose methods based on multiple regression models. Evaluation indicators in the process of model selection included error sum of squares (SSE), MSE, and so on. However, the SSE value in this experiment was meaningless (SSE would inevitably increase when the sample size increased). We chose the MSE as the evaluation indicator. Since this experiment involved regression prediction based on small-sample data, there was no model overfitting state. Consequently, the default state of the system configuration was processed by us (when encountering no solution or a local optimal solution, it would be regarded as a convergence failure state) considering the development costs. We think the lower AUC for OS is a reasonable result given the biological complexity and the small sample size. Furthermore, to reduce bias, computer experts were blinded to the meaning of each indicator when building the ML models.
There were also performance differences between training and testing in each of four classifiers. Regarding the small number of calculations, the advantages of LR were more obvious. When the sample size and the number of variables were large, even if we applied regularization techniques, the performance of LR still degraded due to overfitting. Therefore, in this experiment, LR showed the best performance for the prediction of the training set of RFS, while its performance of OS prediction was relatively reduced. In the DFS prediction process of in our experiments, because DFS was not a Gaussian distribution sample, the LDA predictive performance [30] was the best and better than the other three ML algorithms [31]. When the κ value was configured as the Euclidean distance, the KNN training set performed best among the four ML algorithms in regard to the DFS prediction process in our work. When SVM predicted the training set for DFS in this article, the algorithm was very robust, and the DFS dataset belongs to the small and medium datasets; therefore, SVM performed better than the other algorithms when the kernel function was configured as the default value. The results obtained by the four ML models in the training set were approximately 5% higher than those in the test set, and the overall gap was within a reasonable range [16].
Taking survival time as a categorical variable and making more precise predictions to obtain an approximate time for the occurrence of oncological outcomes also indirectly reflects the potential application of our models in precision medicine. A more accurate prediction of possible prognoses would translate into more precise formulations of treatment therapies and patient management strategies. Extending the survival time is the shared goal of clinicians and oncology patients. Quantifying patient outcomes aids in shared decision making [32]. Because of the heterogeneity of CRCs, physicians and patients must seriously consider the tradeoffs between adverse effects and benefits [33] when choosing a treatment strategy. It is possible to improve outcomes by closer follow-up or the administration of additional chemoradiotherapy to patients who are predicted to have poorer prognoses. Consequently, we suggest that patients who tend to have a shorter DMFS receive prophylactic chemotherapy or regional radiotherapy for the common metastatic sites of CRC described by Jiang B et al. [34]. Moreover, the identification of patients with better prognoses could reduce the cost of medical care and improve the level of humanistic care by reducing the psychological burden on patients and their families. Therefore, predictive tools such as our architecture are urgently needed in the clinic.
However, referring to the results output by models with uncleared vital parameters for managing patients is not always acceptable to clinicians and patients [35, 36]. The interpretability of models is vital, especially in biomedicine [37, 38]. To turn an important parameter uncleared model into an important parameter cleared model, we screened out the corresponding predictors and showed their important order for different outcomes. TNM stage, the primary indicator for chemoradiotherapy decisions, was screened out in the OS, DFS, and DMFS models, which made our models more credible. Moreover, indicators that had been widely found to correlate with prognoses, such as PNI [39–41], pathological type [42–45], and tumor differentiation grade [46], were also selected in the models, which further confirmed the credibility of our architecture. One of the potential benefits of using ML models is that the important variables are identified, and less critical parameters are ignored. Several predictors that were not widely used as important predictors for CRC patients’ prognoses were additionally included in our model and provided new insights into predicting prognoses. The levels of CRP and Ki-67 were shown to be factors affecting prognoses, consistent with studies showing that high serum CRP levels are associated with higher postoperative complication rates [47, 48] and that Ki-67 levels reflect the proliferative capacity of cells [49], especially the proliferative capacity of tumor cells [50]. Surprisingly, LVI was only modestly predictive. Our models also identified some predictors that were not previously considered to be directly linked to poor prognosis (unifocal vs. multifocal lesions and surgery vs. laparoscopy) [51–54]. These factors are more likely to be directly related to surgical trauma rather than survival time. For the RFS model, two chronic diseases, DM and CHD, were selected together with age, possibly due to insufficient sample size. In addition, we indirectly focused on elderly oncological patients who had baseline diseases.
Our findings showed the formidable predictive power of ML methods, particularly for heterogeneous diseases that are stratified by outcomes. ML has unique value in clinical applications; it can guide patient managements, improve patient outcomes, and tailor treatment regimens, especially when resources are scarce (when only clinicopathological and surgical variables are available for analysis). The Cox proportional-hazards model and the multivariate linear model in the ML model are considered to be similar. When relying on the computer for calculation, the function parameters used are the multiple LR model, while the optimization parameter β applied in the Cox proportional-hazards model is consistent with the multiple LR. However, compared with the Cox proportional-hazards model [55], ML still has its advantages. When selecting vital parameters, the Cox proportional-hazards model commonly shows independent prognostic factors and more indirectly compares their predictive value, while ML finds important factors and compares their importance more reliably and directly. When building models and determining their performance, the Cox proportional-hazards model commonly takes a specific time and builds double/multiclassified risk stratification based on it, while we predicted the 3-catergorized occurrence time of oncological outcomes, in other words, patients’ survival times from the longitude angle in our ML models. Moreover, the number of variables when the Cox proportional-hazards model performs best is less than ML, which may contain supermultiple variables [56, 57]. In addition, the ML and Cox proportional-hazards models also differed in the meaning of AUC. The AUC is an evaluation index for ML models (such as AP) in this article, which varies with the AUC obtained from the traditional Cox proportional-hazards model. AP=(TP + TN)/(TP + TN + FP + FN) in our article refers to the percentage of correct prediction results in the total sample and specifically refers to the ratio of 3-classified survival times (OS, DFS, RFS, and DMFS) to the corresponding sample size of the datasets. In the ML model, we binarized the 3 categories, and the AUC solution was the same as the conventional solution. The final AUC value was obtained from the end of the ML model [58], which is the average value of the 3-categoried predictions of each dataset [59]. However, although we obtained encouraging high prediction accuracy and results with transparent important variables, more progress is needed before ML can be fully relied upon. In addition, in clinical practice, traditional performance measures such as the AUC must be translated into medically relevant measures to elucidate the patient-centric value of ML models. ML still ways to go.
Moreover, the limitations of our study must be noted. First, the sample size we used to collect data to input into the ML models was not so large (especially regarding RFS). When the data were input into the ML models for parameter optimization, the sensitivity of parameter adjustment could not be estimated due to the small sample size. Second, as a dilemma of ML [37], the sample uniformity in the data could not be estimated, which could affect the final results. Furthermore, this study was conducted based on a retrospective analysis. However, the patient data were obtained from a well-conceived and well-characterized cohort, which adds to the credibility of our results; thus, this study can serve as the basis for subsequent prospective studies. In addition, postoperative treatment information, such as specific radiotherapy and chemotherapy treatments as well as detailed surgical methods were not available in our database.
We believe that subsequent research could improve the accuracy of individual survival time prediction by employing other techniques, obtaining larger sample sizes, improving follow-up accuracy, and so on. Furthermore, it is necessary to add detailed studies on genomics and chemoradiotherapy regimens. Our next goal is to develop a predictive system as an app and install it in a hospital system.