In this study, we developed an end-to-end deep learning proportional hazard regression model (Deep-Surv) from CT images for predicting survival after surgical resection of stage II-III CRC. The training and validation datasets were constructed using data from two centers. We quantitatively evaluated the ability of clinical features, radiomics features, and deep learning features to predict DFS (Table 2). Univariate analysis in Table 2 suggested that Age, T stage, N stage, differentiation, PNI, LNR, the output of Radiomics-Surv and Deep-Surv could all be used as independent factors and were further validated in the multivariate analysis. The construction of survival models based on independent prognostic factors and multiple evaluation criteria on the training and validation dataset confirmed that Deep-Surv improved prognostic prediction compared to Radiomics-Surv and CS (C-index: training, 0.84 vs 0.7 vs 0.63, validation:0.76 vs 0.67 vs 0.62; AUC: training, 0.82 vs 0.69 vs 0.61, validation, 0.77 vs 0.62 vs 0.56). This result also illustrated that the deep learning method could generate more promising information than semantic phenotypic features and could handle more complex relationships between features. Similar phenotypes in the secondary analysis, the survival analysis was modeled as a classification problem that partitioned the training and validation dataset into high- and low-risk to increase the general validity of the study results. The deep learning-based classifier has better performance on both the training and validation datasets (training: HR, 5.83 [95% CI, 3.532–9.692], P < 0.0001; validation: HR, 3.63 [95% CI, 2.302–5.709], P < 0.0001). Besides, the ability to partition between high- and low-risk remains in subgroups based on clinical characteristics, as shown in Fig. 5, significant partitioning effects on the T stage (T1-T3: HR, 3.106 [95% CI, 1.37–7.039], P < 0.05; T4a: HR: 2.309 [95% CI, 1.052–5.069], P < 0.05) and LNR (< 50%: HR, 2.24 [95% CI, 1.227–4.087], P < 0.05) subgroups. The effectiveness and robustness of the Deep-Surv’s output as an independent prognostic factor were further validated.
TNM is the most commonly used staging system for CRC and is the current benchmark for treatment options for patients with CRC. For personalized treatment, studies have shown that there are some drawbacks, such as TNM is mainly based on specialist opinions and has a single selection of features, which makes the staging effect controversial (Li et al. 2018). In particular, CRC (stage IIB/C (T4a/b N0) has a significantly worse prognosis than stage IIIA (T1-2 N1)) reduces the accuracy and reliability in clinical application(Li et al. 2014). Our study demonstrated that the Deep-Surv was an independent prognostic factor with the ability to stage risk for staging subgroups (T, N stage). For instance, Deep-Surv could be used as a reference indicator to assist TNM, for example, first determining the T, N stage, and then using the Deep-Surv to further risk stratify the T and N stage to improve the ability of clinical decision making.
Previous radiomics studies have shown that CT imaging features can predict disease survival (Ji et al. 2019; Dong et al. 2019). Radiomics features are mathematically extracted from human-defined quantitative formulas and are susceptible to human bias, and thus may be subject to bias or information redundancy (Aerts et al. 2014; Berenguer et al. 2018). The acquisition of radiomics features relies on the precise outline of the lesion area by the radiologist, which is costly in practice. Our Deep-Surv requires only simple interactions that do not precisely describe the tumor, and the neural network is self-learning the features related to prognosis, saving labor costs. The survival analysis methods of radiomics features commonly used are the KM curve method, Cox method, etc. The drawbacks of these methods are firstly the time dependence of prognostic factors on tumor prognosis and secondly the linear model dealing with nonlinear features. As a result, these survival analysis methods may lose some of the prognostic information and reduce the accuracy.
Our model is based on deep learning, which makes no assumptions about temporal dependencies. Time is input to the model as a one-dimensional vector. Deep-Surv is allowed to learn the complex relationships between features autonomously through 3D convolution and loss functions. Many studies have been conducted to validate the role of deep learning in survival prediction. Kather et al. (Kather et al. 2019)illustrated that CNN can assess the human tumor microenvironment and predict prognosis directly from CRC histopathological images. Kim et al. (Kim et al. 2020)presented a deep learning model for chest CT that predicted disease-free survival for patients undergoing a lung operation. Zhang et al. (Zhang et al. 2020)developed a deep learning risk prediction model for overall survival in patients with gastric cancer. Our study further validates the effectiveness of deep learning based on CRC CT images. Secondly, our model used two different classes of CT inputs, non-enhanced CT and enhanced CT. Enhanced CT requires contrast injection and can visualize the blood flow in the diseased tissue. It can provide more accurate information about the lesion in combination with non-enhanced CT. The two classes of CT have different focuses on disease diagnosis, so in clinical practice, clinicians will combine them to diagnose patients. Thirdly, CT images contain a large amount of information, not all of which is relevant to survival prediction. The attention mechanism can automatically filter the features in CT that are related to prognosis. Attention mechanisms have been proved to be effective in selecting relevant features in previous studies (Saillard et al. 2020).
Our study also had limitations. First, although the model can consider the differences between different classes of CT images, it can be further combined with other radiological images, such as MRI. Second, our model was validated in a validation dataset of patients with similar characteristics, but there are issues with the small volume of data, which may limit the statistical capability of distinguishing performance between models. Thirdly, the 3D convolutional used in our model is often referred to as a “black box” (Nicholson Price 2018). Lack of interpretability has long been a drawback of deep learning models.