The study included 1,360 patients with oncological resection between 2001 and 2020 and 117 variables. Table 2 summarizes all patient characteristics which were mainly evaluated as potentially relevant predictors for the machine learning models except for obvious outcome parameters.
First of all, we tested the introduced machine learning algorithms which are demonstrated with the according test statistic in Figure 1. The standard Cox proportional hazards (CPH) model is demonstrated in Figure 1a and reached a c-index of 0.645 and an integrated Brier score (or IBS) of 0.221. The prediction error curve is depicted in the last column of Figure 1 with the values for the Brier score as an integral. The non-linear Cox proportional hazards model (see Figure 1b) was able to improve the statistic with a c-index of 0.681 and an IBS of 0.194. Note that while both Cox proportional hazards models initially have a comparably good score, the integral reaches the critical limit of 0.25 as the timespan reaches 10 years after cancer diagnosis.
Compared to the first two calculations, the c-index could not be improved by linear multi-task logistic regression (see Figure 1c, c-index = 0.673, IBS = 0.229). However, the neural multi-task logistic regression reached a clearly better prediction of the actual survival function with a root mean squared error (RMSE) of the actual predicted survival curve of 9.188 (see Figure 1d, c-index = 0.672, IBS = 0.254). Note that both regression methods also exceed the IBS cut-off value of 0.25 which is generally considered as an acceptable limit. The Gompertz model as an example for a parametric model was not able to significantly improve the test statistic and showed relatively weak results (see Figure 1e, c-index = 0.677, IBS = 0.194).
However, the CPH models could be outperformed by all three survival forest methods with the RSF being the strongest prediction model (see Figure 1h, c-index = 0.736, IBS = 0.166). The Extra Survival Trees (see Figure 1g, c-index = 0.736, IBS = 0.167) and the Conditional Survival Trees (see Figure 1f, c-index = 0.726, IBS = 0.166) showed similarly strong results which could still not outperform the RSF even after hyperparameter optimization. The accurate prediction by RSF is again underlined by the direct comparison between actual and predicted survival function demonstrated in the second column. Here, the root-mean-square error (RMSE) of the RSF was calculated as 6.224.
To establish a more practical and compact approach, we selected the most important features identified by the RSF. Table 3 shows the permutation-based importance of the 20 most important parameters identified by the RSF model. While the individual weights of the predictors are not remarkably high (except for lymph node ratio), the RSF model can still build the prediction based on all variables of the dataset. The weights of the predictors are to be interpreted in a fashion that for instance the omittance of lymph node ratio in the model would evoke a change of the resulting c-index by 0.118 with the specified 95% confidence range.
The time-dependent area under the curve (AUC) is separately demonstrated for the six most important predictors. It is remarkable that the AUC factors such as duration of intensive care, postoperative complications and intraoperative blood loss lose their predictive value rapidly as time passes. In contrast, the significance of lymph node ratio increases postoperatively and stays stable on an AUC level above 0.65 more than 5 years after cancer diagnosis (see Figure 2a).
While predictors have a time-dependent importance, the prediction models also show a time dependent accuracy. Of note, the RSF algorithm outperforms the CPH model on the time-dependent scale with an AUC of 0.821 (as opposed to 0.720 for the CPH model, see Figure 2b). The remaining machine learning models also showed better time-dependent performances than CPH but were not able to outperform the RSF algorithm. While all models were able to predict survival with a very high score above 0.9 in the very first months, the predictions generally tended to decrease in accuracy while time progressed. However, the long-term survival prediction was most successful according to the RSF model.
Figure 3a shows a risk scoring model based on the test statistic calculated by the RSF algorithm. A numeric risk score is assigned to each patient ranging from 4.7 to 7.1. Three different colors were utilized to depict the low-, medium- and high-risk group. The differentiation of three groups was performed manually based on the distribution of risk scores. Figure 3b shows the survival curves of all individuals classified by the scoring system as low-, medium- or high-risk within the same predefined test group. The three survival curves differ significantly according to the log-rank test (p < 0.0001) with the low-risk group having a 5-year survival rate of 73.18%, the medium-group showing 45.39% and the high-risk group finally 14.87%. Median survival time was 18.754 months in the high-risk group, 44.557 months in the medium-risk group and incalculable in the low-risk group since the survival rate did not fall below 50% in the observed long-term time interval of 10 years. Also, the survival curves can be separated into more groups to enable a further stratification of risk groups (not shown).
Finally, the 20 most relevant predictors from the permutation-based importance scoring (see Table 3) were selected to establish a more compact RSF model. Supplementary Figure 4 shows the distribution of the risk scores resulting from the compact RSF model. A risk score between 0 and 5.3 was considered as low-risk, a score between 5.3 and 6.1 as medium-risk and a score above 6.1 as high-risk for decease after resection. The according survival curves of the three different groups from the test cohort are demonstrated in Supplementary Figure 5. The low-risk group had an incalculable median survival longer than 10 years, the medium-risk group had a median survival of 85.639 months and the high-risk group 20.721 months.