Baseline characteristics and pathological results
Table 1 lists patient characteristics and pathological results in the total population, training dataset and testing dataset. The median patient age of the overall cohort was 69 (interquartile range [IQR]: 63–75) years. The median TPSA value was 21.0 (IQR: 10.9–42.2) ng/ml. The median D-max on MRI was 1.9 (IQR: 1.3–2.7) cm. Most patients had clinical stage T1-2 (57.9%, n = 307) and PI-RADS score 5 (58.6%, n = 310). Table 2 details the concordance between the biopsy global GG and the final RP GG, and the corresponding downgrades and upgrades for GG 1-4. The most prevalent GGs assigned on biopsy were GG 1 (31.3%, n = 166) and GG 2 (26.0%, n = 138). Overall, 262 patients (49.4%) experienced upgrading at final pathology. The overall incidence of biopsy GG 1 upgrading was 120 (72.3%) of 166 patients, of which most were to GG 2 (50.0%, n = 83), followed by GG 3 (12.7%, n = 21), GG 4 (6.6%, n = 11) and GG 5 (3.0%, n = 5). Biopsy GG 3 (47.7%) and GG 4 (44.3%) showed the highest agreements when compared with RP GG. Patients with lower biopsy GG were more likely to harbor upgrading at RP.
In multivariable analysis, %fPSA (>0.16 versus £0.16) (OR 0.52; 95%CI: 0.27–0.995; P = 0.048), apical involvement (No versus Yes) (OR 1.80; 95%CI: 1.02–3.19; P = 0.042) on MRI and biopsy GG 1 (P < 0.001) were significantly associated with upgrading at RP (Table 3). According to their respective coefficients, the LR model was constructed using the following formula: = 2.29 0.65 (%fPSA) 0.59 (apical involvement) 0.97 (biopsy GG) (where is the output value of predictive models).
Based on the results of Lasso analysis, those clinical features with coefficients > 0.1 were selected as the parameters included in the construction of Lasso-LR model. Finally, %fPSA, apical involvement, PI-RADS score, clinical T stage and biopsy GG were the selected features (Fig. 1). The Lasso-LR was constructed by using the following formula: = 1.81 0.41 (%fPSA) 0.25 (apical involvement) 0.81 (biopsy GG) 0.15 (clinical T stage) 0.11 (PI-RADS score) (where is the output value of predictive model).
In the RFE-SVM analysis, 10 clinical parameters were selected as the final candidates for constructing the predictive model without impacting the prediction accuracy of the model, including biopsy GG, apical involvement, maximum tumor length in single core, %fPSA, PSAD, presence of core with tumor length > 0.6 cm, presence of csPCa at core, PI-RADS score, D-max and clinical T stage (Fig. 2a). As depicted in Fig. 2b, with the selected features being added to the SVM model one by one, the AUC value of model also increased little by little.
The process of feature selection by RF model and the importance of features are illustrated in Fig. 3. Based on different combinations of clinical parameters, each tree in the forest votes for the major classification, and the final classification of the RF model is derived from the majority of these votes (Fig. 3a). The best number of trees and the best number of variables tried at each split were 131 and 4, respectively. The out of bag (OOB) estimate of error rate was 33.42%, suggesting that the generalization error was quite unsatisfactory.
Comparison between ML-based models
Among these models, Lasso-LR model had the highest AUC (0.776, 95% confidence interval [CI]: 0.729–0.822), followed by SVM (AUC 0.740, 95% CI: 0.690–0.790), LR (AUC 0.725, 95% CI: 0.674–0.776) and RF (AUC 0.666, 95% CI: 0.618–0.714) (Fig. 4a). Similarly, in the testing dataset, Lasso-LR model had the highest AUC (0.735, 95% CI: 0.656–0.813), followed by SVM (AUC 0.723, 95% CI: 0.644–0.802), LR (AUC 0.697, 95% CI: 0.615–0.778) and RF (AUC 0.607, 95% CI: 0.531–0.684) (Fig. 4b). The Lasso-LR model illustrated an accuracy of 0.712, a sensitivity of 0.679 and a specificity of 0.745, indicating that this model correctly identified 67.9% of PCa patients who experienced upgrading at RP and 74.5% of PCa patients who did not experience upgrading at RP (Table 4). In addition, the Lasso-LR model had the highest YI (0.424) compared with other models. Due to the fact that the YI was calculated as a summation of the sensitivity and specificity minus 1, the highest YI indicated that both the sensitivity and specificity of the Lasso-LR model are reasonably good relative to other models. Pairwise comparison of ROC curves showed that the AUC of Lasso-LR model was significantly higher than that of LR (P = 0.002), while the AUCs of SVM and RF were not significantly different to that of LR (P > 0.05) (Fig. 4a).
The calibration of ML-based models was evaluated graphically by the formulation of calibration curves (Fig. 5). The green line represented the fit of the model. Deviations from the 45° line indicated miscalibration. Part of the green line below the 45° line indicated that higher predicted probabilities might overestimate the true outcome, and part of green line upon the 45° line indicated that lower predicted probabilities might under-predict the true probability of upgrading. The SVM model was well-calibrated (Fig. 5d), followed by Lasso-LR (Fig. 5b), RF (Fig. 5c) and LR (Fig. 5a).