Among MLMs, convolutional neural networks with deep learning approaches have been previously reported for medical images [20]. One approach, ResNet50, emerged as superior in image classification tasks [21, 22] while SVM classification was judged useful for discriminating between two classes by making a decision boundary with one or more feature vectors [23]. We thus applied both schema in this study.
Here, we evaluated MLM efficacy at distinguishing between residual teratoma and necrosis in clinical PC-RPLND specimens. The radiomics-based ResNet50 algorithm resulted in a diagnostic accuracy of 80.0%, corresponding to a sensitivity of 67.3%, a specificity of 90.5%, and an AUC of 0.84, while SVM classification (using six clinical variables) resulted in a diagnostic accuracy of 74.8%, corresponding to a sensitivity of 59.0%, a specificity of 88.1%, and an AUC of 0.84.
There are two previous studies of differing radiomics-based MLMs that also discriminate between necrosis (benign) and teratoma/viable GCT (malignant) in PC-RPLND specimens [24, 25]. A study from Lewin et al. used SVM classification and reported a diagnostic accuracy of 71.7%, a sensitivity of 56.2%, a specificity of 81.9%, and an AUC of 0.74 [24]. Another study from Baessler et al. used the random forest model and obtained a diagnostic accuracy of 81%, a sensitivity of 88%, and a specificity of 72% [25]. The performance of our model is superior to the Lewin model; however, compared to Baessler’s report, our model is superior in specificity (90.5% versus 78%) but slightly inferior in accuracy (80.0% versus 81%). This might be due to a larger study population in the Baessler report and heterogeneity from multiple CT scanners/vendors. Furthermore, our model was developed to distinguish between residual teratoma and necrosis in clinical PC-RPLND specimens, not benign versus malignant histology.
Several nomograms by logistic regression analysis have been reported to distinguish benign from malignant histology in PC-RPLND specimens with favorable AUC ranges of 0.77–0.84 [8, 9, 18]. Since, among these, the most promising nomogram includes the following clinical variables: presence of PPT components in primary site, post-chemotherapy LNS, percentage of lymph nodal shrinkage by chemotherapy, and pre-chemotherapy serum tumor marker levels (AFP, HCG, and LDH) [8], we also applied these clinical variables to our study. To best of our knowledge, the only study featuring a machine learning model and these six clinical variables for predicting the histology of PC-RPLND specimens reported an AUC of 0.76 [24]. We selected two clinical variables from the initial six and achieved an AUC of 0.84, similar to previously reported results. Although external validation is needed, our model did achieve a similarly favorable AUC using fewer clinical variables compared with previously reported nomograms and MLMs. Superfluous data can be safely excluded as Lewin et al. incorporated extensive clinical variables into their model with no improvements in AUC [24], a result similar to ours.
Since the ensemble learning paradigm widely used in medical science strengthens models by integrating multiple MLMs into a single-output modality, we next attempted this tactic [26–28]. However, we found predictive performance unimproved by ensemble learning, probably due to unquantifiable performance issues within the individual models comprising the MLM aggregation. Reports on the strengthening of ensemble learning have yet to define if the number of integrated learning models adversely or beneficially affects performance [27]. The need for such considerations in the initial design of any attempted MLM integration for ensemble learning will be a requirement for future studies.
There are several limitations to this study. First, due to the retrospective design and single-institution analysis, no external cohort validation was done. Although we used cross-validation to overcome this limitation, overfitting of the trained model could have occurred. Second, the study population was relatively small. Third, it might be impossible to detect microscopic residual teratoma in necrotic large lymph nodes by a radiomics approach, negating this tactic.
Despite these limitations, we developed predictive models for PC-RPLND histological analysis using MLMs equal to or better than previous reports. Additional validation is required to evaluate whether our models can reliably discriminate between residual teratoma and necrosis enough to exclude patients from unnecessary PC-RPLND. Furthermore, it would be interesting to compare the utility of our machine learning method in conjunction with assessments by experienced urologists and radiologists in predicting the histology of PC-RPLND specimens.