In this study, we performed the validation of EORTC, CUETO and EAU risk stratification algorithms in prediction of recurrence, progression and death of patients with newly diagnosed NMIBC. Our analysis included 322 patients and in terms of intragroup associations confirmed observations from previous studies[11, 12]; allowing us to consider our study group representative. In our study group EORTC model presented with superior performance, although this performance is generally moderate and the difference, despite its statistical significance, may not be of clinical significance. The conduction of systematic review allowed us to summarize and confirm that state-of-the-art tools for risk stratification validate insufficiently in real clinical scenario and that it emphases the need for development of new models. To assure the completeness of the paper, we have assessed comprehensively not only the simplified risk groups presented in EAU, EORTC and CUETO publications but also the score that is used for the development of these risk groups.
In the case of NMIBC, risk stratification algorithms are in great demand as the progression to MIBC is associated with poor prognosis, which was shown not only by our analysis but several others. Despite known risk factors and continuous repetition of TURBT procedures, the accuracy of recurrence and progression to MIBC is still unsatisfactory. As shown by our results and systematic review, poor discriminative abilities of the state-of-the-art risk stratification tools are problematic in both recurrence and progression forecasting. The latter, however, seems to be more accurately predicted by those tools.
We are the first one to report the statistically significant advantage of EORTC over EAU and CUETO in recurrence prediction. No paper before compared the c-indices directly with their 95%CI. The superiority, however, may be not relevant from the clinical point of view, because the c-index values are generally low. It may be, however, relevant to the progression prediction. Although we didn’t find the difference in EAU and EORTC for this problem, EORTC proved its superiority over CUETO. This should be, however, considered with the fact that CUETO was initially developed for BCG-treated patients.
Although we acknowledge that discussed systems could be used for recurrent cases to assess prognosis, importantly, our study did analyze one the survival to the first recurrence. The rationale for this was the significant impact of the individual surgeon on the risk of recurrence after curative treatment of patients with NMIBC, as described earlier by the others,  and could aggravate the lead-time bias. This approach was adopted in several similar studies, e.g. by Shen et al. 
Our systematic analysis has also shown the inconsistency in reporting the validity of utilized stratification approaches. For example, very recent analysis of 301 patients by Wang et al. or the analysis of 1436 patients by Rieken et al. (e.g. in the group without immediate postoperative instillation of chemotherapy) could not be included in the review due to lack of c-index or AUCROC analysis. Even if the c-index values are reported, they are usually provided without 95% confidence interval, hence are unfit for meta-analysis. Nevertheless, most of the authors of cited papers indirectly confirm our observations. The accuracy of predictions was consistently decreased in patients treated with BCG in all included publications.
At the time our analysis was finished, by the end of 2018, a critical assessment from the European Association of Urology Non-muscle-invasive Bladder Cancer Guidelines Panel has been published. In this paper, the experts concluded that none of the available risk stratification and prognostic models reflects current standards of treatment. In the presented opinion the EORTC risk tables and CUETO scoring model should be updated with previously unavailable data and recalculated. Our data support this conclusion.
Multiple discrepancies between original publications and validation studies are reported in the reviewed material. For example, patients requiring second TURBT were dropped from the analysis in original CUETO and EORTC publications, while multiple recent papers didn’t secure this criteria.
Despite numerous attempts of new models development, in recent publication by Kim et al. the authors achieved the c-index for 5-year recurrence and progression of 0.65 and 0.70, respectively. Considering possible overfitting of the model (c-index provided without external validation) and the fact that our validation proved EORTC to provide similar c-indices, its utility requires further extensive validation. Similarly, in the paper by Hong et al. the AUCROC of proposed nomogram was 0.604 for the 5-year prediction of recurrence. In our study, without utilization of proposed nomograms, better validation AUCROC metrics were achieved by EAU risk groups, EORTC score and risk groups. Moreover, the recently proposed model for patients treated with 1–3 years of maintenance BCG  based only on grading and age was described with c-index of 0.59 for training and 0.56 validation sets for recurrence. Those values were covered by 95% CI for c-indices we have provided in this study and those were given for mixed population of both patients treated with BCG and not. Similar situation was note for progression, where authors provided c-indices of 0.72 and 0.64 for training and validation sets, respectively.
Lastly, it is worth mentioning that current risk stratification tools are hard to apply in the field of personalized medicine. For example, applying the standard cutoff of 50% probability, one can conclude that classification into EORTC group of 38% probability of 1-year recurrence probability and 62% of 5-year probability would, per assumption, yield in 38% incorrect predictions for both timeframes. Currently available nomograms do not predict expected time of recurrence or progression, hence cannot be treated as predictive tests for particular patients. This means that despite description of general predictive potential using AUCROC or c-index parameters, analysis of these tools as predictive models in terms of their accuracy, sensitivity or specificity is futile.
Our study is not devoid of limitations associated with study design. As a retrospective analysis, possible recall and selection bias should be considered. This was partially tackled by integration of the results with data received from central registry. Because of this integration, we were able to double check our records. However, only overall survival was analyzed using the information from the central governmental registry. The data about recurrence and progression were obtained only from one institution and the bias associated with this is further aggravated by the fact the data was collected only from one facility and this facility isn’t the only one in the region performing TURBT procedures. Patients choosing different facilities for further treatment had to be lost in follow-up. Additionally, the procedures were performed by multiple surgeons and were assessed by multiple pathologists. Comorbidities might also have an uncontrolled influence on treatment and decision making, however, based on the finding presented above we consider our sample representative for the population. Additionally, none of the patients was treated with immediate single intravesical instillation of gemcitabine. However, the recent evidence suggest that this further decreases the predictive performance of studied systems. 
Notwithstanding this, our study provides additional evidence on the validity of state-of-the-art risk stratification method on a fairly large sample and is the very first one trying to summarize current research and compare all 3 currently recommended methods of risk assessment to each other.