Several studies have demonstrated the proficiency of neural networks in accurately distinguishing between glaucomatous and healthy using retinal fundus images, typically yielding AUC values of around 99% across a number of publicly available databases16–19. Deep learning techniques have even proven adept at detecting glaucoma using retinal images excluding the optic nerve, achieving an AUC of 88%20.
Fumero et al. achieved a 99% AUC with the RIM ONE14 database, while Phasuk et al21. attained 94%. The 96% AUC obtained in this study is in line with published literature.
To date, no published study according to our knowledge, has considered the use of prediction probability obtained from neural networks and its potential in clinical practice. This study focuses not only on the prediction of glaucomatous versus healthy discs of a trained neural network on the RIM-ONE test data but also on the certainty probability of each prediction. 95% was selected as a confidence level as it is a standard threshold in scientific research. If the probability exceeds 95%, we can consider the prediction of the artificial intelligence as accurate. This approach to the use of AI could be adopted in clinical practice to identify those patients who require further investigation for possible glaucoma. This mirrors current clinical practice of initial disc examination by a clinician and further investigation only if there is clinical suspicion of glaucoma.
While other neural networks may be mathematically superior to this model, this novel extension of the model to calculate certainty probability gives this model clinical relevance previously lacking in other models. In this study, in the high confidence prediction test set the AUC is 100%, whereas with the original test set, which included results with predictive probabilities under 95%, an AUC of 96% was obtained. The comparison with the De Long test yielded statistically significant results (p < 0.05).
In our analysis, we found that not only was the increase of Area Under the Curve (AUC) statistically significant, indicating an improved overall performance of the model, the sensitivity and negative predictive value also showed statistically significant improvements.
Specificity and positive predictive did not improve to a statistically significant degree, despite reaching 100%, because the baseline values for specificity and positive predictive value of the full test of 172 images were already high, at 98% and 95% respectively.
All previous studies have primarily focused on achieving a 100% AUC without considering individual prediction probabilities. This approach of assigning predictive probabilities to each outcome from the AI model takes into consideration the clinical relevance of neural networks in diagnosing glaucoma.
This study is, to our knowledge, the first to demonstrate the potential clinical relevance of incorporating high-confidence AI predictions into artificial intelligence models to assess glaucoma from fundus images.