As a non-medical AI, ChatGPT demonstrated modest performance compared to platforms specifically designed for DR detection. Given non-blur, non-mtmDR images, the AI was surprisingly unable to read or correctly interpret the majority. This suggests that ChatGPT’s fund of knowledge does not have sufficient training on normal and mild disease, limiting its current utility as a screening tool in low-risk populations. In comparison, FDA-approved devices have demonstrated higher “imageability,” or the percentage of disease detection results given among all images determined gradable by a reading center. For example, EyeArt has reported imageability of 87.4–97.4% albeit using multiple images per eye captured on select camera makes/models and administering patient exclusion criteria (i.e. persistent visual impairment) (16). Also, in terms of nonreferrable DR images that were able to be read, ChatGPT lagged behind other models which have demonstrated superior performance with sensitivities of 91.9% and specificities of 99.7% (14). The AI did perform significantly better in the moderate NPDR to PDR range. Future versions can be trained to interpret normal and mild disease with improved accuracy.
For cases of mtmDR, the AI demonstrated the ability to correctly diagnose 93.3% of moderate NPDR, 93.3% of severe NPDR or PDR, and 66.7% of blur fundus with suspected PDR images. In the moderate NPDR subset, the retina specialists individually were additionally able to interpret 73.3% (22/30) as VTDR, while ChatGPT did so for only 43.3% (13/30). This suggests that while other platforms may benefit from deep learning, ChatGPT’s training may include predefined database categories such as moderate NPDR, but not detailed information such as DME. Therefore, even if ChatGPT can define DME, mtmDR, and an image of moderate NPDR, it does not yet demonstrate the ability to consistently interpret DME within an image of moderate NPDR. This limits its utility in detecting subtle disease and likely requires dedicated training as completed by other AI platforms. ChatGPT did perform better at diagnosing VTDR in cases of severe NPDR and PDR without blur (83.3%) which may be simply due to more apparent image details. However, with blur and suspected PDR, while significant readability was noteworthy, accuracy decreased to 36.7% compared to 100.0% among the retina specialist panel. As such, the AI in future versions may be most promising for use in high-risk populations in which clear fundus photos are possible.
For all images, ChatGPT stated it is not a medical professional and recommended referral, which is essential for any non-FDA-approved software. Using mtmDR or VTDR only as criteria for referral, ChatGPT still exhibited an exorbitant referral rate as it considered 90.9% of all images to have mtmDR (mislabeling 81.0% of non-mtmDR images) and 58.7% of all images to have VTDR (mislabeling 49.2% of non-VTDR images). This propensity for overcalling mtmDR or VTDR may be an inherent limitation or secondary a carefulness to not miss disease. Other AI applications for DR identification have found similar limitations, but in order for ChatGPT to be useful, its specificity for mtmDR (19.1%) and VTDR (50.8%) must improve (17–20). For comparison, a systematic review of smartphone-based AI’s for DR detection found pooled sensitivity and specificity of 88.0% and 91.5%, respectively (20), and models utilizing the similar EyePACS dataset have achieved sensitivities of 80.0% and specificities of 96.9% (21). Again, overall this weakness in large part was due to normal and mild NPDR images, and excluding those, the AI would correctly refer 97.5% (117/120) even if classification was not wholly correct.
Ideally, an AI software strives to be both time and cost-efficient with adequate sensitivity and specificity. While processing speed as a function of internet connectivity and server demand was not formally evaluated in this study, the software did perform relatively quickly and inexpensively. However, performance metrics lagged, and, occasionally when ChatGPT gave a possible interpretation, the results were inconsistent with any possible fundus finding. For example, 3/60 fundus images with blur that were considered by ChatGPT as unreadable were otherwise described as “celestial bodies.” no images involved asteroid hyalosis. This is in part due to the nature of ChatGPT as a large language model (LLM) with generative capabilities, which differs from other available AI models that classify images into predefined categories of mtmDR. While this weakness exists, the LLM could allow the unique potential for providing conversational counseling at the point of capture, if improved or paired with an FDA-approved platform or human reader. Also, the ideal AI with global reach would not be limited to photos from a specific camera, while able to address wide variances in image quality due to patient and user-dependent factors such as blur and exposure.
Ideal technology should also be accessible in areas without internet access (22) and provide ease of use and interpretation. ChatGPT remains limited to online use only unlike others such as Medios AI (8). However, one of the versatile facets of ChatGPT is that it can be integrated into websites or web-based mobile applications through Application Programming Interface (APIs) which can be link users to patient resources, appropriate physician offices, or personal health records (23). Moreover, improved versions remain foreseeable with a low barrier to update as they become easily accessible online and the ability to connect to real-time human graders, in-line with other AI screening tools (23). Aside from evaluating retinopathy, ChatGPT-4 has additional limitations including inability to learn from previous experience, reliability issues (“hallucinations”), disinformation, privacy concerns, and cybersecurity risk given protected health information (12, 24). Other limitations of this study also include use of single macula-centered fundus photo for interpretation, rather than standard 4-wide-field stereoscopic dilated fundus photographs or optical coherence tomography (OCT). Furthermore, additional definitions regarding mtmDR and VTDR to align with reading center guidelines were given to retina specialists to avoid ambiguity, but not given to the AI as it was not necessary to receive a response.
As AI platforms continue to improve, screening potential in areas with limited numbers of eyecare providers or disproportionate disease burden remains promising, especially with the integration of machine learning. Future studies should assess future versions of ChatGPT’s ability to assess patient images in non-publicly available databases and in a point-of-care prospective trial setting. In addition to fundus images, providing patient characteristic inputs including any history of diabetes and comorbidities, patient skin type or ethnicity, serum creatinine levels, or hemoglobin A1c may improve AI-predicted outcomes (10, 25). Use of multiple images, widefield images, and OCT of each fundus could also provide additional insight into the AI’s capability and improve performance. In fact, ChatGPT noted in several instances that it had difficulty making a determination of DME without OCT. Comparability studies with other AI platforms should be evaluated such as BlenderBot3 (Meta) and Bard (Google). Lastly, the AI has sufficient performance in DR to warrant investigation into other disease states such as ERM, macular hole, and glaucoma.