Study design
This is a systematic review and meta-analysis. This protocol is registered in PROSPERO (CRD42021274441).
Study Objectives
This project aims to determine the diagnostic accuracy of AI in ophthalmology clinical settings, with results stratified by and presented for each ophthalmic condition. Where sufficient information is available, patients will also be grouped by age, either pediatric (under 18 years of age) or adult (18 years or older). Some ophthalmic conditions, such as retinopathy of prematurity, occur exclusively in the pediatric population, whereas others such as age-related macular degeneration occur most commonly in senior adults. Studies with a mix of patient ages will be characterized based on the proportion of adult and pediatric patients. Both these examples provide potential for AI-assisted screening through automated grading of various diagnostic imaging modalities.
The present study will further subgroup ophthalmic conditions by their anatomic location. Anterior segment conditions include cataract, keratoconus, and dry eye disease. Common forms of imaging include anterior segment optical coherence tomography (AS-OCT), keratometry, and slit lamp photography. Posterior segment conditions such as diabetic retinopathy, age-related macular degeneration, and open angle glaucoma can be visualized via imaging modalities such as OCT of the macular and optic nerve, fundus photography, and visual field testing.
Additional subgroups for studies will be based on the setting of clinic or remote via teleophthalmology. This will allow authors to discern whether patient setting is related to the diagnostic accuracy of AI.
For all analyses, human graders will serve as the reference standard and will assess the diagnostic accuracy of the AI screening results relative to images graded by humans. Human graders were set as our reference standard as human grading is the predominant and best method thus far in providing a diagnosis. As diagnoses can differ between eyes for each individual, this study will use the eye as the unit of analysis.
Search strategy
We will undertake a literature search of relevant articles using a comprehensive search strategy developed in consultation with experienced librarians. The search will be conducted on Ovid Medline, Ovid EMBASE, and Wiley Cochrane CENTRAL for articles from January 1, 2000 to December 20, 2021. The timeline of 2000 as the initial search start date was chosen to reflect the recency of AI development and application, including one of the first studies using AI in ophthalmology, which was published in 2004 (15). The search will include a group of terms related to artificial intelligence and ophthalmology. Subject headings as well as key terms will be included. The search was first developed on Ovid Medline, then translated to Ovid EMBASE and Wiley Cochrane CENTRAL. The search will not be restricted based on language or patient population. Supplementary Data 1 includes the complete search strategy for all three databases.
Study Selection
Inclusion and Exclusion criteria
Peer-reviewed scientific articles found in the chosen databases that compare the results of AI-graded ophthalmic images with results from human graders will be included. The scope of imaging for ophthalmic conditions will include, but are not limited to keratoconus, cataract, angle-closure glaucoma, dry eye disease, posterior capsule opacification, diabetic retinopathy, age-related macular degeneration, retinopathy of prematurity, open-angle glaucoma, epiretinal membrane, and macular hole. Patients of any age or comorbidity status will be included.
Review papers, case reports, conference abstracts, guidelines, editorials, commentaries, and opinion pieces will be excluded. Papers not in English will be excluded.
Softwares Used
Due to the large number of anticipated studies from the search, the systematic review software DistillerSR (Evidence Partners) was chosen to assist with de-duplication of citations and screening of articles (16). DistillerSR uses machine learning to automate part of the screening process as an adjunct to human graders (17). After providing the software with a training set where reviewers manually provide the screening result, DistillerSR software will recognize patterns and keywords used for screening that can be applied to the remainder of articles. A relevance threshold level can be set to control the strictness of screening, and manual checks are available at various steps to ensure a desired screening result. Using this software will allow a much broader scope to be accomplished than previous systematic reviews on the topic.
All statistical analysis for the meta-analysis will be completed with R.
Screening of Studies
Retrieved studies from the searched databases will be imported into the systematic review software DistillerSR and deduplicated. Studies will be selected via a two-stage screening process, first by screening titles and abstracts, followed by full-text screening. The screening process will be supplemented with DistillerSR using a stepwise approach. After undergoing training on inclusion and exclusion criteria, two independent reviewers will screen papers until a minimum of 10 relevant articles selected for inclusion and a total of 500 articles screened is reached. This will serve as the training set for the automated DistillerSR screening software. For the next set of 500 articles, one reviewer will screen titles and abstracts and DistillerSR will be used as the second reviewer. These thresholds were chosen as a conservative approach to screening based on manufacturer recommendations for optimal performance of the software. A relevance threshold will be set at 0.1 (most conservative threshold chosen to ensure high sensitivity for inclusion of studies and prevent exclusion of any relevant articles). If an acceptable level of agreement (>90%) between the reviewer and DistillerSR is achieved, the remaining set of articles will be graded by DistillerSR alone (18). In this case, a quality check of a random selection of 10% of articles screened by DistillerSR alone will be done by a reviewer to ensure no relevant studies are excluded. 4) In case an acceptable level of agreement is not achieved (<90%), the algorithm will be re-run with the inclusion of newly screened articles to increase the training set size. Below the relevance threshold of 0.1, we will use DistillerSR only for screening. If again the level of agreement is <90%, then one reviewer will screen the papers with DistillerSR serving as the second screener. In all steps, any disagreements will be reviewed by a third senior adjudicator.
Data Extraction
After full text screening, data will be extracted from each study via the categories of study characteristics, patient information, AI methods, and outcomes (e.g. sensitivity and specificity). The full list of data categories to be extracted is presented in Table 1.
Table 1
Data to be extracted from each study
Data Category | Collected data |
Study characteristics | - Primary author - Publication year - Recruitment period/study duration - Country - Study purpose - Study type (eg. RCT, prospective cohort study) - Sample size - Clinical setting (academic/community) - Reference standard description (eg. human graders – retina specialists) - Ophthalmic condition screened for - Funding sources - Follow-up period |
Patient information | - Patient sociodemographic data (including age (mean/median and categorization of pediatric and adult), sex, comorbidities, eye conditions, race/ethnicity, income status, education) - Inclusion and exclusion criteria |
AI Methods | - Imaging modalities used for screening (e.g. fundus photographs, ocular coherence tomography) - Automated algorithms or tools used (boosted tree, random forest, etc) - Role of AI in screening - Number of human graders - Number of ungradable images - Identified pathologies (types and proportions) |
Intervention Outcomes | - Sensitivity/specificity - Positive predictive value - Negative predictive value - % correct as analyzed by artificial intelligence - Diagnostic accuracy (if stated) |
Assessment of study quality
Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) will be used by two independent reviewers to assess the quality of included studies based on the 4 domains of index test, reference test, patient selection and flow/timing (19). Multiple signalling questions for each domain guide the bias review. Risk of bias is graded as high, low, or unclear. A grading of unclear is given only if there is insufficient information to make a decision. If at least one signalling question is answered as “no”, there is potential for bias and reviewers will independently judge the risk for bias. Unclear grading results when there is insufficient data for a judgement to be made.
In cases where studies exclude patients from the comparative analysis, we established a low risk-of-bias cut-off at 10% of ophthalmic images that were deemed ungradable by the human graders. This cut-off was informed by a selection of review papers, which labelled a 5–10% ungradable rate as low (20, 21).
Any disagreements in grading will be reviewed by a third adjudicator. A summary and graphic representation of the QUADAS-2 gradings for all studies will be presented in the final review. A sensitivity analysis will be conducted by removing studies with a high risk of bias.
Missing Data
Where there is missing data, we will make attempts to contact the corresponding author of studies through the email listed on the publication. A total of three attempts will be made. If no response is received, the authors will make the best attempt to perform the analysis based on available data and code any data not available as missing. The missing data will be noted as a limitation in the discussion section of the manuscript.
Data synthesis
For each study, screening outcomes via artificial intelligence will be entered in a two-by-two table (true positive, false positive, true negative, false negative). The data of the two-by-two tables will be used to calculate sensitivity and specificity for each study (Table 2). We will present individual study results graphically by plotting the estimates of sensitivity and specificity in both forest plots and on the summary receiver operating characteristic (sROC) curve plots. The predictive accuracy will be quantified using the AUROC.
Table 2
Sample Two-by-Two Contingency Table Used for Analysis
| Reference (result by human graders) |
Positive | Negative |
Test (AI result) | Positive | True Positive | False Positive |
Negative | False Negative | True Negative |
We will also conduct a subgroup analysis on the diagnostic accuracy of artificial intelligence when used specifically in teleophthalmology programs.
Our unit of analysis is the eye, given that each eye may have a separate diagnosis and therefore affect accuracy in different ways. Some studies may only report results per patient instead of per eye. As such, a sensitivity analysis will be conducted with the unit of analysis as each patient to ensure consistency of results.
Pooled sensitivity and specificity of artificial intelligence for detection of any ophthalmic conditions based on imaging modalities compared to the reference standard (i.e. human graders) will be reported. The findings will be stratified by ophthalmic condition (anterior vs posterior segment disease entities; when sufficient data is available), as well as demographics (pediatric vs adults >=18 years old). Pooled estimates of the sensitivity and specificity will be obtained with random effect models, using the DerSimonian-Laird method to incorporate variation among studies (22).
We will investigate heterogeneity firstly through visual examination of forest plots of sensitivities and specificities, as well as the sROC plot of the raw data. Last, we will use Cochran’s Q test to evaluate homogeneity. We will also use the statistic I² of Higgins to quantify the amount of heterogeneity. The scale of I² has a range of 0 to 100% and values of 25%, 50%, and 75% are considered low, moderate, and high heterogeneity, respectively. All statistical analyses will be completed by a qualified biostatistician.