Objectives
The primary objective of this review is to assess the diagnostic accuracy of AI algorithms (index test) compared with gold-standard human investigators (reference standard) for screening relevant literatures from original literatures identified by electronic search in systematic review. The secondary objective of this review is to describe the time and work saved by AI algorithms in literature screening. Additionally, we plan to conduct subgroup analyses to explore the potential factors that associate with the accuracy of AI algorithms.
Study registration
We prepared this protocol following the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols (PRISMA-P).19 This systematic review has been registered on PROSPERO (Registration number: CRD42020170815, 28 April 2020).
Review question
Our review question was refined using PRISMA-DTA framework, as detailed in Table 1. In this systematic review, “literatures” refer to the subjects of the diagnostic test (the “participants” in Table 1), and “studies” refer to the studies included in our review.
Table 1
Item | Description |
“Participants”* | Original publications and literatures identified by electronic literature search |
Index test | Automatic literature screening models using artificial intelligence algorithms |
Reference standard | Traditional literature screening by human investigators |
Outcome | Primary outcome: diagnostic accuracy, measured by sensitivity, specificity, precision, NPV, PPV, NLR, PLR, DOR, F-measure, accuracy, and AUC of automatic literature screening models Secondary outcomes: labour and time saving, mainly evaluated by the percentage of searched literatures that the reviewers do not have to read (because they have been screened out by the automatic literature screening models) |
*The “participants” in our review refer to the original publications and literatures identified in a systematic literature search, rather than human participants or patients in traditional systematic reviews. |
Abbreviations: AUC, area under curve; DOR, diagnostic odds ratio; NLR, negative likelihood ratio; NPV, negative predictive value; PLR, positive likelihood ratio; PPV, positive predictive value. |
Inclusion and exclusion criteria
We will include studies in medical research that reported a structured study question, described the source of the training or validation sets, developed or employed AI models for automatic literature screening, and used the screening results from human investigators as the reference standard.
We will exclude traditional clinical studies in human participants, editorials, commentaries or other non-original reports. Pure methodological studies in AI algorithms without application in evidence synthesis will be excluded as well.
Information source and search strategy
An experienced methodologist will conduct searches in three major public electronic medical and computer science databases, including PubMed, Embase and IEEE Xplore Digital Library, for publications ranged from January 2000 to present. We set this time range because to the best of our knowledge, AI algorithms prior to 2000 are unlikely to be applicable in evidence synthesis.20 In addition to the literature search, we will also find more relevant studies through checking the reference lists of studies identified by electronic search. Related abstracts and preprints will be searched in Google scholar. There are no language restrictions in searches. We will use both free text words and MeSH/EMTREE terms to develop strategies related to three major concepts: systematic review, literature screening, and AI. Multiple synonyms for each concept will be incorporated into the search. Details of the search strategies are shown in Table 2.
Table 2
Concept | Search terms |
Systematic review | #1 ("medical evidence" OR PICO OR PECODR OR "intervention arms" OR "experimental methods" OR "study design parameters" OR "Patient oriented Evidence" OR "eligibility criteria" OR "evidence based medicine" OR "clinically important elements" OR "evidence based practice" OR "results from clinical trials" OR "research results" OR "clinical evidence" OR "Meta Analysis" OR "Clinical Research" OR "medical abstracts" OR "clinical trial literature" OR "clinical trial characteristics" OR "clinical trial protocols" OR "clinical practice guidelines" OR "systematic review") |
Literature screening | #2 (extract* OR classif* OR identif* OR retriev* OR detect* OR judg* OR determin* OR decid* OR sort* OR infer* OR interpret* OR includ* OR exclud* OR filter OR filtering OR select*) |
Artificial intelligence | #3 ("Artificial Intelligence" OR "natural language" OR "language processing" OR "Knowledge Acquisition" OR "Knowledge Representation" OR "Support Vector Machine" OR svm OR Gaussian OR Bayes OR Bayesian OR "Cluster" OR Clustering OR "Hidden Markov" OR "conditional random field" OR "Random Forest" OR (Graphical AND model) OR Regression OR "feature engineering" OR "zero-shot learning" OR "few-shot learning" OR "reinforcement learning" OR "transfer learning" OR (unsupervised OR supervised OR semi-supervised OR distant-supervised OR self-supervised) OR "neural network" OR "neural networks" OR (neural AND algorithm*) OR (neural AND machine) OR (network AND algorithm*) OR (network AND machine) OR (automatic AND network) OR (automatic AND networks) OR (automatic AND algorithm*) OR (automatic AND model) OR (automatic AND models) OR (automatic AND machine) OR (automatic AND learning) OR (automatic AND method) OR (learning AND network) OR (learning AND networks) OR (learning AND algorithm*) OR (learning AND machine) OR (learning AND method) OR (deep AND network) OR (deep AND networks) OR (deep AND algorithm*) OR (deep AND model) OR (deep AND models) OR (deep AND machine) OR (deep AND learning)) |
Combined concepts | #1 AND #2 AND #3 |
Abbreviations: SVM, support vector machine. |
Study selection
Literatures with titles and abstracts from online electronic databases will be downloaded and imported into EndNote X9.3.2 software (Thomson Reuters, Toronto, Ontario, Canada) for further process after removing duplications.
All studies will be screened independently by 2 authors based on the titles and abstracts. Those which do not meet the inclusion criteria will be excluded with specific reasons. Disagreements will be solved by discussion with a methodologist if necessary. After the initial screening, the full texts of the potentially relevant studies will be independently reviewed by the two authors to make decisions on final inclusions. Conflicts will be resolved in the same way as they were initially screened. Excluded studies will be listed and noted according to PRISMA-DTA flowchart.
Data collection
A data collection form will be used for information extraction. Data from the eligible studies will be independently extracted and verified by two investigators. Disagreements will be resolved through discussion and consultation with the original publication. We will also try to contact the authors to collect the missing data. If one study did not report detailed accuracy data or did not provide enough data that are essential to calculate the accuracy data, this study will be omitted from the quantitative data synthesis.
The following data will be extracted from the original studies: characteristics of study, information of training set and validation set, the function and performance of AI algorithms. The definitions of variables in data extraction are shown in Table 3.
Table 3
Definitions of variables in data extraction.
Variable | Definitions |
Study characteristics | |
Year | Year of publication |
Authors | Last name of authors |
Study type | Article, abstract, or systematic review |
Journal, conference | Name of journal or conference |
Training set information |
Training set | Name of dataset used for training |
Area | General medicine, detailed disease, or specific intervention |
Source | Name of electronic databases searched for building training set |
Time range | Time range of training set |
Type of publication | Abstract, or full-text |
Number of all literatures | Number of all literatures in training set |
Number of included literatures | Number of included literatures identified by the step of screening in training set |
Training method | Supervised, semi-supervised, or unsupervised |
Validation set information |
Validation set | Name of dataset used for validation |
Area | General, disease, or intervention |
Source | Name of electronic database searched for building validation set |
Time range | Time range of validation set |
Type of publication | Abstract, or full-text |
Number of all literatures | Number of all literatures in validation set |
Number of included literatures | Number of included literatures identified by the step of screening in validation set |
Golden standard | Process of screening by human investigators |
AI algorithm information |
Model name | Name of model |
Model type | Classification, regression, ranking or others |
Model performance | Including but not limited to sensitivity, specificity, precision, NPV, PPV, NLR, PLR, DOR, F-measure, accuracy, and AUC |
Cost saving | Decreased number of screened literatures by human investigators |
Abbreviations: AUC, area under curve; DOR, diagnostic odds ratio; NLR, negative likelihood ratio; NPV, negative predictive value; PLR, positive likelihood ratio; PPV, positive predictive value. |
Risk of bias assessment, applicability, and levels of evidence
Two authors will independently assess risk of bias and applicability with a checklist based on Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2).21 The QUADAS-2 contains 4 domains, respectively regarding patient selection, index test, reference standard, and flow and timing risk of bias. The risk of bias is classified as “low”, “high”, or “unclear”. Studies with high risk of bias will be excluded in the sensitivity analysis.
In this systematic review, the “participants” are literatures rather than human subjects. The index test is AI model used for automatic literature screening. Therefore, we will slightly revise the QUADAS-2 to fit our research context (Table 4). We deleted one signal question in the QUADAS-2 “was there an appropriate interval between index test and reference standard”. The purpose of this signal question in the original version of the QUADAS-2 is to judge the bias caused by the change of disease status between the index test and the reference test. The “disease status”, or the final inclusion status of one literature in our research context, will not change, thus there is no such concerns.
Table 4
The revised QUADAS-2 tool for risk of bias assessment.
Domains | Signal questions | Answers |
“Patient” (literature) Selection | Risk of Bias | |
Was a consecutive or random sample of literatures enrolled | Yes/No/Unclear |
Was a case-control design avoided | Yes/No/Unclear |
Did the study avoid inappropriate exclusions | Yes/No/Unclear |
Could the selection of literatures have introduced bias | Low/High/Unclear Risk |
Concerns regarding applicability | |
Is there concern that the included literatures do not match the review question | Low/High/Unclear Risk |
Index Test (AI algorithms in literature screening) | Risk of Bias | |
Were the index test results interpreted without knowledge of the results of the reference standard | Yes/No/Unclear |
If a threshold was used, was it pre-specified | Yes/No/Unclear |
Could the conduct or interpretation of the index test have introduced bias | Low/High/Unclear Risk |
Concerns regarding applicability | |
Is there concern that the index test, its conduct, or interpretation differ from the review question | Low/High/Unclear Risk |
Reference Standard (results of screening by human investigators) | Risk of Bias | |
Is the reference standard likely to correctly classify the target condition | Yes/No/Unclear |
Were the reference standard results interpreted without knowledge of the results of the index test | Yes/No/Unclear |
Could the reference standard, its conduct, or its interpretation have introduced bias | Low/High/Unclear Risk |
Concerns regarding applicability | |
Is there concern that the target condition as defined by the reference standard does not match the review question | Low/High/Unclear Risk |
Flow and Timing | Risk of Bias | |
Did all literatures receive a reference standard | Yes/No/Unclear |
Did literatures receive the same reference standard | Yes/No/Unclear |
Were all literatures included in the analysis | Yes/No/Unclear |
Could the literature flow have introduced bias | Low/High/Unclear Risk |
The levels of the evidence body will be evaluated by the Grading of Recommendations, Assessment, Development and Evaluations (GRADE) framework.22
Diagnostic accuracy measures
We will extract the data of per study in a two-by-two contingency table from the formal publication text, appendices, or by contacting the main authors to collect sensitivity, specificity, precision, negative predictive value (NPV), positive predictive value (PPV), negative likelihood ratio (NLR), positive likelihood ratio (PLR), diagnostic odds ratios (DOR), F-measure, and accuracy with 95% CI. If the outcomes cannot be formulated in a two-by-two contingency table, we will extract the reported performance data. If possible, we will also assess the area under the curve (AUC), as the two-by-two contingency table may not be available in some scenarios.
Qualitative and quantitative synthesis of results
We will qualitatively describe the application of AI in literature screening. If there were adequate details and homogeneous data for the quantitative meta-analysis, we will combine the accuracy of AI algorithms in literature screening using the random-effects Rutter-Gatsonis hierarchical summarised receiver operating characteristic curve (HSROC) model which was recommended by the Cochrane Collaboration for combining the evidence for diagnostic accuracy.23 The effect of threshold will be incorporated in the model in which heterogeneous thresholds among different studies will be allowed. The combined point estimates of accuracy will be retrieved from the summarised receiver operating characteristic curve (ROC).
Subgroup analyses and meta-regression will be used to explore the between-study heterogeneity. We will explore the following predefined sources of heterogeneity: (1) AI algorithm type, (2) study area of validation set (targeted specific diseases, interventions, or a general area); (3) searched electronic databases (PubMed, EMBASE, or others), (4) proportion of eligible to original studies (the number of eligible literature identified in the screening step divided by the number of original literature identified during the electronic search). Furthermore, we will analyse the possible sources of heterogeneity from both dataset and methodological perspectives in HSROC as covariates following the recommendations from the Cochrane Handbook for Diagnostic Tests Review.23 We regarded the factor as a source of heterogeneity if the coefficient of the covariate in the HSROC model was statistically significant. We will not evaluate the reporting bias (e.g. publication bias) since the hypothesis underlying the commonly used methods, such as funnel plot or Egger’s test, may not be satisfied in our research context. Data were analysed using R software, version 4.0.2 (R Foundation for Statistical Computing, Vienna, Austria) with two-tailed probability of type I error of 0.05 (α = 0.05).