Study conduct
We undertook this study in accordance to an a priori protocol, available at https://doi.org/10.7939/DVN/S0UTUF. We have reported the study in adherence to recommended standards [20].
Sample of reviews
Senior research staff (AG, MG, SAE, JP, and LH) selected a convenient sample of 16 reviews (11 systematic reviews and 5 rapid reviews) either completed or underway at our center. We selected the reviews based on the availability of adequate screening and/or study characteristics data to contribute to our objectives, in a format efficiently amenable to analysis. Table 1 shows the review-level characteristics for each, including the review type (systematic or rapid), research question type (single or multiple), intervention or exposure (simple vs. complex), and included study designs (single vs. multiple). We considered complex interventions to be those that could include multiple components as opposed to a single treatment (e.g., drug, diagnostic test); typically, these were behavioural interventions. Of the reviews, 11 (69%) were systematic reviews, 10 (63%) investigated a single research question, nine (56%) investigated simple interventions or exposures, and four (25%) included only single study designs. The sources searched for each review are in Additional file 1. All systematic reviews used comprehensive searches of electronic databases and grey literature sources, supplemented with reference list scanning. In the rapid reviews, only electronic databases were searched.
Although many modifications to standard systematic review methods may be applied in the completion of rapid reviews [21], for the purpose of this study we considered only the screening method. For the systematic reviews, title-abstract screening was completed by two independent reviewers who came to consensus on the studies included in the review. The review team typically included a senior reviewer (the reviewer who oversaw all aspects of the review and who had the most methodological and/or clinical expertise) and a second reviewer (who was involved in screening and often other review processes, like data extraction). For the rapid reviews, the screening was completed by one highly experienced reviewer (the senior reviewer). This approach is considered acceptable when evidence is needed for pressing policy and health system decisions [22].
Machine learning tool: Abstrackr
We used Abstrackr (http://abstrackr.cebm.brown.edu) [23], an online ML tool for title-abstract screening, for this study. Among the many available tools, we chose Abstrackr because it is freely-available and testing at our centre found it to be more reliable and user friendly than other available tools [10]. Experienced reviewers at our centre (n = 8) completed standard review tasks in Abstrackr and rated it, on average, 79/100 on the System Usability Scale [10] (a standard survey commonly used to subjectively appraise the usability of a product or service) [24]. In our analysis of qualitative comments, reviewers described the tool as easy to use and trustworthy, and appreciated the simple and uncluttered user interface [10]. When used to assist the second reviewer in a pair (a semi-automated approach to screening), across three systematic reviews on average only 1% (range, 0 to 2%) of relevant studies (i.e., those included in the final reviews) were missed [10].
To screen in Abstrackr, all records retrieved by the searches must first be uploaded to the system. Once the records are uploaded, titles and abstracts appear one at a time on the user interface, and the reviewer is prompted to label each as ‘relevant’, ‘irrelevant’, or ‘borderline’. While screening, Abstrackr learns from the reviewer’s labels and other data via active learning and dual supervision [23]. In active learning, the reviewer must first screen a ‘training set’ of records from which the model learns to distinguish between those that are relevant or irrelevant based on common features (i.e., words or combinations of words) [23]. In dual supervision, the reviewers can communicate their knowledge of the review task to the model by tagging terms that are indicative of relevance or irrelevance (e.g., the term ‘trial’ may be imparted as relevant in systematic reviews that seek to include only trials) [23]. After screening a training set, the review team can view and download Abstrackr’s relevance predictions for records that have not yet been screened. The predictions are presented to reviewers in two ways: a numeric value representing the probability of relevance (0 to 1) and a binary ‘hard’ screening prediction (true or false, i.e., relevant or irrelevant).
Data collection
Screening simulation. For each review, we uploaded all records retrieved by the searches to Abstrackr for screening. We used the single-reviewer and random citation order settings, and screened a 200-record training set for each review by retrospectively replicating the senior reviewer’s original screening decisions. Abstrackr’s ability to learn and accurately predict the relevance of candidate records depends on the correct identification and labeling of relevant and irrelevant records in the training set. Replicating the senior reviewer’s decisions optimized the probability of a good quality training set. Although the optimal training set size is not known [7], the developers of a similar tool recommend a training set that includes at least 40 excluded and 10 included records, up to a maximum of 300 records [25].
For systematic reviews completed at our centre, any record marked as ‘include’ (i.e., relevant) or ‘unsure’ (i.e., borderline) by either of two independent reviewers at the title-abstract screening stage is eligible for scrutiny by full text. For this reason, our screening files typically include one of two screening decisions per record: ‘include/unsure’ (relevant) or ‘exclude’ (irrelevant). Because we could not ascertain retrospectively whether the decision for each record was ‘include’ or ‘unsure’, we entered all ‘include/unsure’ decisions as ‘relevant’ in Abstrackr. We did not use the ‘borderline’ decision.
After screening the training set, we downloaded the predicted relevance of the remaining records. Typically, these became available within 24 hours. In instances where the predictions did not become available in 48 hours, we continued to screen in batches of 100 records until they did. We used the hard screening predictions instead of applying custom thresholds based on the relevance probabilities for each remaining record. In the absence of guidance on the optimal threshold to apply, using the hard screening predictions was likely realistic of how the tool is used by review teams.
Although potentially prone to bias, the liberal-accelerated screening approach [18, 19] saves time in traditional systematic reviews even without the use of ML. In this approach, any record marked as ‘include’ or ‘unsure’ by either of two independent reviewers automatically moves forward to full text screening. Only records marked as ‘exclude’ by one reviewer are screened by a second reviewer to confirm or refute their exclusion. The time consuming step of achieving consensus at the title-abstract level becomes irrelevant and is omitted.
Building on earlier findings from a similar sample of reviews [16], we devised a retrospective screening simulation to investigate the benefits and risks of using ML in combination with the liberal accelerated screening approach, compared with traditional dual independent screening. In this simulation, after screening a training set of 200 records, the senior reviewer would download the predictions and continue screening only those that were predicted to be relevant. The second reviewer would screen only the records excluded either by the senior reviewer or predicted to be irrelevant by Abstrackr to confirm or refute their exclusion. This simulation was relevant only to the systematic reviews, for which dual independent screening had been undertaken. Since a single reviewer completed study selection for the rapid reviews, retrospectively simulating liberal-accelerated screening for these reviews was not possible.
Differences in review results. To investigate differences in the results of systematic reviews when relevant studies are omitted, for systematic reviews with pairwise meta-analyses we re-ran the analyses for the primary outcomes of effectiveness omitting the studies that would have been erroneously excluded from the final reports via the semi-automated liberal accelerated simulation. We investigated differences in the results only of systematic reviews with pairwise meta-analyses because the appraisal of this outcome among reviews with qualitative or quantitative narrative syntheses was not feasible within available time and resources. When the primary outcomes were not explicitly reported, we considered any outcome for which certainty of evidence appraisals were reported to be primary outcomes. Otherwise, we considered the first reported outcome to be the primary outcome.
Characteristics of missed studies. We pooled the data for the studies included in the final reports for all reviews to explore which characteristics might be associated with correctly or incorrectly labeling relevant studies. From the final report for each review, we extracted the risk of bias (low, unclear, or high) and design (trial, observational, mixed methods, qualitative, or review) of each included study. For reviews that included study designs other than randomized trials, we considered methodological quality as inverse to risk of bias. We categorized the risk of bias based on the retrospective quality scores derived from various appraisal tools (Additional file 2). We also documented the year of publication and the impact factor of the journal in which each included study was published based on 2018 data reported on the Journal Citation Reports website (Clarivate Analytics, Philadelphia, Pennsylvania). A second investigator verified all extracted data prior to analysis.
Data analysis
We exported the data to SPSS Statistics (v.25, IBM Corporation, Armonk, New York) or StatXact (v.12, Cytel Inc., Cambridge, Massachusetts) for analysis. To evaluate the benefits and risks of using Abstrackr’s predictions in the context of liberal-accelerated screening in systematic reviews we used data from 2x2 cross-tabulations to calculate standard metrics [8], as follows:
- Proportion missed (error): of the studies included in the final report, the proportion erroneously excluded during title and abstract screening.
- Workload savings (absolute screening reduction): of the records that need to be screened at the title and abstract stage, the proportion that would not need to be screened manually.
- Estimated time savings: the estimated time saved by not screening records manually. We assumed a screening rate of 0.5 minutes per record and an 8-hour work day [26].
To determine the effect of missed studies on the results of systematic reviews with pairwise meta-analyses, we compared the pooled effect estimate, 95% confidence interval, and statistical significance when missed studies were removed from the meta-analyses to those from the original review. We did not appraise changes in clinical significance.
To explore which review, study, and publication characteristics might affect the correctness of Abstrackr’s predictions, we first compared the proportion of studies incorrectly predicted as irrelevant by Abstrackr by review type (i.e., inclusion of only trials vs. multiple study designs; single vs. multiple research questions; systematic review vs. rapid review; complex vs. simple interventions) and by study characteristics (study design (trial, observational, mixed methods, qualitative, review) and risk of bias (low or unclear/high)) via Fischer Exact tests. We compared the mean (SD) year of publication and impact factor of the journals in which studies were published among those that were correctly and incorrectly labeled via unpaired t-tests.