Compared with dual independent screening, leveraging Abstrackr’s predictions in combination with a liberal-accelerated screening approach resulted in few (≤ 3), if any, missed records. The missed records would not have changed the conclusions for the main effectiveness outcomes in the impacted reviews; moreover, as we have previously shown it is likely that in the context of a comprehensive search, missed studies would be identified by other means (e.g., reference list scans) [16]. The workload savings were substantial, and despite being not quite as efficient, considerably fewer studies were missed compared to screening by a single reviewer in many (60%) reviews. Included studies were correctly identified more frequently among reviews that included multiple research questions (vs. single) and those that included only randomized trials (vs. only reviews, or multiple study designs). Correctly identified studies were more likely to be randomized trials, mixed methods, and qualitative studies (vs. observational studies and systematic reviews).
As part of our previous work, we simulated four additional methods whereby we could leverage Abstrackr’s predictions to expedite screening, including fully automated and semi-automated approaches [16]. The simulation that provided the best balance of reliability and workload savings was a semi-automated second screener approach, based on an algorithm first reported by Wallace et al. in 2010 [24]. In this approach, the senior reviewer would screen a 200-record training set and continue to screen only those that Abstrackr predicted to be relevant. The second reviewer would screen all records as per usual. The second reviewer’s decisions and those of the senior reviewer and Abstrackr would be compared to determine which would be eligible for scrutiny by full text. Among the same sample of reviews, the records that were missed were identical to those in the liberal accelerated simulation. The median (IQR) workload savings was 2409 (3616) records, equivalent to an estimated time savings of 20 (31) hours or 3 (4) working days. Thus, compared to the semi-automated second screener approach [24], the liberal accelerated approach resulted in marginally greater workload and time savings without compromising reliability.
In exploring the screening tasks for which ML-assisted screening might be best suited, some of our findings were paradoxical. For example, studies were more often correctly identified as relevant in systematic reviews with multiple research questions (vs. a single question). There was no difference in the proportion of studies correctly identified as relevant among systematic reviews that investigated complex vs. simple interventions. There are likely a multitude of interacting factors that affect Abstrackr’s predictions, including the size and composition of the training sets. More research is needed to inform a framework to assist review teams in deciding when or when not to use ML-assisted methods. Our findings are consistent with other studies which have suggested that ML may be particularly useful for expediting simpler review tasks (e.g., differentiating trials from studies of other designs) [25], leaving more complex decisions to human experts. Cochrane’s RCT Classifier, which essentially automates the identification of trials, is one example of such an approach [25]. By automatically excluding ‘obviously irrelevant’ studies, human reviewers are left to screen only those where screening decisions are more ambiguous.
Our data suggest that combining Abstrackr’s early predictions with the liberal-accelerated screening method may be an acceptable approach in reviews where the limited risk of missing a small number of records is acceptable (e.g., some rapid reviews), or the outcomes are not critical. This may be true for some scoping reviews, where the general purpose is to identify and map the available evidence [26], rather than synthesize data on the effect of an intervention on one or more outcomes. When conceptualizing the relative advantages of semi-automatic title-abstract screening, it will be important to look beyond study selection to other tasks that may benefit from the associated gains in efficiency. For example, published systematic reviews frequently report limits to the searches (e.g., limited databases, published literature only) and eligibility criteria (e.g., trials only, English language only) [27], both of which can have implications for the conclusions of the review. If studies can be selected more efficiently, review teams may choose to broaden their searches or eligibility criteria, potentially missing fewer studies even if a small proportion are erroneously omitted through semi-automation.
Given the retrospective nature of most studies, the semi-automation of different review tasks have largely been studied as isolated processes. Prospective studies are needed to bridge the gap between hypothetical opportunities and concrete demonstrations of the risks and benefits of various novel approaches. For example, recently a full systematic review was completed in two weeks by a research team in Australia using a series of semi-automated and manual processes [28]. The authors reported on the facilitators and barriers to their approaches [28]. To build trust, beyond replication of existing studies it will be important for review teams to be able to conceptualize, step-by-step, how ML can be integrated into their standard procedures [13] and under what circumstances the benefits of different approaches outweigh the inherent risks. As a starting point, prospective direct comparisons of systematic reviews completed with and without ML-assisted methods would be helpful to encourage adoption. There may be ways to incorporate such evaluations into traditional systematic review processes without substantially increasing reviewer burden.
Strengths and limitations
This is one of few studies to report on the potential impact of ML-assisted title-abstract screening on the results and conclusions of systematic reviews, and to explore the correctness of predictions by review, study, and publication-level characteristics. Although many tools and methods are available to semi-automate title-abstract screening, we used only Abstrackr and simulated a liberal-accelerated approach. The findings should not be generalized to other tools or approaches. Moreover, we used relatively small training sets in an attempt to maximize efficiency. It is possible that different training sets would have yielded more or less favourable results. Because our evaluation was retrospective, we estimated time savings based on a screening rate of two records per minute. Although ambitious, this rate allowed for more conservative estimates of time savings and for comparisons to previous studies that have used the same rate [10, 15, 16].