Screening characteristics (Table 2) for the included reviews have been reported in a separate study investigating additional unique simulations [15]. The screening workload (retrospective) varied by review (median (IQR [25th percentile, 75th percentile]), 2123 (1321, 5961) records). The workload tended to be larger for the systematic reviews (5092 (2078, 8746) records) compared to the rapid reviews (964 (767, 1536) records). Across reviews, a median (IQR) 9 (4, 14)% candidate records were included following title and abstract screening (8 (3, 9)% for the systematic reviews and 18 (9, 21)% for the rapid reviews). A median (IQR) 2 (0.4, 4)% candidate records were included in the final reports (0.6 (0.4, 2)% in the systematic reviews and 8 (2, 8)% in the rapid reviews). After screening the training sets, across reviews Abstrackr predicted that a median (IQR) 32 (13, 41)% of those remaining were relevant (25 (12, 34)% for the systematic reviews and 38 (37, 59)% for the rapid reviews).
Table 2. Screening characteristics of the included reviews
Review
|
Screened by human reviewers, n (%) a
|
Screened in Abstrackr, n (%)
|
Screening workload
|
Included, title and abstract
|
Included, final report
|
Training set (n includes /excludes, % includes)b
|
Predicted relevant
|
Systematic reviews
|
Activity and pregnancy
|
2928
|
236 (8)
|
98 (3)
|
10/190 (5)
|
319 (12)
|
Antipsychotics
|
12156
|
1177 (10)
|
127 (1)
|
15/185 (8)
|
2117 (18)
|
Brain injury
|
6262
|
518 (8)
|
40 (1)
|
11/189 (6)
|
2126 (35)
|
Concussion
|
1439
|
46 (3)
|
5 (<1)
|
3/197 (2)
|
638 (51)
|
Diabetes
|
47141
|
698 (1)
|
205 (<1)
|
104/196 (53)
|
5187 (11)
|
Digital technologies for pain
|
2662
|
207 (8)
|
64 (2)
|
15/185 (8)
|
321 (13)
|
Experiences of bronchiolitis
|
651
|
88 (14)
|
28 (4)
|
13/187 (7)
|
111 (25)
|
Experiences of UTIs
|
1493
|
25 (2)
|
4 (<1)
|
3/197 (2)
|
864 (67)
|
Treatments for bronchiolitis
|
5861
|
518 (9)
|
137 (2)
|
12/188 (6)
|
656 (12)
|
VBAC
|
5092
|
807 (16)
|
21 (<1)
|
25/175 (14)
|
1490 (30)
|
Visual acuity
|
11229
|
224 (2)
|
1 (<1)
|
4/296 (1)
|
3639 (33)
|
Rapid reviews
|
Community gardening
|
1536
|
153 (10)
|
32 (2)
|
55/145 (28)
|
139 (10)
|
Depression safety
|
964
|
44 (5)
|
8 (1)
|
7/193 (4)
|
449 (59)
|
Depression treatments
|
1583
|
418 (26)
|
179 (11)
|
43/157 (22)
|
904 (65)
|
Preterm delivery
|
451
|
96 (21)
|
34 (8)
|
47/153 (24)
|
95 (38)
|
Workplace stress
|
767
|
141 (18)
|
59 (8)
|
36/164 (18)
|
210 (37)
|
UTI = urinary tract infection; VBAC = vaginal birth after cesarean.
a Retrospective screening data.
b The training set was 200 records for all reviews except Diabetes and Visual Acuity, for which it was 300.
Liberal accelerated screening simulation
Table 3 shows the proportion missed, workload savings, and estimated time savings had the reviewers leveraged Abstrackr’s predictions and the liberal-accelerated screening approach in each systematic review. Records missed are those that are included in the final report, but were excluded via the simulated approach at the title-abstract screening stage. To ascertain whether the simulated approach provided any advantage over screening by a single reviewer, we have also included a column showing the number and proportion of studies that the second reviewer would have missed had they screened the records in isolation.
Compared to dual independent screening, for five (50%) of the systematic reviews no studies were erroneously excluded via our simulated approach. In two (20%) systematic reviews, one record was erroneously excluded, equivalent to 1% of the included records in both reviews. In the remaining three (30%) reviews, three records were erroneously excluded, equivalent to 2 to 14% of the included studies. The simulated approach was advantageous (i.e., fewer records were missed) relative to screening by a single reviewer in six (60%) of the systematic reviews; in many cases, the difference was substantial (e.g., 11% vs. 43% in the Experiences of bronchiolitis review; 1% vs. 11% in the Activity and pregnancy review; 1% vs. 7% in the Treatments for bronchiolitis review; 14% vs. 24% for the VBAC review; 0% vs. 5% in the Brain injury review).
The median (IQR) workload savings across reviews was 3143 (1044, 5023) records (35 (30, 43) %) compared to dual independent screening. This equated to a median (IQR) estimated time savings of 26 (9, 42) hours or 3 (1, 5) working days of uninterrupted screening.
Table 3. Proportion missed, workload savings, and estimated time savings for each systematic review a
Systematic review
|
Records missed, single reviewer, n (%)
|
Records missed, simulation, n (%)
|
Workload savings, n (%)
|
Estimated time savings, h (d)
|
Activity and pregnancy
|
11 (11)
|
1 (1)
|
2536 (43)
|
21 h (3 d)
|
Antipsychotics
|
4 (3)
|
3 (2)
|
10508 (43)
|
88 h (11 d)
|
Brain injury
|
2 (5)
|
0 (0)
|
4193 (33)
|
35 h (4 d)
|
Concussion
|
0 (0)
|
0 (0)
|
635 (22)
|
5 h (<1 d)
|
Digital technologies for pain
|
0 (0)
|
0 (0)
|
2271 (43)
|
19 h (2 d)
|
Experiences of bronchiolitis
|
12 (43)
|
3 (11)
|
389 (30)
|
3 h (<1 d)
|
Experiences of UTIs
|
0 (0)
|
0 (0)
|
448 (15)
|
4 h (<1 d)
|
Treatments for bronchiolitis
|
10 (7)
|
1 (1)
|
5300 (45)
|
44 h (6 d)
|
VBAC
|
5 (24)
|
3 (14)
|
3750 (37)
|
31 h (4 d)
|
Visual acuity
|
0 (0)
|
0 (0)
|
7418 (33)
|
62 h (8 d)
|
d = days; h = hours; UTI = urinary tract infection; VBAC = vaginal birth after cesarean.
a The Diabetes review was excluded because the screening data were not in a format amenable to analysis.
Impact of missed studies on the results
Among the five systematic reviews where studies were missed, three included pairwise meta-analyses (Activity and pregnancy, Antipsychotics, and Treatment for bronchiolitis) (Additional file 4). The single missed study for each of the Activity and pregnancy and Treatments for bronchiolitis reviews were not included in any of the meta-analyses. It is notable that the missed study in the Activity and pregnancy review was written in Chinese, although it did include an English abstract. Neither of the studies reported on the primary outcomes of their respective systematic reviews.
For Antipsychotics, there were three missed studies. Of the 49 pairwise comparisons for which there was at least low strength of evidence in the final report, one of the missed studies (McCracken et al., 2002) was included in 8 (16%) comparisons. The 8 meta-analyses compared second-generation antipsychotics (SGAs) to placebo for the following outcomes for autism spectrum disorder: irritability, lethargy/social withdrawal, stereotypy, inappropriate speech, compulsions, response rate, discontinuations due to lack of efficacy, and appetite increase. Additional file 5 shows the pooled estimate of effect (95% CI) and statistical significance for the 8 relevant meta-analyses in the original report and following the removal of the study by McCracken et al. The statistical significance of the pooled estimate of effect changed in one of the meta-analyses (i.e., 2% of all comparisons for which there was at least low strength of evidence included in the report). For children with autism spectrum disorder, the original meta-analysis found a statistically significant reduction in compulsions in favor of SGAs (mean difference (MD) (95% CI), -1.53 (-2.92, -0.15), p=0.03). The effect was no longer statistically significant following the removal of McCracken et al. from the analysis (MD (95% CI), -1.17 (-2.70, 0.36), p=014). Otherwise, removing McCracken et al. from relevant meta-analyses did not result in changes in point estimates or confidence intervals that impacted the statistical significance of the findings.
Although not included in any of the meta-analyses, the large retrospective cohort study by Bobo et al. (2013) contributed low certainty evidence of an increased risk for type 2 diabetes among patients taking SGAs. No other studies contributed data for this outcome. Although the prospective study by Correll et al. (2009) contributed to the network meta-analysis for harms, it did not report on any of the intermediate or effectiveness outcomes.
Association of study, review, and publication characteristics with predictions
The pooled dataset for the studies included in the 16 final reports contained 802 records for which Abstrackr had made a prediction (excludes those included in the training sets). Among these, Abstrackr correctly predicted that 696 (87%) were relevant, and incorrectly predicted that 106 (13%) were irrelevant after the 200-record training set.
Review characteristics. Table 4 shows the characteristics of the reviews, stratified by the correctness of Abstrackr’s relevance predictions. Six-hundred-eighty-nine (86%) studies were included across the systematic reviews and 113 (14%) across the rapid reviews. There was no difference (P=0.37) in Abstrackr’s ability to correctly predict the relevance of studies by review type (n = 601 (88%) of studies in systematic reviews and 95 (84%) of those in rapid reviews were correctly identified).
Two-hundred-ninety-seven (37%) studies were included in reviews that answered a single research question, and 505 (63%) were included in reviews that answered multiple questions. There was a statistically significant difference (P=0.01) in Abstrackr’s ability to correctly predict the relevance of studies by research question type. Four-hundred-fifty (89%) studies in reviews with multiple research questions were correctly identified. The proportion of correctly identified studies was less (n=246, 83%) in reviews with a single research question.
Four-hundred-three (50%) studies were included in reviews that tested a simple intervention/exposure, and 399 (50%) were included in reviews that tested complex interventions. There was no difference (P=0.47) in Abstrackr’s ability to correctly predict the relevance of studies by intervention or exposure type (n=346 (86%) studies in reviews of simple interventions and 350 (88%) studies in reviews of complex interventions were correctly identified).
Two-hundred-one (25%) studies were included in reviews that included only one study design (trials or systematic reviews), while the remaining 601 (75%) were included in reviews that included multiple designs (including observational studies). There was a statistically significant difference (P=0.003) in Abstrackr’s ability to correctly predict the relevance of studies by included study designs. Abstrackr correctly predicted the relevance of 122 (95%) studies in reviews that included only randomized trials as compared to 57 (79%) and 517 (86%) in reviews that included only systematic reviews, or multiple study designs, respectively.
Table 4. Select review characteristics, stratified by Abstrackr’s relevance predictions
Review characteristic
|
n studies
|
Correctly predicted as relevant, n (%)
|
Incorrectly predicted as irrelevant, n (%)
|
p-value a
|
Review type
|
Systematic
|
689
|
601 (87)
|
88 (13)
|
0.37
|
Rapid
|
113
|
95 (84)
|
18 (16)
|
Research question
|
Single
|
297
|
246 (83)
|
51 (17)
|
0.01
|
Multiple
|
505
|
450 (89)
|
55 (11)
|
Intervention/exposure
|
Simple
|
403
|
346 (86)
|
57 (14)
|
0.47
|
Complex
|
399
|
350 (88)
|
49 (12)
|
Included study designs
|
Single – only randomized trials
|
129
|
122 (95)
|
7 (5)
|
0.003
|
Single – only systematic reviews
|
72
|
57 (79)
|
15 (21)
|
Multiple
|
601
|
517 (86)
|
84 (14)
|
a Fisher’s Exact test.
Study characteristics. Table 5 shows the characteristics of the studies, stratified by Abstrackr’s relevance predictions. Of the included studies, 483 (60%) were trials, 214 (27%) were observational, 2 (0.2%) were mixed methods, 15 (2%) were qualitative, and 88 (11%) were reviews. There was a statistically significant difference (P=0.0006) in Abstrackr’s ability to correctly predict the relevance of included studies by study design. Abstrackr correctly predicted the relevance of 438 (91%) of the trials, 2 (100%) of the mixed methods studies, and 14 (93%) of the qualitative studies. By comparison, the proportion of correct predictions was less for observational studies (n=214, 79%) and reviews (n=88, 83%).
Of the 620 studies for which we had risk of bias details, 120 (19%) were at low and 500 (81%) were at unclear or high overall risk of bias. There was a statistically significant difference (P=0.039) in Abstrackr’s ability to correctly predict the relevance of included studies by risk of bias. Abstrackr correctly predicted the relevance of 438 (88%) of studies at unclear or high risk of bias as compared to 96 (80%) of those at low risk of bias.
Table 5. Study design and study-level risk of bias, stratified by Abstrackr’s relevance predictions
Study characteristic
|
N studies
|
Correctly predicted as relevant, n (%)
|
Incorrectly predicted as irrelevant, n (%)
|
p-value a
|
Design
|
Trial
|
483
|
438 (91)
|
45 (9)
|
0.0006
|
Observational
|
214
|
169 (79)
|
45 (21)
|
Mixed methods
|
2
|
2 (100)
|
0 (0)
|
Qualitative
|
15
|
14 (93)
|
1 (7)
|
Review
|
88
|
73 (83)
|
15 (17)
|
Risk of bias
|
Low
|
120
|
96 (80)
|
24 (20)
|
0.039
|
High or unclear
|
500
|
438 (88)
|
62 (12)
|
a Fisher’s exact test.
Publication characteristics. Table 6 shows the characteristics of the publications, stratified by Abstrackr’s relevance predictions. Across all studies, the mean (SD) publication year was 2008 (7). There was a statistically significant difference (P=0.02) in Abstrackr’s ability to correctly identify relevant studies by publication year. The mean (SD) year of publication was 2008 (7) for studies correctly identified compared to 2006 (10) for those erroneously excluded (mean difference (95% CI), 1.77 (0.27, 3.26). This difference is not considered practically significant.
The mean (SD) impact factor for the journals in which the studies were published was 4.87 (8.49). There was no difference (P=0.74) in Abstrackr’s ability to correctly identify relevant studies by the impact factor for the journal in which they were published. The mean (SD) impact factor was 4.91 (8.39) for studies correctly identified as relevant and 4.61 (9.14) for those erroneously excluded (mean difference (95% CI), 0.30 (-1.44, 2.03)).
Table 6. Publication year and journal impact factor, stratified by Abstrackr’s relevance predictions
Study characteristic
|
All studies
|
Correctly predicted as relevant, n (%)
|
Incorrectly predicted as irrelevant, n (%)
|
Mean difference (95% CI) a
|
p-value b
|
Publication year, mean (SD)
|
2008 (7)
|
2008 (7)
|
2006 (10)
|
1.77 (0.27, 3.26)
|
0.02
|
Impact factor, mean (SD)
|
4.87 (8.49)
|
4.91 (8.39)
|
4.61 (9.14)
|
0.30 (-1.44, 2.03)
|
0.74
|
a Mean difference between correctly identified studies and those erroneously excluded.
b Unpaired t-test.