The REPRISE (REProducibility and Replicability In Syntheses of Evidence) project consists of a suite of studies to address our four objectives (Fig. 1). We will evaluate the completeness of reporting and sharing of review materials in a random sample of 300 systematic reviews with meta-analysis published in 2020 (Study 1). We will survey authors of systematic reviews to explore their views on sharing review data, analytic code and other materials and their understanding of and opinions about replication of systematic reviews (Study 2). We will then evaluate the extent of variation in results when we (a) independently reproduce meta-analyses using the same computational steps and analytic code (if available) as used in the original review (Study 3), and (b) crowdsource teams of systematic reviewers to independently replicate a subset of methods (searches for studies, selection of studies for inclusion, collection of outcome data, and synthesis of results) in a sample of the original reviews; 30 reviews will be replicated by one team each and two reviews will be replicated by 15 teams (Study 4).
We will focus on systematic reviews of the effects of health, social, behavioural and educational interventions. Eligible interventions will include any intervention designed to improve health (defined according to the World Health Organisation as “a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity” (37)), promote social welfare and justice, change behaviour, or improve educational outcomes. Examples of eligible interventions include inhaled corticosteroids to alleviate symptoms of asthma, provision of charity or welfare to alleviate social or economic problems, provision of regulations to improve safety in workplaces, use of bystander programs to prevent violence or harassment, or reduction of class size to enhance educational outcomes for high school students.
Study 1: Evaluation of the transparency of systematic reviews with meta-analysis
The objective of this study is to evaluate the completeness of reporting and sharing of review data, analytic code and other materials (e.g. list of all citations screened, data collection forms) in systematic reviews with meta-analysis. We will do this by conducting a cross-sectional evaluation of systematic reviews published in 2020.
Identification of systematic reviews
We will include a random sample of completed systematic reviews with meta-analysis in our evaluation. To be considered a “systematic review”, authors will need to have, at a minimum, clearly stated their review objective(s) or question(s), reported the source(s) (e.g. bibliographic databases) used to identify studies meeting the eligibility criteria, and reported conducting an assessment of the validity of the findings of the included studies, for example via an assessment of risk of bias or methodological quality. We will not exclude articles providing limited detail about the methods used (e.g. articles will be considered eligible if they provide only a list of the key words used in bibliographic databases rather than a line-by-line Boolean search strategy). Systematic reviews with meta-analysis meeting the following additional criteria will be eligible for inclusion in the study:
written in English, indexed in the most recent one-month period closest to when the search for the present study is run, in one of the following bibliographic databases in the health and social sciences: PubMed, Education Collection via Proquest, Scopus via Elsevier, and Social Sciences Citation Index and Science Citation Index Expanded via Web of Science;
includes randomized or non-randomized studies (or both) evaluating the effects of a health, social, behavioural or educational intervention on humans;
lists the references for all studies included in the review; and
presents at least one pairwise meta-analysis of aggregate data, including at least two studies, using any effect measure (e.g. mean difference, risk ratio).
Using search strategies created by an information specialist (SM), we will systematically search each database listed above for systematic reviews with meta-analysis meeting the eligibility criteria. For example, we will run the following search in PubMed: (meta-analysis[PT] OR meta-analysis[TI] OR systematic[sb]) AND 2020/11/02:2020/12/02[EDAT]. Search strategies for other databases are available in additional file 1. We will download all records and remove duplicates using Endnote software, export the unique records to Microsoft Excel and randomly sort them using the RAND() function, then import the first 500 randomly sorted records into Covidence software (38) for screening. Two investigators will independently screen the titles and abstracts in batches of 500, retrieve any potentially eligible reports and independently assess each report against the eligibility criteria. This process will be repeated until we reach a target of 300 eligible systematic reviews. In the unlikely event that we do not reach our target of 300 reviews after screening all records yielded from the search, we will rerun the search to identify records published in the subsequent one-month period and repeat the screening steps described above. We will include systematic reviews regardless of whether the question(s) they address are also addressed by another systematic review in the sample. Including 300 systematic reviews will allow us to estimate the percentage of reviews reporting a particular practice (e.g. reporting the full line-by-line search strategy) to within a maximum margin of error of 6% assuming a prevalence of 50% (i.e. 1.96*√(0.5*(1-0.5)/300)); for a prevalence of less (or greater) than 50%, the margin of error will be smaller.
Two investigators will independently and in duplicate collect information about each systematic review using a standardized data collection form. Any discrepancies in the data collected will be resolved via discussion or adjudication by another investigator. Prior to data collection, both investigators will pilot test the data collection form on a random sample of 10 systematic reviews within the set of 300, discuss any discrepancies and adjust the form where necessary. The form will include items capturing general characteristics of the review (e.g. journal it is published in, country of corresponding author, intervention under investigation, number of included studies) and items that characterise, for example, whether systematic reviewers:
reported the complete line-by-line search strategy used for at least one (or for each) electronic database searched;
reported the process for selecting studies, collecting data and assessing risk of bias/quality in the included studies;
presented the summary statistics, effect estimates and measures of precision (e.g. confidence intervals) for each study included in the first meta-analysis reported in the review; and
made data files and analytic code publicly available, and if so, specified how to access them.
We will evaluate the main systematic review report, any supplementary files provided on the journal server or in a public or institutional repository, or the review protocol if the authors specify that the relevant information is contained therein. The wording of items will be identical to that used in previous evaluations of this nature (25, 39) to allow investigation of improvements in transparency over time. We will record whether authors specified using a guideline such as the 2009 PRISMA statement (40) to guide reporting. We will also search the websites of the journals that published the included systematic reviews, and record whether or not they have a data or code sharing policy, or both, that is, a request that authors of all research articles or systematic reviews in particular share their data (or analytic code) when they submit an article (16). If such a policy exists, we will extract it verbatim and classify it as “mandatory” (i.e. sharing data or analytic code, or both, is a condition of publication of systematic reviews by the journal) or “desirable” (i.e. authors are advised to share data or analytic code, but failing to do so will not preclude publication of systematic reviews).
We will characterise indicators of transparency in the systematic reviews using descriptive statistics (e.g. frequency and percentage for categorical items and mean and standard deviation for continuous items). We will use risk ratios with 95% confidence intervals to examine differences in percentages of each indicator between reviews published in a journal that publishes evidence syntheses only (e.g. Cochrane Database of Systematic Reviews, Campbell Systematic Reviews, JBI Evidence Synthesis) versus published elsewhere; between the 2020 reviews of health interventions and a previously examined sample of 110 systematic reviews of health interventions indexed in MEDLINE in February 2014 (25, 39); between reviews with self-reported use of a reporting guideline versus without; and between reviews published in journals with versus without any data or code sharing policy, and with versus without a mandatory data or code sharing policy.
Study 2: Evaluation of systematic reviewers’ views on sharing review data, analytic code and other materials and on the replication of reviews
The objective of this study is to explore systematic reviewers’ views on sharing review data, analytic code and other materials (e.g. list of all citations screened, data collection forms) and their understanding of and opinions about replication of systematic reviews. We will gather systematic reviewers’ views via an online survey.
Recruitment of systematic reviewers
Two investigators will independently screen the remaining titles and abstracts identified in Study 1 for inclusion in Study 2. We will invite all corresponding authors of systematic reviews of the effects of health, social, behavioural and educational interventions that meet the inclusion criteria for Study 1 to complete the survey, excluding authors of the randomly selected subsample of 300. The reason for excluding this subsample is to reduce author burden, since these authors will be contacted in Study 3. We will contact authors via email and send up to three reminders separated by three weeks in case of non-response. The survey will be administered via Qualtrics (Qualtrics, Provo, UT, USA).
The survey will capture demographic characteristics of the systematic reviewers (e.g. country of residence, career stage, number of systematic reviews conducted, areas of expertise) and their views on the open science movement. We will include questions asking authors to indicate the extent to which they agree that:
systematic reviewers should share review data, analytic code and other materials routinely;
potential facilitators or barriers apply to the sharing of systematic review material, which we will draw from previous studies examining facilitators and barriers to adopting open science practices (41–47).
Responses will be collected via a 7-point Likert scale ranging from ‘Strongly disagree’ to ‘Strongly agree’. Authors will be given an opportunity to suggest other materials they believe systematic reviewers should share, and facilitators or barriers not listed in the survey. Finally, we will gauge authors’ understanding of and opinions about replication of systematic reviews, adapting the questions asked in previous studies evaluating researchers views on replication studies (48–50).
We will analyse quantitative survey data by calculating the frequency and percentage for each response option for each question. We will use a deductive approach to coding of free-text responses to survey questions (51). First, we will read each line of text and label it with a code that represents the meaning of the text; the initial codes will be informed by our prior work conducted to draft survey items. As each subsequent line of text is read, existing codes will be revised, where necessary, and new codes added, to ensure consistency of coding across responses. We will then organise codes into overarching themes. One investigator will perform the coding of text and categorisation of codes into themes, which will be verified by another investigator.
Study 3: Evaluation of the reproducibility of meta-analyses
The objective of this study is to evaluate the reproducibility of meta-analyses. We will do this by evaluating the extent of variation in results when we independently reproduce meta-analyses using the same computational steps and analytic code (if available) as used in the original review.
We will include in this study all systematic reviews identified in Study 1 where the reviewers made available the data necessary to reproduce the first meta-analysis reported in the review, henceforth referred to as the “index meta-analysis”. If the systematic reviewers uploaded relevant data files (e.g. a Microsoft Excel or CSV spreadsheet, or a RevMan file containing all study effect estimates included in the index meta-analysis) or analytic code as a supplement to the paper, or reported a link to a publicly accessible repository (e.g. Open Science Framework, Dryad, figshare) or a website containing relevant files, we will download the files deposited. If no such files are referred to within the review report, or if the links to files are broken, we will invite the corresponding author of the review to share their systematic review data file(s) and analytic code for the purposes of reproduction, record what materials were shared, and request reasons for non-availability if materials were not shared. If no data file is made publicly accessible, one investigator will extract the study effect estimates included in the index meta-analysis from the relevant table or figure (e.g. forest plot) reported in the review. We will not extract or check for accuracy the data in reports of the included studies.
We anticipate obtaining data files (as a supplementary file to the paper, or from a public repository or the systematic reviewers) for at least 90 (30%) of the 300 reviews. In 2018, colleagues performing another study sought the aggregated data from 200 interrupted time series studies published from 2013–2017; they obtained the necessary data for 30% of studies from the study authors or as a supplementary file to the paper (52). It is possible that the response rate will be higher in our study given the shorter time span between publication and us making the request (which will be less than one-year post-publication).
Reproduction of meta-analyses
Two investigators will independently carry out a reanalysis of the index meta-analysis of the review using the data available. For each meta-analysis, we will conduct the reanalysis according to the methods described in the published report of the review, regardless of how appropriate we consider the methods. If unable to conduct the reanalysis based on the data available and methods described, we will seek clarification from the systematic reviewers, and attempt to reanalyse the data based on the additional information provided. Each reanalysis will be conducted using the same statistical software package and version used by the original systematic reviewers, where possible. If systematic reviewers shared their analytic code, we will use it without modification. If no analytic code was shared, or if we are unable to access the software package and version used by the original systematic reviewers, we will write the code necessary to reanalyse the data ourselves, using the metafor package (53) in the open source statistical software R (R Development Core Team), based on the statistical methods described by the reviewers. We will record for each reanalysis the meta-analytic estimate of effect, its corresponding 95% confidence interval, and inferences about heterogeneity (Cochran’s Q and its P value, I2, tau2, and a prediction interval).
Two investigators will independently classify each index meta-analysis into one of three reproducibility categories:
‘results fully reproducible’ (i.e. no difference [with allowance for trivial discrepancies such as those due to computational algorithms] is observed between the original and recalculated meta-analytic effect estimate, its 95% confidence interval, and inferences about heterogeneity reported in the original review);
‘results not fully reproducible’ (i.e. a difference [even after allowance for trivial discrepancies] is observed between the original and recalculated meta-analytic effect estimate, its 95% confidence interval, or inferences about heterogeneity reported in the original review, or;
‘results not able to be reproduced because of missing information’.
Two investigators will also independently specify whether they believe the observed difference between the original and recalculated summary estimate and its precision was meaningful, that is, would lead to a change in the interpretation of the results (classified as ‘difference meaningful’ or ‘difference not meaningful’). Any discrepancies in the classifications assigned by the two investigators will be resolved via discussion or adjudication by another investigator on the project. We will also record any difficulties in obtaining and using the author-supplied data or analytic code.
We will calculate the frequency and percentage (with 95% confidence interval) of (i) systematic reviews for which a data file was made publicly accessible, (ii) systematic reviews for which analytic code was made publicly accessible, (iii) meta-analyses classified as having fully reproducible results without involvement of the original reviewer; (iv) meta-analyses classified as having fully reproducible results with involvement of the original reviewer, and; (v) differences between the original and recalculated summary estimate and its precision that were classified as meaningful. We will calculate agreement between the original and recalculated meta-analytic effects, displayed using Bland-Altman plots (54), and tabulate discordance between P values for the meta-analytic effects, by categorising the P values based on commonly used levels of statistical significance, namely P < 0.01; 0.01 ≤ P < 0.05; 0.05 ≤ P < 0.1; P ≥ 0.1. We will also classify any difficulties in obtaining and using the author-supplied data or analytic code into conceptually related themes, to generate a list of common challenges experienced.
Study 4: Evaluation of the replicability of systematic reviews
The objective of this study is to evaluate the replicability of systematic reviews. We will do this by evaluating the extent of variation in results when we crowdsource teams of systematic reviewers to independently replicate the searches for studies, selection of studies for inclusion in the review, collection of outcome data from studies, and synthesis of results in a sample of the original reviews. By ‘crowdsource’ we mean recruiting a large group of individuals to complete the systematic review tasks (55–57).
We recognise that the terminology for ‘replication’ is not standardised within and across disciplines (7, 58). In this study, we will adopt the non-procedural definitions of replication advocated by Nosek and Errington (5) and Machery (6); that is, replicators will not need to follow every single step exactly as reported in the original systematic review, but they will be constrained by the original review question and must avoid making changes to the methods and concepts that might be reasonably judged to violate an attempt to answer that question.
We will use as our initial sampling frame the systematic reviews identified in Study 1 where the index (first reported) meta-analysis was reported completely. Specifically, meta-analyses in which the summary statistics (e.g. means and standard deviations per group) or an effect estimate (e.g. mean difference) and measure of precision (e.g. 95% confidence interval) were presented numerically for each study in a table or figure. We anticipate that such details will be available in at least 225 (75%) of the 300 systematic reviews, based on observations in previous evaluations of systematic reviews in medicine (25, 26) and psychology (30, 32). For reasons of feasibility, we will then restrict the sampling frame to those reviews that included 5–10 studies in the index meta-analysis (which is likely to be the case in half of the reviews (39)), and in which searches of databases, registers or websites were carried out in English only. From this subset we will draw a random sample of 32 reviews for replication by crowdsourced systematic reviewers (who we refer to as ‘replicators’).
Crowdsourcing of reviewers
We will recruit replicators using approaches that have been used successfully in previous crowdsourced replication projects (10, 59–61). We will send email invitations to our existing networks of systematic reviewer and methodologist collaborators (internal and external to the institutions we are affiliated with). These include investigators who contributed to the 2020 update of the PRISMA reporting guideline for systematic reviews (62); members of the Society for Research Synthesis Methodology; and co-convenors of Cochrane Methods Groups, Campbell Methods Coordinating Group, and Joanna Briggs Institute (JBI) Methodology Groups, all of whom have extensive experience conducting systematic reviews and meta-analyses or developing methodology for evidence synthesis. Finally, we will advertise the project via our own social media channels and invite evidence synthesis organisations (e.g. Cochrane, Campbell Collaboration, JBI, Agency for Healthcare Research and Quality Evidence-based Practice Center Program, Collaboration for Environmental Evidence, 3ie, Global Evidence Synthesis Initiative, SPOR Evidence Alliance, Evidence Synthesis International), and journals that predominantly publish systematic reviews (e.g. Cochrane Database of Systematic Reviews, The Campbell Library, BMC Systematic Reviews, JBI Evidence Synthesis) to do so too. Via all these avenues, we will invite individuals to participate as a crowdsourced reviewer or recommend the opportunity to colleagues who may be suitable. As an incentive for participation, all replicators will be invited to contribute as authors on the main paper resulting from this study.
To be eligible to participate, replicators will need to have had experience with running a systematic search in a bibliographic database, collecting outcome data from studies or undertaking meta-analysis in at least one systematic review of any intervention within the previous three years. To confirm that researchers unknown to us have the necessary skills to participate, we will ask each to describe any relevant publications and training in systematic review methods, and to specify what tasks they contributed to in previous reviews. Three members of the research team will preside over an accreditation committee to ensure all replicators meet a similar standard. We aim to recruit at least 60 replicators who will form 30 teams, with each team including at least two members.
One of the REPRISE investigators will assemble all the information and files necessary for the replication of the 32 systematic reviews. This will involve extracting from each systematic review report (or review protocol or supplementary files, where necessary):
the inclusion and exclusion criteria for the review, such as eligible participants, interventions, outcomes, study designs, languages of publication;
full details of the search methods, including the line-by-line search strategies for each database consulted, and dates when databases were last searched;
results of each study included in the index meta-analysis (i.e. summary statistics and effect estimate with a measure of precision, such as a confidence interval or standard error), and;
results of the meta-analysis (i.e. meta-analytic effect estimate with a measure of precision and inferences about heterogeneity).
We will prepare forms for replicators to record all screening decisions, data collected, and synthesis results generated in the replication attempts, which they will be permitted to modify however they see fit. We will share all instructions and PDF copies of full text reports with replicators via the Open Science Framework repository.
We will also invite the authors of the original reviews to provide us with a file containing all their screening decisions, if not already made publicly accessible. Examples of such files include a Microsoft Excel or Endnote file listing all titles and abstracts screened, and a list of citations for all full text reports retrieved, and what the decision about eligibility was for each.
Of the 32 reviews included, 30 will be replicated by one team each, who will be asked to address the question addressed by the index meta-analysis in the original review and provided detailed information (where available) about the methods used in the original review. We will strive to match the topic of the review to the expertise of the replicators. The remaining two reviews will be replicated by 15 teams, who will be asked to address the question addressed by the index meta-analysis in the original review but provided minimal information about the methods used in the original review. The purpose of the latter replication is to explore how different teams approach the synthesis of the same set of studies (using a “many analysts” approach (20, 60, 63)). For the two reviews replicated by 15 teams, we will strive to select reviews with multiplicity of results in the included studies, diversity in the study characteristics, diversity in the risk of bias in the included studies, and missing evidence (e.g. unpublished studies), and which match the content expertise of the majority of replicators. Each team will conduct two replications (one of each type). We will ensure that none of the replicators are involved in the replication of an original review that they co-authored, of an original review that addresses the same or a similar question as another systematic review they have conducted, or of an original review which includes a study that they conducted. Replicators will be asked to sign an electronic contract confirming that they will not access the original review assigned to them (to prevent it from influencing their replication attempt).
Replications by one team each
For the 30 systematic reviews replicated by one team each, replicators will initially undertake three tasks. First, one team member will re-run the search strategies for all bibliographic databases searched in the original review that they (or the REPRISE investigators) have an institutional subscription to; searches of other sources, such as trials registers and websites, will not be replicated. Searches will be run using the exact same search string and date limits as reported in the original review. Where possible, searches will be run so as not to retrieve records that were published within the date limits but indexed in the database after the search was originally run. If full search strategies are not reported, replicators will be permitted to request the strategies from the authors of the original systematic review (with the contact made via one of the REPRISE investigators). If authors of the original review reported or provided only the search terms used, not the full line-by-line search strategy, replicators will need to attempt to reconstruct a search strategy from the terms provided (and will be permitted to seek assistance from an information specialist if they do not have one on their team). If authors of the original review reported or provided a search strategy for one database only, replicators will need to attempt to reconstruct search strategies for the other databases by adapting the one provided. Replicators will document which original search strategies they attempted to rerun, which search strategies they needed to reconstruct or adapt based on the partial information available, any errors detected when rerunning each original search strategy, the number of citations that each database yielded, the total number of citations yielded across all databases consulted, and the number of unique citations after duplicates were removed.
Second, two team members will screen independently a random sample of a maximum of 100 titles and abstracts yielded from the searches against the inclusion criteria reported in the original review, and record their screening decisions (‘include’, ‘exclude’, or ‘unsure’). We have restricted this step to 100 records to minimize burden on replicators. Any discrepancies between the team members will be resolved via discussion or adjudication by another team member (or, if the team comprises two members only, by one of the REPRISE investigators who will be unaware of the studies included in the original review).
Third, we will retrieve and send to team members full text reports of all the studies included in the original systematic review (not just those included in the index meta-analysis), and full text reports of studies excluded from the original review (if cited, listed in a ‘table of excluded studies’, listed in a supplementary file or shared by the original authors on request). We will cap the number of excluded full text reports to screen at 50, drawing a random sample if more than 50 were listed by the original reviewers. If information about full text reports that were excluded is not available, we will send only full text reports of the included studies to team members. Two team members will screen independently all study reports written in English that were assigned to them against the inclusion criteria reported in the original review, and record screening decisions (‘include’, ‘exclude’, or ‘unsure’). Any discrepancies between the team members will be resolved via discussion or adjudication by another team member (or, if the team comprises two members only, by one of the REPRISE investigators who will be unaware of the studies included in the original review).
For these three tasks, we will only provide teams with the background, rationale, objectives, search strategies, eligibility criteria and the screening form/checklist from the original review (where available), so that their screening decisions are not influenced by knowledge of which studies were included in the original review. Replicators will be permitted to contact the authors of the original systematic review for clarification if any of the study eligibility criteria were unclear to them (with the contact made via one of the REPRISE investigators). If replicators felt unable to re-run or reconstruct any search strategies because the methods reported were so limited, or if the reported search strategies did not rerun properly, replicators will note this and avoid the second step (screening of titles and abstracts) given no citations will be available to screen.
Next, we will provide teams with reports of all studies included in the index meta-analysis and information about what to collect from each study. Such information will include the outcome domain (e.g. social isolation), experimental and comparator intervention (e.g. visiting by a volunteer versus no visiting), meta-analytic effect measure (e.g. mean difference) and any decision rules reported by the original systematic reviewers regarding which results to select from studies if multiple results were available (e.g. which measurement scale, time point or analysis sample was selected). Two replicators in the team will collect independently the outcome data from each study report that they deem most compatible with the index meta-analysis and the methods stipulated by the original reviewers. Any discrepancies in the data collected by team members will be resolved via discussion or adjudication by another team member or one of the REPRISE investigators. Where necessary, replicators will calculate a study effect estimate and its 95% confidence interval based on the summary statistics collected.
Using the data collected, replicators will then calculate a meta-analytic effect estimate, its 95% confidence interval and P value, and inferences about heterogeneity, according to the meta-analysis methods described in the report, using the statistical package they are most familiar with (e.g. R, Stata, RevMan, Comprehensive Meta-Analysis). Replicators will be permitted to contact the authors of the original systematic review to clarify missing or unclear details about how they did their analysis. Replicators will also be permitted to contact the authors of studies included in the index meta-analysis when essential data in the study reports were missing or ambiguous, regardless of whether the authors of the original systematic review did so. In both scenarios, one of the REPRISE investigators will make the contact. We will provide fields in the data collection form for replicators to record any challenges they experienced with recalculating study or meta-analytic results.
Once they have completed the replication attempt, we will provide teams with a table or figure (e.g. forest plot) indicating the study results included in the original meta-analysis. Each replicator will independently record, for each study included in the meta-analysis, whether the results they extracted from the study report (or via contacting the study authors) matched the results included in the meta-analysis. Also, one REPRISE investigator and the team who attempted to replicate the review will independently classify the meta-analysis into one of three replication categories:
‘results fully replicable’ (i.e. no difference [with allowance for trivial discrepancies such as those due to computational algorithms] is observed between the original and recalculated meta-analytic effect estimate, its 95% confidence interval, and inferences about heterogeneity reported in the original review);
‘results not fully replicable’ (i.e. a difference [even after allowance for trivial discrepancies] is observed between the original and recalculated meta-analytic effect estimate, its 95% confidence interval, or inferences about heterogeneity reported in the original review, or;
‘results not able to be replicated because of missing information’.
A REPRISE investigator and the team will also independently specify whether they believe the observed difference between the original and recalculated summary estimate and its precision was meaningful, that is, would lead to a change in the interpretation of the results (classified as ‘difference meaningful’ or ‘difference not meaningful’). Any discrepancies in the classifications assigned by the REPRISE investigator and team will be resolved via discussion or adjudication by another REPRISE investigator.
Replications by 15 teams
For the two systematic reviews replicated by 15 teams, we will provide teams with reports of all studies included in the review (not just those included in the index meta-analysis). For this review, we will instruct teams to attempt a synthesis of a specific outcome domain (e.g. delinquent behaviour) using the data available in the study reports; however, we will not tell them which studies or results were included in the original meta-analysis, or any decision rules the original systematic reviewers used to select results from studies when multiple were available. Each team will decide for themselves which data to select from the study reports (e.g. which time point or scale to select if multiple were available for the outcome domain), whether to synthesise the data, which studies and data to include in the synthesis, and how to synthesise the data (e.g. which meta-analysis model to use to synthesize results). Teams will be permitted, but not expected, to assess the risk of bias in the studies using any tool they deem suitable, to help inform which studies to include in the synthesis. Teams will then calculate study and synthesized effects using the statistical package they are most familiar with. Teams will need to submit to the REPRISE investigators a report specifying the analysis methods they used and the results they found, which we will de-identify to prevent identification of team members. Teams will not be permitted to publicise or discuss their methods and results with others until advised by the REPRISE investigators, to ensure independent analyses across teams.
Once teams complete both replications, investigators will conduct semi-structured interviews (of approximately 30 minutes duration) with replicators to discuss the analytical steps they took and to understand the decision-making processes used when synthesising the data. The interviews will be conducted by two investigators, one who will lead the discussion and the other who will listen and consider questions that may need to be asked for clarification or further exploration. We will focus particularly on decisions or steps for which replicators found there was insufficient (or no) operational detail to decide what to do. We will also gauge their views on the appropriateness of the methods used by the original reviewers, alternative statistical synthesis approaches that they could have undertaken (particularly those conducted by other teams), and any challenges experienced with conducting both replications. We will analyse the data after each interview and adjust our questioning in subsequent interviews where necessary, to explore important leads that we uncover. We will conduct the interviews using videoconferencing software, which we will audio record and transcribe verbatim.
For the 30 systematic reviews replicated by one team each, we will assess agreement between the original and replicated review in the number of citations yielded from each database, in total and once duplicates were removed, by calculating the weighted Kappa statistic and percentage agreement (both metrics will be presented with 95% confidence intervals). We will use the same metrics to assess agreement between the original and replicated review in screening decisions (where available) for the subset of titles and abstracts and full text reports screened by replicators. We will interpret Kappa values as poor (≤ 0.00), slight (0.01–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), or almost perfect (0.81–1.00) (64). We will calculate the frequency and percentage (with 95% confidence intervals) of (i) original search strategies that the replicators attempted to rerun, (ii) search strategies replicators needed to reconstruct or adapt based on the partial information available; (iii) original search strategies with errors detected when rerun; (iv) studies where the effect estimate and its precision collected or calculated (from summary statistics) matched the effect estimate and its precision included in the meta-analysis, (v) meta-analyses in each of the three replication categories and (vi) differences between the original and recalculated summary estimate and its precision that were classified as meaningful. We will calculate agreement between the original and replicated meta-analytic effects, displayed using Bland-Altman plots (54), tabulate discordance between P values for the meta-analytic effects, by categorising the P values based on commonly used levels of statistical significance, namely P < 0.01; 0.01 ≤ P < 0.05; 0.05 ≤ P < 0.1; P ≥ 0.1.
For each of the two systematic reviews that were replicated by 15 teams, we will visually display the meta-analytic effect estimate and 95% confidence interval generated by each team that chose to meta-analyse the data, and calculate the median, interquartile range and range of each value across the teams. We will also present the findings of teams who used an alternative statistical synthesis method or who chose not to use any statistical synthesis method (e.g. because they believed the studies were too clinically diverse to combine).
For the interview data, we will adopt the same approach used to analyse free text responses to survey questions in Study 2. That is, one investigator will code the transcripts deductively, and codes and themes generated will be verified by another. We will also classify any difficulties in replicating any of the review methods or recalculating results into conceptually related themes, to generate a list of common challenges experienced.