Duplicated Network Meta-analysis: a Case Study and Recommendations for Change

is recognised the of both systematic reviews. We carried out a systematic review to identify and examine duplicated Network Meta-Analyses (NMAs) in a specic where several novel have emerged: hormone-sensitive (mHSPC).

greater clarity regarding the unique aims or scope of the project. Source data and results should be clearly and completely presented to allow unbiased interpretation. Review authors should be fully knowledgeable of their subject, both in terms of relevant studies and of other reviews with potential for overlap or duplication; and should re-evaluate their knowledge throughout the research process, particularly in fast-moving elds.

Background
Research overlap and duplication is a recognised problem in the context of both pairwise and network systematic reviews. It has been estimated that two-thirds of pairwise meta-analyses [1] and over threequarters of published network meta-analyses (NMAs) have overlap with at least one other review, often to a high degree [2]. Commentators have noted that whilst some duplication is justi able in terms of independent replication, larger-scale duplication brings a risk of confusion and wasted effort [1], which may be heightened in the context of rapidly evolving elds such as COVID-19 [3].
Network meta-analysis is an increasingly in uential tool for evidence synthesis, which has particular worth in situations where multiple treatments are to be ranked with respect to a common standard-ofcare, or where no head-to-head comparison data exists. However, methods for NMA are numerous, and continue to evolve. Hence, research duplication may partly be explained by an ongoing lack of consensus regarding the conduct of NMAs, particularly choices as to which interventions, trials and data items should be included and compared [2]. This situation persists despite efforts such as the network extension to the PRISMA statement [4] and the emergence of the concept of Living Systematic Reviews [5,6].
For decades, androgen deprivation therapy (ADT) had been the established standard-of-care for hormonesensitive metastatic prostate cancer (mHSPC). However, recent trials and pairwise meta-analyses [7,8] have demonstrated improved survival from adding docetaxel or abiraterone acetate to ADT, sparking debate regarding their relative merits [9][10][11]. Furthermore, there has been a suggestion that response to these treatments may be in uenced by a patient subgroup de ned as "high-volume" or "high-risk" metastatic disease (HVD [12] or HRD [13]). As a result of our own research in this area, we became aware of multiple NMAs with similar scope but apparently heterogeneous methods and conclusions. Hence, we carried out a systematic review of reviews to evaluate research duplication in this setting.

Systematic literature review of network meta-analyses
We searched systematically (see Additional le 1) for indirect treatment comparisons (ITC), mixed treatment comparisons (MTC) and network meta-analyses (NMA) of systemic treatments in the mHSPC setting. Eligible reviews must have presented at least one evidence-based inference on an indirect treatment comparison with a time-to-event outcome reported on the hazard-ratio scale. To avoid confusion with more recent therapeutic developments [14,15], we speci cally targetted meta-analyses referencing both "docetaxel" and "abiraterone", but excluded analyses of "enzalutamide" or "apalutamide". Searching was performed originally in May 2019, updated in January 2020, within the MEDLINE and EMBASE databases with no restrictions on year of publication or language. Abstracts from the proceedings of American Society of Clinical Oncology (ASCO) and European Society of Medical Oncology (ESMO) were potentially eligible. If a report was accepted as a conference abstract but subsequently published as a peer-reviewed article, we included both, but extracted data from the article.

Data extraction
Two independent reviewers (DF and SB) extracted data concerning the timing of completion of the review, estimated by the date submitted for peer review or to conference committee; and of the results entering the public domain, estimated by the date of publication of a peer-reviewed article or conference abstract book. Also extracted was data on inclusion and exclusion criteria for included trials and patients, de nitions of intermediate survival endpoints and of important patient subgroups, details of statistical methodology and software, and the network HRs themselves. We obtained the original source publications for all trials included in eligible reviews, and extracted the reported HRs for overall survival and intermediate survival endpoints (see Additional les 4 and 5), together with details of statistical methodology used to obtain them. Finally, we assessed each review against the PRISMA-NMA checklist ( [4]; see Additional le 6).

Data analysis
Our primary synthesis was a comparison, across reviews, of reported hazard ratios for the effect of abiraterone acetate versus docetaxel on overall survival, and on intermediate survival endpoints based on disease progression or treatment failure. We attempted to explain observed differences in effect size and precision based on differences in characteristics of the reviews, particularly their included data and statistical methodology. We also aimed to recreate the results of each NMA from reported trial results using Stata v15.1 (StataCorp LP, College Station, TX).

Description of relevant reviews
All trials included in eligible reviews investigated the addition of one or more treatments, such as abiraterone, celecoxib, docetaxel and zoledronic acid, to the standard-of-care of androgen deprivation therapy (ADT) compared to ADT alone, including two combination treatments (zoledronic acid plus ADT in combination with each of docetaxel and celecoxib [35,36]). One relevant trial [37] compared multiple research treatments under the same protocol, such that data from 14 randomised comparisons were represented across the reviews from within nine trial protocols. Each review used data from between three and twelve randomised comparisons (Figure 1), comprising between 1,773 and 7,844 patients. The relevant source data from each of the relevant trials is given in Additional les 4 and 5. The theoretical network resulting from analysis of all such data simultaneously is shown in Figure 2.

Sources of variation
We observed considerable variation between the included reviews in terms of review aims, eligibility criteria and included data, statistical methodology, reporting and inference.

Review aims
All thirteen eligible reviews either stated or implied an aim to synthesize data on optimal treatments for hormone-sensitive prostate cancer. Two reviews stated the additional aim of including updated results [21] and/or improved methodology [20,21]. Four others speci cally aimed to evaluate e cacy within pre-de ned patient subgroups [19,[22][23][24], and four stated the aim of incorporating health economic considerations [24] or adverse effects [17,22,34].

Included trials
Ten of the 13 reviews described themselves as "systematic", and all but one [16] reported that a formal search strategy had been used. All reviews speci ed a disease setting of hormone-sensitive prostate cancer (HSPC). Eight reviews [16,18,20,21,25,[32][33][34] only included trials in metastatic disease (M1). One of the largest relevant trials (STAMPEDE [35,36,38,39]) randomised both M1 and high-risk nonmetastatic (M0) patients; but M1-speci c results were reported, making it eligible for most M1-only reviews. Three other reviews explicitly included trials in the high-risk [19] or locally-advanced [17,22] nonmetastatic setting, although one [17] ultimately limited their analysis to M1 due to lack of data. Only one review [23] included the STAMPEDE direct comparison of abiraterone vs docetaxel [39] published online in February 2018.

Included treatments
The set of included treatments varied depending upon the aims of the review. Ten reviews only included data comparing docetaxel or abiraterone plus ADT to ADT alone -re ecting the focus of clinical interest -although two such reviews [18,19] also included data from the zoledronic acid plus docetaxel combination comparison of STAMPEDE [35], treating this as an additional docetaxel trial. The three remaining reviews permitted a wider, but varied, range of treatments ( Figure 1). Although this presumably re ects deliberate choices made by review authors, only one review [21] gave an explicit justi cation, referring to earlier work [7] where the treatment (sodium chlodronate) was considered separately due to "differences in mechanisms of action" and because it "is not commonly used in practice". By contrast, two other treatments rarely used in recent times (estramustin phosphate and utamide [40,41]) were included in one review [25].

Included participants
Patient inclusions were necessarily governed by the reported data. The vast majority of included trials conformed to the intention-to-treat principle; the exceptions being two small, older trials [25,40,41] where small numbers of patients were not analysed due to protocol deviation or non-eligibility. Two reviews [23,24] restricted to patients with "high volume metastatic disease" (HVD), of which one [23] additionally restricted to newly-diagnosed mHSPC; that is, patients who had not received prior therapy for prostate cancer. As STAMPEDE was considered highly clinically relevant but did not have published HVDspeci c results at the time of publication, it was instead included in a sensitivity analysis. Only two reviews [19,22] investigated patient subgroups other than M0/M1 or HVD: looking at age, performance status, Gleason score and presence of visceral metastases. Neither used the "deft" approach to testing for subgroup interactions in the meta-analytic context as recommended by Fisher et al [42].
Despite the availability of STAMPEDE results for M0 and M1 patients separately, it was not always clear that review authors extracted or analysed data consistently. For example, one review [16] speci ed that only M1 patients were eligible, but reported gures suggest that M0 patients were sometimes also included.

Included outcomes
Eleven of the 13 reviews reported overall survival (OS) results, and ten reported results on an intermediate survival outcome. De nitions of intermediate outcomes varied between trials, and were handled differently between reviews. One review [19] considered that "data on secondary outcomes … were not reported consistently enough between trials to allow for pooling of data", while most other reviews did attempt such analysis.
Three reviews [20,23,25] imposed a speci c de nition of the intermediate outcome, resulting in fewer but possibly more comparable included trials. Another [21] speci ed a list of desired elements, but argued in favour of including two trials omitting one such element [12,13] on the basis that de nitions were similar enough overall to allow a clinical interpretation from the pooled result. One further review [22] appeared similar but was unclear; the others did not provide su cient information.

Included results
Although three reviews explicitly stated that the most recent available trial report would be used [17,19,21], many reviews were inconsistent or unclear. For example, one review [18] referenced updated results for an included trial [43] but apparently used an older set of results [44] in their analysis. Updated OS results from another trial were published in a conference abstract [45], with intermediate outcome results presented at the conference itself; but only a single review [21] incorporated them. Particularly in a time-toevent context, updated results can increase power and precision by capturing additional events [46].

Statistical methods
A wide range of statistical methods were used. Three reviews [16,32,33] simply carried out pairwise metaanalyses of included treatments versus standard-of-care, with inference for indirect comparisons based upon a test of subgroup difference [47]. A more common approach, used in ve reviews [17-19, 22, 24], was the "Bucher method" [48], applicable to three-treatment triangular networks and which has been criticised for estimating a separate heterogeneity variance for each comparison [47]. Two reviews [18,19] accommodated the "docetaxel plus zoledronic acid" comparison from STAMPEDE within this framework by treating it as an additional docetaxel comparison, re ecting a similar approach sometimes used in pairwise meta-analysis [49]. Four other reviews analysed networks of four or more treatments using multiple treatment comparison (MTC) methods, either using frequentist multivariate analysis [21] or a Bayesian framework [20,23,25]. Of the nine frequentist reviews, six used random-effects modelling, one [17] used common-effect modelling, one [18] used a hybrid method (see Additional le 3), and one [24] was unclear.
Due to its adaptive multi-arm design [37], multiple treatment comparisons from the STAMPEDE trial may be correlated. If a review includes such comparisons as though they were independent trials, doublecounting of control arm observations may lead to in ated variances. However, only three reviews [20,21,23] explicitly discussed this issue, despite it being indicated in the PRISMA-NMA statement [4]. One such review [20] stated that "treatment comparisons… from the same study were modelled… with a [Bayesian] correlation prior distributed uniformly on 0-0.95". Another [21] sought to estimate the correlations themselves using event counts by treatment arm. Both also included zoledronic acid combination arms separately from docetaxel and celecoxib alone, which added strength to the docetaxel network comparison. The remaining review [23] was alone in including direct comparison data from STAMPEDE of abiraterone vs docetaxel [39]. Despite correctly noting "differences in the period of enrolment" between the direct comparison and the original comparisons against ADT, and "uncertainty in the extent of overlap of populations for each of the comparisons" [23], they did not attempt to formally account for this, choosing instead to perform sensitivity analyses.

Reporting
Three reviews were reported in conference proceedings only [32][33][34], and a further two [16,25] took the form of "letters to the editor"; understandably these six reviews conformed poorly to PRISMA-NMA guidelines [4]. The eight peer-reviewed articles conformed better to varying degrees (see Additional le 6).
Risk-of-bias assessment and handling of multi-arm trials were common omissions, and in particular only two reviews [21,22] published their protocol in advance. There was also some evidence of outcome reporting bias, for example one review [25] presented an indirect estimate for the intermediate outcome but not for overall survival, despite evidence that both outcomes were analysed. Reporting of source data and description of statistical methodology was often poor, making it di cult to recreate the reported indirect treatment comparisons. Inconsistencies in use of source data, and minor reporting errors such as inconsistent patient or event counts, further hindered attempts to make reasonable judgments as to how such analyses might be recreated.

Comparison of primary results and of reviewers' interpretations
Twelve of the 13 reviews analysed overall survival (OS), of which 9 explicitly reported an indirect estimate of abiraterone versus docetaxel. Despite the dissimilarities described above, results were fairly similar, with HRs of around 0.80 and of borderline signi cance at the 5% level ( Figure 3). Eight reviews drew tentative conclusions regarding an OS advantage for abiraterone over docetaxel. By contrast, three reviews [19,24,34] stated categorically that there was no difference in OS; the conclusions for the nal review [25] were unclear. Notably, conclusions differed among three reviews including an identical set of trials: two [17,19] stated explicitly that their analysis did not demonstrate statistical signi cance, whilst the third [18] stated that "despite several limitations stemming from the paucity of comparative evidence, our results favour [abiraterone] over [docetaxel]". Of ten reviews which analysed an intermediate outcome, 7 reported indirect estimates. Due to the variations in intermediate outcome de nition, we took the results most prominently presented or described in each review (see Additional le 3). The estimates here were more varied, with HRs ranging from 0.50 to 0.84 ( Figure 3). In four reviews [17,[20][21][22] the estimates were strongly signi cant at conventional levels, and this was re ected in the reviewers' conclusions. Two reviews [23,24] concentrated on the high-volume disease (HVD) sub-population and as such differ noticeably in terms of available power and estimated effects, and appear as outliers in Figure 3. One [23] concluded that a "positive trend" was seen both in overall survival and in the intermediate outcome, whilst the other [24] stated that "no statistically signi cant difference" was seen. The remaining outlying result is taken from a review [25] for which descriptions of methodology and source data were particularly limited, and we were unable to recreate their analysis.

Discussion
Our case study of reviews analysing treatments for metastatic hormone-sensitive metastatic prostate cancer identi ed thirteen eligible reviews. Of these, all but one [25] reported within a year of the publication of two major abiraterone results in June 2017 [13,38]. The rst six months alone saw four articles submitted for peer review [16][17][18][19] and six accepted conference abstracts [26][27][28][32][33][34]. Among the former, statistical methodology was relatively simple and there was no evidence of a priori intent, so it may be surmised that speed of dissemination was at least a partial motivation. As such, the scienti c contribution of such reviews is debatable, and may cause confusion if later more studied work suggests a con icting interpretation. By contrast, conference abstracts allow preliminary results to be presented for immediate discussion within the research community; three of the six reviews disseminated in this way [26][27][28] (and two others published later [20,24,29,30]) were ultimately published as fully peerreviewed articles [21][22][23]. Also of note, albeit not within the scope of this "meta-review", are narrative reviews -of which several also appeared at this time [9][10][11] -which aim simply to summarise a fastmoving eld and give clinicians a brief, clear description of the current body of evidence, without attempting statistical inference.
Of four reviews published as peer-reviewed articles [20][21][22][23] in the following calendar year (2018), two [21,22] registered protocols in advance with the PROSPERO international prospective register of systematic reviews (CRD42017071811 and CRD42017071268, available from https://www.crd.york.ac.uk/prospero). Prospective planning allows issues of methodology and scope to be discussed and optimised with input from a collaborative research team with expertise in systematic review, statistical methodology and clinical decision-making, and hence permits reliable results to be reported in a reasonable time-frame. Furthermore, communication with trial investigators may highlight updated results or enable collection of additional relevant data [21]. In particular, two reviews [20,21] included a wider range of treatment comparisons, and utilised fully network-based approaches whilst accounting for correlations induced by multi-arm trial designs; in principle, these reviews should provide the most de nitive evidence within the limitations of aggregate data.
Only three of 9 peer-reviewed articles made reference to any other reviews in the same eld. One [21] did so to highlight the methodological advantages of their own review, whilst the other two [22,23] aimed to build on previous work by clarifying particular aspects such as the HVD patient subgroup, or safety. Notably, one review [25] claimed originality despite being published some time after the others. Overall, many of the eligible reviews had similar objectives and made little attempt to explain why their approach offered a bene cial or differing perspective.
Although surveys of duplication and overlap in meta-analysis and NMA have been published [1,2], this is to our knowledge the rst detailed case-study of the phenomenon in the NMA setting, answering a previous call for further exploration [2]. In doing so, it also offers a differing perspective on NMA: that of academic trials and reviews, and in prostate cancer rather than more commonly-evaluated settings such as cardiac disease. Our literature review was comprehensive, and we are not aware of any substantial relevant unpublished data. As the majority of data were from large-scale randomised controlled trials, we did not attempt to draw any conclusions about the possible effects of trial-level bias on review results.
Super cially, the network ( Figure 2) is fairly small and simple, with only one multi-arm trial and no indirect treatment loops. The most inclusive reviews [20,21] included over 75% of relevant studies, or arguably 100% if older treatments are disregarded ( Figure 1). This compares to a maximum of just 55% from a previous study of much larger networks of biologics for rheumatoid arthritis [2]. This has both advantages and disadvantages: it allowed us to make cleaner and more granular comparisons between reviews, but we were unable to fully examine issues such as network geometry and "lumping" or "splitting" of nodes [2]. On the other hand, the inclusion of the adaptive multi-arm STAMPEDE trial [37] introduces very speci c complexities which were rarely acknowledged or tackled. Since novel trial designs are on the increase [50], is it important to identify such gaps in review methodology in order to avoid biased or ine cient results. To broaden understanding of such issues, we encourage researchers in other clinical settings to undertake similar case studies of duplicated NMAs, where appropriate.
An obvious limitation of all included reviews is their use of aggregate data. Individual participant data (IPD) is generally seen as the "gold standard" [51], and would allow many of the issues discussed here to be resolved. No IPD review in this setting yet exists, but the STOPCAP M1 programme [52] aims to develop an IPD repository to allow this work to be done [53] and to allow other researchers to tackle new questions as they arise going forward.
In this case study, duplication resulted primarily from the situation of multiple new mHSPC treatments emerging within a relatively short period of time, raising unanswered questions on multiple fronts. New therapeutic advances continue to emerge in this setting, and indeed a similar situation may already be developing [54][55][56]. Likewise, respected commentators have noted the risks of duplication in the context of COVID-19 [3], where for example the antiviral drug remdesivir has recently been the focus of multiple reviews [57][58][59]. Ongoing rapid research into prognosis and treatment of COVID-19 will likely continue to raise similar issues. Many of our identi ed reviews differed in terms of included trials and comparisons; which, as previously suggested [2], may have a substantial impact upon estimated effects and ranking statistics. While some focussed solely on docetaxel and abiraterone, others included a variety of other treatments, including some no longer in common use [40,41,60]. No single network encompassed all therapies included by the others.
Living cumulative NMAs [6] -an extension to the general concept of living systematic reviews [5] -have been proposed as a solution to issues both of research wastage and of network fragmentation (lack of overlap). The "living" paradigm may well represent the best opportunity for a review team to maintain knowledge of relevant trials and data. However, for a review to have clinical utility and impact, it must undergo analysis, dissemination and appraisal. These processes introduce additional considerations, such as the desired clinical scope and the appropriateness of pooling together certain results with others [6]. Therefore, even the "living" paradigm may not avoid the obligation for review authors to make pragmatic decisions regarding inclusion -with consequent risk of some degree of duplication or con ict -in order to optimise clinical utility.

Conclusion
Increased availability of published results, such as via open-access policies and online supplementary data, has many advantages in terms of reducing risk of bias and decentralisation of research efforts from the most developed parts of the world. However, building upon what is already known about incidence of duplication [1,2], we have provided the rst empirical evidence of the consequent risks for aggregate-data NMA. As well as cost and resourcing implications, con icting results and confused interpretation may result. To avoid this, we recommend that researchers take a more prospective approach [61] to the review process. Firstly, we would add to a previous entreaty that existing reviews be identi ed "as a compulsory rst step" [62], adding that review authors should identify potentially eligible studies, and familiarise themselves with their design, progress and dissemination, throughout the research process. This may be particularly important in fast-moving elds such as COVID-19, where updated results may become available prior to review publication. Secondly, communication with study investigators may help to identify new or updated data or other planned reviews which might lead to duplication, and may also enable cross-study consensus to be reached on aspects such as inclusion criteria, outcome de nitions or patient subgroups. Finally, we strongly recommend, following the PRISMA-NMA statement [4], that detailed review protocols be published in advance in order to facilitate awareness; to clarify objectives and methods; and to specify the research gap that the review uniquely aims to meet.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
All data generated or analysed during this study are included in this published article and its supplementary information les.