DOI: https://doi.org/10.21203/rs.3.rs-1030290/v1
Background: Different network meta-analyses (NMAs) on the same topic result in differences in findings. In this review we investigated network meta-analyses comparing ranibizumab with aflibercept for diabetic macular edema in the hope of illuminating why the differences in findings occurred.
Findings: For the binary outcome of best corrected visual acuity, different reviews all agreed on their being no clear difference between the two treatments; while continuous outcomes all favour aflibercept over ranibizumab. We discussed four points of particular concern that are illustrated by five similar NMAs, including: network differences, PICO differences, different data from the same measures of effect, differences in what is truly significant.
Conclusions: Closer inspection of each of these reviews shows how the methods, including the searches and analyses all differ but the findings, although presented differently and sometimes interpreted differently, were similar.
With the rapid increase in biomedical evidence, systematic reviews are an opportunity to take healthcare decisions based on comprehensive summaries of the best available evidence on a topic (1–3). Current knowledge may be imperfect, but decisions should be better informed when taken in the light of the best, most up-to-date knowledge. It is crucially important that the systematic review itself is both clear and accurate for local interpretation by healthcare decision makers (healthcare practitioners and policy makers)(3, 4). When similar approaches are taken in summarizing evidence by different research teams but contradictory findings are reported, there is a problem. In this review we investigate one such example in the hope of illuminating why the differences in findings occurred.
Our example is taken from network meta-analyses (NMA) comparing ranibizumab (a monoclonal antibody) with aflibercept (an inhibitor of vascular endothelial growth factor) for diabetic macular edema (DME). DME leads to impaired vision-related functioning and quality of life (QoL) (7) and is the main cause of moderate to severe vision impairment in people with diabetes. DME constitutes a substantial economic burden for patients and public health systems(8, 9). The therapeutic goal for people with DME is to improve visual function and vision-related QoL(10). Anti-vascular endothelial growth factor (VEGF) is recommended by several clinical guidelines as first-line treatment(11, 12). Ranibizumab and aflibercept are commonly used in clinical practice, but there are limited direct comparisons of these two drugs. Network meta-analysis then becomes an attractive option as NMAs use past studies of the two drugs directly compared with other controls to create statistical indirect comparisons of the two medications of current interest.
We identified five NMAs (13–17) through searching electronic database (PubMed, Embase, Cochrane Library, Web of Science, CNKI, Wanfang, VIP) and screening. AMSTAR-2 tool (3) was employed to provide some rating of the quality of each of the reviews.
Everything seemed varied in almost every review (Table 1). Although the question under investigation was consistent, the searches, the numbers of studies used and definitions for eligible participants, comparisons from which to source data and acceptable outcomes mostly lacked rigid consistency.
Study IDa |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Korobelnik 2015(13) |
Régnier 2014(14) |
Zhang 2016(15) |
Muston 2018(16) |
Virgili 2018(17) |
||||||
Protocol identified |
✖ |
✖ |
✖ |
✖ |
✓ |
|||||
Search |
Cochrane |
✓ |
✓ |
✓ |
✓ |
✓ |
||||
EMBASE |
✓ |
✓ |
✓ |
✓ |
✓ |
|||||
MEDLINE |
✓ b |
✓ b |
✓ c |
✓ b |
✓ |
|||||
Others |
✓ d |
|||||||||
Date |
01/2013 |
02/2014 |
08/2015 |
12/2016 |
04/2017 |
|||||
Number of studies |
11 |
8 |
21 |
13 |
24 |
|||||
Statistics |
Network model |
Bayesian |
Bayesian |
Bayesian |
Bayesian |
Frequentist |
||||
Sensitivity analysis |
Heterogeneous studies; ethnic group |
Ethnic group |
Heterogeneous studies; ethnic group |
Studies at higher risk of biasf |
||||||
Covariates |
baseline BCVA and/or CRT |
baseline BCVA and/or CRT |
baseline BCVA and/or CRT |
|||||||
Other |
Some IPD |
|||||||||
Participants |
Diabetic macular edema |
✓ |
✓ |
✓ |
✓ |
✓ |
||||
Significant, focal or diffuse , |
Baseline BCVA & CRT varied - 24-78 letters |
As for Korobelnik 2015 |
Baseline visual acuity between 20/200 and 20/40 |
|||||||
DME secondary to diabetes involving the center of the macula |
||||||||||
retinal thickening due to DME/clinically significant macula edema with DR |
previously received central/peripheral laser or treatment naïve included |
|||||||||
Interventionsh |
aflibercept |
2q4 or 2q8 |
2 mg; bimonthly |
intravitreal |
2q8 |
2 mgg |
||||
+ laser |
✓ |
✓ |
||||||||
ranibizumab |
0.5 mg, PRN |
0.5 mg, PRN |
intravitreal |
0.5 mg, PRN or 0.5 mg T&E or 0.3 mg, q4 |
0.5 mg or 0.3 mg |
|||||
+ laser |
✓ |
✓ |
✓ |
deferred |
||||||
prompt |
||||||||||
dexamethasone |
implants |
implants |
||||||||
(continued) |
||||||||||
bevacizumab |
intravitreal |
1.25 mg |
1.25 mg |
|||||||
+ laser |
✓ |
✓ |
✓ |
|||||||
triamcinolone acetonide |
intravitreal |
4 mg, q4/PRN or 4 mg, q4 |
||||||||
+ laser |
✓ |
✓ |
✓ |
|||||||
pegaptanib |
0.3 mg |
|||||||||
Laser |
✓ |
✓ |
✓ |
✓ |
||||||
+ sham injection |
✓ |
|||||||||
Sham |
✓ |
✓ |
||||||||
Outcomes |
Binary |
ETDRS letterse |
>10 and >15 gain; >10 and >15 loss |
>10 gain |
>10 and >15 gain; >10 and >15 loss |
>15 gain |
||||
AEs |
✓ |
✓ |
✓ |
|||||||
Continuous (average change) |
in BCVA using ETDRS charts |
✓ |
✓ |
✓ |
✓ |
|||||
CMT |
CRT measured using OCT |
|||||||||
Quality rating (AMSTAR-2) |
Low |
Low |
Low |
Low |
High |
|||||
a Sorted by search date | ||||||||||
b Including In-Process Citations and Daily Update | ||||||||||
c PubMED | ||||||||||
d International Clinical Trials Registry Platform; ISRCTN registry; LILACS; Novartis Clinical Trials database; US National Institutes of Health Ongoing Trials Register ClinicalTrials.gov; World Health Organization | ||||||||||
e in BCVA | ||||||||||
f Post hoc | ||||||||||
g Regarding drug dose and monitoring/retreatment regimen, Virgili 2018 included | ||||||||||
schemes that are either on-label or commonly used in clinical practice (such as monthly, bimonthly, PRN, T&E | ||||||||||
h No included NMAs included RCTs which investigated conbercept. | ||||||||||
AE: adverse event; BCVA: best corrected visual acuity; CI: credible/confidence interval; CMT: central macular thickness; CRT: central retinal thickness; DME: diabetic macular edema; DR: diabetic retinopathy; ETDRS: Early Treatment Diabetic Retinopathy Study; IPD: individual patient data; NI: no information; NMA: network meta-analysis; OCT: optical coherence tomography; PRN: pro re nata; q4: every 4 weeks T&E: treat-and-extend; 2q8: 2 mg every 8 week |
Table 2 summarised three outcomes in the NMAs. The first outcome shows, how, for the identical binary outcome, different reviews gathered data from different studies, and, partly due to this, arrived as slightly different point estimates – although all agreed on their being no clear difference between the two treatments (all 95% Confidence Intervals straddled zero); and the other two outcomes reproduce the results from each review and illustrates how the same measure is reported in different ways and different combinations across the five reviews.
Régnier 2014(14) |
Korobelnik 2015(13) |
Zhang 2016(15) |
Muston 2018(16) |
Virgili 2018(17) |
|
---|---|---|---|---|---|
Gain ≥ 10 ETDRS letters at 12 months (three reviewsa) |
|||||
OR [95% CrI] |
0.63 [0.19 to 1.63] |
1.59 [0.75 to 3.35] |
NR |
1.79 [0.63 to 4.06] |
NR |
Studies reporting these datab in each NMA |
|||||
Elman 2010(18) [DRCR.net Protocol I] |
Included |
Included |
NR |
Included |
NR |
Mitchell 2011(19) [RESTORE] |
Included |
Included |
NR |
Included |
NR |
Korobelnik 2014(20) [VIVID; VISTA] |
Included |
Included |
NR |
Included |
NR |
Massin 2010(21) [RESOLVE] |
Included |
Not includedc |
NR |
Not includedc |
NR |
Googe 2011(22) [DRCR.net Protocol J] |
Not includedd |
Included |
NR |
Included |
NR |
Do DV 2012(23) [Da VINCI] |
Included |
Not includedd |
NR |
Not includedd |
NR |
Ishibashi 2015(24) [REVEAL] |
Not included focus on Asian populatione |
Includede |
NR |
Included |
NR |
RESPOND(25) [NCT01135914] |
Included |
Not includedf |
NR |
Included |
NR |
Nguyen 2009(26) [READ-2] |
Includede |
Not includedg |
NR |
Not includedg |
NR |
Average change in BCVAh at 12 months MD [95% CrI] (five reviews) |
|||||
4.5 [1.5 to 7]i |
4.67 [2.45 to 6.87] |
2.07 [-0.97 to 5.33] |
5.20 [1.90 to 8.52] |
4 [2.5 to 5.5] |
|
Gain ETDRS lettersh at 12 months OR [95% CrI] (five reviews) |
|||||
≥ 10 |
0.63 [0.19–1.63] |
1.59 [0.75–3.35] |
NR |
1.79 [0.63–4.06] |
NR |
≥ 15 |
NR |
NR |
NR |
2.30 [1.12-4.20] |
1.33 [1.06-1.67]j |
a Zhang 2016 and Virgili 2018 did not report this outcome. | |||||
b the additional reasons presented for ‘included’ or ‘not included’ were identified by author team of this review, not identified in original texts of NMAs | |||||
c data unavailable on ranibizumab 0.5 mg | |||||
d unclear reason for exclusion | |||||
e included in sensitivity analysis | |||||
f unpublished when NMA conducted | |||||
g data only reported at 6 months | |||||
h higher values represent better visual acuity measured using ETDRS letters | |||||
i data were analysis by author team of this review (Bayesian network model/random effects, using ADDIS software), not reported in original texts of Re´gnier 2014. | |||||
j data were risk ratio (RR) and its 95% CrI | |||||
CrI: credible interval; BCVA: best corrected visual acuity; ETDRS: Early Treatment Diabetic Retinopathy Study; NMA: network meta-analysis; NR: not reported; OR: odds ratio. |
Generally speaking, it is clear that the reader must continue to think ‘cleanly’ amidst the data which may not be quite so clean. Below we discuss some points of particular concern that are illustrated by these five similar NMAs.
Network meta-analysis employs data from [in these cases] randomised trials in ways by which comparisons of interest can be constructed by using somewhat assumption-heavy observational methods. For example, aflibercept versus ranibizumab is the comparison of interest (referred to as the decision set). Aflibercept or ranibizumab, however, may only have been compared with sham injection in randomised trials. We use the term ‘supplementary set’ to refer to interventions, such as sham injection, that are included in the network meta-analysis for the purpose of improving inference among interventions in the decision set(4). As different selection for supplementary set, different network structure will be conducted for the same clinical problem. In selecting which competing interventions to include in decision set, researchers should ensure that the transitivity assumption is likely to hold, mostly based on clinical considerations(4, 27).
When theoretical assumptions are guaranteed (transitivity and consistency), there is no absolute right or wrong in the construction of a network structure. The reader of the review should carefully consider if she/he feels the network indirect comparisons are sensible and making best use of available data.
Mostly, different decisions for the PICO choices are all considered to make clinical sense in different NMAs. These differences do then lead to results that are not identical. For example, one review may feel that a systematic difference in participants in a particular trial may be make it inappropriate to network with the data of other studies (e.g., Ishibashi 2015, only included in sensitivity analysis in Régnier 2014, included in main analysis of Korobelnik 2015). Variations in dosage of treatments may, in the view of one review team, make a study ineligible but be acceptable to other researchers (e.g., Massin 2010, included in Régnier 2014, excluded from Korobelnik 2015/Muston 2018). Time-point of outcome assessment can add more differences (e.g., Nguyen 2009, included in Régnier 2014, excluded from Korobelnik 2015/Muston 2018).
These decisions, all done with the best of intentions, lead to inclusion of different studies contributing to the final – slightly different - results as illustrated in Table 2. It should not be a surprise that clinicians and researchers evolve their ideas and differ even at the same time. Readers need to consider and understand what participants are included – and excluded – in the review, what treatments are its focus and if there are omissions, and what outcomes are being reported and why those choices were taken.
Clinicians and patients first tend to seek if the treatment will help them, for example, ‘get better’ (a question that merits a binary answer) and then, only as second preference, seek more detailed information on the degree of improvement (meriting an answer on a continuous scale). The dichotomous or binary though is often a crude and even arbitrary cut-off within an ostensibly continuous measure. Continuous measures are, however, often a research fabrication and not truly continuous.
In the examples in Table 2 the average change improvements seem relatively consistently to be a matter of around 4 points. It is problematic to really understand what this may mean for any one patient’s life. In averaging across the groups something may be lost, however, that is revealed in the binary and Table 2 gives good evidence for speculation. What trials that report a ≥10-point gain consistently are reviewed to show no clear difference between aflibercept and ranibizumab but the two latest reviews have a new binary to report (≥15 point gain) and both show advantage for those allocated to aflibercept. Perhaps in the averaging across all people in the trials there has been a masking of an important group of people who respond better to aflibercept. But these are clinical and research points of debate.
Overall, the five reviews have reported results that are complicated, thought provoking, but not truly inconsistent with each other. The reader needs to consider the value of the outcome for their need. The researchers may favour the continuous measure of function, the clinician or patient the binary cut-off for better/not better and the policy maker the economics.
When the synthesis of data produces a pre-stated level of statistical significance, however, the findings of the outcome measure may not have great clinical impact. It is easy for confusion to arise when the same data are commented from the statistical perspective or clinical meaning. Careful consideration is required from the reader to understand the assessment of the reviewers - are they reporting the clinical or statistical perspective – or a mixture of both.
Further danger of differing interpretations of the same findings lies in when confidence intervals straddle zero (for continuous data) or 1 (for binary data – as for all the ≥10 point gain findings in Table 2). It is easy for reviewers and readers of the reviews to confuse ‘no evidence of an effect’ with ‘evidence of no effect’. When confidence intervals are wide, for example the 0.63 to 4.06 of Muston 2018 in Table 2, they straddle 1 or unity. In this case it is wrong to claim that aflibercept has ‘no effect’ or is ‘no different’ from ranibizumab – both statements carry too much certainty. It is true, there is no clear difference, but one drug is not clearly different to the other. If a true beneficial effect is mentioned in the conclusion, a true harmful effect should also be mentioned and discussed.
As always, really thinking about the meaning of findings is key. Together, the point estimate and confidence interval provide information to assess the effects of the intervention on the outcome. For example, in the evaluation of these drugs on BCVA it could have been decided that it would be clinically useful if the medication increased BCVA from baseline by 5 letters – and at the very least 2 letters. Virgili 2018 reports an effect estimate of an increase from baseline of 4 letters with a 95% confidence interval from 2.5 to 5.5 letters. This allows the conclusion that aflibercept was useful since both the point estimate and the entire range of the interval exceed the criterion of an increase of 2 letters. The Régnier 2014 review reported similar point estimate (4.5 letters) but with a wider interval from 1.5 to 7 letters. In this case, although it could still be concluded that the best estimate of the aflibercept effect is that it provides net benefit, the reader could not be so confident as the possibility still has to be entertained that the effect could be between 1.5 and 2 letters – a low range that had been pre-specified to be of little clinical value. The contrast of Régnier 2014 and Virgili 2018 serves well to illustrate how very similar findings may justify subtly different implications. The reviewers carry a responsibility to help the reader by clear reporting and thoughtful inclusive explanations – but where this has not happened the readers may have to do this for themselves.
We have summarised the methods and findings of five NMAs of the same topic which produced what seemed like somewhat different findings from similar data sets. Closer inspection of each of these reviews shows how the methods, including the searches and analyses all differ but the findings, although presented differently and sometimes interpreted differently, were similar.
As always, the critical reader of a review should think about the review in detail. This is helped by long-established checklists (28). Furthermore, Grading of Recommendations Assessment, Development, and Evaluation (GRADE) offers a transparent and structured process for developing and presenting summaries of evidence, including its quality, for systematic reviews and recommendations in health care (29).
As is common in different trials and reviews, outcomes – even the same measures – can be legitimately reported in several different ways. There is no avoiding the need to think through what the numbers really mean in terms of people, services and policies. This may necessitate careful, subtle, humane, and expert consideration.
NMA: network meta-analyses
DME: diabetic macular edema
BCVA: best corrected visual acuity
QoL: quality of life
VEGF: vascular endothelial growth factor
GRADE: grading of recommendations assessment, development, and evaluation
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and material
The datasets during and/or analysed during the current study available from the corresponding author on reasonable request.
Competing interests
None.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Authors’ contributions:
Jing Wu: study design, draft, review, revision;
Clive Adams: draft, review, revision;
Xiaoning He: screening of articles, data extraction;
Fang Qi: screening of articles, data extraction, statistical analysis;
Jun Xia: draft, review, revision.
All authors read and approved the final manuscript.
Acknowledgements
We thank everyone who kindly provided assistance during our preparation of this manuscript.