Generally speaking, it is clear that the reader must continue to think ‘cleanly’ amidst the data which may not be quite so clean. Below we discuss some points of particular concern that are illustrated by these five similar NMAs.
3.1 Network differences
Network meta-analysis employs data from [in these cases] randomised trials in ways by which comparisons of interest can be constructed by using somewhat assumption-heavy observational methods. For example, aflibercept versus ranibizumab is the comparison of interest (referred to as the decision set). Aflibercept or ranibizumab, however, may only have been compared with sham injection in randomised trials. We use the term ‘supplementary set’ to refer to interventions, such as sham injection, that are included in the network meta-analysis for the purpose of improving inference among interventions in the decision set(4). As different selection for supplementary set, different network structure will be conducted for the same clinical problem. In selecting which competing interventions to include in decision set, researchers should ensure that the transitivity assumption is likely to hold, mostly based on clinical considerations(4, 27).
When theoretical assumptions are guaranteed (transitivity and consistency), there is no absolute right or wrong in the construction of a network structure. The reader of the review should carefully consider if she/he feels the network indirect comparisons are sensible and making best use of available data.
2.2 PICO differences
Mostly, different decisions for the PICO choices are all considered to make clinical sense in different NMAs. These differences do then lead to results that are not identical. For example, one review may feel that a systematic difference in participants in a particular trial may be make it inappropriate to network with the data of other studies (e.g., Ishibashi 2015, only included in sensitivity analysis in Régnier 2014, included in main analysis of Korobelnik 2015). Variations in dosage of treatments may, in the view of one review team, make a study ineligible but be acceptable to other researchers (e.g., Massin 2010, included in Régnier 2014, excluded from Korobelnik 2015/Muston 2018). Time-point of outcome assessment can add more differences (e.g., Nguyen 2009, included in Régnier 2014, excluded from Korobelnik 2015/Muston 2018).
These decisions, all done with the best of intentions, lead to inclusion of different studies contributing to the final – slightly different - results as illustrated in Table 2. It should not be a surprise that clinicians and researchers evolve their ideas and differ even at the same time. Readers need to consider and understand what participants are included – and excluded – in the review, what treatments are its focus and if there are omissions, and what outcomes are being reported and why those choices were taken.
2.3 Different data from the same measures of effect
Clinicians and patients first tend to seek if the treatment will help them, for example, ‘get better’ (a question that merits a binary answer) and then, only as second preference, seek more detailed information on the degree of improvement (meriting an answer on a continuous scale). The dichotomous or binary though is often a crude and even arbitrary cut-off within an ostensibly continuous measure. Continuous measures are, however, often a research fabrication and not truly continuous.
In the examples in Table 2 the average change improvements seem relatively consistently to be a matter of around 4 points. It is problematic to really understand what this may mean for any one patient’s life. In averaging across the groups something may be lost, however, that is revealed in the binary and Table 2 gives good evidence for speculation. What trials that report a ≥10-point gain consistently are reviewed to show no clear difference between aflibercept and ranibizumab but the two latest reviews have a new binary to report (≥15 point gain) and both show advantage for those allocated to aflibercept. Perhaps in the averaging across all people in the trials there has been a masking of an important group of people who respond better to aflibercept. But these are clinical and research points of debate.
Overall, the five reviews have reported results that are complicated, thought provoking, but not truly inconsistent with each other. The reader needs to consider the value of the outcome for their need. The researchers may favour the continuous measure of function, the clinician or patient the binary cut-off for better/not better and the policy maker the economics.
2.4 Differences in what is truly significant
When the synthesis of data produces a pre-stated level of statistical significance, however, the findings of the outcome measure may not have great clinical impact. It is easy for confusion to arise when the same data are commented from the statistical perspective or clinical meaning. Careful consideration is required from the reader to understand the assessment of the reviewers - are they reporting the clinical or statistical perspective – or a mixture of both.
Further danger of differing interpretations of the same findings lies in when confidence intervals straddle zero (for continuous data) or 1 (for binary data – as for all the ≥10 point gain findings in Table 2). It is easy for reviewers and readers of the reviews to confuse ‘no evidence of an effect’ with ‘evidence of no effect’. When confidence intervals are wide, for example the 0.63 to 4.06 of Muston 2018 in Table 2, they straddle 1 or unity. In this case it is wrong to claim that aflibercept has ‘no effect’ or is ‘no different’ from ranibizumab – both statements carry too much certainty. It is true, there is no clear difference, but one drug is not clearly different to the other. If a true beneficial effect is mentioned in the conclusion, a true harmful effect should also be mentioned and discussed.
As always, really thinking about the meaning of findings is key. Together, the point estimate and confidence interval provide information to assess the effects of the intervention on the outcome. For example, in the evaluation of these drugs on BCVA it could have been decided that it would be clinically useful if the medication increased BCVA from baseline by 5 letters – and at the very least 2 letters. Virgili 2018 reports an effect estimate of an increase from baseline of 4 letters with a 95% confidence interval from 2.5 to 5.5 letters. This allows the conclusion that aflibercept was useful since both the point estimate and the entire range of the interval exceed the criterion of an increase of 2 letters. The Régnier 2014 review reported similar point estimate (4.5 letters) but with a wider interval from 1.5 to 7 letters. In this case, although it could still be concluded that the best estimate of the aflibercept effect is that it provides net benefit, the reader could not be so confident as the possibility still has to be entertained that the effect could be between 1.5 and 2 letters – a low range that had been pre-specified to be of little clinical value. The contrast of Régnier 2014 and Virgili 2018 serves well to illustrate how very similar findings may justify subtly different implications. The reviewers carry a responsibility to help the reader by clear reporting and thoughtful inclusive explanations – but where this has not happened the readers may have to do this for themselves.