On the appropriate interpretation of evidence: the example of anti-vascular endothelial growth factor for diabetic macular edema

Background: Different network meta-analyses (NMAs) on the same topic result in differences in ndings. In this review we investigated network meta-analyses comparing ranibizumab with aibercept for diabetic macular edema in the hope of illuminating why the differences in ndings occurred. Findings: For the binary outcome of best corrected visual acuity, different reviews all agreed on their being no clear difference between the two treatments; while continuous outcomes all favour aibercept over ranibizumab. We discussed four points of particular concern that are illustrated by ve similar NMAs, including: network differences, PICO differences, different data from the same measures of effect, differences in what is truly signicant. Conclusions: Closer inspection of each of these reviews shows how the methods, including the searches and analyses all differ but the ndings, although presented differently and sometimes interpreted differently, were similar. three outcomes in the NMAs. The rst outcome shows, how, for the identical binary outcome, different reviews data from different studies, and, partly due to this, arrived as slightly different point estimates – although all agreed on their being no clear difference between the two treatments (all 95% Condence Intervals straddled zero); and the other two outcomes


Background
With the rapid increase in biomedical evidence, systematic reviews are an opportunity to take healthcare decisions based on comprehensive summaries of the best available evidence on a topic (1)(2)(3). Current knowledge may be imperfect, but decisions should be better informed when taken in the light of the best, most up-to-date knowledge. It is crucially important that the systematic review itself is both clear and accurate for local interpretation by healthcare decision makers (healthcare practitioners and policy makers) (3,4). When similar approaches are taken in summarizing evidence by different research teams but contradictory ndings are reported, there is a problem. In this review we investigate one such example in the hope of illuminating why the differences in ndings occurred.
Our example is taken from network meta-analyses (NMA) comparing ranibizumab (a monoclonal antibody) with a ibercept (an inhibitor of vascular endothelial growth factor) for diabetic macular edema (DME). DME leads to impaired vision-related functioning and quality of life (QoL) (7) and is the main cause of moderate to severe vision impairment in people with diabetes. DME constitutes a substantial economic burden for patients and public health systems(8, 9). The therapeutic goal for people with DME is to improve visual function and vision-related QoL (10). Anti-vascular endothelial growth factor (VEGF) is recommended by several clinical guidelines as rst-line treatment (11,12). Ranibizumab and a ibercept are commonly used in clinical practice, but there are limited direct comparisons of these two drugs. Network meta-analysis then becomes an attractive option as NMAs use past studies of the two drugs directly compared with other controls to create statistical indirect comparisons of the two medications of current interest.

Findings
We identi ed ve NMAs (13-17) through searching electronic database (PubMed, Embase, Cochrane Library, Web of Science, CNKI, Wanfang, VIP) and screening. AMSTAR-2 tool (3) was employed to provide some rating of the quality of each of the reviews.
Everything seemed varied in almost every review (Table 1). Although the question under investigation was consistent, the searches, the numbers of studies used and de nitions for eligible participants, comparisons from which to source data and acceptable outcomes mostly lacked rigid consistency.   Table 2 summarised three outcomes in the NMAs. The rst outcome shows, how, for the identical binary outcome, different reviews gathered data from different studies, and, partly due to this, arrived as slightly different point estimates -although all agreed on their being no clear difference between the two treatments (all 95% Con dence Intervals straddled zero); and the other two outcomes reproduce the results from each review and illustrates how the same measure is reported in different ways and different combinations across the ve reviews.

Discussion
Generally speaking, it is clear that the reader must continue to think 'cleanly' amidst the data which may not be quite so clean. Below we discuss some points of particular concern that are illustrated by these ve similar NMAs.

Network differences
Network meta-analysis employs data from [in these cases] randomised trials in ways by which comparisons of interest can be constructed by using somewhat assumption-heavy observational methods. For example, a ibercept versus ranibizumab is the comparison of interest (referred to as the decision set). A ibercept or ranibizumab, however, may only have been compared with sham injection in randomised trials. We use the term 'supplementary set' to refer to interventions, such as sham injection, that are included in the network meta-analysis for the purpose of improving inference among interventions in the decision set(4). As different selection for supplementary set, different network structure will be conducted for the same clinical problem. In selecting which competing interventions to include in decision set, researchers should ensure that the transitivity assumption is likely to hold, mostly based on clinical considerations (4,27).
When theoretical assumptions are guaranteed (transitivity and consistency), there is no absolute right or wrong in the construction of a network structure. The reader of the review should carefully consider if she/he feels the network indirect comparisons are sensible and making best use of available data.

PICO differences
Mostly, different decisions for the PICO choices are all considered to make clinical sense in different NMAs. These differences do then lead to results that are not identical. For example, one review may feel that a systematic difference in participants in a particular trial may be make it inappropriate to network with the data of other studies (e.g. These decisions, all done with the best of intentions, lead to inclusion of different studies contributing to the nal -slightly different -results as illustrated in Table 2. It should not be a surprise that clinicians and researchers evolve their ideas and differ even at the 2.3 Different data from the same measures of effect answer) and then, only as second preference, seek more detailed information on the degree of improvement (meriting an answer on a continuous scale). The dichotomous or binary though is often a crude and even arbitrary cut-off within an ostensibly continuous measure. Continuous measures are, however, often a research fabrication and not truly continuous.
In the examples in Table 2 the average change improvements seem relatively consistently to be a matter of around 4 points. It is problematic to really understand what this may mean for any one patient's life. In averaging across the groups something may be lost, however, that is revealed in the binary and Table 2 gives good evidence for speculation. What trials that report a ≥10-point gain consistently are reviewed to show no clear difference between a ibercept and ranibizumab but the two latest reviews have a new binary to report (≥15 point gain) and both show advantage for those allocated to a ibercept. Perhaps in the averaging across all people in the trials there has been a masking of an important group of people who respond better to a ibercept. But these are clinical and research points of debate.
Overall, the ve reviews have reported results that are complicated, thought provoking, but not truly inconsistent with each other. The reader needs to consider the value of the outcome for their need. The researchers may favour the continuous measure of function, the clinician or patient the binary cut-off for better/not better and the policy maker the economics.

Differences in what is truly signi cant
When the synthesis of data produces a pre-stated level of statistical signi cance, however, the ndings of the outcome measure may not have great clinical impact. It is easy for confusion to arise when the same data are commented from the statistical perspective or clinical meaning. Careful consideration is required from the reader to understand the assessment of the reviewersare they reporting the clinical or statistical perspective -or a mixture of both.
Further danger of differing interpretations of the same ndings lies in when con dence intervals straddle zero (for continuous data) or 1 (for binary data -as for all the ≥10 point gain ndings in Table 2). It is easy for reviewers and readers of the reviews to confuse 'no evidence of an effect' with 'evidence of no effect'. When con dence intervals are wide, for example the 0.63 to 4.06 of Muston 2018 in Table 2, they straddle 1 or unity. In this case it is wrong to claim that a ibercept has 'no effect' or is 'no different' from ranibizumab -both statements carry too much certainty. It is true, there is no clear difference, but one drug is not clearly different to the other. If a true bene cial effect is mentioned in the conclusion, a true harmful effect should also be mentioned and discussed.
As always, really thinking about the meaning of ndings is key. Together, the point estimate and con dence interval provide information to assess the effects of the intervention on the outcome. For example, in the evaluation of these drugs on BCVA it could have been decided that it would be clinically useful if the medication increased BCVA from baseline by 5 letters -and at the very least 2 letters. Virgili 2018 reports an effect estimate of an increase from baseline of 4 letters with a 95% con dence interval from 2.5 to 5.5 letters. This allows the conclusion that a ibercept was useful since both the point estimate and the entire range of the interval exceed the criterion of an increase of 2 letters. The Régnier 2014 review reported similar point estimate (4.5 letters) but with a wider interval from 1.5 to 7 letters. In this case, although it could still be concluded that the best estimate of the a ibercept effect is that it provides net bene t, the reader could not be so con dent as the possibility still has to be entertained that the effect could be between 1.5 and 2 letters -a low range that had been pre-speci ed to be of little clinical value. The contrast of Régnier 2014 and Virgili 2018 serves well to illustrate how very similar ndings may justify subtly different implications. The reviewers carry a responsibility to help the reader by clear reporting and thoughtful inclusive explanations -but where this has not happened the readers may have to do this for themselves.

Conclusions
We have summarised the methods and ndings of ve NMAs of the same topic which produced what seemed like somewhat different ndings from similar data sets. Closer inspection of each of these reviews shows how the methods, including the searches and analyses all differ but the ndings, although presented differently and sometimes interpreted differently, were similar.
As always, the critical reader of a review should think about the review in detail. This is helped by long-established checklists (28). Furthermore, Grading of Recommendations Assessment, Development, and Evaluation (GRADE) offers a transparent and structured