Our findings suggest that evidence of inconsistency was at least twice as frequent as what would be expected by chance if all networks were truly consistent (when we would expect one in 20 networks for 0.05 level, and one in 10 networks for 0.10 level under the null hypothesis of no inconsistent networks in our sample). Overall, evidence against the hypothesis of consistency (as defined by the p-value of the DBT test) in NMAs with dichotomous data was evident in one in seven networks using the threshold of 0.05. Taking into consideration the low power of the inconsistency test, and in particular the DBT model that has more degrees of freedom in contrast to other inconsistency tests[16, 17], we decided to use also the threshold of 0.10, where inconsistency was prevalent in one in five networks. Considering that the observed inconsistent NMAs are at 14% of the networks, we expect that the truly inconsistent networks range between 12% and 20%, assuming the test has a perfect type I error at 0.05 and power ranges between 50% and 80%.
Our study showed that structural network characteristics only weakly impact the detection of inconsistency. In particular, we found a mild association between networks including both a high number of studies and a small number of interventions or loops, and lower p-values of the DBT test. This is probably due to the increased power to detect inconsistency in these types of networks. Another key finding of our study was that an important drop in heterogeneity when moving from the consistency to the inconsistency model is associated with evidence of inconsistency. This suggests that heterogeneity estimated in the consistency model may account for discrepancies between direct and indirect evidence in the network. Results were overall consistent among DL and REML heterogeneity estimators.
To the best of our knowledge, this is the largest empirical study used to evaluate the prevalence of inconsistency in networks of trial evidence. Overall, our findings are aligned with our previous study[10], where we evaluated 40 networks of interventions. The present research study includes five times the number of networks included in our previous review, and the exploration of multiple structural network characteristics. In this study, we found slightly higher empirical rates of inconsistency compared with our previous study (i.e., 14% of 201 networks vs. 13% of the 40 networks), suggesting that researchers should devote more resources to exploring how to mitigate inconsistency.
Our study has a few limitations worth noting. First, for the empirical assessment of consistency, we evaluated articles with dichotomous outcome data restricting to the odds ratio effect measure. We expect our findings to be generalisable to other effect measures. Although our previous empirical study showed that in some cases inconsistency was reduced when moving from one effect measure to another, overall, the detected inconsistency rates were similar for different effect measures[10]. For completeness it would be interesting to carry out an empirical study for continuous outcomes to examine possible differences in inconsistency between mean differences, standardized mean differences and ratios of means. Second, in the present study we considered a common within-network heterogeneity. This is often clinically reasonable and statistically convenient. Since most direct intervention comparisons in networks comprise only few studies, sharing the same amount of heterogeneity allows such comparisons to borrow strength from the entire network. However, assuming common within-network heterogeneity, intervention comparisons with a smaller heterogeneity than that of the remaining network will be associated with a larger reported uncertainty around their summary effect, compared to what would be accurate. In such a case, the chances of detecting inconsistency decrease. Although assuming a common within-network heterogeneity can underestimate inconsistency, it better reflects how summary effects are combined in an NMA in practice. Alternatively, when heterogeneity is believed to vary across comparisons, different heterogeneity parameters can be built into the model, but need to be restricted to conform to special relationships according to the consistency assumption[28]. Third, we assessed detection of inconsistency based on a threshold of the DBT p-value, which reflects common practice, and ignored the actual differences between the different designs and the direct and indirect estimates. However, to avoid “vote-counting” of strong evidence against the consistency hypothesis we also explored the distribution of the DBT p-values according to several network structural characteristics. Fourth, we did not exclude potential outlier networks, since this was outside of the scope of the study.
In a systematic review and NMA, investigators should interpret strong evidence against the consistency hypothesis very carefully and be aware that inconsistency in a network can be absorbed into estimates of heterogeneity. Given that the descriptive prevalence of inconsistency is frequent in published NMAs, authors should be more careful in the interpretation of their results. Confidence in the findings from NMA should always be evaluated, using for example CINeMA[29] (confidence in network meta-analysis) or GRADE (Grading of Recommendations Assessment, Development, and Evaluation) for NMA approaches[30]. Since inconsistency tests may lack power to identify true inconsistency, we recommend to avoid interpreting ‘no evidence for inconsistency’ as ‘no inconsistency’. We also recommend using both a global (e.g., the DBT model) and a local approach (e.g., loop-specific approach[10] or node-splitting[7] method) for the assessment of inconsistency in a network, before concluding about the absence or presence of inconsistency. However, detection of inconsistency often prompts authors to choose only direct evidence, which is often perceived as less prone to bias, disregarding the indirect information[23]. It is advisable though, instead of selecting between the two sources of evidence, to try to understand and explore possible sources of inconsistency and refrain from publishing results based on inconsistent evidence[5, 31].
NMA is increasingly conducted and although assessment of the required assumptions has improved in recent years, there is room for further improvement[1, 2]. Systematic reviews and NMA protocols should present methods for the evaluation of inconsistency and define strategies to be followed when inconsistency is present. The studies involved in an NMA should also be compared with respect to the distribution of effect modifiers across intervention comparisons. Authors should follow the PRISMA (Preferred Reporting Items for Systematic Review and Meta-analysis)-NMA guidelines[32] and report their inconsistency assessment results, as well as the potential impact of inconsistency in their NMA findings.