This is the first study to apply AMSTAR 2 in highly cited HF-related SRs and MAs. Sixteen domains evaluated each step of the conduct of SRs and MAs. AMSTAR 2 has been previously used in various fields, including psychiatry, surgery, pediatrics, endocrinology, rheumatology, and cardiovascular disease, and many publications have reported substantial numbers of SRs with low to critically low quality [28–33]. Some authors called it a “floor effect” because of the lack of discrimination capacity of the tool, raising questions about its high standard and practical value . In this study, we observed similar findings for high-impact SRs and MAs related to HF. This raises the dilemma that either the AMSTAR 2 standard is unreasonable, or that these high-impact SRs and MAs are of low quality. We can further break down this question into whether AMSTAR 2’s 16 domains and its backbone Cochrane guidelines are impractical, whether AMSTAR 2’s overall rating rules are unreasonable, whether there were actual defects in these SRs and MAs, and whether there was under-reporting that should be formalized to cover the information gap and facilitate readers’ understanding of these studies.
The guidelines and evaluation systems for SRs and MAs are a rapidly evolving field, with new or updated guides released every few years [1–7, 10, 20, 35–39]. AMSTAR 2 was published in 2017, when most SRs were using the Cochrane, QUORUM, PRISMA, and MOOSE guidelines. Until 2020, less than 200 publications in PubMed used AMSTAR 2. In our review, none of the 81 studies used AMSTAR 2 as a guide. Therefore, some noncompliance can be attributed to the fact that these guidelines did not follow AMSTAR 2 from the beginning to conduct and report the study. Additionally, the strict “yes or no” rule in AMSTAR 2 precluded responses like “not applicable,” “cannot answer,” or “not reported,” forcing the evaluators to choose “no” for potentially under-reporting studies and categorize certain domains as “weak.” This probably explains why the results of quality assessment by AMSTAR 2 generally appear worse than those performed by AMSTAR in previous reviews .
The 95% “no” rate for the critical domains Q2 and Q7 played a major role in the overall low rating of these 81 studies, since failing one critical domain resulted in an overall low quality. These two critical domains were included since the AMSTAR in 2007. The Cochrane Handbook mentions that “All Cochrane reviews must have a written protocol, specifying in advance the scope and methods to be used by the review, to assist in planning and reduce the risk of bias in the review process.”  However, none of the nine Cochrane SRs included here reported pre-established protocols. PRISMA checklist also mentions, “If registered, provide the name of the registry (such as PROSPERO) and registration number,” suggesting that registration is not compulsory. Unsurprisingly, none of the 19 studies using the PRISMA checklist reported pre-established protocols. The same is true for MOOSE and QUORUM. Apparently, pre-registration or publishing a pre-established protocol is not yet a common practice for SRs, but more studies have been following this step in recent years [40–43]. Q7 is based on the same content as in Cochrane Handbook Chapter 4, with the further instruction that “the list of excluded studies should be as brief as possible. It should not list all of the reports that were identified by an extensive search.”  However, a complete list of exclusions is not mandated by the PROSPERO guidelines , nor in PRISMA or QUORUM. Although exclusion is an essential step during study selection in SRs and MAs, the various guidelines have not arrived at a consensus regarding reporting a complete list of exclusions, and there is potential difficulty for publication . Thus, the designation of Q2 and Q7 as critical domains and the decision to assign an overall poor quality to studies failing them needs to be justified.
Almost half of the SRs are now based on NRSIs; an increasing number of NRSIs are based on large databases that provide a better real-world picture. Although these studies can be more precise, they are also more easily confounded . Thus, one of the major advancements in AMSTAR 2 from AMSTAR is the inclusion of different RoB assessments and data-pooling methods for NRSIs. The 31% “no” answers for Q9 resulted from the lack of reporting tools or incorrectly used tools. Choosing the right RoB tool for RCTs and NRSIs remains a big challenge for authors of SRs because of the large number of tools available . The lack of homogeneity among various tools also makes comparisons difficult. The RoB assessment tool for RCTs recommended by AMSTAR 2 is the Cochrane Collaboration tool, and that for NRSIs is ROBINS-I [1, 23]. In our study, different tools were eligible for the Q9 assessment. However, the same RCT or NRSI study rated as showing a low RoB by one tool may be rated as showing an elevated RoB by another more comprehensive tool. This discrepancy affected the RoB-related domains Q12 and Q13, and eventually led to different overall quality ratings, raising questions regarding AMSTAR 2’s reliability and consistency. The noncritical domain Q12 is also mentioned in the Cochrane Handbook and PROSPERO registration; however, there is no consensus regarding analysis of the influence of RoB in other checklists [1–4, 44]. Q11 further clarifies the difference between RCTs and NRSIs in terms of data-pooling methodology. The rationale for enhanced pooling of fully adjusted estimates in NRSIs is that the data adjusted by confounders may generate very different results in comparison with raw data. Deficiencies in heterogeneity investigations were another finding for Q11, which are detailed in a recent paper reviewing HF-related MAs . However, the multiple domains assessing RoB and NRSI reflect their overall importance in SRs and MAs. Future researchers should better understand the differences in RoB between RCTs and NRSIs, choose RoB tools wisely, address their influence objectively, and avoid using raw data from NRSIs for meta-analysis.
The final underperforming critical domain was Q15, which covers the investigation of publication bias. Funnel plots are the best known and most commonly used method to assess publication bias or small-study effects, but at least 10 studies are needed to reliably show funnel plot asymmetry [1, 49]. Most of the SRs answered “no” for Q15 because they included less than 10 studies and did not report graphical or statistical tests of publication bias. This issue in AMSTAR 2 may need to be addressed in a future study.
In summary, AMSTAR 2 is an appraisal tool based on the Cochrane Handbook that focuses on the conduct of SRs and MAs in healthcare interventions. Our evaluation of highly cited HF-related SRs and MAs by using AMSTAR 2 helped us identify areas of insufficiency and highlighted the scope for improvements in future studies, including a priori protocol or pre-registration, the addition of a full exclusion list with justifications, appropriate RoB assessments, and caution while combining NRSI data. These findings reflect the core values of the AMSTAR 2 and Cochrane guidelines in avoiding bias. However, compared to the most commonly used guidelines mentioned above, AMSTAR 2 is relatively new and advanced. Thus, consensus among various guidance and assessment tools is essential before it can be considered as the standard. Using a “new” tool to judge older SRs and call them “low quality” is not the conclusion of this study. However, a perfect tool does not exist. AMSTAR 2 does not include justifications for designating certain domains as critical and others as non-critical and does not explain the underlying rules used to categorize studies as “high” versus “low” quality. Considering these aspects, the interaction between AMSTAR 2 and these high-impact SRs yielded some findings of interest, which can be expected to facilitate the validation of AMSTAR 2 and provide feedback for its future development.
We thoroughly searched all published materials related to these 81 studies, but we did not contact the review authors to clarify the “cannot answer” domains or “not reported” contents. If we had done so, some answers would have changed because of the potential underreporting in some domains, particularly Q7 and Q10. We also did not change certain noncritical domains to critical domains and vice versa for different studies, as allowed by AMSTAR 2. However, we did not include MAs that summarize the known literature base or SRs without MAs, which may question the critical nature of Q4, Q7, Q11, and Q15.