Methodological Approaches for Assessing Certainty of the Evidence in Umbrella Reviews: a Systematic Review


 Objective: To identify and describe the methodological approaches for assessing the certainty of the evidence in umbrella reviews (URs) of meta-analyses (MAs).Study Design and Setting: We included URs that included SR-MAs of interventions and non-interventions. We searched from 3 databases including PubMed, Embase, and The Cochrane Library from 2010 to 2020.Results: 138 URs have been included consisting of 96 and 42 URs of interventions and non-interventions, respectively. Only 31 (32.3%) of URs of interventions assessed certainty of evidence, in which the GRADE approach was the most frequently used method (N=20, 64.5%) followed by creditability assessments (N=6, 12.9%). Conversely, thirty (71.4%) of URs of non-interventions assessed certainty of evidence, in which the criteria for credibility assessment were mainly used (N=28; 93%). URs published in journals with high journal impact factor (JIF) are more likely to assess certainty of evidence than URs published in low JIFs. Conclusions: Only one-third of URs that included MAs of experimental designs have assessed the certainty of the evidence in contrast to the majority of the URs of observational studies. Therefore, guidance and standards are required to ensure the methodological rigor and consistency of certainty of evidence assessment for URs.


Introduction
The number of systematic reviews and meta-analysis (SR-MAs) of interventions have increased dramatically over recent years [1]. Due to the increasing number of SR-MAs, the necessity of compiling and updating evidence from these into one accessible and usable document by an umbrella review (UR) has substantially increased over the last decade [2]. Those SRs and MAs may focus on a single treatment comparison and outcome mostly, the UR, also known as overview of reviews, can summarize and even synthesize the ndings of all existing treatment regimens and outcomes [1,2]. As a result, the URs are considered as the source of the highest level of evidence in biomedical literature [2].
Although some methodological standards for SR-MAs can also be applied to URs (e.g., search strategy, study selection, data extraction, etc.) the methods used to assess the certainty of the evidence in the URs are different. In addition, URs are more heterogenous than original SR-MAs because not only they address a wider range of questions but also a variety of the assessment methods has been used to establish the evidence. For instance, some URs use the Grading of Recommendations, Assessment, Development and Evaluations (GRADE) approach, which was originally designed for assessing the certainty of evidence of primary studies included in SRs, not the UR [3]. According to this approach, the certainty is rated ambiguously due to limitations in several domains, e.g., risk of bias, indirectness of evidence, heterogeneity, and imprecision or publication bias. Some URs used the criteria from Physical Activity Guidelines Advisory Committee (PAGAC) to grade the evidence based on applicability, generalizability, risk of bias or study limitations, quantity, consistency (of the results across the available studies), and magnitude and precision of effect sizes [4][5][6]. Additionally, some URs reported the certainty of the included SRs and MAs as originally reported from each study without further assessment [7][8][9].
Aromataris et al. [10] also developed and published a methodological guidance focusing on conducting and reporting an umbrella review, but not uncertainty of evidence of UR. Furthermore, the UR of SR-MAs is more challenging as they usually report summary statistical data as one of the objective criteria to grade the certainty of evidence. Recently, the relatively strict criteria for stratifying evidence using several statistical parameters (i.e., p-value, predictive interval, small-study effects, and excess signi cance bias) have also been used and suggested as the good practical tips for conducting good URs [2]. To our knowledge, no formal guidance for assessing the certainty of evidence in URs exists. Therefore, this study aims to identify and describe the methodological approaches for assessing the certainty of the evidence in published URs.

Methods
This systematic review was conducted according to the methods pre-speci ed in a registered protocol (PROSPERO registration: CRD42020203273).

Search strategy and selection criteria
We searched three databases including PubMed, Embase, and The Cochrane Library from May 2010 to May 2020. The keyword 'umbrella review' was used. The full search strategies without language restriction are described in Supplement 1. Manual searches of the reference lists of the eligible articles were also performed. We de ned an UR as the review that is designed to summarize the evidence from multiple SR-MAs that were labeled as 'umbrella review' in the title or abstract of the article.
At least two reviewers (SS, KT, NP, SN, and SP) independently reviewed the titles, abstracts, and full texts for their potential inclusion against the eligibility criteria. Any disagreement was resolved by consensus with a third reviewer (NC). URs, overview of SR-MAs, review of SR-MAs were selected if they were (a) URs of MAs of intervention or therapy and (b) URs of MAs of non-intervention or therapy. First, we broadly categorized the retrieved URs into two separate categories-(a) UR of meta-analyses of intervention or therapy and (b) UR of metaanalyses of non-intervention or therapy. For the rst category, we included URs into this study if they included MAs of experimental designs (i.e., randomized controlled trials (RCTs) and non-RCTs) that focused on the pooling effect sizes of interventions in prevention or treatment purpose. These interventions could be any of drugs, surgical techniques, changes in treatment/diet/policy, counseling, and modi able risk factors (for examplesmoking, alcoholism, substance abuse etc.). Again, the second category included those URs of MAs of nonintervention studies involving diagnostic/risk/prognostic factors of diseases or health conditions, disease etiology, prevalence or incidence; in which most studies were observational studies, e.g., cohort, case-control, and cross-sectional studies [11].
Other types of studies or reviews (e.g., handbooks, guidelines, commentaries, editorials, and methodological studies), materials for poster presentations, URs with network MAs, UR of SRs without MAs, and protocols of URs were excluded.

Data extraction
At least two of ve reviewers (SS, NP, SN, KT, and SP) independently extracted the data from each UR into a customized data extraction table. Any disagreement was resolved by consensus with a third reviewer (NC).
Details of data extraction are described in Supplement 2.
The assessment of the certainty of evidence was de ned as any of evaluation of the totality or strength of the evidence such as the GRADE approach, criteria for credibility assessment, Agency for Healthcare Research and Quality (AHRQ) methods for systematic review, and other approaches used to grade the overall body of the UR evidence.

Data synthesis and analysis
A descriptive analysis of the methodological approaches for assessing the certainty of evidence in the URs was performed by frequencies and percentage strati ed by intervention and non-intervention URs. The included URs were classi ed into high and low impact sources based on the journal impact factors (JIF) reported by the Institute of Scienti c Information's Journals Citation Report in 2019 [12]. The journals reported as the top 100 highest ranking for the therapeutic and non-therapeutic elds were de ned as the high, otherwise they were classi ed as low impact groups. In addition, we also classi ed based on a median JIF, i.e., high if the URs were published in JIF ≥ median, otherwise the URs were classi ed as lower impact groups. When feasible, we further compared URs published between 2010-2015 with those published from 2016 to 2020. Chi-square or Fisher's exact test where appropriated was applied to compare characteristics of URs between groups. All analyses were performed using STATA version 15.0 (College Station, TX), p-value ≤ 0.05 was considered as statistical signi cance. We analyzed data and presented the results separately for two aforementioned categories of URs (URs of MA of intervention studies and URs of MA of non-intervention studies).

Search results
We identi ed 1767 articles, of which 174 and 1282 articles were excluded due to duplicates and during screening titles/abstracts, respectively: leaving 311 studies for the full-text review. A total of 138 URs matched with the eligibility criteria and were nally included in our study ( Fig. 1). Among them, 96 and 42 were URs for interventions and non-interventions [5,6,8,9,, respectively. The reasons for exclusion of the 173 articles after full-text review were described in detail in Supplement 3. Table 1   Methodological approaches for assessing certainty of the evidence Of 96 URs of SR-MAs of interventions, only 31 (32.3%) assessed the certainty of the evidence, see Table 1. GRADE approach was the most frequently used method for assessing the certainty of the evidence (N = 20, 64.5%) followed by credibility assessments (N = 6, 12.9%), see Fig. 2. Criteria for the credibility assessment were varied across studies, as shown in  Table 3    28; 93%) utilized the epidemiological credibility assessment tools. These criteria were followed to assess the evidence base in all but one [44] of the 6 articles in the high impact factor group [15,24,39,44,46,47]. The rest of the two URs assessed the certainty of evidence using GRADE criteria.

Methodological quality assessment
Almost all of the included URs in the intervention group performed the methodological quality assessment of included MAs (n = 84, 87.5%). Of these, the most frequently used tool was AMSTAR (n = 40, 47.6%), followed by its revised version called AMSTAR 2 (n = 13, 15.5%), and Joanna Bring Institute (JBI) critical appraisal checklist for SRs (n = 6, 7.1%), as shown in Table 1 and Fig. 2. More details of the methodological quality assessment of included MAs, see Supplement 5-6.
Twenty-six (61.9%) URs of non-intervention studies have assessed the methodological quality using certain tools, half of which involved AMSTAR while 6 utilized AMSTAR 2, as shown in Table 1. Again, 6 of the URs used critical appraisal checklist from the JBI. Only one of the URs that assessed methodological quality used Newcastle Ottawa Scale.

Discussion
To the best of our knowledge, this is the rst study to identify the methodological approaches for assessing the certainty of evidence in URs of that included SR-MAs. Overall, 138 URs have been included consisting of 96 and 42 URs of interventions and non-interventions, respectively. Only one-third of URs of interventions assessed certainty of evidence, in which the GRADE approach was mainly used. URs published in high JIFs are more likely to assess the certainty of evidence than URs published in low JIFs. About two-third of URs of non-interventions assessed certainty of evidence, in which criteria for credibility was mainly used. Nearly 90% of the URs performed a methodological quality assessment and AMSTAR was the most frequently used tool for this process.
The certainty of the evidence is the extent of con dence to support a decision or recommendation. High certainty in evidence means that the investigators are very con dent that the effect they found across studies is close to the true effect and vice versa [147]. Concerning the bene ts and harms of a treatment or intervention, the assessment of the certainty of the evidence is essential [148]. Moreover, the certainty of the evidence can be used to develop clinical practice guidelines and recommendations. Again, epidemiological investigations can help establish evidence linking exposure to the incidence of certain health condition in a population. These studies are expected to play a key role in gauging the burden of diseases, delineating guidelines for prevention as well as streamlining the treatment development process. URs of both the interventional and observational studies should aim to provide the highest certainty of evidence to facilitate better health outcomes. Despite the necessity of assessing the certainty of the evidence in URs, there is no consensus that which approach should be the method of choice.
Compared to the results from a previous study by Hartling et al [1] indicating that only 16% of the overview of reviews published between 2000 and 2011 assessed the certainty of the evidence, our study found that only onethird of included URs of interventional studies assessed the certainty of the evidence. Aligned with the previous study [1], the most frequently used method for assessing the certainty of the evidence in the URs was the GRADE approach. One of the reasons could be that the GRADE approach is a well-established tool developed to determine the certainty of evidence-based on several factors namely risk of bias, imprecision, indirectness, inconsistency, and publication bias [147]. However, this tool was primarily designed for assessing the quality of the evidence from primary studies. Thus, further guidance is needed to ensure appropriate use and interpretation of the GRADE tool when it is applied to assess the quality of evidence of SRs, instead of primary studies [1].
Furthermore, this study demonstrated that several methodological approaches for assessing the certainty of evidence were used in the URs. We found that the criteria for credibility assessment, which was recently released [10,148], was also the second most frequently used method in URs of interventional studies. In contrast, approximately all of the URs of observational studies in our review utilized these epidemiological credibility assessment criteria. The reason that our study differs from the previous study [1,3] likely because we speci cally considered the URs that included MAs. The criteria for credibility assessment classi ed the certainty of the evidence according to several statistical criteria, which usually reported in MA. However, this approach using the arbitrary cut-off values and the cut point of each component in these criteria varied among previously published URs, re ecting the need for guidance. Although Aromataris et al.-a methodology working group formed by the JBI (formerly named the URs Working Group)-published the guidance on how to conduct and report an UR [10], the methodology for the certainty assessment was not provided.
This study demonstrated that a higher number of URs with a certainty assessment were published in higher impact journals and the more recent URs tended to assess the certainty of the evidence. One of the reasons could be that the assessment helped to re ect the certainty of results and facilitate the translation of the evidence into guideline recommendations. Therefore, our ndings highlighted the importance of guidance for assessing the certainty of the evidence in URs to recommend the most appropriate tools to provide standards for those conducting URs.
This study also demonstrated that majority of the included URs performed a methodological quality assessment.
This was more frequent than a previous study [1] that reported the assessment of methodological quality in only 37% of the overviews of reviews. One of the reasons could be that this process has been strongly recommended in the methodological guidance for producing URs [2] and has been implemented longer than the certainty assessment. This process is essential to ensure that the methodological quality of SR-MAs that included in URs are adequately assessed and incorporated into the results and conclusions. Besides, we found that the most often used tool for methodological quality assessment changed from the Oxman and Guyatt Overview Quality Assessment Questionnaire (OQAQ) to AMSTAR. The AMSTAR tool has been recommended since 2007 and the revised version-AMSTAR 2 was released in 2016. Given that the revised tool introduced recently, the method advocated in published guidance have evolved over time and the variation of tool used for methodological quality assessment reported in this study con rms the need for updated guidance for conducting URs. Furthermore, researchers of URs should incorporate the certainty of evidence and methodological quality assessment and report the results in their URs, which could in turn enable a translation to guideline recommendations or the researchers should otherwise present valid reasons for not assessing it.
Our study has some limitations. First, the de nition of included studies was restricted to URs. This might not cover all types of other kinds of reviews for example-overview of reviews, and review of reviews. Therefore, our ndings with regards to terminology used to describe "umbrella reviews" and methods used might not be comprehensive or wholly representative. However, there is no universally accepted technical term for this new type of reviews that summarize or synthesize ndings from systematic reviews. The term URs has been used increasingly and studies that describe the methodological approach regarding the URs are sparse to date. Second, our study focused on describing the method used in previously published URs and most of them did not provide the reasons for methods selection. Thus, we could not assess the reasons why each UR used different approaches for assessing the certainty of evidence and methodological quality. However, a major strength of our study is that it provides a broad picture of the certainty assessment methods used in URs of both interventional and observational studies. Clearly, authors of URs of observational studies have a preference to use the criteria for credibility assessment tool, while for the URs of experimental studies, the GRADE tool is mostly favored. This highlights an unmet need of a suitable tool to be used in URs of experimental studies. Nevertheless, the reviews of methods used for assessing the certainty of evidence and methodological quality of URs that contained other study designs could be extended in future research.

Conclusions
This study revealed that only one-third of URs that included MAs of experimental designs have assessed the certainty of the evidence in contrast to the majority of the URs of observational studies. While the most frequently used methodological approach for assessing the certainty of the evidence was the GRADE approach in the rst group, epidemiological credibility assessment tool was the dominant method in the later. Therefore, guidance and standards are required to ensure the methodological rigor and consistency of certainty of evidence assessment for URs. Availability of data and materials: The datasets during and/or analyzed during the current study available from the corresponding author on reasonable request.

Abbreviations
Ethics approval and consent to participate: Not required. Consent for publication: Not applicable.
Competing interests: The authors declare that they have no competing interests.