Slight discrepancies in recommendations across different CPGs are commonplace across many clinical specialities and topics29-31, due to differences in interpretation of evidence. However, if the underlying evidence used in CPG production is the same as in the Cochrane reviews, then it is necessary to understand why recommendations differ considerably. The results of this exploratory study found systematic discord in a considerable majority of statements (85.5%), in the recommendation (4/62) or the strength of recommendation (49/62) given in the CPG compared to those derived from the Cochrane reviews. When discrepancy between the CPG recommendation and the Cochrane review occurred, this was mainly related to the strength of recommendation, which was consistently overstated by the CPG developers.
This is an exploratory study and we focused only on guideline statements related to recent Cochrane reviews. This was because we were familiar with the evidence available on these topics. These Cochrane reviews were peer reviewed by experts and were considered to be at low risk of bias according to the ROBIS tool to assess the risk of bias in systematic reviews32. We limited the inclusion to the major organisations that develop guidelines related to liver disease as we felt that these were most likely to be in clinical use at the present time. Nevertheless, we know that there are many other guideline-producing parties (e.g. the BAVENO meetings) and we do not know if and how much these are in concordance with evidence.
In this study, we observed that even when the same system was used for development of recommendation by the same organisation, the discordance between guideline authors and Cochrane reviews were different. For example, BSG 2019 and BSG 2020 used the same system for grading recommendations. However, the discordance proportions were different. This might be because of differences in the number of guidelines that were eligible for the exploratory study or could be because of different guideline panel members.
Additionally, there were several instances where study authors DR and KG noted a mismatch between the language used by the guideline developers and the assigned strength of recommendation. There were also disagreements as to whether the recommendation was strong or weak for some guideline statements. For example, our review panel had different views on whether the sentence “Current evidence on the safety of subcutaneous automated low-flow pump implantation for refractory ascites shows there are serious but well-recognised safety concerns, including device failure and acute kidney injury. Evidence on efficacy is limited in quantity. Therefore, this procedure should only be used with special arrangements for clinical governance, consent, and audit or research” indicates a strong or weak recommendation by the guideline authors. We considered the recommendation to be weak for the purpose of the calculation of agreement as the guideline authors acknowledged the uncertainty in evidence. In other instances, the guideline authors had used words relevant to strong recommendation in their CPG statement, although they rated the recommendation as weak, for example, “Patients who have recovered from an episode of spontaneous bacterial peritonitis (SBP) should be considered for treatment with norfloxacin (400 mg once daily), ciprofloxacin (500 mg once daily, orally) or cotrimoxazole (800 mg sulfamethoxazole and 160 mg trimethoprim daily, orally) to prevent further episode of SBP” suggests a strong recommendation from the words used according to GRADE recommendations, but the guideline authors indicate that this is a weak recommendation. The appropriate words for the weak recommendation should, according to our point of view, have been “Patients who have recovered from an episode of SBP might be considered for treatment with norfloxacin…”.
There may be several potential reasons for discordance between the CPG statements and our Cochrane reviews. One reason for this difference is the difference in search dates. Only two guidelines had potential search dates later than the Cochrane review: AASLD 2021 and BSG 202021,24. None of the discordances in these two guidelines were due to additional RCTs published on the topics covered by the relevant statements since the search date of the Cochrane reviews. It is essential for systematic reviews as well as clinical guidelines and thus recommendations, to be regularly updated to incorporate new evidence. We have noted the NICE 2020 surveillance report indicating that NICE guideline NG50 is currently in the process of being updated for this reason28. We do not know when the updates of the remaining guidelines will be performed and whether the discordance would be less when these guidelines are updated.
Another reason for discordance could be the type and quality of evidence considered in the publications included in the review process for the guideline forming the basis for the recommendations. For example, the Cochrane reviews in our sample included only RCTs that reported the results as the source of evidence underlying the conclusions. However, we noted that most of the clinical guidelines included non-randomised studies including retrospective cohort studies, case series, other observational studies, or consensus opinions from expert panels (which would have included personal clinical experience) . Although RCTs are regarded as the primary research study type at the top of the hierarchy of evidence1,2, the rigour of the design of the RCTs is just as important as the study type. A poorly conducted RCT cannot be considered better than a well-conducted cohort study. Cochrane reviews contained assessment of the risk of bias of each included RCT and took these assessments into consideration while making recommendations. CPG guidelines did not report such a systematic assessment of the studies supporting the guideline statement. Agrawal and colleagues33 investigated the robustness of RCTs included in the previous AASLD 2012 clinical guideline in terms of random errors, and found a significant proportion of the RCTs (62%) were fragile (i.e. only a few ‘non-events’ needed to be changed to ‘events’ to result in change in the recommendations). The Cochrane reviews that we used for this study used the GRADE system which includes consideration of random errors in determining uncertainty. Therefore, it is possible that the strength of the recommendations in the CPG could have been influenced by the non-randomised studies or small RCTs with relatively small treatment effects, which would have resulted in greater uncertainty in evidence if GRADE guidance was followed.
It is important to recognise that there are situations where there can be strong recommendations made even when there is low certainty evidence. This is clearly outlined in Table 6.3 of the GRADE Handbook17, for example when low quality evidence suggests benefit in a life-threatening situation, or when high-quality evidence suggests modest benefits and low quality evidence suggests the possibility of low incidence but occasional catastrophic harms. However, if such a scenario existed, it should be highlighted in the guideline. Through independent assessment and arbitration, we did not determine that any of these exception criteria applied in the statements included in this study, that would have warranted the strength of recommendation to have been increased.
Another possible reason for the discordance could be personal conflicts of interest (COI)34,35. As may be expected, Nejstgaard and colleagues since reported that financial COI are associated with favourable recommendations in clinical guidelines, advisory committee reports, opinion pieces, and narrative reviews 35. Although more difficult to characterise and evaluate, non-financial conflicts of interest are also likely to introduce bias in CPG recommendations, e.g. for a specific named drug or formulation rather than a generic group without evidence to justify.
The systematic overrating of strength of recommendations in CPGs, highlighted from this work, has major implications in healthcare, as this is often the basis upon which clinicians pursue and advise patients on particular treatment choices and therefore significantly impacts on the clinical shared decision-making process. Although treatment decisions are not bound or dictated by clinical guidelines, healthcare professionals may fear potential medical negligence lawsuits if deciding to practice outside of recommendations contained within clinical guidelines36 and may not have the time or opportunity to investigate the quality of evidence behind the guideline recommendations. Furthermore, one would expect use of resources to increase in line with those pertaining to strong recommendation statements in CPGs, as healthcare professionals proceed to implement these more frequently in practice. The reverse is true with regard to research: further research may be limited in areas with strong recommendations if it is perceived that there is a lack of clinical equipoise or that no uncertainty needs to be addressed37. Therefore, in the endeavour of evidence-based health care with best patient treatment and care as the fundamental cornerstones, it is imperative for clinical practice guidelines not to be flawed by systematic overrating of the strength of recommendation or poor rigour of development of the guideline38.
Our findings corroborate with those found by several previous studies which point at discordances between strength of recommendations and evidence supporting the recommendation for a high proportion of guideline statements15,16,39. This is similar to the worst form of spin found in systematic reviews, i.e. recommendations for clinical practice were not being supported by the study findings in systematic reviews40.
The major weakness in our study is that there was no published protocol or statistical analysis plan. Therefore, this study should be considered only exploratory. Second, we did not assess whether the disagreements could be explained by the methodological quality of the clinical practice guidelines, for example, AGREE II41. Third, as we are authors of or editors of the Cochrane Hepato-Biliary Group reviews we cannot nullify a suspicion that we could be more biased towards the recommendations of the reviews. However, we have reported our findings transparently and they are open to international scrutiny and contest.
In conclusion, our findings suggest that there may be a systematic issue with guideline authors overstating the strength of recommendations in clinical practice guidelines on advanced liver cirrhosis. This could not be explained by the search dates or scenarios which warranted strong recommendations despite low certainty evidence. Therefore, we propose that future work should be conducted to understand why evidence is interpreted differently despite having an accepted methodology for developing clinical practice guidelines. Further research is also needed to understand the clinician and patient perspective of guidelines which provide definitive guidance even when the quality of evidence is weak. Such a definitive study should have a registered protocol and statistical analysis plan prior to the start of the study and investigate whether AGREE II domains can explain the discordance between evidence and CPG recommendations. Research is also necessary to study whether automated checks to ensure that the wording chosen matches the strength of recommendation can decrease the mismatch between the words used in the guideline statement and the strength of recommendation.