After deduplication of identified records, we screened 17,240 titles and abstracts and 54 full-text articles. We identified 22 articles for inclusion which all underwent backwards citation screening. The screening process is shown in the PRISMA Flow Diagram (Figure 2). The search strategy output and reasons for inclusion/exclusion files are available on OSF.  Of note, many studies had multiple phases or participant groups. We included the study if we could clearly separate the methods and results for the phase and/or group. Where possible we extracted information only from the eligible phase/group.
Characteristics of included studies
Our final sample included 22 full-text articles representing 20 unique studies. This included 16 qualitative studies, 4 RCTs, and 1 mixed-methods RCT and qualitative study (Tables 1 and 2) involving 908 total participants from a variety of different stakeholder groups (Table 1). Many studies involved a multidisciplinary mix of participants such as researchers, health professionals, and policymakers [38–52] although some had homogenous groups of clinicians [53–55] or decision-makers. [56–58] The majority of types of evidence syntheses were systematic reviews but one study related specifically to network meta analyses (NMA), one to diagnostic test accuracy (DTA) reviews and one to updating reviews. Seven studies involved an international mix of participants [44,47,51,52,56,58,59]; five were from Canada [40,43,54,55,57], three from the United States of America [41,42,45,48,49], two from Croatia [39,50], two from England [38,53], and one from Kenya . Most were funded by national agencies [39,40,42,42,43,45,48–50,54–57] such as the Canadian Institutes of Health Research [40,43,54,55] or the Agency for Healthcare Research and Quality [42,45,48,49,57].
The TiDiER checklist was used to gather intervention data detailed in Tables 1, 2 and 3. The majority of included qualitative studies conducted either focus groups [39,40,45,54,55] or one-on-one semi-structured interviews. [38,41–43,46–49,53,56–60] (Table 1) RCTs were conducted either with an online survey [50,51] or through in person workshops (Tables 2 and 3). [46,52] There were a wide variety of summary formats tested including de novo summary prototypes [40,43,45,46,53–55,57,58], Grading of Recommendations, Assessment, Development and Evaluations (GRADE) Summary of Findings (SoF) evidence tables [44,46,47,56,59], MAGICapp [48,49], Tableau. [48,49], evidence flowers , plain language summaries , and infographics . Summary formats covered a wide variety of clinical topics. (Tables 1 and 2)
We found the quality of reporting for the qualitative studies was quite poor. The main weakness across these studies included not providing information on philosophical perspectives (11/17) [38–43,45,46,48,49,55,57,58], not locating the researcher culturally or theoretically (15/17) [38,39,43–47,49,53–59] and not addressing the influence of the researcher on the research (15/17) [38,39,41–49,54–59]. Several interviews or focus groups also did not provide clear direct quotes from participants (6/17) [40,43,45,48,49,55,60]. On the other hand, the four quantitative studies were mostly reported clearly with low risk of bias. [46,50–52] The main weaknesses related to descriptions of the blinding of treatment assignment for the outcome assessors and those delivering treatment (2/4) [46,52]. Completed JBI critical appraisal checklists can be found in Appendix 3.
The summary formats tested across the five included RCTs (described across four papers) are described in detail in Table 3. Four RCTs compared alternative versions of SoF tables against a format in current practice and/or a standard systematic review. [46,51,52] One study compared an infographic to a plain language summary (PLS) and scientific abstract (SA).  Studies were largely multidisciplinary and results were not presented by stakeholder group. An exception to this was the study by Buljan et al. 2018 which conducted separate trials with patient representatives (‘consumers’) and doctors. There were no differences between the groups in knowledge scores for both the plain-language summary (PLS) and infographic formats. However, patient representatives reported lower satisfaction (user-friendliness) and reading experience with both formats when compared to doctors. As the quantitative studies used a variety of scales and summary formats, we could only summarise results narratively.
In preparation for the mixed methods synthesis, we identified 74 individual findings from quantitative studies (Appendix 4) and synthesised these into four main areas which related to review outcomes of Knowledge/Understanding, Satisfaction/Reading Experience, Accessibility/Ease of Use, and Preference. (Figure 1). These individual findings helped identify areas of convergence, inconsistency, or contradiction with the qualitative findings and recommendations described later.
Knowledge or Understanding
All five RCTs assessed knowledge or understanding as an outcome (Table 4). No studies employed standardised measures, choosing to use study specific questions. Two articles, reporting the results of three studies, found that the new format improved knowledge or understanding. [51,52] Carasco-Labra et al. reported that compared to a standard SoFs table, a new format of SoF table with seven alternative items improved understanding.  Of seven items testing understanding, three showed similar results, two showed small differences favoring the new format, and two (understanding risk difference and quality of the evidence associated with a treatment effect) showed large differences favoring the new format [63% (95% CI: 55, 71) and 62% (95% CI: 52, 71) more correct answers, respectively]. In two small RCTs, Rosenbaum et al. found that the inclusion of a SoF table in a review improved understanding and rapid retrieval of key findings compared to reviews with no SoF table.  In the second RCT, there were large differences in the proportion that correctly answered questions about risk in the control group (44% vs. 93%, P=0.003) and risk in the intervention group (11% vs. 87%, P<0.001). Two studies reported no significant differences between formats in knowledge or understanding. [46,50]
Ease of use/Accessibility
All five RCTs provided some assessment of ease of use and accessibility, measured in a variety of ways (Table 4). Buljan et al. reported that user friendliness was higher for an infographic compared to a PLS for doctors and patient representatives [Patients median infographic score: 30.0 (95% CI: 25.5–34.5) vs. PLS: 21.0 (19.0–25.0); Doctors median infographic score: 36.0 (30.9–40.0) vs. PLS: 29.0 (26.8–36.2)].  while Carasco-Labra et al. reported that in six out of seven domains, participants rated information in the alternative SoF table as more accessible overall (MD 0.3, SE 0.11, P=0.001).  Opyio et al’s graded-entry SoF formats were associated with a higher mean composite score for clarity and accessibility of information about the quality of evidence (adjusted mean difference 0.52, 95% CI 0.06 to 0.99).  In two small RCTs, Rosenbaum et al. found that participants with the SoF format were more likely to respond that the main findings were accessible.  The second RCT demonstrated, that in general, participants with the SoF format spent less time finding answers to key questions than those without.
Two studies assessed satisfaction (Table 4). Buljan et al. reported that both patients and doctors rated an infographic better for reading experience than a PLS, even though it didn’t improve knowledge [Patients median infographic score: 33.0 (95% CI: 28.0 – 36.0) vs. PLS: 22.5 (19.0 – 27.4); Doctors median infographic score: 37.0 (26.8 – 41.3) vs. PLS: 24.o(21.3 – 27.2)]  Carasco-Labra et al. reported that participants were more satisfied with the new format of SoF tables. (5/6 questions where the largest proportion was in favour of alternate SoF tables). 
Two studies assessed user preference (Table 4). Carasco-Labra et al. reported that participants consistently preferred the new format of SoF tables (MD 2.8, SD 1.6).  Similarly, Rosenbaum et al. reported that overall participants preferred the alternative (or new) format of SoF tables compared to the current formats (MD/SD: 2.8/1.6). 
From 16 qualitative studies and 1 RCT with a supplemental qualitative component, line by line coding identified 542 equivocal and unequivocal findings within the results section of the articles. No unsupported findings were identified. (Figure 1) From these initial 542 findings, we synthesized them further into 393 findings across 6 categories defined as follows:
- Presenting information (comments on the content, structure, and style of the summary format);
- Tailoring information for end users (inherently linked to the presentation of information but more focused on accommodating end user’s different learning styles, backgrounds, and needs and appropriately tailoring content);
- Contextualising findings (properly framing the findings themselves within the relevant context by providing information such as setting, cost constraints, and ability to implement findings);
- Trust in the summary and its producers (end user’s perceptions of credibility markers of the work as a whole -- such as transparency, funding sources, and clear references – i.e., that the work was rigorously done by qualified individuals);
- Quality of evidence (focused on the assessment of study quality and the totality of the evidence including how assessments were reached and information about rating); and
- Knowledge required to understand findings (educational information that should be added to summaries due to comprehension difficulties or gaps in end user’s knowledge base).
These 393 synthesized findings were then reviewed again by two authors (MKS and BC) to produce 130 recommendations for practice which, where possible, are presented based on targeted GDG members. Several recommendations also refer to particular scenarios or types of evidence syntheses such as NMA (n = 22), DTA reviews (n = 2), and updating reviews (n = 8). As previously mentioned, most studies contained diverse multidisciplinary participants. When quotes from participants were reported it was often not attributed to a specific stakeholder and several studies also included no direct quotes from participants. However, where possible, recommendations are presented according to group membership. The 130 recommendations from the qualitative synthesis are available in Figures 3, 4, and 5. Citations for recommendations can be found in Appendix 5.
A majority of recommendations related to presenting information (n = 68) or tailoring the information for the end user (n = 24). For example, items under the ‘presenting information’ category include things like ‘use bullet points’, ‘flag important information by bolding/highlighting’, use ‘greyscale-friendly colours’, and ‘avoid abbreviations.’ ‘Tailoring information’ included guidance on how to create bespoke customised documents with ‘easily extractable information to forward to colleagues’ and the importance of ‘clarifying the audience’ that the report is for and about.
Several items regarding the presentation of numerical and statistical findings were identified across several themes. For example, for ‘presenting information’, it was suggested to ‘use absolute numbers, not probabilities’ and to ‘decrease numeric/statistical data’ whereas the ‘contextualising finding’ category suggested ‘interpretation aids for statistics’, and noted that policy/decision makers are ‘not interested in methodology.’ The ‘knowledge required’ category highlighted the lack of awareness of abbreviations, recommending to ‘avoid abbreviations (e.g., RR for relative risk, CI for confidence intervals’ altogether. Some of these items are intrinsically linked as the ‘knowledge required’ recommendations highlighted that for readers, certain items like ‘forest plots are difficult to understand’ so providing ‘interpretation of statistical results’ and ‘defining statistical terms’ can be helpful.
Mixed methods synthesis
The four outcome areas for the quantitative evidence (e.g., knowledge, satisfaction) were also covered by the qualitative evidence. However, due to the large heterogeneity in stakeholders, formats, and assessments methods, it was difficult to determine whether the qualitative evidence helped explain differences in size or direction of effects in the quantitative studies.
From 74 individual quantitative findings (Appendix 4) we identified 17 which converged with at least one of the 130 qualitative recommendations (Appendix 5). Some of these 17 items supported the same recommendation (e.g., several findings supported the use of summary of findings tables) so in total these 17 quantitative findings supported 9 qualitative findings. Some of these items are inherently linked as SoF tables (4) are often using the GRADE rating scale (8). Similarly the items about assessments of quality (7 and 9) are likely to refer to GRADE as well. The 9 recommendations with mixed-methods support are marked with an asterisk in Figures 3, 4, and 5 and include providing a clear summary report that:
- is structured,
- is brief,
- provides information on the standard steps and nature of the review,
- presents results in summary of findings (SoF) tables,
- defines statistical terms,
- provides interpretations of statistical results,
- includes assessments of quality,
- describes the rating scale (GRADE), and
- describes how authors arrived at their assessments of quality.
Throughout our recommendations, there are items which may appear at face-value to be contradictory. However, they simply accommodate different learning styles (e.g., ‘use summary of findings tables’ and ‘use narrative summaries’), thus these are considered complimentary. Relatedly, there were some items that were expressed by different groups which echoed the end user’s different needs. For example, the ‘Abstract Methods Results and Discussion (AMRaD) format’ was advocated by clinicians whereas ‘avoid academic formatting’ was expressed by policy/decision makers. Additionally there are some items that are similar but were expressed for very different purposes – for example, ‘including author’s names’ is in both the ‘presenting information’ and ‘trust in producers and summary’ themes as some participants flagged this as a clear indicator of their trust in the quality of the work whereas others just wanted the information for general factual transparency purposes. (Appendix 5, Figures 3, 4, 5)