Evaluating quality improvement at scale: design considerations for routine reporting to executive board of a healthcare organisation in the UK National Health Service

Background Quality improvement (QI) in healthcare is a cultural transformation process that requires long-term commitment from the executive board, a critical theme in emerging accounts of QI success in the UK National Health Service (NHS). To help sustain long-term commitment from the executive board, an organisation-wide picture of QI applications and their impact needs to be made routinely visible. Method We developed a retrospective evaluation drawing inputs from the resident QI team of a healthcare organisation and academic colleagues in the eld of implementation and improvement science, as well as peer-reviewed and grey literature on what constitutes success for QI in healthcare. Formative feedback on content relevance, acceptability, and feasibility issues were used to guide evaluation design. The evaluation was conducted as an online survey so that the data accrual process resembles routine reporting to help surface implementation challenges. A purposive sample of QI projects was identied to maximise contrast between projects that were or were not successful as determined by the resident QI team. To hone strategic focus in what should be reported, we also compared factors that might affect project outcomes. For understanding implementation issues, we reviewed data quality to surface challenges in the design and sustainability of routine reporting for the executive board. Results Out of 52 QI projects, 10 led to a change in routine practice (henceforth referred to as adoption). Details of project outcomes were limited. Project team outcomes, indicative of capacity building, were not systematically documented. Service user involvement, quality of measurement plan, delity and documentation of plan-do-study-act (PDSA) cycles had a major impact on adoption. The proximal impact of these process factors on adoption was consistently more apparent than the distal impact of input and contextual factors. Conclusions is an involving continual dialogue accrual A retrospective evaluation, as visibility QI a


Abstract
Background Quality improvement (QI) in healthcare is a cultural transformation process that requires long-term commitment from the executive board, a critical theme in emerging accounts of QI success in the UK National Health Service (NHS). To help sustain long-term commitment from the executive board, an organisation-wide picture of QI applications and their impact needs to be made routinely visible.
Method We developed a retrospective evaluation drawing inputs from the resident QI team of a healthcare organisation and academic colleagues in the eld of implementation and improvement science, as well as peer-reviewed and grey literature on what constitutes success for QI in healthcare.
Formative feedback on content relevance, acceptability, and feasibility issues were used to guide evaluation design. The evaluation was conducted as an online survey so that the data accrual process resembles routine reporting to help surface implementation challenges. A purposive sample of QI projects was identi ed to maximise contrast between projects that were or were not successful as determined by the resident QI team. To hone strategic focus in what should be reported, we also compared factors that might affect project outcomes. For understanding implementation issues, we reviewed data quality to surface challenges in the design and sustainability of routine reporting for the executive board.
Results Out of 52 QI projects, 10 led to a change in routine practice (henceforth referred to as adoption).
Details of project outcomes were limited. Project team outcomes, indicative of capacity building, were not systematically documented. Service user involvement, quality of measurement plan, delity and documentation of plan-do-study-act (PDSA) cycles had a major impact on adoption. The proximal impact of these process factors on adoption was consistently more apparent than the distal impact of input and contextual factors.
Conclusions Designing a routine reporting framework is an iterative process involving continual dialogue with frontline staff and improvement specialists to navigate data accrual demands. A retrospective evaluation, as in this study, can yield empirical insights for dialogue about the routine visibility of QI applications and their organisation-wide impact, thereby honing the implementation science of QI in a healthcare organisation. Background A growing number of health care provider organisations in the UK National Health Service (NHS) are adopting quality improvement (QI) strategies across their organisations [1]. The Care Quality Commission (CQC), an independent regulator of all health and social care services in England, noted a trend after recent inspections that NHS provider organisations with outstanding CQC ratings have applied QI at scale [2,3]. The systematic application of QI at scale in NHS provider organisations has required concerted investment to sustain a process of cultural transformation. To monitor the developmental stages of this process, an organisation-wide picture of QI applications and their impact is needed. This will help the organisation's executive board commit to a long-term perspective of investments made in QI infrastructure, a critical theme in emerging accounts of QI success in NHS provider organisations for mental health [1,4].
With prevailing pressures on NHS resources [5][6][7], exacerbated by the COVID-19 pandemic, sustaining long-term commitment of the executive board would require routine reports to increase visibility of the stage and progress in cultural transformation and QI success. This routine reporting needs systematic design to address a staff capacity building issue, particularly in developmental stages of embedding a culture of continuous improvement within an organisation. The level of technical skills needed for measurement, data collection and analysis is commonly underestimated in QI, and generally not su cient in frontline NHS staff despite some training [8]. Furthermore, routinely collected data are often not as clean or well set up as originally anticipated, often requiring extensive effort to bring them up to a standard suitable for use [8]. This means that an organisation-wide picture of QI applications and their impact cannot be readily constructed by aggregating project reports of impact or success. Even for case studies with proper design and adequate analysis, they would not offer an organisation-wide picture of cultural transformation and QI success.
There is a need for constructing a routine reporting framework that can help inform the executive board and sustain long-term commitments in QI. In this study we carried out a pilot evaluation to develop content of routine reporting. To hone strategic focus in what should be reported, we also compared factors that might affect project outcomes. For understanding implementation issues, we reviewed data quality to surface challenges in the design and sustainability of routine reporting for the executive board.

Setting
An NHS organisation that provides specialist care for mental health in South London established a resident QI team in 2016 with mandate to foster a continuous improvement culture. The resident team supports QI projects led by frontline staff through training and coaching on QI methodology (e.g., Model for Improvement, driver diagrams, Plan-Do-Study-Act (PDSA) cycles). In partnership with an academic institution, the NHS provider organisation also adopted a researcher-in-residence model that embeds an academic faculty in the resident QI team to support data and evaluation needs. With a steady rise of QI activities within the organisation between 2016 and 2018, the executive board requested an evaluation of these nascent developments.

Design
To explore what should be included for an executive board report on organisation-wide QI programme, we developed a pilot retrospective evaluation such that the data accrual process resembled routine reporting.
Under these conditions, challenges in routine reporting might be made apparent to guide design considerations. A scoping exercise was rst initiated by the researcher-in-residence to develop a proposal of evaluation content. This involved individual consultations with three improvement specialist colleagues in the resident QI team, coupled with a rapid evidence scan of peer-reviewed and grey literature on what constitute success for QI in healthcare. Key concerns were raised with time resources, staff turnover, service user involvement, and outcomes like whether project ideas were adopted and spread.
The literature also drew attention to the delity of PDSA applications and implementation of data or measurement plan. A list of items was generated and circulated among the entire team of six improvement specialists in the resident QI team for formative feedback on content relevance, acceptability and feasibility issues. This feedback was combined with inputs from two academic colleagues with expertise in implementation and improvement science. A revised version was then piloted with two colleagues back in the resident QI team before it was nalised as an online survey (see supplementary material).

Project outcomes
To offer an overview of a diverse range of QI projects, we chose to look at whether projects: achieved their aims; led to a change in routine practice (adoption); and triggered similar projects beyond the site of their original conception (spread). These outcomes were core themes that are relevant for all projects despite heterogeneous and highly localised aims. Such a focus enables regular reporting at scale and acts as an early signal for areas that needs attention.

Costs and bene ts
To attribute costs and bene ts of QI projects within the organisation, the evaluation captured resource use in terms of amount of contact with the resident QI team. Considerable variation in documentation precluded formal cost-bene t analysis in a retrospective evaluation, but a question was added on whether it was possible to quantify improvement in terms of cost savings and whether this had been attempted. QI is underpinned by an organisational management philosophy that recognises the critical need to empower frontline staff to learn and participate in continuous improvement in face of escalating complexity and change [9]. On this basis, we also considered aspects of skill development and capacity building which may be attributed to each quality improvement project. Such a focus aligns with the philosophy of QI to move beyond performance assurance and cultivate an organisational culture of learning regardless of project outcomes. As a measure of skill development, we inquired on the extent and form of dissemination efforts. We also tracked capacity building by inquiring whether project team members went on to develop new QI projects.

Contextual, input and process factors
To explore what may facilitate or impede QI project success, we inquired about contextual, input and process factors. Contextual factors refer to organisational conditions that are not within the in uence of project teams. Besides setting (in-patient / community care), the resident QI team drew particular attention to contextual aspects like whether it was the team's rst ever QI project, and whether protected time for QI activities was o cially sanctioned. Input factors comprised team characteristics and staff turnover. Process factors refer to actions or decisions of the project teams. They comprised stakeholder engagement, PDSA cycles and measurement plans implemented.

Sampling
With approximately 200 QI projects initiated between 2016-18, retrospective information retrieval was undertaken with a purposive sample for feasibility reasons. Each improvement specialist of the trust's resident QI team was requested to identify up to ve QI projects that they considered as 'successful' and up to another ve that they considered as 'unsuccessful'. Speci cally, they were asked to rely on their own assessment of what did and did not work to maximise the gradient of contrast across the selection of QI projects. This in turn would also help surface insights on "work-as-done" rather than imposing a "work-asimagined" criterion [10] in the absence of an established de nition of 'success' of in QI [11].

Analysis
We rst enumerated project outcomes in terms of whether they achieved their aims, introduced change ideas that were adopted in routine practice (adoption), and triggered similar projects at other sites (spread). We then compared projects that did and did not lead to a change in routine practice to see if they differ in terms of contextual, input and process factors. To compare the associations between these factors and adoption, we compared effect sizes based on odds ratios (with con dence intervals from logistic regression). An odds ratio (OR) smaller than 1.5 was considered to be a small effect size, whereas OR > 5.0 was considered to be a large effect size [12].

Results
The study sample included 52 QI projects across ve boroughs of London, UK served by the NHS provider organisation (Table 1). Thirty projects were conceived by community mental health teams and the remaining (n=22) by inpatient care teams. Of the three themes of Quality Priorities,[13: patient safety, clinical effectiveness, and patient experience] improving clinical effectiveness was the most common focus in project aims (29 / 52). A small handful focused on multiple priorities at once (6 community mental health and 7 inpatient care projects). Care planning 10 (5)  6 (3) Developing electronic systems to improve care delivery 3 (1)  2 (1) Patient Experience Reducing number of acute out-of-area treatments 3 (2) 0 (2) Carer's assessment and associated care plan 1 (3)  0 (1) Quality of environment and food 1 (1) 1 (2) * numbers in parentheses refers to number of projects that included multiple Quality Priorities in project aims

Project outcomes
In terms of project outcomes ( Table 2), 18 of 52 projects (35%) reported a change in routine practice (adoption). However, only 10 of them reported formal project closure with aims achieved. Out of 7 of 52 projects (13%) that triggered similar projects in other sites (spread), only 3 reported formal project closure with aims achieved. A plausible explanation for these ndings could be that some projects were organisation-wide initiatives that were adopted or spread across service sites regardless of the project outcome at speci c sites. In light of this divergence between "work-as-imagined" and "work-as-done", [10] we decided to retain a more stringent de nition by labelling projects as successful (n=10) only if they led to adoption after achieving their aims at formal closure. This offered a more interpretable benchmark for making comparisons with the remaining 42 projects. Adopted = Change idea adopted, Spread = Triggered similar projects

Costs and bene ts
Among the 10 successful projects, half required six or more months (median = 6.0) for completion. Those that were not successful after formal closure (n = 13) showed large variation. Half were completed in under three months (median = 2.8) but some took up to 12 months (Figure 1). Among projects that did not reach formal closure, those that terminated at the Planning stage of PDSA (n = 18, median = 1.8 months) showed a shorter life span than those that terminated in more advanced stages (n = 11, median = 6.0 months). Meetings with the resident QI team staff took place typically on a monthly basis for successful projects (median = 1.1 meetings monthly), a slightly higher rate than for all others (median = 0.6 -0.9 meetings monthly). Monthly correspondence between project and resident QI team staff (email/phone) shows a similar picture, with slightly higher activity levels in successful projects (median = 7.5 vs 3.8 -4.8 monthly email/phone).
Retrospective estimates were requested for the number of service users and staff who directly bene tted from the undertaken QI projects. This proved problematic as indications were available for only a handful of projects (8 / 10 successful projects vs 11 / 42 for all others). Similarly, when inquired about whether it was possible to quantify improvement in terms of cost savings, more than half of the projects reported "not known".
Among the 10 successful projects, seven disseminated publications of their ndings (2 locally, 4 beyond local site/service, 1 not known). Among the remaining 42, ve did so (2 locally, 2 beyond local site/service, 1 not known). For most projects, survey responses indicated "not known" in both respects. Two of the successful project teams went on to develop two new projects, whereas six in the latter group developed 10 new projects.

Contextual factors
We examined a range of factors that might be associated with QI project outcomes (  n 1 : total number of projects that satisfy the condition described by the independent variable n 2 : total number of projects (in n 1 ) that led to a change in routine practice (adoption).
# independent variables for which odds of project outcome could not be calculated + independent variables that show statistically significant odds ratio ++ independent variables for which conservative estimates (lower bound of 95%CI) show at least a moderate effect size (OR > 1.5, or in opposite direction: OR < 0.7)

Process factors
The odds of adoption were higher if the project team engaged their team leader, stakeholders (e.g., staff members not in project team), and service users. Only service user engagement showed a statistically reliable impact, with a large effect size (OR = 7.4, 95% CI: 1.6 -34.9).
The odds of adoption increased moderately with the number of outcome measures attached to the aim statement of the driver diagram produced by the project team (OR = 3.7, 95% CI: 1.1 -9.8). This was also the case for the number of primary (OR = 2.7, 95% CI: 1.3 -5.9) and secondary drivers (OR = 1.5, 95% CI: 1.1 -1.9) in these diagrams.
Among 35 projects that quanti ed their target outcomes in the aims statement, 10 led to adoption. In the remaining 17 that did not quantify their target outcomes in the aims statement, none achieved adoption (consequently, we could not calculate an OR for comparing odds). The odds of adoption were much higher if measures were tagged to the primary and secondary drivers (OR = 7.5, 95% CI: 1.7 -33.7 and OR = 6.0, 95% CI: 1.3 -27.2 respectively). Projects that included balancing measures also had much higher odds of adoption (OR = 7.4, 95% CI: 1.6 -34.9).
The odds of adoption increased slightly with the number of PDSAs carried out (OR = 1.5, 95% CI: 1.1 -2.2). However, when the comparisons focused simply on whether projects did or did not report PDSA with more than one cycle, the odds were much higher (OR = 7.5, 95% CI: 1.7 -33.7) if they did. The odds were even higher for projects that had PDSA documentation (OR = 85.5, 95% CI: 8.5 -860.2). Despite the wide con dence interval, the lower bound interval estimate suggests that this latter aspect of PDSA had a major impact on project outcome.
Projects that collected data before implementing change ideas showed much higher odds of adoption (OR = 9.5, 95% CI: 1.9 -47.6). This also applies to projects that established the median value of random variation in outcome measures (OR = 5.0, 95% CI: 1.1 -22.0). We could not calculate OR for comparing odds for projects that established the median value of random variation in process and balancing measures because all that did so achieved adoption. Projects that collected data after implementing a change idea also show much higher odds of adoption (OR = 7.5, 95% CI: 1.7 -33.7).

Discussion
This study developed a pilot evaluation to explore what should be reported to give an organisation-wide picture and sustain long-term commitments of the executive board in embedding a culture of continuous improvement. The evaluation yielded exploratory insights and lessons were learnt from the practical challenges that emerged. We re ect on these with the view of informing what is needed for a fuller development of routine reporting on cultural transformation and QI success in an NHS organisation providing specialist care in mental health.
We found that time invested in meetings, virtual correspondences, and overall project duration are data with minimal information retrieval burden. Across a diverse range of projects, time invested by staff is an opportunity cost that could serve as a common denominator for return on investment.
In contrast, we encountered major challenges in measuring bene ts. Cost savings is often not an immediate focus of QI in mental healthcare. Consequently, it is usually not part of the measurement plan. Improvement in patient safety and experience needs a reporting framework that translates gains into cost savings. The onus of developing structured reports is likely to fall on the resident QI team to enable consistent measurement and coherent iterations. The level of skill development and capacity building in QI was not known for most projects in this retrospective evaluation. As with organisational bene ts, monitoring project team outcomes (as opposed to project outcomes) is best carried out by the resident QI unit.
A fundamental goal of QI is to bring about a change in routine practice [14]. Of the 52 QI projects led by frontline NHS staff, 10 achieved this goal. Among factors that reliably increased the odds of effecting a change in routine practice, service user involvement showed at least a moderate to large impact. This nding is consistent with the wider literature on QI in mental health [15]. It also highlights the need for more granular data on the level of involvement of this stakeholder group, to guide efforts in optimising their potential contributions [16].
With the Model for Improvement as the dominant QI activity paradigm, the driver diagram is a common tool for problem recognition and theory of change analysis. If measures were tagged to primary drivers and balancing measures in the driver diagram, the odds of effecting a change in routine practice increased by at least a moderate to large effect size. This was also the case if data were collected before and after implementing a change. Our ndings concur with existing improvement science literature. Measurement and the use of data is at the heart of the Model for Improvement [17]. A transparent, datadriven approach is paramount [18], otherwise project teams may get stuck in the 'Do' phase of a PDSA, or reach no actionable insight in the 'Study' phase.
As structured interventional experiments for testing changes, iterative cycles in PDSA are key for learning [19]. Improvement work is less likely to succeed if iterative cycles are too few [20]. In our evaluation, we found that the odds of effecting a change in routine practice increases by at least a moderate to large effect size if projects reported PDSA that completed more than one cycle. The task of PDSA documentation by far show the strongest impact even by conservative estimates. Documentation of each stage of the PDSA cycle supports scienti c quality, learning and re ection; even if PDSA cycles are wellexecuted, poor documentation would impede organisational memory and transferability of learning [20,21]. Documentation is a critical part of delity in PDSA. It is not a simple task. Training should not overemphasise the conceptual simplicity of PDSA [19]. Achieving high delity in PDSA application as a QI method will require a gradual and negotiated process to explore different perspectives and encourage new ways of working [22]. Collecting routine data on PDSA delity can aid conversations in this effort. When PDSA cycles are performed and reported appropriately, knowledge is accumulated, which can be shared readily [20]. Of note, QI is not synonymous with improving quality [1]. Even if there was no improvement, PDSA cycles with rigorous measurement plans would still generate learning. This would also be considered a QI success [19].

Study limitations
This study set out to explore an organisation-wide picture of QI projects and their impact. While we de ned adoption as bringing about a change in routine practice, the process of retrospective information retrieval posed major challenges in obtaining a wider range of indicators for illuminating on the impact. Furthermore, the absence of an established de nition of 'success' of in QI [11] meant that the resident QI team had to rely on their own judgement when selecting a purposive sample. Coupled with retrospective recall di culties, sampling bias is likely to have an impact on study ndings. Notably however, our survey-based ndings resonate with those from in-depth qualitative studies in terms of the process factors that matter for project outcomes. This suggests that the skilful application of QI can be routinely monitored as an indication of cultural transformation and QI success.
While several process factors showed an apparent impact on project outcomes, contextual and input factors generally showed little impact. This is likely to be due to limitations in the scope of our survey rather than evidence that these latter aspects are not important. It is well-established in improvement science literature that contextual factors play a prominent role [23]. However, the sheer number of variables and the unpredictability of their interactions make it hard to predict the distal impact of contextual factors on project outcomes [24]. Consequently, the need to minimise data accrual burden led to a narrow scope of contextual and input factors. Instead, we prioritise the scope of process factors (e.g. delity of data and measurement plans), to illuminate proximal in uences that are potentially amenable to staff training and interventions.

Conclusion
To advance with a long-term perspective of cultural transformation and QI success, the executive board needs an organisation-wide picture that can be routinely made visible with minimal technical expertise. Setting up a routine monitoring framework can offer more timely vigilance than designing an elaborate retrospective evaluation, which some argued, is akin to attempting to "drive by looking in the rear-view mirror" [25]. Practically, routine monitoring can also improve data quality by alleviating information retrieval di culties and ensuring systematic data accrual. Operationally, respondent burden can be alleviated by spreading data collection over multiple occasions. Short and well-timed surveys can optimise relevance and thus the acceptability of data collection. Data can also be collected from different types of respondents (e.g., project team lead, sponsor, resident QI team). Well-targeted respondents will provide the most accurate data if only the most relevant questions are asked, thereby also minimising respondent burden.
Designing a routine reporting framework is an iterative process involving continual dialogue with frontline staff and improvement specialists to navigate data accrual demands [26]. Across diverse project aims, there is a need to identify themes that hold core relevance in QI. For an executive board, this framework should generate strategic insights that can inform resource commitments in QI. For staff and improvement specialists, routine reporting should offer a feedback loop to support discussion and engagement around nding the right focus [27] and applying QI with delity [19,28]. Developing routine reporting can be an asset for improving practice [29], and honing the implementation science of QI in healthcare.

Declarations
Ethics approval and Consent to participate The data that support the ndings of this study are available from Dr. Kia-Chong Chua but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Dr. Barbara Grey.

Authors' contributions
The evaluation was designed by KC, NS and BG. The analysis was conducted by KC. The manuscript was drafted by KC with critical revisions from CH, BG, MH and NS. All authors read and approved the nal manuscript.