Multi-method appraisal of clinical quality indicators for the Emergency Medical Services in the Low- and Middle-Income Setting: The South African Perspective

Background: Quality Indicator (QI) appraisal protocols are a novel methodology that combine multiple appraisal methods in order to comprehensively assess the “appropriateness” of QIs for a particular healthcare setting. However, they remain inadequately explored compared to the single appraisal method approach. The aim of this paper was to describe and test a QI appraisal protocol versus the single method approach, against a series of QIs potentially relevant to the South African Prehospital Emergency Care setting. Methods: An appraisal protocol was developed consisting of two categorical-based appraisal methods, the Qualify tool and Rand/Appropriateness method, combined with the qualitative analysis of the discussion generated during the consensus application of each method, by a QI Appraisal Working Group. Inter-rater reliability of each individual method was assessed prior to group consensus rating. Variation in the number of non-valid QIs identied between each method and the proportion of non-valid QIs identied between each method and the protocol were compared and assessed. Results: There was mixed inter-rater reliability of the individual methods prior to the group consensus. There was similarly poor to moderate correlation of the results obtained between the individual methods (Spearman’s rank correlation = 0.42, p<0.001). From a series of 104 QIs, 11 were identied that were shared between the appraisal methods. A further 19 QIs were identied and not shared by each method, highlighting the benets of a multimethod approach. There was little evidence to support a difference in the proportion of non-valid QIs identied between individual methods (difference=-0.03); between the Quality tool and the protocol (difference=-0.05); or between the Rand method and the protocol (difference=-0.02). The outcomes were additionally evident in the group discussion analysis, which in and of itself added further input towards understanding and appraising the appropriateness of the QIs that would not have otherwise been captured or understood by the individual methods alone. Conclusion: The utilisation of a multi-method appraisal protocol offers multiple benets, when compared to the single appraisal approach, and can provide the condence that the outcomes of the appraisal will ensure a strong foundation on which the measurement framework

For part 2, the Rand/UCLA Appropriateness Method was used to further rate the indicators by testing the de nitions, data components and criteria for use developed for each QI against several clinical vignettes. Four categories (Clarity, Necessity, Acceptability and Technical Feasibility) were rated using a 9-point visual analogue scale, and data extraction assessed using a mock-up of a generic patient report form for the clinical vignettes 9,13 . Two separate vignettes were developed for each of the QI categories included in the data extraction, and a "low quality documentation" and "high quality documentation" version developed for each vignette for use during the assessment. The Rand method has previously been utilised to assess QIs and was additionally included based on its practical focus (i.e.: the data extraction) 14−17 . Both methods consisted of an evidence evaluation component as part of the appraisal process. To achieve this, the QIs were assessed for inclusion within local clinical practice guidelines (CPGs), and against the results of a literature review of the evidence base utilised for the development of PEC focused QIs.
The results of the review were assessed and presented using the Oxford Centre for Evidence-based Medicine Levels of Evidence 21 .
Data for parts 1 and 2 were collected over three rounds of group discussion of a QI Appraisal Working Group, facilitated by the principle investigator (IH). An initial introductory round was conducted to familiarize the Working Group with the appraisal tool, Rand methodology, results of the literature review, and provide the data dictionary for the QI set. Prior to Round 1, the appraisal tool was independently applied by each member of the Working Group, who then met to discuss their individual scoring and apply a nal consensus summary score for Round 1. Prior to Round 2, the Working Group similarly independently assessed the results of the literature review, and then met to apply a nal consensus rating of the evidence. Round 2 was further utilised to introduce the clinical vignettes for each category which would be utilised for the data extraction. For round 3, the Working Group met to compare their individual data extraction results and rate the QIs for the categories of the Rand method. The Working Group meetings were recorded and later transcribed for the nal part of data collection -content analysis of the discussion generated surrounding the consensus appraisal process for Rounds 1 to 3.

Setting and Population
Traditionally, quality in the PEC setting in SA has been exclusively reported based around response time targets 22 − 25 . Utilisation and reporting of clinically focused QIs by the Emergency Medical Services (EMS) in SA is wholly lacking. Towards this, several clinically focused PEC QIs have recently been identi ed for potential relevance to the SA PEC setting 26 . As a result, these QIs were used to test the appraisal protocol, with the secondary aim of identifying those QIs appropriate for use in the SA PEC setting.
The QI Appraisal Working Group consisted of nine preselected experts chosen for their intricate knowledge of the South African PEC setting and to align with minimum panel size recommendations for each methodology 10,13 . All the participants were South African trained and post-graduate educated Emergency Care Practitioners (ECPs) with > 10 years operational experience each. Six of the participants' primary experience and occupation were in quality governance and improvement within PEC in general, and the remaining three were primarily involved in clinical operations. The Working Group were given one month between each round with which to work through the information and data collection required for each subsequent round.

Data Analysis
Descriptive statistics were utilised to describe and summarize the categorical based appraisal data. For the appraisal tool, mean scores per category, and the number of criteria scoring either 3 (Rather applies) or 4 (Applies) were calculated and presented for each QI. For the Rand method rating, consensus scores per category, and the proportion of categories scoring 7 or more were calculated and presented for each QI. Despite the face-to-face consensus process, inter-rater reliability for each criterion of both the appraisal tool and Rand method were calculated using percentage agreement and Gwet's AC1 and presented for reporting purposes.
A nal composite score was calculated for each QI, for each method. For the appraisal tool, this was calculated using a weighted mean of the appraisal categories after consensus, due the differences in number of criteria per category. To be considered a valid indicator, the QI had to score ≥ 3 based on the nal composite score. For the Rand method, the unweighted mean of the appraisal categories after consensus was used. To be considered a valid indicator, the QI had to score ≥ 7 based on the nal composite score. A second group of QIs were identi ed consisting of those scoring on the validity threshold (3.0-3.1 for the Quality tool; 7.0-7.1 for the Rand method) for which caution was recommended prior to full implementation.
Correlation between the nal composite scores of each method for each QI was calculated and presented using the Spearman's rank correlation. The consensus derived proportion of non-valid QIs, and QIs for which caution was recommended, identi ed by each individual method and the protocol, were calculated and assessed against each other using the z test. 95% con dence intervals will be calculated where necessary and a p-value of 0.05 used as a cutoff for strength of evidence. All data were entered and analysed using a combination of Microsoft Excel 2010 (Microsoft Corp., Richmond, WA, USA) and Stata version 16 (StataCorp. College Station, TX: StataCorp LLC).
Conventional content analysis, as described by Hsieh and Shannon, was utilised to sort and analyse the group discussions generated during the three rounds 27 . Recordings and transcripts were created for each round, and each transcript reread for content familiarisation. First-level coding was conducted through the extraction of meaning units from each transcript and summarised into codes using open-coding from each interview. Once completed, similar codes were combined and organised to develop clustered sub-categories pertaining to each appraisal tool. Transcriptions were analysed using MAXQDA software for data storage; extraction of meaning units and sub-category development (MAXQDA, 2016; Sozialforschung GmbH, Berlin, Germany).

Results
The Working Group appraised a total of 90 clinical and 14 non-clinical (n=104) QIs using each method, over the three rounds. There was a high level of validity of the QIs assessed across the majority of the appraisal criteria for both methods, the results of which were poor to moderately correlated between each method.

Round 1 -QI Appraisal Tool
There was mixed inter-rater reliability of the criteria found prior to the group consensus discussion. General Validity and Understandability and interpretability for medical and nursing personnel scored perfect agreement within the group, while Data Collection Effort (% agreement = 22%, IRR = 0.01) and Understandability and interpretability for patients and interested public (% agreement = 28%, IRR = 0.09) and scored the lowest (Table 2) . Of the 104 QIs assessed, eight (7.7%) scored less than the validity threshold on the nal composite score (≥3). Four of these were in the Acute Coronary Syndromes (ACS) clinical category, and six were QIs associated with or in uenced by a receiving facility or location. All eight of these QIs scored relatively high for Relevance and Scienti c Soundness yet scored poorly for Feasibility. On average, overall scores were generally higher for criteria within Relevance and Scienti c Soundness and lower for those within Feasibility (Table 3). Another 15 QIs scored on the validity threshold (3.0 -3.1) and were generally associated with resources/equipment or regarding the identi cation and reporting of sentinel events. These QIs similarly had their overall score reduced due to a reduced perception of potential feasibility.
For the purposes of appraising the Indicator Evidence criterion within the Scienti c Soundness category, the QIs were evaluated for inclusion within local clinical practice guidelines (CPGs). There was considerable representation of the QIs amongst the SA national EMS CPGs (Table 3). Seventy-nine QIs (76%) were accounted for in the CPGs, of which 76 (73%) had evidence directly supporting their use. Those QIs not represented were found to be either structure based QIs; clinical bundle based QIs; or those QIs focusing on sentinel events and patient safety.

Round 2 -Literature Review
The literature search identi ed a total of 1624 potential articles for review ( Figure 2). Following the title and abstract review, 1528 articles did not meet inclusion criteria and were excluded, leaving 89 articles for full-text review. An additional 15 articles were included following a review of the list of references of the 96 articles identi ed. Following the removal of duplicate texts, and research not meeting the inclusion criteria (n=57) 31 articles remained for the full-text review. The literature review found an evidence base for 11 of the 15 Clinical subcategories and the 2 Non-clinical subcategories, plus an additional 4 subcategories not included in the QI appraisal, covering 311 indicators (Table 4). In excess of half (59%) were developed through a consensus/expert opinionbased approach, with fewer developed via more robust and higher quality levels of evidence such as systematic reviews and/or cohort and case control-based studies (10% each).

Round 3 -Rand Method
As with the appraisal tool, there was mixed inter-rater reliability in the individual rating prior to the consensus rating, with Acceptability scoring the highest (% agreement = 90%, IRR = 0.9) and Technical Feasibility the lowest (% agreement = 47%, IRR = 0.32). Eleven QIs (10.6%) scored below the validity threshold, six of which were within the ACS clinical category, including the four identi ed using the appraisal tool. Similarly, the same six QIs associated with a receiving facility or location scoring below the validity threshold with the appraisal tool, scored below the validity threshold using the Rand method. Another eight QIs scored on the validity threshold (7.0 -7.1) and were generally associated with resources/equipment. Only two of these QIs matched those scoring on the threshold with the appraisal tool. Again, as with the appraisal tool, scores were lower within the Technical Feasibility category compared to the other three.

Comparison of Categorical Appraisal Methods
When nal consensus validity scores were compared, there was poor to moderate correlation of the results obtained between the appraisal tool and Rand method (Spearman's rank correlation = 0.42, p<0.001). Ninety-two of the 104 QIs (88%) (78 clinical and 14 non-clinical) were appraised to be valid and feasible for the SA PEC setting, based on the results of this study. Of this group, an additional 21 QIs (13 clinical and eight non-clinical) were assessed to be on the threshold of validity, in which caution is recommended until a pilot study on their use can be conducted, prior to full implementation. There was little evidence to support a statistical difference in the proportion of non-valid QIs identi ed between the Qualify tool and the Rand method [difference = -0.03; (95% CI -0.12:0.05, p=0.47)]; between the Quality tool and the protocol [difference = -0.05; (95% CI -0.13:0.03, p=0.25)]; or between the Rand method and the protocol [difference = -0.02; (95% CI -0.11:0.07, p=0.66)]. There was likewise little evidence to support a statistical difference in the proportion of QIs in which caution is recommended, identi ed between the Qualify tool and the Rand method [difference = 0.07; (95% CI -0.02:0.15, p=0.12)]; or between the Quality tool and the protocol [difference = -0.06; (95% CI -0.16:0.04, p=0.27)]. There was however, strong evidence to support a statistical difference between the proportion of QIs in which caution is recommended, identi ed between the Rand method and the protocol [difference = -0.13; (95% CI -0.22:-0.03, p=0.009)].

Discussion Group Content Analysis
Several observations highlighted during the group discussions were found to be important considerations regarding the appraisal protocol and its ability to assess the appropriateness of the QIs for the SA PEC setting. For the appraisal tool, Relevance and Scienti c Soundness were perceived to be characteristics inherent to the QIs (and supporting data components) themselves, and as a result were generally appraised to be highly applicable across all QIs and criteria (Table 5). In contrast, Feasibility was judged to be more of a gauge of the system in which the QIs would be implemented and as such, scores were found to be on average lower amongst these criteria [1.1, 1.2]. Somewhat related to this, was the broader issue of context and the importance of selecting those indicators that best suited the local setting, prior to full implementation (1.3, 1.4]. Despite the focus on the appraisal of the QIs, on several occasions the discussion steered towards the need for EMS organisations in SA to improve their quality systems in general, if such measures are to be implemented [1.5, For the Rand method, the importance of having completed the practical data extraction using the case vignettes made a difference in the QI rating [2.1,2.2]. This expanded further into a general conversation about applying the QI framework, the quality system in which they'd be applied and documentation in general [2.3 -2.6].

Discussion
The simplicity and practicality of QIs as a system of quality measurement has led to their widespread adoption in healthcare 4,14,28−34 . Importantly, they align with Donabedian's conceptual framework for healthcare evaluation, predicated on the belief that an effective structure gives rise to effective processes of care, which in turn result in improved outcomes 8 . Within the PEC setting, patient exposure times are generally limited, and the delivery of care based largely around processes as opposed to outcomes. The utilisation of QIs as a measure of quality are therefore ideally suited to this environment.
Despite these advantages, the implementation of inappropriate or poorly tested QIs -even in well-established quality systems -has been reported to be both time-consuming and costly to correct 9,14 . Consequently, QI appraisal has been identi ed as an essential step toward understanding the appropriateness of these measures for a particular healthcare eld or setting, prior to full implementation. The results of this study support these notions through the application of QI appraisal protocol against a series of QIs. Further to this, the results support the value in adopting a multi-method approach towards QI appraisal, compared to the single method approach.
From a series of 104 QIs identi ed for potential use in the SA PEC context, eight were identi ed as non-valid and three identi ed for which caution was recommended prior to full implementation, that were shared between the appraisal methods. A further 19 QIs were identi ed in the above categories and not shared by each method, highlighting the pragmatic advantages of a multi-method approach versus the single method approach. Our observations found the multi-method approach to be advantageous in that the methods complemented each other's strengths and compensated for each other's weaknesses. While the Qualify tool appraised the QIs from a greater number of viewpoints, the Rand approach offered insight into the practical application of the QIs not available with the Quality tool. This was additionally evident in the group discussion analysis, which in and of itself added further input towards understanding and appraising the appropriateness of the QIs that would not have otherwise been captured or understood by the categorical methods alone 18,35 .
Despite these advantages, the application of the protocol required a signi cant investment in time and staff resources. The overall bene ts of such an approach are therefore heavily dependant on the availability of these resources. This availability will likely vary signi cantly, depending on the quality system setting within which the protocol will be applied. These "system-focused" factors therefore have the potential to exert as much in uence on the validity of the QIs as the setting in which the QIs will be implemented 36,37 .
The outcomes of the appraisal have identi ed a signi cant number of QIs assessed to be valid and feasible for the SA PEC setting. The majority are centred around clinically focused processes of care, measures that are lacking in current performance assessment in EMS in SA. The importance and potential in uence of the quality system in which the QIs will be implemented was further highlighted across all the methodologies. Quality system-focused assessment criteria, on average, scored lower than those criteria assessed to be characteristics inherent to the QIs themselves. This was rea rmed during the qualitative discussion analysis, where system focused factors were a regular discussion point.

Conclusion
Measurement forms a central part of every healthcare quality system. Regardless of the measurement approach used, it is essential that the framework be comprehensively assessed for appropriateness for the setting in which it will be employed. Understanding and accounting for this as a factor is key towards ensuring both successful implementation and ongoing utilisation of such a system in this setting. The utilisation of a multi-method appraisal protocol offers signi cant bene t towards achieving this, when compared to the single appraisal approach, and can provide the con dence that the outcomes of the appraisal will ensure a strong foundation on which the measurement framework can be successfully implemented and employed.

Declarations
Ethics approval and consent to participate Ethical approval for the study was granted by Stellenbosch University Health Research Ethics Committee (HREC) (Ref no. S15/09/193). Written consent for participation was provided by each of the participants prior to data collection. The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Consent for publication
Not applicable/required

Competing interests
The authors declare that they have no competing interests.

Nil
Authors' contributions IH, PC, MC, LW and VL conceived the study. IH, conducted the data collection and analysis. IH drafted the manuscript, and all authors contributed to its revision. All authors have read and approve of the nal manuscript, and consent to its publication. IH takes responsibility for the paper.  Quality Indicator Evidence Base Review Potential risks/side effects: "No risks are known/assumed which may result from the use of the indicator." Scientific soundness S1 Unambiguity of definitions: "The indicator is defined clearly and unambiguously." S2 Reliability: "It is a reliable measurement." S3 Risk adjustment: "The indicator is sufficiently adjusted to risk" (Are all factors that are not caused by the user taken into due account?) S4 Sensitivity: "The indicator provides sufficient sensitivity." S5 Specificity: "The indicator provides sufficient specificity." S6 Validity: "The indicator provides sufficient validity." Feasibility F1 Understandability and interpretability for patients and interested public F2 Understandability and interpretability for medical and nursing personnel F3 Possibility to influence the indicator manifestation: "The quality indicator refers to an aspect of care which can be influenced by the actors to be assessed." F4 Availability of data: "The data are documented by the service provider as a routine or can be collected with acceptable effort." F5 Data collection effort: "There is no data collection method available that provides at least equivalent results with less effort." F6 Implementation barriers: "Implementation barriers are unknown or covered by adequate measures." F7 Accuracy: "The correctness of the data can be verified." F8 Data integrity: "Is the individual data set intact?" F9 Completeness of the data: "Is it possible to verify that all occurring cases were recorded?"    Figure 1 Quality Indicator Appraisal Protocol       relevance wand significance and benefit was naturally going to be scored high" 1.2 Usability "Whenever I was rating a category that I used or drew information from the data dictionary, there was always sufficient information that left no doubt that it was well planned for or accounted for. The difficult part was knowing how much variation there would be in different EMS organizations in South Africa in how they would be able to extract this information and put it to use"

1.3
Context "Whatever indicators are used by a service, it's important that they do a feasibility assessment of what's possible for them to achieve. We may be able to say overall, like these will work for South Africa in general, but when it comes to actual implementation, a service is going to have to understand its surroundings and the types of patients it sees"

"Like, the indicators involving direct transport to a CT [Computed Tomography] scanner for Stroke patients, or to PCI [Percutaneous Coronary
Intervention] facilities for STEMI [ST Elevation Myocardial Infarction], those will only be applicable to certain metropolitan areas, and probably only for certain private services as well. It won't be a general indicator for everyone to use"

Quality system
"This is a complete mind shift from what we currently know and how we measure quality in South Africa. If a service is serious about implementing these, even it's just a few, they're going to have to admit that it's going to take an overhaul in their quality system, and that it's likely going to need more resources than what they dedicate to measuring response times at the moment" 1.6 "Outside of a few of the large private services, the provincial services are going to have to ramp up the effort around measuring quality. As simple and as easy a system that these indicators are, there's probably not many of the provincial services that are ready to implement them" RAND method 2.1 Methodology "You really get to see how these will be used from a practical point of view. I can see the benefit of how a simple system that's objective can make the world of difference. It's not like how I used to remember it when we checked the case sheets, and it depended on how you felt at the time" 2.2 "Doing the data extraction made a big difference, because I remember, especially for the sentinel event indicators, I scored them quite low with the appraisal tool, but when we went through them and applied them to actual cases, it was much simpler than I thought it would be and so I scored them higher after being able to actual do the extraction" 2.3 Technology "I think applying these indicators would be way easier with an electronic patient report form. It's going to take way more effort in doing it manually, but I can still see the benefits even if it's done this way" 2.4 Quality system "I think when you're sitting down and applying the indicators to case sheets, the system does seem simple and straightforward enough to use. But what do you do from there? It's going to be a logistical challenge to get the paperwork together to do the assessment, but I feel like the bigger challenge is using the information we learn, it's just as important as getting the information" 2.5 Transparency "It seems like it's going to be easy to game the system. Like how I know the guys have done the things that they've written down. What sort of mechanism is there for to check that they've been truthful in their notes, especially if they now know they're being watched" 2.6 Technology "I think [participant] was right about the electronic record, because we can build checks and balances into that sort of thing to monitor truthfulness I suppose, also like [respondent] mentioned. That also solves the legibility issue and whether or not enough information has been written. Look at when we used the poor documentation examples, it was difficult to apply the indicators to those just because you didn't always have the right information to go on"