Reliability and validity of a four-level severity score based triage scale

Background:Triage scales are essential tools for an early and rapid assessment of patients, by classifying them according to their degree of urgency. The objective of this study was to measure the reliability and validity of a four-level severity score based scale. Methods:To evaluate reliability, 250 triage forms were triaged by 6 triage nurses twice, 15 days apart. Intra and inter-rater reproducibility were measured using a weighted Cohen’s Kappa. For the validity study, 485 charts were evaluated. The relationship between triage level and emergency department resource consumption was measured using Spearman’s correlation coefficient. Prediction of severity was measured by the correlation between triage level and hospitalization in any ward, and between triage level and death in the emergency department or hospitalization in an intensive care or high dependency unit. Areas under the ROC curves were measured for these results. Results:For inter-rater reproducibility, the weighted Kappa was measured at 0.51 (95% CI 0.30 – 0.70), and 0.67 (95% CI 0.42 – 0.91) for intra-rater reproducibility. 474 patients were recruited in the validity study, respectively 26 (5%), 50 (11%), 384 (81%) and 14 (3%) for triage levels 1 through 4. Spearman’s correlation coefficient between triage level and emergency department resource consumption was calculated at R = -0.41 (p<0.001). 111 patients were admitted to the hospital with an area under the ROC curve measured at 0.72. 8 patients were admitted to an intensive care or high dependency unit with an area under the ROC curve measured at 0.73. Conclusions:Our study of a four-level severity score based triage scale finds intermediate results in terms of reliability and validity. Further improvements of the scale have to focus on better discriminating between moderately urgent and non-urgent patients.

Triage scales are essential tools in emergency departments (ED) for an early and rapid assessment of patients. These scales aim to classify patients according to severity and resources necessary for their management. They are an essential part of patient flow management by optimizing waiting times and patient care. Thus, a triage scale must be adapted to the typology of admitted patients and to the internal organization of the department. The expected characteristics of a good triage scale are to be valid, reliable and efficient.
Validity measures the degree to which the triage scale reflects the severity of the patient at the time of triage. As there is no "gold standard" with absolute accuracy for severity, surrogate markers such as expert opinion, resource consumption, hospitalization or mortality have been used.(1) Reliability refers to the degree to which repeated assessments of the same patient will yield the same triage level, either by the same assessor or by several different assessors. Efficiency is defined as the use of available resources in a timely manner. Measures of efficiency include the length of time necessary for triage, the time to first medical contact or length of stay in the ED. (2) To answer these constraints many local, national or international triage scales have been  Thus it is closer to a severity or early warning score and plausibly less   complex than the five-level complaint based or algorithmic scales more commonly used   (such as the Manchester Triage Scale, MTS, or the Emergency Severity Index, ESI).
The LMTS had never been formally evaluated. The aim of this study was to measure the performance of this scale regarding reliability and validity.

Study setting
Our ED is located in a large urban teaching hospital and admitted over 63 000 patients in 2017. Pediatric and gynecological emergencies are managed in separate departments, with the exception of major traumas which are managed in the adult ED. Patient data is collected in a standardized manner in an electronic health record (EHR).

Reliability study
We conducted a prospective observational single center study, whose objective was to measure inter and intra-individual reliability of the LMTS.
The primary endpoint was the measurement of inter-rater reliability. Secondary endpoints were the level of complete concordance between raters and intra-rater reliability.
Subject population was six experienced triage nurses who classified patients according to information contained on admission records. The minimum number of records necessary was calculated à priori at 249, using the confidence interval method described by Rotondi. 250 records were chosen among patients 18 years or older who were admitted to the ED during the study period. Records of patients who were already triaged by a medicalized mobile intensive care unit (SAMU) or by police, transferred from another ED or from prison, or who were redirected towards a pediatric or obstetrical ED at triage were excluded. Out of each of the selected records, a "triage form" was created containing the information available to the triage nurse before a triage decision was made. The six study nurses then classified each form once, then a second time in a different order fifteen days later.
For inter-rater reliability we used the methodology described by Light (9): a quadratic weighted Cohen's Kappa was calculated between each possible pair of assessors (resulting in fifteen weighted Kappas), then an arithmetic mean of these fifteen Kappas was calculated. For intra-rater reliability, a similar methodology was used: a quadratic weighted Cohen's Kappa was calculated between the two evaluations of each triage nurse.
An arithmetic mean of these six Kappas was then calculated. All analyses were performed using R (R development core team, 2018).

Validity study
We conducted a retrospective observational single center study. The primary objective was to measure the correlation between triage level and resource consumption in the ED.
Secondary objectives were to measure the predictive value of the triage scale for severity, measure undertriage and overtriage.
Resource consumption was measured according to the same methodology used by Tanabe for the Emergency Severity Index (10) and Taboulet for the FRENCH (8), resulting in a resource consumption rate. Investigators were blinded to triage levels when collecting the data. Examples of resources collected are presented in Table 1. Correlation between triage level and resource consumption was measured using Spearman's correlation coefficient (R). Severity was measured by the correlation between triage level and hospitalization in any ward, and between triage level and hospitalization in an intensive care (ICU) or high dependency unit (HDU). We measured this relationship by calculating the area under the receiver operating characteristic (ROC) curve for these outcomes.
Overtriage rate was defined as the proportion of patients classified as level 1 or 2 who were not hospitalized following ED management. Undertriage rate was defined as the proportion of patients classified as level 3 or 4 who were ultimately hospitalized in an intensive care or high dependency unit following ED management, or died in the ED.
Inclusion and exclusion criteria were the same as for the reliability study. Using the method described by Moinester (11), with an a priori correlation coefficient estimated at R = -0.55 (95%CI 0.45 -0.65) based on current literature, the minimum number of patients required for analysis was 190. We estimated that 10% of patients would be excluded. In order for the sample to be representative of the variations in patient flow and ED staffing, all patients presenting to the ED were screened for inclusion during three 24h periods during the week ranging from the 30 th of May 2017 and the 40 th of June 2017: the days with the most and the least admissions (Monday and Wednesday from 00h to 00h), and the 24hours ranging from Saturday at noon to Sunday at noon. Triage nurses were blinded with respect to the days of evaluation. Every patient file for the three selected periods was retrospectively studied for exclusion criteria and data collection. Statistical analysis was performed using R (R development core team, 2018).

Reliability Study
The mean age of all nurses was 34.5 (standard deviation 10.6). They had a mean of 5.4 years' experience as triage nurses (standard deviation 2.7).
Calculation of inter-rater reliability yielded a quadratic weighted Kappa = 0.51 (95% Confidence Interval (CI): 0.30-0.70). There was a 30% complete agreement between all six raters and 60% of discordances were for a difference of one level. Discordance according to triage levels are summarized In Table 2.
Calculation of intra-individual reliability 15 days apart yielded a quadratic weighted Kappa  Table 3.  Table 4. 44 patients out of the 76 patients classified as either level 1 or 2 were not hospitalized following ED management (overtriage rate 58%). 5 patients out of the 398 classified as either level 3 or Discussion Reliability study According to the Landis and Koch classification, (12) our study found a moderate level of inter-rater reliability and a substantial level of concordance when measuring intra-rater reliability with the LMTS.
Our methodology ensures the robustness of our results. The triage forms are based on real ED cases and representative of the case mix and acuity of the patients presenting to our department.
However, our study suffers from several limitations. Triage level was ascertained using forms extracted from patient records. Any information relevant to the original triage decision not present in the original file would thus be lacking in the study triage form.
This could influence a triage decision, especially if this information related to one of the emergency signs and symptoms. The absence of nonverbal cues and the possibility of asking any follow-up questions could have impaired the decision of the triage nurses participating in the study, especially when hesitating between two triage levels. Despite this, we chose not to add additional information when creating the forms in order not to bias the experiment in favor of the scale. Furthermore, like any other clinical task, triage is affected by extraneous factors such as the total workload, language barriers or patient aggressiveness, which can induce variability in the collection of relevant clinical data. Our "ex-vivo" methodology could not correct for these factors, but we nonetheless believe that this methodology is well suited for measuring the consistency of triage assessment, as the exact same information is used for each triage decision.
Moreover, no direct comparison was made between the LMTS and other more widely used triage scales. Such a comparison would have required training the triage nurses in the two scales, which was not technically possible.
There are several methods for calculating a kappa statistic. We chose the quadratic weighted kappa as a measure of reliability as it emphasizes the amount of disagreement between observations. However, this method, although the most common in the literature, is debated because it could reward mistriage if a majority of classifications fall in the "middle category". (13) This could be less of an issue in our study as the LMTS has four categories, and thus no "middle category". Also, it has been shown that quadratic weighted kappa yields systematically higher measurements than linear or un-weighted kappa measurements and this should be taken into account when comparing triage scales. There are fewer studies evaluating intra-rater reliability for the various triage scales, and they show higher intra-rater reliability than the one measured in this study. A study by Storm-Versloot comparing the MTS and the ESI found quadratic weighted kappa of 0.90 and 0.85 respectively. (14) The lower inter-rater and intra-rater reliability of the LMTS could be explained by the more open structure of the scale compared to the complaint based and/or algorithmic structure common in the more widely used scales. An algorithmic presentation forces a decision, potentially decreasing the variability between assessments. The open structure of the LMTS, while easily accessible, could either foster non adherence or might lack precision (i.e. a same patient might justly be classified in two categories). This is supported by the result that only 30% of classifications showed complete concordance between the six triage nurses. Although most of these discordances were for one triage level, in an emergency setting this could have important consequences, as the triage level determines the level of care and waiting time.

Validity Study
Our study showed a fair performance of the LMTS in predicting resource consumption in the ED and patient severity in a high volume urban ED. Undertriage rate was low, while the overtriage rate was quite high.
Our study suffers from several limitations. Our cohort was retrospective and subject to selection bias and confounding. However, we believe that our method of screening patients reduces the two most important confounders, the variability in workload and staff. The retrospective nature of the study also increases the risk of missing data.
However, as our EHR also integrates prescription this risk is diminished concerning resource consumption. Furthermore, it is possible that in a certain number of cases the triage scale was not applied correctly and our study was not designed to measure this. No measure of efficiency, such as length of stay in the ED, was collected for this study.
Finally, no direct comparison was made between the LMTS and other more widely used triage scales.
More than 80% of patients in our study were categorized as level 3, with only 3% categorized as level 4. Level 3 seems to be a "catch all" category, into witch fall most patients not presenting with explicit severity symptoms. This could be a consequence of basing this scale primarily on a severity score instead of chief complaints. For example, a patient with a benign condition might still present with an slightly elevated pain scale or respiratory rate classifying as level 3, and not be at risk for deterioration. Equally, some patients at risk for deterioration could have been triaged in the same category. The nine categories of severity signs and symptoms are there to mitigate this risk, but they might need to be expanded or modified after an analysis of outcomes per chief complaint.
As there is no "gold standard" for severity, the validity of a triage scale can be measured in multiple ways, each reflecting a different facet of severity. (20) Resource consumption is a measure of the strain put on an ED by an individual patient; patients requiring a greater number of ED resources are often the most acute. It might vary according to local resources, prescription practices and case mix. Also, the method of resource consumption calculation varies across studies. Using a statistical measure of correlation such as Spearman's coefficient mitigates these concerns as it measures the relationship between the rankings of two variables. The correlation coefficient measured in our study (R = -0.41) falls in the lower range of previous studies. The ESI is the scale with the most studies measuring its correlation with resource consumption, possibly because predicted resource consumption is part of its decision algorithm. These studies showed coefficients ranging from -0.53 (21) to -0.71 (22). The MTS was evaluated at R = -0.37 (21), the CTAS at R = -0.48 (23) and the FRENCH version 2 at R = -0.64 (8).
Hospitalization rates are a surrogate marker for severity frequently used in studies of triage scale validity. They are also dependent on local admission policies and case mix.
Despite this, validity studies using this criterion as an endpoint have found a correlation between triage level and hospitalization rate for the ESI, MTS, CTAS, and FRENCH. (8,21,23,24,24,25) In our study, triage level is correlated with hospitalization rate.
However, our hospitalization rate for the most urgent patients (level 1 and 2) is much lower than those recorded in previous studies. This corresponds to our measured overtriage rate of 58%. Some overtriage is expected, for instance patients with severe pain will be triaged as level 1 or 2 and might later be discharged home. Nonetheless, a high overtriage rate can be detrimental as valuable resources will be diverted from the care of more severe patients.
Undertriage is an important marker when measuring the validity of a triage scale, as it assesses the degree to which a triage scale puts patients at risk by under-evaluating their severity. There are two ways of measuring undertriage: against outcome or against a reference standard (usually determined retrospectively case by case by an expert panel, a method subject to disagreement between experts). (26,27) Most studies use the reference standard method; however, in our opinion, this method lacks precision as it measures the rate of patients whose triage category was less than the reference category regardless of the potential severity of the mistriage. We chose hospitalization in an ICU or HDU or death in the ED for patients triaged level 3 or 4 as an outcome measure in order to specifically measure the rate of patients whose survival was potentially put at risk by the triage scale.
In this regard the scale performed well with a low undertriage rate of 1.2%, although the low rate for this outcome in our population (1.7%) underscores the fragility of this measure. Using the same definition as ours, a study by Cooke in 1999 found an undertriage rate of 33% for the first version of the MTS. However, a study by Steiner found a rate of 1.6% for the German adaptation of the second version of the MTS and a study by Grossmann in 2010 found a rate of 3.3% for the ESI, in the same range as our result. (28)(29)(30)

Conclusion
The LMTS, a four-level severity score based triage scale showed a moderate level of interrater reliability and a substantial level of intra-rater reliability, although usually lower than published results of more structured algorithmic scales. Triage level was correlated with resource consumption and severity. Further improvements of the scale have to focus on better discriminating between moderately urgent and non-urgent patients. The study protocol was reviewed and accepted by the Le Mans General Hospital Ethics Board. All data was anonymized before analysis. Patient consent was waived considering the observational nature of the study. The study database was declared to the Commission Nationale Informatique et Libertés (French data protection regulatory body).

Availability of data and materials
The datasets used and/analyzed are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.

Funding:
No funding was obtained for this study.

Consent for publication:
Not applicable.
Author contributions: CJ is the principal investigator and was responsible for data collection. The study is based on an original idea by JCC. CJ, SB and JCC were jointly responsible for study conception and design. CJ and JCC were responsible for data analysis and interpretation. CJ, SB and JCC were responsible for writing the manuscript. All authors read and approved the final manuscript.