According to the Landis and Koch classification, (12) our study found a moderate level of inter-rater reliability and a substantial level of concordance when measuring intra-rater reliability with the LMTS.
Our methodology ensures the robustness of our results. The triage forms are based on real ED cases and representative of the case mix and acuity of the patients presenting to our department.
However, our study suffers from several limitations. Triage level was ascertained using forms extracted from patient records. Any information relevant to the original triage decision not present in the original file would thus be lacking in the study triage form. This could influence a triage decision, especially if this information related to one of the emergency signs and symptoms. The absence of nonverbal cues and the possibility of asking any follow-up questions could have impaired the decision of the triage nurses participating in the study, especially when hesitating between two triage levels. Despite this, we chose not to add additional information when creating the forms in order not to bias the experiment in favor of the scale. Furthermore, like any other clinical task, triage is affected by extraneous factors such as the total workload, language barriers or patient aggressiveness, which can induce variability in the collection of relevant clinical data. Our “ex-vivo” methodology could not correct for these factors, but we nonetheless believe that this methodology is well suited for measuring the consistency of triage assessment, as the exact same information is used for each triage decision.
Moreover, no direct comparison was made between the LMTS and other more widely used triage scales. Such a comparison would have required training the triage nurses in the two scales, which was not technically possible.
There are several methods for calculating a kappa statistic. We chose the quadratic weighted kappa as a measure of reliability as it emphasizes the amount of disagreement between observations. However, this method, although the most common in the literature, is debated because it could reward mistriage if a majority of classifications fall in the “middle category”. (13) This could be less of an issue in our study as the LMTS has four categories, and thus no “middle category”. Also, it has been shown that quadratic weighted kappa yields systematically higher measurements than linear or un-weighted kappa measurements and this should be taken into account when comparing triage scales. (14–16)
Many studies have evaluated the inter-rater reliability of the major triage scales. A review and meta-analysis by Pourasghar et al found 50 articles evaluating 15 different triage scales, with kappa measures ranging from 0.32 (a local Taipei scale using an un-weighted kappa) to 0.94 (South African Triage Scale using a quadratic weighted kappa).(16) There is also a wide variability of inter rater agreement among each individual scale. For instance, the most evaluated scale, the Canadian 5 level CTAS, showed quadratic weighted kappa measures ranging from 0.44 (17) to 0.84 (18). Overall, our study showed a lower inter-rater reliability measure for the LMTS compared to other triage scales.
There are fewer studies evaluating intra-rater reliability for the various triage scales, and they show higher intra-rater reliability than the one measured in this study. A study by Storm-Versloot comparing the MTS and the ESI found quadratic weighted kappa of 0.90 and 0.85 respectively. (14)
The lower inter-rater and intra-rater reliability of the LMTS could be explained by the more open structure of the scale compared to the complaint based and/or algorithmic structure common in the more widely used scales. An algorithmic presentation forces a decision, potentially decreasing the variability between assessments. The open structure of the LMTS, while easily accessible, could either foster non adherence or might lack precision (i.e. a same patient might justly be classified in two categories). This is supported by the result that only 30% of classifications showed complete concordance between the six triage nurses. Although most of these discordances were for one triage level, in an emergency setting this could have important consequences, as the triage level determines the level of care and waiting time.
Another point to consider when interpreting these results is the effect of the number of levels on the measured kappa. Brenner et al have demonstrated that a quadratic weighted kappa increases with the number of categories in an ordinal scale. (19) This could partially account for a lower kappa value when evaluating this four-level scale as compared to the more common five-level scales.
Our study showed a fair performance of the LMTS in predicting resource consumption in the ED and patient severity in a high volume urban ED. Undertriage rate was low, while the overtriage rate was quite high.
Our study suffers from several limitations. Our cohort was retrospective and subject to selection bias and confounding. However, we believe that our method of screening patients reduces the two most important confounders, the variability in workload and staff. The retrospective nature of the study also increases the risk of missing data. However, as our EHR also integrates prescription this risk is diminished concerning resource consumption. Furthermore, it is possible that in a certain number of cases the triage scale was not applied correctly and our study was not designed to measure this. No measure of efficiency, such as length of stay in the ED, was collected for this study. Finally, no direct comparison was made between the LMTS and other more widely used triage scales.
More than 80% of patients in our study were categorized as level 3, with only 3% categorized as level 4. Level 3 seems to be a “catch all” category, into witch fall most patients not presenting with explicit severity symptoms. This could be a consequence of basing this scale primarily on a severity score instead of chief complaints. For example, a patient with a benign condition might still present with an slightly elevated pain scale or respiratory rate classifying as level 3, and not be at risk for deterioration. Equally, some patients at risk for deterioration could have been triaged in the same category. The nine categories of severity signs and symptoms are there to mitigate this risk, but they might need to be expanded or modified after an analysis of outcomes per chief complaint.
As there is no “gold standard” for severity, the validity of a triage scale can be measured in multiple ways, each reflecting a different facet of severity. (20)
Resource consumption is a measure of the strain put on an ED by an individual patient; patients requiring a greater number of ED resources are often the most acute. It might vary according to local resources, prescription practices and case mix. Also, the method of resource consumption calculation varies across studies. Using a statistical measure of correlation such as Spearman’s coefficient mitigates these concerns as it measures the relationship between the rankings of two variables. The correlation coefficient measured in our study (R = –0.41) falls in the lower range of previous studies. The ESI is the scale with the most studies measuring its correlation with resource consumption, possibly because predicted resource consumption is part of its decision algorithm. These studies showed coefficients ranging from –0.53 (21) to –0.71 (22). The MTS was evaluated at R = –0.37 (21), the CTAS at R = –0.48 (23) and the FRENCH version 2 at R = –0.64 (8).
Hospitalization rates are a surrogate marker for severity frequently used in studies of triage scale validity. They are also dependent on local admission policies and case mix. Despite this, validity studies using this criterion as an endpoint have found a correlation between triage level and hospitalization rate for the ESI, MTS, CTAS, and FRENCH.(8,21,23,24,24,25) In our study, triage level is correlated with hospitalization rate. However, our hospitalization rate for the most urgent patients (level 1 and 2) is much lower than those recorded in previous studies. This corresponds to our measured overtriage rate of 58%. Some overtriage is expected, for instance patients with severe pain will be triaged as level 1 or 2 and might later be discharged home. Nonetheless, a high overtriage rate can be detrimental as valuable resources will be diverted from the care of more severe patients.
Undertriage is an important marker when measuring the validity of a triage scale, as it assesses the degree to which a triage scale puts patients at risk by under-evaluating their severity. There are two ways of measuring undertriage: against outcome or against a reference standard (usually determined retrospectively case by case by an expert panel, a method subject to disagreement between experts).(26,27) Most studies use the reference standard method; however, in our opinion, this method lacks precision as it measures the rate of patients whose triage category was less than the reference category regardless of the potential severity of the mistriage. We chose hospitalization in an ICU or HDU or death in the ED for patients triaged level 3 or 4 as an outcome measure in order to specifically measure the rate of patients whose survival was potentially put at risk by the triage scale. In this regard the scale performed well with a low undertriage rate of 1.2%, although the low rate for this outcome in our population (1.7%) underscores the fragility of this measure. Using the same definition as ours, a study by Cooke in 1999 found an undertriage rate of 33% for the first version of the MTS. However, a study by Steiner found a rate of 1.6% for the German adaptation of the second version of the MTS and a study by Grossmann in 2010 found a rate of 3.3% for the ESI, in the same range as our result.(28–30)