For the human text classification eye tracking user study, the participant-selected cause-of-injury codes for each narrative in 204-case dataset was recorded. As explained in the study design, each case in the dataset was analyzed by three participants (17 unique sets * 12 case per set* 3 batches= 204). ET1, ET2 and ET3 are referred to as the eye tracking data recorded for 3 batches of cases across 51 participants. For the ML and LLM models, the cause-of-injury codes predicted by the LR and ChatGPT models for each narrative in the 204-case dataset were recorded.
Comparing Text classification performance between Humans, ChatGPT, and ML
The text classification performances of the three approaches (human, ML, and ChatGPT) were evaluated on the set of 204 injury cases by comparing the cause-of-injury codes predicted by eye-tracking study participants, ML model, and ChatGPT with the original codes assigned by the QISU professional coders. Recall was used as the primary measure of performance for each of the six cause-of-injury codes as well as overall performance on the whole dataset, as described in Equation (1):
For Recall calculations, a case was counted as True Positive when the predicted cause-of-injury code agreed with the originally assigned QISU code, and the total number of cases (in the denominator of equation 1) for individual cause-of-injury code was 34 and for the overall Recall it was 204 (size of the dataset). Due to space considerations, other commonly used performance measures such as Precision or F1-scores are not included in the paper. Figure 2 presents the Recall for each injury code and the overall dataset for Human study (three sets- ET1, ET2, and ET3), ChatGPT, and ML.
As shown in Figure 2, among the three text classification approaches, the ML model reported the highest overall Recall (84%) as compared to Human and ChatGPT. One of the possible underlying reasons for this may be that the ML model was trained exclusively on a large dataset of injury narratives. On the other hand, while Human study participants and ChatGPT had capabilities of general understanding of the narrative text and cause-of-injury category definitions, they did not have specific training or experience on injury narrative classification task to understand the nuances. The text classification study participants did not have any significant prior knowledge on injury data analysis and the LLM ChatGPT although being trained on several textual databases from different domains, has not been exclusively trained on injury-code related datasets. Figure 2 also shows the variations in Humans’ text classification performance across the 3 sets ET1, ET2, and ET3 among the 51 study participants. It is to be noted that there were multiple participants involved in each of the user study batches- ET1, ET2, and ET3, therefore, a comparative analysis between the batches was not performed.
Variation in Performance based on the Distinct Nature of the Code
Among the different cause-of-injury codes, the classification performance varied based on the uniqueness of the code, i.e., how distinct were the codes relative to other codes so that they are less likely to be confused with another code. This was particularly important for Human and ChatGPT approaches where specialized training on injury coding was not provided. As the ML model was trained on thousands of cases of each injury code, it was expected to learn the classification rules in a more detailed manner. The cause-of-injury codes BURN and MOTORVEHICLE were relatively unique in nature, codes FALL, STRUCK and CUT were moderately unique, and the code OTHER was not very unique in nature. The injury code descriptions are provided in Table 2 in the Methods section. The distinct nature of codes BURN and MOTORVEHICLE meant that the narratives typically included a unique set of words, for example, in describing the type of injury (e.g., burned or scalding for BURN) or the product involved (e.g., car or pedestrian for MOTORVEHICLE), that did not overlap with narratives of other injury codes. On the other hand, the injury codes that were not very unique in nature contained overlapping elements with other injury codes or did not have clear definitive classification rules. For example, the codes CUT and STRUCK had some overlapping elements such as interaction between a tool and a person, and the code OTHER had a relatively fuzzy definition that the narrative does not belong to any other cause-of-injury codes.
As shown in Figure 2, codes BURN and MOTORVEHICLE, which had relatively unique definitions, reported higher Recall for Humans and ChatGPT as compared to other injury codes. It is also to be noted that there was lesser variation in the performance of the three Human sets ET1, ET2, and ET3 for BURN and MOTORVEHICLE as compared to other injury codes. This may be indicative that the text classification process for these codes was less confusing for non-experts due to their unique nature. For BURN, ChatGPT (97%) and Humans (~91% for all three sets) reported their highest Recall. It is also interesting to note that while ML reported its highest Recall for MOTORVEHICLE (97%), it reported its lowest Recall for BURN (97%).
Humans and ChatGPT reported relatively low Recall for cause-of-injury codes OTHER, CUT, and STRUCK, while the ML model performed relatively better for these categories. It is also to be noted that the variation in human sets ET1, ET2, and ET3 is relatively large for these categories. For cause-of-injury code OTHER, ML reported highest Recall (79%), followed by Humans (ET1-38%, ET2-71%, ET3-59%) and ChatGPT (53%). For STRUCK, the Recall for ML (79%) was significantly high as compared to Humans (ET2- 53%, ET3- 32%) and ChatGPT (62%). All three approaches- Humans, ChatGPT and ML, reported their lowest individual Recall for STRUCK. For CUT, the Recall was close for ML (82%) and ChatGPT (79%) followed by Humans (ET1-59%, ET2-76%, ET3-68%). It is also to be noted that the ML model reported better Recall than Human and ChatGPT for all cause-of-injury codes, except FALL and BURN. For FALL, the ML Recall was marginally less than ChatGPT and ET3. For FALL cause-of-injury code, Humans reported highest Recall (ET2-88%) as well as lowest Recall (ET3-59%).
Variation in Prediction Performance based on Complexity of Narratives
Each of the narratives in the set of 204 prompts used for classification task were internally classified into 3 levels of complexities-Low, Medium, and High, based on two factors: (a) the level of difficulty in comprehending the narrative text and (b) the lack of clarity or obviousness in selecting the most appropriate cause-of-injury code either due to multiple possible codes or lack of information in narrative. The eye-tracking study participants were not aware of these different narrative complexity levels during their text classification task. Overall, out of the 204 narratives, there were 74 narratives categorized as “Low”, 82 narratives categorized as “Medium” and 48 narratives categorized as “High” complexity. We analyzed the variation in Recall values based on narrative complexity for the predictions made by Humans (ET1, ET2, ET3), ChatGPT and ML, presented in Figure 3.
As shown in Figure 3, the highest Recall values were observed for narratives with low complexity for ML (89%), ChatGPT (91%), and Humans- ET1 (62%) and ET3(62%). However, ET2 reported highest Recall (68%) for narratives with “Medium” complexity. Overall, considering the variations in human performance between groups ET1, ET2, and ET3 shown in Figure 3, we can observe that, a) for low complexity narratives, ML and ChatGPT performed considerably better than humans, b) for medium and high complexity narratives, ML performed considerably better than ChatGPT and humans, and ChatGPT performed marginally better than humans.
Explainability: Comparing top words between Humans, ChatGPT, and ML
We also studied the reasoning behind the text classification choices made by humans, traditional ML model, and LLM (ChatGPT) by studying the words in narrative text that were used by them for classification decision making. For human text classification (ET1, ET2, ET3), word-level eye tracking parameters- FC and FD were recorded and analyzed for each participant to identify the top-10 words that participants focused on while comprehending the narrative and selecting the cause-of-injury code. For ML model, the top-5 words in the narrative used by the LR model for making the prediction were determined using LIME based ELI5 explainability analysis. The top-5 ML words are referred to later in the paper as ML1-(top 1st word), ML2 (2nd word), ML3 (3rd word), ML4 (4th word), and ML5 (top 5th word). For the LLM, the top-10 words used by the ChatGPT model were obtained through prompts asking to list the top words used by the model for text classification decision making in decreasing order of importance.
Next, we compared the level of agreement between the top predictor words of ML model with humans and ChatGPT. The level of agreement between the top words of humans and ML model was calculated by comparing the overlap between the top-5 words of ML model with top-10 predictor words of humans for each of the 206 cases in the dataset. Similarly, the level of agreement between the top words of ML and ChatGPT was calculated by comparing the overlap between top-5 words of ML and the top-10 words used by ChatGPT for each case in the dataset. Figure 4 shows the agreement level for each of the top-5 ML words (ML1-ML5) with humans (ET-1, ET-2, ET-3 based on fixation count (FC) and fixation duration (FD)) and ChatGPT. The Y-axis represents the number of cases out of total 206 dataset where the top-nth word of ML model was present in the top-10 word lists of ChatGPT and humans (ET-FCs and ET-FDs). The X-axis shows the top ML words (ML1-ML5).
As shown in Figure 4, the topmost ML word-- ML1 overlapped with ET and ChatGPT top-10 words for maximum number of cases in the dataset, followed by next top words ML2, ML3, ML4, and ML5. This behavior was somewhat expected as the top predictor words provided by ELI5 in decreasing order of importance. Similarly, the list of top-10 predictor words ChatGPT were also organized in decreasing order of importance as it was instructed to do so in the prompts. One of the interesting trends to be observed in Figure 4 is that there was a relatively substantial decline in the number of cases with matching top-words between ML and other approaches after ML2, as compared to the decline from ML1 to ML2. This indicates that the top-2 ML predictor words were considerably more predictive as compared to ML3, ML4, and ML5.
Another noticeable trend in Figure 4 was that the decline in agreement between the top words of ML and ChatGPT going from ML1 to ML5 was considerably steeper as compared to the decline in agreement of top words of ML and humans (ET-FCs and ET-FDs). This relatively higher overlap between the top words of ML and humans indicates that there was better alignment in the reasoning of classification choices between ML and humans as compared to ML and ChatGPT. We can also observe in Figure 4 that there was a slightly higher agreement level with humans for ML5 as compared to ML 4, which might be indicative that there may not be as significant difference in the importance scores of ML4 and ML5 as compared to the topmost words ML1 and ML2.
Next, we compared the overlap between top predictor words used by ChatGPT and humans as shown in Figure 5. The axes in Figure 5 are organized similar to Figure 4, with the Y-axis representing the number of cases from the 206-case dataset where the top words of humans and ChatGPT agreed, and the X-axis representing the top-10 ChatGPT words (ChatGPT1- ChatGPT10).
As shown in Figure 5, the agreement between top words of ChatGPT and humans followed a similar overall trend as agreement between ML and humans shown in Figure 4, with higher level of agreement with the first two/three words followed by a steady decline for later words. However, one of the noticeable differences between Figures 4 and 5 is that there was a relatively steeper decline in agreement of top words of ChatGPT and humans as compared to ML and humans. Between humans and ML, the number of overlapping cases were in the range of 80-105 for ML4 and 80-110 for ML5, as shown in Figure 4. For agreement between ChatGPT and humans, the number of overlapping cases were in the range of 60-90 for ChatGPT4 and 35-45 for ChatGPT5, as shown in Figure 5. This indicates that while the top-3 predictor words aligned better with human reasoning, the later words had a relatively lower agreement with humans.
Next, we studied how the agreement of top predictor words between different approaches varied between different cause-of-injury codes as shown in Figure 6. The intuition behind it was that the level of uniqueness of words used in the narratives associated with different cause-of-injury varied considerably, therefore the top predictor words used by different approaches may vary based on the codes. As mentioned earlier, codes BURN and MOTORVEHICLE were relatively unique in nature, codes FALL, STRUCK and CUT were moderately unique, and the code OTHER was not very unique in nature. To examine the overlap of top predictor words between different approaches, we compared the following combinations: a) ML and ChatGPT (top-5 of ML and top-10 of ChatGPT) b) ML and Eye Tracking based on Fixation Count (ET(FC)), c) ChatGPT and ET(FC), c) ML and Eye Tracking based on Fixation Duration (ET(FD)), and d) ChatGPT and ET(FD). In Figure 6, the X-axis shows all the six cause-of-injury codes, and the Y-axis represents the percentage of total number of cases for each category (34 cases of each category in the 206-case dataset) where the top-5 predictor words agreed between different text classification approaches. For this analysis, the data from the three sets of humans (ET1, ET2, and ET3) were combined together, and are represented as ET in Figure 6.
Examining the overlap of top predictor words of ML and ChatGPT, we can see in Figure 6 that the highest level of agreement was observed for code BURN followed by CUT, FALL, STRUCK, MOTORVEHICLE, and OTHER. As shown in Figure 2, the prediction performances (Recall) of ML and ChatGPT were relatively close for codes CUT, MOTORVEHICLE, and FALL, and there was some difference in Recall for codes OTHER, STRUCK and BURN. Two interesting trends to note for the relatively unique codes BURN and MOTORVEHICLE were: a) while the Recall values for ML and ChatGPT were relatively different for BURN, the overlap between top words was high, and b) while the Recall values of ML and ChatGPT were close for MOTORVEHICLE, the overlap of top predictor words was low. One of the possible reasons might be the difference in approaches of interpreting the narrative words. The ML model was trained on thousands of injury narratives, so it was referring to the weight of words derived from the training set. The ChatGPT LLM model was not specifically trained on injury narratives, so the way in which it was deriving the relative importance of words was based on a more general vocabulary of words.
Between ML and humans, the highest overlap among top predictor words was observed for code CUT followed by STRUCK, MOTORVEHICLE, OTHER, FALL, and BURN, as shown in Figure 6. Comparing the Recall values of ML and humans (ET1, ET2, and ET3), a relatively wider gap was observed for codes CUT, OTHER, and STRUCK. It is interesting to note that while there was considerable difference in Recall values of ML and humans for codes CUT and STRUCK, the overlap between top predictor words was relatively high. A possible reason for it may be that since both these codes were moderately unique in nature, the same words may be used by humans or ML models to make different conclusions, i.e. selecting a different code.
Between ChatGPT and humans (ET(FC) and ET(FD)), the highest overlap between the top predictor words was observed for code FALL followed by codes CUT, BURN, STRUCK, MOTORVEHICLE, and OTHER. It can be observed from Figure 6 that there was considerable difference in the level of overlap between ChatGPT and humans based on fixation count ChatGPT-ET(FC) and fixation duration ChatGPT-ET(FD) for codes STRUCK and FALL as compared to other codes. No consistent trend was observed for variation of Recall values of ChatGPT and humans (ET1, ET2, and ET3) for different cause-of-injury codes.