This evaluation provided a case study to investigate the ability for GAI tools to process and analyse large healthcare datasets. To the authors’ knowledge, this paper is the first to challenge GAI tools to complete a clinical diagnostic coding conversion task, and to compare the results against that of a manual rater. Conversions of clinical diagnostic codes to other coding systems, such as the task presented in this study, is a complex and time-consuming task commonly undertaken within healthcare data processing. Therefore, this study highlights an example of a potential use for GAI within health data analytics.
The analysis of this study examined levels of agreement between the two GAI tools and the manual rater. Whilst Claude Sonnet 3.5 outperformed ChatGPT 4o for both sets of assigned weights, there are several caveats to consider. For instance, the clinical validity of ICD codes, particularly in cases where these were identified as partial or incorrect matches was not assessed. This may have resulted in several potentially valid codes being incorrectly coded. For example, the SNOMED code “314041007 Abdominal pain in early pregnancy” was manually converted to “R10.9 Unspecified abdominal pain”. As this formed the benchmark for comparison between the GAI tools, conversions made by ChatGPT 4o (“O26.83 Pregnancy related abdominal pain”) and Claude Sonnet 3.5 (“O26.892 Other specified pregnancy related conditions, first trimester”) were considered as incorrect matches, and assigned zero points. During analysis, the GAI tools identified additional or arguably better matches between SNOMED-CT and ICD-10-CM. Additionally, there were several cases where the I-MAGIC tool was unable to produce a match for a SNOMED-CT code (e.g., 102508009 “Well female child”), whereas ChatGPT 4o and Claude Sonnet 3.5 were both able to produce the same alternative ICD-10-CM code (i.e., “Z00.129 Encounter for routine child health examination without abnormal findings”). This suggests that further formal analysis may demonstrate GAI tools outperform human raters. It is likely therefore that the results of this study significantly underestimate the accuracy and clinical validity of the matches produced by the GAI tools.
Despite GAI tools demonstrating significant time and cost savings, several challenges were noted throughout the conversion process. With regards to ChatGPT 4o, the process of performing the SNOMED-CT-AU to ICD-10-CM conversion was not fully automated, nor would it be straightforward for someone inexperienced with writing GAI prompts to perform. When piloting the prompt, ChatGPT 4o had the tendency to skip lines, chunks of data, or hallucinate (produce new input data that was not provided in the dataset). It was therefore required to explicitly ask ChatGPT 4o to “manually and sequentially” convert the provided codes, and to “…not hallucinate, and only convert codes which have been provided…” and “…not create new codes to convert”. When completing the final batch of conversions, the output needed to be monitored for accuracy. Despite not hallucinating during the task, ChatGPT 4o produced new input data when it ran out of codes it had been provided.
When providing additional prompts after the algorithm had performed well, it was beneficial to provide positive reinforcement to inform ChatGPT 4o it performed the task correctly. This avoided ChatGPT 4o changing its original output. There were also instances where ChatGPT 4o would attempt to terminate the task (“Unfortunately I have run out of time to process additional conversions”) but was able to be prompted to continue without further issue. These nuances required some level of skill and familiarity with ChatGPT 4o and GAI prompts.
In terms of the time and labour required, ChatGPT 4o was not simply a ‘set and forget’ solution to a large data task. Due to limitations on the volume of codes it was able to process before sometime hallucinating, a manual ‘nudge’ (i.e., “Please continue with the next batch”) was required after every 25 lines had been converted. This required continual monitoring of ChatGPT 4o whilst it was processing to continue and ensure that lines of data had not been skipped. Importantly, this rendered the task impractical to complete in the background whilst undertaking other work. ChatGPT 4o also imposes limits on the number of messages that are permitted within a certain timeframe (40 messages per three hours). Given the number of nudges required to process this data, in addition to further messages to adapt and rectify the prompt if it is not processing correctly, the message limit is quickly reached and requires waiting until the window has lapsed before proceeding with the rest of the task. This dramatically inflates the timeframe in which the task is able to be completed.
Claude Sonnet 3.5 provided a more streamlined tool which did not require as much skill or time to produce a prompt. One of the key limitations of Claude Sonnet 3.5 was the process of importing and exporting data. Unlike ChatGPT 4o, Claude Sonnet 3.5 does not yet have the functionality to directly import or export Excel spreadsheet files. As a result, it was necessary to copy and paste lines of data from the spreadsheet into Claude Sonnet 3.5. This led to a further limitation, which was the restrictions on both message length and number of messages permitted. As the amount of data exceeded that which was able to be accepted into the input field, it was necessary to break up the prompt into smaller, more manageable chunks (in this study, 500 lines at a time). Although Claude Sonnet 3.5 did not appear to hallucinate with a greater number of conversions, only 50 were requested to be completed at a time due to limits on the maximum length of the output message which could be provided. This however meant that the message limit (approximately 45 messages every 5 hours, dependent on message length) was quickly consumed. Given that Claude Sonnet 3.5 processed significantly faster than ChatGPT 4o, there was a longer waiting period between exceeding the message limit and the limit being renewed. As Claude Sonnet 3.5 was unable to directly export into Excel at the end of the task, this significantly increased the time burden of the task, as it was requested to produce R Studio code which could be run to produce a final output dataset. In addition to requiring the worker to have some knowledge of how to run code in R Studio, this, in fact, accounted for the majority of the time to complete the task. For instance, it took 1 hour and 15 minutes to complete the code conversion, with the remainder of the time (1 hour and 55 minutes) accounting for writing and running the R Studio code. The ability to produce downloadable Excel file into Claude Sonnet 3.5 would rectify this current obstacle significantly reducing the time and cost taken to complete data analysis.
Despite the limitations of GAI, there are clear benefits for its uses in completing large data analysis tasks. When completing this task manually, the human raters found this to be both mentally, emotionally draining and physically fatiguing, with high risk of repetitive strain injury. The human raters found this task to be boring, tedious, and unstimulating, which, over a long period of time, is likely to decrease both staff morale and mental wellbeing. When placing employees at high risk of repetitive strain injuries, there is risk of higher costs and projects delays resulting from researchers taking leave to recover from injury and stress [22, 23].
4.1 Study Limitations
Whilst this case study provides valuable insights into the use of GAI to complete a large health data analysis task, there are several limitations which should be noted. Firstly, given that this is an Australian dataset, the SNOMED-CT codes came from the Australian edition (SNOMED-CT-AU), whilst the I-MAGIC tool only caters for the standard version. Therefore, this may account for why some codes were unable to be manually converted using the I-MAGIC tool. Additionally, multiple raters completed the manual coding task, introducing potential issues around inter-rater reliability, particularly when coders are less familiar with the task. Further, to date the I-MAGIC tool uses the ICD-10-CM and has not yet been updated for the new edition of the ICD (11th edition). To date there is not yet a mapping tool which enables SNOMED-CT to be converted to the newer version of the ICD. In addition, this study only considered ICD-10-CM codes to be ‘correct’ if they either perfectly or partially matched the manual code. Given the aim of this study was to examine whether this task could be completed using GAI, it was outside of the scope of the study to manually examine each ‘incorrect’ match to see whether or not it was clinically valid. However, this is likely to significantly undermine the results and underestimate the level of agreement between GAI tools and manual ratings.
A further limitation of this study is the speed at which GAI tools are being developed and improved. It is likely that in the time period following this study, newer tools will be developed which may yield different results in terms of accuracy and processing speed. However, these will only improve the efficiency and accuracy of GAI tools.
4.2 Recommendations for Future Research
There is significant scope for future research within this field. Firstly, further analysis of the produced data from this study is planned to examine the clinical validity of partial or incorrect matches, which will further strengthen the results of this study by producing more accurate ratings between the GAI and manual coding output.
This study used the paid versions of both ChatGPT 4o and Claude Sonnet, which offer additional functionalities and greater processing speed than what is offered in the free version. This study could be replicated using the free versions to compare whether the paid upgrade yields any difference in terms of level of agreement and processing time.
It is also yet to be determined whether the time and cost savings on this task would translate to other data conversion tasks. Further studies using GAI tools on are needed to determine if time and cost differences are consistent across tasks.
Additionally, as new GAI tools are released with improvements to speed and functionality, it is recommended that this study is repeated to examine how these improvements impact the speed and accuracy by which this task can be completed. The concept of asking GAI to complete other similar data analysis tasks such as these should be considered, to further explore the capabilities of GAI on health care data.
4.3 Conclusions
This study provides a case study for using GAI to complete manual data processing tasks which would otherwise be tedious, time consuming, costly, and both mentally and physically fatiguing to complete. The results from this study highlight that manual processing is prohibitive in terms of time, cost, and that alternative methods, such as the use of GAI, should be explored. GAI provides a potential gateway to explore and make use of the significant quantities of unanalysed data to assist in improving outcomes for healthcare staff, researchers, systems, and importantly, patients.