Assessing the Effectiveness of Automatic Speech Recognition Technology in Emergency Medicine Settings: A Comparative Study of Four AI-powered Engines

Purpose Cutting-edge automatic speech recognition (ASR) technology holds significant promise in transcribing and recognizing medical information during patient encounters, thereby enabling automatic and real-time clinical documentation, which could significantly alleviate care clinicians’ burdens. Nevertheless, the performance of current-generation ASR technology in analyzing conversations in noisy and dynamic medical settings, such as prehospital or Emergency Medical Services (EMS), lacks sufficient validation. This study explores the current technological limitations and future potential of deploying ASR technology for clinical documentation in fastpaced and noisy medical settings such as EMS. Methods In this study, we evaluated four ASR engines, including Google Speech-to-Text Clinical Conversation, OpenAI Speech-to-Text, Amazon Transcribe Medical, and Azure Speech-to-Text engine. The empirical data used for evaluation were 40 EMS simulation recordings. The transcribed texts were analyzed for accuracy against 23 Electronic Health Records (EHR) categories of EMS. The common types of errors in transcription were also analyzed. Results Among all four ASR engines, Google Speech-to-Text Clinical Conversation performed the best. Among all EHR categories, better performance was observed in categories “mental state” (F1 = 1.0), “allergies” (F1 = 0.917), “past medical history” (F1 = 0.804), “electrolytes” (F1 = 1.0), and “blood glucose level” (F1 = 0.813). However, all four ASR engines demonstrated low performance in transcribing certain critical categories, such as “treatment” (F1 = 0.650) and “medication” (F1 = 0.577). Conclusion Current ASR solutions fall short in fully automating the clinical documentation in EMS setting. Our findings highlight the need for further improvement and development of automated clinical documentation technology to improve recognition accuracy in time-critical and dynamic medical settings.


Introduction
Clinical documentation is a time-consuming and challenging task.Electronic health records (EHRs) have been increasingly adopted to streamline this process.However, research has shown that the introduction of EHR systems has led to increased clinician burnout [1], decreased time spent on patient care [2], and lower patient satisfaction [3].The unintended consequences of using EHRs are exacerbated in more time-critical and dynamic settings, such as emergency medicine, because the documentation task in such settings is performed during a short and highly intense process, posing signi cant challenges in the timely and accurate creation of patient medical records, and thus, leading to incomplete, delayed, or erroneous documentation [4][5][6].
Given these challenges in the effective use of EHR in various medical settings, research has proposed using automatic speech recognition (ASR) technologies, empowered by natural language processing (NLP) techniques, to automate (part of) clinical documentation [7].These technologies record and process conversations during patient-provider encounters to extract and summarize relevant clinical information.The processed information could, for instance, be used to populate some EHR elds, create billing codes, and generate decision and diagnostic support.Some notable and advanced ASR tools are offered by companies such as Google [8], Nuance [9], and Amazon [10].A few studies have assessed their performance in recognizing clinically meaningful information from patient-provider conversations [7,11].However, these prior studies were mainly conducted in quieter medical settings, such as patient exam rooms and the performance of ASR technology is understudied in more dynamic and noisier clinical settings.Additionally, prior work mainly investigated one-on-one interactions, such as between physician and patient during clinical encounters.The ability of these commercially available ASR tools to recognize medical information in non-linear, fragmented conversations involving multiple people such as occurs in in emergency care has not been investigated.
To address these remaining questions, our work uses Emergency Medical Services (EMS) or prehospital care as a study context, as it presents a perfect example of dynamic and noisy medical settings with many people involved (e.g., multiple care clinicians, patients, bystanders, family members, etc.).During prehospital care, EMS clinicians are dispatched to the eld to provide emergency care, with the goal of quickly stabilizing a patient's urgent health condition and transporting the patient to the nearest or most appropriate care facility.Given the dynamic, hands-on, and time-critical nature of prehospital care, real-time clinical documentation using EHR is often impossible [12,13].These characteristics present a great opportunity for ASR technology to facilitate and automate EHR documentation in prehospital care.
In this study, we report an empirical assessment of how four commercially available ASR engines perform in transcribing and recognizing medical information from 40 audio recordings of high-delity simulations of prehospital care with teams of EMS clinicians.We aim to address the following research questions through this work: 1) How effectively do commercially available ASR engines transcribe and recognize clinically relevant information in dynamic and noisy emergency care environments such as in the prehospital setting?2) What are the common types of transcription errors made by ASR engines in prehospital settings?Answering these research questions can inform the further development and improvement of ASR tools in supporting real-time clinical documentation in dynamic and noisy medical settings.

Automatic speech recognition (ASR) engines evaluated
This study focuses on evaluating the potential of leveraging ASR engines to facilitate clinical documentation in emergency care settings, such as EMS, which are often noisy, dynamic, and involve multiple parties in the care process, as opposed to quiet care settings such as medical exam rooms.Based on our research and literature review, Google Cloud Speech-to-Text [14] with the medical conversations model, OpenAI automatic speech recognition model Whisper [15], Amazon transcribe medical with conversation model [10], and Microsoft Azure Speech-to-Text [16] are the four most advanced ASR engines that are publicly available for evaluation.We refer to these four ASR engines in the following content as "Google ASR", "Amazon ASR", "OpenAI ASR", and "Azure ASR", respectively.

Evaluation dataset
The assessment data were generated from 40 audio recordings of high-delity EMS simulations, which were conducted in a mobile simulation lab resembling the back of an ambulance.The simulation scenarios varied by clinical conditions and required treatments, including a 15-month-old with seizure, a 1-month-old with hypoglycemia, and a 4-year-old with clonidine ingestion.These simulations replicate the dynamic prehospital environment, including background noise and overlapping speech from multiple speakers.The participants of the simulations were EMS clinicians who staff either an ambulance or a re truck in their daily roles.The simulation recordings were transcribed by professional transcriptionists as part of a project examining EMS work ow [17,18].These transcripts are used as ground truth for ASR engine evaluation.The Institutional Review Board of Pace University thoroughly reviewed the study protocol and determined it to fall under the category of nonhuman subjects' research.

Annotation process for classifying transcription into EMS EHR elds
Given that the ultimate goal of our work is to determine how to leverage ASR engines to automate clinical documentation in EMS, the main aim of this assessment is to compare the ability of different AI-powered ASR engines to accurately transcribe content associated with speci c structured elds within EMS EHR.Examples of such structured elds include "age," "gender," "past medical history," "allergies," and others.We used a multi-step process to annotate the transcripts.
First, based on the examination of two EMS EHR systems' structures as well as NEMSIS data structure [19], we listed all the typical structured elds in the EMS EHR systems and categorized them into ve high-level categories: "ePatient", "eHistory", "eVital", "eSituation", and "eMedication" (Table 1).Then, two annotators (ZZ and LZ) independently annotated six transcripts from three different scenarios (two transcripts for each scenario) by marking content that belongs to these structured elds.In this step, we calculated the inter-rater agreement using Cohen's Kappa coe cient; two annotators achieved an "almost perfect agreement" on the mapping between content and EHR elds (kappa value is 0.80).All disagreements between the annotators were resolved in meetings that involved a third annotator (XL).on the data dictionary, while another annotator (ZZ) reviewed all the annotations to ensure correctness and consistency of annotation across all 40 transcripts.It is worth noting that the annotation process was iterative and collaborative, involving back-and-forth correction of previous annotations.

Automated content extraction from ASR transcribed text for performance evaluation
To determine the ASR engine's ability in transcribing clinically relevant information associated with structured elds in the EMS EHR, an automated content extraction framework is designed, shown as Fig. 1.This extraction framework aims to identify transcribed text that aligns with the annotated ground truth transcription.First, the timestamp of the ground truth transcription that contains the annotated content belongs to the EHR elds is used to locate the searching span of the text analysis.For example, if the ground truth transcription contains annotated age information, based on the timestamp of the transcription, we located the initial timestamp of the ASR transcribed data.From the initial timestamp, we included ve ASR transcribed sentences before and after that timestamp for the text analysis.Then, a sliding window of length n that matches the number of words in the annotated content is used to generate candidate content for extraction from the ASR transcribed sentences for similarity measurements.The sliding window does not cross punctuation tokens.For example, to match annotated age "one month old" in the ground truth transcription, a sliding window of length 3 can be applied to text "All right, so this is Susie, one month old." to generate the following candidates "so this is", "this is Susie", "one month old".
Two NLP similarity metrics were employed in this study: (1) Levenshtein distance similarity [20], which is based on fuzzy string matching [21], and (2) embedding-based semantic similarity, which is based on cosine similarity calculated using the large language model-based embeddings.These metrics were utilized to identify the most suitable matches in the ASR transcribed data compared to the annotated ground truth transcriptions.The Levenshtein distance serves as a metric for quantifying the dissimilarity between two sequences of words.This metric calculates the minimum number of edits required to transform one word sequence into another.The permissible edits include inserting, deleting, or substituting different letters or words.The FuzzyWuzzy ratio function [22] was used to estimate the Levenshtein distance similarity ratio between a candidate and annotated ground truth.To calculate the semantic similarity between the candidate and annotated ground truth, we rst used BERT-based sentence transformer [23] to convert candidates and ground truth into embeddings.Then, cosine similarity is calculated to measure the similarity between them.The candidate demonstrating the highest Levenshtein distance similarity and cosine similarity was chosen as the best match and utilized for quantitative evaluation.

ASR performance evaluation methods
Two performance evaluation methods were employed to evaluate the extracted content against the annotated ground truth.

Named Entity Recognition (NER)-based evaluation metrics
After the matching content is extracted, the traditional evaluation metrics that are applied for NER tasks were used for evaluation [24][25][26].All extracted content is converted into words.Then, the word-based precision (P), recall (R), and F1 are calculated for each category of the structured elds.

Semantic similarity-based accuracy measurement
For the semantic evaluation, cosine similarity between the embeddings of the annotated ground truth and the best matching candidate is calculated.If the cosine similarity is above the threshold 0.8, it is a correct match.The calculated accuracy for each category of the structured elds is based on the frequency of the correct match.

Descriptive statistics of the evaluation data
The ground truth transcription of the 40 audio recordings each contained 744 to 1957 words (mean: 1340.5, Std: 297.57).The number of speakers in each recording ranged from 4 to 8 (mean: 6, Std: 1.02).The duration of the reenacted audio recordings was between 528 and 928 minutes (mean: 652.5, SD: 86.58).
The detailed statistics of the annotated data can be found in Table 2.One interesting observation is that the total number of occurrences in each EHR eld varies.The most frequent ones are those belongings to EHR elds "complain/symptoms", "medication", and "treatment", whereas the less frequent ones are "mental state" and "electrolytes".Some categories occur in more than 87.5% of the recordings, such as "age", "complain/symptoms", "past medical history", "BP", "ECG", "RESP", "medication", and "medication equipment".Another interesting observation is that the average length of the annotated content in most elds is no more than three words.The category "trauma" has the longest average description, with 3.89 words, possibly because it often has detailed descriptions about the patient's situation after the examination.

ASR performance evaluation and comparison
The performance of the four ASR engines in recognizing the content in EMS EHR categories is reported in Table 3.The results show that Google ASR has a higher overall average performance than the other three ASR engines.However, comparing to Google ASR, the other three ASR engines perform slightly better in transcribing information for certain categories, such as "gender", "age", "family information", "lung sounds", "capillary re ll time", "complaint/symptoms", "trauma", "airway", and "pupils".Amazon ASR and Azure ASR only performed signi cantly better than Google ASR and OpenAI ASR in transcribing "gender" information.Upon investigation, we found that "male" was often transcribed as "mail" by Google ASR and OpenAI ASR.All ASR engines have great performance on transcribing information about "mental state" and "electrolytes".This could be because there are fewer instances in the evaluation data, and the descriptions of these two categories are relatively short.OpenAI ASR demonstrates consistently higher performance in categories such as "family information," "trauma," and "airway," which typically have an average length of more than three words.4 shows the performance comparison based on the semantic similarity evaluation.Google ASR still gained the highest average performance.The performance of Google ASR in categories such as "pupils", "past medical history," "ECG," "medication," and "treatment" showed greater improvement compared to the other categories when semantic similarity evaluation was used.Through our investigation, we discovered that semantic measurement could identify cases where transcribed words have similar meanings to the ground truth, even if the words are not an exact match.For example, if the ground truth is "diarrhea for the past two days", and the transcribed word is "diarrhea for the past few days", the semantic similarity between these two phrases is more than 0.9, and it is counted as a correct case.On the other hand, the performance in the "temperature" category dropped slightly when the semantic similarity measurement was used because we observed that the transcribed temperature sometimes omitted the decimal point in the value.

Types of errors in transcribing
Given that the ASRs did not perform well on some categories, we analyzed the errors and grouped them into three types: substitution (the transcribed word is different from the ground truth), approximation (the transcribed word differs by a few characters from the ground truth, but has a similar pronunciation), and truncation (the transcribed word omits certain characters from the ground truth).Table 5 exhibits some representative examples of the substitution, approximation, and truncation errors.The rst ve cases show substitution errors where the transcribed word(s) differ from the ground truth conversation.In other words, most of them change the meaning of the original words, which could result in wrong documentation in the EHR.The next four cases are examples of approximation errors, indicating that the ASR transcribed the words into ones that sound similar but have totally different meanings.In the medical domain, transcribing "hypertension" as "hypotension" or "hypoglycemia" as "hyperglycemia" would raise critical concerns.
In the truncation error type, as demonstrated in the last three cases, instances of non-exact matches, even when only one character is omitted in the transcription, could potentially lead to a completely different representation of the patient's situation.For instance, the acronym "OPA" stands for oropharyngeal airway, but the transcribed phrase "OP" omits the character "A", making it inaccurately represent the patient's status.

ASR e ciency comparison
The average length of audio les is 660 seconds.We assessed the e ciency of different ASR engines in transcribing these audio les.The average processing times for individual audio les using Google ASR, OpenAI ASR, Amazon ASR, and Azure ASR were 211.39,265.05, 32.38, and 331.21 seconds, respectively.Notably, Amazon ASR demonstrated signi cantly faster performance compared to the other three services.This could potentially be attributed to its more e cient batch processing capabilities.

Discussion
With the growing interest in leveraging ASR engines to facilitate and automate clinical documentation, a few studies have examined and evaluated their performance in recognizing medical information and automating (part of) clinical documentation during patientprovider encounters.For example, Tran et al. [11] assessed the performance of two ASR engines (Google Speech-to-Text Clinical Conversation and Amazon Transcribe Medical) in recognizing non-lexical conversational sounds (e.g., "Mm-hm," "Uh-uh") during primary care encounters.In a similar vein, a few other studies [7] have used and evaluated ASR engines for transcribing clinical conversations in domains such as primary care [8], orthopedic encounters [27], home hemodialysis [28], and telemedicine [29].
However, these prior works were primarily conducted in quieter medical settings, such as patient exam rooms, and focused on transcribing and summarizing one-on-one patient-clinician interactions.The performance of ASR engines in more dynamic, noisy, and fast-paced environments involving conversations among several people has been understudied.To the best of our knowledge, this study is the rst to examine commercially available ASR engines in transcribing clinical conversations in dynamic emergency medicine settings.
Our evaluation results show that the overall quality of the transcripts generated by current state-of-the-art ASR engines fall short in accurately transcribing and recognizing medical information in the busy and noisy EMS environment.Two of them, OpenAI Speech-to-Text and Amazon Transcribe Medical with the conversation model, performed worse than the other two in recognizing information in some critical EMS EHR elds, such as "airway" and "pupils".For all four ASR engines, a number of pieces of content containing clinically relevant information were truncated, with many being replaced by irrelevant words, particularly within categories such as "complaint/symptoms", "treatment", and "medication".Our examination of the 40 EMS audio recordings indicates that over 87.5% of communications during EMS scenes contain clinical information within these categories.Accurate documentation in these elds is therefore essential, especially for downstream tasks like generating real-time decision support to enhance EMS care.Failure to capture such information correctly could lead to inaccuracies in clinical documentation and potentially adverse patient safety incidents.
The results of this study indicate that there is substantial potential for improvement of state-of-the-art ASR engines in accurately recognizing clinical information in EMS communication, especially with the advance of AI and advanced large language models (LLMs), such as GPT-4 [30], Me-Llama [31], and others.By investigating the recordings, we are con dent that some of the words can be reliably determined by a human listener based on the context of the communication.This suggests that training LLMs to correct the transcribing errors introduced by ASRs should be technically possible.Our future strategy involves creating a training dataset that encompasses the common errors made by ASR engines in the EMS context.This will enable us to ne-tune the LLMs to identify errors based on context, ensuring not only accurate transcription but also capturing the intended words and conveyed meaning effectively.Additionally, our future work also includes eld testing to assess the performance of ne-tuned LLMs in transcribing real-time clinical conversations in dynamic and noisy environments such as EMS.
However, despite the great potential of ASR engines and LLMs in transcribing clinical conversations and automating EHR documentation, there are several considerations for fully implementing such technology in real clinical settings.For example, the accuracy of current ASR engines in transcribing spontaneous speech and extracting relevant clinical information from fragmented conversations has been a concern for many healthcare clinicians [32,33].Therefore, human validation is still required in the process of automated EMS clinical documentation to prevent any errors or biases that may arise from the system or LLMs [34][35][36].Another consideration is related to user acceptance and usability of such tools; that is, whether these tools are perceived to be easy to use and useful, and how adopting them can impact clinicians' work ow.Limited work has engaged end-users in evaluating ASR engines, with only two exceptions where a qualitative evaluation with end-users was conducted to elicit clinicians' experience with using such tools [28,29].Future work should take a human-centered approach to systematically evaluate the tools' usability, clinical validity, and impact on clinical work ow.Finally, as some medical settings (e.g., EMS) do not have a dedicated role for documentation and require clinicians to be hands-on almost all the time [12], it is critical to consider how to facilitate the use of ASR tools in such dynamic and hands-busy environment.Recent work has proposed using wearable devices (e.g., smart glasses [37]) to enable hands-free clinical documentation.Future work can explore the potential of combining ASR engines with hands-free, wearable devices integrated with EHR systems to better support real-time transcription and documentation in dynamic and hands-busy medical settings.
There are several limitations in this study.First, only 40 audio recordings from EMS training simulations were utilized to assess the ASR engines.The relatively limited sample size might affect the generalizability of our ndings.Second, our results might not accurately re ect the actual performance of ASR systems when applied to real EMS environments, which are subject to various complicated factors, including possible more interruptions, variations in communications between patients and clinicians, and the potential for increased outdoor noise levels, and operating in a moving vehicle.These conditions are expected to present greater challenges and may lead to lower ASR performance.Third, our study only evaluated four ASR engines.We acknowledge the existence of other initiatives led by academic institutions and companies (such as tecdoc.ai)aimed at developing next-generation ASR engines tailored for clinical documentation.However, some ASR engines did not provide an open API for evaluation.Finally, our assessment focused on evaluating the transcription of conversations only in the EMS context.More study is needed in other noisy and dynamic settings such as emergency department and trauma resuscitations where there are more complex teams present.

Conclusion
We assessed the performance of four contemporary ASR engines in analyzing conversations within EMS or prehospital care to determine the potential of using such technology to facilitate and even automate real-time clinical documentation.Our results indicate that the Google ASR engine outperformed the other three ASR engines in recognizing clinical information across most categories in EMS EHR.However, the overall performance of all engines remains suboptimal, with signi cant clinical information often omitted or replaced with irrelevant words.Such errors can result in missing or incorrectly interpreted patient information and treatment during prehospital care.Therefore, there is a need to improve the performance of ASR engines to ensure accurate recognition of all critical clinical information, enabling effective automated documentation in the EMS setting.Future work should focus on enhancing ASR accuracy to minimize recognition errors and reduce patient safety risks associated with EMS clinical documentation technology.

Competing Interests
Dr. Zhan Zhang is an Associate Editor of the Journal of Health Informatics Research.Other authors declare that they have no con icts of interest.

Figure 1 A
Figure 1

Table 2
Statistics of the annotated transcription data.

Table 3
Performance comparison of the ASR using NER-based metrics.Note: numbers in bold highlight the best performance of each category.

Table 4
Performance comparison of the ASR using semantic similarity-based accuracy (similarity > = 0.8).Note: The numbers in bold highlight the best performance of each category.

Table 5
Examples of the transcribing errors in each type