Overall clinical performance
All 5 performance tests were conducted between February 3d and April 4th, 2023, taking around 180 minutes each. Overall, GT had strong performance with the final proportion of correctly interpreted paragraphs ranging from 83.5% (Urdu) to 95.4% (French). The proportion of correctly interpreted paragraphs in the first attempt ranged from 76.1% (Spanish) to 91.7% (Arabic, Table 1, Fig. 1).
Table 1
Clinical performance summary of google translate
Language | Spanish | French | Urdu | Arabic | Mandarin |
Ommission of word(s) | 2 (1.8) | 2 (1.8) | 1 (0.9) | 0 (0.0) | 0 (0.0) |
Addition of word(s) | 1 (0.9) | 8 (7.3) | 5 (4.6) | 0 (0.0) | 1 (0.9) |
Word(s) left in source language | 0 (0) | 0 (0) | 2 (1.8) | 0 (0.0) | 0 (0.0) |
Registration of different word(s) | 25 (22.9) | 16 (14.7) | 16 (14.7) | 5 (4.6) | 7 (6.4) |
Partial misinterpretation of paragraph | 13 (11.9) | 2 (1.8) | 6 (5.5) | 1 (0.9) | 9 (8.3) |
Complete misinterpretation of paragraph | 12 (11.0) | 11 (10.1) | 15 (13.8) | 8 (7.3) | 7 (6.4) |
Wrong punctuation leading to misinterpretation | 1 (0.9) | 1 (0.9) | 1 (0.9) | 0 (0.0) | 1 (0.9) |
Total of misinterpretations | 26 (23.9) | 15 (13.8) | 22 (20.2) | 9 (8.3) | 17 (15.6) |
Identification and correction of misinterpretation | | | | | |
Misinterpretation caught and corrected | 19 (17.4) | 10 (9.2) | 4 (3.7) | 3 (2.8) | 3 (2.8) |
Misinterpretation caught but not corrected | 3 (2.8) | 3 (2.8) | 7 (6.4) | 1 (0.9) | 2 (1.8) |
Misinterpretation not caught | 4 (3.7) | 2 (1.8) | 11 (10.1) | 5 (4.6) | 12 (11.0) |
n: number of paragraphs with mistakes %: proportion out of 109 paragraphs with mistakes; *performance tested for official languages: Chinese mainland (written back-translation simplified Chinese) and fussah (formal arabic).
Interpretations were more accurate from English to the target languages than from non-English languages to English. In all languages, GT was highly sensitive to dialects and its performance depended on the ability of speakers to adhere to the official language e pronunciation. This was particularly important in Mandarin and Arabic: mainland Chinese and Fussah (formal Arabic) were well interpreted whereas when speaking a Chinese dialect or one of the 14 Arabic dialects, performance was lower. Interpretations were also better in paragraphs that were neither too long nor too short, allowing the application to digest the input but also put the single words into context. In all languages, but most in Urdu, Spanish and Arabic, pronouns (he/she and it) were often misinterpreted. Frequently, male gender was used as default. Interpretations also consistently used informal language and addressed subjects on first-name terms.
Types of mistakes
Omissions, additions, and words that were left in the source language were mistakes rarely made by the application. Registrations of wrong words were more common, particularly in the Roman languages (French and Spanish). All types of mistakes were more likely to lead to misinterpretations if they were affecting words central to the meaning of the paragraph, completely changed the words’ meaning, or affected multiple words in a row. For example, the French to English interpretation of “this should calm him down” remained the same even if the tense of the verb was registered incorrectly in French (cela “devait” instead of “devrait” le calmer). However, it led to a complete misinterpretation of a paragraph in French when the English words “little one” were registered as “Italy won”.
Identified mistakes leading to misinterpretations
The continuous real-time check of the back-translation by both speakers (clinician and patient) allowed for the detection of all misinterpretations that were based on registrations of wrong words or misunderstood syntax in the language of the speaker. Frequently, these mistakes were due to the dialect of the speaker or a slightly unclear pronunciation, and improved over the course of the performance test. The real-time check and repetition of the paragraph if a mistake was detected increased the amount of correctly interpreted paragraphs per language by 2.8% (Mandarin) to 17.5% (Spanish; Table 1, Fig. 1). Once a mistake leading to a misinterpretation was detected, only one (0.9%; Arabic) to 7 paragraphs (6.4%; Urdu) remained uncorrected after the maximum of 2 additional attempts, highlighting the overall strong performance of GT.
Unidentified mistakes leading to misinterpretations
In Spanish, most of the misinterpretations that were not caught by the back-translation originated from the synonymous use of “si” (meaning “yes” and “if” depending on the context). If the application chose the wrong synonym in the target language, the meaning of the message was frequently not delivered. In French, unidentified misinterpretations were very rare, however important. The sentence “il est vraiment enrhumé” (he’s pretty congested) was interpreted as “he is really pissed off”. The interpretation was correct if the adverb (vraiment) was left out. In Arabic, Urdu and Mandarin, unidentified misinterpretations where more common. In all 3 languages, “congestion” (runny nose) was interpreted as “traffic jam” and the height of the fever was interpreted as “tall”. In Urdu and Mandarin, the verb “drinking” was misinterpreted as “drinking alcohol” whereas the interpretation was correct if the verb “feeding” was used. Potentially dangerous misinterpretations included misinterpretations of numbers in Urdu. While the Urdu back-translation was correct, the Urdu-sentence “normally he would have had like 2 or 3 extra (diapers)” was interpreted to English as “normally he would have between one and 2000 diapers.” (Fig. 2a). The height of the fever was also only interpreted correctly, if English numbers and decimals were used by the Urdu-caregiver – a practice that, according to the Pakistani interpreter and MD, is used by some but not all families. Difficulty with interpreting numbers correctly existed also in Mandarin. Particularly if a Taiwanese dialect was spoken, 4 and 10 were at risk to be misinterpreted. As Mandarin has a very different structure of describing the past, present and future, mistakes in the times happened occasionally, potentially leading to misinterpretations. The application made relatively few misinterpretations in Arabic, if Fussah was spoken and common language was used. As in all languages, interpretations were very literal and sometimes lead to misinterpretations: the specific meaning of “term” in “term baby” was misinterpreted as the literate interpretation of “term” as “word”, becoming “word baby”. However, some sentences where mistakes were common in other languages were well-interpreted in Arabic (Fig. 2b).
Misinterpretations by language category
While slang words were well interpreted in most languages, the interpretation of medical terms was more prone to mistakes and varied between languages (Table 2).
Table 2
Misinterpretation by language category (lay/medical/slang)
Language | Spanish | French | Urdu | Arabic | Mandarin |
Paragraphs including slang (n = 8) | n (%) | n (%) | n (%) | n (%) | n (%) |
Registration of different word | 2 (25.0) | 0 (0.0) | 1 (12.5) | 0 (0.0) | 0 (0.0) |
Misinterpretation | 2 (25.0) | 0 (0.0) | 1 (12.5) | 0 (0.0) | 0 (0.0) |
Misinterpretations caught | 1 (12.5) | 0 (0.0) | 1 (12.5) | 0 (0.0) | 0 (0.0) |
Paragraphs using common language (n = 81) | | | | | |
Registration of different word | 17 (21.0) | 12 (14.8) | 8 (9.9) | 4 (4.9) | 6 (7.4) |
Misinterpretation | 18 (22.2) | 12 (14.8) | 11 (13.6) | 5 (6.2) | 11 (13.6) |
Misinterpretations caught | 16 (19.8) | 11 (13.6) | 6 (7.4) | 3 (3.7) | 4 (4.9) |
Paragraphs including medical terms (n = 20) | | | | | |
Registration of different word | 6 (30.0) | 4 (20.0) | 7 (35.0) | 1 (5.0) | 1 (5.0) |
Misinterpretation | 6 (30.0) | 3 (15.0) | 10 (50.0) | 4 (20.0) | 6 (30.0) |
Misinterpretations caught | 5 (25.0) | 2 (10.0) | 4 (20.0) | 1 (5.0) | 1 (5.0) |
n: numbers of paragraphs wrongly interpreted; %: proportion of language subcategory that was wrongly interpreted |
As there is no word for crackles or wheezes in Mandarin or Urdu, the application mimicked the sounds instead, potentially confusing or startling parents by the sudden sound output from GT. In Mandarin, Spanish, French and Arabic some other common medical terms like COVID or bronchiolitis were well interpreted. In Urdu many medical terms do not exist and English terms are used instead, which complicated their interpretation for the application as it consequently aimed to interpret every word. The professional interpreter and clinician also described a cultural reluctance to speak about sex/reproductive organs in Urdu, which created an additional challenge. As there is no commonly used term for “vagina” in Urdu, the application used a very poetic term (“the channel where the menstrual blood flows”) which would hardly be understood by many parents.
Legal/policy evaluation
From a policy perspective, the jurisdictional survey revealed that access to language interpretation in healthcare was inconsistent between both sites and jurisdictions, as well as being under-funded by both public and private payment mechanisms. The content analysis uncovered various legal concerns that accompany the use of AI-based interpretation tools, including the paucity of legal mechanisms (e.g., human rights and other laws) available for patients and families to assert a right to access adequate and timely interpretation. Data privacy and security risks were also identified, as well as the risk of medical malpractice for physicians due to over-reliance on poorly validated translations (Table 3).
Table 3
Legal considerations regarding the use of AI-based language interpretation
| Data Privacy | Data Security | Consent for use of GT | Avoid causing harm due to use of GT |
Patient or substitute decision maker | Ask provider whether, where and how your personal health information (PHI) is being transferred and/or shared. | Ask provider whether and how your (PHI) is protected/secured. | Make sure you understand the risks and benefits of using GT prior to providing consent. | Request an alterative form of translation services if discomfort or concerns with using GT. |
Healthcare (HC) facility | Unapproved / unsupported use of GT could indicate that providers are insufficiently resourced, possibly giving rise to vicarious liability for privacy breach. If this tool is being used unofficially, HC facility should revisit its language translation policies and, to mitigate risk, educate providers about how to protect privacy of PHI and develop pathways for safe utilization of GT as appropriate. | If ongoing use of tool is anticipated, HC facility should ensure providers have access to a private version of GT that is housed in architecture meeting or exceeding data security standards. | Providers should be trained to understand the risks and benefits of using GT and be able to explain these to a patient in order to obtain informed consent. Alternative translation options should be made available for patients who decline use of GT and require language translation in order to receive healthcare services. | Ensure that providers are not forced to use GT due to inadequate access to other translation services. Alternative language translation services should be made available for scenarios where patients cannot adequately use GT safely, decline to consent to GT use, or it is not clinically appropriate. |
HC provider | Minimize risk by ensuring GT does not collect PHI. If PHI will be collected and/or transferred, understand how, why and for what downstream use, if any. Seek advice to ensure compliance with relevant privacy laws and local best practice guidelines. | First, minimize risk by ensuring GT does not collect PHI. If PHI will be transferred, understand whether, how and where it will be stored and how it is protected (i.e. data encryption etc.). Seek advice to ensure data is adequately secured. | Make sure patients understand any risks and benefits related to data privacy, security, and possible risks associated with inaccurate translation. Explain alternatives to use of GT, even if none available. | Ideal if tool has received regulatory approval for intended use as required by local laws. Accuracy of tool for intended use case(s) must be adequately validated to the standard that a reasonable physician in similar circumstances with similar resources would agree that its use is appropriate. |
No gold standard for language interpretation in health care settings and underfunding
The Canadian Charter of Rights establishes a right to an interpreter during legal proceedings, but not in healthcare encounters [25]. In fact, the Canadian Supreme Court has explained that having public healthcare does not mean all products and services must be covered [26]. This leaves provinces/territories, municipalities or facilities to take on the “budgetary burden” [27] of implementing translation policies [28]. Similar approaches have been taken across many jurisdictions around the world [30].
If using an AI-based tool, healthcare professionals generally need to ensure the translation provided is accurate and unlikely to cause harm to patients [29]. Using a professional language interpretation service can help limit the risk of liability for harm caused. If, however, a patient requires medical care and no formal language interpretation service is available, using language interpretation devices might be the only way to understand and address the patient’s needs. To minimize the risk of legal liability for harm caused, providers should make the kinds of decisions that another reasonable provider might make in similar circumstances and with similar resources [30]. Knowledge about the limitations of an AI-based application’s performance in the language used is therefore key to enabling physicians to make well-informed decisions.
Data privacy and security considerations
As long as personal health information (PHI) collected is only that which is necessary to provide and support care, privacy legislation generally allows providers to collect and share PHI among treating healthcare workers (i.e. “the circle of care”) without the patient’s express consent. This implied consent exception also allows PHI to be used for particular tasks related to quality improvement (QI) at healthcare facilities [31]. However, third-party translation tools are commonly hosted on servers that are owned or leased by the company that created the tool. In the case of GT, this means that words will be spoken into the app, transmitted to some server where the language interpretation model resides, interpreted on that server, then transmitted back to the app interface. Unless the GT tool is not only local (i.e. stored on the user’s device), questions about data privacy, security and stewardship need to be addressed. Users who download these tools will be prompted via a user agreement to provide consent for the tool to process their data/inputs, but what the consent entails will have privacy implications that require assessment by legal experts. Differences in model architecture need to be accounted for in the specific legal analysis. However, for the before-mentioned reasons, physicians are sometimes forced to use language interpretation tools in the absence of comprehensive legal evaluations by their facility’s legal/privacy department. Data and security risks could be reduced by obtaining informed consent from the patient to use the device. Personal identifiers should still be avoided while using language interpretation tools. Many AI-based language interpretation tools like GT can, for a cost, be hosted on private cloud servers dedicated specifically to a healthcare institution, allowing for compliance with local privacy and laws and enabling compliant use of these tools.