The ratings provided by the experts for both GPT-3.5 and DeepL indicated that, on average, there were favorable assessments for translation quality. However, the subjectivity of the evaluation and the lack of explicit evaluation criteria, such as completeness, comprehensibility, or technicality, pose challenges. Despite these limitations, the ratings remained consistent with those from a pre-study, suggesting consistency in expert evaluations and a common understanding of accurate translations.
Statistical analysis via the Mann‒Whitney U test revealed that there were no significant differences in the mean ratings between GPT-3.5 and DeepL for both the 100 randomly selected terms and the 20 common terms. This suggests that both machine translators performed comparably in terms of translation quality.
High ratings for terms with more than 7 words, such as "Elevated proportion of CD4-negative, CD8-negative, alpha-beta regulatory T cells" or "Anomalous insertion of papillary muscle directly into anterior mitral leaflet", show promising results even for more complex terms. DeepL is slightly ahead here. The observation that the multi-word translations exhibit comparable performance to their shorter counterparts may be attributed to the incorporation of additional contextual information.
In assessing interrater reliability, the study revealed data homogeneity among the ratings for the 20 common terms. This resulted in low values for both the ICC and Fleiss's kappa, indicating that traditional measures of interrater reliability may not be suitable in such cases of minimal variance and uniform ratings. In addition, our analysis revealed instances where the same expert rated identical German translations produced by both translators differently, indicating some degree of inconsistency in rating assignment (intrarater reliability).
To validate translation quality, an independent reference translation from the HeTOP database was employed for 15 of the 20 common terms. The Jaro–Winkler similarity metric revealed high similarities between the machine-generated translations and the HeTOP reference translations. However, it is essential to acknowledge potential nuances, as the threshold may exclude moderately similar yet semantically relevant translations. In the cases where the similarity of the terms only slightly exceeds the predefined threshold, as in the comparison between "Bauchschmerzen" and "Abdominalschmerzen", with a similarity value of 0.62, it becomes clear that the degree of similarity requires careful examination, since the similarity in this case lies solely at the end of the term.
There are various similarity metrics for measuring text similarity, including the Levenshtein distance, cosine similarity, and Jaccard similarity. However, Jaro‒Winkler stands out because of its ability to weight the prefix (the beginning of words), which is useful for capturing similarities related to singular/plural differences. This allows for a more precise capture of similarities in words, enhancing the detection of semantic similarities. Metrics such as the BLEU (bilingual evaluation under study), which is used in many translation studies [2], are designed primarily for evaluating machine translations when performing 1-to-1 text comparisons with n-gram decomposition and are not necessarily suitable for direct 1-to-1 string comparisons, such as for our comparison of the 20 common terms with the HeTOP database.
When comparing a machine translation to a reference translation, there are several limitations and challenges to consider that can impact the evaluation process. These limitations include subjectivity. Different medical experts may have different interpretations and preferences for how a particular text should be translated.
It was challenging for some experts to evaluate the quality of the translated synonyms in comparison to their English counterparts. There was a tendency to evaluate the synonyms in relation to the main term. This intricacy is attributable to the specific study design and could be mitigated through the adoption of a randomized presentation format for the translations under evaluation.
Challenges in interpretation occurred in cases of spelling errors in translations, such as "Hypoesthesie" instead of "Hypoästhesie" in DeepL. Experts have also observed instances where English synonyms are inaccurately associated with specific terms. For example, in the case of the term "fractured facial bone", which was one of the 100 randomly selected terms, an English synonym stored as "bone facial bone" was identified that appeared to be mislabeled and that may be more appropriately labeled "broken facial bone". Since this made it difficult to evaluate the quality of the translated synonym, the rating for this synonym was removed from the overall rating.
GPT-3.5 has several limitations, such as the risk of generating incorrect or biased translations. Providing additional details and contexts through the prompt in GPT-3.5 could improve the accuracy and quality of the translation, especially in regard to medical terminology information, e.g., providing information that many terms might have their roots in Latin or Greek. However, we acknowledge that the optimization of language models such as GPT-3.5 falls under the domain of prompt engineering and that simply adding more information does not guarantee improved results.
One possible approach to improve translation quality is to combine translations from multiple translation engines and select different translation candidates from them. This can even be done on the basis of different input languages and support languages [18].
In addition to analyzing the translation quality of commonly used HPO terms and the influence of term length, an alternative approach could involve a range of medical experts in the selection of HPO terms on the basis of their significance, difficulty of translation, phenotype, and frequency of use. However, this approach is not without limitations. It is conceivable that medical experts without expertise in translation may subjectively assess the complexity of terms, which could lead to inconsistencies in the selection process.
The generalizability of our results to other languages must be viewed critically. In this study, the focus is clearly on translating terminology into German and investigating how well an automated process performs. For validation purposes, it was important to us that medical experts were fluent in both the source language and the target language. However, DeepL has more than 30 source and target languages and can therefore be used for many languages, including French, Korean and Spanish [3]. The GPT models also include various languages in their training data.
The accelerated development of large language models has led to the introduction of newer GPT models during and following the course of this study. These models are anticipated to introduce novel innovations and enhancements [9, 19]. To facilitate the transfer of the study findings to the translation performance of the current models, the 20 common terms were retranslated with the GPT-4o model. This resulted in translations that were identical except for a few instances of singular/plural differences and minor adjustments to the translations of "Diminished physical functioning" and "Hypoesthesia." The Jaro–Winkler similarity between GPT-3.5 and GPT-4o was 0.99, whereas the similarity between DeepL and GPT-4o was identical to the similarity between DeepL and GPT-3.5, which was 0.76. These values indicate comparable results, thereby demonstrating uniform validity.
Notably, our study revealed that 75% of the common HPO terms had German reference translations in the HeTOP database. Given the limited sample size, the results are not yet statistically significant. However, given the paucity of studies on extensive translations, these findings underscore the incomplete coverage of translated medical terminology and highlight the importance of our study, particularly for the documentation and diagnosis of rare diseases where precise distinctions in disease characteristics are vital.