In this study, a new evaluation method was proposed, which objectively considers the content and structure of AI-generated text when evaluating various models on medical questions. This evaluation method aligns with the commonly used AI evaluation metrics (precision and recall) to minimize the subjectivity of expert evaluations of different large models. Additionally, this study uses this evaluation method to assess the performance of various general large models in the field of dental implants across different question dimensions, thereby filling the research gap in the implantology field regarding the performance of these models.
When handling simple medical questions, the Gemini Advanced model performed the best, with the highest accuracy rate (0.80) and the smallest standard error (0.057).The p-value for comparison with ChatGPT-4 was 0.3158, indicating no statistically significant difference.Research by Masalkhi et.al. showed that Gemini AI and ChatGPT-4 performed comparably on simple ophthalmology questions(21).Additionally, studies by Andrew Mihalache et.al. demonstrated that Gemini and ChatGPT had accuracy rates of 74% and 79%, respectively, on simple medical questions(22, 23).Based on the analysis in this study, the performances of Claude-3, ChatGPT-4, and Gemini were roughly comparable, with no statistically significant differences between them (p-values greater than 0.05). Their accuracy rates were 0.74, 0.72, and 0.80, respectively. These studies indicate that these models have similar capabilities in handling simple medical questions. In contrast, Qwen performed the worst among these models, with an accuracy rate of only 0.60, showing statistically significant differences compared to all other models.
For handling complex medical questions, ChatGPT-4 demonstrated the highest average evaluation score (7.99 ± 1.95), indicating strong performance when faced with complex issues. Its median score was 8.3, suggesting that a few extreme answers lowered the average score, but overall, ChatGPT-4 showed good stability and consistency in its performance on complex questions. This result is consistent with the findings of Harriet Louise Walker et.al., who observed similar reliability in ChatGPT's responses to hepatopancreatobiliary (HPB) diseases(24).Compared to other models, ChatGPT-4's performance on complex questions was significantly better than Claude-3 (p = 0.001), with no significant difference compared to Gemini and Qwen (p > 0.05). Qwen's average accuracy (7.05 ± 2.88) was slightly lower than ChatGPT-4's, but its median score was the highest (8.5), indicating that Qwen provided more appropriate answers in certain cases, though with lower consistency. Claude-3 performed the worst on complex questions, with the lowest accuracy (5.64 ± 2.28) and significant differences compared to both ChatGPT-4 and Gemini (p = 0.001 and p = 0.033, respectively).
When handling specific medical case questions, the average scores of ChatGPT-4, Qwen, and Claude-3 were similar, with no significant differences (p > 0.05). Standard deviation analysis showed that ChatGPT-4 and Claude-3 had relatively consistent performance, while Qwen's performance was more variable.
In terms of diagnosis, Claude-3 demonstrated consistency (9.88 ± 1.56), whereas ChatGPT-4 and Qwen showed greater variability (9.83 ± 2.10 and 10.9 ± 0.68, respectively). Regarding treatment plans, Qwen showed stability (9.07 ± 0.29), while ChatGPT-4 and Claude-3 had a broader range of choices. For treatment planning, Qwen's data distribution was wider (7.89 ± 4.46), indicating higher personalization and adaptability, but also greater uncertainty.
Overall, in medical case questions, ChatGPT-4 performed the best, followed by Gemini. Qwen provided high-quality answers in certain cases but had less stability. Claude-3 performed relatively poorly across various aspects. Qwen excelled in diagnosis, possibly due to its efficient feature extraction capability. However, its performance in treatment decision-making and planning did not show the same advantage, with a large standard deviation reflecting instability in specific plan designs. Notably, p-values indicate significant differences in several comparisons. The p-values for Qwen versus Claude-3 and ChatGPT-4 in diagnosis were 0.001 and 0.002, respectively, indicating statistically significant differences in diagnostic performance. In contrast, the p-values for treatment plans and planning were all greater than 0.05, indicating no significant differences in these areas. This demonstrates that all the large models have some degree of clinical decision-making ability in the field of implants. Additionally, observations of the models' scores revealed that Qwen had a higher number of completely incorrect responses (scores of 0, including hallucinations and misunderstandings of cases) in treatment planning compared to the other two models.
This study has several limitations. The field of dental implants is complex and diverse. Since this study extracted guidelines from ITI(International Team for Implantology) and randomly selected real cases, it did not encompass all aspects of the implant field. Due to space constraints, the study did not delve into the specific score differences in the evaluation criteria for different questions where significant differences between models were observed. Furthermore, insights from multidisciplinary perspectives on pre-implant periodontal treatment and post-implant restoration could provide more comprehensive treatment recommendations. Such interdisciplinary integration is crucial for dental implant practice, especially in complex situations like insufficient bone volume or occlusal disorders.Lastly, the various large models did not adequately explain the technical key points and difficulties in treatment planning. This might be due to the training data for these models coming from online sources, which do not include deep, specialized information from the field of dental implants.