The present study explored the potential for the application of LLMs in AIDs. 46 questions related to the concept, clinical features, report interpretation, diagnosis, prevention and treatment, and prognosis of AIDs were entered into ChatGPT 3.4, ChatGPT 4.0, and Gemini independently, and the replies of those questions generated from those three chatbots were collected and evaluated by experienced laboratory specialists independently from five quality dimensions including relevance, completeness, correctness, helpfulness, and safety.
Our study demonstrated that ChatGPT 3.5 and Gemini can provide limited help in healthcare and with the advancement of LLM, while ChatGPT 4.0 might be applied to the clinical practice of AIDs. Specifically, ChatGPT 4.0 performed best and provided replies to AIDs related questions with good relevance, correctness, completeness, helpfulness, and safety, and the length of the replies of ChatGPT 4.0 was also the longest. ChatGPT 3.5 and Gemini can provide relevant and safe responses to questions related to AIDs while performing moderately in completeness, correctness, and helpfulness. Indeed, compared to ChatGPT 3.5, ChatGPT 4.0 has improved semantic understanding capability and can process longer conversational contexts, which enables it to generate more correct and helpful responses. Consistent with our findings, the safety of the ChatGPT 4.0’s responses have also been improved[22]. These improvements in performance or algorithmic differences from other chatbots may lead to the differences in replies of each chatbot.
Overall, our data showed that ChatGPT 3.5, ChatGPT 4.0, and Gemini performed well on relevance, correctness, and safety in answering conceptual questions. Nevertheless, ChatGPT 3.5 had a less satisfactory performance for completeness and helpfulness in answering conceptual questions compared to ChatGPT 4.0. For instance, when responding to the inquiry "What is an autoimmune disease?", ChatGPT 4.0 goes beyond the mere definition of AIDs provided by ChatGPT 3.5. It delves deeper into the intricacies of the condition, providing a detailed breakdown of the characteristics that are unique to each type of autoimmune disease. Thus, the replies of ChatGPT 4.0 were more comprehensive and helpful than ChatGPT 3.5. Consistent with our results, using ChatGPT to answer frequently asked questions in urinary tract infection, 92.6% of questions were correctly and adequately answered by ChatGPT[23]. ChatGPT 3.5 turbo responses also showed a less accurate response for SLE related clinical questions[16].
Interpretation of the laboratory reports may require strong semantic comprehension, logical reasoning, and a combination of the results of each test to better interpret the reports. Indeed, as the number of parameters increases, ChatGPT 4.0 is significantly better than its predecessor ChatGPT 3.5 in semantic understanding and logical reasoning[22]. When solving clinical laboratory problems, ChatGPT 4.0 presented considerable performance in finding out the cases and replying to questions, with an accuracy rate of 88.9%, while ChatGPT 3.5 and Copy AI have accuracy rates of 54.4% and 86.7% respectively[21]. In our study, ChatGPT 4.0 scored higher than ChatGPT 3.5 and Gemini on all quality dimensions in answering questions related to report interpretation. We speculate that ChatGPT 3.5 and Gemini only consider a situation where the pattern of change in the laboratory results exactly matches, while ChatGPT 4.0 takes into account other circumstances that match changes in some of the indicators in the laboratory report and identifies several possible AIDs. Therefore, ChatGPT 4.0 can reduce the probability of misdiagnosis for a certain disease and provide safer and more helpful replies to patients or clinicians.
ChatGPT 3.5, ChatGPT 4.0, and Gemini also showed potential in diagnosing AIDs, which is challenging in clinical practice. In our study, when answering the diagnosis-related questions, all three chatbots performed better in relevance, correctness, helpfulness, and safety, with scores greater than 4. ChatGPT 4.0 and Gemini outperformed ChatGPT 3.5 in terms of completeness. Similarly, ChatGPT effectively highlighted key immunopathological and histopathological characteristics of Sjögren's Syndrome and identified potential etiological[18].
Our study also revealed the potential of LLM to assist in the prevention and treatment of certain diseases. In our study, ChatGPT 3.5, ChatGPT 4.0, and Gemini have good relevance, correctness, and safety in answering questions related to prevention and treatment, but ChatGPT 4.0 performed better than ChatGPT 3.5 and Gemini in terms of completeness and helpfulness. In assessing the role of LLMs in providing information on methotrexate administration to patients with rheumatoid arthritis, the accuracy of the outputs of ChatGPT 4.0 achieved a score of 100%, ChatGPT 3.5 secured 86.96%, and BARD and Bing each scored 60.87%. Besides, ChatGPT 4.0 achieved a comprehensive output of 100%, followed by ChatGPT 3.5 at 86.96%, BARD at 60.86%, and Bing at 0%[17].
The performance of LLMs, particularly ChatGPT 4.0, in answering questions related to AIDs was still far from perfect. There are still many efforts to do for the application of LLMs in clinical practice. First, the replies of those LLMs need to be more comprehensive. The continuous updating of medical knowledge in the training dataset and the improved algorithm will enable LLMs to give more comprehensive replies. Secondly, the accuracy of the LLMs’ replies to AIDs related questions also needs to be further improved. Retrieval-augmented generation to the embedding of customized data makes LLMs more specific and reduces hallucinations.
Our study has some limitations. First, man-machine control was not included in our study. By allowing clinical specialists to also respond to the questions and compare the responses of the LLMs and the clinical specialists, it is possible to get a clear picture of the distance between the LLMs and the clinic, and provide direction for their subsequent improvement and service to the clinic. Second, the repeatability of LLMs was not evaluated, which was important in clinical practice. Besides, we used Chinese to enter AIDs-related questions, however, the performance of LLMs may be influenced by the language entered[22]. Finally, the subjectivity of graders when scoring the quality of the replies of LLMs may affect the results to some extent.