Among the Chatbots evaluated, ChatGPT3 demonstrated the best performance, with its results identified 15 (71%) research priorities. Despite its limitation of not being connected to the internet and having a knowledge cut-off in November 2021, ChatGPT3 showcased its ability to leverage its deep understanding of human language to provide insightful responses. These findings suggest that ChatGPT3 may be a valuable tool for researchers and healthcare professionals to provide a framework to generate important research questions. Nevertheless, it is crucial to acknowledge that this iteration of ChatGPT has constraints in terms of its knowledge base. It is also worth highlighting that when asked about having access to the JLA priorities, it explicitly denied having such access. Furthermore, more recent iterations like ChatGPT4, equipped with internet connectivity, might provide a broader range of information, however this is a subscription or paid-for service.
In contrast, Bard, developed by Google, had access to the internet but identified nine (43%) research priorities. Its performance was more focused on TKA, providing poorer responses regarding THA. This discrepancy highlights the variation in Chatbot performance based on the specific context or query, emphasising the need for further investigation to understand and improve these limitations. Additionally, Bard, despite being capable of providing sources for the information it presents, did not produce any. Interestingly, when prompted if it had access to the JLA priorities, it produced a response with the 10 priorities and links to the direct website to access these priorities.
It is important to note that Bard may have limited ability to recall previous conversations, as mentioned in the Bard FAQ.[27] However, the extent of Bard's memory capabilities remains uncertain, which could explain the misalignment with our research priorities and explain the lack of priorities that were aligned with the responses that Bard gave after initial prompting despite admitting access to the JLA priorities.
While Bing's answers were not as comprehensive as ChatGPT3 they offered credible and relevant references to our ‘prompts’. Examples include: A prospective study on enhanced recovery programmes[28], impact of COVID-19 on lower limb arthroplasty mortality and morbidity [29], a review of lower limb arthroplasty literature assessing publication bias [30], a review of literature regarding rheumatoid arthritis and lower limb arthroplasty. [31] However, the same references were used in response to our follow-up ‘prompts’. The findings indicate that Bing's responses may have limitations when generating novel information and offering new references. Nevertheless, the references provided by Bing offer researchers valuable additional resources, enabling them to augment their search and gain increased confidence in the obtained results. In addition to this, questions remain as to why there is a significant gap in the detail provided by Bing compared to ChatGPT3 despite being powered by the updated LLM version GPT4.
All three Chatbots generated responses related to research questions that were not mentioned in our comparator (JLA). ChatGPT3 specifically emphasised the necessity for further investigation in areas such as implant infection prevention and the utilisation of emerging technologies like AI. On the other hand, Bard also identified research questions, including the need to mitigate complications following arthroplasty. Bard produced responses were often generic and not directly relevant to the provided prompt or our research question. Similarly, Bing presented various responses that did not align with our identified research gaps. Although these responses were sourced from credible references, they failed to address the specific research priorities we were focusing on. However, it is crucial to acknowledge that JLA priorities undergo extensive filtering of research priorities in the early stages before publication. Therefore, while the responses generated by these Chatbots may not align with the final published versions, they might still contain valuable content that could have been utilised during the development stages.[13]
Despite the potential benefits of Chatbots, our analysis revealed that none of the Chatbots addressed five key priorities pertaining to hip and knee arthroplasty. The absence of attention to these key priorities raises concerns about the effectiveness and comprehensiveness of utilising Chatbots in the context of hip and knee arthroplasty research. It suggests that the current implementation of Chatbots may not fully embrace their potential to bridge gaps in knowledge or incorporate expert opinions in this specialised area. Despite this, it does demonstrate promising potential as a valuable adjunct to expert opinions and can be highly beneficial in facilitating the generation of a comprehensive list of ideas.
To the best of our knowledge, no previous research has explored the use of Chatbots for generating research ideas in the field of orthopaedic surgery. A recent study focused on assessing Chatbot LLM in their ability to identify research questions within the domain of gastroenterology.[32] This study prompted ChatGPT3 to generate responses related to four key research areas in gastroenterology. The responses were subsequently evaluated by a panel consisting of five individuals, including three gastroenterology consultants and two AI experts. Comparably, our comparator (JLA) represents a rigorous, multi-stakeholder process which is externally recognised. Nevertheless, the authors identified the potential of using ChatGPT to identify research gaps in other specialities.
In another study, researchers explored the capability ChatGPT3, to simulate a Google search that a potential patient might perform.[19] They then compared the answers generated by ChatGPT3 with the results obtained from an actual Google web search. By evaluating the responses produced by ChatGPT3 in comparison to those obtained from conventional search engines like Google, researchers have started assessing the potential of AI Chatbots to deliver pertinent and precise information in clinical contexts.
The JLA PSP require a significant number of resources, including reviewing systematic review and other studies with experts dedicating their time to this. By incorporating Chatbots, the process could potentially be accelerated, and resource utilisation reduced. A substantial portion of time is devoted to the initial scoping exercise, systematic reviews, and scoping surveys. This research suggests that automating this process could generate essential themes that can then be thoroughly discussed by an expert panel, ultimately leading to the formulation of specific research questions.
This study pioneers the utilisation of three Chatbots to identify research gaps within the realm of lower limb arthroplasty. While previous literature has explored the potential of Chatbots in scientific research, specifically in the context of paper writing and draft editing[10], our study uniquely explores their capacity as research idea generators. Other studies have been published on the utilisation of ChatGPT3 in medical school exams in the USA [33] and in Korea for general surgery board exams. [34] Further studies have shed light on the reliability of content generated by Chatbots when summarising medical literature reviews, highlighting the concerning possibility of Chatbots producing unfounded statements or even fabricated information. [35] Similar instances have been observed outside of healthcare, as exemplified by a recent court case where a legal representative resorted to the use of fabricated citations from ChatGPT3 in support of their argument.[36]
Strengths and Limitations
One of the key strengths of our paper is the comparator utilisation of JLA priorities, which distinguishes it from other studies that solely rely on expert opinions. By incorporating JLA priorities, we established research priorities prior to conducting our study.
The methodology presented in this study makes reproducing the results challenging due to the dynamic and iteratively changing nature of Chatbots and the potential for different answers to be generated by other users or at different times. Therefore, our results may not be reproducible.
We must also acknowledge the limited number of ‘prompts’ used to generate responses for each Chatbot and acknowledge that the ‘prompts’ that were used in this study may not directly be the most efficient or best ‘prompts’ to obtain a comprehensive idea of the gaps in research for hip and knee arthroplasty. Chatbot LLM draw from vast amounts of information, making consistent output difficult to ensure. Future studies or processes utilising Chatbot LLM for similar purposes should carefully consider these limitations and potential variations in results.
It is prudent to recognise the potential drawbacks associated with employing Chatbots for the purpose of generating research ideas. Chatbots rely on LLM that are trained using extensive data, sourced from both reliable and unreliable sources. The inherent challenge lies in the fact that LLM may lack the ability to differentiate between the reliability of these sources. Consequently, there is a risk of incorporating inaccurate or misleading information into the research ideas generated by Chatbots. Moreover, LLM have the potential to amplify existing biases present in the literature. Since they learn from the data they are trained on, which might contain inherent biases, the output produced by Chatbots could inadvertently reinforce or perpetuate these biases.[37] Additionally, the reasoning behind the outputs generated by Chatbots remains opaque. While Chatbots can generate responses and ideas, the internal workings and decision-making processes behind those outputs are not readily understandable.
This lack of transparency makes it challenging to critically evaluate or validate the ideas suggested by Chatbots. Enhancing the transparency and explainability of the Chatbots LLM can enable researchers to better understand the reasoning behind the generated ideas and evaluate their suitability for further investigation.[38]