In this study we explored LLMs’ applicability in generating medical questions. Specifically, multiple choice questions (MCQs) for medical examinations.
MCQs are an essential component of medical exams, used in almost every aspect of medical education [8, 9]. Yet, they are time consuming and expensive to create [28]. The possibility of AI generated questions can provide an important opportunity for the medical community and transform the way written tests are generated. Using LLMs to support these tasks can potentially save time, money and reduce burnout. Especially in a system already sustaining itself on limited resources [29].
AI benefits
Physician burn-out, poor mental health and growing personal distress are constantly studied [30]. However, academic physicians experience a unique set of additional challenges, such as increased administrative work, less time with patients and increased clinical responsibilities. As a result, they have less time for traditional academic pursuits such as research and education [31, 32, 33]. In the famous words of Albert Einstein: “Bureaucracy is the death of any achievement”.
AI can potentially relieve medical educators from tiresome bureaucracy and administrative work, allowing them to focus on the areas that they view as most personally meaningful and avoid career dissatisfaction [32, 34].
According to Bond et al. another possible application of AI in medical education is grading patients notes. This can provide additional formative feed-back for students in the face of limited faculty availability [35]. Moreover, AI can assist medical students by creating personalized learning experience, while accessing current up-to-date information [36]. These are only a few examples, as every day new tasks improved by AI are discovered.
AI drawbacks
Nowadays, AI continues to evolve, becoming more integrated in various medical fields [37]. The probability of revolutionizing the healthcare world seems inevitable. AI performance is fast, efficient and with what seems like endless data resources [38]. In almost every study we reviewed, LLMs’ execution was more than satisfactory with the consensus that AI is capable of producing valid questions for medical exams. However, while these models show promise as an educational tool, their limitations must be acknowledged.
One notable limitation is a phenomenon known as “Hallucination” [39]. This occurs when generative AI misinterprets the given prompt, resulting in outputs that lack logical consistency. When relying on AI for quality MCQs writing, this phenomenon is unacceptable. Furthermore, AI ability to integrate contextual and sensory information is still not fully developed. Currently, AI cannot understand non-verbal cues or body language. Also, bias in data and inaccuracy is troubling [40, 41].
Another consideration is the logistics necessary to implement AI in healthcare and education. New technologies require training, commitment and investment in order to be maintained and managed in a sustainable way. Such a process can take time and energy [42]. Moreover, implementing new technology increases concern for privacy and data security [43, 44]. Patients' data is sensitive and often a target for cyber-attacks [45].
Equally important limitation of AI integration in healthcare is accountability. AI “black box” refers to the “knowledge within the machine”. The internal workings of the system are invisible to the user. Healthcare staff use the AI, write the input and receive the output. But, the system's code or logic cannot be questioned or explained [46].
Additional aspect to consider is the longstanding concern of AI replacing human jobs [47]. This could cause a dislike and resistance to AI integration. This notion is unlikely in the near future. But, distrust in AI technology is yet another challenge to its implementation [48].
But maybe the biggest concern of AI application in medical education is impairing students' critical thinking. According to Van de Ridder et al., self-reflection and criticism are crucial for a medical student's learning process and professional growth. In a reality where a student can delegate to Chat-GPT tasks such as writing personal reflection or learning experiences, the students deny themselves of the opportunity to self-reflect and grow as physicians [49].
It is imperative to take into consideration those significant shortcomings and challenges. AI should be used wisely and responsibly while integrating it into medical education.
MCQs creation
For each study we examined the process of crafting the MCQs. We noticed a wide range of approaches to writing the prompts. In some studies, additional modifications took place in order to improve the validity of the questions. This emphasizes the importance and sensitivity of prompts. Prompt-engineering may be a task that requires specific training, so that the prompt is phrased correctly and the MCQs quality is not impaired.
Limitations
Our review has several limitations. Most of the studies are retrospective in nature. Due to heterogeneity in study design and data, a meta-analysis was not performed. None of the questions were image or graph based, which is an integral part of medical exams. Three studies did not base their prompt on a valid medical reference, such as previous exams or approved syllabus. Also, the three studies did not evaluate the questions after they were generated. Two studies were at high risk of bias.
Additional studies will be needed to further solidify the usefulness of AI tools, especially in generating competent medical exam questions.
Lastly, we limited our search to PubMed/MEDLINE. We did so due to its relevance in biomedical research. We recognize this choice narrows our review's scope. This might exclude studies from other databases, possibly limiting diverse insights.