Artificial intelligence (AI) systems are revolutionizing various domains of medicine by providing advanced tools for differential diagnosis generation, clinical decision support, and the analysis of imaging, physiologic, and genomic data [1]. Among these systems, large language models (LLMs) such as ChatGPT, developed by OpenAI, have shown significant potential in medical education and professional examinations.
ChatGPT, particularly its latest iteration GPT-4, represents a substantial advancement over its predecessor GPT-3.5. It was trained on a diverse corpus of text through both supervised and unsupervised learning techniques, followed by reinforcement learning with human feedback [2]. While earlier versions of ChatGPT demonstrated near-passing performance on general medical examinations, GPT-4 has exhibited notable improvements across various specialized medical exams, including the United States Medical Licensing Examination (USMLE) and several board certifications in fields such as neurosurgery, dermatology, orthopedic surgery, and radiology [3–6].
The performance of GPT-4 on these exams underscores its enhanced capability to understand and process complex medical information. For instance, GPT-4 scored significantly better than GPT-3.5 on the USMLE, achieving a 20% improvement in accuracy [3]. This suggests that GPT-4 can not only recall factual information but also perform higher-order reasoning, a critical aspect of medical problem-solving.
GPT-4o introduces several features and novelties that distinguish it from previous models. Notably, it integrates a multimodal input capability, allowing it to process and generate text based on both textual and visual inputs, though this functionality was not publicly available at the time of this study [2]. Additionally, GPT-4o boasts enhanced contextual understanding and problem-solving abilities, achieved through a more extensive and nuanced training dataset. This model also incorporates improved fine-tuning processes, leveraging human feedback to refine its responses more effectively. These advancements are designed to enhance GPT-4o’s accuracy, coherence, and relevance, making it a powerful tool for complex and specialized applications, such as those required in medical board examinations
Despite these promising results in other medical specialties, the performance of GPT-4 on oral and maxillofacial surgery (OMFS) board exam questions has not yet been examined. OMFS is an extremely specialized subspecialty, encompassing a wide range of complex clinical scenarios, from craniofacial trauma to oncological procedures, requiring a deep understanding of both surgical principles and clinical management. Given the unique challenges and specificity of OMFS, it is uncertain how well GPT-4 will perform in this domain, despite its success in other fields.
This study aims to evaluate GPT-4's performance on OMFS board exam questions, addressing this gap in the current research. By employing a 250-question mock board examination, we will assess GPT-4's proficiency in this specialized area. The objectives are to determine the accuracy of GPT-4's responses, compare its performance to that of human examinees, and identify any specific strengths or limitations in its ability to handle specialized medical content. By doing so, we aim to elucidate the potential role of advanced LLMs in medical education and their possible future applications in clinical practice.