Artificial intelligence (AI) is revolutionizing our world. As an emerging technology, it has shown enormous potential in numerous fields with the arrival of self-driving vehicles, smart devices, and automated personal assistants [1]. This has since sparked conversation about other possible implications, particularly in the realm of education.
The basis for AI’s use in education lies in its incorporation into language learning models (LLMs), or computational tools that can comprehend, learn, and generate human-like text. These models train on an extensive data set and abide by an algorithm to discern relationships between words and grammatical structure in order to establish language patterns [2]. As a result, LMMs establish rules for which they can create their own coherent and contextually relevant sentences.
A prime example is OpenAI’s ChatGPT, which in its fourth iteration builds upon a transformer-based model and introduces several pivotal improvements over prior versions and other models [3]. The most notable is its breadth of training data, which expanded from 175 billion parameters in GPT3.5 to 1 trillion in the 4.0 model [4]. Bing’s current AI model has 175 billion parameters and Microsoft’s Bard has 540 billion by comparison [5, 6]. ChatGPT therefore has a substantial knowledge base, which enables it to produce more nuanced and accurate responses than its competitors.
Due to its widespread acclaim as the fastest growing model, GPT-4 has since been subjected to numerous testing environments, passing the Uniform Bar Exam, Law School Admission Test, Scholastic Aptitude Tests, Graduate Record Examinations, and Advanced Placements exams with high percentiles [7, 8]. In this paper, we focus on GPT-4 performance with respect to the United States Medical Licensing Examination (USMLE) STEP 1 examination, which is the first of three board exams required of medical students. It is administered over 8 hours and consists of 7 blocks each composed of 40 questions pertaining to basic sciences of the practice of medicine [9]. In 2022, 29,039 examinees from US/Canadian schools sat the exam and 91% passed [10].
Earlier studies have demonstrated success of the prior models, ChatGPT-3 and ChatGPT-3.5, with scores above the passing threshold on USMLE STEP 1 questions, provoking discussion of its use as a question analyzer and educational resource [11, 12]. All of these studies, however, reported GPT’s overall performance. No studies have assessed performance within the subjects and disciplines present on the USMLE STEP 1 exam. This information is needed to assess potential content weaknesses prior to recommendation as a learning tool.
In this study, we expand upon the initial findings with GPT-3 and GPT-3.5 by using the newer model GPT-4 and reporting performance by subject and discipline. Primary objectives included determination of strict AI performance in the question sets. Secondary objectives involved identification of potential implications for GPT-4 in medical education.