As noted, generative AI technologies, particularly ChatGPT, have seen exponential growth and widespread adoption recently, mirroring the transformative effects of past innovations like the internet, Google, Wikipedia, and calculators in the educational sector (López Martín & Martín Gutiérrez, 2023, p. 4). These technologies promise to revolutionize teaching and learning by potentially reducing the time educators spend on developing and revising instructional materials, thereby allowing more focus on direct instruction and the creation of engaging, multimodal learning experiences (Shah, 2023).
However, the response from higher education institutions and educators to generative AI has been mixed. Some perceive these technologies as tools to alleviate mundane tasks and enhance focus on critical issues, while others raise concerns about threats to academic integrity when used unethically (see, e.g., Ribera & Díaz Montesdeoca, 2024; Shah, 2023). Despite these differing views, the permanence of generative AI in the educational landscape is becoming clear, shifting the debate towards how these tools can be integrated effectively, responsibly, and ethically to augment the learning experience (López Martín & Martín Gutiérrez, 2023, p. 4).
The discussion extends to the application of generative AI tools in creating diverse types of instructional activities, including single-choice, multiple-choice, true/false, fill-in-the-blank, short-answer, and essay-answer quizzes. Multiple-choice questions (MCQs) are a staple in online assessment within higher education, serving both formative and summative functions (Gonsalves, 2023, pp. 1-2; see also Beerepoot, 2023). Recent studies have explored ChatGPT's effectiveness in answering online quiz questions across various academic fields. Raftery (2023) examined ChatGPT versions 3.5 and 4 in answering quizzes for first-year quantitative techniques modules in an Irish technological university. The findings showed increasing accuracy across versions: ChatGPT-3.5 scored 35%, ChatGPT-4 achieved 47%, and ChatGPT-4 with the Wolfram plugin reached 78%. After correcting calculation errors, the scores improved significantly, indicating the potential for high success rates in completing online quizzes with ChatGPT's assistance.
In the medical field, Gilson et al. (2022) found that ChatGPT could correctly answer over 60% of multiple-choice questions from a test bank for the United States Medical Licensing Examination (USMLE), performing comparably to a third-year medical student. Hoch et al. (2023) aimed to determine ChatGPT’s accuracy in otolaryngology board certification quizzes and investigate performance across subspecialties. Using a dataset from the German Society of Oto-Rhino-Laryngology, Head and Neck Surgery, they found that ChatGPT answered 57% of questions correctly, with a higher success rate in single-choice questions and variable performance among subspecialties.
Newton's (2023a) research in early 2023 highlighted ChatGPT's moderate performance on MCQs, particularly in areas requiring complex problem-solving, calculations, and interpreting images. However, by the end of March 2023, the introduction of ChatGPT-4 marked a significant improvement (Newton, 2023b). Its performance on MCQs was described as "really good" (as cited in Raftery, 2023), especially in handling the complexities previously found challenging. This rapid shift highlights the rapid advancements in ChatGPT's capabilities.
Notwithstanding ChatGPT's advanced capabilities, MCQs remain highly valuable for formative assessments and facilitating self-directed learning. Furthermore, LLMs like ChatGPT can be instrumental in streamlining MCQ creation, a traditionally labor-intensive process involving manual question and answer development. This automation can simplify and expedite test evaluation, showcasing the dual role of MCQs as both educational tools and beneficiaries of AI-driven efficiencies. Research in this area is ongoing, with Tu et al. (2023) exploring the potential of LLMs in learning assessment, particularly within data science education. Their work demonstrates the ability of LLMs to generate questions (e.g., 10 questions on hypothesis testing). However, this study did not include answer generation or an evaluation of the questions' usability, highlighting the need for further research to ensure the quality and effectiveness of LLM-generated assessment elements.
Dijkstra et al. (2022) and Ionescu and Enescu (2023) showcase the potential of different GPT models for educational assessment. Dijkstra et al. examined GPT-3's ability to generate educational text completions within reading comprehension quizzes. Their system, EduQuiz, generates a complete MCQ, but faces difficulties creating high-quality distractor options (cf. Section 3.2). On the other hand, Ionescu and Enescu explored ChatGPT-3's capabilities for creating online MCQs and automating essay-type answer grading. Their implementation successfully generated quiz questions, but automatic essay evaluation proved less consistent, with issues like changing grading styles and inconsistent answer formats. These studies present contrasting findings, therefore also highlighting the need for further development to refine LLM-based assessment tools, particularly in areas like distractor generation and consistent answer evaluation.
Tlili et al. (2023) report the results of interviews with educators, regarding their perceptions about the use of ChatGPT in learning environments. Concerns raised about quiz generation by ChatGPT are that (a) for some questions it was easy to identify the right answer; (b) in one quiz, the wrong answer was always placed at the end; (c) in another quiz, the correct answer was not given; and (d) there were format inconsistencies. The prompts that generated the quizzes shared in that study, however, are very broad, having no detailed guidelines about expected difficulty, format, or the need to provide an answer, likely leading to unsatisfactory questions and answers.
Fleming et al. (2023) generated 50 questions with GPT-4 in the style of the USMLE and asked physician reviewers to label them as correct or incorrect. Only 32 questions (64%) were deemed correct by all reviewers, while the remaining questions were deemed incorrect by at least one reviewer for having multiple correct answer choices (n=9); the AI-chosen answer being incorrect (n=6), or there being no correct answer choice (n=3).
A significant limitation of all the studies involving commercial LLMs like GPT-4 lies in the proprietary nature of these models. Since companies do not disclose their training data or architecture, and models are being constantly updated, research results are not reproducible. This limitation applies to the findings we present in this study as well. We nevertheless expect that by providing a full description of the experience of creating subject competence tests, we can offer insights that can help similar future studies involving commercial or non-commercial LLMs.