This mixed-methods study was performed to critically evaluate the structural quality of AI-generated MCQs in comparison to human-generated items. AI items were prospectively generated, while human-authored items were retrospectively sourced from an existing content bank, as further detailed below.
Item structure and test blueprint construction
A single-best-answer MCQ format was employed. A complete item included a detailed contextual stem, a question, five options (with correct answer indicated), and explanatory text articulating the logic of correct versus incorrect options. As explained below, the explanatory text was evaluated separately from the other elements due to it being not universally included in standard MCQs.
A test blueprint was constructed to emulate a standard medical school examination at the level of the graduating student. Content areas included Medicine, Surgery, Paediatrics, Obstetrics, Gynaecology, Psychiatry, Population Health, and General Practice. 125 items were included, comprised of 40 from three sources, Novices, Experts, and AI. A surplus of 5 Expert human-generated items were included in the scoring process, as they were intended for future use in a mock examination where only satisfactory items would be included, and an element of redundancy was required for that purpose. An excerpt of the test blueprint is given in Appendix 1.
Development of a standardised scoring system
A standardised scoring rubric was developed to facilitate consistent evaluation of human and AI-generated MCQs. This incorporated elements such as content validity (encompassing factual accuracy, fidelity, and realism), scope, correct item anatomy, specific item-writing flaws, and cognitive skill level. This rubric drew on established frameworks, including modified Bloom’s taxonomy and item-writing guidelines (Haladyna et al., 2002), and is presented as Tables 1-3. A global impression criterion was included as a proxy for whether the item was considered fit for use in a summative examination for graduating medical students. A separate secondary evaluation was undertaken to assess the quality of each item’s explanatory feedback text for comprehensiveness, veracity, and articulation of clinical reasoning.
Table 1. Standardised scoring rubric of all MCQs
CORE ITEM ELEMENT
|
Score key
|
Content validity: The item has content validity, being factually accurate and realistic to clinical practice
|
- Entirely does not meet criteria
- Mostly does not meet criteria
- Mostly meets criteria
- Entirely meets criteria
|
Within scope: The item tests concepts that are within scope for the target audience of a graduating medical student
|
- Entirely does not meet criteria
- Mostly does not meet criteria
- Mostly meets criteria
- Entirely meets criteria
|
Item anatomy: The anatomy of the item is correct and complete
|
- Entirely does not meet criteria
- Mostly does not meet criteria
- Mostly meets criteria
- Entirely meets criteria
|
Item-writing flaws (IWF): How many item-writing flaws are present? What type of flaws are present?*
● Content
● Style
● Formatting
● Stem
● Options
|
● Numeric count
● Type of IWF also documented
|
Cognitive skill level: What is the cognitive skill level of the item?
|
Using a modified Bloom’s taxonomy:
- Level I: Remembering
- Level II: Understanding
- Level III: Applying, analyzing, evaluating, and creating
|
Global impression (structural): Global impression of the stem, question, and options: This item is fit for use in a summative examination for graduating med student
|
- No (unsalvageable)
- No (major further editing)
- Yes (minor further editing)
- Yes (no further editing)
|
SCORING OF EXPLANATORY TEXT
|
Score key
|
Feedback comprehensiveness: The feedback was appropriately comprehensive, addressing the correct option and distractors
|
- Entirely does not meet criteria
- Mostly does not meet criteria
- Mostly meets criteria
- Entirely meets criteria
|
Feedback veracity and clinical reasoning: The science and clinical reasoning in the feedback was satisfactory
|
1. Entirely does not meet criteria
2. Mostly does not meet criteria
3. Mostly meets criteria
4. Entirely meets criteria
|
Global impression (overall): This item, including its written feedback, is fit for use in a summative examination for graduating med student
|
- No (unsalvageable)
- No (major further editing)
- Yes (minor further editing)
- Yes (no further editing)
|
* Referencing item-writing guidelines as laid out by (Haladyna et al., 2002).
Table 2. Modified Bloom’s taxonomy
Level
|
Cognitive domains
|
Level I
|
Remember (identifying and retrieving information)
|
Level II
|
Understand (interpreting and summarizing information)
|
Level III
|
Apply, analyze, evaluate, and create (implementing, organizing, and critiquing information)
|
Table 3. Examples of item-writing guidelines (adapted from Haladyna et al., 2002).
● Content concerns: Use novel material to test higher level learning. Paraphrase textbook language or language used during instruction when used in a test item to avoid testing for recall.
● Formatting concerns: Format the item vertically instead of horizontally
● Style concerns: Use correct grammar, punctuation, capitalisation, and spelling
● Writing the stem: Include the central idea in the stem instead of the choices; word the stem positively, avoid negatives such as NOT or EXCEPT.
● Writing the choices: Place choices in a logical or numerical order; keep the length of choices about equal; avoid All-of-the-above; avoid giving clues to the right answer, such as pairs or triplets of options that clue the test-taker to the correct choice; make all distractors plausible.
|
Human-generated MCQs – Novice and Expert
A total of 85 human-generated MCQs were sourced from an existing Australian commercial medical education provider (eMedici2 Pty Ltd, Adelaide, Australia; https://emedici.com). This content bank is derived from submissions by medical students and junior doctors, which pass through a pipeline of peer review, expert clinician review, and editorial approval prior to acceptance. Human authors are provided detailed written item-writing guidelines referencing style and item anatomy, at the time of item submission. Only items tagged as testing higher-order cognitive skills were included in this study. Each item was otherwise randomly selected from the content bank based on its recorded topic by two authors who were otherwise blinded to the content of the item. Of the 85 human items, 40 were written by a non-expert and had not passed through a peer review or other editorial process, and as such were deemed at ‘Novice’ level of authorship, while 45 had been edited and/or approved by subject matter experts and were thus deemed ‘Expert’ level. Subject areas were matched between groups.
AI-generated MCQs
GPT-4 was used in this study (model number: gpt-4-0125-preview) based on favourable reported performance against the Massive Multi-task Language Understanding benchmarks (OpenAI, 2023). A programmed script was used in GPT-4 with the prompt and key learning points - created as below - to generate outputs in an unsupervised fashion such that all authors were blinded to the GPT-4 outputs.
Prompt engineering
Construction of a tailored prompt for GPT-4 took place across three reference group meetings by the six-panel team (the authors) who have broad educational, item-writing, clinical, and technical expertise. The aim was to develop a generic prompt template that maximised the potential of GPT-4 to produce structurally sound items testing higher order cognitive skills and could be easily adapted for a wide range of learning points or item topics with minimum subsequent human effort. The prompt was engineered incrementally, with each output assessed subjectively until the quality was deemed to be at the ceiling point. The prompt template included:
- Information on the setting and the target audience of the MCQ;
- The inclusions and exclusions in the clinical stem to meet basic item anatomy requirements;
- Advice on avoidance of specific item-writing flaws, with instructions sourced from a full taxonomy of item-writing guidelines by Haladyna et al. (2002), as outlined in Table 3;
- Instruction on the number of question options and distractors;
- Instruction to produce explanatory feedback including clinical reasoning for the answer and distractors of the MCQ;
- Instruction to include references to recent peer-reviewed articles;
- Five examples of peer-reviewed, high quality MCQs covering a range of medical topics; and
- A key learning point of the intended MCQ in the form of a factual statement, which included the question topic (in accordance with the test blueprint – Appendix 1).
Variability of GPT-4 outputs
Among the input variables of the GPT-4 interface is ‘temperature’, which broadly determines the level of variety in subsequently generated text. This parameter ranges from 0 to 2, with a lower value resulting in more consistent outputs. Preliminary investigations have yet to identify the ideal temperature for medical MCQ generation (Agarwal et al., 2024), and it is likely to vary in different settings. To maximise reproducibility, we used a temperature of 0.0.
To confirm the predictability of outputs at temperature 0.0, six learning points were used to generate three consecutive outputs without interval prompt modification. These 18 items were evaluated by a consensus panel of five authors against the scoring rubric, then independently reviewed by another author. The results of this variability testing are given in Appendix 2.
References
The veracity of the references generated by GPT-4 to support the generated feedback, were also evaluated from the 18 ‘variability testing’ items. These were evaluated against the criteria: ‘The references included were real, relevant to the MCQ, formatted, and peer-reviewed.’ The items were scored on a scale of 1 (entirely does not meet criteria), 2 (mostly does not meet criteria), 3 (mostly meets criteria), to 4 (entirely meets criteria). Specific inaccuracies in the references were documented.
Item appraisal - consensus panel scoring
All items, AI and human-generated, were pooled and then evaluated in random order by a consensus panel of five authors blinded to the origin of the item using the prespecified scoring rubric. One duplicate item was identified and excluded.
Examples of a novice, expert, and AI-generated item used in this study are presented in Appendix 3.
The panel, by majority vote, also recorded their prediction for whether the item was authored by a novice, expert, or GPT-4.
Ethics approval
This project received approval from the University of Adelaide Human Research Ethics Committee (HREC-2023-285).
Data analysis
There was no identifiable data involved in this study. Mean scores for measures of item quality between author types were compared using ANOVA with post-hoc Bonferroni or Tamhane tests as appropriate (the latter was used where the largest variance was at double or higher compared to the lowest). The distribution of the global impression scores (including and excluding feedback) were tallied and presented as a percentage of items. The distribution of item-writing flaws and cognitive skill level were tallied by author type. Summary descriptions are provided for the frequency of correct answer identification and placement, and assessments of quality of referencing. A p value <0.05 was considered significant.