All data generated or analyzed during this study are included in this published article. Overall, GPT-4 performance was extremely rapid and efficient. All questions were generated according to the secondary prompt and were introduced to five, blinded specialists who were not aware to the research question and to the optional writing of questions by an artificial intelligence algorithm. Of note, currently there is no option to write image-based questions, which is a major limitation in the field of medicine. In addition, the algorithm had difficulty differentiating between close disciplines, e.g., distinguishing between general surgery versus gynecological pathologies requiring surgical treatment in clinical scenarios addressing the lower abdomen.
Only one question (0.5%) out of 210 required replacements due to a completely mistaken answer. This question was in the domain of surgery. A total of 13 questions had more than one possible matching answer due to incomplete clinical information in the question stem, or optional answers that could not be definitively differentiated from each other. These questions and answers were not replaced but necessitated correction and re-writing in a better precision of the question stem or the answer options.
In addition, 3 questions presented patients’ age that was unconcordant with the clinical description (categorized above either as wrong questions or wrong answers). Such mistakes were included in disciplines that included questions that could be “age sensitive”: gynecology and pediatrics. For example, a question presenting a 38-year-old woman with irregular menses as postmenopausal. Also, one mistake that could be classified as “gender sensitive”, once again, in gynecology, when an abdominal complaint of a male was questioned, and the optional answers included ectopic pregnancy and ovarian cyst rupture.
In the chapter of internal medicine, two questions were considered, by specialist physicians, too easy and replaced, although considered qualifying.
In 2 cases, our specialists defined terminology used by GPT- 4 as being outdated or inaccurate: using the term SIRS (Systemic Inflammatory Response Syndrome) in the field of surgery and the term amenorrhea instead of irregular menses. Both questions necessitated correction but otherwise were judged as qualified.
Overall, 3 questions were identical / appeared twice in the same test and necessitated writing new ones. These questions were in the field of surgery. Additionally, three questions needed replacement as they presented a repeated topic that exceeded its appropriate weight within the exam syllabus. These questions were in the domain of internal medicine. Two questions were in the elimination format, which is considered methodologically flawed, despite the absence of such question types in the provided example questions.
One of the questions had wrong spelling, writing “GI track” instead of “GI tract”.
Overall, the majority of problematic questions were in the field of surgery, reaching up to 30% of questions in this subject. It is worth mentioning that this is the only subject in which the algorithm provided an incorrect answer. In the chapter of gynecology, 20% of the questions had inaccuracies, most commonly due to lack of relevant clinical descriptions. In pediatrics and internal medicine only 10% of the questions needed some kind of correction. All MCQs written by GPT-4 in psychiatry qualified and did not necessitate corrective measures. It should be emphasized that the original examination, serving as an example for GPT-4 had no such inaccuracies.
Presented hereafter are examples for questions necessitating replacement or correction:
Example 01
A). A question necessitating change of the preliminary prompt due to lack of clinical case description at the base of the question
Which of the following is a negative symptom of schizophrenia?
a. Hallucinations
b. Delusions
c. Anhedonia
d. disorganized speech
Example 01
B). A question generated according to the second prompt, relying on a clinical case presentation as a basis for the knowledge question
A 22-year-old male presents with disorganized speech, delusions, and hallucinations. These symptoms have been present for the past 3 months. What is the most likely diagnosis?
a. schizoaffective disorder
b. Schizophrenia
c. Major depressive disorder with psychotic features
d. bipolar disorder with psychotic features
Example 02
A). A question presenting without a correct answer (content mistake)
A 45-year-old woman presents with a painful, red, swollen left leg. Duplex ultrasound shows an occlusive thrombus in the left superficial femoral vein. What is the most appropriate initial treatment?
a. Surgical thrombectomy
b. Anticoagulation therapy
c. Compression stockings
d. Elevation of the affected limb
Example 02
B). Corrected question (option b added manually as the correct answer).
A 45-year-old woman presents with a painful, red, swollen left leg. Duplex ultrasound shows a small, non-occlusive thrombus in the left superficial femoral vein. What is the most appropriate initial treatment?
a. Surgical thrombectomy
b. Non-Steroidal Anti-Inflammatory Drug
c. Compression stockings
d. Elevation of the affected limb
Example 02
shows a question with incorrect answer regarding the treatment of superficial vein thrombosis. The answer provided by the algorithm was anticoagulation therapy, which is the correct answer for deep vein thrombosis. The above question describes a patient with superficial vein thrombosis, which is treated using non-steroidal anti-inflammatory drugs.
Example 03
A). A question using out-of-date terminology.
A 71-year-old woman is hospitalized with an Inflammatory Bowel Disease exacerbation. During her hospitalization she has fever with leukocytosis and is diagnosed with SIRS -Systemic Inflammatory Response Syndrome. Which of the following statements is correct regarding the pathophysiology of her condition?
a. It is activated mainly by the Innate Immune system
b. Vasoconstriction is caused mostly by Histamine and Bradykinin
c. Anti-cytokine medications such as Anti-TNF were proven effective in decreasing mortality
d. The Cardiac Index is decreased due to an increase in the Afterload
Example 03
shows a question using the term SIRS, which is no longer used as part of sepsis diagnostic criteria that were replaced two years ago.
Example 04
A). A question presenting with an inaccurate terminology
A 34-year-old woman presents with amenorrhea for three months. She has a history of irregular menstrual cycles and hirsutism. Her BMI is 32. What is the most likely diagnosis?
a. Premature ovarian failure
b. Polycystic ovary syndrome (PCOS)
c. Hyperprolactinemia
d. Hypothyroidism
Example 04
B). A correction of the question, replacing the incorrect professional term
A 34-year-old woman presents with irregular menses (last was three months ago). She has a history of irregular menstrual cycles and hirsutism. Her BMI is 32. What is the most likely diagnosis?
a. Premature ovarian failure
b. Polycystic ovary syndrome (PCOS)
c. Hyperprolactinemia
d. Hypothyroidism
Example 04
shows a question mistakenly using the term amenorrhea when the correct description had to be irregular menses.
Example 05
A). Presenting 38-year-old woman as postmenopausal (age sensitive question)
A 38-year-old woman presents with postmenopausal bleeding. Endometrial biopsy reveals atypical endometrial hyperplasia. What is the most appropriate management for this patient?
a. Progesterone therapy
b. Hysterectomy
c. Endometrial ablation
d. Observation and repeat biopsy in 3 months
Example 05
B). A question correction, replacing the age of the patient to 58 years old woman
A 58-year-old woman presents with postmenopausal bleeding. Endometrial biopsy reveals atypical endometrial hyperplasia. What is the most appropriate management for this patient?
a. Progesterone therapy
b. Hysterectomy
c. Endometrial ablation
d. Observation and repeat biopsy in 3 months
Example 06
A). A question presenting a “pediatric patient” without specifying the fact that he should be an infant (age-sensitive question)
A pediatric patient presents with a "blueberry muffin" rash, hepatosplenomegaly, and jaundice. Which of the following is the most likely cause?
a. Cytomegalovirus (CMV) infection
b. Congenital rubella infection
c. Congenital syphilis
d. Congenital toxoplasmosis
Example 06
B). The age of the patient was added for the appropriateness of optional answers.
A 4-month-old infant presents with a "blueberry muffin" rash, hepatosplenomegaly, and jaundice. Which of the following is the most likely cause?
a. Cytomegalovirus (CMV) infection
b. Congenital rubella infection
c. Congenital syphilis
d. Congenital toxoplasmosis
Example 07
A question deemed as wrong, necessitating replacement
Which of the following pediatric conditions is characterized by recurrent episodes of paroxysmal vertigo, tinnitus, and hearing loss?
a. Meniere's disease
b. Benign paroxysmal positional vertigo
c. Migraine-associated vertigo
d. Acoustic neuroma
Example 07
shows a question in which the algorithm associated Meniere’s disease, a disease classically appearing between the age of 20 to 60, as a viable option (intended to be the correct answer) in children. This mistake should be classified as age-sensitive.
Example 08
A). A question presenting a male patient, with two of the provided answers describing gynecological pathologies (gender sensitive mistake)
A 56-year-old male presents with acute onset of severe left lower quadrant pain, fever, and nausea. Upon examination, there is tenderness and guarding in the left lower quadrant. What is the most likely diagnosis?
a. Acute appendicitis
b. Diverticulitis
c. Ovarian cyst rupture
d. Ectopic pregnancy
Example 08
B). The gynecological pathologies were replaced
A 56-year-old male presents with acute onset of severe left lower quadrant pain, fever, and nausea. Upon examination, there is tenderness and guarding in the left lower quadrant. What is the most likely diagnosis?
a. Acute appendicitis
b. Diverticulitis
c. Acute cholecystitis
d. Liver abscess
Example 09
A question presenting a clinical case of Lyme disease, without mentioning traveling to high incident countries. A mistake potentially classified as geographically sensitive
A child presents with a "bull's-eye" rash, fever, and joint pain. Which of the following is the most likely diagnosis?
a. Rocky Mountain spotted fever
b. Lyme disease
c. Erythema multiforme
d. Stevens-Johnson syndrome
Example 09
shows an example of a question aiming for a disease that has specific geographical distribution, without mentioning traveling to endemic areas.
Example 10
A). A question lacked information, resulting in the incorrectness of the marked answer. While the question asks about the most common cause of anemia in children, the marked answer was physiological anemia, which is true for neonates but not for children. Once again, an age-sensitive question.
Which of the following is the most common cause of anemia in children?
a. Iron deficiency
b. Sickle cell disease
c. Thalassemia
d. Physiologic anemia
Example 10
B). The question was changed to “neonates” instead of “children” to match the correct answer
Which of the following is the most common cause of anemia in neonates?
a. Iron deficiency
b. Sickle cell disease
c. Thalassemia
d. Physiologic anemia
Example 11
An elimination request in a too-short question (methodological mistake).
Which of the following is NOT a core symptom of autism spectrum disorder (ASD)?
a. Deficits in social communication
b. Restricted, repetitive patterns of behavior
c. Sensory sensitivities
d. Excessive worry and anxiety
Example 12
Two questions showing the tendency of the algorithm to form too simplistic questions without need for clinical integration.
Which of the following is the primary neurotransmitter implicated in the pathophysiology of schizophrenia?
a. Serotonin
b. Norepinephrine
c. Dopamine
What is the most common cause of bronchiolitis in infants and young children?
a. Respiratory syncytial virus (RSV)
b. Influenza virus
c. Parainfluenza virus
d. Adenovirus