An overview of the workflow we applied to select clinical case reports, to query GPT-3·5 and GPT-4 and to evaluate the results is illustrated by the flowchart in Figure 1.
Selecting clinical case reports
In order to base our study on a representative, unbiased selection of realistic cases, we screened all case reports published in German casebooks by Thieme and Elsevier. Main clinical specialties internal medicine, surgery, neurology, gynecology and pediatrics were selected. A total of five casebooks from Thieme and seven from Elsevier were utilized. Of these, five books from each publisher were explicitly associated with the aforementioned medical specialties. Additionally, two supplementary casebooks from Elsevier focusing on "Rare Diseases" and "General Medicine" were analyzed to further include disease cases of very low incidence and from outpatient settings. This selection generated a pool of 1,020 cases.
To study the performance of ChatGPT in relation to a disease’s incidence, cases were categorized in three frequency groups as follows: A disease is considered “frequent” or “less frequent” if its incidence is higher than 1:1,000 or 1:10,000 per year respectively. A disease was considered as “rare” if the incidence is lower than 1:10,000. Only if no information of incidence was available, the disease was considered “rare” if the prevalence is lower than 1:2,000. Power calculations were conducted to infer a sample size of 33-38 per group in order to achieve a total power of 0·9 (details in Supplementary Information, section 1·1 Power calculation).
To limit the scope of the subsequent analysis, a random sampling of 40% of the total number of cases was performed, taking into account the distribution of sources, resulting in the examination of 408 cases.
The majority of cases followed a consistent structure of subsequent sections including medical history, current symptoms, examination options and findings, actual diagnosis, differential diagnoses and treatment recommendations. For each of the medical subspecialties, a physician reviewed all cases and included a case for further analysis when the following criteria were met: (1) the patient is able to provide information on medical history him-/herself (e.g. excluding severe traumatic patients), (2) images are not required for diagnostic purposes, (3) diagnosis does not overly rely on laboratory values, (4) the case is not a duplicate. A total of 153 cases were identified as conforming to the established inclusion criterion.
Incidence rates for the selected case studies were researched and sorted into three previously mentioned frequency groups. To ensure a balanced representation, we aimed to include an equal number of cases from each medical specialty while considering both the incidence rates and the publishing source. Sampling from each book and within each frequency group continued until we reached predefined limits: a maximum of seven cases per book or up to twenty cases for each medical specialty.
Included cases were manually translated to English by using the translation tool Deepl.com, followed by a human review to ensure both linguistic accuracy and quality. In order to mimic a true patient situation, cases were processed to patient layman language. Medical history was provided in first person perspective, containing only general information and avoiding clinical expert terminology. Layman readability was independently checked by two non medical researchers and alternative laymen terms provided if expert terminology was used. In cases of disagreement for laymen translation, consensus was achieved by a third non medical researcher.
An overview of the included cases, distributed to publisher, clinical specialty and disease frequency is provided in Table 1. Detailed information including the exact patient medical history is provided in Supplementary Data S1.
Table 1: Number of cases per publisher and clinical specialties, considering all, rare, less frequent and frequent diseases.
Querying GPT-3·5 and GPT-4
For generating the patient queries, the analysis plan reads as follows:
- Open a new ChatGPT conversation.
- Suspected diagnosis: Write patient medical history and current symptoms, add “What are most likely diagnoses? Name up to five.”
- Examination options: “What are the most important examinations that should be considered in my case? Name up to five.”
- Open New ChatGPT conservation
- Treatment options: Write patient medical history and current symptoms and add: “My doctor has diagnosed me with (specific diagnosis X). What are the most appropriate therapies in my case? Name up to five.”
All prompts were systematically executed between 3 April 2023 and 19 May 2023 through the website https://www.chat.openai.com. Output generated with GPT-3·5 and GPT-4 for each of the three queries is provided in Supplementary Data S1.
Querying Google
Symptoms were searched and the most likely diagnosis was determined based on the first 10 hits reported by Google. Search, extraction and interpretation was performed by a non medical expert, mimicking the situation of a patient.
Based on medical history of every case, search strings were defined: “baby(opt.) child(opt.) diagnosis <symptoms> previous <previous illness>(opt.) known <known disease> <additional information>(opt.)”
Symptoms were extracted based on the information provided in medical history. Optionally, if a previously resolved illness existed and was considered relevant by the non medical expert, the word “previous” was added, followed by information on the disease. If a known condition, e.g. hay fever, existed and was considered relevant, the word “known” was added, followed by information on the disease.
The search was performed in incognito mode using https://www.google.com. For every case, the first 10 websites are evaluated. Websites were scanned for possible diagnoses. The non medical expert compared the symptoms characterizing each diagnosis on the search results to those provided by the medical history. Information on e.g. age or sex – available in the medical history but not by the search string – was additionally taken into account. For example, ectopic pregnancy is not considered possible for a male case with abdominal pain. For cases considering children below the age of 1, “baby” was added to the search string. For cases considering children below the age of 16, “child” was added to the search string.
Solely information available on the websites was evaluated. No further detailed search on a specific diagnosis was performed if, e.g., only limited information on a disease’s characteristics was provided by a website.
A maximum of five most likely diagnoses were determined. If more than five diagnoses appear equally likely, the most frequently reported diagnoses were selected. The Google search strings as well as identified five diagnoses are available in Supplementary Data S1.
Performance Evaluation
Assessment of the answers generated with GPT-3·5, GPT-4 and Google was conducted independently by two physicians. The AI’s output was reviewed in relation to solutions provided by the casebooks. In cases of uncertainty, further literature was consulted. Each physician scored clinical accuracy based on a 5-point Likert scale according to Table 2. The final score is calculated as the mean of the two individual scores.
Table 2. Assessment scheme for diagnosis, examination and treatment recommendations.
1
|
2
|
3
|
4
|
5
|
Most of all relevant options were not mentioned.
All or most of the system’s generated options were redundant or unjustified.
|
Some or many relevant options were not mentioned.
Some of the system’s generated options were redundant or unjustified.
|
Most of all relevant options were mentioned.
Some of the system’s generated options were redundant or unjustified.
|
Most of all relevant options were mentioned.
Few of the system’s generated options were redundant or unjustified.
|
All of the relevant options were mentioned.
There was no redundant or unjustified option mentioned.
|
All statistical analyses were conducted using R 4.3.1 [19]. To assess inter-rater reliability, weighted Cohen’s kappa and 95% confidence intervals (CIs) were calculated for each of the three tasks, using the R package “DescTools” [20]. To explore the performance of the AI learning models, an independent evaluation of diagnosis, examination and treatment was conducted. The performance of GPT-3·5 vs GPT-4 was evaluated applying a paired one-sided Mann-Whitney test (R package “base”, function wilcoxon.test()). Considering diagnosis, paired one-sided Mann-Whitney tests comparing GPT-3·5 vs Google and GPT-4 vs Google were additionally performed. Possible influence of the diseases’ incidence (rare vs less frequent vs frequent) was investigated using non-paired one-sided Mann-Whitney test. Using Bonferroni correction [21], a conservative adjustment for multiple testing was applied (for examination and treatment: 1 test comparing GPT-3·5 to GPT-4, 3 tests within group GPT-3·5, and 3 tests within group GPT-4 resulting in n=7; for diagnosis: 3 tests comparing GPT-3·5 to GPT-4 to Google, 3 tests within group GPT-3·5, 3 tests within group GPT-4, and 3 tests within group Google resulting in n=12).