Table 1 shows the overview of ChatGPT performance on 4 consecutive stages of this study.
The percentage of comprehensive answers (totally correct, totally complete, no extra-costs, no harm) for Physical examinations, Workups, DDx and treatment were 65.6%, 50%, 40,6% and 40,6%. Although additional information was given to ChatGPT by moving through each new stage, this trend showed a slight decrease in percentage of comprehensive answers as the study proceeded.
In the third stage (Giving DDx based on history, physical examinations, and workup results), ChatGPT was successful in 65.6% of the cases to suggest the correct diagnosis first in the list of differential diagnoses.
In a total of 4 questions per case (32 * 4 = 128), none of ChatGPT answers included suggestions that created unnecessary costs or harm to the patient.
Table 1
The overview of recommendations given by ChatGPT to rheumatology cases
| Count (total = 32) | Percentage |
Comprehensive* Physical Examinations | | 21 | 65.6% |
Comprehensive Workups | | 16 | 50.0% |
Comprehensive DDx | | 13 | 40.6% |
Comprehensive treatment | | 13 | 40.6% |
First DDx is correct** | | 21 | 65.6% |
Suggestions with Unnecessary costs | | 0 | 0% |
Suggestions with harm to the patient | | 0 | 0% |
*Comprehensive: A comprehensive answer is an answer that is “totally correct”, “totally complete” and avoids unnecessary costs and harms to the patient **It demonstrates the prevalence of answers that include the correct diagnosis as the first diagnosis in the list of differentials. DDx: Differential Diagnosis |
Stage 1: What physical examinations should be done by the physician?
At this stage, the correctness and completeness of recommendations of ChatGPT for the necessary physical examination were assessed (Fig. 2). At this stage the given information about the case included initial presentation, history, and ROS.
Regarding the Correctness, ChatGPT recommendations were 75% totally correct, 18.75% mostly correct, and 6.25% uncertain (meaning that the evaluator could not decide about it).
ChatGPT answers were not Mostly or totally incorrect in any of the cases about physical examination.
Regarding the Completeness, ChatGPT recommendations were 81.25% Totally complete, 15.63% Almost complete and in 3.13% included only basics information.
ChatGPT answers were not poor or very poor in terms of completeness in any of the cases.
Stage 2: What further workups should be ordered by physician?
At this stage, the correctness and completeness of recommendations of ChatGPT for the further necessary workups were assessed (Fig. 3). At this stage the given information about the case included initial presentation, history, and ROS and physical examinations.
As shown in Fig. 3,
Regarding correctness in workup recommendations, ChatGPT was 65.63% totally correct, 28.13% mostly correct and 6.25% uncertain. In none of the cases, the answers were totally or mostly incorrect.
Regarding completeness of workup recommendations, ChatGPT was 62.50% totally complete, 21.88% mostly complete and 15.63% included only basics. In none of the cases, the answers were poor or very poor in representing necessary workup information.
Stage 3: What are the differential diagnoses for this patient?
At this stage, the correctness and completeness of recommendations of ChatGPT for the differential diagnoses were assessed (Fig. 4). At this stage the given information about the case included initial presentation, history, and ROS and physical examinations and workup results (imaging, laboratory results, biopsies and consultations).
As shown in Fig. 4,
Regarding correctness in differential diagnoses, ChatGPT was 53.13% totally correct, 40.63% mostly correct and 6.25% uncertain. In none of the cases, the answers were totally or mostly incorrect.
Regarding completeness of differential diagnoses, ChatGPT’s answer was 62.50% total complete, 28.13% mostly complete and 9.38% included only basics. In none of the cases, the answers were poor or very poor.
Stage 4: What is the correct treatment for this patient?
At this stage, the correctness and completeness of recommendations of ChatGPT for the treatment were assessed (Fig. 5). At this stage the given information about the case included initial presentation, history, and ROS and physical examinations and workup results (imaging, laboratory results, biopsies and consultations) and final diagnosis.
As shown in Fig. 5,
Regarding correctness in treatment recommendations, ChatGPT answers were 50% totally correct, 31.25% mostly correct and 18.75% uncertain. In none of the cases, the answers were totally or mostly incorrect.
Regarding completeness of differential diagnoses, ChatGPT’s answer was 59.38% totally complete, 31.25% mostly complete and 9.38% included only basics. In none of the cases, the answers were poor or very poor.
Trends in appropriateness of ChatGPT answers in different stages of the study
In terms of correctness (Figs. 2 to 5), the percentage of totally correct answers given by ChatGPT showed a slight decrease: 75% (for treatment), 65.63% (for workups), 53.13% (for DDx) and 50% (for treatment). This is contrary to the increase in the given information to ChatGPT at each stage.
As the study proceeded, the percentage of totally correct answers decreased, giving an increase in the percentage of mostly correct and uncertain answers.
The average score (quantitative scores from 1 to 5) of correctness is shown in Fig. 6. The mean scores of correctness show a slight decrease in each stage as the study proceeds.
In terms of completeness (Figs. 2 to 5), the percentage of totally complete answers given by ChatGPT also showed a slight decrease: 81.25% (for treatment), 62.50 (for workups), 62.50% (for DDx) and 59.38% (for treatment). This is contrary to the increase in the given information to ChatGPT at each stage.
As the study proceeded, the percentage of totally complete answers decreased, rising to the percentage of almost complete and only basics answers.
Figure 7 corresponds to the average completeness scores (1 to 5) at each stage of the study. The means of completeness scores were 4.81, 4,47, 4,53, 4,50 respectively.