The Myth of Diagnosis as Classification: Examining the Effect of Explanation on Patient Satisfaction and Trust in AI Diagnostic Systems

Background: Artificial Intelligence has the potential to rev- olutionize healthcare, and it is increasingly being deployed to support and assist medical diagnosis. One potential applica-tion of AI is as the first point of contact for patients, replacing initial diagnoses prior to sending a patient to a specialist, allowing health care professionals to focus on more challeng-ing and critical aspects of treatment. But for AI systems to succeed in this role, it will not be enough for them to merely provide accurate diagnoses and predictions. In addition, it will need to provide explanations (both to physicians and pa- tients) about why the diagnoses are made. Without this, accurate and correct diagnoses and treatments might otherwise be ignored or rejected. Method: It is important to evaluate the effectiveness of these explanations and understand the relative effectiveness of different kinds of explanations. In this paper, we examine this problem across two simulation experiments. For the first experiment, we tested a re-diagnosis scenario to understand the effect of local and global explanations. In a second simulation experiment, we implemented different forms of explanation in a similar diagnosis scenario. Results: Results show that explanation helps improve satis- faction measures during the critical re-diagnosis period but had little effect before re-diagnosis (when initial treatment was taking place) or after (when an alternate diagnosis re- solved the case successfully). Furthermore, initial “global” explanations about the process had no impact on immediate satisfaction but improved later judgments of understanding about the AI. Results of the second experiment show that visual and example-based explanation integrated with rationales had a significantly better impact on patient satisfaction and trust than no explanations, or with text-based rationales alone. As in Experiment 1, these explanations had their effect primarily on immediate measures of satisfaction during the re-diagnosis crisis, with little advantage prior to re-diagnosis or once the diagnosis was successfully resolved. Conclusion: These two studies help us to draw several con- clusions about how patient-facing explanatory diagnostic systems may succeed or fail. Based on these studies and the review of the literature, we will provide some design recom- mendations for the explanations offered for AI systems in the healthcare domain.


Introduction
Background AI systems are increasingly being fielded to support diagnoses and healthcare advice for patients [1]. Although these systems are still in their infancy, they have the potential to serve as a first point-of-contact for patients, and eventually may produce diagnoses and predictions about patient's health, perform routine tasks, and provide non-emergency medical advice. This has the potential to provide innovative solutions for improved healthcare outcomes at a reduced cost.
However, in order to replace or supplement human diagnosis from physicians and health care professionals, it may not be enough for the AI diagnosis system to just be accurate. An accurate diagnosis without justification or explanation might be ignored, even from a competent physician.
This was perhaps first noted in the early days of medical diagnosis systems, Teach and Shortliffe [2] found that when considering AI diagnostic systems, the most important desire of both physicians and non-physicians was that it should be able to explain its diagnostic decisions. In contrast, avoiding incorrect diagnoses and erroneous treatments were rated among the least important properties. Recently, Holzinger et al. [3] argued that Explainable AI (XAI) may help to facilitate transparency and trust for the implementation of AI in the medical domain, so we expect that any successful patient-focused AI diagnoses system will also provide explanations and justifications of that diagnosis so that the patient can understand why a diagnosis is made or a treatment plan is recommended. It is even possible that an average diagnosis system with better explanation will lead to better healthcare outcomes than a perfect diagnosis system without explanation.
A variety of algorithms have been identified for providing explanations of AI diagnostic systems, both within and outside the field of healthcare. For example, early expert systems provided rule-based logical explanations that were tightly coupled to the knowledge the systems used to make diagnoses [4]- [8]. More recently, researchers have focused on visualizing elements of the classification algorithms being used to make a diagnosis (e.g., heat-map image analysis), and visualizing decision trees or complex additive models [9]- [13]. Other researchers have explored using casebased explanations, providing examples, and compelling support for the systems' conclusions [14]- [20]. Consequently, there are several algorithmic approaches to both diagnosis and explanation of diagnoses that have been explored in medical AI. However, it is not clear which methods are effective, whether a single method is sufficient, or whether the explanations need to be tailored to individual patients, situations, or different timepoints during diagnosis.
Furthermore, the literature on XAI in healthcare focuses primarily on algorithms, rather than the impact the algorithms have on patients (such as their satisfaction, understanding, or willingness to use the system in the future).
One approach we have pursued to study explainable AI (XAI) in healthcare is to understand the types of explanations real physicians offer when they interact with patients.
For example, Alam [21] conducted an interview study with physicians to document how they explained diagnoses to the patients. The results suggest that physicians use a variety of explanation methods, which are dependent on context, including time (i.e., early or later in diagnosis) and the patient or patient's advocates identity (including cultural, education, age, and other concerns). The explanations identified included the use of logical arguments, examples, test results, imagery, analogies, and emotional appeals. The results of this study also suggest that physicians tend to provide different types of explanations at different points of diagnosis. Although many of these explanations have been explored in the XAI literature [22], few systems have acknowledged the variety and contextual aspects of the different explanation types.

Methods for Providing Explanations
In the present paper, we will report on two experiments we conducted that explore how different types of explanations may impact satisfaction and trust in a simulated AI diagnostic system. In these studies, we will examine several different types of explanations that have been proposed and explored in the XAI community.
One such type of explanation is whether the goal of the explanation is to inform about the diagnostic process, versus justify why a particular diagnosis was made. These explanation types are respectively referred to as "global" and "local" explanations [23]- [26].In general, Alam [21] found that physicians report using both methods; sometimes they explain how a particular disease or diagnostic process works; other times they justify why a particular diagnosis is given based on evidence (symptoms, test results, history, etc.).
Another important distinction is the means by which an explanation is provided. Alam [21] also found that physicians' explanations mapped onto many of the explanation types studied in the XAI literature, including case-based information and examples [27], [28], analogies [29], [30], logical arguments [31], [32], visualizing imagery and highlighting important aspects [33], [34]. For imagery, AI healthcare systems may use graphs to show the relative probability of different outcomes or the relative importance of different symptoms for those outcomes, which is more akin to how the LIME algorithm [35] works for diagnostic features.
Physicians may present visualization differently from how AI systems offer visual explanations, but even the use of xrays and other test reports are generally accompanied by explanations highlighting the location of critical signs indicating a diagnosis-with a similar goal as gradient-based heatmaps [36]- [39] in XAI systems.
Next, we will report the results of two studies in which we tested a variety of explanation methods and approaches in a simulated diagnostic situation. Rather than testing a single explanation of an isolated case, we designed a garden path scenario in which symptoms initially pointed to one diagnosis, but later it became clear that another diagnosis was correct. This provided an emerging diagnosis, which we believe is particularly well-suited to understand how patients both trust and understand an AI diagnostic system.

Experiment 1
Hoffman et al. [40] argued that elements of satisfaction and trust follow from an improved understanding of an AI system that might be gathered from different kinds of explanations. Consequently, we hypothesize that explanations will induce greater satisfaction, trust, understanding, and perceptions of accuracy. To investigate this, we tested participants interacting with a simulated AI system that initially gives the most likely but incorrect diagnosis, but later it changes the diagnosis to the correct disease once further testing is complete. This provides an important case for understanding explanation, because, at all times, the AI can be judged to be behaving optimally given its information--even when its diagnosis is incorrect.

Method Participants
Eighty undergraduate students at Michigan Technological University took part in the study in exchange for partial course credit.

Procedure
We created a diagnosis scenario in which a simulated AI system gives a most likely but incorrect diagnosis but later changes the diagnosis to the correct disease. The scenario involved gastrointestinal disorders and symptoms, which are often difficult to diagnose in real-world situations. The participants played the role of patients in the scenario, instructed to say they were suffering from specific symptoms (abdominal pain, cramps, diarrhea, fatigue, and joint pain). A simulated AI system (called MediBot.ai) provided diagnostic information about the scenario, initially concluding that the patient was suffering from Irritable Bowel Syndrome (IBS), and advised patients to follow a specific diet chart and come back for follow up next week. After one week, the participants were told that they had begun to feel better, but the symptoms started getting worse after that.
When the patient did not feel good even after three consecutive weeks, MediBot determined that the patient might not be suffering from the "most likely" condition IBS and changed its diagnosis, ordered additional diagnostic tests, and determined the patient was suffering from Celiac disease, which occurs due to gluten allergy (the 'ground truth' of the scenario). Participants had to communicate with Med-iBot through six simulated weeks, but the study took around 20 minutes to complete. All participants experienced the same basic scenario with identical symptoms and diagnoses.
To maintain certain intervals between the simulated weeks, they were given brief crosswords to solve during the intervals. After they solved one crossword, they were asked to start following up with MediBot and play their role as pa- The Appendix lists the entire scenario for a patient across six weeks of diagnosis. After each simulated week, participants were asked to rate their satisfaction, trust, perception of accuracy, sufficiency, usefulness, and completeness for and are referred to as "Explanation Satisfaction Scale" attributes [40]. At the end of the study, participants also rated their agreement about their understanding of four 5-point Likert scale statements (see Table 2).

Results
Both the control and the global explanation groups expressed less satisfaction, trust, perception of accuracy, sufficiency, usefulness, and completeness than the local explanation group, as shown in Figure 2.
The control group and global explanation groups received the same scenario with no local explanations, and only differed in whether they saw an initial global explanation of the AI, and so the fact that they did not differ from one another on these ratings suggest that the satisfaction ratings focus We examined the rating for each dimension of explanation satisfaction scales with a Type-III factorial ANOVA examining the main effects of time, explanation condition (local, global, and control), and their interaction using the R package 'ez' [41]. The Type-III ANOVA examines the main effects AFTER the interaction has been accounted for, allowing us to identify residual effects of explanation types across all time points. The results are shown in Table 1.  We used a posthoc Tukey test at a p<.05 significance level on the three groups to examine pairwise differences (see Table 2). The global explanation condition produced ratings that were significantly better than the local explanation for statements 1 and 2, and both the local and global conditions were rated better than the control for statements 2 and 4.
There were no differences between groups on statement 1 and 3. Thus, although the initial global explanation was not helpful for improving satisfaction during the scenario, it provided a better overall understanding of the general method of diagnosis by the AI system.

Impact of Local Explanation/Justification
In this study, we examined how a re-diagnosis event impacted satisfaction and trust, and how different kinds of explanations impacted satisfaction, trust, and understanding of an AI system. Overall, the study showed that satisfaction and trust are harmed at the critical points during rediagnosis, even when the system is making the best diagnosis based on Thus, we found that local justifications were effective, but their effect is time-sensitive. During a critical situation or when AI was making errors, local justifications were very effective and powerful explanations for the patients.

Impact of global explanations
In contrast, pre-test global explanations using example diag-  There are a number of alternative methods that have been explored for the explanation of classification and diagnosis.
One approach attempts to focus attention on important causal factors in a classification decision or diagnosis [42].
Although like our study, the relative likelihood of different outcomes is typically shown, algorithms also often try to to be important methods for reasoning and persuasion [14], [15], [43] and have been extensively explored in the XAI literature [22].
To understand how more complex explanations might impact satisfaction and trust, we conducted a second study using a similar diagnosis scenario to investigate how different forms of local explanations affect patient satisfaction, trust, and perception of accuracy during diagnosis, which we will report next. In this study, we will examine and compare feature-highlighting approaches with case-based approaches.

Experiment 2
Explanations in AI diagnostic systems may come in different forms such as text-based rationales, visualizations, examples, or contrasts. The goal of this study was to investigate whether different forms of explanation in an AI diagnostic system affect patient satisfaction, trust, and perception of accuracy. We implemented three forms of explanation: written rationales, visuals + rationales, and examples + rationales, in a diagnosis scenario similar to the one in Experiment 1. Again, a simulated AI system gave a most likely but incorrect diagnosis, but later it changed the diagnosis to the correct disease.

Method Participants
One hundred and thirteen undergraduate students at Michigan Technological University took part in the study in exchange for partial course credit.

Procedure
The study was conducted online, and it took 15-20 minutes to complete. Participants gave their consent online before taking part in the study. They played the role of a patient suffering from a gastrointestinal disorder interacting with the simulated AI system slightly modified from the Experiment 1 scenario.
This time, the patient suffered from abdominal pain, cramps, bloating, diarrhea, fatigue, and joint pain and had no family history of gastrointestinal diseases but had recently been exposed to a natural water source, making an initial diagnosis of Giardia likely. When tests for this came back negative, MediBot predicted that it might be IBS and asked to follow the IBS diet. The patient's condition was inconsistent for a few weeks following the diet, then eventually MediBot resolved the diagnosis as Celiac disease and confirmed it with tests.
Participants were randomly assigned into one of four groups, each receiving a different form of explanation: 1) text rationales as explanation; 2) visual + rationales explanation; 3) example-based rationales, and 4) a control group. week, participants were asked to rate their satisfaction, trust, perception of accuracy, sufficiency, usefulness, and completeness for the explanations, as in Experiment 1.

Results
In order to simplify the presentation of the results, we organized the ratings for all six weeks into three sets: Week 1 To understand the differences between the Explanation Conditions at each time set, we conducted Tukey posthoc tests for each of the six scales using the R package agricolae [45]. The results are shown in Table 4.  Table 3: Results from Type-III factorial ANOVA for explanation satisfaction scales But they both were better than control and rationales for satisfaction, sufficiency, completeness, and accuracy. At Time 3, there are no differences between any of the explanation types, indicating that the resolution of the scenario produced uniformly high satisfaction. Only during the rediagnosis crisis weeks, when the system was noticeably wrong, there were statistically significant differences between explanation conditions.

Discussion and Summary
This study demonstrated several important results. First, like Experiment 1, explanations only appear to matter substantially during crisis weeks. It must be noted that this crisis was not due to a specific mistake or error on the part of the AI, but it was a consequence of making a most-likely diagnosis based primarily on the relative base rate of two diseases that have similar symptomology. Second, we found that richer explanations (visuals + rationales and examples + rationales) are the most effective at these critical points., but otherwise do not differ substantially from the control group. Next, for the majority of measures, rationales alone were no better than the control group. Additionally, although the visualization was substantially different from example-based explanations, we found no evidence that one method was more effective than the other. Finally, once the system came to a resolution the explanation no longer mattered and participants gave high satisfaction ratings.
Notably, this experiment did not test several conditions that might also be interesting. First, because of the lack of impact of global explanation on satisfaction measures, we did not compare global explanations in this study, either alone or accompanying the local explanations. We have no data on whether the global explanation would improve local justifications in this scenario but suspect that they would have little impact here as well. We also did not examine

Discussion
The two studies reported here allow us to draw several conclusions about how patient-facing explanatory diagnostic systems may succeed or fail. Overall, they show the importance of context on explanations. For example, justifying a decision is important to maintain satisfaction in the system; different kinds of explanations impact the patient differently, and the timing of explanations is also critical.
We will examine the main lessons from these studies next. The impact of explanations at the critical Time 2 is important because this is the point at which real patients might start abandoning the system, seeking second opinions, or failing to adhere to recommendations. The type of error seen in this scenario is especially pernicious, because the diagnosis was in some sense optimal, even though it is wrong.
The study shows that under the right circumstances, an explanation may mean the difference between seeing this and thinking that the diagnosis system is fundamentally unreliable or inaccurate.

Lesson: Significance of global explanation
Global explanations were not as effective as local justifications for immediate measures of patient satisfaction and trust. Nevertheless, they showed significant improvement in some post-scenario measures---ones related to the perception of global or overall understanding of the diagnosis by the AI system. And so, that should not be ignored if developers are really trying to build an XAI system for the patients. Thus, not only are different explanations effective at different times, but they also impact different aspects of their assessment of the system.

Lesson: Effectiveness of local justification
These studies showed the power of local justification/expla-

Lesson: The format of explanation matters
Across the two studies, we examined several different formats of explanation. We found that even a simple visualization showing the likelihood of different outcomes was effective (Exp. 1), as were more complex visualizations (Exp.

The myth of diagnosis as classification
One final observation we make is that it is a mistake to think about AI diagnosis as merely a classification problem-determining what diseases or conditions the patients are suffering from considering their symptoms and signs. [21] identified several ways this is true for physicians diagnosing a variety of disorders, and this is also true for AI systems.
This problem involves diagnosing, but also requires explaining why and how the AI is making the diagnosis. In many cases of diagnosis, an error is not necessarily an actual mistake, as it might follow the most likely outcome that happens to be wrong for an individual. In other cases, the course of treatment may not simply be following the most-likely option. Instead, a treatment (e.g., antibiotics) may be pursued even if it is not the most likely if it has little risk, but the consequences of not treating are large. Still, other cases involve several possibilities that each could be the source of the symptoms.

Conclusion
To improve patient satisfaction and trust at such points, building AI systems with higher accuracy might not be enough, and may not even be possible. In critical situations, AI systems may offer an erroneous diagnosis in the process of determining the most-likely disease or condition, but patients would not understand the reason behind this if they do not get exposed to the explanations and justifications. Incorporating appropriate explanations with the AI systems may help a patient understand the diagnosis better in these situations and make them satisfied with the diagnosis as well.

Declarations Ethics approval and consent to participate
This study was approved by the MTU Institutional Review Board Human Subjects Committee (M1966) and all methods in the study were carried out following the relevant guidelines and regulations. Participants were prompted with an informed consent form at the beginning of the study also outlining that by participating they acknowledge that the data obtained from this study will be used for publication.
We sought explicit consent. For the online version, participants had to click a specific button and for the in-person version they had to sign the consent form to indicate they consented after reading the aforementioned form.

Consent for publication
Not applicable.