Impact of Artificial Intelligence, With and Without Information, on Pathologists’ Decisions: An Experiment


 Background: Artificial intelligence (AI) is rapidly gaining attention in medicine and in pathology in particular. While much progress has been made in refining the accuracy of algorithms, thereby increasing their potential use, we need to better understand how these algorithms will be used by pathologists, who will remain for the foreseeable future the decision-makers. The objective of this paper is to determine the propensity of pathologists to rely on AI decision aids and to investigate whether providing information on the algorithm impacts this reliance.Methods: To test our hypotheses, we conducted an experiment with within-subjects design using an online survey study. 116 respondent pathologists and pathology students participated in the experiment. Each participant was tasked with assessing the Gleason grade for a series of 12 prostate cancer samples under three conditions: without advice, with advice from an AI decision aid, and with advice from an AI decision aid with information provided on the algorithm, namely the algorithm accuracy rate and the algorithm model. Scores were computed by comparing the respondents’ scores with the “true” score at the individual-question level. A mixed effects logistic regression was used to analyze the difference in scores between the different conditions, controlling for the random effects of participants and images and to assess the interactions with Experience, Gender and beliefs towards AI.Results: Participant responses to the questions with AI decision aids were significantly more accurate than the control condition without aid. However, no significant difference was found when subjects were provided with additional accuracy rate and model information on the AI advice. Moreover, the propensity to rely on AI was found to relate to general beliefs on AI but not with particular assessments of the AI tool offered. Males also performed better in the No-aid condition but not in the AI-aid condition.Conclusions: AI can significantly influence pathologists and the general beliefs in AI could be major predictors of future reliance on AI by pathologists.


Background Introduction
Decision support systems have been used for some time now in clinical pathology practice to support image analysis (1) for quantifying tasks such as oestrogen receptors, progesterone receptors (2) and HER2/neu assessments in breast cancer (3). but more sophisticated "arti cial intelligence"-driven software tools are still rarely commercially available yet for pathology (4,5).
The limited use of AI in the eld of pathology is changing quickly. In the last few years, arti cial intelligence has made great strides and multiple papers have tested algorithms that achieved signi cant performances in accuracy (6). IDx-DR, an AI diagnostic system able to detect signs of retinopathy in retinal images, was approved in 2019 by the Food and Drug Administration (FDA) for use in the USA (7). And eighty percent of pathologists anticipate that AI will be introduced into their pathology laboratories within the coming decade (8).
To turn AI from promising algorithms to a practical reality, pathologists, solution providers and healthcare organizations need to overcome signi cant implementation challenges (7,9), which have thwarted many promising medical technological innovations in the past (10). Crucially, research within the eld needs to establish whether the alleged performance of algorithms will translate into better clinical decisions and better health outcomes for patients and, furthermore, under what conditions these changes would be most effective (7).
Pathologists will play a key role in this translation into health outcomes. Many AI studies compare AI algorithms to pathologists, who are blinded to the algorithm's results (11) often reaching accuracy rates very similar or better than pathologists. But when these systems will be implemented, pathologists and AI systems are unlikely to be competing or blinded to each other. Pathologists are expected to retain the nal say as to whether to rely on algorithms' conclusions or not (12). As a result, algorithms are likely to be used as tools that augment pathologists' skills rather than substitute them (7,13), which corresponds to a level of only 3 or 4 on Parasuraman et al's scale of autonomy of human interaction with automation (10 being a fully autonomous system) (5). For the foreseeable future, the impact of AI on clinical decisions and health outcomes will be mediated by the behavior of pathologists and their reliance on AI expertise.

Reliance on Arti cial Intelligence
Experiments suggest that AI does indeed in uence pathologists' decisions and improve clinical outcomes over either pathologists or AI making decisions separately (14). A recent survey suggests that 73.3% of pathologists are interested or excited about integrating AI in their practice (8). However, there is also a societal skepticism towards AI for vital decisions related to healthcare (10,15). Pathologists may distrust AI expert advice and choose not to rely on it, which could slow down the adoption of these tools (10,16,17). Moreover, surveys and opinion papers re ect an abstract opinion towards "AI", which is often a loaded and fantasized word and it is yet to be determined whether such beliefs translate into corresponding behaviors. Pathologists may express optimism towards the role and impact of AI, but the question remains whether they will rely on it to make decisions that are critical to their work and the welfare of their patients.
The algorithmic nature of the recommendation may in uence decision in different ways (18). Disclosure of the arti cial nature of a chatbot, for instance, can negatively impact their effectiveness (19). As Feldman et al note: "individuals may be more likely to trust those who are similar to them and nothing seems more different from a human being than an algorithm" (20). Pathologists may thus experience what has been referred to as algorithm aversion and be less likely to rely on an algorithm when making decisions (21), especially when they have expertise (22). Simply referring to these technologies as "arti cial intelligence" may also generate misconceptions and pre-existing fears that may impact reliance on the technology Alternatively, people commonly form the opinion that machines never make mistakes (23), which can lead to an uncritical trust in the technology and such complacency may induce excessive reliance on decision support systems (20). Experiments suggest that faulty AI can mislead pathologists (14). Overall, the presence of AI is likely to in uence the decision-making process in at least some cases (24) and we posit that: H1: AI decision aids will in uence pathologists' diagnosis Information on Arti cial Intelligence Researchers and surveys of decision-makers suggest that even highly accurate algorithms could face adoption challenges if AI remains a black box (25). The opacity of AI tools and their "black box" nature may thus impede adoption (17). Model interpretability means, for the user, the ability to understand how the algorithm works and how it reaches its conclusions (7). Multiple studies have con rmed that transparent automation systems are associated with greater trust (23,26) and lack of interpretability in AI conclusions is a primary concern when implementing AI technology (20). It has been argued that machines should be as comprehensible and transparent as possible (27), to the extent that more simple systems may lead to more trust and reliance than more e cient but more complex systems (26).
Research suggests that information about performance is the key factor in generating trust in machines (28). Speci cally, providing con dence level helps users adjust their level of trust towards a decision aid system (23). Disclosing accuracy rate estimates with predictions of AI tools may thus help pathologists build trust in the tools' results (17). As a result, we posit that: · H2: Pathologists will rely more on AI if information on the AI algorithm is disclosed Trust and reliance on AI Outside of pathology, a mature stream of research has investigated the antecedents of reliance on automated systems, to which arti cial intelligence systems belong (23,26). This research highlights that trust in the automated system is a key determinant of reliance and use of the system -"people tend to rely on automation they trust and tend to reject automation they do not" (26). We posit that: H3: The more pathologists trust an AI system, the more they will rely on it.
In this study, we test these hypotheses and investigate the following research questions: to what extent do pathologists rely on AI? To what extent does information on AI in uence reliance on AI by pathologists? To answer these questions, we conducted a survey experiment completed by 116 pathologists who assessed the Gleason grade for a series of prostate images.

Experimental design
The research design was a 1x3 within-subjects online experiment. In consultation with an expert pathologist, the Gleason grading system was identi ed as an appropriate task for this experiment for the following reasons: 1) a large proportion of pathologists are familiar with it; 2) it is performed relatively quickly; 3) it can lead to important clinical decisions for patients with prostate cancer diagnoses, 4) there is signi cant ambiguity and heterogeneity in results, with low-moderate agreement scores, ranging from 47%-70% (29), suggesting an opportunity for the implementation of machine learning tools that show comparable accuracy.
Digitized prostate biopsy samples were provided by The Radboud University Medical Center and Karolinska Institute as part of the 2020 Prostate cANcer graDe Assessment (PANDA) Challenge, which made public around 11,000 whole-slide images to explore the potential of automated deep learning systems in pathology (30). The whole slide images (WSI) are colour images and vary between 60,000-100,000 pixels in each dimension. The images were anonymized and uploaded to the cloud using PathcoreFlow™ (31), a web-based commercial image management solution and viewer for digital pathology (see Appendix A and B).
The online experiment was developed via the online survey platform, QualtricsTM. Each participant had to assess the 12 WSI through a secure link to PathcoreFlow, with four images presented into each of the three following conditions: A. No-aid: no expert advice provided. B. AI-aid, no information (AI-aid): expert advice that respondents were told came from an algorithm, with no further information C. AI-aid with information (AI-aid+): expert advice that respondents were told came from an algorithm with accuracy rate and model information (See Appendix A for full description).
For condition C, transparency was operationalized as a brief summary in lay terms of steps typically followed by an algorithm to reach its conclusion (6,32) and revised by a team of researchers to improve readability and accessibility. The algorithm accuracy was expressed relative to typical pathologist agreement from previous studies of deep learning system (DLS) AI for whole-slide Gleason scoring (29,33) (See Appendix C). Participants were instructed to review the prostate sample image and read any additional information before providing their clinical decision.
The AI advice was presented as a result of algorithmic processes; however, these scores were actually based on the predetermined ground truth established from the PANDA challenge by 3 experienced pathologists with a subspecialty in urological pathology (34).
To control for order/carryover effects, conditions were presented in random order, except that the "AI with information" condition was always presented after the "AI, no information" condition to avoid a previous opinion bias. The more ambiguous Gleason Grade (2 and 3) were overrepresented to minimize the agreement with AI by chance (35). The conditions under which each batch of 4 images were presented were randomized so that each image appeared as frequently under each of the three conditions. The order of the items within each batch was also randomized.
After the task, participants were asked questions about the perceived trustworthiness and helpfulness of the AI tool, and general trust towards AI (see appendix D). These questions were answered on a 5-point likert scale. Finally, participants answered socio-demographic questions about age, gender, employment status, workplace, years of experience and specialization and were offered a $20 Amazon gift card. Upon completing the survey, all participants received a debrief form explaining the use of deception and a summary of their answers to allow for constructive re ection.

Participants
Participants were recruited from medical associations and social media groups for pathologists to review a series of biopsy sample images and assess the Gleason Grade for each sample. Eligible participants were practising, student or retired pathologists, with an adequate knowledge to assign a Gleason Grade. All participants provided their informed consent to the researchers and were subsequently provided with a link to practise using the PathcoreFlow™ digital pathology software before starting the survey. 460 people completed the survey. However, 59 were excluded from the analysis for incomplete responses, 101 for repeat email addresses or IP addresses and 45 for potentially using spam email addresses. An additional 139 respondents were excluded for selecting a Gleason Grade without even viewing the image on more than one of the images that had no decision aid associated with it, or for answering in less than 10 seconds, as this was deemed inappropriate by the pathology expert for proper analysis of the image. The nal sample was 116 respondents. Information on the sample is provided in Table 1. To test our hypotheses, we performed a mixed-effects logistic regression using the lme4 package in R. Mixed-effects logistic regression is an extension of the generalized logistic model that accounts for xed and random effects. These models are particularly useful in settings where repeated measurements are made on the same statistical units.
In the rst model, the condition (No AI aid, AI-aid and AI-aid+), as our variable of interest, was speci ed as the xed effect while participants and images were speci ed as the random effects. This approach allows us to control for potential biases introduced by individual level variances, such as ability, and image level variances, such as level of di culty to grade. The dependent variable of the model is the score, computed as "1" if the respondent assigned the "true" Gleason Grade to the image, and as "0" otherwise. It is not essential to our experiment how accurate the "true" score is since our interest is on how in uenced by the algorithm respondents are.
In the second model, to identify the factors contributing to the reliance on AI, we added the interaction between the condition and Experience, Gender and the answers to the 5 survey questions.

Results
The average accuracy without aid was 75.3%, against 81.4% for both the AI-aid and the AI-aid + conditions. Timing data indicates that questions in the AI-aid + condition (M AI−aid+ = 21.25s) required less time to complete than the AI-aid condition (M AI−aid = 27.08s) and no-aid condition (M no−aid = 44.2s).
The rst mixed effects logistic regression model showed at the 0.05 signi cance level that AI-aid and the AI-aid + conditions improved score compared to no-aid condition (see Table 2). The results thus suggest, after controlling for image and individual level variations, when AI recommendations are present, pathologists are more likely (92% versus 87%) to provide accurate answers (i.e. align with the AI recommendations), supporting hypothesis 1. Since the AI recommendations are the true Gleason Grades, the coe cients of the conditions from the logistic mixed effect model is the extra effect of the recommendation on accuracy over the respondent's skill and the image's complexity, in other words, the reliance on AI recommendations (how much more likely a participant will have the right answer when AI recommendation is presented to the participant, compared to when there is no recommendation). However, no signi cant difference was found between scores in the AI-aid and AI-aid + conditions, thus providing no support for hypothesis 2. In the second model, we introduced explanatory variables Experience, Gender and the 5 survey questions in Appendix D. Since there were no differences between the AI-aid and the AI-aid + model, we only distinguished between the "AI-aid" (including both AI-aid and AI-aid+) and "No aid" conditions. Table 3 presents the model results.
First, there are three main effects. AI-aid and Gender (Male) are signi cantly associated with higher scores, while the less pathologists believe that AI can help pathologists to be more e cient (1 means "strongly agree" and 5 means "strongly disagree"), the higher the accuracy of the respondent. Second, the interaction effects provide some more details on these effects. The interaction between Male and AI-aid (-1.35) was signi cant and almost entirely countered the impact of Male as a standalone factor (1.65). In other words, the higher accuracy of males almost entirely disappeared when AI advice was offered. Likewise, the effect of the two questions Q4 "Arti cial Intelligence can help pathologists make better decisions" and Q5 "Arti cial Intelligence can help pathologists be more e cient" (Q5) was entirely offset in the AI-aid condition. In other words, participants who agreed with these statements were less accurate in the No-aid condition but scored the same in the AI-aid condition (the main effect of Q4 was insigni cant). Q1, Q2 and Q3 had no signi cant direct or interaction effect.
Finally, the interaction between AI-aid and Experience was positive, the more experienced, the more likely respondents are to rely on the AI advice, but just below signi cance level (p = 0.064).

Discussion
This study has two key ndings. First, hypothesis H1 is supported. The difference in correct answers between the no aid condition and the two AI conditions suggests a statistically signi cant in uence of AI on pathologists' decisions when assessing the Gleason grade. Therefore, the presence of an AI recommendation in uences pathologists. This nding con rms previous ndings on the impact of AI on the decision-making process (14) and suggests that AI in uences physicians in making expert decisions. Considering that such AI systems are not widely operational yet to assess the Gleason grade, this suggests a signi cant openness to relying on AI among pathologists, in line with positive opinions expressed by pathologists (4,8). It may also point to a risk of over-reliance on computer-aided diagnostic systems, which has been identi ed in radiology (37), of con rmatory bias and of alert fatigue (38).
The second key nding is that there is no difference in reliance on AI between AI without information and AI with information. This result contradicts previous ndings that explicit transparency in automated tools is essential to develop trust and reliance (23). We offer two explanations for this. The rst explanation is that pathologists may have preconceived opinions on the reliability of AI systems. The reliance on AI would thus depend not so much on the worthiness of the speci c AI system but on their general beliefs towards AI. This is con rmed by the partial support of hypothesis H3. Reliance on AI was signi cantly related to general beliefs in the potential of AI to help pathologists be more e cient and make better decisions (Q4 and Q5) but not to the belief in the trustworthiness and usefulness of the AI advice provided in the experiment (Q2 and Q3). This is a surprising result suggesting that the general beliefs towards AI in general are more decisive than beliefs towards speci c AI advice being used. One explanation is that pathologists may nd it di cult to evaluate speci c algorithms. This echoes studies showing a general positive attitude towards AI but a di culty in having con dence in its results (39).
The second explanation is that the disclosure of a less-than-perfect accuracy rate (70%), even though it was better than the average accuracy of pathologists, nulli ed the impact of information on AI reliance. Medical professionals tend to have high expectations of AI accuracy (over 80% of clinicians expect AI systems to be superior to the average performing specialist) (40). There is an expectation that AI will produce perfect answers (21) and people more quickly lose con dence in algorithmic than in human forecasters even after demonstrating that the AI is more accurate on average (21). High pro le harmful or inadequate performance have thus produced reactions that hindered the development of AI (36). However, the lack of correlation between Q2 and Q3 and scores again contradict this explanation. This result might also highlight the speci city of pathologists in their decision process.
Finally, the use of the advice may have been used to shorten the decision process rather than to improve its quality by double-checking, as is suggested by timing data, which shows that decisions were made faster in the two AI-aid conditions than in the no-aid condition. This con rms previous ndings that algorithms reduce time spent developing a diagnosis (36).

Limitations
First, despite signi cant efforts to eliminate suspicious responses, recruitment through social media carries some risk of participation by non-pathologists. However, no respondents had no correct answers out of the four questions in the No aid condition and only 7 out of 116 respondents had only one correct answer. Scoring the Gleason grade is a complex and non obvious task and it is unlikely that a non pathologist would be able to achieve such results. However, the sample may have included respondents from related professions that were not captured in the demographics survey such as urologists or laboratory technicians.
Second, while the within-subject experimental design can help reduce errors associated with individual differences by explicitly controlling for individual variances, and signi cantly increase the statistical power compared to between subject design, such design may have biased the data by anchoring reliance behaviors in the rst AI condition or by activating a fatigue effect when reaching the "AI with information" task. As all participants were rst exposed to the "AI with no information" condition before the "AI with information" one, they may have formed and anchored their opinion on whether to rely on the AI advice by the time they reached the "AI with information" questions. Further replications of this experiment may bene t from using a between groups design to isolate each condition and by demanding that participants complete their assessment rst before viewing the AI response (21).

Implications
This study provides encouragement to implementers of AI solutions. It suggests that pathologists (and possibly other professionals) are open to using AI recommendations for critical and ambiguous decisions. Future research should investigate further how physicians and other professionals with high responsibility rely on AI recommendations, in particular, improving our understanding of the extent to which reliance on AI decision aid differs from reliance on second opinions from other physicians, the extent to which pathologists overly rely upon AI decision aid and the effects of task ambiguity on AI reliance.
It also suggests that information on the algorithms, at least on such applications, may not be as decisive as previously thought. Some of the most complex and opaque AI systems, such as deep learning ones, may thus not be at a disadvantage. This also suggests that pathologists may not feel con dent to assess algorithms. Further research should investigate what factors or information can enable pathologists to con dently assess speci c algorithms or the decision. Practitioners and managers may also consider that the decision (and responsibility) to rely on algorithms should be delegated to institutions, which comes with its own set of challenges.
Methodologically, it remains a challenge to recruit doctors as participants in empirical research studies (41) and our approach has been successful in recruiting a statistically signi cant number of physicians as respondents by enabling respondents from all over the world to participate despite restrictions caused by the COVID-19 global pandemic.

Conclusions
The results of this experiment suggest that pathologists are open to relying on AI recommendations for critical and ambiguous decisions. It also suggests that information on the algorithms, at least on such applications, may not be as decisive as previously thought. AI algorithms need to be seen as tools that realize their value when physicians decide to rely on them. Taking physicians into account and ensuring that they are in a position to make the best decisions, rather than placing them in competition with AI, should be a priority to practitioners, developers, implementers and regulators alike.
Abbreviations AI