Agents that Argue and Explain Classifications of Retinal Conditions

Expertise for auditing AI systems in medical domain is only now being accumulated. Conformity assessment procedures will require AI systems: (1) to be transparent, (2) not to rely decisions solely on algorithms, or (3) to include safety assurance cases in the documentation to facilitate technical audit. We are interested here in obtaining transparency in the case of machine learning (ML) applied to classification of retina conditions. High performance metrics achieved using ML has become common practice. However, in the medical domain, algorithmic decisions need to be sustained by explanations. We aim at building a support tool for ophthalmologists able to: (i) explain algorithmic decision to the human agent by automatically extracting rules from the ML learned models; (ii) include the ophthalmologist in the loop by formalising expert rules and including the expert knowledge in the argumentation machinery; (iii) build safety cases by creating assurance argument patterns for each diagnosis. For the learning task, we used a dataset consisting of 699 OCT images: 126 Normal class, 210 with Diabetic Retinopathy (DR) and 363 with Age Related Macular Degeneration (AMD). The dataset contains patients from the Ophthalmology Department of the County Emergency Hospital of Cluj-Napoca. All ethical norms and procedures, including anonymisation, have been performed. We applied three machine learning algorithms: decision tree (DT), support vector machine (SVM) and artificial neural network (ANN). For each algorithm we automatically extract diagnosis rules. For formalising expert knowledge, we relied on the normative dataset (Invernizzi et al. in Ophthalmol Retina 2(8):808–815, 2018). For arguing between agents, we used the Jason multi-agent platform. We assume different knowledge base and reasoning capabilities for each agent. The agents have their own optical coherence tomography (OCT) images on which they apply a distinct machine learning algorithm. The learned model is used to extract diagnosis rules. With distinct learned rules, the agents engage in an argumentative process. The resolution of the debate outputs a diagnosis that is then explained to the ophthalmologist, by means of assurance cases. For diagnosing the retina condition, our AI solution deals with the following three issues: first, the learned models are automatically translated into rules. These rules are then used to build an explanation by tracing the reasoning chain supporting the diagnosis. Hence, the proposed AI solution complies with the requirement that “algorithmic decision should be explained to the human agent”. Second, the decision is not solely based on ML-algorithms. The proposed architecture includes expert knowledge. The diagnosis is taken based on exchanging arguments between ML-based algorithms and expert knowledge. The conflict resolution among arguments is verbalised, so that the ophthalmologist can supervise the diagnosis. Third, the assurance cases are generated to facilitate technical audit. The assurance cases structure the evidence among various safety goals such as: machine learning methodology, transparency, or data quality. For each dimension, the auditor can check the provided evidence against the current best practices or safety standards. We developed a multi-agent system for retina conditions in which algorithmic decisions are sustained by explanations. The proposed tool goes behind most software in medical domain that focuses only on performance metrics. Our approach helps the technical auditor to approve software in the medical domain. Interleaving knowledge extracted from ML-models with expert knowledge is a step towards balancing the benefits of ML with explainability, aiming at engineering reliable medical applications.


Introduction
Patients expect physicians to comprehensibly explain decisions that have an impact on them. The same holds for physicians interacting with decision support systems, which need to explain their recommendations [1]. Explaining algorithmic decisions is not only a trust related issue, but also a legal one. For instance, article 22 of GDPR empowers human agents with the "right to demand an explanation of how AI system made a decision that affects them".
With the current surge of interest in machine learning (ML)-based medical software, explaining decisions made based on black-box ML models remains challenging [2]. For many domains [3,4], especially the medical one [5,6], there is need for algorithmic decisions should be sustained by explanations [7,8] and assurance cases. It has been argued that machine learning equals human-level capacity in medical diagnosis [9], while achieving high performance metrics has indeed become common practice [10]. In our view, this high performance obtained through deep learning modelswhich are black box models-rises at least three practical challenges: (1) How explanations can be provided to the users? (2) How assurance cases can be build for audit and safety approval? (3) How expert knowledge can be included in the loop? These challenges are not distinct, but rather interleave each other: if one extracts knowledge from the learned models, the reasoning can be easily traced to generate explanations and to build safety cases.
Machine learning uses various algorithms to analyse data. The problem is that, if data is analysed with a decision tree (DT) algorithm, data reveals a particular pattern. If the same data is analysed with support vector machine (SVM) or artificial neural network (ANN) algorithms, different patterns will be revealed. Which pattern is the correct one, becomes a challenging task. In our approach, each ML algorithm learns from the same data a distinct set of rules. Each rule represents an argument, and the agents exchange part of these arguments. The resolution of the debate forms a diagnosis that is then explained to the stakeholder, by showing the chain of supporting arguments and by building the assurance case [11].
We developed a diagnosis system for retinal conditions. We designed the system to meet the following requirements: 1. To explain algorithmic decision to the human agent by automatically extracting rules from the black box ML algorithms. 2. To include the ophthalmologist in the loop by formalising expert rules and including the expert knowledge in the argumentation machinery. 3. To build safety cases by creating assurance argument patterns for each diagnosis.
This design goes behind most software in medical domain that focuses only on performance metrics. Our approach helps the technical auditor to approve software in the medical domain.
We propose an argumentation-based decision function after receiving results from three ML classifiers and one expert agent. This expert agent contains rules manually formalised from normative data for retina, such as the data found by Invernizzi et al. [12]. Interleaving knowledge extracted from ML-models with expert knowledge is a step towards balancing the benefits of ML with explainability, aiming at engineering reliable medical applications.
We applied our machinery on distinguishing between three classes: normal retina, Diabetic Retinopathy (DR), and Age Related Macular Degeneration (AMD). We chose DR and AMD because both diseases: (i) are highly prevalent; (ii) produce changes in the macular area; (iii) benefit from an early diagnosis and treatment: if detected early, we could prevent severe vision loss; (iv) modify the volume and the thickness of the macular retina.

Diagnosing by Machine Learning and Expert Knowledge
Retina conditions can be identified using Optical Coherence Tomography (OCT). From the OCT images, we extract the volume and thickness of the retina, defined as the space between the internal limiting membrane and Bruch's membrane. The segmentation of this space is made by the OCT software installed on our Heidelberg Spectralis device. We use an Early Treatment Diabetic Retinopathy Study (ETDRS) grid, which is built inside the OCT software [13].
The grid is made out of three concentric circles with 1, 3 and 5 mm radius. The grid is further divided into nine zones: one central zone c 0 in the 1 mm circle; four zones in the 3 mm circle ( n 1 , s 1 , t 1 , i 1 ); and four zones in the 5 mm circle ( n 2 , s 2 , t 2 , i 2 ). These nine zones are depicted in the top left of Fig. 1. Here n stands for nasal, s for the superior zone, t for the temporal zone, and i for the inferior zone of the retina. The thickness and volume values inside the grid are automatically calculated, and the grid's position on the macula is manually adjusted by the OCT operator, so that the center of the three concentric circles is overlaped over the center of the fovea. Patients are classified in three classes: Age Related Macular Degeneration (AMD), Diabetic Retinopathy (DR), and Normal. For the experiments we used a dataset consisting of 699 images: 126 samples from Normal class, 210 from DR, and 363 from AMD. Each sample has 18 attributes: 9 thickness values and 9 volume values for each zone: c 0 , n 1 , s 1 , t 1 , i 1 , n 2 , s 2 , t 2 , i 2 . The dataset contains patients from the Ophthalmology Department of the County Emergency Hospital of Cluj-Napoca, Romania. We included the following patients that visited our clinic and underwent OCT examination: 50 patients with AMD having at least one eye with stage 4 AREDS-exudative type; 5 patients with DR having at least one eye with macular edema; 3 patients with normal retinal architecture for both eyes. All ethical norms and procedures, including anonymisation, have been performed.

Augmenting Data and Training
In the first step, we start with an unbalanced dataset of patients with multiple visits: 50 with AMD, 5 with DR and 3 with Normal retina. Since this dataset is small for machine learning, we perform data augmentation. Data augmentation [14,15] is performed by changing thimethod for argumentationckness and volume with a random value between − 0.001 and 0.001 for each zone, so the sample belongs to the same class. For each sample from DR and Normal, we generate 10 more samples with small variations. After augmentation, the resulted dataset of 699 images is more balanced: 126 Normal, 210 DR, and 363 AMD, with the class distribution 18% Normal, 30% DR, and 52% AMD.
The values for the volume and average thickness for each zone are directly extracted from the OCT images. The dataset was divided into training, validation and test set. The training set contains 60% from the input data, and it is used to learn the model. The validation set contains 20% from the input data, and it is used to tune the parameters of the model. The test set contains the remaining 20% and it is used to check the learned model against data never seen before.
We experimented with three algorithms: decision tree (DT), Support Vector Machine (SVM) and Artificial Neural Network (ANN) [16]. These algorithms illustrate three distinct approaches in the world of machine learning: DT is the master algorithm for the symbolists, SVM is the master algorithm for analogizers, while ANN is the favored approach for connectionists [17]. Since these algorithms follow different learning strategies, it fits our aim to show different perspectives on the given retinal case. For each algorithm there is an agent with the role of arguing for its classification decision.
Each internal node in a DT contains a test on a value of one of the input features. The branches of a node correspond to all possible values of the feature. The leaf nodes specify what class value to be returned by the decision tree function. The DT algorithm is a greedy one, since at each step it picks the feature with the highest importance. The importance of a feature can be measured in terms of information gain (i.e., expected reduction in entropy)  Fig. 1 Extracting rules from learned models SVM has a different perspective on the task. It is a risk averse algorithm, since it finds linear separators that stay as far as possible from both classes. It therefore needs to compute distances between the feature vectors corresponding to the training instances. Since SVM works well when there is a clear margin of separation between classes, it will bring on the table strong arguments for cases in which the diagnosis is clear.
ANN has the ability to detect complex nonlinear relationships. Learning occurs by gradually adjusting the weights between neurons. During training, the error computing for each instance is used to adjust the weights: the error is distributed proportionally by the backpropagation mechanism to the neurons that have contributed to it. Deep learning architectures applied on large data would boost performance. Here, we kept the architecture simple (e.g. only two hidden layers), since we want to be able to break the black box model of ANN and to extract rules that will be presented to the human agent.
For DT, we used Giny impurity to measure the quality of split. For SVM, we used a 3-bin-discretizer, encoding ordinal and uniform strategy for discretisation. For SVM training [18], we used a linear penalty squared hinge loss, a tolerance for stopping criteria 1e−5, and the regularisation parameter C = 1.0 . For ANN [19], we used 2 hidden layers of 18 and 3 neurons, with the RELU activation function, ridge regression to avoid overfitting (i.e. L 2 since we have only 18 features to learn from), penalty parameter = 1e−5, learning rate 0.0001 and tolerance of 1e−4". Also, the Limited-memory BFGS optimiser [19] is appropriate for our small dataset of 699 images. The accuracy of the decision tree is 97%, neural network is 75%, while for SVM is 70%. Note that our focus is not on improving performance, but on explainability: how we can explain the algorithm-based diagnosis to the human agent.

Extracting Rules
In the second step, we automatically extract rules from the learned models. For each of the nine zones, we use two predicates: t(zone) for thickness, and v(zone) for volume. The rules learned by the DT algorithm are easily extracted by traversing the learned tree. The challenge remains to extract rules from more black-box models like SVM and ANN.
To be relevant for the ophthalmologists, the rules should contain only features from the input space, and not features from the learned model. With SVM and ANN, the learned parameters do not correspond to our predicates of interest t(zone) and v(zone). To transform the algorithm's parameters space into the input space, we use the Rule Matrix tool [20].

Arguing, Conflict Resolution and Explaining Decision
Third, the multi-agent system consists of six agents (see Fig. 3): three agents have rules learned from machine learning: the DT agent, the ANN agent, and the SVM agent. The expert agent has rules manually formalised from normative datasets on retina. These four agents are called argumentative agents, since they exchange arguments based on their own perspective on the current patient. The master agent handles the conflict resolution among arguments and generates explanations. The ophthalmologist agent provides new cases for diagnosis. The agents were developed with the Jason platform and its AgentSpeak programming language [21].  We use t hree classes for retinal disease ⟨AMD, DR, Normal⟩ and three learning algorithms (LAs): a decision tree (DT), a support vector machine (SVM), and an artificial neural network (ANN). We also use expert knowledge E, hence LA = {DT, SVM, ANN, E} . From the learned models, we extract rules R i with the structure: Here m j represents the performance metric of the learning algorithm LA, e.g. accuracy (a), precision (p), recall (r), F-measure (f). Each condition Cond has the form: where the operator ⊕ ∈ {<, ≤, >, ≥, =} , while t is the thickness and v the volume of one retinal zone c 0 , n 1 , With feature(patient), the rules can include any additional information on the current case (e.g. male(patient)). Several conditions can be linked by the logical operator ⊗ ∈ {∨, ∧} . The output of each rule represents the distribution of probability over the three classes. Hence AMD + DR + Normal = 1 . The number of cases supporting the rule is given by the support value above the implication operator.

Obtained Rules
We obtained three sets of rules automatically extracted from the machine learning algorithms, and one set of rules manually formalised from normative datasets on retina. Table 1 lists part of the rules learned with the decision tree algorithm. For instance, the rule R DT 1 states that if the thickness of the zone s 1 is less than 0.35 (i.e. t(s 1 ) ≤ 0.35 ), and the volume of the same zone s 1 is less than 0.51 (i.e.

Rules Extracted from Machine Learning Models
v(s 1 ) < 0.51 ) then the given retina has the AMD condition with 100% probability. Since the rule is valid for a number of 69 images in our dataset, its support is 69. The confidence of the rule is given here by the accuracy metric. During learning, the obtained accuracy for decision tree algorithm was a=0.97. Apart from the rules extracted from the decision tree, the DT algorithm also computes the most informative feature, which in our case was the s 1 zone. This means that the DT algorithm looks firstly at the s 1 zone to diagnose an OCT image.
We obtained six rules from the SVM model (see Table 2). Since the thickness of the n 2 zone appears in all six rules learned by SVM, this zone is considered by SVM as the the most informative parameter. Note also that some retinal zones (e.g. i 2 ) do not appear in these rules, meaning that the SVM algorithm found no information within the zone i 2 . Note also that the rules R SVM 4 and R SVM 5 are the most certain with respect to the Normal class (i.e. the probability is some 97% in both cases). The availability of such information is in line with the goal of Explainable AI (XAI), since the system is able to answer questions such as: when can I trust you? For instance, the SVM agent, will be reliable for patients whose OCT images activate the rules R SVM 4 and R SVM 5 . Note also that the rule R SVM 6 is a default rule, which is activated when the conditions of the other rules do not hold for the current OCT image.
The rules extracted from the ANN are listed in Table 3. For instance, the rule R ANN 1 states that if the volume of the t 2 zone is less than 1.28 then most probably (i.e. 0.9099), the retina is normal. The confidence of all ANN rules is given by the learning accuracy a = 0.75 for the ANN classifier.

Rules Formalised from Expert Knowledge
To formalise the expert knowledge, we start from normative data for retinal thickness. Invernizzi et al. found that the thickness of retinal zones is influenced by gender, sex, and axial length [12]. In Table 4, we formalised their results as rules to infer normal retina. Retinal thickness was significantly higher in males. The t(c 0 ) is positively correlated with the age. The mean retinal volume was 8.58 ± 0.36 mm 3 Table 1 Sample of rules extracted from the decision tree (DT) ) and 305 μm for females (see rule R E 12 ). That is some 2 standard deviations above the average for this normative cohort. These values were proposed [23] as gender-specific thickness levels to have reasonable certainty that diabetic macular edema involving the central zone c 0 is present using Spectralis measurements.
For instance, the expert rules R E 1 to R E 10 in Table 4, formalised from [24], refer to the thickness values to infer if the retina is normal, while rules R E 11 and R E 12 to support the DR diagnosis. Now, there are four perspectives when making the decision: the view of the ophthalmologist (Table 4), the view of the DT algorithm (Table 1), the view of the SVM algorithm (Table 2), and the view of the ANN (Table 3). For most OCT images, these perspectives are the same. But for some OCTs, they are divergent, which triggers an argumentative process.

Argumentative Patterns
We use two argumentative patterns: argument for diagnosis (AD) used by the four arguing agents (i.e. DT, SVM, ANN and expert) to support their diagnosis, and the assurance pattern (AP) to explain the reliability of the results to the ophthalmologist.
First, the argument for diagnosis considers the number of arguments supporting and attacking a diagnosis (Algorithm 1). Each arguing agent conveys arguments to the master agent based on its learned model. For instance, two alternative arguments for achieving the goal diagnose are shown in Listing 1, as they are formalised in the Jason programming language. Note that these parameters correspond to rule R DT 1 in Table 1.  : Table 4 Sample of expert rules extracted from normative datasets Second, the assurance argumentative pattern [25] is adapted here for a multi-agent system (Fig. 4). We represent the pattern with the Goal Structuring Notation (GSN) [26]. GSN is a description language consisting of six nodes: goals (what to assure), contexts (state, environment or conditions of the system), strategies (describe how to break down a goal into subgoals), evidence (assuring the goal can be reached), monitoring (representing evidence available at runtime), or undeveloped nodes (status of no evidence or monitoring supporting the goal. In Fig. 4, the goals are depicted with yellow rectangles, the contexts with rounded rectangles, the solutions with the blue circle, while the strategies with the parallelogram. The assurance argument points out pieces of evidence (the green shapes) for goals related to the machine learning methodology, the extracted rules, the conflict resolution or the data type used. One can see that the subgoal "Dataset is balanced" is an undeveloped node, signaled by the empty diamond. This empty diamond is an indicator for the ophthalmologist that the result is biased towards the most represented class in the initial dataset (this class is AMD). Or, the auditor can signal that the 699 instances used here do not meet the current threshold for data quantity for the given task. Such safety cases are therefore helpful to structure evidence and therefore to facilitate technical audit of AI systems in medical domain.

Conflict Resolution
The Master agent employs different conflict resolution strategies: voting, weighed voting, strongest-argument. These strategies take into consideration three components: number of supporting arguments (i.e. majority vote), confidence in the argument, or the performance metric of the learning process.
With majority vote, if three agents reached the same diagnosis and only the fourth one has a different vote, the diagnosis is given by the most common view on the case. The votes can be weighted by the confidence in the Fig. 4 Assurance argument pattern for multi-agent system conveyed arguments. The weight of an argument is given by: (1) its probability distribution over the diagnosis variable ⟨AMD, DR, Normal⟩ and (2) by the performance metric achieved during learning. This probability distribution for each rule is extracted automatically from the learned model. For instance, in case of the decision tree algorithm, the nodes in the tree contain the number of samples in each class that satisfy all the conditions from the root to that node. The values from the leaves of the tree are used to compute the probability that the diagnosis is correct. If all instances belong to one class, then the probability is 100%. If there are instances that belong to another class, then the probability decreases with each sample. In the case of the SVM and ANN algorithms, each classifier provides the probability for a sample to belong to each class.
As an example, assume that the following four arguments have all their premises satisfied: Here, the DT agent argues in favor of AMD (recall the order of the diagnosis in the distribution probability vector The strongest-argument strategy selects the diagnosis that is supported by the argument with the highest certainty weighted by the agent's performance metric. For instance, let one argument conveyed by the SVM agent and one argument by the ANN agent: Here, the SVM agent supports the diagnosis DR with 0.7 × 0.5429 = 0.38 . Differently, the ANN agent supports the diagnosis Normal with 0.75 × 0.9099 = 0.68 . Assuming these are the only two arguments, the ANN's argument wins under the strongest-argument strategy. The ophthalmologist is the one responsible to select the conflict resolution strategy.

Explaining Diagnosis
The master agent explains the diagnosis to the ophthalmologist by means of arguments obtained from the arguing agents. These rule-based arguments are verbalised, as Table 5 bears out. Verbalising rules is based in our approach on linguistic patterns, such as: The patient has the diagnosis D because the volume/avg thickness value for retina zone Zn is larger/smaller than value X.
The master agent displays also the confidence of the supporting arguments conveyed by the agent. For instance, "DT agent is 100% sure" for the diagnosis Diabetic Retinopathy, means that the DT agent conveyed a supporting rule with the diagnosis vector ⟨0, 1, 0⟩ . Similarly, "SVM agent is 97% sure", means that its vector can be ⟨0.02, 0.97, 0.01⟩ , where the probability 97% is attached to the DR diagnosis. Additionally, the master agent displays the global performance metrics for the learned models, e.g. accuracy is 0.6 in case of decision tree learning. In case of conflicting arguments, the master agent presents the supporting arguments for a diagnosis and generates the assurance argument template.

Introducing the Ophthalmologist in the Loop
The opinion of the ophthalmologist is presented here along with the results obtained from agents for a patient diagnosed with Diabetic Retinopathy (Fig. 5, left). Taking into account the last two rules from the Table 4, we consider that thickness of c 0 above 320 mm for men and 305 mm for women indicates the presence of increased macular thickness related to DR. The average retinal thickness in the central zone for this patient is 420 mm, therefore this patient might be affected by diabetic macular edema (Fig. 5 right). Analysing the cross-sectional central image, the normal retinal morphology seems to be modified, with increased thickness, cystoid fluid-filled spaces in the outer retinal zones (red arrowheads), hard exudates (green arrowheads) and disruption of the external limiting membrane and the photo-receptor layers (yellow arrowhead), all these being consistent with diabetic macular edema.

Discussions and Related Work
Expertise for auditing AI systems in medical domain is only now being accumulated. Through the New Legislative Framework (NLU), UE recommends the technical auditors of AI systems to use expertise from established organisation such as agencies for medical devices. According to the new proposal of the EU [27], in each member state, a regulatory agency for artificial intelligence is planned. Such AI agency is envisaged to cooperate and benefit from the experience of existing bodies, for instance when deploying AI software on medical devices. Conformity assessment procedures will require AI systems: (1) to be transparent, (2) not to rely decisions solely on algorithms, or (3) to include safety assurance cases in the documentation to facilitate technical audit.
For diagnosis of the retina condition, our AI solution deals with the above three aspects as follows: First, the learned models are automatically translated into rules. These rules are then used to build an explanation by tracing the reasoning chain supporting the diagnosis. Hence, the proposed AI solution complies with the requirement that "algorithmic decision should be explained to the human agent". Second, the decision is not solely based on ML-algorithms. The proposed architecture includes expert knowledge. The diagnosis is taken based on exchanging arguments between ML-based algorithms and expert knowledge. The conflict resolution among arguments is verbalised, so that the ophthalmologist can supervise the diagnosis. Third, the assurance cases are generated to facilitate technical audit. The assurance cases structure the evidence among various safety goals such as: machine learning methodology, transparency, or data quality. For each dimension, the auditor can check the provided evidence against the current best practices or safety standards. Diagnosis Diabetic Retinopathy was chosen because: The thickness value in t 1 zone is smaller than 0.34 and The thickness value in t 2 zone is smaller than 0.3 and The thickness value in s 2 zone is greater than 0.3 and The volume value in s 1 zone is smaller than 0.58 and The thickness value in s 1 zone is greater than 0.35.
We proposed here an architecture for explaining classification of retina conditions. Each module of this architecture can be further improved: First, the set of training algorithms that we used (DT, SVM, or ANN) can be extended. For each new learning algorithm, an arguing agent will be enacted. Its arguments are processed by the argumentation machinery which takes a decision and explain it. This flexibility of the multi-agent system to accommodate new agents without architectural alteration of the system is important given that new ML-based are developing fast. Note that, for each new learned model, one needs to include also the mechanism to extract rules. Second, the quantity of data used can be improved. For this aim, one can benefit from (i) European Health Data Space initiative that aims to overcome the current underuse of medical data available [27], or (ii) from more advanced techniques for augmenting medical data [15]. In this line, it has been argued that data augmentation is more important than learning algorithms for retinal vessel segmentation [14]. Third, the safety cases should be better generated based on some certification procedure. The challenge is that these certification processes, including the corresponding certification centers and regulatory sandboxes, are only now developing [27].
The advantage of argumentative agents is that they can easily interact with each other and with an external user. They react to changes in the environment or to new knowledge and they are more suitable for an interactive system.
Hernandez et al. have used Jason Induction of Logical Decision Trees (JILDT) [28] to create logical decision trees using AgentSpeak plans. The focus is on developing agents that learn while they are executing plans. We are interested here in the reverse process. Instead of building decision trees based on plans, we train first various classifiers with available data and then we generate plans from the learned models.
Vadet et al. have developed a multi-agent military health System in which the diagnosis is distributed using a Belief Desire Intention agent-based model [29]. The system has an autonomous sensor network controlled by agents. The agents have common knowledge that includes diagnosis thresholds and costs for each rescue mission.
A diagnosis system used in a clinical environment should have assurance for its performance. Picardi et al. have designing an assurance argument pattern [25] that can be used for ML decision making. After the models are created using training and validation data sets, their performance is measured on the test data set. The performance gives the level of confidence in the ML model.
Our conflict resolution strategy reassembles with voting methods from ensemble learning. The advantage of our approach over ensemble learning is that the agents use strategies to accept arguments. Plans for accepting an argument are changed depending on the context and based on the knowledge of the agent. The behavior of agents is guided by changes in their plans and belief base. An agent can use different plans according to its own knowledge. Also, we include agents with expert rules that are not learned from models.

Conclusion
We developed here a multi-agent system for diagnosis of retinal conditions. The system diagnoses and explains three retina conditions (AMD, DR, and Normal) using 18 parameters extracted from OCT images. One advantage of our approach is the interleaving of arguments from expert knowledge with arguments extracted from machine learning algorithms. The learned arguments were automatically formalised as rules, in order to avoid black box based decisions and to ensure transparency. Moreover, the multi-agent system proposed here automatically generates assurance cases. These assurance cases combined with transparency of our multi-agent system are a step towards balancing the benefits of machine learning with explainability for engineering reliable medical application.