Explainable Neural Network Ensembles

doi:10.21203/rs.3.rs-863687/v1

Download PDF

Research Article

Explainable Neural Network Ensembles

https://doi.org/10.21203/rs.3.rs-863687/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 14 Jul, 2022

Read the published version in Cognitive Systems Research →

Version 1

posted

You are reading this latest preprint version

Neural networks are known for providing impressive classification performance, and the ensemble learning technique is further acting as a catalyst to enhance this performance by integrating multiple networks. But like neural networks, neural network ensembles are also considered as a black-box because they cannot explain their decision making process. So, despite having high classification performance, neural networks and their ensembles are not suited for some applications which require explainable decisions. However, the rule extraction technique can overcome this drawback by representing the knowledge learned by a neural network in the guise of interpretable decision rules. A rule extraction algorithm provides neural networks with the power to justify their classification responses through explainable classification rules. Several rule extraction algorithms exist to extract classification rules from neural networks, but only a few of them generates rules using neural network ensembles. So this paper proposes an algorithm named Rule Extraction using Ensemble of Neural Network Ensembles (RE-E-NNES) to demonstrate the high performance of neural network ensembles through rule extraction. RE-E-NNES extracts classification rules by ensembling several neural network ensembles. Results show the efficacy of the proposed RE-E-NNES algorithm compared to different existing rule extraction algorithms.

Geometry

Topology

Theoretical Computer Science

Neural Networks

Ensemble

Boosting

Rule Extraction

Classification.

Neural Network (NN) is one of the potential machine learning algorithms for solving tasks like classification, prediction, clustering, and many more. Specifically, NN is a very prominent tool for solving the classification task due to its impressive performance as a classifier. The application of NN as a classification tool is further flourishing with the exposure of performance raising machine learning techniques like the ensemble. The ensemble technique can transform a classifier into a powerful classification model by combining the strength of several individual classifiers[1]. Researches show that the generalization capability of an NN increases through ensembling ([2], [3]). Neural Network Ensembles (NNEs) have been successfully used in various applications like hand-written character recognition, face recognition, digital speech recognition, optical character recognition, medical diagnosis, seismic signals classification, etc. ([4], [5]).

Though the ensemble technique enhanced the applicability of Neural Networks (NNs) in various applications but the incapability of NNs to explain their decision making process is still limiting their applicability in some applications that require transparent decisions along with high accuracies. NNs are considered as a black box because they cannot justify their decision making process. However, researchers found a solution to this problem in the form of rule extraction. The rule extraction is a process to represent the knowledge learned by NNs in the form of symbolic classification rules ([6],[7]). Many works exist to extract classification rules from NNs, but most of them extract rules from individual NNs. Very few researches exist to generate classification rules using NNEs. So, this work tries to fill this gap by generating classification rules using NNEs. NNEs are considered as a better and powerful predictive model than individual NNs, so they will probably help to generate better classification rules.

This paper proposes an algorithm named "Rule Extraction using Ensemble of Neural Network Ensembles (RE-E-NNES)", which extracts symbolic classification rules using NNEs. The fundamental idea of the working of the algorithm lies in its name. The algorithm employs several NNEs in the process of rule extraction. The algorithm builds a Neural Network Ensemble (NNE) model and extracts rules set from it, and consecutively uses the obtained rules to build the succeeding NNE model and extracts rules from it, and so on. The process continues till a new rule set is generated and there are patterns misclassified by the rule set. The RE-E-NNES algorithm employs the boosting algorithm called Neural Network Boosting (NNBOOST) [8] for building NNEs and decision tree for extracting rules from NNEs. Details of the proposed algorithm are given in Sect. 3. The algorithm is validated with twelve real-life datasets collected from the UCI and KEEL repository and compared with few existing rule extraction algorithms.

The organization of the paper goes as follows: Sect. 2 presents the works related to rule extraction using NNEs, Sect. 3 explains the proposed RE-E-NNES algorithm in details, Sect. 4 illustrates the RE-E-NNES algorithm with an example, Sect. 5 presents some significant experimental results along with a discussion about them, and finally Sect. 5 concludes the paper.

Rule extraction from NN is a fervent topic that started decades ago for describing the network’s prediction and representing it in a human-understandable form. The first work on rule extraction from NNS was done by Gallant [9] in 1988. Since then many works have been done on rule extraction. Based on the way of extracting rules, Andrew et al. [10] proposed a taxonomy of rule extraction. The first one is the decompositional approach which generates rule by analyzing the structure of NN (each hidden node and weights). The second one is the pedagogical approach which treats NN as a black box and generates rule as a whole in the form of input and output, it does not analyze hidden nodes and weights. The third one is the eclectic approach which is a combination of the first two approaches. Many algorithms have been proposed to extract rules based on the three basic techniques. Among them, most of the algorithms extract rules from individual networks, there are very few which extract rules using ensembles as mentioned below:

Bologna proposed a special network called Discretized Interpretable Multilayer Perceptron (DIMLP)" to generate symbolic rules from both single networks and ensembles ([11], [12]). For ensembling, he has used the bagging technique. Zhou et al. introduced the REFNE (Rule Extraction from Neural Network Ensemble) algorithm [13], which mainly focuses on the comprehensibility of the generated rules and uses a specific fidelity evaluation mechanism. They have used trained ensembles to generate instances and then extracted symbolic rules from those instances. They have also used a specific discretization scheme to discretize the attributes before rule extraction and adopted a specific rule form. Bologna et al. [14] have used an ensemble of DIMLP network to recognise image via rule extraction. They have used ICA filter energies to train neural network ensembles and extracted rules from ensembles. They have shown that a few numbers filters are sufficient to reach a good recognition rate, whereas large numbers of filters make the rules complex. Johansson et al. [15] extracted rules from Neural Network Ensembles to justify that sometimes inconsistency of the rule extraction algorithms is better than consistency. They have used Genetic Rule Extraction (G-REX) algorithm to extract rules from 20 independently trained NNs.

Hartono et al. [16] constructed a model of neural network ensemble where the function of the ensemble can be easily interpreted by generating logical rules. Ao et al. [17] used ensembles of Elman neural networks and support vector to extract rules for temporal modelling of microarray continuous data. Hara et al. [18] proposed an Ensemble-Recursive-Rule extraction (E-Re-RX) algorithm based on Re-RX (Recursive Rule Extraction) algorithm [19], which uses two artificial neural networks to achieve high recognition rates. Hayashi et al. ([20],[21]) proposed three ensemble neural network rule extraction algorithm based on the Re-RX algorithm [19], which uses three artificial neural networks to generate symbolic rules. Hayashi et al. [22] applied the three ensemble neural network algorithm for credit risk evaluation. Bologna et al. [23] performed experiments based on the Discretized Interpretable Multilayer Perceptron (DIMLP) architecture for 10 independent runs of 10-fold cross-validation trials over 25 binary classification problems. Rules were extracted from DIMLP ensembles, boosted shallow trees (BSTs), and Support Vector Machines (SVM) based on the DIMLP architecture.

Gu and Angelov [24] have proposed a new deep rule-based approach using a high-level ensemble feature descriptor for aerial scene classification. They have ensembled three pre-trained deep convolutional neural networks for feature extraction and extracted more discriminative representations from the local regions of aerial images. The proposed approach produced impressive classification results on the unlabeled images. Hartono [25] has proposed an ensemble of several perceptrons with a simple competitive learning mechanism to decompose a non-linear classification problem into several more manageable linear problems, thus realizing a piecewise-linear classifier. During the competitive learning process, each member of the ensemble competes to learn from one linear subproblem in a reinforcement learning-like mechanism. The linearity of the ensemble members simplified the task for interpreting the rule captured by the ensemble. Yuan et al.[26] have proposed a novel rule-learning algorithm where the neural network ensemble acts as a front-end processor to generate data for learning rules.

The Rule Extraction using Ensemble of Neural Network Ensembles (RE-E-NNES) algorithm extracts symbolic rules using Neural Network Ensembles (NNEs). As the name implies, the algorithm extracts rules by ensembling NNEs. The underlying functioning of the algorithm is depicted in Fig. 1. The algorithm builds the initial or first NNE model and extracts rule set R₁ from it, and consequently uses R₁ to build the second NNE model and extracts rule set R₂ from it, and so on.

Now various questions arise like how many NNEs are ensembled, how the rule sets are combined, etc. All the questions are answered by the flowchart of the algorithm shown in Fig. 2. Notations used in the flow chart: P represents training patterns, NNE_J represents J^th Neural Network Ensemble, R_J represents J^th rule set, M represents misclassified patterns. As the flow chart shows the algorithm comprises of two phases:

Phase I : In this phase, the algorithm builds the first (J = 1) ensemble model NNE_J, extracts rule set from it, and consecutively refines the rule set to obtain the first rule set R_J. The algorithm calculates M by R_J. If M = null, the algorithm stops with R_J as the final set or, else goes to the second phase with M and R_J.

Phase II: In this phase, the algorithm builds the consecutive NNE model NNE_(J+1) with M. It extracts rules from NNE _(J+1) and merges with R_J, and subsequently refines the merged rules to obtain R_(J+1). Now if R_(J+1) and R_J are equivalent then the algorithm stops with R_J as the final rule set or, else the algorithm calculates M for R_(J+1). If M = null, then the algorithm stops with R_(J+1) as the final rule set or, else sets J to (J + 1) and repeats the whole phase with the M generated by R_(J+1).

Both the phases of the RE-E-NNES algorithm are explained in detail below:

3.1. Phase I

This phase comprises of three important steps: creating NNE, rule extraction from NNE, and rule refinement.

Creating NNE

The RE-E-NNES algorithm ensembles NNs specifically Multilayer Perceptrons(MLPs) based on the popular boosting method. The Boosting method builds a model sequentially by changing the weights of the patterns and combines them using the weighted majority voting scheme. Schapire [27] for the first time introduced the concept of boosting which was further modified by Fruend et al. ([28], [29], [30]). Many works exist on boosting, but there are very few which used NNs as their base classifier to create an ensemble of classifiers compared to other classifiers like decision tree, SVM, KNN. The reason for this might be the uniform initialization of pattern weights used by the existing boosting algorithms. Ensembling NNs with equal initial pattern weights may sometimes result in the reduction of accuracy. Keeping this point in view, Neural Network Boosting (NNBOOST) [8] algorithm was proposed by chakraborty et al. with the aim to improve the classification performance of NNs. The NNBOOST algorithm uses a different weight initialization strategy to assign weights to training patterns. It does not start with assigning the same weights to all patterns. It uses Euclidean distances of the patterns from the centroid of the respective classes to assign weights to training patterns. Weights are reassigned in every iteration by updating the centroids of classes. RE-E-NNES algorithm uses NNBOOST algorithm to ensemble MLPs. The process of creating an ensemble is depicted by Fig. 3.

NNBOOST starts with calculating the centroids of all classes in the dataset. The Centroids are calculated as the mean of all training patterns in the respective classes. Weights are assigned as the inverse of the Euclidean distances of the patterns from the respective centroids. That is patterns close to a centroid are assigned more weight compared to distant patterns. Consider the two class classification problem: Classes are labeled as {-1, 1} and centroids are calculated for both of the classes. For a pattern, n belonging to class C_a where a = 1 to 2, weight assignment formula is given by (1) where mean_C_a is the centroid of class C_a.

weight (n) = (1/Euclidean distance (n, mean_C _a )) (1)

The NNBOOST algorithm trains the first NN at iteration, i = 1 with the weighted patterns and classifies the training patterns by the trained NN to calculate misclassified patterns, training error, and confidence of the trained NN. Training error is given by (2) where \(\mathbf{m}\mathbf{i}\mathbf{s}\mathbf{c}\mathbf{l}\mathbf{a}\mathbf{s}\mathbf{s}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}\_\mathbf{c}\mathbf{o}\mathbf{s}\mathbf{t}\left(\mathbf{n}\right)\) is the misclassification cost of pattern n (n = 1 to N). If n is misclassified by the trained NN,\(\mathbf{m}\mathbf{i}\mathbf{s}\mathbf{c}\mathbf{l}\mathbf{a}\mathbf{s}\mathbf{s}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}\_\mathbf{c}\mathbf{o}\mathbf{s}\mathbf{t}\left(\mathbf{n}\right)=1\), else\(\mathbf{m}\mathbf{i}\mathbf{s}\mathbf{c}\mathbf{l}\mathbf{a}\mathbf{s}\mathbf{s}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}\_\mathbf{c}\mathbf{o}\mathbf{s}\mathbf{t}\left(\mathbf{n}\right)=0\). Confidence of the trained NN is given by (3). Confidence of the trained NN is calculated as the ratio of correct classification to misclassification.

Error (i) = 1/N \(\mathbf{*}\left({\sum }_{\mathbf{n}=1}^{\mathbf{N}}\left(\mathbf{m}\mathbf{i}\mathbf{s}\mathbf{c}\mathbf{l}\mathbf{a}\mathbf{s}\mathbf{s}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}\_\mathbf{c}\mathbf{o}\mathbf{s}\mathbf{t}\left(\mathbf{n}\right)\right)\right)\) (2)

\(\mathbf{C}\mathbf{o}\mathbf{n}\mathbf{f}\left(\mathbf{i}\right)\) = ((1-Error (i))/Error (i)) (3)

If Error (i) = 0, the algorithm stops the process and sets (T = i) where T is the number of iterations or number of NNs to be ensembled, else the algorithm sets (i = i + 1) and checks whether all the NNs to be ensembled are created or not. If (i > T) the process is stopped, else the algorithm updates the centroid of classes for building the next NN model. Centroids are updated by excluding those training patterns which are misclassified by the NN model learned at iteration (i-1) and weights for training patterns are updated (using (1)) using the updated centroids.

For testing a pattern q, predictions are made by all the trained T networks and all the predictions are combined using (4) (\(\mathbf{c}\mathbf{l}\mathbf{a}\mathbf{s}\mathbf{s}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}\left({\mathbf{i}}_{\mathbf{q}}\right)\) is the prediction made by the i^th network for the pattern q). If the Output is positive, then the class for pattern q is assigned as 1, else − 1.

Output= \(\mathbf{s}\mathbf{i}\mathbf{g}\mathbf{n}\left({\sum }_{\mathbf{i}=1}^{\mathbf{T}}\left(\mathbf{C}\mathbf{o}\mathbf{n}\mathbf{f}\left(\mathbf{i}\right)\mathbf{*}\mathbf{c}\mathbf{l}\mathbf{a}\mathbf{s}\mathbf{s}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}\left({\mathbf{i}}_{\mathbf{q}}\right)\right)\right)\) (4)

The summarized form of the NNBOOST algorithm described above is given below:

1. Set i = 1; (i = 1 to T, T = number of neural networks to be ensembled)

2. For each class C_a,

• Calculate mean_C_a (centroid of training patterns in class C_a, a = 1 to number of classes)

3. Calculate the weight of n^th pattern as,

• weight(n)=(1/Euclidean distance(n, mean_C_a)); where n є C_a

4. Train the i^th network with the weighted patterns.

5. Calculate the confidence of i^th network as:

• Conf(i)=((1-Error(i))/Error(i));

Where, Error (i) =\(\frac{1}{\mathbf{N}}\mathbf{*}{\sum }_{\mathbf{n}=1}^{\mathbf{N}}\left(\mathbf{m}\mathbf{i}\mathbf{s}\mathbf{c}\mathbf{l}\mathbf{a}\mathbf{s}\mathbf{s}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}\_\mathbf{c}\mathbf{o}\mathbf{s}\mathbf{t}\left(\mathbf{n}\right)\right)\) ; N=number of patterns,

If(target(n) = = predict(n)), then misclassification_cost(n) = 0 ;

Else misclassification_cost(n) = 1;

6. If (Error (i) ≠ 0), set i = i + 1,

• If( i < = T)

¬ Update mean_C_a excluding the misclassified patterns by the (i-1)^th network and GOTO step 3.

• Else stop the process.

Else set T = i, and stop the process.

Testing: For a pattern q,

¬ Output=\(\mathbf{s}\mathbf{i}\mathbf{g}\mathbf{n}\left({\sum }_{\mathbf{i}=1}^{\mathbf{T}}\left(\mathbf{C}\mathbf{o}\mathbf{n}\mathbf{f}\left(\mathbf{i}\right)\mathbf{*}\mathbf{c}\mathbf{l}\mathbf{a}\mathbf{s}\mathbf{s}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}\left({\mathbf{i}}_{\mathbf{q}}\right)\right)\right)\) ;

If(target(q) = = predict(\({\mathbf{i}}_{\mathbf{q}}\))), then \(\mathbf{c}\mathbf{l}\mathbf{a}\mathbf{s}\mathbf{s}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}\left({\mathbf{i}}_{\mathbf{q}}\right)\)= 1 ;

Else \(\mathbf{c}\mathbf{l}\mathbf{a}\mathbf{s}\mathbf{s}\mathbf{i}\mathbf{f}\mathbf{i}\mathbf{c}\mathbf{a}\mathbf{t}\mathbf{i}\mathbf{o}\mathbf{n}\left({\mathbf{i}}_{\mathbf{q}}\right)\)= -1 ;

¬ If Output = + ve, then class = 1 ;

Else class = -1;

Rule extraction from NNE

In this step, the RE-E-NNES algorithm extracts a rule set from the NNE model created in the previous step. The algorithm uses the C4.5 decision tree to extract rule set from the NNE model. It makes a query to the NNE model by presenting a training pattern and uses the response or prediction made by the NNE model to build a decision tree. Consider the example shown in Fig. 4 (where n represents number of patterns, X₁, X₂, and X₃ represents features ,Y₁ and Y₂ represents classes), if (P₁=[X₁, X₂, X₃, ?] ,P₂= [X₁, X₂, X₃, ?] ,…, P_N = [X₁, X₂, X₃, ?]) represents a set of queries to the NNE model and ([Y₁], [Y₂] ,…, [Y₁]) represents their responses or predictions as made by the NNE model, then the algorithm uses the complete patterns (P₁ = [X₁, X₂, X_3, Y₁ ], P₂ = [X₁, X₂, X_3, Y₂] ,…, P_N = [X₁, X₂, X_3, Y₁ ]) to build a decision tree for generating classification rules. Example in Fig. 4 shows that 5 classification rules are generated.

Rule refinement

In this step, the algorithm refines the rules obtained from the NNE model to retrieve a minimized rule set. The algorithm refines the rules in the set by removing the irrelevant rules based on sensitivity analysis. It is an iterative process and removes one rule in one iteration. In an iteration, it calculates the accuracies of the rule set on training patterns by removing one rule at a time and discards the rule from the rule set which is less sensitive in making a decision i.e., whose absence improves the accuracy of the rule set to maximum. The iteration ends when no more rule exists in the set whose absence can increase the accuracy of the rule set.The Rule refinement step is depicted by Fig. 5. Rules denotes the rule set obtained from the preceding step and R_acc denotes the accuracy of the Rules on the training set, r_i denotes the i^th rule of the set Rules and h_i denotes the training accuracy of Rules on removing r_i.

After completion of phase I, the algorithm classifies the training patterns P by the obtained rule set, R_J (J = 1). If no pattern gets misclassified than the algorithm stops with R_J as the final rule set R, else the algorithm goes to phase II with the misclassified patterns M and rule set R_J.

3.2. Phase II

This phase of the algorithm ensembles a series of NNEs sequentially by extracting respective rule sets from them until no more NNE can be created. The algorithm initiates this phase by building a new NNE model (NNE_(J+1)) using the NNBOOST algorithm with the misclassified patterns (M ) obtained by the rule set (R_J) in phase I. Succedingly, the algorithm extracts rules from NNE_(J+1) similarly as explained in the Rule extraction from NNE step of phase I. Next, the rule merging step starts. As it can be seen from the flow chart in Fig. 2, in this step the algorithm merges the rule set of the current NNE (NNE_(J+1)) with the rule set of preceding NNE (NNE_J). Merging is done by simply taking the union of the two rule sets. Consecutively, the algorithm refines the merged rule set similarly as explained in the Rule refinement step of phase I to generate a rule set R_(J+1).

The rule set R_(J+1) is compared with the rule set R_J. If they are equivalent than the algorithm stops with R_J as the final rule set, else the training patterns P are classified by the rule set R_(J+1). If there is no M than the algorithm stops with R_(J+1) as the final rule set R, else sets J = J + 1 and repeats the phase II with the misclassified patterns M.

This section illustrates the algorithm with the Australian Credit Approval dataset, collected from the University of California, Irvine (UCI). It is a standard dataset used for comparing different rule extraction algorithms. The dataset contains a mix of attributes with six continuous attributes, eight categorical attributes, and one binary attribute representing the class. It contains 690 patterns, with 307 patterns representing class one (+) and 383 patterns representing class zero (-). For this illustration, sets of 621 and 69 patterns are taken for training and testing, respectively.

The RE-E-NNS creates NNEs with ten Feed Forward Neural Networks (FFNNs) having single hidden layers with 16 hidden nodes each. The algorithm determines the number of hidden nodes, by varying the nodes from (L + 1) to 2L (where L is the number of input nodes) [8] and selecting the architecture with the lowest MSE on training patterns.

4.1 Phase I

Table 1 shows the results for this phase with the illustrative dataset. The first ensemble model NNE₁ produces an accuracy of 86.63% on the training patterns. The algorithm extracts eleven rules from the NNE₁ model, among which it selects four rules to form the first rule-set R₁. The rule-set R₁ produces an accuracy of 86.63% on training patterns and misclassifies 83 training patterns. The algorithm feds this 83 misclassified patterns with R₁ to phase II.

Table 1

Results for phase I with Australian Credit Approval dataset
Creating NNE	Training accuracy of NNE₁	86.63%
Rule extraction from NNE	Number of rules extracted	11
Rule refinement	Number of refined rules	4
Rule refinement	Rule-set R₁	Rule1: If (attribute3 > = 0.0491) ^ (attribute4 > = 0.25) ^ (attribute5 < 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule2: If (attribute5 > = 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 >= 0.5), then class = 1; Rule3: If (attribute2 > = 0.1936) ^ (attribute4 < 0.25) ^ (attribute5 < 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule4: If (attribute13 < 0.2708) ^ (attribute2 > = 0.0476) ^ (attribute4 > = 0.25) ^ (attribute5 < 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Default rule: Else, class = 0;
Classification of training patterns by R₁	Accuracy of R₁ on training patterns	86.63%
Classification of training patterns by R₁	Number of misclassified patterns	83

4.2 Phase II

Table 2 shows all the intermediate results obtained in the process of extracting rules from the second ensemble model NNE₂. The algorithm creates the NNE₂ model with the 83 patterns misclassified by R_1, and NNE₂ produces an accuracy of 93.98% on the 83 patterns. The table shows that the algorithm extracts three rules from the NNE₂ model and merges with R₁ to form a rule-set with seven rules. Consecutively, the algorithm refines these seven rules to create the second rule-set R₂ with five rules. The R₂ is shown, in table 6.2. The R₂ produces a training accuracy of 86.79%, which is higher than the training accuracy of R₁ (given in Table 1). As R₁ and R₂ are not equivalent, RE-E-NNS repeats phase II with R₂ and the 82 patterns misclassified by it.

Table 2

Results for rule extraction from the second NNE model with Australian Credit Approval dataset
Creating NNE	Accuracy of the NNE₂ with the 83 misclassified patterns used to create it	93.98%
Rule extraction from NNE	Number of rules extracted	3
Rule merging	Number of rules after merging	7
Rule refinement	Number of refined rules	5
Rule refinement	Rule-set R₂	Rule1: If (attribute3 > = 0.0491) ^ (attribute4 > = 0.25) ^ (attribute5 < 0.1154) ^ (attribute8 >= 0.5), then class = 1; Rule2: If (attribute5 > = 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule3: If (attribute2 > = 0.1936) ^ (attribute4 < 0.25) ^ (attribute5 < 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule4: If (attribute13 < 0.2708) ^ (attribute2 > = 0.0476) ^ (attribute4 > = 0.25) ^ (attribute5 < 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule5: If (attribute10 > = 0.7985) ^ (attribute8 > = 0.5), then class = 1; Default rule: Else, class = 0;
Classification of training patterns by R₂	Accuracy	86.79%
Classification of training patterns by R₂	Number of misclassified patterns	82

Similarly, RE-E-NNS creates the third ensemble model NNE₃, with the 82 patterns misclassified by R₂. Table 3 shows all the results obtained in the process of rule extraction from NNE₃. The algorithm extracts six rules from the NNE₃ model and merges these six rules with R₂ to form a set of eleven rules. Consecutively, it refines these eleven rules to create the third rule-set R₃ which comprises of six rules. Results show that R₃ produces a training accuracy of 86.95%, which is higher than the training accuracy of R₂ (given in Table 2). R₂ and R₃ are not equivalent, so the algorithm repeats phase II with R₃ and the 81 patterns misclassified by it.

Table 3

Results for rule extraction from the third NNE model with Australian Credit Approval dataset
Creating NNE	Accuracy of the NNE₃ with the 82 misclassified patterns used to create it	96.34%
Rule extraction from NNE	Number of rules extracted	6
Rule merging	Number of rules after merging	11
Rule refinement	Number of refined rules	6
Rule refinement	Rule-set R₃	Rule1: If (attribute3 > = 0.0491) ^ (attribute4 > = 0.25) ^ (attribute5 < 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule2: If (attribute5 > = 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule3: If (attribute2 > = 0.1936) ^ (attribute4 < 0.25) ^ (attribute5 < 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule4: If (attribute13 < 0.2708) ^ (attribute2 > = 0.0476) ^ (attribute4 > = 0.25) ^ (attribute5 < 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule5: If (attribute10 > = 0.7985) ^ (attribute8 > = 0.5), then class = 1; Rule6: If (attribute3 > = 0.9502) ^ (attribute8 > = 0.5), then class = 1; Default rule: Else, class = 0;
Classification of training patterns by R₃	Accuracy	86.95%
Classification of training patterns by R₃	Number of misclassified patterns	81

The algorithm creates the fourth ensemble model NNE₄ with the 81 patterns misclassified by R₃. Table 4 shows all the results for rule extraction from NNE₄. RE-E-NNS extracts two rules from the NNE₄ model and forms eight rules after merging these two rules with R₃. Consecutively, it refines these eight rules to form the fourth rule-set R₄ which comprises of six rules. Results show that R₄ is equivalent to R₃. So, phase II ends with R₃ as the final rule-set, R.

The final rule-set produces an accuracy of 86.96% on the testing patterns.

Table 4

Results for rule extraction from the fourth NNE model with Australian Credit Approval dataset
Creating NNE	Accuracy of the NNE₄ with the 81 misclassified patterns used to create it	93.83%
Rule extraction from NNE	Number of rules extracted	2
Rule merging	Number of rules after merging	8
Rule refinement	Number of refined rules	6
Rule refinement	Rule-set R₄	Rule1: If (attribute3 > = 0.0491) ^ (attribute4 > = 0.25) ^ (attribute5 < 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule2: If (attribute5 > = 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule3: If (attribute2 > = 0.1936) ^ (attribute4 < 0.25) ^ (attribute5 < 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule4: If (attribute13 < 0.2708) ^ (attribute2 > = 0.0476) ^ (attribute4 > = 0.25) ^ (attribute5 < 0.5) ^ (attribute10 < 0.47) ^ (attribute5 > = 0.1154) ^ (attribute8 > = 0.5), then class = 1; Rule5: If (attribute10 > = 0.7985) ^ (attribute8 > = 0.5), then class = 1; Rule6: If (attribute3 > = 0.9502) ^ (attribute8 > = 0.5), then class = 1; Default rule: Else, class = 0;
Classification of training patterns by R₄	Accuracy	86.95%
Classification of training patterns by R₄	Number of misclassified patterns	81

This section presents the results for analysing the performance of the proposed RE-E-NNS algorithm. The performance of the algorithm is validated, with twelve datasets taken from the UCI and KEEL repository. A detailed description of the datasets is shown, in Table 5. All the experiments are done in windows environment by the Matlab. Ten-fold cross validation is done to validate the experimental results.

Table 6 shows the optimal architectures of FFNNs used in the experiments, and the average testing accuracies of the FFNNs and the NNEs for ten folds. The testing accuracies for NNEs represents the testing accuracies obtained after ensembling FFNNs with the NNBOOST algorithm (a part of RE-E-NNS algorithm). The NNBOOST algorithm has used a maximum of ten FFNNs to create NNEs. The table shows that the average testing accuracies of the NNEs are higher than the average testing accuracies of FFNNs in all the datasets.

Table 5

Datasets used in the experiments
Datasets	Attributes	Total patterns	Number of patterns with missing values	Class	Number of patterns in different classes(after removing patterns with missing values)
Credit Approval	6 continuous, 9 categorical	690	36	2	Class1 = 296 (+), Class0 = 358 (-)
Australian Credit Approval	6 continuous, 8 categorical	690	-	2	Class1 = 307 (+), Class0 = 383 (-)
Echocardiogram	10 useful(2 binary,8 real), 2 useless (one denoting name and other group)	132	71	2	Class1 = 17(a patient survived for atleast 1 year after a heart attack), Class0 = 44 (a patient did not survived for atleast 1 year after a heart attack or 1 year is not completed following a attack)
Statlog (Heart)	6 are real, 3 are binary, 3 are nominal,1 is ordered	270	-	2	Class1 = 150 (labeled as 2 in the dataset representing ‘presence’), Class0 = 120 (labeled as 1 in the dataset representing ‘absence’)
Breast Cancer	9 useful (real), 1 useless (denoting ID)	699	16	2	Class1 = 239 (labeled as 4 in the dataset representing ‘malignant’), Class0 = 444 (labeled as 2 in the dataset representing ‘benign’)
German	13 categorical, 7 numerical	1000	-	2	Class1 = 700 (labeled as 1 in the dataset representing ‘good’), Class0 = 300(labeled as 2 in the dataset representing ‘bad’)
Eye	14 continuous	14980		2	Class1 = 6723 (eye-closed), Class0 = 8257 (eye-open)
Pima Indians Diabetes	8 real	768	-	2	Class1 = 268 (presence), Class0 = 500 (absence)
Census Income	6 continuous, 8 categorical	48842	3620	2	Class1 = 7236 (> 50k), Class0 = 37986 ( < = 50k)
Thyroid	6 continuous, 15 binary	7200	-	3	Class1 = 166 (normal), Class2 = 368 (hyper), Class3 = 6666 (hypo)
Wine	13 Real	178	-	3	Class1 = 59 (type1), Class2 = 48 (type2), Class3 = 71(type3)
Car Evaluation	6 Categorical	1728	-	4	Class1 = 1210, Class2 = 384, Class3 = 69, Class4 = 65

Table 6

Optimal architectures, and accuracies for FFNNs and NNEs
Datasets	Architectures of FFNNs	Average testing accuracies of FFNNs for 10-folds (in %)	Average testing accuracies of NNEs for 10-folds (in %)
Credit Approval	15-26-1	84.92	85.69
Australian Credit Approval	14-16-1	85.36	86.23
Echocardiogram	10-17-1	88.33	91.67
Statlog (Heart)	13-24-1	76.29	80.74
Breast Cancer	9-11-1	95.74	96.03
German	20-27-1	71.9	74.2
Eye	14-25-1	54.89	55.13
Pima Indians Diabetes	8-14-1	73.38	74.42
Census Income	14-24-1	83.75	83.8
Thyroid	21-30-3	92.56	94.13
Wine	13-16-3	94.44	95.18
Car Evaluation	6-10-4	87.2	91.65

5.1 Results and Comparisons

The proposed RE-E-NNS algorithm is compared with four significant rule extraction algorithms, Recursive Rule Extraction (Re-RX) [19], Reverse Engineering Recursive Rule Extraction (RE-Re-RX)[6], Eclectic Rule Extraction from Neural Network Recursively (ERENNR)[7], and Eclectic Rule Extraction from Neural Network with Multi-Hidden Layer (ERENN_MHL)[32] algorithms.

Table 7 shows the comparison of RE-E-NNS algorithm with the four algorithms based on the average testing accuracies of 10-fold cross validation results. The bold values represent the highest accuracies in the respective datasets. Results show that the proposed RE-E-NNS algorithm has performed better than the four algorithms in all the datasets except Echocardiogram, Eye, and Wine datasets. A high increase in the accuracy is observed, in German, Pima Indians Diabetes, and Car Evaluation datasets. Figure 6 depicts the graphical comparisons between the algorithms for better understanding and comparison.

The rules are also evaluated based on fidelity. Table 8 shows the comparison between the RE-E-NNS, Re-RX, RE-Re-RX, ERENNR, and ERENN_MHL algorithms with the average testing fidelity of 10-fold cross validation results. The bold values represent the highest fidelity in the respective datasets. Results show in five out of the twelve datasets, the fidelity of the rule-sets generated by the RE-E-NNS algorithm is better compared to all the four algorithms. If compared individually, the fidelity of the rules constructed by RE-E-NNS is better than Re-RX in all the datasets, better than RE-Re-RX in all the datasets except Pima Indians Diabetes, better than ERENNR in nine datasets, and better than ERENN_MHL in five datasets and equal in the Echocardiogram dataset.

Table 7

Comparison of accuracies with the average of 10-fold cross validation results (in %)
Datasets	Re-RX	RE-Re-RX	ERENNR	ERENN_MHL	RE-E-NNS
Credit Approval	81.39	84.31	86.62	86.77	87.39
Australian Credit Approval	73.48	74.49	85.51	85.65	86.52
Echocardiogram	93.33	96.67	96.67	98.33	98.33
Statlog (Heart)	72.59	75.56	80.37	82.59	83.70
Breast Cancer	90.59	94.85	96.47	96.76	97.06
German	71.4	72.4	72.9	74.6	79.2
Eye	53.98	55.15	55.23	55.36	55.13
Pima Indians Diabetes	68.57	74.68	76.88	76.23	79.22
Census Income	82.49	83.59	84.29	84.39	85.74
Thyroid	94.07	94.48	94.87	95.08	95.67
Wine	63.70	81.85	98.15	98.89	97.78
Car Evaluation	88.87	88.87	89.86	90.12	94.82

Accuracy is not enough to conclude that a classification model is good. So, the algorithms are also compared based on some other performance measures, as shown in Table 9. The bold values represent the highest measures in the respective datasets.

• All the measures for RE-E-NNS are better than Re-RX, RE-Re-RX, ERENNR, and ERENN_MHL in the Thyroid and Car Evaluation datasets.

• In some of the datasets, the RE-E-NNS has produced higher average False positive (FP) value compared to one or more algorithms. As a consequence, few measures for the RE-E-NNS are less in those datasets. In the cases of Credit Approval and Australian Credit Approval datasets, RE-E-NNS has produced rules with higher average FPR and lower average specificity only compared to the Re-RX algorithm. In the case of Echocardiogram dataset, RE-E-NNS has constructed rules with lower average precision, lower average specificity, and higher average FPR compared to RE-Re-RX and ERENNR algorithms. In the case of Census Income dataset, RE-E-NNS has constructed rules with lower average precision, lower average specificity, and higher average FPR compared to all the algorithms. For the Pima Indians Diabetes also, the RE-E-NNS has constructed rules with lower average precision (except RE-Re-RX), lower average specificity (except ERENN_MHL), and higher average FPR (except ERENN_MHL) compared to the others. However, a maximum of the measures is better for RE-E-NNS in the above-mentioned datasets.

Table 8

Comparison of fidelity with the average of 10-fold cross validation results (in %)
Datasets	Re-RX	RE-Re-RX	ERENNR	ERENN_MHL	RE-E-NNS
Credit Approval	86.92	91.08	96.62	96.92	96.77
Australian Credit Approval	70.58	74.49	93.19	95.07	96.52
Echocardiogram	91.67	85	88.33	95	95
Statlog (Heart)	68.52	70.74	85.19	86.29	83.70
Breast Cancer	90.74	95	96.91	96.47	98.38
German	64.89	65.22	69.56	70	68.89
Eye	81.98	81.72	77.90	98.33	100
Pima Indians Diabetes	77.92	84.03	86.75	81.69	78.31
Census Income	93.46	94.23	94.42	99.12	96.27
Thyroid	90.14	90.4	90.82	92.56	96.81
Wine	46.67	73.33	87.78	95	91.67
Car Evaluation	49.13	49.13	59.94	54.39	62.95

• In some of the datasets, the proposed RE-E-NNS has classified patterns with higher average False Negative (FN) value due to which few measures for RE-E-NNS are less in those datasets. In the cases of Credit Approval and Australian Credit Approval datasets, the RE-E-NNS has produced lower average recall compared to ERENNR and ERENN_MHL. In the case of Statlog(Heart) dataset, RE-E-NNS has constructed rule-sets with lower average recall and lower average f-measure compared to ERENNR. In the case of Breast Cancer dataset, the RE-E-NNS has produced lower average recall compared to Re-RX and ERENNR. In the case of German dataset, the RE-E-NNS has constructed rule-sets with lower average recall compared to RE-Re-RX and ERENN_MHL. However, most of the measures are better for RE-E-NNS compared to Re-RX, RE-Re-RX, ERENNR, and ERENN_MHL in the above-mentioned datasets.

• For the Eye dataset, RE-E-NNS has produced rules with better average recall and f-measure compared to the four algorithms. If compared individually, RE-E-NNS has performed better in 5, 4, 4, and 2 measures compared to the Re-RX, RE-Re-RX, ERENNR, and ERENN_MHL algorithms, respectively. In the case of Wine dataset, RE-E-NNS has performed better than two (Re-RX and RE-Re-RX) algorithms for all the measures.

All the results show that RE-E-NNS performed better compared to the four algorithms for a maximum of the measures in most of the datasets. So it can be concluded, the average performance of RE-E-NNS is better than the other algorithms.

Table 9

Comparison of performance measures with the average of 10-fold cross validation results (all in % except MCC)
Datasets	Algorithms	Precision	Recall	F-measure	FPR	Specificity	BA	MCC
Credit Approval	Re-RX	79.07	78.62	78.59	17.12	82.88	80.75	0.6177
	RE-Re-RX	79.42	86.87	82.81	18.33	81.67	84.27	0.6843
	ERENNR	79.51	94.4	86.17	20.1	79.89	87.15	0.7429
	ERENN_MHL	80.05	93.76	86.2	19.23	80.77	87.26	0.7436
	RE-E-NNS	81.62	92.83	86.73	17.31	82.68	87.76	0.7518
Australian Credit Approval	Re-RX	78.02	57.69	65.65	14.19	85.8	71.75	0.4658
	RE-Re-RX	73.53	66.45	69.54	19.39	80.61	73.53	0.4793
	ERENNR	79.13	92.48	85.07	19.93	80.07	86.28	0.7235
	ERENN_MHL	79.19	92.79	85.23	19.93	80.07	86.43	0.7267
	RE-E-NNS	82.57	90.54	85.97	16.28	83.72	87.13	0.7404
Echocardiogram	Re-RX	86.667	100	92	10	90	95	0.8828
	RE-Re-RX	100	96	97.778	0	100	98	0.9264
	ERENNR	100	96	97.778	0	100	98	0.9264
	ERENN_MHL	96.67	100	98	2.5	97.5	98.75	0.9707
	RE-E-NNS	96.67	100	98	2.5	97.5	98.75	0.9707
Statlog (Heart)	Re-RX	74.59	67.97	67.84	21.50	78.49	73.23	0.4857
	RE-Re-RX	78.05	69.45	70.54	17.68	82.32	75.89	0.5374
	ERENNR	77.92	91.64	83.74	32.78	67.22	79.43	0.6162
	ERENN_MHL	81.72	80.12	79.21	14.93	85.07	82.59	0.6633
	RE-E-NNS	82.42	80.29	80.95	14.04	85.96	83.13	0.6686
Breast Cancer	Re-RX	80.84	95.44	86.93	11.96	88.04	91.74	0.8016
	RE-Re-RX	90.95	95.26	92.85	5.62	94.38	94.82	0.8848
	ERENNR	92.46	98.28	95.27	5.3	94.69	96.49	0.9216
	ERENN_MHL	97.12	94.96	95.93	2.97	97.03	95.99	0.9306
	RE-E-NNS	97.57	95.33	96.37	1.98	98.02	96.67	0.9383
German	Re-RX	74.00	91.34	81.63	75.00	24.99	58.17	0.2163
	RE-Re-RX	74.42	92.31	82.29	74.07	25.93	59.12	0.2469
	ERENNR	75.95	90.54	82.18	71.09	28.90	59.72	0.2407
	ERENN_MHL	76.09	92.88	83.51	68.44	31.56	62.22	0.3231
	RE-E-NNS	81.44	91.35	85.95	48.04	51.96	71.66	0.4765
Eye	Re-RX	37.29	3.74	6.79	5.12	94.89	49.31	-0.033
	RE-Re-RX	51.39	0.75	1.47	0.56	99.44	50.09	0.011
	ERENNR	52.11	3.27	6.14	2.46	97.54	50.40	0.0244
	ERENN_MHL	86.55	1.03	1.99	0.441	99.56	50.30	0.0439
	RE-E-NNS	55.53	20	71.39	19.98	80.02	50.01	0.0299
Pima Indian Diabetes	Re-RX	89	13.26	22.22	1.38	98.62	55.94	0.2505
	RE-Re-RX	70.29	48.03	56.29	10.98	89.02	68.53	0.4134
	ERENNR	81.75	45.30	57.21	6.12	93.89	69.59	0.4736
	ERENN_MHL	74.30	52.22	58.72	12.39	87.61	69.91	0.4523
	RE-E-NNS	73.79	63.39	67.85	11.97	88.04	75.71	0.5369
Census Income	Re-RX	75	0.04	0.18	0.90	99.09	50.02	0.0257
	RE-Re-RX	87.89	1.35	3.09	0.053	99.947	50.64	0.0958
	ERENNR	100	1.79	4.10	0.00	100	50.89	0.1265
	ERENN_MHL	90.45	2.63	5.10	0.049	99.951	51.28	0.1319
	RE-E-NNS	62.73	18.91	29.49	1.34	98.66	59.03	0.3332
Thyroid	Re-RX	91.09	91.09	91.09	4.45	95.55	93.32	0.8665
	RE-Re-RX	91.72	91.72	91.72	4.14	95.86	93.79	0.8758
	ERENNR	92.31	92.31	92.31	3.85	96.15	94.23	0.8846
	ERENN_MHL	92.63	92.63	92.63	3.69	96.31	94.47	0.8894
	RE-E-NNS	93.51	93.51	93.51	3.24	96.76	95.14	0.9027
Wine	Re-RX	45.56	45.56	45.56	27.22	72.78	59.17	0.1833
	RE-Re-RX	72.78	72.78	72.78	13.61	86.39	79.58	0.5917
	ERENNR	97.22	97.22	97.22	1.39	98.61	97.92	0.9583
	ERENN_MHL	98.33	98.33	98.33	0.83	99.17	98.75	0.975
	RE-E-NNS	96.67	96.67	96.67	1.67	98.33	97.5	0.95
Car Evaluation	Re-RX	77.75	77.75	77.75	7.42	92.58	85.16	0.7033
	RE-Re-RX	77.75	77.75	77.75	7.42	92.58	85.16	0.7033
	ERENNR	79.71	79.71	79.71	6.76	93.24	86.47	0.7295
	ERENN_MHL	80.23	80.23	80.23	6.59	93.41	86.82	0.7364
	RE-E-NNS	89.65	89.65	89.65	3.45	96.55	93.10	0.8620

The algorithm is also validated, with the Freidman non-parametrical test and Least Significant Difference (LSD) post-hoc test. The null hypothesis for this test is “There is no significant difference between Re-RX, RE-Re-RX, ERENNR, ERENN_MHL, and RE-E-NNS”. Ten-fold cross validation results are used for the tests. Table 10 shows the results for the Friedman test:- the mean rank for each algorithm, the Freidman statistic: Chi-square value, and the p-value. In all the datasets, the null hypothesis is rejected at a highly significant (p < 0.01) p-value, except the Echocardiogram dataset.

The Friedman test has indicated that there is a significant difference between Re-RX, RE-Re-RX, ERENNR, ERENN_MHL, and RE-E-NNS algorithms. However, the test did not show the significant differences between the algorithms. So, LSD post-hoc analysis is performed, with the results of the Freidman test. LSD performs all possible pairwise comparison of group means obtained from Freidman test. Table 11 shows the results for the LSD test. Only the results for the pairs of interest are shown, in the table. The p-value for the pair:- (Re-RX and RE-E-NNS) is significant at (p < 0.01) in all datasets except Echocardiogram. The p-value for the pair:- (RE-Re-RX and RE-E-NNS) is significant at (p < 0.01) or (p < 0.05) in all datasets except Echocardiogram, Eye and Wine. The p-value for the pair:- (ERENNR and RE-E-NNS) is significant at (p < 0.01) or (p < 0.05) in all datasets except Echocardiogram, Eye, and Wine datasets. The p-value for the pair:- (ERENN_MHL and RE-E-NNS) is significant at (p < 0.01) or (p < 0.05) or (p < 0.1), in the Credit Approval, German, Eye, Pima Indians Diabetes, and Car Evaluation datasets. So, after analyzing all the mean ranks, pairwise comparisons, and p-values in Table 10 and 11, it can be concluded that in most of the cases, the results for the RE-E-NNS algorithm are significant enough to reject the null hypothesis. So, the overall results for RE-E-NNS algorithm are statistically significant compared to Re-RX, RE-Re-RX, ERENNR, and ERENN_MHL algorithms.

Table 10

Multiple comparisons using Friedman statistical test
Datasets	Algorithms	Mean Ranks	Friedman statistic	p-value
Credit Approval dataset	Re-RX	1.5	36.0412	0.000020277***
	RE-Re-RX	2.05
	ERENNR	3.3
	ERENN_MHL	3.45
	RE-E-NNS	4.7
Australian Credit Approval	Re-RX	1.05	39.8191	0.000000047178***
	RE-Re-RX	1.95
	ERENNR	3
	ERENN_MHL	4
	RE-E-NNS	5
Echocardiogram	Re-RX	2.7	1.92	0.7505
	RE-Re-RX	3
	ERENNR	2.9
	ERENN_MHL	3.2
	RE-E-NNS	3.2
Statlog (Heart)	Re-RX	1.35	32.5155	0.0000015009***
	RE-Re-RX	2.05
	ERENNR	2.9
	ERENN_MHL	3.8
	RE-E-NNS	4.9
Breast Cancer	Re-RX	1.15	38.7208	0.000000079555***
	RE-Re-RX	1.95
	ERENNR	2.9
	ERENN_MHL	4
	RE-E-NNS	5
German	Re-RX	1.45	29.1224	0.0000073822***
	RE-Re-RX	2.3
	ERENNR	2.8
	ERENN_MHL	3.45
	RE-E-NNS	5
Eye	Re-RX	1	26.2769	0.000027825***
	RE-Re-RX	2.95
	ERENNR	3.25
	ERENN_MHL	4.5
	RE-E-NNS	3.3
Pima Indians Diabetes	Re-RX	1.1	29.0103	0.0000077798***
	RE-Re-RX	2.75
	ERENNR	3.35
	ERENN_MHL	3
	RE-E-NNS	4.8
Census Income	Re-RX	1.45	29.8523	0.0000052451***
	RE-Re-RX	2.3
	ERENNR	3.05
	ERENN_MHL	3.75
	RE-E-NNS	4.45
Thyroid	Re-RX	1.05	39.0553	0.000000067857***
	RE-Re-RX	2.05
	ERENNR	2.9
	ERENN_MHL	4
	RE-E-NNS	5
Wine	Re-RX	1.1	32.2553	0.0000016965***
	RE-Re-RX	2
	ERENNR	3.8
	ERENN_MHL	4.3
	RE-E-NNS	3.8
Car Evaluation	Re-RX	1.7	33.93684	0.00000076775***
	RE-Re-RX	1.7
	ERENNR	2.8
	ERENN_MHL	3.8
	RE-E-NNS	5
* means (P < 0.01), means (P < 0.05), *means (P < 0.1), Bold values in the Mean Ranks column represent highest mean ranks.

Table 11

Pairwise comparisons using Least Significant Difference (LSD) post-hoc test
Datasets	Pairwise comparison		p-value
Credit Approval	Re-RX	RE-E-NNS	0.0000030462***
	RE-Re-RX	RE-E-NNS	0.0001109***
	ERENNR	RE-E-NNS	0.04114**
	ERENN_MHL	RE-E-NNS	0.06825609*
Australian Credit Approval	Re-RX	RE-E-NNS	0.000000021415***
	RE-Re-RX	RE-E-NNS	0.00001531***
	ERENNR	RE-E-NNS	0.0046***
	ERENN_MHL	RE-E-NNS	0.1563
Echocardiogram	Re-RX	RE-E-NNS	0.2482
	RE-Re-RX	RE-E-NNS	0.6442
	ERENNR	RE-E-NNS	0.4884
	ERENN_MHL	RE-E-NNS	1
Statlog (Heart)	Re-RX	RE-E-NNS	0.00000034416***
	RE-Re-RX	RE-E-NNS	0.0000427***
	ERENNR	RE-E-NNS	0.0041***
	ERENN_MHL	RE-E-NNS	0.1142
Breast Cancer	Re-RX	RE-E-NNS	0.000000041108***
	RE-Re-RX	RE-E-NNS	0.00001386***
	ERENNR	RE-E-NNS	0.0028***
	ERENN_MHL	RE-E-NNS	0.1542
German	Re-RX	RE-E-NNS	0.00000039484***
	RE-Re-RX	RE-E-NNS	0.00011472***
	ERENNR	RE-E-NNS	0.0017***
	ERENN_MHL	RE-E-NNS	0.0268**
Eye	Re-RX	RE-E-NNS	0.00098727***
	RE-Re-RX	RE-E-NNS	0.6162
	ERENNR	RE-E-NNS	0.9429
	ERENN_MHL	RE-E-NNS	0.0857*
Pima Indians Diabetes	Re-RX	RE-E-NNS	0.0000001079***
	RE-Re-RX	RE-E-NNS	0.0032***
	ERENNR	RE-E-NNS	0.0373**
	ERENN_MHL	RE-E-NNS	0.0097***
Census Income	Re-RX	RE-E-NNS	0.00000088605***
	RE-Re-RX	RE-E-NNS	0.00042718***
	ERENNR	RE-E-NNS	0.0218**
	ERENN_MHL	RE-E-NNS	0.2514
Thyroid	Re-RX	RE-E-NNS	0.000000021415***
	RE-Re-RX	RE-E-NNS	0.000028845***
	ERENNR	RE-E-NNS	0.0029***
	ERENN_MHL	RE-E-NNS	0.1563
Wine	Re-RX	RE-E-NNS	0.000082042***
	RE-Re-RX	RE-E-NNS	0.4658
	ERENNR	RE-E-NNS	1
	ERENN_MHL	RE-E-NNS	0.4658
Car Evaluation	Re-RX	RE-E-NNS	0.0000016833***
	RE-Re-RX	RE-E-NNS	0.0000016833***
	ERENNR	RE-E-NNS	0.0014***
	ERENN_MHL	RE-E-NNS	0.0817*
* means (P < 0.01), means (P < 0.05), *means (P < 0.1)

5.2 Comparison of RE-E-NNES with Three-MLP Ensemble Re-RX

The proposed RE-E-NNES algorithm is also compared with a popular rule extraction algorithm named Three-MLP Ensemble Re-RX [21] which also extracts rules by ensembling FFNNs. Table 12 shows the comparison of RE-E-NNES algorithm with the Three-MLP Ensemble Re-RX algorithm based on the average testing accuracies of 10-fold cross validation results. The bold values represent the highest accuracies. Results show that the proposed RE-E-NNES algorithm performed better than the Three-MLP Ensemble Re-RX algorithm in all the datasets. A huge increase in accuracy can be seen in Breast Cancer, Pima Indians Diabetes, and Wine datasets. These datasets have continuous attributes and the Three-MLP Ensemble Re-RX algorithm extracts rules based on the Recursive Rule Extraction (Re-RX) algorithm, which has a limitation associated with the continuous attributes[19]. The proposed RE-E-NNES algorithm has no constraints on any kind of attributes.

Table 13 shows the comparison between the RE-E-NNES and Three-MLP Ensemble Re-RX algorithms with the average testing fidelity of 10-fold cross validation results. The bold values represent the highest fidelity. Results show that except Pima Indians Diabetes, Census Income and Thyroid datasets, the fidelity of the rulesets generated by the RE-E-NNES algorithm is better compared to the Three-MLP Ensemble Re-RX algorithm. In a maximum of the datasets, the rulesets generated by the RE-E-NNES algorithm can more accurately describe the decisions made by NNEs. Both the algorithms are also compared based on some performance measures as shown in Table 14. Results show that in maximum of the datasets the average performance of RE-E-NNES is better compared to the Three-MLP Ensemble Re-RX algorithm.

Table 12

Comparison of accuracies with the average of 10-fold cross validation results (in %)
Datasets	Three-MLP Ensemble Re-RX	RE-E-NNES
Credit Approval	85.85	87.39
Australian Credit Approval	85.51	86.52
Echocardiogram	93.33	98.33
Statlog (Heart)	78.15	83.70
Breast Cancer	90.29	97.06
German	71.3	79.2
Eye	54.39	55.13
Pima Indians Diabetes	70.13	79.22
Census Income	83.37	85.74
Thyroid	94.94	95.67
Wine	71.85	97.78
Car Evaluation	91.94	94.82

Table 13

Comparison of fidelity with the average of 10-fold cross validation results (in %)
Datasets	Three-MLP Ensemble Re-RX	RE-E-NNES
Credit Approval	91.54	96.77
Australian Credit Approval	89.42	96.52
Echocardiogram	91.67	95
Statlog (Heart)	74.81	83.70
Breast Cancer	90.44	98.38
German	66.11	68.89
Eye	98.88	100
Pima Indians Diabetes	84.29	78.31
Census Income	98.89	96.27
Thyroid	99.5	96.81
Wine	57.78	91.67
Car Evaluation	53.76	62.95

Table 14

Comparison of performance measures with the average of 10-fold cross validation results (in %)
Datasets	Algorithms	Precision	Recall	F-measure	FPR	Specificity	BA	MCC
Credit Approval	Three-MLP Ensemble Re-RX	78.35	94.38	85.46	21.47	78.53	86.46	0.7295
Credit Approval	RE-E-NNES	81.62	92.83	86.73	17.31	82.68	87.76	0.7518
Australian Credit Approval	Three-MLP Ensemble Re-RX	79.13	92.48	85.07	19.93	80.07	86.28	0.7235
Australian Credit Approval	RE-E-NNES	82.57	90.54	85.97	16.28	83.72	87.13	0.7404
Echocardiogram	Three-MLP Ensemble Re-RX	72.22	100	95.84	15.83	84.17	91.07	0.9512
Echocardiogram	RE-E-NNES	82.42	80.29	80.95	14.04	85.96	83.13	0.6686
Statlog (Heart)	Three-MLP Ensemble Re-RX	79.63	69.74	73.49	14.07	85.93	77.84	0.5664
Statlog (Heart)	RE-E-NNES	82.42	80.29	80.95	14.04	85.96	83.13	0.6686
Breast Cancer	Three-MLP Ensemble Re-RX	81.2	94.42	86.69	11.78	88.22	91.32	0.7955
Breast Cancer	RE-E-NNES	97.57	95.33	96.37	1.98	98.02	96.67	0.9383
German	Three-MLP Ensemble Re-RX	71.41	98.31	82.66	92.18	7.82	53.07	0.1487
German	RE-E-NNES	81.44	91.35	85.95	48.04	51.96	71.66	0.4765
Eye	Three-MLP Ensemble Re-RX	23.19	10	63.37	9.94	90.06	50.03	0.0174
Eye	RE-E-NNES	55.53	20	71.39	19.98	80.02	50.01	0.0299
Pima Indian Diabetes	Three-MLP Ensemble Re-RX	60.42	42.34	49.52	14.86	85.14	63.74	0.3030
Pima Indian Diabetes	RE-E-NNES	73.79	63.39	67.85	11.97	88.04	75.71	0.5369
Census Income	Three-MLP Ensemble Re-RX	31.67	0.06	0.18	0.04	99.95	50.00	0.0061
Census Income	RE-E-NNES	62.73	18.91	29.49	1.34	98.66	59.03	0.3332
Thyroid	Three-MLP Ensemble Re-RX	92.40	92.40	92.40	3.79	96.20	94.30	0.8860
Thyroid	RE-E-NNES	93.51	93.51	93.51	3.24	96.76	95.14	0.9027
Wine	Three-MLP Ensemble Re-RX	57.78	57.78	57.78	21.11	78.89	68.33	0.3667
Wine	RE-E-NNES	96.67	96.67	96.67	1.67	98.33	97.5	0.95
Car Evaluation	Three-MLP Ensemble Re-RX	83.87	83.87	83.87	5.38	94.62	89.25	78.49
Car Evaluation	RE-E-NNES	89.65	89.65	89.65	3.45	96.55	93.10	0.8620

5.3 Discussion

The proposed RE-E-NNS aims to demonstrate the high performance of NNEs through rule extraction, and the results presented in the preceding subsection proves that the algorithm can generate productive rules from NNEs. The RE-E-NNS algorithm is implemented, on twelve datasets collected from the UCI and KEEL repository. Results (Table 7) show that the rule-sets generated by the proposed RE-E-NNS algorithm are more accurate compared to Re-RX, RE-Re-RX, ERENNR, and ERENN_MHL algorithms in most of the datasets. The average fidelity of the algorithm is better compared to Re-RX, RE-Re-RX, and ERENNR algorithms and almost similar to ERENN_MHL (Table 8). The performance of the algorithm is also validated with precision, recall, f-measure, False Positive Rate (FPR), specificity, Balanced Accuracy (BA), and Matthews Correlation Coefficient (MCC) (Table 9). The statistical significance of the proposed algorithm is demonstrated, with two well known statistical tests: the Friedman test and LSD test (Table 10 and Table 11). The proposed algorithm is also compared with a popular rule extraction algorithms named Three-MLP Ensemble Re-RX which also uses decision tree for extracting rules from NNEs, and the results in Table 12, 13, and 14 proves the superiority of RE-E-NNs compared to the Three-MLP Ensemble Re-RX algorithm.

If the comprehensibility issue is considered, the RE-E-NNS may generate many rules compared to the other rule extraction algorithms. Because it uses a decision tree to extract rules from an NNE and it also merges the extracted rule-sets from many NNEs to obtain the final rule-set. The RE-E-NNS may construct rules with many conditions because it does not uses networking pruning and attribute pruning techniques for discarding the irrelevant attributes like others. Table 15, shows the comparison between the algorithms with the average comprehensibilities of ten folds for a mixed mode dataset. Results show that the comprehensibilities of the rule-sets generated by the RE-E-NNS are not the best among the six algorithms, but they are not worse either. For the dataset, the comprehensibility for RE-E-NNS is better than Re-RX, RE-Re-RX, and Three-MLP Ensemble Re-RX algorithms.

Table 15

Comparison of comprehensibility of rules for a mixed mode dataset
Dataset	Comprehensibility	Re-RX	RE-Re-RX	ERENNR	ERENN_MHL	Three-MLP Ensemble Re-RX	RE-E-NNS
Australian Credit Approval	Global	11.8	11.8	3.4	3.7	12.6	4.1
Australian Credit Approval	Local	34.2	33.6	2.4	5.5	36.8	11.9

Though the comprehensibilities of the rule-sets for RE-E-NNS are not best, still the rule-sets are adequate for explaining the knowledge sealed in NNEs with very high accuracies. Moreover, accuracy and comprehensibility have an inverse relationship. The rule-sets can classify unknown patterns with better accuracy compared to the NNs and NNEs (Table 6 and Table 7 shows that). In overall, the algorithm is competent enough to mimic the high performance of NNEs through the extraction of high-performance rules.

NNEs are known for raising the classification performances of NNs. So, this paper proposes the RE-E-NNS algorithm to demonstrate the high performance of NNEs. RE-E-NNS extracts classification rules by employing several NNEs in the process of rule extraction. The algorithm builds a Neural Network Ensemble (NNE) model and extracts rule-set from it. Consecutively, it uses the obtained rule-set to create the succeeding NNE model and extracts rule-set from it, and so on. The algorithm continues the process until the stopping criteria are satisfied. The algorithm is validated with different real-life datasets taken from the UCI and KEEL repository. The obtained results suggest that the RE-E-NNS can exhibit the high performance of NNEs through the extraction of highly accurate rules. RE-E-NNS has extracted rules with more accuracy compared to individual NNs and individual NNEs. The result shows that RE-E-NNS has generated more accurate rules compared to Re-RX, RE-Re-RX, ERENNR, ERENN_MHL, and Three-MLP Ensemble Re-RX algorithms. The performance of the algorithm is also evaluated based on different performance measures and two statistical tests. All the results prove that RE-E-NNS is an effective algorithm for exploiting the knowledge learned by NNEs and representing in a human-understandable form with high generalization capability. It can be applied in any real-life classification application which requires a transparent and accurate decision.

Conflict of interest : No conflict of interest.

Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.

L.K. Hansen, P. Salamon: Neural network ensembles, IEEE Trans. Pattern Analysis and Machine Intelligence ,vol. 12 ,issue 10, pp. 993-1001, 1990.
E. Alfaro, N. Garcia, M. Gamez, D. Elizondo: Bankruptcy forecasting: an empirical comparison of Adaboost and neural networks, Decision Support Systems, vol. 45, issue 1, pp. 110–122, 2008.
L. Zhou, K. K. Lai: Adaboosting Neural Networks for Credit Scoring, The Sixth International Symposium on Neural Networks (ISNN 2009), Advances in Intelligent and Soft Computing, vol. 56. Springer, Berlin, Heidelberg.
H. Li, X. Wang, S. Ding: Research and development of neural network ensembles: a survey, Artificial Intelligence Review, vol. 49, issue 4, pp. 455-479, 2017.
Z. Zhou, J. Wu, W. Tang: Ensembling neural networks: Many could be better than all, Artificial Intelligence, Vol. 137, issues 1–2, pp. 239-263, 2002.
M. Chakraborty, S. K. Biswas, B. Purkayastha: Recursive Rule Extraction from NN using Reverse Engineering Technique, New Generation Computing, Vol. 36, issue 2, pp. 119–142, 2018.
M. Chakraborty, S. K. Biswas, B. Purkayastha: Rule Extraction from Neural Network Using Input Data Ranges Recursively, New Generation Computing, 2018 .doi.org/10.1007/s00354-018-0048-0.
M. Chakraborty, S. K. Biswas, B. Purkayastha: A Novel Ensembling Method to Boost Performance of Neural Networks. Journal of Experimental & Theoretical Artificial Intelligence, Taylor & Francis, DOI: 10.1080/0952813X.2019.1610799
S. I. Gallant: Connectionist expert systems, Communications of the ACM, vol. 31, issue 2, pp. 152–169, 1988.
R. Andrews, J. Diederich, A. B. Tickle: Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge-based Syst., Vol. 8, issue 6, pp. 373-389, 1995.
G. Bologna, A study on rule extraction from several combined neural networks, International Journal of Neural Systems, vol.11, issue 3, pp. 247–255, 2001.
G. Bologna: Is it worth generating rules from neural network ensembles?. Journal of Applied Logic, vol.2, issue 3, pp.325–348, 2004.
Z. Zhou, Y. Jiang, S. Chen: Extracting symbolic rules from trained neural network ensembles, Artificial Intelligence Communications, vol.16, issue1, pp.3-15, 2003.
G. Bologna, C. Pellegrini: Recognizing Images from ICA Filters and Neural Network Ensembles with Rule Extraction, In Mira J., Álvarez J.R. (eds) Artificial Neural Nets Problem Solving Methods, IWANN 2003, Lecture Notes in Computer Science, vol 2687. Springer, Berlin, Heidelberg.
U. Johansson, R. Konig, L. Niklasson: Inconsistency - Friend or Foe, 2007 International Joint Conference on Neural Networks, Orlando, FL, USA, 2007, pp. 1383-1388.
P. Hartono, S. Hashimoto: An Interpretable Neural Network Ensemble, IECON 2007 - 33rd Annual Conference of the IEEE Industrial Electronics Society, Taipei, 2007, pp. 228-232.S. I.
Ao, V. Palade: Ensemble of Elman neural networks and support vector machines for reverse engineering of gene regulatory networks, Applied Soft Computing, vol.11, issue 2, pp.1718–1726, 2011.
A. Hara, Y. Hayashi: Ensemble neural network rule extraction using Re-RX algorithm, in Proceedings of the 2012 Annual International Joint Conference on Neural Networks, IJCNN2012, Part of the 2012 IEEE World Congress on Computational Intelligence, WCCI 2012, Australia, June 2012.
R. Setiono, B. Baesens, C. Mues: Recursive neural network rule extraction for data with mixed attributes, IEEE Transactions on Neural Networks and Learning Systems, vol.19, issue 2, pp. 299–307, 2008.
Y. Hayashi: Neural Data Analysis: Ensemble Neural Network Rule Extraction Approach and Its Theoretical and Historical Backgrounds, in Rutkowski L., Korytkowski M., Scherer R., Tadeusiewicz R., Zadeh L.A., Zurada J.M. (eds) Artificial Intelligence and Soft Computing, ICAISC 2013, Lecture Notes in Computer Science, vol 7894. Springer, Berlin, Heidelberg.
Y. Hayashi, R. Sato, S. Mitra: A new approach to three ensemble neural network rule extraction using recursive-rule extraction algorithm,” in Proceedings of the 2013 International Joint Conference on Neural Networks, IJCNN 2013,USA,August2013.
Y. Hayashi, Y. Tanaka, S. Yukita, S. Nakano , G. Bologna, Three-MLP Ensemble Re-RX algorithm and recent classifiers for credit-risk evaluation, 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 2015, pp. 1-8.
G. Bologna, Y. Hayashi: A Comparison Study on Rule Extraction from Neural Network Ensembles, Boosted Shallow Trees, and SVMs, Applied Computational Intelligence and Soft Computing, vol 2018, Article ID 4084850, pp. 1-20 , 2018.
X. Gu, P. P. Angelov: Deep Rule-Based Aerial Scene Classifier using High-Level Ensemble Feature Descriptor, 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 2019, pp. 1-7.
P. Hartono, "Ensemble of perceptrons with confidence measure for piecewise linear decomposition," The 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 2011, pp. 648-653, doi: 10.1109/IJCNN.2011.6033282.
Yuan Jiang, Zhi-Hua Zhou, Zhao-Qian Chen: Rule learning based on neural network ensemble, Proceedings of the 2002 International Joint Conference on Neural Networks, IJCNN'02 (Cat. No.02CH37290), Honolulu, HI, USA, 2002, pp. 1416-1420 vol.2.
R.E. Schapire: The strength of weak learnability, Machine Learning, vol. 5, issue 2, pp. 197–227, 1990.
Y. Freund: Boosting a weak algorithm by majority, Information and Computation, Vol. 121, issue 2, pp. 256–285, 1995.
Y. Freund, R.E. Schapire: A decision-theoretic generalization of on-line learning and an application to boosting, in: Proc. EuroCOLT-94, Barcelona, Spain, Springer, Berlin, pp. 23–37, 1995.
Y. Freund, R.E. Schapire: Experiments with a new boosting algorithm, In Proceedings of the Thirteenth International Conference on Machine Learning, July 1996.
ZH. Zhou: Rule Extraction: Using Neural Networks or For Neural Networks? Journal of Computer Science and Technology, Vol. 19, issue 2, pp. 249-253, 2004.
M. Chakraborty, S.k. Biswas, B. Purkayastha: Rule Extraction from Neural Network trained using Deep Belief Network and Back Propagation, Knowledge and Information Systems, Volume 62, Issue 9, pp. 3753–3781, 2020 .

Download PDF

Journal Publication

published 14 Jul, 2022

Read the published version in Cognitive Systems Research →

Version 1

posted

You are reading this latest preprint version

Explainable Neural Network Ensembles

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Related Works

3. Rule Extraction Using Ensemble Of Neural Network Ensembles

3.1. Phase I

3.2. Phase II

4. Illustrative Example

4.1 Phase I

4.2 Phase II

5. Results And Discussion

5.1 Results and Comparisons

5.2 Comparison of RE-E-NNES with Three-MLP Ensemble Re-RX

5.3 Discussion

6. Conclusion

Declarations

References

Status:

Journal Publication

Version 1