Weighted heterogeneous ensemble for the classification of intrusion detection using ant colony optimization for continuous search spaces

This paper proposes a heterogeneous ensemble classifier configuration for a multiclass intrusion detection problem. The ensemble is composed of k-nearest neighbors, artificial neural networks, and naïve Bayes classifiers. The decisions of these classifiers are combined with weighted majority voting, where optimal weights are generated by ant colony optimization for continuous search spaces. As a comparison basis, we have also implemented the ensemble configuration with the unweighted majority voting or Winner Takes All strategy. To ensure the maximum variety of classifiers, we have implemented three versions of each classification algorithm by varying each classifier’s parameters making a total of nine diverse experts for the ensemble. For our empirical study, we used the full NSL-KDD dataset to classify network traffic into one of five different classes. Our results indicate that the ensemble configuration using ACOR-optimized weights is capable of resolving the conflicts between multiple classifiers and improving the overall classification accuracy of the ensemble.


Introduction
Intrusion detection is a network security mechanism for detecting unauthorized access to communication networks. Intrusion detection systems (IDS) are crucial to maintaining safe, secure, and uncompromised networks by identifying atypical and suspicious activities. In other words, intrusion detection is a pattern recognition problem where the objective is to classify inbound network traffic as either normal or anomalous. Ensemble methods combine a number of individual classifiers to create a composite classifier with the goal of outperforming any of the singular classifiers. There are two types of ensemble methods, homogeneous (where different datasets are used to train one classifier) and heterogeneous (where one dataset is used to train many classifiers) (Nguyen et al. 2020;Kassaymeh et al. 2021a).
The heterogeneous ensemble approach has the benefit of increased reliability of classification. By weighing and combining the predictions of several diverse classifiers, we are able to reach the final decision with greater confidence. The diversity of the classifiers in the heterogeneous ensemble is of great importance, since classifiers that make the same classification errors are redundant and reduce the overall accuracy of the ensemble classifier. There are several ways to achieve classifier diversity, including the use of different training datasets to train base classifiers; choosing a variety of base classifiers; and applying varied parameters to the base classifiers (Nguyen et al. 2020;Kassaymeh et al. 2021b). Similarly, there are several ways to define an ensemble of classifiers, including voting, bagging, boosting, stacking, cascading, and delegation. One of the simplest and most popular methods to implement is voting. In the voting method, the decision of the ensemble is dominated by the decisions of the majority. The simplest voting technique is uniform voting, also known as Winner Takes All (WTA) strategy, where every classifier is assigned equal importance. However, with the introduction of diversity among classifiers, we will expect various degrees of success of the classifiers. The weighted voting technique is used to assign weighted coefficients to each classifier, representing its confidence scores.
In our earlier work (Aburomman and Reaz 2016), we introduced an ensemble design based on binary classification methods. The resulting ensemble decision was a set of posterior probabilities for each of five classes in the dataset: "Normal," "Probe," "DoS," "U2R" and "R2L." However, simply selecting the class with highest posterior probability as a final decision leads to poor generalization ability of the ensemble. This is partially due to the inherent unbalance among the number of training and testing examples for each class. However, we recognize that the utilization of two-class experts leads to some unclassifiable regions in an ensemble decision space.
In this paper, we introduce a novel approach based on multiclass classification algorithms with the aim to further improve the generalization of generated ensemble. The transition from binary to multiclass classification is not a straightforward process and the chief contribution of this paper is the development of a voting scheme which will simultaneously maximize the accuracy of the final classification decision while retaining the class-wise posterior probabilities for each class and for each classifier. We achieve this by combining the Recall Combiner (REC) method proposed by Kuncheva and Rodríguez (2014) with the adaptability of meta-heuristic optimization. The set of several classifiers, based on K-nearest neighbor (kNN), artificial neural network (ANN), and naïve Bayes (NB), are combined in a Weighted Majority Voting (WMV) strategy, with weights generated by Ant Colony Optimization for continuous search spaces (ACOR). As a reference point, we also examine the effectiveness of the same set of experts combined with Winner Takes All (WTA) strategy.
The remainder of this paper is organized as follows. Section 2 summarizes the related work. Section 3 introduces our methodology. Section 4 presents our experimental results and discussions. Finally, Sect. 5 presents the concluding remarks and the direction of future work.

Related work
In this section, we summarize the most relevant studies related to the construction of heterogeneous ensembles. In particular, we are interested in research studies concerned with weighted voting ensembles and methods for generating weight coefficients. In order to avoid the redundancy, we dedicate this section only to several most prominent techniques found in reviewed papers. In Bhati and Khari (2022), the authors proposed an ensemble IDS based on voting mechanism using Support Vector Machine (SVC) and ExtraTree classifiers, experiments conducted on KDDCup99 Dataset shows higher accuracy than individual approaches.
To deal with an accurate diagnosis of Parkinson's disease (PD) in the early stages, the authors in Nilashi et al. (2022) proposed a comparative study, where clustering and prediction ensemble methods applied in comparison with other prediction learning methods. Results conclude that ensembles can provide improved prediction accuracy, as evidenced by data analysis on a real-world Parkinson's disease dataset.
In their research (Rashid et al. 2022), SET, a tree-based stacking ensemble technique, was introduced and tested on two intrusion datasets (NSL-KDD and UNSW-NB15). Further, they used feature selection techniques to select the most relevant features for the proposed SET. According to their proposed model's comprehensive performance analysis, other existing IDS models cannot distinguish between normal and anomalous network traffic as well as their model can.
In Gupta and Rani (2020), the authors implement and research a new approach for detecting malware. The proposed approach uses an interconnected collection of features obtained by large-scale statistical and dynamic malware analysis and uses ensemble learning and big data technologies to detect malware in a distributed environment. The method constructs a number of lower-level classification models. A series of robust algorithms is then deployed for each of the simple classifiers to assign rank and weight. Weights shall then be used in majority voting and the option of an optimum category of stacking classifiers. The solution proposed is implemented at the top of Apache Spark, which, due to its ease of use and improved performance, is already a standard for the distributed computation of big data. The tests show that the proposed approach enhances the efficiency of large-scale generalization of malware detection.
In Kaur (2020), the authors implemented KMeans and Gaussian Mixture Model (GMM) Clustering for the reduction of the data set and the protection of traffic diversity. The aggregated data serves as input into the Random Forest Classifier (RFC). For class detection of attacks, the classification RF was also performed. As an introduction to the basic learners of ensemble methods, the findings from KMeans RF classification, GMM classification, and RF graded classifications were taken. The authors researched and compared two ensemble approaches, namely AdaBoost ensemble based on weighted votes and ensemble based on stacking. The analysis was performed on two datasets, namely NSL-KDD and UNSW-NB15. AdaBoost Ensemble accuracies were obtained with a weighted voting basis of 90.46% and 83.32%, respectively, for KDDTest+ and KDDTest-21. The accuracy of the UNSW-NB15 test data with AdaBoost-based Weighted Voting ensemble was also achieved at 91.31%. For KDDTest+ and KDDTest-21, respectively, 85.24% and 78.20% were achieved with stacking-based ensemble accuracies. In the end, the stacking-based ensemble for the UNSW-NB15 test dataset achieved precision of 89.57 percent. In general, through ensemble methods, the authors were able to achieve better detection rates and accuracies with reduced false alarm rates. Testing of latency time in the distributed Spark system on different machines was carried out by adjusting the number of executing cores. An online classification system for intrusion detection data based on the use of an assembly classification model was proposed in Folino et al. (2020) which defines the assembly function as non-trainable assembly functions and discovers it data guided via Genetic Program (GP) methods. The system architecture that integrates various types of functionality, including drift detection mechanisms, simple model inductions / replacements, and the efficient GP measurement of the combiner is supported by their approach. A series of experiments on artificial and real data sets enable to compare the approaches with several competitors and research the effects and strategies on the detection of drifts and the replacement of base classifiers by different window dimensions. The outcome of these experiments indicated that the architecture proposed can deal effectively with non-stationary data streams and is therefore a useful solution for intrusion detection scenarios for real-life applications. Kausar et al. (2010) argue that the confidence in the final decision can be greatly increased by combining weighted opinions of several experts into a weighted voting ensemble. Under the assumption that classifier output represents a posterior probability of given example being an instance of some class, the authors propose to generate a single weight coefficient for each classifier in the ensemble. Particle Swarm Optimization (PSO) (Kumar and Jaiswal 2020) is used to generate weights and the final decision is reached with weighted voting. The meta-heuristic approach is, therefore, used to find near optimal set of weights for which the classification error of the ensemble is minimized. The authors deploy a set of base classifiers with binary output, where the posterior probability equals 1 for class predicted by the base classifier and 0 for any other class. The idea presented in Kausar et al. (2010) is very similar to the approach we adopted. However, instead of rating the overall performance of classifier with a single weight, we introduce a method where classifier's proficiency with each class in the dataset is rated individually. This approach enables us to combine different subspaces of decision regions from multiple base classifiers. The other major difference is the extension of meta-heuristic weight optimization approach to a multiclass classification domain.
In Borji (2007), the authors studied three methods for creating heterogeneous ensemble: majority voting, Bayesian averaging, and belief measure. Four base classifiers: artificial neural networks (ANN), support vector machines (SVM), k-nearest neighbor (kNN), and decision trees (DT) are trained with the same training set. With an introduction of belief measurement, the authors estimate the probability that value predicted by the classifier is the actual class label of given observation. These probabilities could be easily computed based on validation results, and a final decision is determined by the classifier with highest belief value. In essence, this approach closely resembles approach we adopted in this paper. However, we replace belief measurement with a set of meta-optimized weight coefficients. The other important difference is that we recognize the possibility that particular classifier can be proficient when classifying samples from a certain class and completely unreliable for some other class, at the same time. To rectify this problem, we propose to construct a set of weight coefficients for each classifier and for each class and effectively decompose each classifier's decision function to a set of c decision sub-functions, for a c class problem.
As an alternative approach to voting, Folino et al. (2016) proposed an ensemble based on a fusion of multiple classifiers with Genetic Programming (GP) (Acosta-Mendoza et al. 2014). The basic principle behind this approach was to utilize the GP Tool to generate a fusion function which will combine classifiers into an ensemble. The validation subset was used to evolve a fusion function which will maximize the classification accuracy of the resulting ensemble. In order to generate the ensemble classifier, optimized fusion function is used with the classification results for test subset. The main advantage of this approach is the elimination of additional training phase for the ensemble. However, the approach based on weighted voting can also retain this advantage. With ACOR-WMV approach, we propose similar utilization of validation set, but with one key difference: instead of evolving the fusion function, we can generate the set of optimal weight coefficients for voting procedure. Although both approaches are based on same assumptions, the votingbased fusion offers an additional advantage of relatively low computational cost.
In general, methods for combing heterogeneous classifiers can be dived into two groups: classifier selection-where the best classifier is selected for each subspace defined by decision function; and classifier fusion or aggregation-where classifiers' decisions are merged with some fusion function (Kassaymeh et al. 2022). Related literature reveals several attempts to combine both approaches (some of which are reviewed in this section), with varying degrees of success. The main contribution of this paper is utilization of metaheuristic optimization algorithms to find the set of weights which will: • Emphasize the decision of the classifier, but only in those sub-regions of decision space where classifier shows proficiency (strictly under the assumption that this proficiency can be determined with validation) and Using ensemble approaches could improve detection rates and accuracy while lowering false alarm rates Folino et al. (2020Folino et al. ( ) 2020 Intrusion detection based on the use of an assembly classification The technique was found to be effective in both simulated and realworld datasets • Use weighted voting to aggregate the decision of several classifiers into a final decision.
In order to accomplish set goals, we propose a weight generation model which will perform both tasks simultaneously.
A brief summary of related work is presented in Table 1.

Methodology
The proposed classification methodology is based on a heterogeneous ensemble, which will label patterns in user activities as either normal user behavior, or as a network intrusion-in which case we are interested in identifying the type of network intrusion, as well. The ensemble combines nine experts, based on kNN, ANN, and NB algorithms. The Weighted Majority Voting (WMV) based on Recall Combiner (REC) and weight coefficients obtained with Ant Colony Optimization for continuous search spaces (ACOR) is used to aggregate the decisions of nine experts into a final decision. In addition, we will also combine the experts' decisions into an unweighted ensemble, with Winner Takes All (WTA) strategy. Methodology framework can be decomposed into five steps: 1. Training and validation of experts, 2. Classification of test set with trained experts, 3. Optimization of weights with ACOR, based on validation results, 4. Combining test results of experts with ACOR-WMV ensemble, 5. Combining test results of experts with WTA ensemble.
Experimental framework is depicted in Fig. 1

Experts
In general, the classification task can be defined as a process of finding a classification hypothesis h (also called the decision rule or approximation function), which will approximate the target function f . We can formally define the target function as a function that maps input vector x ∈ X to a discrete value in output domain y ∈ Y . This is expressed as y = f (x). Therefore, the decision function h will also map the input vector x to the output domain y. Consequently, there are two possible outcomes: • The input vector is x is mapped accurately, i.e., h(x) = y, in which case the classification is successful, • The input vector x is not mapped correctly, i.e., h(x) = y, in which case we will have a misclassification. Classification models based on supervised learning rely on training phase to generate a classification hypothesis h. The classifier h is constructed by finding the decision function which will successfully classify most of the pairs in the training set S{(x 1 , y 1 ), . . . , (x n , y n )}. The set s consists of n training vectors x i and matching targetsy i . Each input point or observation x i is a p-dimensional row vector, defined in feature space X ∈ R p . The corresponding target y i will be a class label defined in output domain Y .
The simplest way to measure the reliability of classifier h is to compute its misclassification rate ε(h), i.e., the fraction of training set pairs which were not classified correctly. We can express this concept with Eq. 1 Through the training phase, the misclassification rate is, directly or indirectly, minimized. Therefore, even for low values of misclassification rate ε(h), the classifier h may not generalize well for a new set of previously unseen examples. In order to evaluate the generalization ability of the classifier, we introduce a validation set S M with m new pairs Through past decades, the research in the field of Machine Learning has produced many classification algorithms, capable of multiclass classification.
Subsections 3.1.1, 3.1.2, and 3.1.3 will outline several of the most prominent classifiers.

K-nearest neighbor
The KNN-based algorithms for data classification are widely known for their relative simplicity and effectiveness. The basic algorithm rests on the assumption that all input row vectors can be represented as points in p-dimensional space R p (Michalski et al. 2013). By evaluating the distances between known points in the training set x i , for i = 1, . . . , n, and some new point x N , we can find the k points in training set which are nearest to the new point. Accordingly, we will also have a set of k class labels which match the k nearest points in feature space. The decision of kNN classifier h(x N ) is based on a majority voting procedure, where the class with most support (among classes of k nearest neighbors) is selected as the final decision.
We can represent the kNN decision making for a twodimensional feature space in Fig. 2. Instances of three classes are marked with blue, green, and red circles. The distances between new example (marked with white circle) and its k = 5 nearest neighbors are denoted with d 1 , . . . , d 5 . The final decision is based on the class with the most support, in this case, the class marked with red color.
In general, there are many methods to determine the distance between two points in feature space. The use of different distance functions will increase the diversity among kNN-based classifiers. Since the diversity is important for a good ensemble, we will define three kNN classifiers, denoted as h 1 , h 2 and h 3 , with Euclidean, Jaccard, and Spearman distance function, respectively.
Let us denote the distance between some training set point x i and some new point x N , with d. Then we can define the Euclidean distance with Eq. 2, the Jaccard distance with Eq. 3 and the Spearman distance with Eq. 4.
where R(x N ) and R(x i ) are Spearman ranks of given data points, andR(x N ) andR(x i ) are coordinate wise rank vectors.

Artificial neural networks
Artificial neural networks (ANN) are very effective at learning real-valued, discrete-valued, or vector-valued functions and have been favorably applied to learning complex realworld sensor data (Michalski et al. 2013). Given a set of training data points and target labels S, ANN constructs a classification hypothesis h so that the number of misclassified instances is minimized. The ANN classifier is a complex mapping function made of a number of relatively simple mathematical functions called neurons (Bishop 1995). In order to solve the multiclass classification problem, we deployed a three-layer ANN classifier, where each layer consists of a certain number of neurons. The size of first layer, also called the input layer, is determined by the dimension of the input point x i . Consequently, the input layer will be comprised of p neurons, where each feature of input vector x N is an input layer neuron. The size of the second (hidden) layer is a user-defined parameter H . Input to each neuron in hidden layer L 1 , . . . , L H is a sum of neurons from first layer, modified by input layer weights β. The third or output layer is defined with output domain Y , i.e., output layer will have c neurons, one for each class. Input to each neuron in output layer is a sum of neurons from hidden layer, modified by the hidden layer weights γ . The neuron from the output layer R r , for r = 1, . . . , c, with the greatest value will define predicted class h(x N ) for a new input point x N . In general, this concept can be represented by Fig. 3.
We define the final decision h(x N ) of ANN classifier with Eq. 5.
The weight coefficients β s and γ r are generated during the training phase, where selected training algorithm is used to minimize the classification error ε(h) for a given training set S. The selection of different training algorithms can promote the diversity among ANN classifiers. Accordingly, we define three ANN classifiers, denoted as: h 4 , h 5 and h 6 , based on Scaled Conjugate Gradient back-propagation (SCG), One Step Secant (OSS) and Fletcher-Powell Conjugate Gradient (CGF), respectively.

Naïve Bayes
Generally, Naïve Bayes classifiers are algorithms based on Bayes' theorem, where assumption of conditional independence between features is made. From practical standpoint, this assumption is frequently violated and term "naïve" is used because this violation is intentionally overlooked. Nevertheless, the practical application of NB classifier has often shown favorable results.
From probabilistic aspect, the classification problem can be defined as an estimation of posterior probability P(h(x) | S), with which we examine the likelihood of some classification hypothesis h making the prediction h(x)-based on the evidence found in training set S = {(x 1 , y 1 ), . . . , (x n , y n )}. Bayes theorem states that this probability will, in general, depend on three values: • P(h(x)) the prior probability of class prediction h (x) occurring. This probability is estimated independently of evidence found is training set S, and it may be viewed as a frequency of h(x) class label. • P(S) the prior probability of set S. This value is a constant for any hypothesis h and it bears no reflection on the validity of given hypothesis. • P(S | h(x)) the posterior probability of having some set S, under assumption that hypothesis h holds true for every instance (x i , y i ) in training set.
Based on defined values, the Bayes theorem is given in Eq. 6.
Since the prior probability P(x N ) is constant value and it does not reflect on the classification hypothesis h, we can define the decision making process of NB classifier with Eq. 7.
where we used P r to denote the probability that h(x N ) = r , for r = 1, . . . , c.
Depending on the type of classification problem, different probability distributions may be used with NB algorithm. With aim to promote the diversity among classifiers, we implemented three different kernel distribution functions. Kernel distribution is appropriate when continuous distribution is required and it may be used in cases where the distribution of predictor can be skewed or multimodal (i.e., have multiple peaks). Accordingly, the diversity of NB classifiers is promoted with implementation of three kernel distribution functions: 1. Normal (or Gaussian) kernel K (x), defined by Eq. 8, 2. Epanechnikov kernel K (x), defined by Eq. 9, 3. Triangle kernel K (x), defined by Eq. 10, where we use 1 (|x|≤1) to denote the indicator function, such that 1 (|x|≤1) = 1 if (|x| ≤ 1), and 1 (|x|≤1) = 0, otherwise. In addition to kernel distribution function, we also define the scale or kernel width factor κ which will modify the feature values. Finally, the probability estimation for NB classifier is defined by Eq. 11.

Ensemble
Given a set of l experts h j , where j = 1, . . . , l, we generate an ensemble E as a weighted combination of predictions from each expert h j (x N ). Several considerations must be made before the ensemble is formally defined: 1. There is no formal training procedure for an ensemble, i.e., the ensemble classifier is not directly related to the training sample S. 2. The weights for voting procedure are optimized with regard to validation results. In addition, optimized weights will be used with all future (test) samples. 3. The output of each expert h j is a single discrete value defined in output domain Y . This will enable us to handle results from fundamentally different experts (kNN, ANN and NB) in same way.
For proposed ensemble, the weight optimization task can be regarded as a "black box" mechanism, i.e., the optimization algorithm is not an integral part of ensemble. This modal approach to ensemble design will allow us to use any optimization algorithm capable of solving multimodal problems in continuous search space. Formally, the set of optimal weights W is a result of optimization, i.e., Given the diversity of the classifiers, it is a safe to assume that not all of the classifiers in the ensemble will be equally reliable at labeling given observations. The idea of introducing weight coefficients to quantify the confidence in an opinion of a classifier is a widely accepted approach. In this paper, we implement a weighting scheme based on class recall. This bears some resemblance to the Recall Combiner (REC) proposed in Kuncheva and Rodríguez (2014). However, there are two key differences: • As considered earlier, the classification output h j (x N ) is a single integer defined in output domain Y . Therefore, we will not have a set of posterior probabilities for a new sample x N for all classes in dataset. Instead, we propose to retain this information by substituting the role of posterior probabilities with weight matrix W . • The REC ensemble is based on a direct measurement of class recall (accuracy of each expert with each class)based on validation results. In our work, we propose a different way to measure the class recall, i.e., we utilize the meta-heuristic optimization to generate the set of optimal weight coefficients.
Both aspects of the problem, the utilization of posterior probabilities and measurement of class recall, are solved simultaneously with implementation of weight matrix W . To fully utilize the potential of each expert, we generate c different weights for each expert in the ensemble. Depending on the output of the classifier h j (x N ,i ), for some new observation x N ,i , we use a different weight coefficient. For an ensemble of j classifiers, where the output of each classifier is represented by an integerr = 1, . . . , c, we can define a set of weights with a weight matrix W , generally defined by Eq. 13.
For simplicity sake, let us denote the ensemble created with weight matrix W and validation results for i − th observation M i , as E W (M i ), for i = 1, . . . , m. Then we will consider a validation sample to be correctly classified by ensemble if E W (M i ) = y M,i , where y M,i is i − th target in validation set. If we look back to the weight optimization problem, it is clear that generated weights should maximize the classification accuracy of the resulting ensemble. Accordingly, we define objective, or fitness functionz(W ), as classification accuracy of the ensemble classifier E W constructed with weight matrix W . This is represented with Eq. 14. where At this point, we will insert the assumption that there exists an optimization algorithm O which will maximize the objective function z(W ) given the validation results M and validation targets {y M,1 , . . . , y M,m }. The implementation details of such algorithm will be considered in Sect. 3.3.
Each element of the weight matrix W , w r , j is to be used with classifier h j if and only if that classifier predicts a label h j (x N ) = r . The selected weights w r , j are added for each class in an ensemble, and label with the highest score is adopted as the prediction of the ensemble E(x N ,i ). In other words, the weight matrix W , represents a set of posterior probabilities that classifier h j has made correct classification for each class c in dataset. With aggregation of these probabilities, we are able to find the most likely class label of the example. Weighted Majority Voting ensemble, constructed in described way, is defined with Eq. 15.
where 1 (h j (x N ,i )=r ) is an indicator function, such that: For a c = 5 class problem, with l = 9 experts, the example of voting procedure for proposed ensemble is illustrated in Fig. 4. In this example, the class label "1" will have the most votes, however, this will not necessarily mean that this class will also have the highest support.
The proposed WMV ensemble is defined by Algorithm 1, for a test set S N with s examples. It should be noted, that with the omission of weight coefficient w r ,1 from Eq. 15, this ensemble becomes the "unweighted" majority voting, or the Winner Takes All (WTA) ensemble. Therefore, both WMV

Ant colony optimization for continuous search spaces
The ant colony optimization method (ACO) was originally developed as a meta-heuristic approach for combinatorial optimization (Socha and Dorigo 2008). However, it is possible to extend ACO to solve problems in the continuous domain, without any major conceptual change to its structure. The transition between combinatorial to continuous optimization is achieved with the replacement of discrete probability function, used to simulate the pheromone model, with a probability density function (PDF). The PDF is obtained by sampling the set of solutions called solution archive T . This modification of ACO is also called ACOR, where the R suffix denotes the real domain R.
In this paper, we implement ACOR algorithm to generate the matrix of optimal weight coefficients W * . The fitness of each solution W i , for i = 1, . . . , b, is evaluated by Eq. 14, and it represents the classification accuracy of ACOR-WMV ensemble generated with W i , based on the validation results M and validation targets {y M,1 , . . . , y M,m }. The potential solution W i is a c × l matrix, as represented in Eq. 12. The ACOR algorithm starts by generating the set of b solutions randomly. The fitness function is evaluated for each solution and the solutions are sorted from highest to lowest fitness in a solution archive T . The solution archive T = {W 1 , . . . , W b }, contains all initial solutions, with best solution in the first position and the worst solution is in the last position in the archive. The ACOR algorithm searches iteratively through feasible space for fixed number of iterations, t = 1, . . . , t max . Accordingly, we use notation T (t) to denote the solution archive in iteration t.
The set of new solutions, analogous to ants, is generated in each iteration. The new solutions W j , for j = 1, . . . , a are sampled with a Gaussian kernel K (W j ), from the archive T (t) , as represented by Eq. 16.
where λ i is a weight with the ranking of i-th solution, σ i is a standard deviation of i-th solution and μ i is a mean value of i-th solution.
where q is user-defined parameter that represents the locality of the search. Each ant selects one of the existing solutions from the archive, by sampling its corresponding weight. The selected solution is then modified by Eq. 15 with the mean μ i = W i and standard deviation σ i defined by Eq. 18.
where ξ is another user set parameter, called the speed of convergence.
The generated solutions are then appended to the archive T (t) . The fitness is evaluated again and the archive is sorted according to calculated values. The size of the archive is maintained constant, by removing the last a solutions in each iteration. The process is repeated until the maximal number of iterations t max is reached. The optimal solution is first solution in the archive T at the last iteration, i.e., W * = T (t max ) 1 . The implementation of ACOR method is presented with Algorithm 2.

Experimental procedure
This experiment was conducted on a desktop computer with Intel core I7 3.40 GHz processor 16GB RAM running 64bit Windows 10 Professional Edition. All codes used in the experiments were implemented in Matlab 2015a. The experimental procedure analyzes 11 classification models, 9 of which are single-stage classifiers and 2 ensemble classifiers. The summary of implemented approaches is presented in Table 2 The parameters for ACOR optimization of weight coefficients are presented in Table 3.
The basic metric for examining the performance of each classifier is its confusion matrix and values derived from confusion matrix. We will use 4 attributes from confusion matrix to compute evaluate the effectiveness of the classifier: T P number of true positives, or the number instances that were correctly classified as a class r , i.e., h j (x N ,i ) = y N ,i , for y N ,i = r . T N number of true negatives, or the number of instances that were correctly classified with some class other than r , i.e., h j (x N ,i ) = y N ,i , for y N ,i = r . F P number of false positives, or the number instances of that were incorrectly classified as class r , i.e., h ( x N ,i ) = y N ,i , for y N ,i = r ., F N number of false negatives, or the number of instances that were incorrectly classified with some class other than r , i.e., h j (x N ,i ) = y N ,i , for y N ,i = r .
These 4 values are computed for each class r = 1, . . . , c in dataset and several classification properties are calculated as presented in Table 4.
In addition, we will also consider the overall accuracy ACC, for both test and validation sets. The comparison of computational efficiency will be based on the total running time of each classifier τ .

Dataset
The experimental procedure is based on a simulation of network communication, with the recorded instances of network traffic and the matching class labels provided by NSL-KDD dataset. The NSL-KDD is a dataset proposed by Tavallaee Data selection process involves splitting data into three groups: Training subset The set of input data points and matching targets, which will be used for training of base classifiers. Validation subset A smaller data set of input points and matching target labels that is created from (KDDTest+.TXT) file and removed from there to insure its independency. It is used to generate weight coefficients for ensemble classifiers. Test subset The set of input points and matching target labels that are disjoint from both the training and the validation data sets used exclusively to evaluate the performance of the trained and tuned classifiers.
Numbers of training pairs in these sets are presented in Table 5.

Results and discussion
The weight optimization with ACOR is performed iteratively, for a population of potential solutions. With each new population algorithm searches for new solutions with greater fitness values. Consequently, the best fitness and the average fitness of the population are steadily converging toward the maximum. The convergence of ACOR algorithm for weight optimization problem is presented in Fig. 5. Performance of all implemented classifiers is summarized in Table 6. The overall accuracy presents the percentage of correctly classified examples in test and validation subsets. We will establish the effectiveness of proposed classifiers based on accuracy measurement paired with the total running time of the algorithm. Figure 6 presents a graphical summary of test and validation accuracies for all implemented classifiers.
Closer analysis of performance of ensemble classifiers is presented in Tables 7 and 8, for ACOR-WMV and WTA classifier, respectively. Six calculated parameters are compared for each class in the NSL-KDD dataset.
The measurement of F1 scores was established as the most balanced metric to analyze the reliability of the classifier with each class in dataset. Accordingly, we present F1 score for each class and for each classifier in Fig. 7.
Best performance, in terms of overall classification accuracy, is achieved with ACOR-WMV ensemble. The analysis of F1 scores for each class, suggests that ACOR-WMV has the best performance with each class in dataset, except for "U2R" class-where kNN based classifier (h 1 ), has better performance. However, this can be disregarded since class "U2R" is represented with significantly less instances than any other class in the dataset. With the integration of REC based voting scheme and ACOR-optimized weights, we were able to achieve following: • Approximate the posterior probability that classifier h j has made correct prediction, when predicting the class r , for j = 1, . . . , l and r = 1, . . . , c. • Use approximated probabilities as weights in majority voting procedure to generate an ensemble classifier. • Reinforce the decisions of experts only for those classes where that expert shows proficiency. We also, simultaneously, diminish the influence of that experts for classes with which it performs poorly.
Other ensemble approach, based on WTA strategy, shows exceptionally poor performance. The unweighted voting procedure is strongly biased toward the classes with more instances in training dataset. Based on the measurement of running time of both ensembles and with the consideration of obtained accuracy, we can establish that ACOR-WMV approach is by far superior to the WTA-based ensemble.
In this study, the Friedman's statistical test (Friedman 1937) was conducted to verify whether there is a statistically significant difference between the acquired results of the proposed ACOR-WMV and each of the comparative methods. The nonparametric Friedman's test with a threshold of 95% (alpha = 5%) was utilized. Friedman's test first employs the ranking method to determine the control method.
Then, it utilizes Holm's test technique to find the p values between the control method and the other comparison methods (Albashish et al. 2021). Table 9 reports the ranking of the ACOR-WMV as well as competing for classification methods. As illustrated in Table 9, the proposed ACOR-WMV method is ranked first with an average equal to one, while the KNN3 is ranked second with an average ranking of 2.0, and the WVA is ranked fifth, with an average of around 7.5. In contrast, the KNN2 is rated last, with an average of approximately 11.0.
Based on Friedman's statistical test, Holm's test method is often applied as a post hoc statistical approach to examine the control method (i.e., the proposed method (ACOR-WMV) with the remaining methods with a confidence level of 0.05 in all test cases. Table 10 shows the results of Holm's procedure. As reported in Table 10, Holm's method demonstrates that ACOR-WMV significantly outperforms KNN2, KNN1, NB2, and NB3. Furthermore, there is a marginal difference between the accuracy of the ACOR-WMV and that of the ANN1, WTA, and NB1 as the adjusted p values of these methods based on Holm's method are extremely near to the significance threshold (0.05). Consequently, the average classification accuracies of ACOR-WMV are highly powerful and statistically superior to most of the classification methods listed in Table 6. They can be regarded as a real alternative to classification methods for many applications in the field of machine learning.

Conclusion and future work
This paper presents a framework for generating an ensemble of 9 experts for classification of network traffic for Intrusion Detection System. Each expert is trained and validated with subsets drawn from the NSL-KDD set. We present a novel method of generating weights for Weighted Majority Voting (WMV) procedure, where we use ACOR-optimized linear coefficients to approximate the class recalls for each expert in ensemble. The class recall combiner (REC) scheme, powered by ACOR weights presents a viable solution to ensemble classification, as shown by the experimental procedure. The ensemble approach offers an additional advantage of not being directly trained. In addition we have also implemented an unweighted version of ensemble, based on Winner Takes All (WTA) strategy.
Test results indicate that the best performance is achieved with ACOR-WMV ensemble. Although much faster, the ensemble based on WTA strategy was not able to combine the decision of its experts in an effective way. The second best overall accuracy was achieved with a classifier based on artificial neural networks (ANN). However, the ACOR-WMV shows significantly better results, with an improvement of 4.4 % in overall classification accuracy. In conclusion, the relatively low computational cost and high classification accuracy put forward the WMV-based ensemble as the viable solution to NSL-KDD classification problem. The logical extension of presented research is two-folded: • The implementation of different expert classifiers for the ensemble, • The implementation of different algorithms for weight optimizations.
The expansion of expert sub-systems for an ensemble should further improve the diversity of the ensemble, and lower the chance of two experts making the same classification errors. With addition of other popular single-stage classification algorithms (e.g., Support Vector Machines, Decision trees, etc.) we may expect an increase in performance of the multiple classifier system based on WMV procedure. Similarly, we may expect some improvement if we were to substitute the ACOR algorithm with some other meta-heuristic optimization algorithm (e.g., particle swarm optimization, genetic algorithm, artificial immune system, etc.). The modular design of proposed ensemble will make these tasks simpler, with only minimal modifications to the existing ensemble algorithm.