Ensemble classification for intrusion detection via feature extraction based on deep Learning

An intrusion detection system is a security system that aims to detect sabotage and intrusions on networks to inform experts of the attack and abuse of the network. Different classification methods have been used in the intrusion detection systems such as fuzzy, genetic algorithms, decision trees, artificial neural networks, and support vector machines. Moreover, ensemble classifiers have shown more robust and effective performance for various tasks in the field. In this paper, we adopt ensemble models in order to improve the performance of intrusion detection and, at the same time, decrease the false alarm rate. We use kNN for multi-class classification, as well as SVM to approach the classification problem in normal-based detection. In order to combine multiple outputs, we use the Dempster–Shafer method in which there is the possibility of explicit retrieval of uncertainty. Moreover, we utilize deep learning for extracting features to train the samples, selected by the sample selection algorithm based on ensemble margin. We compare our results with state-of-the-art methods on benchmarking datasets such as UNSW-NB15, CICIDS2017, and NSL-KDD. Our proposed method indicates the superiority in terms of prominent metrics Accuracy, Precision, Recall, and F-measure.


Introduction
Nowadays, most security systems are mainly focused on encoding, firewall, and access control (Zabihi et al. 2014). However, these methods mainly suffer from network security and cannot guarantee system and network security. Intrusion Detection Systems (IDS) aim to improve systems' security, playing an important role in Cyber Security (Moustafa et al. 2019). Intrusion is defined as a set of hypothetical operations for security purposes, integrity, confidentiality, or availability. Events that enter to network or host are monitored by the intrusion detection system, and necessary measures are taken in relation to whether these events are a sign of an attack or proper use of the system (Naphade et al. 2016;Zarpelão et al. 2017). Generally, intrusion detection methods include: (i) Anomaly-based (ii) Signature-based. In the following, we describe both categories of methods.
-Anomaly-based detection method: In this method, normal patterns and behavior should be first identified, and special patterns and rules should be found for them. Behaviors that follow this pattern are known as normal behavior, and behaviors that are significantly deviated from these patterns are considered to be anomaly behaviors (Ahmed et al. 2016;Al-Enezi et al. 2014). An anomaly-based intrusion detection system is shown in Fig. 1. -Signature-based intrusion method: In this method, pre-determined attacks and intrusions patterns are kept as a rule inside the database, and each pattern represents an intrusion. In this method, network traffic is examined, and the occurrence of intrusion is announced if there is such a pattern in the system (Ahmed et al. 2016). A signaturebased intrusion detection system is shown in Fig. 2. Accuracy and precise detection are of the most importance in the intrusion detection system. In this regard, we are providing an intrusion detection system that can classify attacks. This system, with a higher degree of accuracy, has a significant impact on system performance, and machine learning is a reliable tool in this field .
According to the results of various experiences about the use of training methods, there is no single specific training algorithm that works better and more effectively for all applications (Zhang et al. 2017). In fact, each algorithm is a specific model formed based on certain assumptions. Sometimes, these assumptions are true, and sometimes, these assumptions are violated. Therefore, no algorithm alone can work successfully in all conditions and for all. Ensemble methods have been introduced in order to overcome this problem (Zhang et al. 2017;Park and Chang 2018). Ensemble methods have been used in classification in the recent decade (Ludwig 2019;Keramati et al. 2014). These methods have better performance than single methods. In this study, we propose evidence theory is a mathematical theory based on the posterior probabilities to combine the evidence from kNN and SVM classifiers such as the final decision-making for improving the ability and increasing accuracy of intrusion detection. The contributions of the article can be summarized as follows: 1. Using ensemble margin for better sample selection; 2. Using deep learning for feature extraction; 3. Using ensemble method Dempster-Shafer for combine classifier; 4. Conducting extensive experiments to evaluate the performance of the proposed method on KDD-Cup and NSL-KDD data sets.
In order to ensure diversity, we trained four support vector machine classifiers and four probability K-nearest neighbor algorithms. Hermit function used for reducing the number of support vectors and enhancing the accuracy of data classification and reduction in support vectors could lead to improving the speed of data classification (Moghaddam and Hamidzadeh 2016). We used the Sigmoid function to make the probability support vector machine, and deep learning was used to extract key quality features from among 41 features in the KDD99 data set, and sample selection algorithm is a crucial task in sample-based learning algorithms. We used that to select better samples based on the ensemble margin ). Finally, we used from Dempster-Shafer evidence theory for fusion data. This theory strengthens correct decisions and weak incorrect decisions based on probability. We perform an extensive set of experiments in which we show that our proposed method can outperform state-of-the-art approaches for detecting intrusion attacks.
The rest of this paper is structured as follows. In Sect. 2, we review the literature in the field of intrusion detection. We explain the background in Sect.3. Section 4 describes our methodology, followed by Sect. 5 explaining the experiments on the data sets. Finally, the conclusion and the suggestions for future work are presented in Sect. 6.

Literature review
In the 1970s, the need for security systems is felt more than ever due to the increasing speed, efficiency, number of computers. In 1977 and 1978, the International Standard Organization held a meeting between governments and inspection bodies of Electronic Data Processing that the outcome that meeting was to prepare a report on the status of security, inspection, and control of systems at that time. At the same time, the US Department of Energy began very detailed studies on the inspection and the security of computer systems due to concerns about the security of its systems. This study was carried out by a person named James P. Anderson. The Report presented by Anderson in 1980 can be introduced as the main core of the concepts of intrusion detection (Anderson 1908). Singh et al. (2015) presented a proposed system based on Extreme Learning Machine. This machine solves the problem of the neural network in terms of speed. This system was used aimed to reduce computational memory and time using creating a profile of network traffic, and as well as two alpha and beta profiles were used. Alpha and beta profiles can reduce the effect of unaligned data. The beta profile can reduce the size of the experimental data set, while its features are maintained in practice, and the alpha profile is used to reduce the effect of discovery time. Folino et al. (2016) used an intrusion detection system based on ensemble classification, aimed to increase group accuracy. The ensemble structure of the NIDS makes possible the detection of sophisticated attacks and alarms in a proper manner, and the advantages of using this ensemble classifier include reducing error variance and bias, and it is appropriate for unbalanced classification. The proposed method works well to identify attacks and minimize the alarms but needs to be improved for specific attacks. Aburomman and Reaz (2016) presented a new method based on the support vector machine, K-Nearest Neighbor, and particle algorithm, and the weighted majority algorithm classifier for the intrusion detection system. Six support vector machine classifiers and six K-Nearest Neighbor classifiers with different values have been used in this method. Then, WMA was used as a classifier combination. The local uni-modal sampling (LUS) algorithm was used to select high-quality parameters. The proposed method has used LUS-WMA that has better accuracy than a method that uses the WMA classifier, but the performance of WMA alone is better than the proposed method. Gautam and Om (2016) used the proposed algorithm based on information theory and entropy, in which this algorithm obtained the entropy after the classification of features, and classification is based on bias and features. The results show that the rate of detection and accuracy of the proposed algorithm is better than the Fast Feature Reduction in Intrusion Detection Data sets (FFRIDD) and Multi-Level Dimensionality Reduction Methods (MLDRM) selection algorithm. A hybrid semi-supervised learning technique was introduced using the Active smart vector learning machine (ASVM) and Fuzzy C-Means (FCM) in the design of an intrusion detection system that has an excellent performance. This system is considered as a binary classification and hence, works faster than multi-classifiers (Kumari and Varma 2017). Li et al. (2018) presented a new hybrid method based on the density peaks clustering and k nearest neighbors in order to increase the accuracy rate that DPNN was used to train, and kNN was used for classification. Finally, the proposed DPNN method has better accuracy than the support vector machine, and there are many other methods in the field of machine learn-ing. Vinayakumar et al. presented a proposed hybrid intrusion detection system (Scale-Hybrid-IDS-Alert Network) based on a high level of a scalable framework on a hardware server that the capability to classify unpredictable cyber-attacks, monitor network, and host-level event. The framework distributed based on a deep learning model with the DNNs method used for analyzing big data in real-time and optimal network parameters and network typologies for DNNs. Based on the tests obtained, the performance of the DNNs is higher than that of the classical method (Vinayakumar et al. 2019). El-Sappagh et al. (2019) depicted different classification methods of data mining for true detection false alarm and high accuracy. Based on many methods of data mining on KKD CUP99 that disclose all attack classes with high accuracy. In this paper, the best accuracy gain in the multilayer perceptron method by 92%, and the best training time in the rule-based model is 4 seconds. Elmasry et al. (2020) proposed a method using an ensemble weighted majority algorithm to increase accuracy a feature selection method to decrease the number of features for detection attacks. This method increases accuracy of detection by 10, false-positive rate, reduces to 0.05%. Zhang et al. (2020) proposed class imbalance processing technology for IDS data set, which combines Synthetic Minority Over-Sampling Technique (SMOTE) and under-sampling for clustering based on Gaussian Mixture Model (GMM). The advantage of their novel method is verified using the UNSW-NB15 and CICIDS2017 data sets. This model shows an effective solution to imbalanced data in an intrusion detection system.
Given the literature mentioned above, most of the methods employed have focused on increasing accuracy and precision and redaction false alerts. In the present paper, we used an ensemble method in order to increase the accuracy rate of intrusion detection system for the classification of multiple class attacks. We also used Deep learning in order to reduce time and select better features.

Background
In Sect. 3.1, we discuss Dempster-Shafer theory of evidence and parameters. Feature extraction using Autoencoder is illustrated in Sect. 3.2

Dempster-Shafer (DS) theory
Evidence Theory (ET) was first presented by the publication of the theory of upper and lower probabilities in 1976 (Shafer 1976). Shafer developed this theory and fixed its deficiencies. It is used as a tool to analyze uncertainty in the theory of inaccurate probabilities (Zaman et al. 2011), which later became known as the theory of belief functions. The Dempster-Shafer theory is important by discussing exist-ing beliefs about a situation. This theory is considered the most effective method in integrating data at the general level (Hamidzadeh and Moslemnejad 2019). Basic Functions in the Dempster-Shafer Theory include: 1. The probability mass function 2. The Belief function 3. The Plausibility Function Here, we consider a finite set φ = {φ 1 , φ 2 , ..., φ n } that called the frame of discernment (FOD), which includes all subsets of φ. Function mass is defined m : 2 φ −→ [0, 1] where φ is named the power set and illustrate by 2 φ .
The probability mass function, known as the mass function, is considered as the essential function and known with the two symbols m and basic probability assignments (BPA). The probability mass function is an evidence for the existence of state A, which is defined as a real number between zero and one, and the probability of occurrence φ is zero, and the sum of the subsets is one.
Belief function means the lowest probability limit a state that may occur, and the Plausibility function is the upper limit of the belief in reality. A function Bel: 2 φ −→ [0, 1] is named belief function over the frame of discernment φ. A function Pl: 2 φ −→ [0, 1] is named the Plausibility function on the frame of discernment(FOD) φ. The Plausibility function by a function is called doubt, is related to the belief function and according to the following relation is defined as a belief.
Belief interval reflects the uncertainty that the probability of the occurrence of A shown by P(A) The Dempster's rule of combination: in here, suppose m 1 and m 2 are basic probability assignments with occurrence element B 1 , B 2 , ..., B i and C 1 , C 2 , ..., C j , respectively. According to the assumptions by equation 6, Dempster's rule of combination to combine the output of classes is presented in equations by Equations 7,8,9,10.

Feature extraction using deep learning
A deep neural network is considered as a general modification for a set of multi-layered architect neural networks that show how a neural network with a large number of layers can work successfully in creating the necessary structures for deep learning. Nowadays, deep learning artificial neural networks have had a lot of competition in pattern and machine learning detection. Feature learning algorithms are used for finding and extracting of common patterns automatically in order to use the extracted features in regression and categorization processes (Schmidhuber 2015). Also, they are considered as a new method for artificial neural networks, which use large-scale and cost-effective computing. These methods are used to recognize a visual face, dimension detection, network intrusion detection, and many other domains (LeCun et al. 2015). This learning technique can not only be used to generate significant indicators for the data set due to the correct architecture, reducing the number of neurons in the layer but also to compress it to create compact features, reduces the number of intermediate-layer neurons, which is called an autoencoder.
In the simplest form, an autoencoder has three layers in which there is an input layer and an output layer, and a hidden layer. If there is a p neuron in the input and output layer and a q neuron in the hidden layer, and F : R → R is a transfer function, such as the sigmoid function, X ε R p is an input vector of the features (Günther et al. 2016). The value of the features in a neuron of the hidden layer is obtained using the following equation.
Where W i ε R p and biε R are the corresponding weight and bias parameters of the neuron i, respectively. After computing the features in the hidden layer as follows: The representation created in this layer is used as input of the output layer and the output of the last layer of the neural network for all values: Where W j ε R p and b j ε R are the corresponding weight and bias parameters to j neurons in the output layer.
Weights matrices are found in the entire network. And the transfer functions F , F are not the same necessarily (LeCun et al. 2015). For inputs, the most common cost function is the mean-square error (Vincent et al. 2010).
Where, the cost function is composed of two distinct parts. An autoencoder network is shown in Fig. 3. The above structure is expanded in order to extend the auto-encoder method in deep learning methods, and the result of the hidden layer in the first auto-encoder is used as the input of the second auto-encoder, and this process can continue for increasing the deep neural network layers. The method described above was used in the training of the final network so that the first hidden layer is trained using the method described and are stabilized after creating the weights of the hidden layer inputs and the next hidden convergence is trained like an auto-encoder in a similar method, and this process continues until the entire network architecture is completed. Finally, the setup step is done on the entire network structure in an integrated manner. Finally, after a training network, the encoder output is considered as the extracted feature.

Methodology
In this paper, as stated before, we present an intrusion detection system using belief function theory to achieve a higher accuracy rate in the classification and detection of attacks. The block diagram of the proposed method is given in Fig. 4. Therefore, the framework of the proposed system is provided in Sect. 4.1. Ensemble margin is introduced in Sect. 4.2 to removing the noise and redundancy. kNN probability and its parameters are presented in Sect. 4.4, and finally, a method of probability SVM is provided in Sect. 4.5. Now, we will explain the proposed method in detail.

Framework of the proposed system
In the framework of the proposed system, the aim is to develop ensemble-based classifiers that will enhance the accuracy of classification attacks in the intrusion detection system. For this purpose, we trained and tested eight classifiers in which the results of four kNN classifiers with the nearest neighbor values of k=3, k=5, k=8, k=10, and four SVM classifiers with RBF kernel and the values of RBF=1, RBF=3, and hermit kernel with degree=8, degree=10 are combined in one group using combine classification. Here, we used four experts of SVM and kNN for diversity, and this leads to a higher performance of the proposed system.
In order to integrate the data, we used combining ensemble classification of Dempster-Shafer due to the possibility of explicit retrieval of uncertainty. This method can integrate numerical, signal, and multidimensional data and is considered the most powerful method of data integration. We used the heuristic function in kNN for converting the format of its output to the probable output. Also, the sigmoid function is used for converting SVM output to probabilistic values. Sigmoid function has a better performance than linear and polynomial functions in SVM experts. Probabilitizing the outputs of the classifiers used in the proposed method is done because Dempster-Shafer's theory is based on probabilities and uncertainty, and its input should be as probable values.
In Fig. 4, the framework of the proposed method is divided into nine stages.
1. Data pre-processing 2. Sample selection using ensemble margin 3. Features Extract data set using deep learning 4. Data classification with four SVM with RBF and Hermit kernels 5. Make the output of SVM probabilistic using the sigmoid function 6. Make kNN probabilistic using heuristic function 7. Data classification with four probabilistic kNN with a different value of 8. Using the Dempster-Shafer Rule to get the final values BPA 9. Analyze and determine the final classification results

Ensemble margin
The ensemble margin for the first time was presented by Schapiro et al. in order to explain the success of the boosting algorithm (Schapire et al. 1998). The ensemble margin is considered as an important concept in ensemble learning that has a very good accuracy rate of ensemble learning, and its values vary between zero to one. The ensemble margin can be calculated in the form of the difference between votes for samples in the feature space x i where c1 is the class that has the most votes among other classes and m c 1 is the number of votes, and c 2 is the second class with the most votes and m c 2 is the number of its votes, and L is the number of classifiers (Saidi et al. 2018).
The mean value of the ensemble margin is obtained from the following equation.

Sample selection algorithm based on ensemble margin
In the proposed method, we used the sample selection algorithm to select the experimental and experimental samples from the KDD99 data set based on the Ensemble Margin (EM), which is a fundamental concept in ensemble learning. The steps are as follows: 1. Giving the selected data set, initializing for the number of classes and classifications used in the problem, and initializing the threshold number γ , γ = 0.633.
2. Implementation of the proposed algorithm and calculating the ensemble margin for each educational sample. 3. Selecting better samples using threshold γ . 4. Item Putting the selected experimental sample in the data set S We used the new data set in the proposed algorithm after selecting the best samples using the feature selection algorithm based on the ensemble margin. The pseudo-code of the proposed algorithm is shown below.

Making probability of the nearest neighbor algorithm
K-nearest neighbor classifiers (kNN) are one of the classification algorithms that widespread used for supervised learning tasks; it is a practical algorithm for classification, and most popular is due to the simplicity of the concept and rapid implementation. This classifier was introduced by P.E.Hart and T.M.Cover (Cover and Hart 1967). We used the kNN classifier for solving multi-class problems. In the present study, four probabilistic kNN classifiers with different k values are used for classification attacks.
The different values of k are considered for the diversity and better performance of the classifier. The probabilistic kNN method is based on the posterior probability estimate, and we used the Heuristic function to convert the classifier output to probable values. The kNN proposed algorithm is shown below.
We assumed the probabilistic method in such a way that there are a few points with different features in space. Three different classes are shown in Fig. 5. In this method, the Euclidean distance of the K nearest neighbor from the Ex experiment data obtained using Eq.19.
The steps to obtain Probabilistic values kNN are as follows: 1. In the first step, we computed Euclidean distance for the N input data; 2. In the second step, we obtained k nearest neighbor from the Ex experiment data; 3. In the third step, we obtained the distance of m date related to class i from the Ex data, where d m is the Euclidean distance of the m-th data from the test data; 4. In the fourth step, we computed the probability p i i = 1...m using Eq.21; 5. In the fifth step, we opted Maximum amount p i that data with Possibility p i belonging to class i.
As shown in Fig. 3 which is k = 5 (the nearest neighbor value), in this figure two data belong to the class i and two data belong to the i + 1 class, and one data belong to the i + 2 class. To obtain the probabilistic value of the output of KNN, we do as follows: First, we have to compute the probability of all classes. In order to compute the probability of belonging test data to class I, we must obtain the first data interval, and the second data belonging to class I using Euclidean distance, and then, we calculate the total inverse distance of these two test data. Then, we gain the sum of Euclidean distance of the six test data. Finally, the probability of belonging to the class I is calculated using Eq. 21. For all classes, we do the above stage. After calculating the probabilistic values of the data classes, if the value of p(i) is greater than p(i+1) and p(i+2), the experimental data with the probability p(i) belong to class i.

Probabilistic support vector machine (PSVM)
Support vector machine(SVM) is one of the best methods for solving multi-class and regression problems. The support vector machine can be used well in two-class to multiclass classification problems. It has a high training rate and decision-making speed and is appropriate for solving regression and classification problems. Among the two methods of One-against-all(OAA) and One-against-another, that is used to generalize vector support machines to multi-class mode; the One-against-all method is used for implementation (Javid and Hamidzadeh 2019). Different kernels, such as linear, polynomial, sigmoid, and RBF, are introduced for use in the feature space .
Experimental results have shown that the SVM classifier has better performance in classification using RBF kernel and hermit. Hermit kernel is used for increasing the classification accuracy and the classification speed. Liner Kernel Function: Sigmoid Kernels Function (SKF): Radial Basis Function (RBF): Hermit Kernels Function: He 0 (x) = 1, He 1 (x) = x, In this paper, we used different kernels for the experiment. The experimental results illustrated the best result obtained with the RBF kernel function. We train four different expert SVM with RBF kernel functions. This approach is used for more diversity of experts in ensemble classifiers. We selected values for RBF kernel function that defined in vector with value=[1,3 ] and hermit kernel function with degree=[8,10 ].
SVM output is a distance used for the comparison of the classifiers. However, posterior possibilities are needed for most applications. Platt introduced a method for converting SVM output to probabilistic values using a sigmoid function to obtain a maximum similarity (Platt 1999), and this function is based on the Levenberg-Marquardt algorithm and defined by Eq.28, which is a belief model algorithm.
There are linear methods and polynomial functions for the probability of a backup vector machine. Still, the use of the sigmoid function has better results according to the results of the experiments. Fig. 6 shows the probabilistic output for two thousand test samples. In the present study, the sigmoid function used for obtaining the probability of the output of SVM.
We showed the algorithm for sigmoid Training, and this function is the model-trust algorithm, based on the Levenberg-Marquardt algorithm. PSVM Algorithm is shown below. In this algorithm vector of outputs of the SVM on a data set, data set labels, number of negative points, and number of positive points using as input data, and the coefficients A and B such that the posterior probability of P (y = 1|x) is output algorithm, where f(x) is the output of the SVM.

Input:
Vector of outputs of the SVM Labels Prior0: Number of Negative points prior1: Number of positive points Htarget= (prior1+1)(prior1+2) Ltarget= 1/(prior0+2) Output: A and B and P i A=0 B=log(prior0+1)(prior1+) while minimum function (ti − 1)log(1 − pi) + (tilog( pi)) do update and calculate gradient and hessian of error function(use H'=H+Sigma I compute Pi define by Eq.28 end while find the coefficients A and B such that the posterior probability P i

Data set KDD
The network traffic data set is used to evaluate the proposed method. The Cyber systems and MIT Lincoln Laboratory Technology Group have collected a network traffic data set. This laboratory simulates the United States Air Force (USAF) LAN network with 9-week continuous attacks of TCP-dump data. This data set by simulation of various attacks such as normal traffic and four classes of attacks (Kaushik and Deshmukh 2011;Tavallaee et al. 2009): -DOS: in this attack, the attacker sends a large number of requests to a host. -R2L: the attacker is trying to unauthorized access from an external machine to the root of the system. -U2R: the attacker attempts to use vulnerable points of the system to control an external machine through the network as a local user. -Probe: the attacker is trying to get information about machines and network services.
This work has been done on various platforms such as Windows and Unix. The KDD99 has five classes that are listed in Table 4. The number of educational and experimental records is shown below. The NSL-KDD is a refined version of the KDD99. In this data set redundant, duplicate records are eliminated for the decline biases of the classifier. It consisted of all the featured needed to form the KDD. The KDD data set as the benchmark data set as well as the most popular database in the field of detection and intrusion has 41 features with 38 numerical features and 3 symbolic features.

Data set UNSW-NB15
The cybersecurity research team of the Australian Centre for Cyber Security (ACCS) has introduced a new data called UNSW-NB15 [45] to resolve the issues found in the KDD-Cup 99 and NSL-KDD data sets. The UNSW-NB15 contains 42 features. In the dataset, there are 42 features, 3 instances are nominal and 39 are numeric. The UNSW-NB15 is divided into two main datasets: UNSW-NB15TRAIN, which is used for training and the UNSW-NB15-TEST which is employed for testing the trained models. In this research, we further split the UNSW-NB15-TRAIN into two sets: the UNSW-NB15-TRAIN-1 (75% of the full training data set) for training and the UNSW-NB15-VAL (25% of the full training data set) for validation before testing. The UNSW-NB15 contains samples with nine categories of network attacks: Backdoor, Shellcode, Reconnaissance, Worms, Fuzzers, DOS, Generic, Analysis, Shellcode, and Exploits. Table 2 illustrates the details and the distribution of the values of each attack class within the data subsets.

Data set CICIDS 2017
The CICIDS 2017 data set ' [6] is data developed by the Faculty of Computer Science, the University of New Brunswick in 2017. The data set comprises both benign traffic and details of up-to-date common attacks: such as Brute Force FTP, Brute Force SSH, Dos, Heartbleed, Web Attack, Infiltration, Botnet, and DDoS (Swami et al. 2020). This data set shows real generalization of traffic in real networks as well as newer means of attack, we opt the CICIDS 2017 dataset because it consists of 5 days of data collection and the CSV version of CICIDS 2017 contains 2,830,743 rows divided into 8 files, each row having 79 features. Each row of CICIDS 2017 is labeled as Benign or one of fourteen types of attack in order to create a training and test subset. In the CICIDS 2017 data set, the attack simulation is divided into seven categories including Brute Force Attack, Heart Bleed Attack, Botnet, DoS Attack, DDoS Attack, Web Attack, and Infiltration Attack. CICIDS 2017 has more complex types of attacks as presented in Table 3. The rationale for selecting CICIDS 2017 data set is to have a data set that shows clearly the current real-world network traffic in the experiments.

Data preprocessing
Raw data often have problems such as noise, bias, sharp changes in dynamic range and sampling, and using them in this way will weaken subsequent designs. Data preprocess- ing involves all conversions, such as reducing the size of the data that is done on the raw data and converts it to a form that can be used for later processing, such as data classification. It makes them simpler and more effective. Since the nature of the features in the data set is discrete and continuous, so to ensure each observation is in a set of numeric values. KDD includes three symbolic features such as protocol type, service, and flag. We do Data Preprocessing in three steps: 1. Data mapping: symbolic feature values were mapped to a numeric label manually for record in train, test, and validation. The values for these features are mapped to numeric values ranging from 1 to M, where M is the total number of each feature symbolic. 2. Eliminate the duplicate packets: The data set may including duplicate packets that might have the same samples.
To avoid this overhead, duplicate packets are eliminated which do not have any effect on model training. 3. Identification of class: data set including a class for each record where the class is either normal connection or a type of attack. Each record from the data set belongs to one class from five major classes: normal, probe, DOS, U2R, R2L. The value for each class is mapped to numeric values. The normal class was mapped to number 5, probe to 4, DOS to 3, U2R to 2, R2L to 1. The method of labeling and numerical range of labels are described in Fig. 7.

Experimental setup
This experiment is carried out on a personal computer with Intel Pentium Processor 2.40 GHz Intel Core i7 CPU, 16 GB of RAM, and MATLAB 2016a environment for experiment and python for shape.
In this paper, we selected two data sets, namely, NSL-KDD and KDD99, for experimental. We used five data sets in experimental that taken from training and testing KDD99 in experiments, which are the same in size in all experiments. The number of selected data for each class is shown in Table 4. These values are randomly selected from the experimental and educational data listed in Table 1. Finally, the samples are selected for the experiment using the feature selection algorithm based on the ensemble margin.
Some experiments are based on k-fold cross-validation to evaluate test data accuracy. Although we remove the dependency of the samples, that gives us the assurance that the results are not random. Therefore, each training data set is divided into five subsets. In practice, ever times, four portions are put together for the training set, and one of the five-part is used as the test set. Finally, the samples are selected for the experiment using the feature selection algorithm based on the ensemble margin. The use of the confusion matrix is one of the ways to evaluate classifiers. This matrix keeps the information in brief. Table 5 shows this matrix. Several evaluation criteria can be defined with the help of a confusion matrix, which the most important criterion is Accuracy, Recall, F-measure, and Precision.

Analysis of results
We listed the results of the experiments on the five data sets for each base and ensemble in Table 6. The output of the probabilistic nearest neighbor classifiers and the probabilistic support vector machine is given for integration into the Dempster-Shafer classifier. Overall, it can be figured out that the proposed method accuracy in most classes is higher than the accuracy of the other methods. In the method, for U2R attacks, the accuracy in the proposed method is dramatically more than the prior method in which the number of experimental and educational samples is low, the kNN classification has a lower rate of accuracy, but in the proposed method, we achieved higher accuracy using a combination of classifiers. Also, we compared the accuracy rate of the proposed method with other classifiers used in each data set and its results presented for each data set in Table 6. We separately showed the result of the experimental in Figs. 8,9,10,11,and 12. These figures illustrated the accuracy of SVM, kNN, and ensemble experts, respectively. In these figures, we show the accuracy of each expert for all data sets. This Table 7 presents some statistical information about the average results of different experts for data sets used in the proposed method. In general, it can be inferred that the rate of accuracy attack is higher than the other method in other words value obtained for U2R by 99.84% is higher than other attacks. This experiment aims to illustrate the performance of the proposed methods to increase the accuracy of attacks. Table 9 and Table 11 show the accuracy of the proposed method with other intrusion detection methods using KDDCUP and NSL-KDD data sets. We used the nonparametric Wilcoxon signed-rank test for statistical analysis and comparison of results. The two rows of Table 10 illustrate the results of the Wilcoxon test (Demšar 2006). Results show the superiority of the classification accuracy of the proposed method in contrast to the other methods.
As can be seen, the accuracy value obtained in the proposed method has especially superiority on Normal, Probe, NORMAL, Dos, and R2L attacks rather than PSO, LUS, and WMA-based methods, and this method does not perform well enough on Normal and R2L attack. Although the proposed method and methods based on WMA significantly increases the accuracy of U2R attacks, CANN, TNN, and DPNN method (Li et al. 2018;Lin et al. 2015) have poor performance on R2L and U2R. As is seen in the diagrams of the accuracy of Fig. 13, proposed method has been in the best situation in contrast with other methods as mentioned above. Somewhat poorer results for the proposed method were obtained for NORMAL by 98.23% that, in contrast with other methods, is higher.
Next, we compare the performance of our method with feature selection methods. To do this, we take all the features of the data set and apply different feature selection techniques and compare their performance with our method based on autoencoder and the model that uses all the features. Table 8 shows the results of the kNN classifier using all the features, the features generated by autoencoder, and other feature selection techniques. We use Principal Component Analysis (PCA), variance threshold, and tree-based feature selection. We see in the table that our method outperforms all other feature selection techniques, as well as the model    that utilizes all the features. This suggests the effectiveness of autoencoder feature extraction technique for this task. According to Table 11, we can show the superiority of the proposed method approach over listed previous studies in terms of accuracy in IDS using the NSL-KDD data set. In Table 12, we can see the proposed method gain 90.98 accuracy, whereas CART, MLP, NB, and CMN achieve 88.67, 89.09, 83.22, 89.95, respectively. Furthermore, for other metrics, the proposed method shows higher results rather than other classifiers for the UNSW-NB15 data set, and NB gives the worst performance in terms of accuracy, precision, and F-measure. As shown in Table 13, our study achieved accuracy of 98.97 on the CICIDS 2017 data set and experts of RF, kNN, MLP, and RKM give accuracy of 97.89, 98.01, 97.73, and 98.04, respectively. In this data set, MLP achieves the worst result and our study achieves a better result for all metric rather than other methods. As illustrate in Table 14, the proposed ensemble models show the higher result in terms of Accuracy, Precision, Recall, and F-measure on NLS-KDD data set, with better performance in comparison with other classifiers. We also showed the result of the experimental in Figs. 16,15 and Fig. 14. This due to in ensemble method the correct decisions are strengthened, and incorrect decisions are canceled or weakened, and these methods have better performance than single methods in detection attacks.

Conclusions and future Work
In this paper, we proposed a novel method to improve the performance and getting a high accuracy in detecting the attacks network via the ensemble method. As in the ensemble method, the correct decisions are strengthened, incorrect decisions are weakened. In this work, first, we used from SVM and kNN experts, heuristic function in kNN and sigmoid function used for converting the format of its output to probable output. Finally, Dempster Shafer's used for Combining base experts into an ensemble expert. Deep learning for feature extraction and ensemble margin used for the selected better samples from the data set. Although, in the      detection of R2L and Probe attacks, it has been able to achieve the highest accuracy of the common methods in this field. We performed different experiments on our proposed method and observed the performance of all the classifiers. The results of the experiments on UNSW-NB15, CICIDS 2017, and NSL-KDD data sets showed the superiority of the proposed method in comparison with the other methods in terms of Accuracy, Precision, F-measure, and Recall. In the future, we plan to study how to compute the probabilities of the output of SVM and kNN using other methods to reduce time and improve performance. We also tend to use other classifiers such as the Bayesian network, whose output is probable, as one of the basic classifiers which will perform better on accuracy. Another interesting direction is to combine the advantages of SVMs and Convolutional Neural Networks for improved classification effectiveness.

Declarations
Conflict of Interest All Authors declare that they have no conflict of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors. Informed consent was obtained from all individual participants included in the study.