A self-adaptive level-based learning artificial bee colony algorithm for feature selection on high-dimensional classification

Feature selection is an important data preprocessing method in data mining and machine learning, yet it faces the challenge of “curse of dimensionality” when dealing with high-dimensional data. In this paper, a self-adaptive level-based learning artificial bee colony (SLLABC) algorithm is proposed for high-dimensional feature selection problem. The SLLABC algorithm includes three new mechanisms: (1) A novel level-based learning mechanism is introduced to accelerate the convergence of the basic artificial bee colony algorithm, which divides the population into several levels and the individuals on each level learn from the individuals on higher levels, especially, the individuals on the highest level learn from each other. (2) A self-adaptive method is proposed to keep the balance between exploration and exploitation abilities, which takes the diversity of population into account to determine the number of levels. The lower the diversity is, the fewer the levels are divided. (3) A new update mechanism is proposed to reduce the number of selected features. In this mechanism, if the error rate of an offspring is higher than or is equal to that of its parent but selects more features, then the offspring is discarded and the parent is retained, otherwise, the offspring replaces its parent. Further, we discuss and analyze the contribution of these novelties to the diversity of population and the performance of classification. Finally, the results, compared with 8 state-of-the-art algorithms on 12 high-dimensional datasets, confirm the competitive performance of the proposed SLLABC on both classification accuracy and the size of the feature subset.


Introduction
Feature selection (FS) is an important data preprocessing method in data mining. With the rapid development of data acquisition technology, the number of features collected in many applications increases greatly. However, not all of them are related to the given target. Irrelevant and redundant features may even reduce the performance of classification. This curse of dimensionality is a major obstacle in machine learning and data mining (Gheyas and Smith 2010). Hence, FS aims to choose a small number of relevant and non-redundant features to achieve similar or even better classification performance than using all features. FS methods can be categorized into three types: wrapper, filter and embedded. Filter methods analyze intrinsic properties of features and rank their importance, ignoring objective function (Rakshit et al. 2013). Wrapper methods utilize objective function to evaluate the quality of different feature subsets and select the best one (Grande et al. 2007). Compared with filter methods, wrapper methods usually obtain higher accuracy. However, they have high computational cost, especially for high-dimensional, large datasets. Embedded methods attempt to reduce the computation cost of reclassifying different subsets that are performed in wrapper methods. They perform feature selection in the process of training. Therefore, the performance of an embedded method is usually specific to the given learning algorithms, which results in poor robustness with respect to the change of learning algorithms (Rao et al. 2019). In this study, we focus on the wrapper method.
In principle, FS is an NP-hard combination problem. Given a feature set with n features, the number of all possible feature subsets is 2 n , which makes it too costly and & Shiguo Huang fjhsg25@126.com restrictive practically to specify the best feature subset with traditional exhaustive search approaches since the size of the search space increases exponentially as the number of features increases. Therefore, evolutionary computation (EC) techniques attract attention due to their global search strategies and have been proven to be effective in dealing with the FS problems (Oh et al. 2004). There are many typical EC techniques, such as genetic algorithm (GA), particle swarm optimization (PSO), ant colony optimization (ACO) and differential evolution (DE). All these EC techniques show the effectiveness in dealing with feature selection. Details will be described in the next section.
In this paper, we concentrated on a recent EC technique, i.e., artificial bee colony (ABC) Algorithm, which was proposed by Karaboga in 2005(Karaboga 2005) and simulates the intelligent foraging behaviors of a honey bee swarm. Compared with other evolutionary algorithms, ABC algorithm has many advantages, such as strong robustness, high flexibility, simple structure, few control parameters and strong ability to explore a wide space. Accordingly, ABC is chosen for feature selection in this paper. However, when dealing with FS problem, the basic ABC algorithm has some drawbacks, including a slow convergence, insufficient exploitation and the best solution in the bee population is memorized but is not used to guide the population update.
To address the above issues, many studies have been developed, however, few of them are developed for highdimensional FS, and most of them aim to improve the classification performance by combining ABC algorithm with other algorithms, which have more complex parameters to control than a single algorithm (reviewed at length in Sect. 2). Therefore, we propose a self-adaptive levelbased learning artificial bee colony algorithm (SLLABC) for feature selection given as follows: (1) Introducing a novel level-based learning mechanism into ABC algorithm. The bees are divided into high to low levels according to their fitness. Each bee ''emulates those better than itself'' from higher levels, except for the bees in the highest-level learning from each other. This novel mechanism effectively improves the exploitation abilities of the ABC algorithm, and accelerates the convergence of the ABC algorithm. (2) Proposing a self-adaptive method for determining the number of levels. The number of levels is dynamically and automatically adjusted by diversity of individuals during evolution. The higher the diversity of individuals, the more the levels and the lower the probability of dominant individuals being learned, and vice versa. This method preserves high exploration in early evolution and promotes high exploitation in later evolution. (3) Proposing a new update strategy for optimal individuals. Different from some single-objective methods that only consider the accuracy, as well as the multi-objective methods that adopt multi-objective fitness functions, this strategy takes both error rate and subset size into consideration but gives priority to the lower error rate, which impels the population to approach the optimal subset with both the minimum error rate and the smallest size.
The rest of this paper is organized as follows. Section 2 reviews the relevant literature on EC-based feature selection algorithms. Section 3 introduces the basic artificial bee colony algorithm. In Sect. 4, the SLLABC algorithm is proposed based on three new strategies. The experimental results are shown in Sect. 5. Finally, conclusions are given in Sect. 6.

Related works
To eliminate the negative impact of the irrelevant and redundant features, a variety of feature selection methods have been proposed. However, due to the inefficiency of traditional search approaches in feature selection which is a complex combinatorial optimization problem, various ECbased feature selection algorithms have been proposed. The forerunner is Siedlecki and Sklansky who prove that genetic algorithm (GA) is a powerful tool for feature selection when the dimensionality of the given feature set is greater than 20 (Siedlecki and Sklansky 1993). After that, to further improve the performance of GA for feature selection, many different improvements have been proposed in search mechanisms (Demirekler and Haydar 1999;Jeong et al. 2015;Wang et al. 2020) and fitness function (Canuto and Nascimento 2012;Sousa et al. 2013). A feature selection method (Derrac et al. 2009) based on GA utilizes the cooperative coevolution (CC) framework (Potter and Jong 2000) but is not investigated in the large datasets. With the same idea of ''divide and conquer'', particle swarm optimization (PSO) achieves better performance than GA algorithm (Song et al. 2020;Van den Bergh and Engelbrecht 2004). However, the traditional personal best and global best updating mechanism in PSO limits its performance for feature selection and the potential of PSO for feature selection has not been fully investigated. Therefore, Xue et al. have proposed new initialization strategies and new personal best and global best updating mechanisms (Xue et al. 2014b). As the dimension of the problem increases, most PSO-based FS methods consume a significant amount of memory and require a high computational cost. A variable-length PSO (Tran et al. 2018) has been developed to deal with highdimensional data. Compared with fixed-length PSO methods, the proposed variable-length PSO can achieve much smaller feature subsets with significantly higher classification performance in much shorter time. A PSO variant namely competitive swarm optimizer (CSO) directly adopts predominant particles in the current swarm to guide the update of particles (Cheng and Jin 2015), the diversity of this optimizer is greatly promoted and thus shows good performance in dealing with large scale optimization. In Ref. (Gu et al. 2018), the CSO is applied to high dimensional FS problem. Recently, a hybrid CSO (Ding et al. 2020) algorithm has been proposed to improve the drawbacks of the original CSO with low computational efficiency and avoid falling into local optimum. The CC framework has been applied to differential evolution (DE) as well, namely DECC (Shi et al. 2005). DE is also a very popular evolutionary algorithm used in FS problems. Xue et al. designed a DE-based multi-objective feature selection algorithm (Xue et al. 2014a). After that, a binary DE variant with self-learning (MOFS-BDE) is employed to improve the classification performance of DE . Since FS is a discrete combinatorial problem, it has been widely achieved by ant colony optimization (ACO) which was designed to solve discrete optimization problems originally. Interestingly, the CC framework has also been applied to ACO algorithm (Vieira et al. 2010). However, ACO usually represents FS as a graph problem that restricts the scalability of the algorithm and the interaction between features. Hence, some hybrid algorithms of ACO and other EC algorithms have been proposed to improve its scalability, such as a hybrid algorithm of ACO and GAs (Hamamoto et al. 2015), three mechanisms for hybrid ACO-PSO based approaches for feature selection (Menghour and Souici-Meslati 2016). Recently, Meenachi and Ramakrishnan have hybridized ACO and fuzzy rough set to select global optimal features (Meenachi and Ramakrishnan 2020). Except for the above algorithms, many other EC-based feature selection algorithms have also been developed. Since too many studies exist, we cannot review them all. Here, to save space, we only list the above typical works. For a comprehensive review of EC-based algorithms, readers can refer to Ref. (Xue et al. 2016) and Ref. (Nguyen et al. 2020).
This paper concentrated on the ABC-based FS algorithm. More and more papers have developed ABC variants for FS in the past decade. Gao et al. claimed that the ABC is good at exploration but poor at exploitation. In order to overcome this issue, opposition-based learning method and chaotic maps are used in the initialization of the algorithm, and the update rule of basic algorithm is replaced with a new update rule by considering the best solution in the population (Gao et al. 2011). A discrete binary ABC variant (DisABC) substantially modifies the search mechanism by using the Jaccard similarity coefficient for feature selection problems (Hancer et al. 2015). An ABC variant namely CLABC (Cooperative learning ABC) utilizes a cooperative learning strategy with modified search mechanisms and multiple search equation (Harfouchi and Habbi 2015). In another study, the population of ABC is divided into two groups, called diversity population-DP and convergence population-CP. In CP, the promising food sources are stored to improve convergence ability of the algorithm, and historical unpromising food sources are stored in DP to maintain diversity of the population (Cui et al. 2018). In order to improve the local search capability of the basic algorithm, a quick ABC algorithm (Karaboga and Gorkemli 2014) and some update strategies based on normal distribution have been proposed in Ref. (Babaoglu 2015). To overcome slow convergence issue and tendency to local optimum of basic ABC, opposition-based learning and generalized opposition-based learning have been integrated with the basic ABC (Zhou et al. 2015).
Although many ABC variants have been proposed, few of them are developed for high-dimensional FS. In other words, the most existing ABC variants are at the cost of higher computational complexity and more complex algorithm implementation.
Comparing the ABC algorithm with other swarm intelligence algorithms, one of its advantages is that exploration and development are clearly separated. Particularly, the employed bees and the onlooker bees that focus on neighboring solutions perform exploitation, while the scout bees that focus on renewing random solutions perform exploration. However, many studies combine ABC with other optimization algorithms to keep the balance between exploration and exploitation. ABC has been combined with DE to perform FS for classification in ABC-DE (Zorarpacı and Ö zel 2016). This algorithm contains a new binary neighborhood search mechanism and a modified onlooker bee process for the ABC algorithm; it also has a new binary mutation phase for the DE algorithm. In another study, a hybrid study based on ABC and quantum evolutionary algorithm has been proposed and ABC is used for improving the local search capability of the hybrid algorithm (Duan et al. 2010). By combing the characteristics of ACO and ABC algorithms, a novel hybrid algorithm AC-ABC has been proposed to optimize feature selection (Shunmugapriya and Kanmani 2017). In this variant, ants are committed to exploiting the optimal feature subset, and the bees use the feature subset generated by ants as their food source.
Although the above hybrid algorithms achieve promising results, they also have more parameters than a single algorithm, including the parameters from both basic algorithms and the parameters to control the hybridization. Given the clear separation between exploration and exploitation, a novel search mechanism was introduced into ABC algorithm in this study to control them instead of hybridizing the ABC algorithm with other algorithms.

Artificial bee colony algorithm
In the basic ABC algorithm, the colony of artificial bees is divided into three types: employed bees, onlooker bees and scout bees. The employed bees search for food sources and share their information with onlooker bees. The onlooker bees select one of the food sources according to the information from employed bees, and exploit the neighborhood space of the food sources to produce a new food source. An employed bee whose food source has been improved through a predetermined number of trials becomes a scout bee and starts to randomly search for a new food source. The basic implementation of ABC comprises four phases: (1) Initialization phase: The algorithm randomly produces food sources. Each food source is described as a vector in the search space: x i ¼ x i;1 ; x i;2 ; :::; x i;D È É , and is generated by Eq. (1): where i ¼ 1; 2; :::; SN f gand SN is the number of the food source and is equal to the number of employed bees or onlooker bees. j ¼ 1; 2; :::; D f gand D is the dimensionality of the search space. x i;j is the j th dimension of x i . Uð0; 1Þ is a random variable which distributed uniformly, x min j and x max j are the maximum and minimum boundary value, respectively.
(2) Employed bee phase: Each employed bee is associated with a food source. Employed bees need to modify the position of their food source to find new better ones. Thereby, they learn from a neighbor source x k which is selected randomly among all sources except for itself. The new food source is produced by Eq. (2): where x i is the current food source, / i;j is a uniformly distributed random value within [-1, 1]. After x i 0 is produced, its fitness value is evaluated and compared with x i . If x i 0 is better than x i , x i 0 replaces x i to enter next iteration and its counter holding the number of trials is reset to 0. Otherwise, the current source x i is kept into next iteration and its counter holding the number of trials is increased by 1.
(3) Onlooker bee phase: After each iteration, all employed bees give information about their sources to onlooker bees. Each onlooker bee selects a food source according to the fitness values by roulettewheel scheme, where the better the fitness value of the source, the higher the probability of being selected. The probability value is calculated by Eq. (3): where fit i is the fitness value of source x i , which is calculated by Eq. (10). After calculating the probability value of each source, a random number randð0; 1Þ is generated to determine which source to choose. If p i [ randð0; 1Þ,x i is chosen to update just as in the employed bee phase. (4) Scout bee phase: Each source has a counter which is zero at the beginning. If the counter holding the number of trials exceeds the predefined threshold value, its corresponding food source will be abandoned and replaced by a new food source, which is generated by Eq. (1).
When the ABC algorithm is applied to feature selection, we abstract each food source into a feature subset, and the quality of the food source is the fitness value of the feature subset. Each individual is represented with a binary string. ''1'' in the string means the feature is selected and ''0'' means the feature is not selected.
4 The proposed method 4.1 A novel level-based learning strategy Yang et al. (Yang et al. 2017) proposed a level-based learning (LL) mechanism for PSO, namely level-based learning swarm optimizer (LLSO), and apply it to continuous space optimization problems. In LLSO, two particles are selected from two different levels to instruct the current particle to update the position, and the particles in the highest level directly enter the next iteration without update. Due to the difference between ABC and PSO, a novel LL mechanism is proposed for ABC-based feature selection to solve the problem in discrete space.
The novel LL mechanism is described as follows: (1) The individuals in the population are sorted in ascending order of the fitness values firstly.
(2) The sorted population is divided into NL levels, the individuals with good fitness values are assigned to high levels, and the individuals with poor fitness values are assigned to low levels. The number of individuals in each level is the same. Note that the whole swarm may not be equally partitioned by NP=NL. In this situation, we assign the last NP%NL individuals into the lowest level. (3) Each individual ''emulates those better than itself'' by learning from a random individual in higher levels, and the individuals in the highest level learn from a random individual in the same level to update their positions.
To have a better understanding of the novel LL mechanism, we take an example in Fig. 1, in which the population is divided into four levels. Level1 is the highest level, and Level4 is the lowest level. According to the above process, each individual in Level4 needs to learn from an individual randomly selected in Level3 or Level2 or Level1, each individual in Level3 needs to learn from one individual randomly selected in Level2 or Level1, each individual in Level2 needs to learn from one individual randomly selected in Level1, and each individual in Level1 learns from another individual randomly selected in the same level. In short, x k in Eq. (2) is selected randomly from many individuals in the selected level which is one of the higher levels.
Obviously, the number of candidate exemplars for individuals in different levels is different. In other words, as the level that an individual belongs to goes higher, this individual has fewer exemplars in the higher levels in total to learn from. This level-based learning strategy encourages more exploration among individuals in lower levels and more exploitation among those in higher levels.

A self-adaptive method for determining the number of levels
The number of levels NL has an important impact on the performance of feature selection. Yang et al. (Yang et al. 2017) believe that different data sets have different numbers of levels, and then proposed a dynamic version of LLSO (DLLSO) to determine the number of levels. The description of DLLSO is as follows: Firstly, to input a set containing different integers that represent the candidate numbers of levels. Then, at each generation, LLSO selects an integer from the set based on their probabilities, and the performance of LLSO with this level number is recorded at the end of the generation. The greater the improvement in the fitness value of the current optimal solution is, the more likely it is to choose this integer as NL in the next iteration.
The method of determining the number of levels by DLLSO needs to give a predefined set of candidate integers according to the given problem, which is hard to be given successfully. Therefore, a self-adaptive method to determine NL is proposed in this study. The basic idea of this method is to use the diversity of the population, i.e., the similarity of individuals within the population, to obtain the series NL. The details are as follows: (1) The distance between individuals is used to measure the similarity between individuals. In feature selection, the value of each dimension of individuals is 0 or 1, so Hamming distance is used to measure the similarity between individuals, which is defined as follows: where H i;j is the Hamming distance between x i and x j , x i and x j are two different individuals in the population, D is the dimensionality of the search space. Then the average diversity Ave t of the population in the t th iteration is calculated as Eq. (5).
where meanðÞ denotes the mean function, and NP is the population size.
(2) Due to mutual learning, the individuals in the population tend to be similar during evolution, which leads to a decrease in the average diversity of the population. To avoid a sharp decline in the exploration performance in the early iteration, we divide the population into more levels to enhance the diversity of level selection and reduce the probability of individuals in dominant levels to be learned owing to the small number of individuals in each level, which can avoid a sharp decline in population diversity, while in later iterations, the diversity in individual selection becomes more important. Increasing the probability of dominant individuals being learned is beneficial for the colony to exploit the search space more intensively. Therefore, the decline rate of the average diversity is used to determine the NL in the current iteration. The decline rate of the average diversity can be calculated from the following formula: where the lower and upper bounds of NL are 2 and NP, respectively. fixðtempÞ returns the maximum integer which is not more than temp. temp is calculated as follows: where Ave t is the average diversity of the population at t th iteration, and Ave 0 is the average diversity after initializing. Compared with DLLSO that gives a predefined set, our method of determining NL by using the average diversity of the population is simple and easy to implement. It not only saves a lot of parameter trial work but also has good performance in balancing exploration and exploitation. The experiments in Sect. 4 will show the comparative results of the two methods.

A new mechanism for updating the individuals
In ABC algorithm, the update mechanism is very important to the performance of the algorithm. If the food source in the employed bee or onlooker bee phase is not updated for a long time, it will be abandoned and reinitialized to produce a new food source, which reduces the convergence speed of the algorithm. Meanwhile, FS mainly contains two objectives, i.e., minimizing the classification error rate and minimizing the number of selected features, simultaneously. However, in most feature selection algorithms, the fact that the individuals with the same error rates may select different features is not taken into consideration. On the one hand, for two individuals with the same fitness value, the number of features they select may be different. For example, there are two individuals x 1 ¼ 10011 ½ and x 2 ¼ 11111 ½ with the same fitness value, and x 2 selects five features, which is more than x 1 . If x 2 is chosen to instruct the current individual to update, it increases the dimension of the solution, which is not conducive to minimize the number of selected features.
On the other hand, for two individuals with both the same error rates and the same number of selected features, the features they select may be different. For example, there are two individuals x 3 ¼ 01011 ½ and x 4 ¼ 11100 ½ with the same fitness value and the same number of selected features. However, they select different features. Specifically, x 3 selects the second, the fourth and the fifth features, while x 4 selects the first, the second and the third features.
Therefore, a new update mechanism is proposed to increase the update frequency of food sources and minimize the number of selected features. Let sumðxÞ represent the number of features selected by individual x, and fitnessðxÞ be the fitness value of the individual (the smaller the better). Therefore, individuals are updated in the following way: where x t and x t0 denote the current individual and its offspring (i.e., the candidate individual), respectively. x tþ1 denotes the individual entering the next iteration. To give readers a clearer understanding of this mechanism, we give an example, as shown in Fig. 2: There are nine cases of whether an individual updates or not in Fig. 2. The first three cases indicate that if the fitness value of x t is larger than x t0 , it will be replaced by x t0 regardless of the number of features it selects, which is shown as case 3-6. If x t has the same fitness value with x t0 and its number of selected features is the same as or more than x t0 , it will also be replaced by x t0 , otherwise it will enter the next iteration directly. In another three cases, x t with smaller fitness value than x t0 will enter the next iteration without update. The impact of the new update mechanism on the performance of the algorithm will be analyzed in the section of experimental studies.

A self-adaptive level-based learning artificial bee colony algorithm
Notably, ABC algorithm was originally proposed for continuous optimization. To make the ABC algorithm be suited for feature selection, we introduce a transfer function to convert the population from continuous values into discrete values. This idea has been showed to be effective in some feature selection methods (Ghamisi et al. 2014;Mafarja et al. 2018;Mohammadi and Abadeh 2014;Zhang et al. 2015). The transfer function is given as follows: where x i;d is the position of the ith bee in dimension d. rand is a random number in [0, 1]. The function of sigmoidðÞ is formulated as We combined the novel level-based learning mechanism, the self-adaptive method for determining the number of levels and the new update mechanism to develop an improved artificial bee colony algorithm, namely selfadaptive level-based learning artificial bee colony algorithm (SLLABC) for feature selection. The pseudo code of the SLLABC algorithm is shown by Algorithm 1. A self-adaptive level-based learning artificial bee colony algorithm… 9671

Complexity analysis
According to Algorithm1, the SLLABC can be divided into four phases: initialization, employed bee phase, onlooker bee phase and scout bee phase. The time complexity of SLLABC is analyzed as follows.
( (4) Scout bee phase includes two parts: initialization and evaluation. The computational complexity of initialization and evaluation are both O NP ð Þ at worst.
Note that for evaluating the individuals, the above analysis only considers the number of the individuals evaluated in each iteration. Overall, SLLABC only takes extra O NP Â log NP ð ÞþNP ð Þ in each iteration compared with the basic ABC algorithm, which takes O NP Â D ð Þin each iteration.
As for the space complexity, SLLABC have the same space as basic ABC, i.e., O NP Â D ð Þ . Therefore, the computational complexity of SLLABC is competitive compared with the basic ABC algorithm.

Experimental design
To verify the performance of the proposed feature selection algorithm, a series of experiments are conducted on a total of 12 standard datasets. These datasets together with their number of features, instances and classes are listed in Table 1. They were obtained from http://featureselection. asu.edu/datasets.php and http://archive.ics.uci.edu/ml/data sets.php. Among them, there are microarray gene expression data, image (face) detection data and email text data etc. These datasets have been processed by the providers in advance. In addition, they not only come from different applications fields, but also the number of features varies from 310 to 22,283, instances vary from 62 to 165, and classes vary from 2 to 15, which makes our experiments universal.
A suitable classifier is important to evaluate the feature subsets. Because of its effectiveness, the well-known K-Nearest Neighbor (KNN) (Liao and Vemuri 2002) is adopted to evaluate the performance of all algorithms in our experimental studies, where K = 5. In order to reduce the risk of overfitting, the average classification error rate by tenfold Cross Validation is taken as the fitness value. The fitness function is shown as follows: where, error i is calculated as follows: The proposed algorithm and other algorithms in comparison are in MATLAB language. The population size of all algorithms is 50, the maximum number of iterations is 100. For fair comparison, each algorithm runs 10 times independently. All experiments are executed on a computer with Intel(R) Core (TM) i5-7500 CPU and 16 GB RAM.

Experimental results and discussions
As previously discussed, this paper employed three new strategies. This section performs an extensive analysis on these three new strategies. After that, the proposed SLLABC algorithm is compared with the other eight algorithms for feature selection to evaluate its comprehensive performance.

Discussion on the diversity of two level-based learning mechanisms and its influence
First of all, we investigate the potency of individuals in level 1 learning from each other, and compare the diversity curve, convergence curve and the average feature subset size of the two mechanisms, denoted as SLLABC and SLLABC-1, respectively. Where, the individuals in level 1 are directly passed to the next iteration without updating in the SLLABC-1 algorithm, while in the SLLABC algorithm, they learn from each other in level 1 for updating. Figure 3 shows diversity curve of SLLABC and SLLABC-1. To avoid contingency, we run these two algorithms 10 times independently to obtain the mean diversity in each iteration.
In terms of the value of diversity, the diversities of SLLABC are higher than that of SLLABC-1 on the dataset LSVT, Yale, and Colon in the later iterations. However, the diversity of SLLABC is lower on the dataset Leukemia1. On the most datasets, the diversity value of SLLABC algorithm is not significantly different from that of SLLABC-2. In terms of the coincidence degree of the curve, the diversity curve on all data sets of SLLABC almost coincides with that of SLLABC-1 in the first 10 iterations. In addition, for datasets with a small number of features, LSVT, Colon and Leukemia1, the curves of the two algorithms are significantly different, while there is almost no difference between the curves of the two algorithms on the datasets with higher dimensions such as ALLAML, GLI_85 and Pixraw10P. The reason for this phenomenon may be that the number of individuals in level 1 is only a very small part of the whole population, especially in the high-dimensional feature selection, and in late iterations when the diversity of the population is very low, the individuals learning from each other does not play an important role in diversity. Figure 4 shows the feature subset size of SLLABC and SLLABC-1. We can see that SLLABC selected less features than SLLABC-1 on 6 datasets, On the other 6 datasets, i.e., LSVT, SRBCT, Leukemia1, ALLAML, Pixraw10P and GLI_85, SLLABC selected more features than SLLABC-1.
The convergence curves of the two algorithms are plotted in Fig. 5. It shows the decline of the error rate. On most datasets, the effect of SLLABC on reducing the error rate in the early iterations is not obvious, but it is always able to achieve a lower error rate than SLLABC-1 in the middle-and-later iterations. Specially, although SLLABC-1 obtained smaller feature subsets on the above six datasets in Fig. 4, the corresponding performance on error rate in Fig. 5 is not better than SLLABC. This may be due to that the individuals in level 1 only account for a small proportion of the whole population at the early stage of the evolution, and learning from each other cannot enhance the diversity significantly, but can make the individuals in the level 1 jump out of the local optima, so then the possibility of achieving a lower error rate was improved.
It is generally accepted that the exploration and exploitation ability of EAs can be implied by the diversity of the population, the lower diversity is, the less the exploration potential of individuals is and the more on exploiting potential is, and vice versa. The error rate convergence curve verifies this view as well. Therefore, according to the discussion above, we draw such a conclusion that the mechanism of individuals learning from each other in the first level does not play an important role in improving the exploration ability, but can enhance the exploitation performance of the population in most cases.

Discussion on the diversity of two methods of determining level number and its influence
Secondly, in order to study the potency of the self-adaptive method for determining the number of levels, Figs. 6, 7, 8 show the comparison between SLLABC and SLLABC-2, the former using the self-adaptive level-based learning method and the latter using the original dynamic levelbased learning (DLL) method. The DLL method needs to give a definite set of level numbers S. In order to study the influence of different S on classification performance, this study uses two sets, one is the empirical value S 1 ¼ ½4; 6; 8; 10; 20; 50 obtained by Yang, and the other S 2 ¼ ½5; 10; 20; 30; 40; 50 is listed at random. The algorithms using S 1 and S 2 are referred here as SLLABC-2-1 and SLLABC-2-2, respectively. Noted that, SLLABC-2 contains SLLABC-2-1 and SLLABC-2-2. AS shown in Fig. 6, in the first 10 iterations, the diversity curves of the three algorithms are not significantly different on all data sets. However, at the middle and the end of run, the diversity curves of SLLABC are lower than those of SLLABC-2 on Yale, SRBCT, DLBCL and GLI_85. In addition, there are significant differences in diversity curves between SLLABC-2-1 and SLLABC-2-2, especially on the datasets Yale, ALLAML and Prostate. Figure 7 shows the number of the features selected by the three algorithms. The feature subset of SLLABC is smaller than that of SLLABC-2-1 on six datasets, and smaller than that of SLLABC-2-2 on five data sets. Only on a few data sets, i.e., Leukemia1 and ALLAML, the feature subset of SLLABC is larger than that of both SLLABC-2-1 and SLLABC-2-2. Figure 8 plots the decline of the error rate by the three algorithms. We find that the decline rates of the three methods are similar in the early stage, however, the convergence rate of SLLABC-2-1 and SLLABC-2-2 are slower than the SLLABC at the middle and end of run on the most datasets, which indicates that the self-adaptive strategy for NL has a better exploitation ability. It is also worth mentioning that, on the datasets LSVT, ALLAML, Pixraw10P and Prostate, the final error rate of SLLABC is lower than that of SLLABC-2-1 but higher than that of SLLABC-2-2. On the dataset Leukemia2, the final error rate of SLLABC-2-2 is lower than that of the other two compared algorithms. Comparing the three curves in Figs. 6 and 8, we find that a lower diversity may not result in a higher error rate. On the contrary, SLLABC achieves a lower diversity in Fig. 6 and a lower error rate as well in Fig. 8, which demonstrates that the exploration of SLLABC-2-1 and SLLABC-2-2 are overemphasized and thus resulting in a poor exploitation, while SLLABC is able to compromise exploration and exploitation better than SLLABC-2-1 and SLLABC-2-2. In addition, the SLLABC-2-1 and SLLABC-2-2 algorithms using different S also have great differences in the number of features and error rate. It indicates that how to determine the S in DLL strategy is very important, which may need to be summarized by different experiments according to different problems. Fortunately, the performance of the SLLABC using the adaptive strategy proposed in this paper is better than that of the other two algorithms using the original dynamic strategy in most cases. All in all, the selfadaptive strategy for NL is promising for the proposed algorithm.

Discussion on the diversity of two update mechanisms and its influence
To investigate the influence of the new update mechanism on the performance of feature selection, SLLABC is compared with SLLABC-3, the former adopts the new update mechanism and the latter adopts the original update method, i.e., the size of feature subset is not taken into account. As can be seen Fig. 9, SLLABC-3 maintains a higher diversity than SLLABC on each dataset, and its value of diversity remains high. It is worth mentioning that at the end of run, the diversity values of SLLABC-3 on the first three data sets are not less than 0.4, which is not conducive to the fast convergence of the algorithm. This is verified by the convergence curve in Fig. 10. In Fig. 10, SLLABC achieves a lower or the same error rate at the end of run on most datasets. The only exception is found on the Pix-raw10P dataset, which might be due to the internal attributes and samples of the dataset and we cannot confirm at present.
After that, to investigate the differences between the two update mechanisms, we draw a histogram of the average feature subset size shown as Fig. 11. We can see that SLLABC-3 outperforms SLLABC on only two datasets, i.e., DBWorld and ALLAML. Therefore, the new update mechanism is promising in dimensionality reduction.

Comparison of the performance between SLLABC and other algorithms
In this section, we further investigate the performance of SLLABC algorithm by comparing it with eight feature selection methods, including the popular PSO variants, i.e., CSO (Gu et al. 2018 (Shunmugapriya and Kanmani 2017). In particular, CSO, VS_CCPSO and ALO_GWO have performed well in dealing with high-dimensional problems. Moreover, for the compared algorithms, the parameters are set as recommended in the corresponding papers. Table 2 gives the parameter settings of the above algorithms. First of all, we plot the convergence curves of the nine algorithms, as shown in Fig. 12. Evidently, in terms of   ' ± ': The result obtained by SLLABC is significantly better/worse than the compared algorithm. ' = ': There is no statistically significant difference e between the results obtained by SLLABC and the compared algorithm convergence behavior, AC_ABC, MbGWO-SFS, CSO, HLBDA and FRGA show unsatisfactory performance, while SLLABC algorithm can accelerate to detect the global best subset over the complex feature space. Even though ALO_GWO algorithm shows a very fast convergence speed in the early iterations, it starts to decelerate after 20 iterations. In contrast, SLLABC is able to maintain a fast convergence over the course of iterations and can always converge to a smaller error rate than any other algorithm at the end of each run. Moreover, on the dataset LSVT, Leukemia1, DLBCL, ALLAML, Prostate, Leukemia2 and GLI-85, the SLLABC still has a downward trend in the 100th iteration. In other words, SLLABC algorithm not only preserves good exploration and exploitation abilities and can particularly compromise these two well to search the space during the evolution. Table 3 shows the worst, the best, the mean and the standard deviation of the error rate obtained by SLLABC and other compared algorithms, and the best ones obtained on each dataset are bold. Subsequently, the Wilcoxon rank sum test (Wilcoxon 1992) is adopted to compare the results obtained by the SLLABC algorithm and other compared algorithms at a significance level of 0.05. The result is given in Table 4, wherein the symbol ' ? 'denotes the SLLABC algorithm outperforms the compared algorithm significantly, while '-'-indicates otherwise. On some datasets, the compared algorithms have similar performance with SLLABC, which is marked as '= '.
The experimental results in Table 3 show that VS_CCPSO algorithm achieves lower error rate than other algorithms on datasets with a small number of features such as LSVT and Yale, while SLLABC outperforms other algorithms on datasets with a large number of features. It achieves the minimum mean error rate, the minimum best and the minimum worst error rate on ten datasets, ten datasets and eight datasets, respectively. In addition, the SLLABC was able to perform more stably in comparison with the state-of-the-art algorithms, as smaller standard deviation values supported these findings.
The result of Wilcoxon rank sum test indicates that, the error rate of SLLABC algorithm is significantly lower than that of most algorithms on most datasets. Specifically, compared with VS_CCPSO algorithm, SLLABC underperforms VS_CCPSO on the datasets with a small number of features such as LSVT and Yale. Nevertheless, SLLABC is superior to or level pegging with VS_CCPSO on higher-dimensional datasets. The difference in error rate is insignificant between SLLABC and ALO_GWO in all datasets except DLBCL. In a word, SLLABC algorithm can achieve high classification accuracy for feature selection, especially on very high-dimension datasets.
Since the number of the selected features determines the computational cost of a classification algorithm, it is also a key performance index of feature selection. We compare the size of the feature subset of SLLABC algorithm with that of other algorithms. Table 5 shows the results of the maximum (worst), the mean, the minimum value (best) and the standard deviation (std) value of features selected by SLLABC and other algorithms. Inspecting the results, VS_CCPSO are outstanding on LSVT, Yale and SRBCT. Combined with Table 3, VS_CCPSO gets lower error rates and fewer features on these datasets, but the performance on very high-dimensional datasets is not as good as SLLABC. 2D_UPSO obtains the smallest feature subsets on higher-dimensional datasets, but its standard deviation in each dataset is large, which indicates that 2D_UPSO is  ' ± ': The result obtained by SLLABC is significantly better/worse than the compared algorithm. ' = ': There is no statistically significant difference e between the results obtained by SLLABC and the compared algorithm unstable on reducing feature size. SLLABC algorithm does not obtain the smallest feature subset on each data set, however, it reserves fewer features on most data sets, and its standard deviation value is not large. On the whole, SLLABC has a good performance on dimensionality reduction for very high dimensional feature selection. Table 6 shows the results of Wilcoxon rank sum test on feature subset size. The effect of dimensionality reduction by SLLABC algorithm is substantially better than other algorithms on most datasets. However, when a dataset only contains the key features that must be used by classifiers, removing any one of them may increase the error rate. Although the subset size of 2D_USPO, MbGWO-SFS and HLBDA can be reduced, their classification accuracy is not improved. In contrast, the features selected by SLLABC can usually provide key information, which is good at enhancing the classification performance. The summarized results concluded that the proposed SLLABC was more capable of selecting significant features compared to the state-of-the-art algorithms in high-dimensional datasets.
For a better evaluation of the proposed SLLABC algorithms, not only the accuracy and the size of feature subsets but also the computational complexity needs to be investigated. To provide a more intuitive representation of the time consumption, this paper calculates the CPU execution times of SLLABC and other algorithms under the same physical conditions. Table 7 summarizes the worst(longest), the mean and the best(shortest) CPU execution time (in seconds) between SLLABC and other compared algorithms. Table 8 shows the result of the Wilcoxon rank sum test on execution time.
As can be seen from Table 7, SLLABC algorithm still performs well on very high-dimensional datasets, which takes less execution time, but is inferior to CSO on the data sets with a small number of features. However, what is interesting is that as the dimension of datasets increase, the advantage of the SLLABC in time consumption is becoming more and more obvious. Combining Tables 7, 8, SLLABC algorithm consumes less time than most algorithms and obtains higher accuracy and fewer features on the very high-dimensional datasets. Therefore, our proposed SLLABC algorithm is a valuable feature selection tool, and it can be implemented for other real-world applications.

Conclusions
This paper aims to propose a Self-adaptive Level-based Learning Artificial Bee Colony (SLLABC) algorithm to deal with the feature selection problem on high-dimensional classification.
First of all, we described a novel level-based learning (LL) mechanism for ABC algorithm in detail. In basic ABC algorithm, the current individual learns from an individual selected randomly from the whole population, while after introducing the novel level-based learning mechanism, it has to learn from a better individual, and the individuals in the first level have to learn from each other. This mechanism enhances the exploitation of ABC algorithm and makes algorithm obtain the optimal solution more quickly.
Secondly, a self-adaptive method was proposed to determine the number of levels. Compared with the dynamic method for the number of levels, our new method adaptively adjusts the level number according to the  average diversity of the population instead of artificial empirical values. The experimental results show that the self-adaptive method improves the exploitation ability and dimensionality reduction of ABC algorithm. Furthermore, to improve the performance of ABC algorithm, we proposed a new update mechanism. If a candidate individual has the same accuracy as the current individual, but its number of selected features is the same as or less than the current individual, then replaces the current individual to enter the next iteration. This strategy effectively not only reduces the number of selected features but also improves the update frequency of individuals to enhance the exploration ability of ABC algorithm.
Finally, the proposed algorithm SLLABC is compared with SLLABC-1, SLLABC-2, SLLABC-3, and the results show that SLLABC can effectively balance the exploration and exploitation during the evolution. We further compared SLLABC with eight state-of-the-art algorithms on classification error rate, size of subset, and execute time. The results corroborate that proposed SLLABC is indeed a competitive algorithm to feature selection problems, especially with high-dimension data.
Moreover, it is worth mentioning that the novel levelbased learning mechanism and the new update mechanism proposed in our paper are universal to NP-hard problems. Like most EAs, however, SLLABC has high computational complexity due to the characteristics of random search and repeated evaluations. Therefore, how to reduce execution time by using sample reduction strategies or parallelization is one of the directions for future study. Additionally, minimizing the size of the feature subsets and maximizing the classification accuracy are both important indicators in feature selection, hence formulating feature selection as a multi-objective combinatory optimization problem to meet various requirements of decision-makers is a direction for our future study.