Discrete Equilibrium Optimizer Combined with Simulated Annealing for Feature Selection

This paper proposes a binary adaptation of the recently proposed meta-heuristic, Equilibrium Optimizer (EO), called Discrete EO (DEO) to solve binary optimization problems. A U-shaped transfer function has been used to map the continuous values of EO into the binary domain. To further improve the exploitation capability of DEO, Simulated Annealing (SA) has been used as a local search procedure and the combination has been named as DEOSA. The proposed DEOSA algorithm has been applied over 18 well-known UCI datasets and compared with a wide range of algorithms. The results have been statistically validated using Wilcoxon rank-sum test. In order to test the scalability and robustness of DEOSA, it has been additionally tested over 7 high-dimensional Microarray datasets and 25 binary Knapsack problems. The results clearly demonstrate the superiority and merits of DEOSA when solving binary optimization problems.


Introduction
With the invention of modern hi-tech sophisticated devices, it has become possible to generate tons of data within a very short interval. However, there is a two-fold question: do we need the entire set of data to retrieve some useful information, and is it always feasible to analyze such a huge quantity of data?
Statisticians may help to answer these questions. According to Petrov (2020), by the end of 2025, there will be around 180 zettabytes of data which is a staggering number and 90% of this data have been generated in the last two years.
As a matter of fact, only 0.5% of data have been analyzed in the year 2012, and the percentage must have decreased more in recent times. One of the reasons for this small percentage is due to the lack of enough tools to analyze this enormous quantity of data. This has increased the need for data processing, and among which one important technique is known as dimension reduction Chizi & Maimon (2005). The motive of dimension reduction is to extract the meaningful information from a large dataset while reducing the number of dimensions of the dataset under consideration.
One of the most sought-after dimension reduction procedures is feature selection (FS) Ghosh et al. (2018a). The objective of FS is to search for the most optimal set of useful and relevant features from the entire feature vector. It is a difficult research problem and its difficulty increases exponentially proportional to the number of features in a feature set. For a feature set with n number of features, a total of 2 n feature combinations are possible, and selecting one combination out of these 2 n combinations becomes extremely complex when the value of n increases significantly. FS algorithms try to search for a significantly good and acceptable solution within a suitable time-bound. FS procedures can be categorized into two different variants: wrapper and filter. Filter methods look for statistical interpretations of data to find the most informative and appropriate subset of features. Wrapper methods, on the other hand, use learning algorithms (e.g. classifiers Allan (1977)) to evaluate candidate feature subsets and guide the next round of searches according to the evaluation outcomes. Although wrapper methods are more resource intensive (require lots of processing power) than filter methods (due to not requiring learning algorithms), they are able to find superior solutions in comparison to filter methods. In order to improve performance of such algorithms, a recent trend is to blend multiple algorithms to combine advantages of the individual algorithms. Such hybrid algorithms tend to perform better by improving their exploration and exploitation ability (through inclusion of local or global search techniques). Some researchers also use a combination of filter and wrapper to reach a better solution by using advantages of both the models. These models are known as embedded models Ghosh et al. (2019a); Guyon et al. (2005).
The quality of such models depends a lot on how wrapper and filter parts interact with each other which requires an extra level of tuning leading to higher computational complexity.
The wide use and superiority of hybrid methods in the domain of FS have intrigued us to develop a new hybrid procedure that can efficiently perform search operations in the binary search space. In this paper, we have made an effort to address the FS problem using a hybrid variant of a recently proposed optimization algorithm known as Equilibrium Optimizer (EO) Faramarzi et al. (2020). It mimics the process of control volume mass balance used to predict equilibrium and dynamic phases, where the equilibrium state is treated as the optimal solution to the optimization problem. FS can also be considered as an optimization problem, where the goal is to find an optimal feature subset subject to constraints like high classification accuracy and a low number of features.
This intuitive similarity between FS and optimization problems has motivated us to modify the EO and apply it to solve FS problems. The highlights of this paper are mentioned below: 1. Modification of EO by mapping continuous values into the binary domain using a U-shaped transfer function, thereby making it applicable to binary optimization problems.
2. Enhancement of EO's exploitative abilities through the use of Simulated Annealing (SA). The hybrid version is known as DEOSA.
4. Additional experimentation of DEOSA through applications over 7 highdimensional Microarray datasets and 25 binary Knapsack datasets in order to test the robustness and scalability of the proposed model.
The remaining paper is structured in the following format: Section 2 provides a detailed literature study of the similar kinds of work carried out by different researchers across the globe. Section 3 provides a comprehensive illustration of the proposed FS model. The experimental outcomes of DEOSA over the UCI datasets are explained in Section 4 while the outcomes of the additional experiments are reported in Section 5. In the end, Section 6 concludes our work and provides directions for future extension of this work.

Literature Study
It is quite interesting how solutions to many complex optimization problems are lying hidden in nature waiting to be discovered by us. Throughout the years, researchers have proposed various algorithms to solve FS problems getting inspired by natural phenomena. Starting from the concept of chromosome formation in the Genetic Algorithm (GA) to the concept of control volume mass balance in EO, there is a large number of algorithms that are derived from simple natural occurrences.
The journey of nature-inspired FS algorithms started with GA Leardi (1996).
Leadi et al. introduced GA to the domain of FS by using 0 and 1 to denote the absence and presence of a feature (variable) in a chromosome (agent). GA is one of the most popular evolutionary algorithms which mimics the procedure of child chromosome formation using the concepts of chromosome crossover and mutation. While crossover in GA tries to improve exploration, mutation provides perturbation of the solution to bring exploitation in the search space.
Due to its simplicity in nature, GA has been widely used in various optimization and FS problems. In search of a proper trade-off between exploration and exploitation, researchers have proposed different variants of GA. In Ma & Xia (2017), Ma et al. have divided the chromosome population in different tribes and introduced tribe competition to support the evolution process. A modified GA (MGA) has been proposed in Jiang et al. (2017) where the authors have updated certain operations of GA to provide guidance to the chromosomes. For example, in place of typical crossover, they have proposed a crossover procedure which is guided by the objective measures of the parent chromosomes. MGA has been used to perform FS before demand forecasting in outpatient department. Apart from these recent additions, there are numerous other work on GA which can be found in Yang & Honavar (1998);Huang et al. (2007); Oh et al.  Moradi & Gholampour (2016). ACO is based on the food searching process followed by ants in nature. Dorigo et al. have introduced ACO in Dorigo & Di Caro (1999). Rashedi et al. have developed GSA Rashedi et al. (2009) which is inspired by mass interactions and the law of gravity.
All these algorithms are very popular and highly used in the FS domain.
As per the No-free Lunch theorem related to optimization, however, all optimization procedures behave equally well when they are averaged over all the Intuitively it means that there is no perfect algorithm to solve all optimization problems Wolpert et al. (1995). If algorithm A outperforms another algorithm B in some areas, there must be some other areas where algorithm B outperforms algorithm A. That is why researchers continue to propose various new algorithms to solve different optimization problems including the FS problem.
One of the main challenges in any FS problem is to find proper stability between exploitation and exploration. Many FS algorithms are getting formulated on a regular basis to address this balance.
Even in recent times, researchers have found enormous inspiration from nature to formulate various other meta-heuristic algorithms for the FS problem. This has motivated us to modify EO and make it applicable to solve FS problems. In this paper, we have mapped EO to the binary space of FS using a U-shaped transfer function and hybridized it by utilizing the exploitative capabilities of SA. The final hybrid model termed DEOSA has been applied over 18 well-known datasets taken from UCI machine learning repository to prove its usefulness in the FS domain. On the other hand, as additional experimentation, it has been tested over 7 Microarray and 25 binary Knapsack datasets to prove its applicability in other binary optimization problems as well.

Equilibrium Optimization: An Overview
EO, first proposed in Faramarzi et al. (2020), is a physics based optimization algorithm inspired by dynamic source and sink models for estimating equilibrium states. This method is based upon straightforward properly fused dynamic mass balance on a control volume, where a mass balance equation is used to determine the concentration of an impassible component in a control volume as a function of its different sink and source procedures. The generic equation of mass-balance is identified with a first order differential equation Nazaroff & Alvarez-Cohen (2001) given by Equation (1).
where V is the control volume, C is control volume concentration, C eq is equilibrium state concentration without any production inside control volume, G is the mass generation rate, V dC dt denotes mass changing rate, Q is the flow rate, in volume, into and out of the control volume. Equation (1) indicates that change of mass in time is equivalent to the mass entering the system plus massgenerated inside minus mass leaving the system. An equilibrium state is reached when V dC dt reaches zero. A rearranged Equation (1) is given in Equation (2).
( 3) where e 0 and C 0 are initial start time and initial concentration. Equation (4) shows the result of Equation (3).
As followed by Equation (4), there are three terms representing the updating rule of the concentration of a solution. First-term is equilibrium concentration.
One of the best solutions that appeared till the current iteration is used as equilibrium concentration and it is selected randomly from a pool, namely equilibrium pool. The second term indicates the concentration difference between a solution and equilibrium state. The third term is involved with the generation rate. Mathematical expressions and explanations of these components are given in the following subsections.

Equilibrium Pool and Candidates
The equilibrium state is the state in which the algorithm converges, and it is expected to be the global optima of the problem under consideration. In the beginning, the equilibrium concentration is not known to the algorithm, rather equilibrium candidates provide the particles with a search pattern. As C eq,pool = { C eq(1) , C eq(2) , C eq(3) , C eq(4) , C eq(avg) }

Exponential Term, F
The term F in Equation (4) helps EO find a balance between exploration and exploitation. The turnover rate λ is a random vector in [0, 1] since, in general, the turnover rate varies w.r.t. time in a real control volume and t is a function of iteration.
where iter and maxIter are current and maximum number of iterations, respectively. a 2 is a constant, higher a 2 indicates lower exploration and better exploitation ability.
For guaranteed convergence, the search speed needs to be slowed down, and to perform that t 0 is considered as: where, a 1 is a constant and higher a 1 value implies higher exploration ability.
r is a random vector in [0, 1]. sgn function indicates the direction of the search process. As per Faramarzi et al. (2020), we have set a 1 = 2 and a 2 = 1. By substituting Equation (9) in Equation (8), we obtain:

Generation Rate, G
In EO, the generation rate (G) plays an important role to provide the optimum solution. G is calculated as: where − −− → GCP is obtained by repeating the value obtained from Equation (13). r 1 and r 2 are random numbers, r 1 , r 2 ∈ [0, 1]. GCP implies a Generation rate Control Parameter, which indicates the probability of generating the term's contribution towards the concentration updating process. GP means Generation Probability, which indicates the probability that informs the number of particles using the last term in Equation (4) to update their states. GP = 0.5 is set balancing the exploitation and exploration Faramarzi et al. (2020). So, finally, the concentration updating rule in this algorithm is given by: V is considered as unit, i.e. V =1.

Simulated Annealing
In metallurgy and materials science, annealing van Laarhoven & Aarts (1987) is a heat treatment in which a solid is heated up to a maximum temperature, where the solid becomes liquid and then it is cooled by slowly lowering the temperature. SA Kirkpatrick et al. (1983), inspired from the annealing process, is a single solution based meta-heuristic algorithm which is an enhanced version of the hill climbing Ackley (1987). SA uses a certain probability to accept a bad 'move' to overcome the problem of being trapped in locally optimal solutions.
For a particular solution (agent), a neighboring solution is generated Mafarja & Mirjalili (2017) and evaluated using an objective function. If the objective value of the neighbor is better than the current solution, then the current solution is replaced with the neighbor. If the objective value of the neighbor is worse than the current solution, the neighbor is accepted with a probability value generated by the Boltzman equation, p = e −θ/T k . So, the acceptance probability function is given as: Here, θ = f it ( In the continuous version of EO, the concentration of the solution is updated as Equation (14). Here we need to use a transfer function Mirjalili & Lewis (2013) to make the binary version of EO. There are numerous transfer functions being used in the domain of FS to convert continuous optimization algorithms into binary approaches. In this work, a U-shaped transfer function is used whose applicability is proved in Mirjalili et al. (2020).
The U-shaped transfer function is provided in Equation (16).
where α and β are the two controlling parameters. α defines the slope and β controls the basin width of the transfer function. The utilized transfer function is depicted in Figure 1.
The concentration in the real domain will be converted to binary vector as per Equation (17) using the probability value generated by Equation (16).
where X d t is d th dimension of the concentration in t th iteration, rno is a random number, rno ∈ [0, 1]. (2020). This pool controls both exploration and exploitation. During initial iterations, the distance among the equilibrium candidates is high, and the use of these candidates for updating concentrations helps perform a global search.
With the increasing number of iterations, the equilibrium candidates come closer to each other and then using these candidates to update the concentrations helps perform local search encircling the candidates, which results in exploitation. Exploration is exclusively taken care of using Generation Probability (GP, Equation (13)). However, there is no such thing to consider exploitation particularly. Hence, we have incorporated the concept of SA to perform a local search ,i.e., take care of exploitation specifically. The hybrid version of EO is labeled as DEOSA (Discrete EO with SA).
FS is considered as a multi-objective optimization problem with two different criteria to evaluate the feature subset in consideration: classification accuracy and the number of selected features. To be specific, the objective of FS is to achieve maximum classification accuracy with the minimum number of features.
These two criteria are contradictory in nature Emary et al. (2016b), so we have considered the classification error rate instead of accuracy. Equation (18) aggregates both of the objectives together and converts the multi-objective FS problem into a single objective problem.
where, ǫ denotes the classification error rate, |υ| represents the number of features in the subset being evaluated, and |D| represents the total number of features in the dataset. µ and η respectively represent the weight (importance) of the subset length and the classification error, and η + µ = 1. We have used the K-Nearest Neighbor (KNN) classifier Altman (1992) to compute the classification error (ǫ).

Experimental Results
This section reports the experiments used to demonstrate the applicability of EO in the FS domain. We have used the KNN Altman (1992)

Dataset Description
To evaluate the performance of DEO and DEOSA, 18 popular UCI datasets have been used from the repository Blake (1998). The datasets are adopted from diversified backgrounds to ensure the effectiveness of the proposed model.
A brief overview of these datasets is outlined in Table 1

Parameter Tuning
For any population-based evolutionary algorithm, the population size and the maximum number of allowable iterations significantly affect the searching procedure. Hence, it is very important to tune these parameters to retrieve optimal results from the algorithm. Population size maintains the extent of interaction of agents in a population and the maximum number of iterations guides the convergence policy of the algorithm. In an attempt to find a proper combination of values for these parameters, DEO and DEOSA are simulated for different combinations of these parameters.  Figure 2. It is evident from the graphs shown Figure 2, the combination of (50, 30) for population size and the maximum number of iterations works the best in the given scenario. Therefore, for the rest of the experimentation, we have fixed these values.
Another very important aspect of any evolutionary algorithm is the iterationwise improvement in performance. Convergence graphs help to visualize the evolutionary strength of any algorithm. For this analysis, convergence graphs of DEO and DEOSA have been plotted in Figure 3. The graphs indicate that both DEO and DEOSA are able to provide steady and fast convergence over the iterations. As the best value for the maximum number of iterations is found to be 30, the convergence graphs are provided for 30 iterations.

Results and Discussion
This section reports the results obtained by DEO and DEOSA over the 18 UCI datasets outlined in Table 1. To get a generalized overview of the algorithms, both the algorithms have been run 10 times and the best and average classification accuracies obtained by these experiments have been provided in Table 2. Inspecting the results in this table, it is evident that DEOSA is able to achieve better accuracy in most of the cases as compared to DEO. However, both algorithms have become successful in providing a steady result which is evident from the low standard deviation values. In order to compare these results with the results obtained before FS, a simple comparison is provided in Table 3.
For all the datasets, classification accuracy has improved after FS (with huge margins in multiple cases).
Both algorithms provide ≥ 90% accuracy on 15 datasets but the accuracies From this discussion, we can safely comment that DEO and DEOSA are able to successfully select the optimal set of features from the datasets. The accuracy provided by both the models using very few features makes both the algorithms very competitive in the field of FS.
Overall, we can say that although DEO and DEOSA perform really well in FS, the results obtained by DEOSA are better than DEO. Hence, it can be concluded that SA plays a significant role in improving the performance of DEO.
Basically, in DEO, exploration in the search space is guided by the four particles present in the equilibrium pool. The fifth particle in the pool, which is the average of the other four particles, mainly helps in exploitation. Whereas, the stability between exploitation and exploration is maintained by the exponential term mentioned in Section 3.1.2. However, there may be situations when particles in the equilibrium pool become similar in nature which indicates that they belong to the same part of the search space. As a consequence, it leads to massive exploitation of that specific part. On the other hand, it limits the algorithm to explore the entire search space to look for the global optima. Here comes the role of the SA that actually assists DEOSA with exploring the search space, and therefore superior performance in DEOSA. In this context, it is to be noted that even though SA increases the exploratory ability of DEO by circumventing local optima, the exploration-exploitation trade-off remains stable due to the usage of the exponential term. Std. Deviation 0 0

Comparison
In this sub-section, we present the results of the proposed DEOSA and high- (a)  Figure 4 shows the performance of DEOSA in terms of classification accuracy over UCI datasets. Inspecting Figure 4, it can be observed that DEOSA performs best in 7 cases. This algorithm is the second-best over HeartEW and Sonar datasets while HSGW and RSGW perform the best over them respectively. For SpectEW and Tic-tac-toe, it secures the third rank. In Figure 5, the average classification accuracy achieved by each method over all the 18 UCI datasets is reported. In terms of the average accuracy, DEOSA provides 94.35612% classification accuracy, which is the highest among all the algorithms.
HSGW and ASGW are the second and third rank holders in terms of average classification accuracy. . Figure 6 shows the performance of DEOSA in terms of another important aspect of FS and the very purpose of this research field, which is the number of selected features. We can observe that the DEOSA has selected the lowest number of features in 8 cases. On average over all 18 datasets (Figure 7), DE-OSA selects 13.90555556 features, which is second best, following HGSA (10.8).
In order to determine the significance of the obtained results, we have performed the Wilcoxon rank-sum test Wilcoxon (1992) with 5% significance level for each pair of methods used in this section. Table 4 shows the obtained pvalues, p ≤ 0.05 is shown in bold.

Additional Testing
As evident from the previous discussion, DEOSA is an effective algorithm for FS. In this section, additional experiments are performed to check how DEOSA can scale to bigger datasets and other challenging binary optimization problems.

Application on Microarray Data
Microarray datasets Guha et al. (2020b) are high-dimensional in nature and pose greater problem against FS models due to the requirement of extreme searching. Therefore, such datasets are very effective for assessing the robustness of any FS model. For this experimentation, 7 publicly available Microarray datasets have been considered, and details of which are presented in Table 5.  Guha et al. (2020b). Table 6 contains the classification accuracy obtained by these algorithms over the 7 Microarray datasets. The number of features used to achieve the accuracy has been provided in parenthesis corresponding to the accuracy. For 6 out of 7 Microarray datasets, DEOSA has obtained 100% classifica-tion accuracy which is quite remarkable. One thing which is really intriguing is the number of features selected by DEOSA. These numbers are extremely low compared to the total number of features mentioned in Table 5. The count of the features is still less than most of the algorithms used here for the comparison.

Application on 0/1 Knapsack Problem
In order to check the applicability of DEOSA in other binary optimization problems, it has been further applied over 0/1 knapsack problems, which are considered as a popular combinatorial optimization problem with various applications. The problem states that there are some items where each item contains some value and some weight. The objective is to select a set of items that will maximize the accumulated value subject to a constraint on the total weight.
The mathematical formulation of 0/1 knapsack can be stated as: subject to : where x i is the state of i th item which is 1 if the item is selected, else 0, v i and w i are the value and weight of the i th item respectively and W is the maximum allowable weight.
For the experimentation with the binary knapsack problem, 25 popular knapsack datasets have been selected Kreher from. The results of 30 independent runs have been tabulated in Table 7. The mean and standard deviation of these runs are provided in the table. The results of some other U-shape function-based models have been taken from Mirjalili et al. (2020) for comparison. To maintain a neutral computation environment, the population size and the number of iterations are set to 20 and 500 respectively as mentioned in the paper.  Table 7, it is clearly visible that DEOSA is able to provide results comparable with the other U-function-based models. The knapsack is a maximization problem, so the higher the mean, the better the result. DEOSA has obtained very high mean values over all 25 datasets. One important thing to observe here is that the standard deviation for the proposed method is signifi-cantly low which indicates the stability of the algorithm in providing such good solutions.
After analyzing the results obtained by DEOSA over UCI, MIcroarray, and Knapsack datasets, the advantages of DEOSA can be clearly stated as: • DEOSA is able to provide very high classification accuracy for all the 18 UCI datasets while utilizing a very low percentage of features from the entire feature set. It proves the applicability of the proposed model in the domain of FS.
• Successful application over Microarray datasets proves the scalability of DEOSA because Microarray datasets are high-dimensional in nature and contain > 2000 features. In 6 out of 7 datasets, DEOSA has achieved perfect accuracy of 100% which is very impressive, On the other hand, the number of features to obtain these accuracies is lesser than most of the algorithms used for comparison in Table 6.
• 0/1 Knapsack problems are used to prove the robustness and applicability of DEOSA in other binary optimization problems. The high values obtained for the 25 binary Knapsack problems demonstrate that DEOSA is suitable for other binary optimization problems also. Further, low standard deviations for all these datasets indicate that DEOSA is very stable while achieving these results and it has very high chances of providing good results for every run of the algorithm.

Conclusion
In this work, we have proposed a binary variant of EO (DEO) to make it applicable to the field of FS.