S-shaped versus V-shaped transfer functions for binary Manta ray foraging optimization in feature selection problem

Feature selection (FS) is considered as one of the core concepts in the areas of machine learning and data mining which immensely impacts the performance of classification model. Through FS, irrelevant or partially relevant features can be eliminated which in turn helps in enhancing the performance of the model. Over the years, researchers have applied different meta-heuristic optimization techniques for the purpose of FS as these overcome the limitations of traditional optimization approaches. Going by the trend, we introduce a new FS approach based on a recently proposed meta-heuristic algorithm called Manta ray foraging optimization (MRFO) which is developed following the food foraging nature of the Manta rays, one of the largest known marine creatures. As MRFO is apposite for continuous search space problems, we have adapted a binary version of MRFO to fit it into the problem of FS by applying eight different transfer functions belonging to two different families: S-shaped and V-shaped. We have evaluated the eight binary versions of MRFO on 18 standard UCI datasets. Of these, the best one is considered for comparison with 16 recently proposed meta-heuristic FS approaches. The results show that MRFO outperforms the state-of-the-art methods in terms of both classification accuracy and number of features selected. The source code of this work is available in https://github.com/Rangerix/MetaheuristicOptimization.


Introduction
In recent times, due to the huge amount of data collected every minute and the need to convert such data into useful information, data mining is considered one of the fastest growing fields of Information Technology [1][2][3][4].With increasing applications of traditional machine learning [5] as well as deep learning [6][7][8], data mining and knowledge extraction techniques have gained huge popularity over the years as most of these learning techniques are hungry for informative data.Data mining involves: preprocessing, knowledge representation and pattern evaluation [9].One of the main data pre-processing steps is feature selection (FS).FS can be defined as the process of finding the optimal subset of features that can retain suitably high classification accuracy in representing the original feature set [9].Through FS, we can remove irrelevant and redundant features from the datasets.In the learning process, irrelevant and redundant features not only mislead the learning algorithm and reduce the performance but also result in increased computational complexity and high storage requirement [10].FS techniques are broadly classified into two categories [11][12][13][14][15]: filter and wrapper.Filter methods do not require any learning algorithm, rather they use statistical methods to evaluate a feature subset or determine correlation between variables.On the other hand, wrapper models use learning algorithm, like classifiers for the evaluation of feature subsets.Filters are much faster than wrappers, but wrappers are capable to achieve higher classification accuracies than filters [16].
Over the last two decades, several meta-heuristic optimization techniques have been proposed in the field of FS to overcome the limitations of traditional optimization approaches [17].Nature inspired algorithms have shown high performance in solving the search problems in general [17].In the literature, many methods have been proposed to mimic the behaviours of animals, birds, fish, wolves, etc.The first proposed FS technique using meta-heuristic approach is Genetic Algorithm (GA) [18].A chaotic genetic feature selection optimization method (CGFSO) is proposed by the authors in [19].Particle Swarm Optimization (PSO) is proposed in [20] by mimicking the social synergy of a flock of birds.Many FS approaches based on PSO are proposed in [21][22][23].Ant Colony Optimization (ACO) [24] is another meta-heuristic algorithm that mimics the behaviour of real ants when searching for the shortest path to a food source.ACO-based FS approaches are proposed in [25,26].Following the biological behaviours of bees, Artificial Bee Colony (ABC) optimization algorithm [27] and its FS counterpart [28] are also proposed.Social Mimic Optimization (SMO) Algorithm [13] is proposed by following the human behavior of mimicking someone who is more intelligent, more esteemed, more powerful, and its FS counterpart is proposed in [29].Some recent optimization algorithms and their FS counterparts are Grey Wolf Optimizer (GWO) [16], Whale Optimization Algorithm (WOA) [30], Ant Lion Optimizer (ALO) [17], Gravitational Search Algorithm (GSA) [31], Social Ski-driver (SSD) Algorithm [32], Sailfish Optimizer (SFO) [33], Ring Theory-based Evolutionary Algorithm (RTEA) [34].
The authors of the work reported in [35] have proposed two binary variants of ALO using S-Shaped and V-Shaped transfer functions.In [36], the authors have proposed six variants of ALO using three S-Shaped and three V-Shaped transfer functions.In [37], authors have proposed binary version of Grasshopper Optimization Algorithm (BGOA) using S-Shaped and V-Shaped functions.In [38], the authors proposed six transfer functions and eight binary versions of PSO.
Since there already exists so many optimization algorithms in literature that perform quite well, the question is whether we need another new optimization approach.According to No Free Lunch theorem for optimization [39], there cannot exist a single algorithm to solve all optimization problems.This, in turn, implies that currently proposed algorithms for FS are not able to solve all FS problems.This motivated us to propose a new FS approach based on a recently proposed meta-heuristic algorithm, Manta ray foraging optimization (MRFO) [40].We have proposed the binary version of MRFO for FS problems.We have proposed eight different versions of the FS technique based on four S-Shaped and four V-Shaped transfer functions.The methods are evaluated on 18 standard datasets and compared with different recently proposed metaheuristic FS techniques to validate the performance of the same.

Manta Ray Foraging Optimization: a brief overview
MRFO is first proposed in [40], inspired by Manta rays, one of the largest known marine creatures.Manta rays feed on plankton, mostly made of microscopic animals from the water.They have different types of foraging strategies: 1. Chain foraging [41]: Manta rays observe the plankton position and swim towards it by forming an orderly line.So, the plankton that is missed by the previous manta rays will be devoured by the following manta rays.The higher the concentration of plankton in a position, the better that position is.Now, except the first, all manta rays move towards both the best position and the one in front of it.The mathematical model of chain foraging is given by Eq. (12).
where x d i ðtÞ is the position of i th individual in t th iteration at d th dimension.r is a random number, r½0; 1. a is a weight coefficient.x d best is the position with highest plankton concentration.Although the actual best solution is not known, MRFO assumes the position with highest concentration found so far as the best solution.2. Cyclone foraging [42]: When a school of manta rays finds a plankton with high concentration in deep water, they form a spiral by creating head-to-tail links.In this formation, each manta ray moves towards the plankton position as well as the one in front of it.The mathematical model of cyclone foraging is given by Eq. (34).
where b is a weight coefficient, T is the maximum number of iterations, r 1 is a random number, r 1 ½0; 1.All individuals perform a random search considering the best so far plankton position as the reference position.So, it works as a good exploitation for the region found best so far.This can also be modified to improve exploration capability of the MRFO algorithm.We can force the manta rays to look for a new position, by assigning a new random position as their reference position.This results in extensive global search.The mathematical model is given by: where x d rand is a random position in the search space, Lb d and Ub d are lower and upper limits of the dth dimension, respectively.3. Somersault foraging [43]: This is a random, frequent, local and cyclical movement that helps manta rays optimize plankton intake.The best so far plankton position is used as a pivot and each individual swims to and fro around the pivot and somersault to a new position.The mathematical model is given as: where S is the Somersault factor, r 2 and r 3 are random numbers in [0, 1].As per [40], we have fixed S ¼ 2.

Manta ray representation
FS is a binary optimization problem [44], since while selecting a feature subset, each feature belonging to the original feature vector has two possibilities: either to be included in that subset or discarded.Hence, a vector of 0's and 1's is needed to represent any solution in a FS problem, where 0 represents the corresponding feature not selected and 1 represents selected.The length of the vector is equal to the feature dimension of the dataset under consideration.Accordingly, a manta ray is represented as a binary vector.Problems with continuous real search space can be converted to binary problems by converting their variables to binary variables.In the continuous version of MRFO, manta rays can move around the search space using position vectors within the continuous real domain.In discrete binary search space, update of positions implies switching between 0 and 1.Therefore, we have converted the real position of manta rays to binary values by applying a transfer function to the real position [22].
Transfer function defines the probability of updating the values of binary solution from 0 to 1 and vice-versa.Transfer functions force manta rays to move in a binary space.There are two different types of transfer functions [36]: S-Shaped and V-Shaped.In S-Shaped functions, the positions are updated based on Eq. ( 8).
where rand is a random number in ½0; 1.In case of V-Shaped functions, the positions are updated based on Eq. ( 9).

end if end for end while
In this work, we have studied the impacts of different transfer functions on the performance of MRFO.We have used eight transfer functions [38], four S-Shaped and four V-Shaped transfer functions.Table 1 shows the mathematical formulas of the 8 transfer functions used in this work.Figure 1 shows their corresponding graphs.

Fitness function
One important aspect of FS techniques is evaluation of the selected feature subset.Since the proposed method is a wrapper-based FS technique, a learning algorithm (classifier) is involved in the evaluation process.We have used a wellknown classifier, K-Nearest Neighbour (KNN) classifier [45]     Lowest number of selected features for each dataset is highlighted in bold for this purpose.FS, in general, has two objectives [29]: achieving higher classification accuracy and selecting lower number of features.Basically, FS method removes the irrelevant and redundant features.Higher accuracy is achieved by eliminating the irrelevant features, which otherwise mislead the classification model, thereby lowering the classification accuracy.Both the higher classification accuracy and lower number of selected features indicate that the selected subset is better.However, these two objectives are opposing in nature, since the number of features needs to be minimized and the classification accuracy needs to be maximized.Hence, classification error rate has been chosen here instead of classification accuracy.These two are combined into a single objective by Eq. ( 10) and used as the fitness function in present work.
where c is the classification error of the used classifier, |R| is number of selected features, |D| is the total number of features (dimension), x is weight [35], x½0; 1.
3 Experimental results and discussion

Datasets
The proposed eight binary MRFO algorithms are evaluated on 18 benchmark UCI datasets [46].The datasets are described in terms of number of attributes, number of samples, number of classes and dataset domain in Table 2.

Parameter tuning
The convergence graphs for the 18 datasets with all the eight binary MRFO versions are given in Fig. 2.These graphs depict the value of fitness function vs no. of iterations for all eight versions of the proposed method.It can be noticed that all the methods have converged within 30 iterations.Although it is stated that the ultimate motive of any FS procedure is to increase the classification accuracy while using minimum number of features, the importance of increasing the classification accuracy is more than reduction in the number of features.So, the convergence of the algorithm is guided by the changes in classification accuracy instead of number of features.To proceed with the algorithm, the maximum number of iterations is needed to be fixed beforehand.Through experimental observation, we have seen that for every dataset used in the experimentation, there is little to no improvement in classification accuracy after 30 iterations.So, the algorithm is believed to have converged as it cannot improve the accuracy further after that point.So, we have set the number of iterations as 30 for all the subsequent experiments.We have also observed the effect of population size on the performance of MRFO in terms of classification accuracy.For this, the proposed approaches are evaluated with population sizes of 5, 10, 20, 30 and 50. Figure 3  depicts the impact of different population sizes on the attained classification accuracy for all the said datasets.It can be noticed that increasing population size does not always increase classification accuracy.Besides, increased population size means higher run time.So, for all our subsequent experiments, the value of the population size is set to 20 as a trade-off between the classification accuracy and the computational cost of the algorithm.All the proposed approaches are implemented in Python3 [47] and experimented on Intel Core-i3 CPU with 4 GB of RAM.
Table 3 represents the classification accuracies achieved (in percent) by the proposed eight FS methods.Without FS column indicates the classification accuracy of the dataset without applying any FS method.
In Table 3, we have ranked the proposed methods for each dataset according to the achieved classification accuracy and from this, we have obtained average rank of each proposed method.MRFOv3 has achieved rank one i.e., it achieves higher accuracy than other seven proposed methods for most of the datasets used in this work and MRFOv3 performs in a consistent way i.e., does not fluctuate in achieving higher accuracy.
Table 4 shows the number of features selected using the proposed eight FS methods.Entire dimension column indicates the total number of features in the dataset.
In Table 4, we have ranked the proposed methods for each dataset according to the number of selected features and from this, we have obtained average rank of each method.MRFOv3 has achieved rank one i.e., it selects lower number of features in comparison with other seven methods for most of the datasets used in this work and MRFOv3 performs in a consistent way i.e., does not fluctuate in selecting lower number of features.
Observing both of the tables, on an average, MRFOv3 gives highest classification accuracy and selects the least number of features, and it performs consistently.So, amongst the proposed eight methods, MRFOv3 is considered as the best and we have used for comparing with different state-of-the-art methods.
To prove the effectiveness of the selected KNN classifier, its performance is compared with the performances of other two popular classifiers: Naive Bayes and Random Forest.The results are generated by using different classifiers for computing accuracy used in the fitness function (mentioned in Eq. ( 10)).Table 5 shows the classification accuracy achieved and the number of features selected, putting all the three classifiers individually in the fitness function of the proposed FS approach.The results prove that KNN works best w.r.t. the datasets considered here.Hence, for further experiments, only KNN classifier is used.

Comparison
In this section, we have compared the classification accuracy achieved and number of features selected using the proposed MRFOv3 approach with some methods reported recently in literature.We have compared our method with ALO-based methods: bALO-QR, bALOS-QR, bALOV-QR, ALO, bALO-1, bALO-S, bALO-V, bALO-CE, bALOS-CE, bALOV-CE reported in [17]; GSA-based methods: BGSA, HGSA reported in [48] and hybrid methods using GWO and WOA: HSGW, RSGW, ASGW reported in [49].In this work we have used 18 UCI datasets and those were used in all the works reported in [17,48,49].
Figure 4 shows the classification accuracy achieved by the proposed MRFOv3 method and other methods for each dataset.
From Fig. 4, it is clear that MRFOv3 achieves the highest accuracy in 14 out of 18 datasets.It is also worth mentioning that not only MRFOv3 has outperformed other methods in 77.8% of the datasets, but also it is able to achieve 100% accuracy in 7 cases: CongressEW, Exactly, M-of-n, PenglungEW, Vote, WineEW and Zoo.Other four cases where MRFOv3 fails to achieve the highest accuracy are: BreastEW, Exactly2, IonosphereEW and SpectEW.For SpectEW, MRFOv3 has achieved second highest accuracy.From Fig. 5, we can see that, in case of BreastEW, Exactly2 and IonosphereEW the number of selected features using MRFOv3 is significantly low. Figure 5 shows the number of features selected by the proposed MRFOv3 and other methods for each dataset used in this work.
From Fig. 5, it is clear that MRFOv3 selects the lowest number of features in 15 out of 18 datasets i.e., 83.3% cases.It can also be noted that for BreastEW, Iono-sphereEW, PenglungEW, SpectEW and WineEW datasets, the number of selected features by MRFOv3 is quite low compared to other methods.Now, the three cases where MRFOv3 fails to select the lowest number of features are: Exactly, M-of-n and WaveformEW.But for all these datasets, MRFOv3 has achieved the highest accuracy.Moreover, for Exactly and M-of-n, MRFOv3 is able to achieve 100% accuracy.So, observing Figs. 4 and 5, we can conclude that no other method outperforms MRFOv3 w.r.t.both in terms of classification accuracy and number of features even for a single dataset.
Taken together, this comparative study establishes that the proposed method outperforms the state-of-the-art methods considered here for comparison.Again, it justifies the No Free Lunch theorem by showing that there can always be a better algorithm for an optimization problem such as FS.It also shows the impact of transfer functions on the meta-heuristic algorithm used for FS.In Table 6, we have provided the Wilcoxon signed pair test [50] for all possible pair of the methods.Table 6 shows whether the difference between results (in terms of classification accuracy) of two algorithms is statistically significant or not according to Wilcoxon signed rank test.A p value less than 0.05 is denoted by 1 (0 otherwise) which implies that difference between results is statistically significant, 0 means it is not.

Conclusion and future work
FS is considered as an important pre-processing step in the domain of Machine Learning and Data Mining.In this article, we have proposed the binary version of MRFO for selection of optimal subset of features.Since MRFO is reported to be suitable for continuous search space problems, we have modified MRFO for the binary search space to fit it into the problem of FS using eight different transfer functions which belong to two different families: S-shaped and V-shaped.We have applied the eight binary versions of MRFO methods on 18 standard UCI datasets.We have shown the impact of transfer functions on the classification accuracy obtained and number of features selected.
We have considered both the classification accuracy and number of features selected while designing the fitness function.To measure the effectiveness of the proposed approaches, we have considered the best among the eight versions of MRFO and compared it with 16 recently proposed meta-heuristic FS approaches.The results show that MRFO outperforms the state-of-the-art methods in terms of both classification accuracy and number of selected features.
In future, we plan to propose an embedded FS method where binary MRFO can be united with a compatible filter method.We can also aim to hybridize MRFO with other meta-heuristic FS methods.Besides, as the proposed FS version of MRFO is a general approach, so this can easily be applied to other standard pattern classification problems where usually large dimension of feature vectors are used.

Compliance with ethical standards
Conflict of interest We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Fig. 1 a
Fig. 1 a S-Shaped and b V-Shaped family of transfer functions

Fig. 3
Fig. 3 Accuracy versus Population size for 18 UCI datasets used in present work

Fig. 4 Fig. 5 6
Fig. 4 Comparison between MRFO and state-of-the-art methods based on accuracy

Table 1 S
-Shaped and V-Shaped families of transfer functions while Stop Criterion not Satisfied do for i = 1...N do if rand < 0.5 then //Cyclone Foraging if t/T < rand then x rand = x l + rand * (xu − x l )

Table 2
Description of the datasets used in the present work Fig. 2 Convergence graphs for 18 UCI datasets used in present work

Table 3
Classification accuracies (in percentage) of the proposed FS method using eight different transfer functions

Table 4
No. of selected features with the proposed FS method using eight different transfer functions