A New Evolutionary Ensemble Learning of Multimodal Feature Selection from Microarray Data

In the last decades, data has grown exponentially with respect to the number of samples and features. This makes the feature selection (FS) more challenging. In this paper, an optimization method called the multimodal optimization (MMO) technique is employed to find multiple optimal solutions instead of a single solution. The main contribution of the MMO technique is to provide multiple optimal solutions, instead of a single solution. Using the hidden information in the data and creating an ensemble of classifiers, the potential and information of multiple answers provided by MMO are used to address the issue of FS from microarray data. After pre-processing of the data, to benefit from the potential and information of multiple answers, the optimal features subset are obtained by a firefly-based MMO algorithm. The mutual information method is used as the fitness function to evaluate the proposed subset of features. Then, each feature subset is used to train a classifier and the classifiers are trained by the data, the features of which are presented by a MMO algorithm, and these classifiers make an ensemble. To select a proper combination, a particle swarm optimization algorithm is used. Finally, the algorithm for the datasets of the microarray is evaluated in terms of cancer diagnosis. The proposed method efficiency is evaluated by applying on 11 datasets. The results indicate the superiority and proper performance of the multimodal FS method compared to other methods.


Introduction
In the last decade, increasing medical information has led to the emergence of microarray data.The microarray data includes information extracted from tissue samples and cells to analyze genetic differences, and the data are used for the diagnosis and treatment of tumors.This type of data has many features and few samples; therefore, it is challenging to process them for classi cation purposes [1]- [4].
The noise in the features makes the analysis of such data more complicated.The large number of genes may be uninformative for classi cation because they are irrelevant and redundant [5].Irrelevant features do not affect the output, and the values produced for each sample are random.The redundant features contain information that is completely or partly repeated in one or more other features.
If all genes are used in tumor classi cation, the e ciency might be affected by redundant genes and fall.Therefore, eliminating the redundancy and selecting related genes from microarray data may improve the model learning e ciency and classi cation accuracy; this will signi cantly contribute to the diagnosis, prediction and treatment of cancer [6].
Based on the above-mentioned issues, it is evident that making use of feature selection techniques in bioinformatics instead of an optional selection has turned into an important precondition step for creating a classi cation model.Moreover, since most model diagnosis techniques are not designed for countering a large number of irrelevant features, combining them with feature selection techniques will lead to more effective solutions [7].Feature selection methods are divided into four general groups, namely Filter Models, Wrapper, combined, and Embedded [7], [8].In lter-based models, learning algorithm is not used to rank the features, rather the statistical properties of the training data will be considered [9]- [11].In the wrapper models, the subsets of several features are evaluated by a learning algorithm, using evolutionary algorithms.In the embedded feature selection method, the classi er learning process is concurrent with feature selection.Since features with high resolution are selected, weak cases are eliminated by analyzing and measuring the effects of features when creating a classi er [12].Different combined methods are created by integrating each of the mentioned approaches.The heuristic searches could be used to nd optimal subsets of features [13]- [18].
For some optimization problems, there is not a global optimum, and one can use several local optimums and their combination rather than a global optimum search to nd a general optimum.For example, medical specialists may have different opinions on the diagnosis of cancer, and gathering all their opinions is more reliable than a single one.Regarding the feature selection with a very high dimension, sometimes there is not just one subset of answers since the best subset and different subsets can be found that enable classi cation from different aspects.In the case of traditional methods, just one subset of answers is used as the classi cation input, while one can use the potential of other feature subsets in classi cation, and each feature subset resulting from multimodal algorithms can be used to learn a classi er.When several feature subsets are present, one can combine several different classi ers, and come to one classi er which considers all feature subsets (which looks at the dataset from different aspects).
Finding several optimal solutions is the goal of a distinct type of optimization method called Multimodal Optimization (MO) technique.The main objective of multimodal optimization techniques is providing several optimal solutions for user selection [19].The multimodal optimization methods not only explore an extensive area of the search space and conduct a proper local search for an accurate setting of obtained solutions, but they also present alternative solutions for a given problem.In the real world, sometimes even when an optimal solution is found, making use of it is not feasible due to the high cost involved in performing a model using the subset.If favorable alternative solutions are provided to the user, the present solutions will be reviewed, and the best one will be selected according to the user's priority [19].In this paper, the objectives of the multimodal feature selection technique are selecting several feature subsets and making use of the hidden information potential in the data simultaneously.Instead of discarding the answers obtained from the multimodal algorithm, this paper makes the answers nd their proper classi er, and conduct the nal classi cation collectively.To do this: The re y-based improved multimodal optimization algorithm [20] with a complete exploration of the search space is used to nd several optimal subsets of features.
Then, a classi er is learned with each of the feature subsets to use the potential and information of several feature subsets, and the combination of these classi ers will allow using the potential of information hidden in all feature subsets simultaneously.
To select the proper combination, the particle swarm optimization method is adopted as the explorer, and the classi ers and modals which are going to be present in the combination will be provided to us.
Reducing the dimensions of the data related to the microarrays has many advantages, such as the classi cation of genetic data and diagnosis of cancer.Therefore, in the last step, the cancer is diagnosed using the developed model.
In order to prove the e ciency of the proposed method in selecting a multimodal feature, its function is compared with a single model method and the multimodal method in which the best modal is selected and not all the modals are used.The results showed the superiority and proper function of the multimodal feature selection method compared with other methods.The rest of the paper is organized in the following manner: in Section II, the related works are reviewed.The suggested method is introduced in Section III.Section IV discusses experimental results and the test results of the datasets are reviewed.Finally, Section VI concludes this paper.

Related Works
Considering feature selection problems, the meta-heuristic optimization methods have recently been adopted more frequently.Figure 1 illustrates a classi cation for these methods.It shows that feature selection can be categorized into single-and multi-objective, heuristic and meta-heuristic and singlemodal and multi-modal.The applied meta-heuristic algorithm include a genetic algorithm, harmony search algorithm, black hole algorithm, particle swarm optimization algorithm, arti cial bee colony algorithm and biogeography-based algorithm.The researches which have used these algorithms will be discussed.
The genetic algorithm has so far been used in many papers on feature selection as an evolutionary algorithm.[21] suggested a method for feature selection called IG/SGA which combines Information Gain (IG) and Standard Genetic Algorithm (SGA).[22] introduced a combined method for feature selection called MIMAGA-Selection which is a combination of Mutual Information Maximization (MIM) and Adaptive Genetic Algorithm (AGA).[12] suggested an evolutionary method called Intelligent Dynamic Genetic Algorithm (IDGA) for gene selection and cancer classi cation in microarray data.This method is a combination of genetic algorithms and arti cial intelligence.[23] introduced a method is which is a combination of Correlation-based Feature Selection (CFS) and Taguchi-Genetic Algorithm (TGA).[24] suggested a hybrid method for feature selection in Microarray data analysis.The proposed method uses a Genetic Algorithm with Dynamic Parameter setting (GADP) with the X2-test for homogeneity.For feature selection, [25] introduced a genetic algorithm based on community detection, which works in three steps.The feature similarities are calculated in the rst step.Then, in the next step the features are classi ed through community detection algorithms into clusters.For the third step, features are selected by a genetic algorithm with a new community-based repair operation.
The harmony search algorithm is a meta-heuristic method that is based on musical instrument function, and it used for solving the issue of feature selection.
[26] combined lter-wrapper method called SU-HAS which is a combination of Symmetrical Uncertainty (SU) with Harmony Search Algorithm (HSA) and it is used for gene selection in microarray data.
The black hole algorithm is an evolutionary algorithm which gained the attention of researchers due to its simplicity and proper e ciency.[27] suggested a method suggested a feature selection in microarray data based on Binary Black Hole Algorithm (BBHA) and Random Forest Ranking (RFR).
A combined approach composed of lter and wrapper techniques for gene selection from microarray data is introduced by [28].[29] suggested a new fuzzy system which is based on a combination of stem cell method, which uses stem cell algorithm and ant colony algorithm.They intended to create a classi cation strategy for analyzing gene expression data.
Since particle swarm optimization does not have a complicated evolutionary operator, and fewer parameters need to be set, it is used as an effective method for gene selection [21].In [30] a combined method is suggested for cancer classi cation called CFS-iBPSO which is a combination of Correlation-based Feature Selection (CFS) and it improved binary particle swarm optimization algorithms (iBPSO).[31] introduced a combined method called HPSO-LS (hybrid particle swarm optimization with local search strategy) which is based on particle swarm optimization and local search strategy.Embedded in PSO, this strategy is used to select a subset of less correlated and salient features.[32] introduced a PSO-based feature selection method which had the advantages of ltering and coating methods and it could enhance the classi cation accuracy and reduce computation complexity.
[33] suggested a novel feature selection algorithm that was based on bare bones PSO (BBPSO) with mutual information.First, an effective swarm initialization strategy based on label correlation is developed which takes full advantage of the correlation between features and class labels to accelerate the swarm convergence.Then, to enhance the exploitation performance of the algorithm, two local search operators (i.e., the supplementary operator and the deletion operator) are developed based on feature relevance-redundancy. Furthermore, an adaptive ip mutation operator is designed to help particles jump out of locally optimal solutions.The Arti cial Bee Colony is an optimization algorithm based on swarm intelligence and intelligent behavior of the bee population, which is also used for gene selection because of its proper function in many papers.[34] suggested a method for the selection of informative genes from microarray data.The method is a combination of minimum redundancy-maximum relevance (mRMR) with an arti cial bee colony algorithm called mRMR-ABC.[35] introduced a combined method for the selection of informative genes in microarray data using Independent Component Analysis (ICA) and an Arti cial Bee Colony (ABC) called ICA+ABC.
The multi-objective approaches are also used in the gene selection process.The structure and basics of multi-objective optimization methods are the same as single-objective optimization methods.However, the number of variables and objective functions are larger for these methods, and they are used to nd an optimum answer set, not one optimum answer.In [36] a multi-objective algorithm is suggested for gene selection in microarray data classi cation.The suggested algorithm is a multi-objective version of the bat algorithm with a re ned formulation.The multi-objective operators and the resistant local search strategy are called MOBBA-LS (Multi-Objective Binary Bat Algorithm with Local Searches).The primary lter method used is Fisher criterion which leads to the selection of genes with higher ranks.Then the subset ltered by MOBBA-LS method is used.
The evolutionary algorithms are reviewed for feature selection and their purpose is to reach one and only one feature subset for classi cation.However, multimodal optimization approach can present several feature subsets as optimum answers [37].In this approach, each evolutionary algorithm can be used as independent search units.For example, [20] used the re y algorithm as search units.In [19] the multimodal optimization is used for solving the issue of feature selection which needs higher power of exploration and exploitation.
Regardless of the feature reduction issue, there are different approaches to classi cation on the reduced features, one of which is the ensemble method.The ensemble approaches are based on the fact that the output of several experts is better than the output of a single expert [38].
[39] suggested a hybrid method of Binary Particle Swarm Optimization (BPSO) and a Combat Genetic Algorithm (CGA) for gene selection based on microarray data.[40] compared two hybrid methods.The rst method was a Fast Correlation based Filter FCBF with genetic algorithm GA (FCBF-GA) and the second method was FCBF with Particle Swarm Optimization PSO (FCBF-POS).[41] introduced a new hybrid gene selection method called Genetic Bee Colony (GBC) that combined the genetic algorithm (GA) with the Arti cial Bee Colony (ABC) algorithm.[42] introduced a metaheuristic framework for gene selection using Harmony Search (HS) with Genetic Algorithm (GA).
[43] suggested a search method for nding the optimum ensemble of classi ers with an evolutionary approach.In this paper, the useful genes are selected by M different methods of feature selection which are connectable to N classi ers.Since it is impossible to nd the optimum ensemble of classi ers connection to feature selection methods by searching the whole space, this paper uses the genetic algorithms for this search.Finally, after nding the combination, the majority vote is used to gather the opinions.[42] introduced a method for identifying important gene subsets which is a combination of Ensemble Gene Selection (EGS) and Adaptive Genetic Algorithm (AGA).First, the EGS method using the multi-layer approach and F-score approach is used to lter the noise and redundant genes.The suggested method makes use of EGS for selecting genes with the highest ranks which are transferred to AGA algorithm.Then, the AGA is used to identify the subset of important genes which helps with cancer and tumor diagnosis.[44] suggested an ensemble method based on the t-test and Nested Genetic Algorithm (inner and outer) for selecting optimum genes subsets from microarray data.Following data preprocessing through t-test, the nested genetic algorithm composed of two genetic algorithms is used to obtain the optimum subset of features by combining the data of two different datasets.The Outer Genetic Algorithm (OGA-SVM) is performed on the microarray gene expression dataset, and Inner genetic Algorithm (IGA-NNW) is performed on DNA methylation dataset.The nal solution is the outcome of the combination of information from gene expression data and DNA methylation.Eventually, the Incremental Feature Selection (IFS) is used as an ensemble approach to obtain the smallest subset of optimum genes.

Proposed Method
This paper details a re y-based multimodal optimization algorithm for the selection of multimodal feature and their combination on the datasets with high microarray dimensions for the diagnosis of cancer.The proposed method is composed of 4 phases which is shown brie y in Figure 2.
As shown in Figure 2, in the rst phase, the multimodal feature is selected using the re y-based multimodal optimization algorithm.The outcome of this phase is several subsets of features.In the second phase, a classi er is learned to make use of the potential of all answers with each of the feature subsets obtained from the rst phase.In the third phase, the PSO is used to nd the optimum ensemble.And in the last phase, the model is evaluated.

Phase 1: Optimization and selection of the proposed multimodal feature
In phase 1, feature selection on microarray data is performed using the re y-based multimodal optimization algorithm in order to select the multimodal feature.The advantage of using a multimodal algorithm is that its output is several feature subsets instead of one.Therefore, the potential and information of several answers and hidden information in the data are used.
In this phase, the dimension of particles is equal to the number of features.The particles are initialized using the uniform distribution between 0 and 1.Then a threshold is considered.For example, based on Table 1, the aim is to nd a subset of features whose corresponding values in the particle are higher than the threshold.Table 1.Selection of the primary feature using the threshold value.After initializing the primary parameters and the formation of the primary population, the population is divided into several subpopulations using fuzzy clustering.
Following the application of the threshold on each subpopulation, the re ies present in each subpopulation are evaluated using the mutual information criterion.In fact, this criterion is considered as the tness function.It is used in each generation as a function to check the relevance and redundancy of the feature subsets between each feature and class label.This criterion is calculated as follows: In equation 4, the variable w is intended for adjusting the number of selected features and imposing a penalty on the increase of features.Hence, if many features are selected, the value of tness will be larger than the value of the selected features.According to equation 3, if the model selects many features, the model will be ned to make it understand that it must not select many features.
Then the re y algorithm is applied to each subpopulation.After applying the algorithm, an updated subset is produced per each subpopulation.Again, to evaluate the candidate answers, the subset of that answer`s proposed feature is extracted, and the goodness of the features in the updated subpopulations is calculated based on the criterion of mutual information between the given feature and data class.After nishing the generation repeats, the growth and stability level of subpopulations are calculated to determine the optimum features subset in each subpopulation.If the growth of the subpopulation is zero, the optimum feature subset will be saved in the memory.Among the produced subpopulations, some may lack the stability criterion which is employed to check the presence of a potential optimum answer.A stable subpopulation has at least one optimum point.After the diagnosis of an unstable subpopulation, it will be destroyed, and a new subpopulation will be created in a part of the search space which is less explored.
In [20], when a subpopulation is unstable, nding a proper answer in that part of the search space is impossible, therefore, sometimes searching needs to be done in new areas.If at the end of a step, the subpopulation is unstable, then this subpopulation will be destroyed and a new subpopulation will be created.
However, the subpopulation may be created in a part of the search space which has previously been explored or being explored by other particles.For example, to better understand the issue, according to Figure 3, the sub-populations are within HOT area and during the development of a new sub-population, the goal is the identi cation and exploration of cold points which have not hitherto been explored.
To prevent this problem, the subpopulations need to be created in a part of the search space which has not been explored by any group.In this paper, a memory called History is used to save the points which have been explored so far.Therefore, each population places its current point coordinates in History.In fact, History is composed of the points which have been explored or being explored.
When a subpopulation is destroyed due to instability, to create a new subpopulation, rst so many candidate points are created in the search space for producing a new subpopulation.Then the distance of each point to the nearest corresponding point in the history is measured.Finally, the calculated distances are ordered and a point is selected for the new subpopulation which has the highest distance to the previously explored points.Eventually, a new population is created around this part by selecting this point.To create a new population around the produced point, the normal distribution to the center of that point is sampled.This strategy will increase the chance of exploring more area of the search space.For each of the modals, the re y evolutionary algorithm is used.As this algorithm has a higher search ability relative to exploration, and due to making use of multimodal search, exploration is not needed, because other models explore different points of the search space.
Following the re-formation of unstable clusters using Gaussian distribution in points that have not been explored, the new optimums in these points are saved in the memory.Finally, after nishing the optimization, a feature subset is obtained per each modal.The structure of the multimodal optimization algorithm is shown in Figure 4.

Phase 2: Learning the classi er on each feature subset
In the majority of multimodal optimization methods, the best answer is used following nding several answers.In this approach, the other semi-optimum answers, which may have useful information, will be ignored.Nevertheless, in some issues, such as feature selection, the combination of the best features may not be available, and the ensemble of different feature subsets, instead of selecting only the best ones, may be more useful.
The aim of this phase is to use the potential of all the answers obtained from the multimodal algorithm of the previous phase, not only one answer.Therefore, to reach this purpose, a classi er is learned with each of the feature subsets obtained from the rst phase, and these classi ers are combined which are the result of the combination of several classi ers.In this research, the particle swarm optimization method is used to decide which feature subsets and classi ers will be present in the ensemble.
As mentioned before, in the rst phase, several feature subsets are proposed.It is evident that these feature subsets are not similar, but may have some common features.Suppose that the number of classi ers is N and the number of features subsets is M. Therefore, there is pair of "classi er-feature subset".
In other words, there is the vector of MN and each of the elements of the vector are related to a given "classi er-feature subset".In fact, each feature subset is linked to a series of classi ers and each one forms a pair.
For example, Tabel 2 shows an ensemble including 8 pairs of classi er-feature subset.In this Tabel, SVM classi er will pair with feature subsets 3 and 4, KNN classi er pairs with feature subsets 1, 3 and 5, and MLP classi er will pair with classi ers 2, 4 and 5.These 8 pairs will be involved in ensemble process.The evolutionary algorithm aims to nd a vector with 15 zero's and one's.Bit 1 shows that "classi er-feature subset" will be present in the ensemble, and Bit 0 shows that "classi er-feature subset" will not be present in the ensemble.The optimum ensemble may be found by counting and comparing all the combinations, but this task needs much time because the number of cases is .This number cannot be linearly searched even if there are few features and modals.Since a method with higher e ciency is needed, this paper makes use of the particle swarm optimization algorithm in Phase 3 for addressing the issue.
Since the ensemble problem has few classi ers and modals, the problem of searching for the best ensemble has few dimensions compared with other searches.Therefore, considering the low dimension of the search space, adopting classic algorithms like PSO could be a good choice.Table 2. Structure of a vector of 15, each bit shows whether the corresponding classi er-feature subset is a part of the ensemble.

Phase 3: Finding the optimum ensemble using PSO
In the PSO algorithm, following the formation of the primary population and formation of the vector as zero and one to evaluate each vector, the learning data are divided into learning and testing groups, and a special classi er is combined with a special feature subset for each of the elements of the vector with value 1.The classi ers are learned using the learning data.In order to reduce the time, the classi ers are already learned and stored, and used in the PSO evaluation function.For example, if the number of classi ers is 3 and the number of modals is 5, based on Table 3, 3 classi ers are learned per each of the modals or feature subsets.Table 3. Structure of a combination using particle swarm optimization algorithm.In fact, the PSO evaluation function receives a vector and evaluates it.After being combined using the testing data, the classi ers are tested and the label is saved.The whole vector gives us a label which is the result of the majority vote of all the participants.The higher the accuracy level of this majority vote, the better the combination.The aim of this stage is to nd the best combination for the ensemble.

Phase 4: Model evaluation
Following nding the best combination for the ensemble, in this phase, a weighted majority vote is conducted among these classi ers for classi cation, and a weighted system is used while voting.Considering the validation data, classi ers that gain higher weight will have more in uence on the nal decision.In this phase, the model is evaluated and veri ed.Finally, cancer is diagnosed using the proposed model.

Experimentation And Comparisons
In this section, the results and experiments conducted on the datasets are analyzed.All the experiments are conducted 10 times and the average result is reported.The source codes of the proposed method, Fire y Single Modal Feature Selection and Fire y Multimodal Feature Selection method are available [1].In this experiment, MATLAB R2017b is used as the programming language.The con gurations of the computer are Intel Core i7 Due, 2.20 GHz CPU and 8.00 GB of RAM.

Datasets
To evaluate the function of the proposed method and compare it with other methods, seven datasets with large dimensions are used.The details of these datasets including the name of the dataset, number of features, number of classes, number of their data and imbalance ratio are shown in Table 4.The imbalance ration (IR) of datasets is calculated in the last column of Table 4. Considering the IR values, it is evident that there is balance in most of the used datasets except for Lung and Lung cancer dataset.Therefore, the accuracy measure is used for evaluations.In all experiments, we randomly used 70% of the data for training and the rest for testing.Regarding reliability, we ran each experiment 10 times and the average results are reported.

Evaluation Criteria
The evaluation criteria used in the result tables for evaluating the quality of the proposed method is the accuracy average of the system classi cation.

Parameter Setting
In every experiment, the re y algorithm parameters including α, β and σ were presumed to be 0.4, 0.2, and 1 respectively.The parameter Tr, which is for threshold, equals 0.95.The threshold is tunned using the validation set.In the tness function, the parameter is 0.01.The parameter presented in the fuzzy clustering algorithm, which determines the number of modals, is considered with values 5, 8, 10 and 15.The number of independent conducts for the evaluation of the proposed algorithm is 10, the maximum number of repeats is 1000, and the number of primary populations is considered 1000.

Evaluation of the results and comparison of functions
The function of the proposed method is compared with 3 feature selection methods.These methods include: 4.5 Fire y Single Modal Feature Selection [45]: This model, like a simple evolutionary method, selects the feature using the re y algorithm, and then diagnoses cancer.In this method, the population is not divided, and has a modal called single Modal.This method is called FSMFS.
4.6 Fire y Multimodal Feature Selection: This method selects the feature using the multimodal optimization method and each modal suggests one feature subset.Finally, it selects the best modal (not using all the modals) and uses it for diagnosing cancer.This method is called FMFS.
In order to evaluate the function of these 2 methods, the experiments are conducted on K-nearest neighbor algorithm (KNN K=1), ensemble of classi ers (random forest (RF)), C4.5 decision tree (DT), Multiclass Support Vector Machine (ECOC) and naïve Bayes algorithm (NB).In the proposed method, the classi ers of KNN K=1, KNN K=3, KNN K=5, Linear-SVM, Naïve Bayes, Tree C4.5 and Random Forest are used in the ensemble phase.We do not use the classi er for comparison since the classi ers in total constitute an ensemble classi er for the suggested method.
Table 5 shows the results of the proposed method and other methods on the dataset with 5 modals.As mentioned above, the proposed method uses 7 classi ers.In 5 datasets, the proposed method has the highest accuracy value.In the FSMFS method, there is a signi cant difference with the proposed method in most datasets.In the FMFS method, there is a signi cant difference with the proposed method in some classi ers, and all of them show a better function of the proposed method.
The proposed method, in which the potential and information of all modals are used, has better and more promising function in all datasets except Brain Cancer and DLBCL compared with Single Modal and multimodal modes in which the best modal is selected.In the Brain Cancer dataset, the function of FMFS+NB method is better than other methods.In the DLBCL dataset, the function of FSMFS+NB method is better than other methods.
Table 6 also shows the results of the proposed method and other feature-selection methods for the 7 datasets.4.5.Percentage of selected features Percentage of features selected for the proposed method and other feature-selection methods for the two MLL and SRBCT datasets are shown in Table 7.As can be observed, the proposed method selects fewer features than the other methods.If the proposed method, like other methods, uses a higher percentage of features, it can achieve higher accuracy.Table 7.Comparison of percentage of features used by the proposed method and existing feature-selection algorithms Propo metho GATFRO [50] ACTFRO [50] GSFR [51] GLO [52] DEGR [53] FS-JMIE [54] PSO [ 4.6 The effect of increasing the number of modals This is concerned with analyzing the effect of adding to the number of modals.Considering the results shown in gure 5, the larger number of modals for some datasets is followed by lower accuracy and in some cases improved but with insigni cant accuracy.If a large number of modals are selected, the dimension of the problem of searching for ensemble will increase signi cantly and this causes the optimal ensemble search process not to converge.Also, the process of labeling unseen data will go longer due to the signi cant number of modals.This is not suitable for real-world applications.In the case of our suggested method, our preference is for the small number of modals (i.e., fewer than 10).

Receiver Operating Characteristics (ROC)
It is used to rate the classi er prediction rate.It is plotted between the false positive rate on x -axis and the true positive rate on y -axis.If the curve lies on the ROC space that is on the upper left corner, then the prediction is accurate.The ROC output shown in gure 6 implies that the curves of the proposed method lie on the ROC space and lie close to the top left corner, which indicates that the classi er prediction is accurate.

Effect of parameter on number of features modals commonality
In this section, we intended calculate the effect of parameter on the mean number of features and modals commonality.The variable w is intended for adjusting the number of selected features and imposing a penalty on an increase of features.To do so, ten different values were presumed for the parameter so as to show its effect on all datasets.As shown in Table 8, increased value of the parameter is expectedly followed by the reduced mean number of modals features and vice versa.
In the following, the effects of different values of parameter on features commonality for different modals will be addressed.As shown in Table 6, a larger value of the parameter is followed by a reduced number of features in the modals, and the inverse is also true.For the suggested method, reducing the commonalities between different modals is intended and this will be veri ed experimentally below.
To do so, it is presumed that for the lung dataset we look for 3 modals (i.e., feature subsets).In order to attain better visualization and manifestation in the visual view, we use the number.The extent of commonalities between modals is visualized through a Venn diagram.Based on a part a of gure 7, the percentage of feature commonality between modals is insigni cant.Also, out of a total number of features for the three modals, 30.9 percent was in modal 1, 31.5 percent was in modal 2 and 33.5 percent in modal 3.
Based on previous reviews and as shown in gure 7, in the case of the multi-modal method, the percentage of feature commonalities between different modals is low and the selection of multimodal feature does not lead to nding similar answers.With the largest number of features in modals ( gure 7.a), their commonality percentage is very low and there is even a case in which no commonality between modals could be found (Figure 7.d).

Result Discussion
In the case of the suggested method, there may not be a re y belonging to a cluster or modal because there are multiple sub-populations in the search space.In other words, re ies between two clusters can concurrently belong to each of the two clusters.Non-fuzzy clusterers associate each datum (candidate solution modeled by re y) solely to a cluster.In order to follow up on the problem of border re ies and the possibility of their relative belonging to both modals, fuzzy clustering model was used.
To review the performance of the suggested method, KNN K=1, KNN K=3, KNN K=5 Linear-SVM, Naïve Bayes, C4.5Tree, and Random Forest classi ers were used in the ensemble phase.
In this paper, the ensemble method was primarily emphasized.In order not to approach the results of strong classi ers, we used weak and base classi ers.Also, the performance of FSMFS and FMFS was reviewed through k-nearest neighbors algorithm, ensemble classi ers, and C4.5 decision tree along with a support vector machine, simple Bayesian algorithm and random forest.
The experiments were carried out in four cases: single modal case, multi-modal case in which the best model is selected, and a simple multi-modal case.In order to perform the tests, 7 array subsets were used.
The effect of adding to the number of modals was also investigated which suggested that in the case of selecting a large number of modals, the dimension for the problem of searching for ensemble will increase signi cantly and this causes the process of search for the optimal ensemble not to converge.In addition, the process of labeling unseen data will become longer due to the existence of numerous modals which is not proper for real applications.In the nal section, the effect of parameter on the mean number of features and commonality of modals was shown.If the parameter increases, the mean number of features of modals declines and its decrease is followed by in mean number of features of modals.In addition, the percentage of the commonality of features between different modals is low for the multi-modal model and the selection of provided multimodal features does not lead to nding similar answers.The conducted tests point to superiority and better performance of the suggested method compared with single modal method, and multi-modal in which the best modal is selected.

Conclusion
The paper proposed a method for multimodal feature selection according to the re y-based multimodal optimization algorithm and their combinations on the microarray dataset.The proposed method was composed of three phases.In the rst phase, the re y-based improved multimodal optimization algorithm and the search space was completely explored to nd several optimum feature subsets.Then in the second phase, a classi er was learned with each of the feature subsets to use the potential and information of all feature subsets, and the ensemble of these classi ers led to the use of the potential and information hidden in all feature subsets simultaneously.For selecting the proper ensemble, the particle swarm optimization method was employed as the explorer.The e ciency of this algorithm was evaluated by applying it to 11 datasets.The aim of the present article is to use multimodal feature-selection and to use the potential and information of several feature subsets.The results showed the superiority and proper function of the multimodal feature selection method compared with other methods.The effect of increasing the number of modals in the proposed method on different datasets.

Figures
Figures

Figure 1 Optimization 2 A
Figure 1 Optimization Algorithms to Solve Feature Selection Problem

Figure 3 Identi
Figure 3Identi cation of the non-explored points

Table 4 .
Characteristics of the used datasets.

Table 5 .
Comparison of the accuracy of the proposed method with other methods

Table 6 .
Comparison of the results of the proposed method with literature methods for seven data sets

Table 8 .
The Mean Number of the Selected Features for Each Dataset Based on Different Values of Parameter .