Robust Feature Selection by Filled Function and Fisher Score

Feature selection is essential in high-dimensional data analysis and filter algorithms, and due to their simplicity and fast speed, they have increasingly been drawing attention in recent years. Retaining all features in machine learning tasks is not only inefficient but the irrelevant and redundant features may have an adverse impact on the classification accuracy rate. Feature selection is an optimization problem which aims to transform the dataset ’s high -dimensional space to a lower-dimensional space by utilizing the relevant and suited features. Feature selection is a time-consuming task, while it is very effective in saving the time devoted to the learning algorithm. In feature selection algorithms, filter algorithms are increasingly attractive due to their simplicity and fast speed. In this paper, we are going to introduce a supervised filter feature selection using filled function and fisher score (FFFS). Based on this criterion, we try to find a feature subset resulting in the least classification error rate. In order to prove the effectiveness of the proposed algorithm, Extensive experiments have been conducted on 20 high-dimensional real-world datasets. Experimental results reveal the superiority of the proposed algorithm to state-of-the-art algorithms in terms of minimum classification error rate. Results validated through statistical analysis indicated that the proposed algorithm is able to outperform the reference algorithms by minimizing the redundancy of the selected features. So, the selected feature subset can avoid serious negative impacts on the classification process in real-world datasets. In addition, this paper proves the ability of the proposed algorithm in selecting the most relevant features for classification tasks by applying different noise rates to the datasets. According to the experiments, the FFFS is less affected by noisy attributes in comparison with other algorithms. Thus, it is a reasonable solution in handling noise and avoiding serious negative impacts on the classification error rate in real-world datasets.

Feature selection becomes an important task in machine learning which shows how to assess the quality of the selected subset for the next steps. So, searching for the best feature set requires a criterion to compare the subsets and find the best one, because it can be effective in the performance of high dimensional data classification [10,11].
With the advancement of information technology over the last decade, many new applications have emerged as hubs for information gathering and dissemination. So, data velocity and variety, have brought many challenging issues to the digital processing of raw data [12,13].
In recent years, with the rapid advances of computer and database technologies, there are many sources of data that can be used to solve various problems. Then, optimum usage from big and complex data sources becomes the focus of many research areas. To mitigate this problem, one of the potential ways is to reduce the amount of data with the feature selection technique [14,15].
This enormous data can contain a lot of irrelevant, noisy, and redundant information which may degrade the performance of learning algorithms. So, not only the large quantities of high dimensional data are not profitable, but also, they have brought great challenges [16][17][18].
Feature selection is a very important preprocessing or knowledge discovery tool for data analysis, and it has been applied to many domains and also used as an important technique for high-dimensional data which has shown its promising ability to enhance a variety of other learning tasks, such as data classification document analysis, and image processing and video processing, to name a few [22,32,33].
Generally, features can be categorized as: relevant, irrelevant, or redundant. An optimal feature set is a description of the original data with relevant and nonredundant features that most contribute to learning accuracy [19,20].
One of the most important challenges in high-dimensional data analysis is the limitation in time, memory, and computational cost [18,21]. It is desirable to reduce high data dimensionality for many machine learning tasks due to the curse of dimensionality [12,15,22].
Using a proper set of features leads to the reduction of model complexity of classifiers and it also causes faster convergence in the optimum results, thus, creating an improved performance of classifiers [17,23,24]. On the other hand, failing to remove the redundant features does not cause information problems, but raises the computational cost and increases the response time of the system. In addition, due to storage space constraints, it results in storing a lot of non-useful information along with useful data [25,26].
According to the accessibility of class label data, feature selection can be partitioned into 3 categories [27]. When class labels are accessible, it is called supervised feature selection [28]; however, if data consists of the samples and the features without any information about the natural grouping of the data, it is called unsupervised feature selection [9,[29][30][31]. In the end, the integration of a small amount of labeled data into unlabeled data as additional information, such as an unsupervised algorithm with additional information, is called semi-supervised feature selection [32][33][34].
The most basic technique searches all possible subsets and selects the subset that minimizes the model error. However, this exhaustive search is not computationally effective, particularly in situations where a large number of features exist. So, several feature selection algorithms were created [19,35].
Depending on how to generate feature subsets, feature selection algorithms can be divided into three categories: Wrappers, Filters, and Hybrid or Embedded [10,11,25]. The filter algorithms are independent of any classification scheme, so when the features get evolved, there is no need to evolve the pattern as well. Obviously, Filter algorithms select key features using intrinsic properties of data, but under particular conditions, they could give the optimal set of features for a given classifier [23,36]. A big challenge in filter algorithms is the difficulty to design a fitness function. So, there is less work on filter algorithms than wrapper algorithms [1,35].
In this work, we focused on the filter model based on filled function with fisher fitness function, and aim to develop a new feature selection algorithm which can effectively remove irrelevant, noisy, and redundant features. To this end, the proposed algorithm first computes the relevancy of each feature and then computes correlation values between features. This results in avoiding higher correlated features.
For clarity, the main contributions of this paper are summarized as follows: 1-It is the first time that the efficient filled function algorithm is used on a feature selection problem. 2-The fisher score is used as a fitness function to rank the features. 3-Efficiently selects informative features and reaches convergence in a reasonable time.
4-The filled function algorithm for the feature selection problem is defined as a supervised feature selection problem maximizing the performance of the learning algorithm and minimizing the feature set size. 5-Experimental results on a wide variety of high-dimensional datasets have shown the superiority of the proposed algorithm in achieving the minimum classification error rate. 6-The efficiency and effectiveness of the filled function algorithm are demonstrated through extensive comparisons with other algorithms using 20 real-world high dimensional datasets.
The rest of this paper is organized as follows: In Section 2, the related literature of filter feature selection algorithms is reviewed. The proposed algorithm is explained thoroughly in Section 3. Section 4 reports the experimental results and analyses the performances of different feature selection algorithms. Finally, Section 5 contains the conclusions and future works.

Related Works
Filter algorithms select the most discriminative features through the character of data. So, it means, the intrinsic properties of the data are used for feature selection, without the need of using any classification algorithm. The main advantages of the filter algorithms are their speed and scalability [1,37].
Within the filter model, each feature can evaluate individually or through feature subsets, then this kind of algorithms can be further categorized into two groups: univariate and multivariate [38].
Univariate algorithms evaluate each feature without attention to other features and make a rank list based on the efficacy of each feature according to the criterion. Then, the features greater than a threshold value are selected. Simplicity and light computational cost of feature ranking, make it as widespread ways in feature selection [39].
However, this algorithm may not be efficient to select the best features. An issue is a contention that the redundant subset might be optimal in certain datasets. Because little attention has been devoted to the impact of correlation between the features [40], each feature may not be rated on its own independently, but it can play an effective role in the selected feature set due to its correlation with other features [39,41].
Meanwhile, multivariate algorithms evaluate features together with a relevance estimation of each feature for obtaining a feature subset. In order to evaluate features in a multivariate algorithm, an algorithm based on maximizing the ratio of betweenclass scatter and within-class scatter was proposed and called linear discriminant analysis. But it has some issues in calculating the inverse matrix of within-class scatter [42,43].
The graph-based feature selection algorithm is another multivariate filter algorithm, which constructs the graph by preserving the geometrical structure in constructing the neighborhood graph, then selects the relevant features [44].
None of the above works achieved the balance between the classification accuracy rate and the feature reduction rate. To tackle the above-mentioned problem, this paper proposes a novel filter feature selection algorithm, in which the relevant features are selected by using the filled function. So, the main aim of this paper is to provide an algorithm that has a reduced classification error rate and optimum reduction of the number of features.

Background
In this section, the filled function algorithm is described in section 3.1, and the definition of fisher score is briefly reviewed in part 3.2.

Filled Function Algorithm
In high-volume data, despite the curse of dimensionality, it is important to have an optimal subspace of the feature set. Nowadays, studies on global optimization for feature selection problems have been significantly accelerated. The optimal subspace caused less error rate in classification and low computing time.
The filled function algorithm has been introduced by Ge in 1990 [51]. The general form of the filled function is as follows in Eq. (1): where : → , : → , = 1, … , and is a box. Let * be a local minimizer of the problem( ). The Filled function is an auxiliary function with satisfactory for global optimization results. To optimize the problem, the filled function converts the local minima points of the original problem to the local maximum point. This maximum point is a better initial point for the next iteration of problem-solving. In other words, at point * , that can be further minimized to get a point, say , of ( ) lower than the ( * ) when * is not a global minimizer. Then the minimization of ( ) is restarted from the point . The process is then repeated until a global minimizer of ( ) is obtained. The filled function algorithm has been introduced in [51]. The global optimization problem (F) is considered in the following form in Eq. (2): where Ω = [l, u] ⊂ Rn is a closed bounded set and f (x) is a continuous and differentiable function with a finite number of local minima in Ω [52]. Firstly, a flatten function has been applied in the objective function to eliminate numerous local minima just as Sui and Liu do in literature [53] and [54]. Flatten function was introduced by Wang [55], which has the following form in Eq.(3) and Eq. (4): i.e.
where f (x) is the original function and x * is its current local minimum. By using it, all those local minima equal to and worse than x * will be eliminated. Thus, many locally optimal solutions will be eliminated, which makes the algorithm focus its attention only on the better local optimal solutions than the current local optimal solution x * and thus makes the global minimum finding much easier. For the properties and the detailed information, please refer to [55]. Let Gm and Lm represent the set of global minima and local minima of the original function f (x) respectively. Definition 1 Assume x * is a local minimum of f (x), if p(x, x * ) satisfies the following conditions, p(x, x * ) is referred to as a filled function of the original function f (x) at has a local minimum.
The filled function without any adjustable parameter at the current minimum x * is constructed by the following formula [52].
where α is a constant, whose value is the reciprocal of the order of magnitude of s(x * ,x * ). Note that after the current local minimum x * was founded, the search will focus on Ω 2 . accordingly, p(x, x * ) is a continuously differentiable function in Ω 2 . In fact, in regionΩ1 and Ω2, ( , 2 is continuous and differentiable, respectively. Now, we will prove that the proposed p(x, x * ), i.e. (5), is a filled function that satisfies the conditions in Definition 1.
Thus, in region Ω 1 , for p(x, x * ), none of the points is a stationary point.

Theorem 3
Suppose that x * ∈ Lm but x * ∉Gm, then there exists a local minimum of there is at least one point in Ω 2 . Hence, 2 ̅̅̅̅ is a nonempty bounded closed set.
This minimum is at least a local minimum of Assume that x * is the current local minimum of original function f (x), and the filled function at this point is In the existing filled function algorithms, a widely used algorithm to generate the initial points is as follows in Eq.(6) [52]: where δ is a small step size and e i represents the i-th coordinate axis direction.
However, if the local minima of p(x, x * ) are not near in the coordinate directions, it is possible that p(x, x * ) cannot obtain a good local minimum by using these initial points and its local minimum found may not enter a deeper valley of the f (x). This may result in the failure of the algorithm [52]. We take a two-dimensional problem ( ) = 2 1 2 − 1.05 1 4 + 1 6 1 6 − 1 2 + 2 2 as an example to illustrate this. Assume x * is its current minimum, the landscape of the proposed filled function p(x, x * ) is displayed in figure 1a [52]. Let the blue solid point denote x * in figure 1b, where figure1b is a top view of figure 1a. We can see from figure 1b that when x1 to x4 generated by Eq.(6) as shown in red circles are used as initial points, the local minimum y * does not locate near these points. It is very difficult or we have to take much computation cost to find the local minimum y * from these initial points. That is to say, the initial points generated by Eq.(6) may not be good initial points. Instead, if we choose x 5 as an initial point, we may easily find y * . Thus, x 5 may be a good initial point. Based on this observation, it is necessary to design a scheme to adaptively select the initial points instead of using a fixed manner to choose the initial points. To achieve this goal, a new algorithm to generate the initial points from x * along with adaptively changing both directions (not fixed to axis directions) and the number of directions is developed. The details are as follows [52]: Step 1. The coordinate axis directions e i and −e i are taken as the initial directions to generate initial points according to Eq.(6), where δ = 0.001. Let N = 2. The number of the initial points is P num = N * dim, where dim is the number of dimensions of the problem.
Step 2. If we cannot find a local minimum of p(x, x * ) by using these initial points, let P num and P num initial points x i = 1 1+δd i for i = 1, 2, · · · , P num .
Step 3. Repeat Step 2 until we find a local minimum or N reaches a pre-assigned threshold.
To elaborate on this strategy, we use a two-dimensional problem as an example to illustrate the idea of generating initial points in Fig. 2. The blue point x * is the current minimum of f (x) and y * is a local minimum of the filled function [52]. So, we take the coordinate axis directions d1 ∼ d4 shown by the red dotted arrows to generate the initial points x1 ∼ x4 as shown in Fig. 2a. We can see from figure 2a that none of the four initial points is near to y * . It is difficult to find y * from these initial points.
In this case, we adaptively choose the uniformly distributed directions d1 ∼ d6 and generate new initial points x1 ∼ x6 as shown in figure 2b. Subsequently, x5 is near to y * and by using it as the initial point, we can easily find y * as shown in figure 2b [52].

Fisher Score
Some supervised feature selection algorithms, evaluate features according to some quality creations that quantify the relevance of features individually. The selection task can be efficiently accomplished using the Fisher score based on features ranking, which leads to a suboptimal subset of features [56].
Consider the classification of C classes. Given n i training samples { 1 , 2 , … , } for each class i, ( i =1 , 2 , … , C). The a priori probability of class i is estimated by Eq. : The class means are estimated by Eq. (8): and the gross mean is estimated by Eq. (9): The sample covariance matrix ̂ of the class is estimated by Eq. (10): The within-class scatter matrix and between-class scatter matrix are estimated by Eq. (11) and Eq. (12): The class separability of a feature set can then be measured by Eq. (13): This measure serves as a good criterion for feature subset selection and has shown superior performance in many practical problems. However, its calculation for a large number of features is computationally expensive. Instead, the Fisher criterion for one single feature has been prevalently used to select discriminant features. For the kth feature, it is calculated by Eq. (14): Where ( ) and ( ) are the kth diagonal element of and , respectively, and can also be calculated from the data of a single feature.
For feature pre-selection, we calculate the Fisher criterion of each feature, order the features in decreasing order of criterion values, and simply select the features of maximum values, while the features with very small fisher values are abandoned.
Though the single-feature fisher criterion does not consider the joint separability of multiple features, it is able to retain all discriminant features by only removing irrelevant/noisy features, for which the fisher criterion is nearly zero.

Proposed Algorithm for Feature Selection
In this section, we propose a multivariate filter-based feature selection algorithm with filled function and Fisher score. As regards the nature of the feature selection issues, the solutions are limited to the binary space values but the real-world datasets have continuous space. So, to solve the feature selections problem, the continuous (free position) must be transformed to their corresponding binary solutions.
For example, if the value of a bit is equal to 1, then its corresponding feature is selected into the feature subset; on the contrary, 0 indicates vice versa. Figure 3 shows it clearly. It means that the second and fourth features are selected for the feature subset. The framework of the FFFS algorithm is shown in Algorithm 1. In iteration t of the purposed algorithm, there are three phases: Firstly, to reach the best feature set, it minimizes the problem to get a local minimum x * by using an existing local optimization algorithm. Next, we adopt a flatten function s(x, * ) which can eliminate the local minima equal to and worse than the current minimum * . This will make the search for a global minimum much easier. Also, the Fisher score function helps rank the results to make the selection easier. Finally, we construct a filled function p(x, * ) at * , and then optimize it from the initial points adaptively selected. If we could not find an acceptable feature set to * , we will use the widening strategy. Otherwise, we will use this better feature set as a new initial point i.e. +1 0 , in (t+1)th iteration to minimize the original function. The above phases are repeated until a global minimum of the original function is found.

Initialization steps:
Step1: Choose an initial point x 1 0 ∈ Ω for the original problem.

Step3:
Set the initial value of n=2, then the number of initial directions p num =2*dim. || dim is the number of dimensions of the problem.
Step4: Set the threshold of N, N max

Main Steps:
Step1: Minimize the original problem (F) from x t 0 by applying an existing local search algorithm to get local minimum x t *

Step2:
Eliminate the local minima inferior to x t * by using flatten function Step3: Sort the result by fisher score Step4: Construct Step5: Generate the search direction d i , i=1, 2, …, P num where P num =N*dim and set j=1.

Step6:
If j<P num , go to step 7, otherwise, go to step 9; Step7: Generate the initial point for p(x, x t * ) from x t * = x t * + δd j .
Minimize p(x, x t * ) from x t * to get a local minimum p * by adopting a local search algorithm. If p* is not in region Ω, let j=j+1, then go back to step 6, otherwise, go to step 8;

Step8:
If f(p * )<f(x t * ), set t=t+1, x t 0 = p * and N=2, then go back to step1; otherwise, let j-j+1 and go back to step 6; Step9: Set N=N+1 Step10: If N<N max , then go back to step 5, otherwise, go to step 11; Step11: Implement the narrow valley widening strategy on s(x, x t * ) to get a better minimum x t * in this valley.

Step12:
If the local minimum cannot be improved, terminate the process and take x t * as the global minimum of the original problem; otherwise, let N=2 and go back to step2.

Algorithm 1: Pseudo FFFS algorithm
In algorithm 1, is a small positive number, ∈ [0,1], and Β is a large positive number. Parameter is the number of data set features.
To summarize, first, the labeled data are mapped into two hypersphere regions for target and non-target objects. This mapping process is considered nonlinear programming. The problem is solved by employing the Fisher score for finding a global minimizer. The global minimizer is considered as a boundary that fits the target class. In the end, a one-class classifier to detect target class members is obtained. In the next section, experimental results are explained.

Experimental Results
In this section, in order to measure the efficiency of the proposed algorithm, the evaluation of the experimental results will be discussed. The experiments were performed on various high-dimensional datasets compared with the proposed FFFS algorithm against 3 other state-of-the-art feature selection algorithms. At first, in section 5.1 the used datasets are introduced concisely. Next, in section 5.2, the experimental and parameter setups are discussed. Finally, in section 5.3 the performance of the proposed algorithm against other relevant algorithms is evaluated.

Datasets
The results presented in the experiments are based on the 15 real-world datasets from the UCI machine learning repository and the 5 medical datasets [57,58]. These 20 datasets have different numbers of attributes and instances, as shown in table 1. In table 1, the number of data samples, features, and classes for each dataset are given. In all datasets, the missing value is replaced with the median of that feature.

Compared Algorithms
The comparison results obtained by the proposed algorithm with the classification problems, including a feature selection algorithm based on Chebyshev distance outlier detection called NFR-Relief [59], a feature selection algorithm combining maximal information entropy and the maximal information coefficient called mMIE-mMIC [60], and a fully-connected network called NeuralFS [61]. All algorithms have been tested on 20 datasets in terms of classification error rate and F-MEASURE performance measures.

Experimental and Parameters Setup
The experiment was conducted on the Intel core i3 computing platform with 4GB RAM, 2.13 GHz Frequency. The programming environment is MATLAB R2014a in Windows 10 operating system.
Several experiments were carried out to demonstrate the efficiency and effectiveness of the FFFS algorithm. For executing the FFFS algorithm, all experiments use the same parameters for initialization and evolution. The experiments are illustrated as: To be fair, all comparison algorithms used the parameters reported in the corresponding original work.
To verify the performance of the proposed algorithm, the 3-fold cross-validation is used to generate the three equal portions in the dataset. Based on this, 70% of the training partition is used for training and the left 30% is used for the fitness function validation.

Evolution Metric
The error rate (ERR), the F-MEASURE and computational time are used as the criteria for comparing the proposed algorithm and the three other state-of-art algorithms.

ERR
The classification process should present the correct label for each sample. To evaluate the performance of the data classification, the error rate is calculated by Eq. (15): where n is the total number of samples, y is the true label and ̂ is the label presented by classification and also ( ,̂) is the function which returns zero, if y is equal to ̂ , otherwise returns one. It is clear that smaller values of ERR mean better classification results.

F-MEASURE
In order to analyze the quality of the classification results obtained by the algorithm, the F-MEASURE is used as the weighted harmonic mean of the precision (positive predictive value) and recall (sensitivity) of the test. Given two arbitrary variables Precision and recall, the F-MEASURE is defined as Eq. (16) and (17): F-Measure is defined as Eq. (18): F-Measure is a good parameter for evaluating the quality of classification and it also describes the weighted average between Precision and Recall. For an ideal classification algorithm, this value is 1 and in the worst case is 0.

Wilcoxon Rank Test
The Wilcoxon statistical test is also used to find about the proposed algorithm more precisely. It is a Non-parametric statistical test implemented to determine statistical significance between algorithms [62,63]. The Wilcoxon test is conducted at 5% significance level to verify whether there is a statistical difference between obtaining results of error rate. In other words, if p-value<0.05, then it indicates the proposed approach has a significant difference [64].

Robustness Test
In this paper, we intentionally generate Gaussian noises artificially and include them in the original dataset. In this way, it is possible to analyze the FFFS capability to select the main relevant features instead of noisy data. The generation of Gaussian noise is defined as Eq. (19) [65, 66]: Where μ represents the mean and σ the standard deviation. In this paper, we use a Gaussian noise with μ = 0 and with large values of standard deviation.
The features of the original datasets are transformed into noisy ones. The transformation was performed by adding or subtracting generated noise to the original features [66]. 10% noise was added at the first step, then it was increased to 40% noise. After that, the performance of the proposed algorithm was compared to three other algorithms.

Result Interpretation
In all reported tables the third column shows the results of the original data and also in all figures the result of this column is marked with yellow. The last column belongs to the proposed algorithm and the red marks in all figures belong to the FFFS algorithm. The best results on each dataset are marked as bold.

Error Rate
This experiment is presented here to intuitively show the rate of error in classification. In table 3 the results obtained from applying the above algorithms are shown. For each algorithm, the average classification error rate (ERR) by k-NN is shown and also standard deviation is reported.  Table 3 shows that the proposed algorithm outperformed the NFR-Relief, mMIE-mMIC, and NeuralFS algorithms in all datasets in terms of accuracy. The best results are indicated by bold values. As can be seen from Table 3, ERR on 10 datasets shows that FFFS outperforms other algorithms on D2, D3, D6, D8, D9, D11, D12, D13, D15, and D20. According to the values of ERR given in the last row, the best one is obtained by the proposed algorithm (1.85) and the worst is by NFR-Relief (3.10). From Table 4, it can be observed that ReliefF-RBGSA has very low p values for all datasets when compared with Relief-BGSA Figure 4: Classification error rate According to table 3, considering the average scores in the datasets, the FFFS algorithm eventually shows the best performance in minimum error rate. Comparing the error rate of different algorithms, the proposed algorithm shows a lower error rate than other algorithms in most datasets. In terms of ERR, the proposed FFFS algorithm achieves the best performance on the D2, D3, D6, D8, D9, D11, D12, D13, D15, and D20 datasets.
Following the proposed solution, the NeuralFS algorithm was ranked second in the error rate. This algorithm has the best performance in D1, D7, D10, D17 andD18 datasets.
The summary of the results presented in tables 3 and 4 and figure 4 are as follows: 1. The results obtained by almost all the algorithms are better than the original dataset for classification. 2. ERR illustrates that the FFFS performs better than other feature selection algorithms on most of the datasets. 3. By using feature selection algorithms in all datasets, the ERR was reduced to more than half of all the feature columns. 4. Concerning standard deviations, the FFFS algorithm in most of the datasets achieved more robust results against other comparison algorithms. 5. The more the standard deviation indicates the decrement of the robustness of the algorithm, it reduces the reliability of the algorithm. 6. Although the mMIE-mMIC algorithm has complex formulation and high calculations, it failed to reach a significant reduction in error rates and robustness. 7. While the proposed algorithm shows an appropriate and acceptable performance in all the datasets, the mMIE-mMIC algorithm has better performance in big datasets and despite the mMIE-mMIC algorithm, the NeuralFS algorithm has the lowest error rate in small datasets. 8. Table 4 compares the results of the FFFS with NFR-Relief, mMIE-mMIC, and NeuralFS algorithm, where bold values represent p-value >= 0.05. 9. According to a normative analysis of Wilcoxon test results, the outcome of the proposed algorithm is meaningfully distinguishable from others. Considering error, the FFFS algorithm performs significantly better than the competing algorithms.

F-MEASURE
Tables 5 display the average F-MEASURE obtained by different feature selection algorithms on all the datasets. For better comparison, we multiplied the numbers by 100.  Table 5 shows that the proposed algorithm was ranked second with a negligible difference against the NFR-Relief, mMIE-mMIC, and NeuralFS algorithms in most datasets in terms of F-MEASURE rate. The best results are indicated by bold values. According to table 5, considering the average scores in the datasets, the NeuralFS algorithm had the best performance in the maximum F-MEASURE rate. However, the proposed algorithm was ranked second with a negligible difference. In terms of F-MEASURE, the proposed FFFS algorithm achieves the best performance on the D1, D8, D12, and D15 datasets and in the rest of the datasets, it usually ranks second.
The summary of the results presented in table 5 and table 6 and figure 5 is as follows: 1. The results obtained by almost all the algorithms are better than the original dataset for classification. 2. F-MEASURE illustrates that the proposed algorithm almost had satisfactory results against NFR-Relief and mMIE-mMIC algorithms. 3. The accuracy of the experiment is significant in the NeuralFS and FFFS algorithms, which means that the features selected by these algorithms are more suitable for classification. 4. The low standard deviation in the results of the proposed algorithm shows that is a robust and reliable algorithm for feature selection. 5. Compared algorithms usually have a higher standard deviation than the proposed algorithm, which reduces their robustness. 6. The Wilcoxon test measure shows the superiority of the proposed algorithm on the NFR-Relief, mMIE-mMIC algorithm, but the FFFS algorithm had similar performance compared to the NeuralFS algorithm. Table 7 indicated the average CPU time measured in seconds per run, taken by the FFFS algorithm and three other algorithms. This measure indicates the speed of the algorithms in selecting features from a given dataset. The results are graphically shown in figure 6.  The summary of the results presented in table 7 and figure 6 is as follows:

Time Consuming
1. The time consumed in the NFR-Relief algorithm was the least and the proposed algorithms have achieved second place in the least time-consuming. 2. The processing time of the FFFS algorithm according to the first rank in the lowest error rate is quite logical and acceptable. 3. The mMIE-mMIC and NeuralFS algorithms, due to their complexity, could not have a good execution time.

Noise Robustness
In order to show the robustness of the proposed algorithm, some noise was added to all datasets since real-world data always have noise. Figure 7 depicts the convergence performance of the proposed algorithm in comparison with the other three algorithms with different noise rates in datasets. The horizontal axis (x-axis) indicates noise level, while the vertical (y-axis) represents the error rate percentage. The results shown in figure 7 indicate that the proposed algorithm outperforms the other algorithms. The most resistant algorithm (the green one) is the FFFS algorithm, which gets the lowest increment in error rate by increasing the noise rate. It means that the proposed algorithm has selected better features for classification. The summary of the results presented in figure 7 is as follows: 1-Experimental results on datasets demonstrate that FFFS has more robustness for datasets with noisy features.
2-The mMIE-mMIC algorithm, due to its complexity, could not have good robustness against noise in small datasets and it is more suitable for larger datasets. But, in most cases, the proposed algorithm performs better against noise and its error rate increase difference is less than the mMIE-mMIC algorithm.
3-The slope of the NeuralFS line in the graphs, indicates that it is suitable for small datasets. However, it still does not have the noise robustness as much as the FFFS algorithm.
4-NFR-Relief algorithm has a steady upward trend in most datasets. This algorithm works regardless of the dataset size and can be used when we are unaware of the nature of the dataset.

Conclusion and Future Works
Ever-evolving frontier technologies make huge and complex datasets. So, feature selection algorithms are necessary to refine the information and restrict them only to the relevant and useful features which are important to machine learning tasks.
In this study, a new algorithm is presented for feature selection according to the filled function and the Fisher score. The performance of the proposed algorithm was assessed on 15 datasets from the UCI data repository and 5 medical datasets. Furthermore, the proposed algorithm was compared against state-of-the-art supervised filter feature selection algorithms that are NFR-Relief, mMIE-mMIC, and NeuralFS algorithms.
The use of a filled function alongside the Fisher score to design a filter feature selection algorithm is a well-established trend and as evident from this work which brings fruitful results.
The experiments revealed that although most of the results obtained by the proposed algorithm do not significantly improve the results obtained by the other supervised feature selection algorithms, the FFFS algorithm in general obtained better results regarding the other compared algorithms, and it was able to select features subsets with low redundancy. A better feature set that was selected by the FFFS algorithm, caused the classification error rate to improve since some redundant features were eliminated. It should be noted that the proposed algorithm could obtain more robust against noise increment results than other comparing algorithms due to selecting better features.
Speeding up the proposed algorithm by applying different alternatives, such as the use of random algorithms, or the intent to propose a new search algorithm to tune the thresholds in different filter algorithms, and perform extensive experimentations using sensitivity analysis can be studied in the future.