Feature and instance selection through discriminant analysis criteria

Feature selection and instance selection are two data preprocessing methods widely used in data mining and pattern recognition. The main goal is to reduce the computational cost of many learning tasks. Recently, joint feature and instance selection has been approached by solving some global optimization problems using meta-heuristics. This approach is not only computationally expensive, but also does not exploit the fact that the data usually have a structured manifold implicitly hidden in the data and its labels. In this paper, we address joint feature and instance selection using scores derived from discriminant analysis theory. We present three approaches for joint feature and instance selection. The first scheme is a wrapper technique, while the other two schemes are filtering techniques. In the filtering approaches, the search process uses a genetic algorithm where the evaluation criterion is mainly given by the discriminant analysis score. This score depends simultaneously on the feature subset candidate and the best corresponding subset of instances. Thus, the best feature subset and the best instances are determined by finding the best score. The performance of the proposed approaches is quantified and studied using image classification with Nearest Neighbor and Support Vector Machine Classifiers. Experiments are conducted on five public image datasets. We compare the performance of our proposed methods with several state-of-the-art methods. The experiments performed show the superiority of the proposed methods over several baseline methods.


Introduction
Feature selection (FS) and instance selection (IS) have been an active research topic in preprocessing of datasets of images, videos, and texts as they promise fast performance and reduced complexity in classification problems. The importance of these tasks is increasingly recognized in most real-world datasets problems that are addressed in machine learning and autonomous systems. Feature selection methods reduce the dimensionality of datasets by removing features that are considered as irrelevant or noisy for the learning task. This topic has received a lot of attention in machine learning and pattern recognition communities (Aghazadeh et al. 2018;Angulo and Shin 2018;Lim et al. 2017;Mohamed et al. 2018;Roffo et al. 2017;Staczyk et al. 2018;Zhu et al. 2017). In any dataset, data can be seen as University of the Basque Country UPV/EHU, San Sebastian, Spain a collection of data points called instances. Each instance can describe a particular object or situation and is defined by a set of independent variables called features or attributes. Instance selection refers to the task of selecting a useful subset of data called representatives that can summarize the whole dataset (images, videos, or texts) (Bien and Tibshirani 2011;Dornaika and Kamal Aldine 2015;Elhamifar and Vidal 2011;Li and Maguire 2011). On the other hand, feature selection selects a subset of features that represents patterns from a larger set of features that are often mutually redundant, possibly irrelevant (Du et al. 2017;Yang et al. 2011). As a result, the ability to select features from a large feature set can be critical for many computer vision tasks since the task will focus only on the important attributes of the dataset. Instance selection and feature selection are used by many machine learning algorithms such as classification and clustering. In this regard, many methods for instance selection and feature selection have been proposed and have shown promising results (Gunal and Edizkan 2008;Kuri-Morales and Rodríguez-Erazo 2009;Tsai and Wu 2008;Yin and Chiang 2008).
Accordingly, other approaches rely on the well-known statistical analysis technique Least Squares Regression (LSR) to perform various discriminative learning methods. In Xiang et al. (2012), Yuan et al. (2018), the authors propose a multiclass feature selection algorithm that embeds class label information into the LSR formulation so that the distances between classes can be increased. This was achieved by introducing a -dragging technique that forces the regression targets of different classes to move in opposite directions. In , the authors introduce a new relaxed label regression method that aims to preserve the row sparsity consistency property of samples from the same class, so that the distances of regression responses between samples from the same class can be greatly reduced, imposing a novel inter-class sparsity regularization term on the transformation. In contrast, other authors do not introduce an additional regularization term to capture the local structure of the data, but instead constrain the reconstruction error of the data by using low-rank preserving projection over a graphregularized reconstruction (Wen et al. 2019).
In this paper, we address the problem of features and instance selection. Generally speaking, it was shown that the best way for performing feature and instance selection is either to perform feature selection then instance selection, or to simultaneously perform both tasks. Both schemes can have their own drawbacks and limitations. For instance, the first scheme (feature selection then instance selection), the instance selection can be affected by the solution pro-vided by the subset of features found in a separate stage in advance. The simultaneous solution can be suboptimal and lacks of efficiency since very often the solution is obtained by a wrapper technique. We propose a trade-off solution that can be seen as an in-between solution that is efficient and accurate. The proposed method exploits the strengths of some feature subset selection and data self-representativeness instance selection. A genetic algorithm-based search scheme is deployed in order to optimize a given criterion that characterizes the subset of features and the associated selected instances. The proposed schemes are tested for the task of classification over different datasets having small and high dimension. In addition, the proposed method will be compared with the main trends used for combining feature and instance selection. Classification performances will be quantified on several public image datasets in the two scenarios: original training dataset and (ii) pre-processed training dataset. Here, the pre-processed datasets mean that the training datasets undergo feature and instance selection according to a given method.
The remainder of this paper is structured as follows. Section 2 reviews the concept of feature and instance selection. It briefly reviews the instance selection method (Sparse Modeling Representative Selection (SMRS)). Section 3 introduces our proposed schemes for the joint feature and instance selection. Section 4 presents our experiments and results. Section 5 concludes the paper. Table 1 presents the main acronyms and notations used in the paper.

Related work
This section briefly presents the feature and instance selection. It also reviews the instance selection method based on Sparse Modeling Representation Selection.

Feature selection
Feature selection is one of the stages for preprocessing the data through the identification and selection of a subset of F features from the original data of D features (F < D) without any transformation (Tsai 2011). In the domain of supervised learning, feature selection attempts to maximize the accuracy of the classifier, minimizing the related measurement costs by reducing irrelevant and possibly redundant features (Aghazadeh et al. 2018;Apolloni et al. 2016;Hou et al. 2014;Lim et al. 2017;Mohamed et al. 2018;Paul and Das 2015;Raza and Qamar 2016;Roffo et al. 2017;Staczyk et al. 2018;Zhu et al. 2017). Feature selection reduces the complexity and the associated computational cost and improves the probability that a solution will be comprehensible and realistic.

Instance selection
As data volumes continue to grow, many traditional methods and systems struggle to process such volumes of data to obtain actionable knowledge (García-Pedrajas et al. 2010;Wilson and Martinez 2000). Current classification algorithms suffer from memory requirements and excessive processing times that make them impractical when dealing with huge datasets (Hernandez-Leal et al. 2013). These issues can be critical when using an inert learning algorithm such as the nearest neighbor rule, and can prevent results from being obtained. However, reducing the size of the dataset by selecting a representative subset has two main advantages. (1) It reduces the storage space required to store the data.

Combining feature and instance selection
Instead of treating the instance or feature selection problems separately, some research efforts have been devoted to the study of the dual problem IS and FS, which we will refer to as feature and instance selection (FIS). There is no disadvantage in addressing both problems simultaneously, since features and instances can be selected independently: Classification accuracy is the only part of the problem that is determined by the selected data.
Many algorithms have been developed to tackle either instance or feature selection. However, few works have addressed the joint instance and feature selection problem (Fernández et al. 2018;Suganthi and Karunakaran 2018). The simplest way to combine instance and feature selection is to perform one process at a time. When the first process is instance selection and the second is feature selection, the combination is called ISFS. The reverse case is referred to as FSIS.
In Kirkpatrick et al. (1983), the authors solve these two problems using a heuristic technique. They use two simultaneous Simulated Annealing algorithms where each problem is solved separately and they then compute the quality of both problems using the current solution of each process.
Much work has been done with genetic and evolutionary algorithms (Ahn and Kim 2009; Teixeira et al. 2008). In Teixeira et al. (2008), the authors use a chromosome that is a binary vector whose elements are associated with all instances and all features. The search for optimal subsets of features and instances is transferred to finding the optimal chromosome. Very often genetic algorithms are invoked to solve these problems. In Ramirez-Cruz et al. (2006) the authors use the chromosome in a different way; they split it into two fields. The first field corresponds to the feature weights and the second field has binary values corresponding to the instances. In Kuncheva and Jain (1999), the authors use boolean value coding to select features and instances. The fitness function used is set to the combination of the accuracy of the Nearest Neighbor classifier and a value that penalizes the cardinality of each set.
In , the authors use an intelligent multi-objective evolutionary design to solve the instance and feature selection problems. The goal is to search for a minimal subset of features and instances by maximizing the classification accuracy of the Nearest Neighbor classifier. In Ros et al. (2008), the authors use a hybrid genetic algorithm with the dual goal of reducing instances and selecting features while achieving the highest classification score. In Ishibuchi and Nakashima (2000), the authors use a genetic algorithm to select a small number of instances along with only significant features by giving the change that excludes features a greater probability of excluding features from the solutions. In Sierra et al. (2001), the authors present a combination of instance and feature selection by using an adaptation to the Estimation of Distribution algorithm (EDA) named Genetic Algorithm. Pelikan and Mühlenbein (1998) In Derrac et al. (2010), the authors propose the instance and feature selection method based on an evolutionary model to perform feature and instance selection in Nearest Neighbor classification. It is based on the use of three population types, where these types represent the FS problem, the IS problem, and the feature and instance problem. Each chromosome in each population has a fitness function formed by the accuracy of the NN classi-fier (obtained from the three NN classifiers) and the reduction ratio in features and instances.

Review of instance selection based on SMRS
In this section, we briefly describe the efficient instance selection method described in Elhamifar et al. (2012) (namely the method Sparse Modeling Representative Selection (SMRS)). In what follows, large bold letters denote matrices and small bold letters denote vectors. The problem formulation can be stated as follows. Consider a set of data samples T = {x 1 , . . . , x N } in R d arranged as columns of the data matrix X = [x 1 , . . . , x N ], where d denotes the sample dimension. The goal is to select the most representative samples within the sample set T . SMRS is a filtering method that uses the concept of relevance ranking. In other words, the relevance score of each sample x i , i = 1, . . . , N is first estimated. Based on the sorted relevance scores, the most relevant prototypes are then selected using a predefined threshold or fixed number of prototypes. The basic idea of Elhamifar et al. (2012) is to estimate the coefficients for the selfrepresentativeness of the data, from which the relevance score of each sample can be derived. The basic assumption is that each data sample is equal or close to a linear combination of some samples in the original data set. Mathematically, this assumption can be written as follows: where b i is a vector of coefficients. The above N equations can be encapsulated in one single matrix equation The unknown matrix can be estimated by minimizing the following criterion using some regularization on the unknown coefficients: where A F denotes the Frobenius norm of the matrix A, i.e., A F = ( i j A 2 i j ) 1/2 . The regularization plays an important role in order to get useful coding coefficients. In particular, the regularization becomes crucial whenever the system is undetermined (number of samples is greater than the dimensionality of the sample). The SMRS method (Elhamifar et al. 2012) estimates the matrix B by adding a regularization term given by the L 1,2 norm of B. Thus, the matrix B is estimated by minimizing the following criterion: where λ is a positive regularization parameter. The above criterion has two terms. The first is the least square error associated with the self-representativeness of data. The second is the L 1,2 norm of the matrix B, i.e., 1 2 . Thus, the regularization term is the L 1,2 norm of B (i.e., sum of L 2 norm of the rows of the matrix B). This norm imposes that the matrix of coefficients B is blocksparse. It can be considered as an approximation to the L 0,2 norm of a matrix (the number of non-null rows). Thus, in essence, the optimization problem provided by (1) attempts to compute the unknown matrix that simultaneously provides the least reconstruction error and that has the least number of non-null rows (a low rank matrix). The affine constraint 1 T B = 1 T makes the selection of representatives invariant with respect to a global translation of the data. The optimization of (1) can be carried out using the Alternating Direction Method of Multipliers (ADMM) (Boyd et al. 2011).
Thus, the relevance score of the i th sample is set to the L 2 norm of the i th row of B. It should be noted that in a supervised learning context, the same framework of SMRS can be applied to each class separately to find the most representative examples in each class.

Motivation
As can be seen, the state-of-the-art methods for combining feature selection and instance selection can be mainly divided into two groups. In the first group, referred to as FSIS, the proposed algorithms perform feature selection followed by instance selection. In the second group, the two processes are performed simultaneously. These are referred to as FIS. A deep look into related work on FIS methods shows that the majority of these approaches, which attempt to solve these problems in a broad generic formulation, solve them by using specialized versions of genetic algorithms that divide the chromosome into two distinct regions, one for features and one for instances, and apply separate operators to each region. The main solution to FIS problems is given by evolutionary algorithms, which are able to achieve a solution where the fitness function depends mainly on the performance of a given classifier and on the reduction of features and instances. However, the use of criteria derived from manifold learning paradigms has received much less attention. In this work, the joint selection of features and instances is solved by using some manifold learning criteria for the FS part and an efficient self-representation of the data for the IS part. Motivated by the fact that the combination FSIS can be generally more interesting than the combination IS + FS, when dealing with real-world problems (Tsai et al. 2013), we propose three schemes for combining feature selection and instance selection. All three schemes use an evolutionary scheme for searching the FS solution part, but the IS part is solved and integrated within this evolutionary algorithm using a self-representation scheme for the data, thus avoiding the use of a specific chromosome for the instance selection part.
The three proposed schemes have some interesting properties in terms of effectiveness and efficiency. This is discussed in more detail in the next section. The first scheme is a wrapper technique. The second and third schemes are filtering techniques (i.e., their fitness function does not depend on a particular classifier). All the proposed schemes use a similar search strategy given by a genetic algorithm (GA). For the three proposed schemes, we use only one chromosome for feature selection and the final selection of both features and instances is guided by a fitness score that considers the candidate feature subset and the associated selected instances. We emphasize that there is no chromosome for the instances, since the selection of the instances is performed on the candidate feature subset using the efficient SMRS method.
The proposed methods are based on some collaboration between feature selection and instance selection. Moreover, the number of features selected is not fixed in the sense that the search process tries to find the best features from the data. In practice, the size of the selected features is generally between 40% and 60% of the original features.

Outline of the proposed schemes
In this section, we describe our three proposed methods for feature selection and instance selection. In all the proposed methods, a genetic algorithm is used to perform the search for the optimal feature chromosome. The first scheme is a wrapper technique, while the other two schemes do not depend on the classifier. Suppose that we have a training dataset represented by the data matrix X ∈ R D×N , where Our goal is to use a genetic algorithm that maximizes a fitness function to select the features from each data, and then, we select the instances using the method of SMRS. First, we build a binary chromosome whose size is given by D. In this chromosome, a bit value of one in the chromosome representation means that the corresponding feature is included in the given subset, and the value of zero means that the corresponding feature is not included in the subset. Second, we create a fitness function that focuses on maximizing the performance for a given set of features and the corresponding selected instances. The relationship between features and instances is treated as something related to their quality for a supervised learning task. A textual description of the framework can be seen as follows: At the beginning, the complete sets of features and instances are set as initial solutions. There are two separate processes for selecting features and instances. Starting with the initial data, the new solution is generated using the feature selection meta-heuristic, where the features of the data are reduced; then, the algorithm performs instance selection based on the feature subset, which results in a given solution for feature and instance selection. The goal is to find the optimal feature chromosome by maximizing the fitness function, which already quantifies the goodness of the feature subset and the instance subset.
In this work, we introduce three fitness functions that produce three different schemes. The first one (wrapper FSIS) is based on classification accuracy, while the second and third fitness functions are based on LDA and LDE criteria, respectively. The classifiers used in our work are: Nearest Neighbor (NN) and Support Vector Machines (SVM).

Wrapper approach: W-FSIS
First, we build a binary chromosome H for feature selection. The size of H is D. Our goal is to use a genetic algorithm that optimize a fitness function that scores both the subset of features associated with the current chromosome and the selected instances provided by the SMRS method. Second, we build a fitness function focused on maximizing the classification rate.
The flowchart of the proposed fitness function associated with the wrapper FSIS is illustrated in Fig. 1. Since it is a wrapper technique, the fitness is mainly set to the rate of correct classification of the used classifier over some data with known labels. Due the specific problem that is given by a cooperative feature selection instance selection, the rate should actually score the goodness of the subset of features and the subset of the associated selected instances. Let X F ∈ R F×N denote the data matrix X after feature selection. That is to say, X F is X from which all features that have their chromosome bit equal to zero, have been eliminated. Let X I ∈ R F×M denote the output of instance selection given by the SMRS method when applied on X F . Here M denotes the number of selected instances. In practice, M is user defined.
Thus, based on the content of the current chromosome and the associated output of the SMRS method, we are able to generate from the training set the following data matrices: X F , X I , and X N . We then randomly split the training matrix X F into two equal parts: X F1 and X F2 . Based on the four parts X F1 , X F2 , X I , and X N , we can estimate three classification rates for a given classifier as it is shown in Fig. 1. The first rate, denoted by R F1 , is the classifier success rate when the part X F1 is used as a training set and the part X F2 is used as a test set. The second rate, denoted by r F2 , is the classifier success rate when the part X F2 is used as a training set and Fig. 1 Wrapper FSIS algorithm. This figure summarizes the design of the fitness function associated with the proposed wrapper FSIS algorithm. The fitness function of each feature selection chromosome is set to a blend of three classification errors obtained after carrying out the feature selection and the instance selection processes. In the proposed algorithm, only the features have a chromosome that is optimized using a genetic algorithm the part X F1 is used as a test set. The third rate, denoted by R I , is the classifier success rate when the part X I is used as a training set and the part X N is used as a test set. By merging all these classification rates, the accuracy of the current chromosome, H , can be given by the following formula: where λ is a positive number between 0 and 1. The reduction rate of the chromosome H is: By merging the classification rate and the reduction rate, the fitness of the current chromosome, H , can be evaluated by the following formula: where α is a positive number between zero and one. For tasks where accuracy can be the primary objective, this parameter can be set to a value very close to one. As it can be seen, the above fitness function takes into account the quality of the selected features and that of the selected instances. Based on the fitness function, a genetic algorithm is invoked in order to estimate the optimal chromosome. The main steps of the proposed Wrapper FSIS method are given in Algorithm 1.
It is worth noting that for a given number of selected instances, the selected instances obtained by the SMRS method are linked to the content of the feature chromosome and thus can be considered as deterministic samples.

Background: Feature subset selection using LDA criterion
Appropriate classification of data points requires the data points belonging to different classes be ideally far apart from each other and those belonging to the same class should be as near as possible. The extents of such inter-class separability and intra-class nearness depend on the features of the data points. Linear Discriminant Analysis (LDA) (Keinosuke 1990;Nie et al. 2008) seeks a linear combination of features that characterizes or separates two or more classes of objects or events. Initially, LDA was used in order to estimate a linear projection of data. It turned out that the same criterion can be used in order to select a subset of features by replacing the projection matrix by a selection matrix (Nie et al. 2008). Based on LDA criterion, the authors of Nie et al. (2008) propose a trace ratio for feature subset selection. The LDA criterion (after projection or feature selection) favors a high within-class compactness and a large between-class separation. In the Fisher discriminant analysis, the corresponding D × D scatter matrices S w , S b are defined as follows: is labeled as the j-th class and G i j = 0 otherwise. C is the number of classes. By fixing the size of the feature subset to F, the problem stated in Nie et al. (2008) is to find the selection matrix W ∈ B D×F (each column of W is formed by zeros except one element is one) that maximizes the following trace ratio: where tr(.) denotes the trace of a matrix, X is the original data matrix, W ∈ R D×F is the selection matrix, L b is the betweenclass Laplacian matrix associated with the data X, and L w is the within-class Laplacian matrix. Given the number of selected features F, the authors devise an iterative algorithm in order to obtain the optimal selected features by optimizing the above trace ratio. In Liu et al. (2013), the author showed that by normalizing the rows of the data matrix (i.e., the features over instances) using zero-mean unit variance normalization, the trace ratio optimization becomes simpler since the method is based on scoring individual features and on taking the subset of F features that have the highest scores. Furthermore, it was shown that the normalization process may lead to better performances.

Proposed LDA-FSIS
The wrapper technique described in the previous section must learn and test a classifier for each chromosome in order to compute the rates intervening in fitness. This process can be computationally intensive for large datasets. In this section, we propose a filtering approach that is able to provide a solution for feature and instance selection without the use of a classifier. Our filtering approach follows the main steps described in the previous section. However, the main difference is that the fitness function does not depend on a particular classifier. Therefore, the fitness computation is much more efficient since no classifier needs to be trained or tested. The flowchart of the proposed fitness function in conjunction with the filter FSIS method is shown in Fig. 2.
Similar to the wrapper method, for a given chromosome configuration, we can generate the matrix data X F ∈ R F×N . Then, the SMRS method selects the most relevant instances in X F . This provides the data matrix X I ∈ R F×M . It should be noted that F is imposed by the content of the current feature chromosome (it is equal to the number of bits having one) while M is imposed by the SMRS method or by the user.
Let X I denote the row-wise normalized version of X I . Each row in X I is normalized to have a zero mean and unit variance. From the current data matrix X I , one can generate the M × M matrices L b and L w . Therefore, the fitness function of the current feature chromosome is given by: A high value for the score given in Eq. (4) means that the subset of F features and M instances are good with respect to the class discrimination of the data matrix X I . We emphasize the fact that the selected M instances that are represented by X I are already assumed to be the most relevant representatives in the set of instances X F . Therefore, the fitness function defined by (4) is a good indicator about the goodness of the solution (features and instances) associated with the current chromosome.
A genetic algorithm is invoked in order to estimate the optimal solution where the fitness function is given by (4). The resulting scheme is coined LDA-FSIS.

Background: LDE criterion
Local Discriminant Embedding (LDE)  attempts to separate the submanifold of each class and specifically derives the embedding for classification in a low-dimensional Euclidean space. LDE has a similar expression to LDA. However, the Laplacian matrices are computed differently. In fact, LDE is based on two graphs: the intraclass graph G w (intrinsic graph) and the inter-class graph G b (penalty). Let l(x i ) be the class label of x i . For each data sample x i , two subsets, N w (x i ) and N b (x i ) are computed. N w (x i ) contains the neighbors that share the same label with x i , while N b (x i ) contains the neighbors with different labels. A simple way to compute these two sets of neighbors associated with the local sample is to use two nearest neighbor graphs: one nearest neighbor graph for homogeneous samples (parameterized by K 1 ) and one nearest neighbor graph for heterogeneous samples (parameterized by K 2 ). Note that K 1 and K 2 can be different and are chosen with empirical values. Fig. 2 Proposed manifold-based fitness function for FSIS. This figure summarizes the design of the fitness function associated with the two proposed FSIS algorithms. The score of each feature selection chromosome quantifies the goodness of the selected features and instances in terms of class separability. In our work, we derive two fitness functions yielding two manifold-based schemes: LDA-FSIS and LDE-FSIS. In the proposed algorithm, only the features have a chromosome that is optimized using a genetic algorithm Each of the previously mentioned graphs, G w and G b , is represented by its corresponding affinity (weight) matrix W w and W b , respectively, W b . The entries of these symmetric matrices are defined by the following formulas: where sim(x i , x k ) is a real value that encodes the similarity between x i and x k . Without loss of generality, we assume that sim(x i , x k ) belongs to the interval [0, 1]. Simple choices for this function are the Kernel heat and the cosine. The Laplacian matrices associated with the graph G w and G b are given by: where D w denotes the diagonal weight matrix, whose entries are column (or row, since S w is symmetric) sums of S w . The projection matrix W can be estimated by maximizing the following trace ratio:

Proposed LDE-FSIS
Similar to LDA-FSIS, we can generate a fitness function based on LDE criterion. For a given chromosome configuration, we can generate the data matrix X I ∈ R F×M . The fitness of a given feature chromosome becomes: where the Laplacian matrices L w and L b are given by (5), (6), (7), and (8).

Proposed W-FSIS scheme vs. simultaneous FIS
The majority of simultaneous FIS LDE is proposed to address some limitations of the global Linear Discriminant Analysis (LDA) method. It extends the concept of LDA to perform local discrimination. According to graph-based embedding paradigms, the schemes are based on the use of a chromosome, which is used to select both features and instances. The optimal chromosome (with binary or real values) is estimated using a wrapper method. Therefore, when the number of features and instances increases, these methods may face problems in terms of effectiveness and efficiency. On the other hand, our proposed wrapper method has two main differences. First, it uses only one chromosome for feature selection. The instance selection is performed using a data self-representation method, which provides the best subset of relevant instances given the subset of features. Second, FIS algorithms are very often performed with wrapper techniques based on a genetic algorithm. These techniques can reduce the number of features and instances by penalizing low feature and instance reduction rates. There is a tradeoff between achieving high accuracy and high data reduction (Perez-Rodriguez et al. 2015). Their global fitness uses equilibrium parameters that can be difficult to tune. On the other hand, our scheme uses a different type of instance compression. By using SMRS instance selection, the number of selected instances can be either explicitly specified by the user or automatically estimated from the associated encoding matrix, since this matrix is a row-sparse matrix.
We emphasize the fact that whenever the number of original features becomes very large, the use of dimensionality reduction based on manifold learning such as Principal Component Analysis (PCA) can reduce the size of the feature chromosome from thousands of bits to just a few tenths. This property can benefit both our proposed schemes and the simultaneous FIS schemes. However, the simultaneous schemes cannot benefit from manifold dimension reduction techniques to reduce the size of the instance chromosome because its size is still equal to the number of instances in the original dataset. In our proposed scheme, there is no chromosome for the instances.

Proposed LDA-FSIS and LDE-FSIS schemes vs. sequential FSIS techniques
All filter methods adopting FS as a first step have two main shortcomings. First, the set of selected features is estimated independently of the instance selection step. Thus, the process of instance selection depends entirely on the solution found on the whole set of instance. Second, the filter methods require an a priori knowledge of the number of features to be selected. On the other hand, our proposed schemes (LDA-FSIS and LDE-FSIS) explore feature subset selection in a way that the instance selection intervenes in the scoring process of a given candidate feature subset. By doing so, there is an explicit interaction between FS and IS steps. Furthermore, due to the adopted search mechanism, our proposed schemes do not require prior knowledge about the number of selected features.

Datasets
Our experiments were conducted on five public image datasets, including handwritten digits, face images in which the pose of faces ranged from profile to frontal view, object images with a variety of complex geometry and reflectance properties, and image datasets described by high-level features. In the following, we give a brief description of some of the features of these datasets.
• USPS 1 : Handwritten Digit dataset consists of 11,000 grayscale images of the ten digits "0 to "9 ; each class (digit) has 1100 images. • Sheffield 2 It contains 575 facial images of 20 people.
The poses of the faces range from profile to frontal views. The original image size is 112-92 pixels with 8-bit gray levels. In our experiments, the images are rescaled to 56-46 pixels. • COIL-20 3 : This dataset (Columbia Object Image Library) consists of 1440 images of 20 objects. Each object was subjected to 72 rotations (each object has 72 images). The objects exhibit a wide variety of complex geometry and reflectance properties.

Experimental setup
In this section, we compare the performance of the proposed methods (LDA-FSIS, LDE-FSIS, and W-FSIS) with that of state-of-the-art selection strategies. The performance is quantified by the classification accuracy (recognition rate) using the selected features and instances as training data. In our work, we adopted a commonly used protocol to evaluate the classification accuracy after selecting features and instances. First, each dataset is randomly divided into two parts: a training part and a testing part. Then, the training part is processed with a given selection method that selects a subset of features and a set of training instances. The test part is then recognized either with all training data and all features (in the tables representing the recognition results, this is called "All Data") or with the selected features and instances (in the tables, each method corresponds to a row). This automatic detection is performed by two classifiers: Nearest Neighbor (NN) and Support Vector Machines  (SVM). Then, the recognition performance of each selection method or scheme can be quantified using the ground truth labels of the test set. Since the same training and test data are used, the difference in classification performance will almost certainly be due to the selection algorithm alone. For a fair comparison, all instance selection algorithms are used to provide the same number of selected instances. We conduct three groups of experiments. In the first group, we compare the proposed methods with methods that use sequential processing, where feature selection is applied first and then instance selection is applied to the training sets with selected features. In the second group of experiments, we compare the proposed methods with methods that apply joint feature and instance selection. In the third group of experiments, we compare our proposed methods with some instance selection methods. We used the MATLAB genetic algorithm with the following main parameters: The number of generations is set to 20 and the population size is set to 150. For the first two sets of experiments, the data partitioning into training and test sets is given in Table 2. This table also illustrates the percentage of selected samples used in the training set.

Experimental results
First group: Proposed schemes vs. FSIS schemes For feature selection, we consider the following methods. Fisher method (Gu et al. 2012) is a well-known feature selection method. It computes a score for a feature as the ratio of interclass separation and intraclass variance, where features are evaluated independently, and the final feature selection occurs by aggregating the m top ranked ones. Feature Selection via SVM (FSV) (Bradley and Mangasarian 1998) is a wrapper method, where the feature selection process is injected into the training of an SVM by a linear programming technique. Mutual information (MutInf) method (Zaffalon and Hutter 2002) is a filter method where the feature are first considered individually, ranked, and then, a subset is extracted. The Maximum Relevance Minimum Redundancy (mRMR) algorithm (Peng et al. 2005) selects a subset of features having the most correlation with a class (relevance) and the least correlation between themselves (redundancy). The Relief algorithm (Sun et al. 2010) performs feature ranking based on the concept of margin that is computed for each feature.
We evaluate the classification performance of the methods in two scenarios. In the first scenario, we only apply feature selection using the whole set of training samples (FS). In this scenario, there is no instance selection. In the second scenario, we apply feature selection followed by instance selection where a given number of the most relevant samples are retained. In this scenario, all compared algorithms use the SMRS instance selection method. In our work, we have adopted a commonly used protocol to evaluate the classification accuracy. We proceed as follows. The whole dataset is randomly split into two parts: a training part and a test part. For the first scenario, the training part is processed by a given feature selection method and the test part will be processed by selecting the features based on the found selected features selected from the training data. For the second scenario, feature selection is applied on the training data; then, the obtained training data (selected features) undergo an instance selection using the SMRS method.
Finally, the recognition rate is quantified on the test data. The classification is carried out by two classifiers: Nearest Neighbor (NN) and Support Vector Machine (SVM). For the second scenario, and for a fair comparison, all compared algorithms are used such a way they use the same number of selected instances.
Tables 3, 4, 5, 6 and 7 illustrate the obtained performance on the five image datasets. In these tables, the number of selected representatives per class by the IS method (SMRS) is depicted at the top of each table. The best results are shown in bold.
For the classic schemes (FS+IS) (shown in the first six rows in Tables 3, 4, 5, 6 and 7, we used the 40 % top ranked features. The last row in each table illustrates the performance using all training data (i.e., all features and all samples). Best results are shown in bold. From these tables, we can observe the following.
• The FS methods that work on all samples can improve the classification performance, whereas for the FS+IS paradigms the performance decreases. This is very consistent with the essence of the objectives sought by FS and IS. The objective of FS methods is to reduce the   We emphasize that our three proposed methods automatically compute the number of selected features. We used a genetic algorithm that can automatically determine the suboptimal chromosome. In this chromosome, the number of bits taking the value of one is not predetermined and is determined by the genetic algorithm performing a full feature subset selection (number and positions). We also report the obtained performance of the competing method (FS +IS) on the dataset Extended Yale. In this case, the feature selection method is Fisher's method and the instance selection method is SMRS. Figure 3 shows the color map of SVM performance as a function of selected attributes and selected representatives per class.
Second group: Proposed schemes vs. simultaneous FIS schemes For the simultaneous feature and instance selection, we consider the following methods: (i) A wrapper method that foolws a main trend for the simultaneous feature and instance selection. It is based on the classifier performance optimization, (ii) Instance and feature selection based on cooperative coevolution (IFS-coco) (Derrac et al. 2010), and (iii) Feature Weighting and Instance Weighting (FWIW) (Perez-Rodriguez et al. 2015). According to the work described in Perez-Rodriguez et al. (2015), the simultaneous FWIW gave superior results to other schemes involving hard selection. Thus, in our comparison will use this variant. Table 8 depicts a comparison between our proposed schemes and the above mentioned methods on the five datasets. The number of selected instances is the same for all compared methods that allow to fix this number. The latter is shown in the previous tables. As it can be seen, in general, the proposed LDA-FSIS and LDE-FSIS schemes have outperformed the competing simultaneous feature and instance selection methods. This holds for the two classifiers. Our proposed methods LDA-FSIS and LDE-FSIS do not implement a simultaneous selection scheme. Instead, they optimize a score that depends on within-class and betweenclass graphs as well as on the best relevant instances for a given feature subset. Thus, our proposed schemes are still more efficient than the simultaneous feature and instance selection paradigms since the computational cost of the manifold-based fitness function is much less than carrying out an entire classifier training and testing. Regarding the compared wrapper methods (our proposed W-FSIS method and the competing method shown in the first row), the proposed W-FSIS outperformed the competing method in five configurations out of ten.
From the previous experiments, we can observe that the LDA-FSIS and the LDE-FSIS schemes are giving the best performance. In this group of experiments, we focus on the LDA-FSIS scheme. The number of selected instances is the same for all compared methods that allow to fix this number. The experimental setup adopted by the first two groups of experiments is slightly changed. Thus, we set the number of representatives per class for the Extended Yale, COIL-20, USPS, and Segmentation datasets to 12, 10, 50, and 60, respectively. Table 9 depicts a comparison between our proposed LDA-FSIS and the mentioned instance selection methods on five datasets. We can observe that the proposed manifold-based scheme LDA-FSIS has outperformed all competing selection methods. This suggests that performing feature selection and instance selection using the proposed scheme can be better than using instance selection on the full original set of features.
Influence of the representative subset size. Figure 4 illustrates the recognition rate of the proposed method LDA-FSIS as a function of the number of representatives. The SVM classifier is used. We can observe that by increasing the number of representatives the classification performance improves. Note that the classification rate obtained with the whole training set (1000 examples per class) was 98.8%.

Conclusion
This paper addresses the problem of joint feature and instance selection. We propose three schemes for combining feature selection and instance selection. All three schemes use an evolutionary scheme for finding the feature selection solution. The instance selection part is integrated into this evolutionary algorithm, which searches for the best subset of features. Instance selection is performed on candidate features using an efficient sparse modeling data self-representation scheme avoiding the use of a specific chromosome for the instance selection part. The first scheme is a wrapper technique that depends on the classifier used. The second and third schemes are filtering techniques that use fitness functions based on Linear Discriminant Analysis (LDA) and Local Discriminant Embedding (LDE) criteria, respectively.
The three proposed schemes have some interesting properties in terms of effectiveness and efficiency. The classifiers used in our work are: Nearest Neighbor (NN) and Support Vector Machines (SVM). The proposed schemes were compared with three categories of algorithms: (i) schemes that perform feature selection followed by instance selection, (ii)   Table 9 Comparison with IS methods. The used datasets are: Extended Yale, COIL-20, USPS, and Segmentation Recognition rates (%)

Fig. 4
Effect of the number of selected instances (per class) on the final classification using the proposed LDA-FSIS with the USPS dataset schemes that perform feature and instance selection simultaneously, and (iii) schemes that perform instance selection only. The experiments involved image classification using the selected features and instances. The results obtained by the proposed schemes seem to be more accurate than those obtained by the competing schemes. In particular, the two filtering schemes that adopt the manifold structure can be very appealing.
Funding The authors have not disclosed any funding.
Data availability Enquiries about data availability should be directed to the authors.

Conflict of Interest
The authors declare that they have no conflict of interest.

Ethical standards This article complies with ethical standards.
Ethical approval This article does not include studies with human participants or animals conducted by any of the authors.