Robust multi-label feature selection with shared label enhancement

Feature selection has attracted considerable attention due to the wide application of multi-label learning. However, previous methods do not fully consider the relationship between feature sets and label sets but devote attention to either of them. Furthermore, conventional multi-label learning utilizes logical labels to estimate relevance between feature sets and label sets so that the importance of corresponding labels cannot be well reflected. Additionally, numerous irrelevant and redundant labels degrade the classification performance of models. To this end, we propose a multi-label feature selection method named Robust multi-label Feature Selection with shared Label Enhanced (RLEFS). First, we obtain a robust label enhancement term by reconstructing labels from logical labels to numerical labels and imposing l2,1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_{2,1}$$\end{document}-norm onto the label enhancement term. Second, RLEFS utilizes the robust label enhancement term to share the similar latent semantic structure between feature matrix and label matrix. Third, local structure is considered to ensure the consistency of label information during the feature selection process. Finally, we integrate the above terms into one joint learning framework, and then, a simple but effective optimization method with provable convergence is proposed to solve RLEFS. Experimental results demonstrate the classification superiority of RLEFS in comparison with seven state-of-the-art methods.


Introduction
In real-world applications, data mining is confronting a great challenge with the increase in high-dimensional data, because the high dimensionality of data brings a lot of noise information and corrodes the classification performance of the learning models [1]. To address the issue, numerous feature selection methods are designed to reduce the number of irrelevant and redundant features, to excavate useful information, and improve the classification performance [2][3][4][5]. Feature selection technique is widely used in extensive domains, such as communications-electronics, biomedical, computational chemistry and virus detection, etc.
In recent years, a large number of researches regarding feature selection methods are proposed. Generally, feature selection methods are categorized into three types based on the selection strategy [6]: filter models [7], wrapper models [8] and embedded models [4,9]. Filter models are independent of the subsequent learning approach in the implementation process, while wrapper models are dependent on the specific learning approach. Embedded models embed feature selection and the learning approach into one unified framework, which becomes a hot topic in the areas of feature selection [10][11][12]. Embedded-based feature selection is focused in this paper.
In the early stage, researchers use previous methods to deal with binary or/and multi-class label data. With the explosive growth of multi-label data, numerous multi-label learning methods have been proposed to handle multi-label data. Most of the existing multi-label learning methods do not fully consider the relationship between the feature set and the label set simultaneously but devote attention to either of them. It means that these methods are good at dealing with either the relationship between features or the relationship between labels. For instance, Jian et al. propose a multi-label method named MIFS [13]. MIFS method focuses on obtaining the low-dimensional form of label matrix by matrix decomposition. Besides, Cai and Zhu consider the feature manifold of features to conduct multi-label feature selection but ignoring the label correlation which is important for multi-label learning [14]. Moreover, there exists close relationship between the feature set and the label set. It is indicated that the similar latent semantic structure between the feature set and the label set is consistent in the literature [15]. Therefore, we propose a new term to share the latent semantic similar structure between the feature set and the corresponding label set, which combines local and global structures of labels to provide complementary information so as to improve the performance of multi-label feature selection, i.e., we simultaneously exploit local and global structures from the label-level perspective. The literature works [16][17][18] are several highly related studies. In [16,17], researchers demonstrate that two kinds of geometry structure (both local and global geometry structure) among instances can provide complimentary information to each other so that the performance of unsupervised feature selection methods is strengthened. However, unsupervised methods ignore the relationship between labels and the significance of label enhancement. In addition, the local and global structures may occur in any realworld applications, i.e., some structures provide global information and some information are shared in local structure. Therefore, Zhu et al. [18] propose a global and local label correlation method to deal with multi-label learning. However, this method not only ignores multi-label enhancement but is also unable to conduct feature selection. Therefore, these approaches cannot be applied to joint feature selection and multi-label learning with label enhancement directly. But the proposed method solves this intractable problem. Additionally, traditional multi-label feature selection methods utilize logical labels to estimate the relevance between the feature set and label set [15,19,20]. For example, there is a picture that includes sunshine, flying bird, sky and white cloud. Usually, we use 1 and 0 (or -1) to denote the relevant or irrelevant object of this picture. In this case, we find that the importance of each label is the same for each instance. However, the importance of labels cannot be well reflected by logical values [21,22]. To this end, we design a robust label enhancement term, which reconstructs labels from logical labels to numerical labels. Furthermore, there exist a lot of irrelevant and redundant information in labels, as mentioned above. As a result, low-quality labels cannot offer the accurate guidance for feature selection. We impose l 2,1 -norm onto the label enhancement term so that we can obtain robust labels to guide feature selection process. Finally, the consistency of label information is taken into account in the design of the proposed method.
In light of the above analysis, we propose a novel multi-label feature selection method that integrates the above various terms into one joint learning framework, which is named Robust multi-label Feature Selection with shared Label Enhancement (RLEFS). And then, an effective optimization method with provable convergence is proposed to solve RLEFS. In summary, the novelties and contributions of this paper are highlighted as follows: • We design a robust label enhancement term by reconstructing labels from logical labels to numerical labels and imposing l 2,1 -norm onto the label enhancement term. • The proposed method shares the new robust labels that we design between the feature set and the corresponding label set to capture the latent semantic similar structure. At the same time, the consistency of label information is taken into account in the design of the proposed method. • Designing a joint learning framework that named Robust multi-label Feature Selection with shared Label Enhancement (RLEFS), and developing an optimization method with provable convergence to solve the proposed RLEFS framework. • Conducting comprehensive evaluation criteria on multiple benchmark data sets to demonstrate the effectiveness of the proposed framework.
The remainder of the paper has the following organizational form. In Sect. 2, we review the related work including some related notations and representative studies. In Sects. 3 and 4, we propose a joint learning framework named RLEFS and one optimization method with provable convergence, respectively. In Sect. 5, the comprehensive experimental results on multiple multi-label data sets are presented and discussed. Finally, some concluding remarks are given in Sect. 6.

Preliminaries
In this subsection, some notations are defined, which are used through this paper. Matrices are denoted by italicized uppercase letters, such as A. For matrix A ∈ R n×m , A i· and A · j denote ith row vector and jth column vector of A, respectively. Scalars are denoted by lowercase letters, such as a. Functions can be represented by calligraphic letters, such as L. A T and T r(A) denote the transpose and the trace of A, respectively, where A in T r(A) is a square matrix. The Frobenius norm of A is defined as where A i j denotes the (i, j)th entry of matrix A. In this paper, we suppose that the feature matrix X ∈ R n×d has n instances in d-dimensional space. The corresponding label matrix Y ∈ R n×c has c column class labels. Generally, Y i j = 1 if the ith instance is related with the jth label, and Y i j = 0 otherwise.

Related work
In the past decade, a large number of established multi-label learning methods have been widely applied to different fields. Generally, these methods are categorized into three groups from the perspective of label correlations, i.e., first-order strategy, second-order strategy and high-order strategy [23]. The first-order strategy, such as Binary Relevance (BR) [24], utilizes single-label learning method to deal with multi-label data, which is not only timeconsuming but also ignores the label correlations. In the second-order strategy, the pairwise label correlations attract extensive concern. For instance, Global Relevance and Redundancy Optimization (GRRO) is a state-of-the-art second-order-based method [25]. For the highorder strategy, the correlations of multiple labels or all labels are taken into account. For example, Huang et al. propose a label-specific-features-based multi-label learning method named LLSF-DL [3]. LLSF-DL utilizes a sparse stacking way to conduct high-order label correlations learning. However, previous methods based on the three strategies assume that the importance of labels is equivalent. That is, the traditional multi-label learning utilizes logical labels to estimate the relevance between the feature set and label set so that the importance of labels cannot be well reflected. To this end, some researchers transform logical labels into numerical labels. For instance, Hou et al. propose manifold-based multi-label learning method to translate logical labels into numerical labels [21]. However, this method focuses on the manifold learning of multi-label set. Huang et al. design a feature selection method that considers Laplacian score based on manifold learning named MCLS [26]. Inspired by [21], MCLS method utilizes the manifold learning to transform the binary label to the numerical label. However, these methods only focus on handling label sets.
Based on whether or not to deal directly with multi-label data, multi-label learning is mainly divided into two types: problem transformation methods and algorithm adaption methods [23]. The former class is to transform the multi-label classification into several single-label classification problems, such as Pruned Problem Transformation (PPT) [27]. Doquire et al. propose a multi-label feature selection method based on PPT, which employs mutual information to evaluate features named PPT+MI [28]. Similarly, PPT+CHI employs χ 2 statistic method to select the important features after the preprocessing of PPT [27]. However, there exist some problems in these problem transformation methods. First, the geometry structure of the labels was not considered. Second, these methods use logical labels in design of methods. The latter one is to transform algorithms to fit multi-label data, which handles multi-label data directly. Generally, algorithm adaption multi-label feature selection methods adopt different criteria in design of methods, such as mutual information-based and sparse learning-based multi-label feature selection methods [1]. Several representative methods are reviewed.
MIFS [13] uses matrix factorization to obtain a low-rank latent label subspace that preserves the local geometrical structure of labels based on the instance-level regularizer. The obtained low-rank latent label subspace eliminates irrelevant and redundant information of the original label matrix. However, the multi-label matrix of MIFS is decomposed into two factor matrices. The latent label matrix may lose effective information when the value of the rank k of the latent label matrix is inaccurate. MIFS is constructed as follows: where W ∈ R d×c , V ∈ R n×k and B ∈ R k×c are regarded as the weight matrix, the latent label matrix and the basis matrix, respectively. L ∈ R n×n denotes Laplacian matrix. α, β and γ are three regularization parameters of MIFS method.
Besides, Cai et al. also design a feature selection method based on the sparse-learning theory, which imposes l 2,0 -norm onto the weight matrix and then uses the Augmented Lagrangian Multiplier method to optimize the following objective function [29]. It is named RALM-FS: where b and 1 denote the bias term and the all-one-element vector, respectively. k denotes the number of the already-selected features. However, this method ignores the relationship between labels. Alternatively, Lin et al. design a multi-label feature selection method, which maximizes the feature dependency while minimizing feature redundancy named MDMR [30]. Besides, Zhang et al. propose a new feature relevance term based on label redundancy, and they combine the new feature relevance term with feature redundancy term to design a new method named Label Redundancy-based multi-label Feature Selection (LRFS) [2]. LRFS has the following form: where L R( f k ; Y ) and I ( f k ; f j ) are regarded as the feature relevance term and the feature redundancy term, respectively. f k denotes a candidate feature that is from the feature set F. y i and y j are two labels that are from the label set Y . To balance the magnitude between feature relevance term and feature redundancy term, feature redundancy term is divided over the number of the already-selected feature subset S. Both MDMR and LRFS methods belong to multi-label feature selection methods based on information theory. In addition, we analyze the advantages and disadvantages of these existing state-of-the-art methods. For instance, Jian et al. [13] realize the existence of invalid labels that could restrain the prediction performance of feature selection. Therefore, multi-label informed feature selection method (MIFS) is proposed. The advantages are that it utilizes non-negative matrix factorization technology to reduce invalid original label information and maintains the structural consistency by constructing a conventional Laplacian matrix. In addition, the disadvantages are that the original labels can be decomposed into two factors with mixed values [31]. Moreover, MIFS utilizes the input instance space to generate local manifold graph structure. However, the local and global structure have been demonstrated to provide complementary information to each other [16,17]. Furthermore, MIFS ignores the importance of label enhancement for feature selection. As a result, MIFS results in serious classification error. To this end, we tend to exploit label enhancement with label-level manifold regularization term to guide feature selection. RALM-FS holds that l 2,0 -norm has better sparsity than l 2,1 -norm. Therefore, this method imposes l 2,0 -norm onto the weight matrix and then uses the Augmented Lagrangian Multiplier method to optimize the following objective function. However, RALM-FS only selects the fixed number of features and the weight of other features is forcibly adjusted to zero. At the same time, the importance of label is ignored due to the nature of l 2,0 -norm. Moreover, RALM-FS is unable to achieve label enhancement. Gao et al. [32] design a feature selection method based on shared patterns named SSFS. The advantage is that it utilizes a constrained latent structure to achieve a shared regularization term so that multi-label instances with highdimensional features and labels can be addressed. This shared space encapsulates important information between the features and the labels simultaneously. The disadvantage is that it not only constructs a fixed local instance-level manifold graph regularizer, but also ignores the label enhancement. In addition, local learning regularizer is adopted in this paper. In local learning regularizer, an intuitive assumption is adopted, that is, if X i· and X j· in a high-dimensional space are close, then Y i· and Y j· in a low-dimensional space should be similar [33]. The graph Laplacian is used to achieve the above assumption. Therefore, the following calculation method is obtained: where L X = A X − S X denotes a graph Laplacian matrix of matrix X . S X and A X denote the symmetric affinity matrix and the degree matrix, respectively. Finally, we review the recent advanced work. Hashemi et al. [34] discover the ParetoCluster method lacks discrimination in high-dimensional cases, where ParetoCluster uses Pareto dominance and cluster analysis to deal with multi-label feature selection. To deal with this problem, they propose a Pareto-based multi-label feature selection method named PMFS. PMFS transforms the multi-label issue into a bi-objective optimization issue by considering feature relevancy and feature redundancy simultaneously. Paniri et al. [35] propose an Ant Colony Optimization-based feature selection method to deal with the multi-label learning problem. This method uses a heuristic learning scheme with Temporal Difference (TD) reinforcement learning instead of the static heuristic function when dealing with ACO-based feature selection problems. Hashemi et al. [36] impose joint feature selection and multi-label learning method into a bipartite graph (feature graph and label graph) matching process. This method achieves superior classification performance. Moreover, Hashemi et al. [37] also propose a multi-label feature selection method during the multi-criteria decision-making process for the first time. These method belong to the state-of-the-art multi-label feature selection methods. In addition, Fan et al. [38] propose a state-of-the-art structured subspace multi-label feature selection method. This method utilizes global and local label correlations by manifold learning to uncover a compact latent subspace. However, this method ignores the importance of label enhancement. In [39], the adaptive spectral graph with information entropy is designed due to its nature, i.e., simple and effective. However, the information entropy theory-based strategy leads to be more time-consuming. [39] consists of two parts, i.e., the local cliques are integrated into the global discriminant model and exploring clustering results by high-order label correlations. Similar to the aforementioned literature, this method ignores the importance of label enhancement. Liu et al. [40] propose an Online Multi-label Group Feature Selection (OMGFS) method by considering online intra-group and online inter-group sections. OMGFS uses an information-theoretic-based strategy so that it has more time-consuming, while label enhancement is ignored. To this end, we propose the following method.

The proposed framework
In this section, we propose a novel robust multi-label feature selection method that is named Robust multi-label Feature Selection with shared Label Enhancement (RLEFS). The core contribution of the proposed method can be concluded as follow: RLEFS utilizes the robust label enhancement term to share the similar latent semantic structure between the feature matrix and the label matrix. Additionally, the local structure is considered in the design of RLEFS to ensure the consistency of label information as much as possible during the feature selection process. The details of the proposed method are presented in the followings.
Generally, the least square regression model is utilized to learn the weight matrix W . However, this model is very sensitive to noise. In order to make the model more robust, we impose l 2,1 -norm onto the learning model, where l 2,1 -norm has been confirmed to be robust to noise information [20]. Therefore, the following learning model is given: where W ∈ R d×c denotes the feature weight matrix that measures the importance of each feature. The greater the value of W i· 2 is, the larger contribution the ith feature of matrix X is. Furthermore, motivated by literature [18,41], it is vital to preserve the local structure of labels, which means if W ·i and W · j are close, then Y ·i and Y · j should be close. Thus, the corresponding graph Laplacian is presented by the following form: where Formula (6) utilizes the weight matrix to preserve the relationship between labels from the perspective of label level. We incorporate Formula (6) into Formula (5) to obtain Formula (7): where L = A − S denotes a graph Laplacian matrix of label set. α is a trade-off parameter between loss function and label graph regularization term. However, traditional multi-label feature selection methods utilize logical labels to estimate the relevance between the feature set and label set so that the importance of labels cannot be well reflected. Therefore, we design a robust label enhancement term by reconstructing labels from logical labels to numerical labels and imposing l 2,1 -norm onto the label enhancement term. We obtain the following form: where β denotes a regularization parameter that adjusts the contribution of the robust label enhancement term. F ∈ R n×c denotes an enhanced label space, which shares the similar latent semantic structure between the feature space and label space. It is worthy to pointed that F is not a low-dimensional embedding subspace for label matrix Y , but a homomorphic space. This homomorphic label space will not lead to information loss of the original logical label space compared with some methods that directly reduce the dimension of high-dimensional space to low-dimensional space. Besides, we impose l 2,1 -norm onto the label enhancement regularization term to enhance the robustness of the enhanced label space F, which can prevent noise interference. B ∈ R c×c denotes a self-representation mapping term, which can adjust the correlation between the enhanced label space F and the original logical label space Y . Finally, we impose l 2,1 -norm onto the feature weight matrix W , which ensures not only the row sparsity of W but also can automatically select the most discriminating and effective features during the multi-label learning. Afterward, we reformulate the following function: where γ denotes a regularization parameter that controls the sparsity of the objective function. However, the row-sparse property of W is not always guaranteed by l 2,1 -norm [42]. Consequently, we impose the non-negative constraint on W so that the row-sparse property is further enhanced. We also apply non-negative constraints onto F and B, which can ensure the consistency of F and Y , because Y has only non-negative logical values, 0 and 1 in the used data sets. Note that, if the elements of F are all zero, the above function always leads to a trivial solution. To this end, we impose an orthogonality constraint onto F. The final objective function is reformulated as follows: Next, we develop a simple yet effective optimization method to guarantee the convergence of function (10). The optimization method with provable convergence will be described in detail in the next section.

Optimization schemes
In this section, we propose a simple but efficient optimization method to solve the proposed objective function (10). We observe that the objective function (10) is joint non-convex because the Hessian matrix composed of the second partial derivative of the multivariate function is not a positive semi-definite matrix. In addition, the objective function is nonsmooth due to l 2,1 -norm. To solve these problems, we transform the objective function into several sub-solution processes, that is, one variable is adjusted, while other variables are fixed. At the same time, a relaxed approach is introduced to solve the non-smooth problem. The objective function (10) is equivalent to objective function (11) as follows: F, B) denotes the objective function (11) with respect to variables W , F and B. D 1 , D 2 and D 3 are defined as follows, respectively: where · 2 denotes the l 2 -norm of a vector. D 1 ii , D 2 ii and D 3 ii denote the ith diagonal element of D 1 , D 2 and D 3 , respectively. is a non-negative small constant. By integrating nonnegative and orthogonal constraints into function (11), we obtain the following Lagrangian function: where ∈ R d×c + , ∈ R n×c + and ∈ R c×c + represent the Lagrangian multiplier, respectively. λ denotes the regularization parameter of orthogonal constraint. By taking the derivative of function (13) w.r.t. W , F and B, respectively, we obtain Due to i j W i j = 0, i j F i j = 0 and i j B i j = 0 according to Karush-Kuhn-Tucker conditions, we get the following Formulas: According to Formula (15), we obtain the following update rules: where t denotes the counter. L is a graph Laplacian matrix with mixed sign, thus we decompose L into two non-negative parts, i.e., A and S. Besides, some elements in the denominator could be zero during the updated process. To solve this issue, we add a very small constant to the denominator. Then, we obtain the top-k features during the feature selection process. The pseudo-code of the proposed method is described in Algorithm 1. In Algorithm 1, RLEFS has three phases. Phase 1 (lines 1, 3, 4 and 5) provides the input terms. Line 1 gives the input data set and regularization parameters, such as the input feature matrix X ∈ R n×d , the output matrix X ∈ R n×c as well as regularization parameters α, β, γ and λ. Lines 3-5 provide initialized value of all variables. For instance, we randomly initialize the variables W , F and B in line 3. Phase 2 (lines 6-10) shows the update process by using the aforementioned update rules. Phase 3 (lines 11 and 12) sorts all features by W i· 2 , where i = 1, 2, 3, . . . , d, and select the top-k features. Then, we use top-k selected features to train subsequent classifiers.

Input:
1: The input feature matrix X ∈ R n×d and the output label matrix Y ∈ R n×c ; The regularization parameters α, β, γ and λ. Output: 2: Return the top-k selected features index set. 3: Initialize W ∈ R d×c + , F ∈ R n×c + and B ∈ R c×c + randomly; 4: t = 0; 5: Compute the degree matrix A and the affinity matrix S of the label matrix Y ; 6: Repeat: 7: Update the diagonal matrix D 1 , D 2 and D 3 by formula (12)

Proof of convergence
In this subsection, we offer the proof of the convergence of the proposed optimization method. Taking the variable W as one example, we obtain the following formula according to the gradient descent method: where the learning rate η is a small positive constant. To ensure non-negative constraints and obtain a data-adaptive learning rate, we set We take Formula (18) into Formula (17), then the following result is presented: Accordingly, the above update rule is a special case of the gradient descent method. The convergence of the optimization method is proved based on [43,44].

Definition 1
If G(w, w ) ≥ F (w) and G(w, w) = F (w) are satisfied, then G(w, w ) is considered to be an auxiliary function of F (w). (20), we can deduce that

Proof of Lemma 1 According to Definition 1 and Formula
where w denotes any element of W . Next, we give the proof of the convergence of RLEFS w.r . t. W by a proper auxiliary function G(w, w ). Considering that the update rules are always based on an element-by-element basis, we use W i j to denote the (i, j)th element of W . And F i j denotes the part of (W ), which is relevant to W i j . Therefore, the first-order and second-order partial derivatives of F (W i j ) are obtained: Afterward, we obtain the Taylor function of F i j (W i j ): Inspired by [44], we set the following auxiliary function regarding F i j (W i j ): As for the other condition G(W i j , W t i j ) ≥ F (W i j ) in Definition 1, we need to prove the following inequality: It is obvious that Formula (26) is equivalent to the following form: Therefore, G(W i j , W t i j ) is proved to be an auxiliary function of F i j (W i j ). We brought the auxiliary function into Formula (20), then we obtain the update rule of W by As a consequence, the objective function (10) is non-increasing under (28) according to the above proof. The convergent proof of the other two variables (F and B) is similar to the variable W .

Experimental settings and results
In this section, we conduct experiments on eleven multi-label benchmark data sets in comparison with the seven state-of-the-art multi-label feature selection methods, where all the experiments are performed on the 3.4GHz i7-6700 machine with 16 GB RAM.

Experimental data sets
In our experiments, all data sets are fetched from Mulan Library [45] in which these data sets are also adopted in numerous literature regarding multi-label learning [4,12,[46][47][48]. Besides, these data sets are collected from different fields. For example, the Enron data set is a subset of the Enron e-mail corpus [49], which comes from text domain. Flags data set is collected from the image field, which has 194 instances and 7 labels that contains the color of red, green and blue, etc. Several data sets come from the yahoo data set that belongs to multi-label text (web page) categorization. The detailed description of all the used data sets is summarized in Table 1.

Experimental settings
To demonstrate the effectiveness of the proposed method RLEFS, the following classic and state-of-the-art methods are used as compared methods, i.e., PPT+MI [28], PPT+CHI [27], MIFS [13], MDMR [30], LRFS [2], RALM-FS [29] and SSFS [32]. The detailed analysis of these compared methods has shown in the related work. Then, we introduce some experimental parameters in advance. The heat-kernel function is adopted for graph Laplacian matrix: where the parameters p and σ are set as 5 and 1, respectively. To ensure the fairness, the hyperparameters of all the methods are tuned in the same grid, i.e., {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0}.
We utilize the optimal parameters with respect to classification performance during the training process, where fivefold cross-validation is adopted. Besides, BR [24] is used to transform multi-label data into binary classification data so that linear Support Vector Machine (SVM) and K-Nearest-Neighbors (KNN, K=3) can be used in our experiments, where C of SVM is tuned in {10 −4 , 10 −3 , . . . , 10 3 , 10 4 }. At the same time, ML-kNN (k=10) is adopted as well, where ML-kNN is a common multi-label learning classifier [50]. Next, we adopt Micro-F 1 and Macro-F 1 (a.k.a. Micro-average and Macro-average) based on F 1 -measure and Hamming Loss as evaluation criteria to evaluate the proposed method and other compared methods [51]. Micro-F 1 and Macro-F 1 are two label-based evaluation metrics. They have the following forms: (30) where m and i represent the number of class labels and the ith label, respectively. TP, FP and FN are composed by T, F, P and N, where T, F, P and N denote True, False, Positive and Negative, respectively. Moreover, the larger value of Micro-F 1 (or Macro-F 1 ) indicates the better classification performance of the method. Besides, Hamming Loss (H L) belongs to the example-based evaluation metric. It has the following form: where n and m denote the number of instances and labels, respectively. Y i j and Y i j denote the (i, j)th predicted label and the (i, j)th label of the original label set. In addition, we suppose D = {(X i· ; Y i· |1 ≤ i ≤ n)} denotes a test data set. Y denotes a predicted label set by the learned multi-label classifier. ⊕ denotes the XOR operator. The Hamming Loss measures the number of misclassified instances and label pairs. Moreover, the smaller value of Hamming Loss indicates the better classification performance of the method. These different types of evaluation metrics can measure multi-label methods from multiple aspects.

Experimental results
To evaluate the classification performance of the RLEFS method, we conduct numerous experiments on eleven different multi-label data sets. The experimental results are described in the form of tables and figures. First, we use the top 20% of the total features in each data set to calculate the average results and standard deviations of different methods (all 19 features of the data set Flags are adopted). Tables 2, 3, 4 and 5 record the results of Macro-F 1 and Micro-F 1 of all the methods on the classifiers SVM and KNN (K = 3). Table 6 records the results of Hamming Loss of all the methods on the ML-kNN classifiers. In these tables, all methods are ranked on each data set, such as the optimal method ranks first, the suboptimal method ranks second. The average rank is shown if several methods have the same rank result. The average result of all data sets by each method is shown in the final row in each table. Observing these tables, we conclude that the proposed method RLEFS obtains the best average results, where the best average results of the RLEFS are 0.161, 0.355, 0.183, 0.375 and 0.0722 in Tables 2, 3, 4, 5 and 6, respectively. The results of all methods using SVM classifier in terms of Macro-F 1 are described in Table 2. Through observing Table 2, we find that RLEFS outperforms PPT+MI, Table 2 The results of all methods using SVM classifier in terms of Macro-  Table 3 The results of all methods using SVM classifier in terms of Micro-  Table 4 The results of all methods using 3NN classifier in terms of Macro-  Table 5 The results of all methods using 3NN classifier in terms of Micro-    Table 7. As can be seen, the null hypothesis will be rejected, when we set the significance level α as 0.05, which implies all used methods have the same performance if the null hypothesis holds. As a result, the Bonferroni-Dunn test is used as the post hoc test due to observed results. The significant difference between RLEFS and other compared methods is estimated by computing whether the average rank of the two methods is at the critical distance (CD), where CD = q α √ k(k + 1)/6N , q α = 2.69, α = 0.05, k = 8 (k denotes the number of algorithms), N = 11 (N denotes the number of data sets), and thus, CD = 2.8096. Figure 1 shows that CD diagrams on all used evaluation metrics. In these CD diagrams, we use a red line to denote the distance between RLEFS method and other compared methods within 3362 Y. Li et al.

Fig. 2
Eight methods on six data sets using SVM classifier in terms of Macro-F 1 one CD. As we can see in these CD diagrams, RLEFS and other methods have significant differences. At the same time, RLEFS has a competitive advantage over other methods. Moreover, the proposed method ranks first under all evaluation metrics.
Besides, we present the experimental results on six representative data sets, including Arts, Education, Enron, Flags, Reference and Science in Figs. 2, 3, 4, 5 and 6. The X -axis and Y -axis are used to indicate the already-selected features and the classification performance of corresponding evaluation criteria, respectively. The number of already-selected features is varied from top-1% to top-20% of all the features, where the step size is set to 1%. As shown in Figs. 2, 3, 4, 5 and 6, we observe that RLEFS achieves the best classification performance. In most cases, the classification performance of RLEFS increases at first and then stabilizes according to Figs. 2, 3, 4 and 5. The classification performance of RLEFS decreases at first and then stabilizes according to Fig. 6. Overall, the proposed RLEFS method outperforms other seven compared methods in the experiments.

Sensitivity analysis of parameters
We present and discuss the sensitivity of parameters regarding the proposed method RLEFS, where these parameters contain α, β, γ and λ. In this section, we take data set Arts as an 3364 Y. Li et al.

Fig. 4
Eight methods on six data sets using 3NN classifier in terms of Macro-F 1 example. First, these parameters are tuned in {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1.0}. One parameter is adjusted while the other three parameters are fixed, where we set the fixed parameters as 0.5 in this paper. We show the analysis results on SVM classifier. In Fig. 7a-d, we observe that the classification performance is sensitive to the values of these parameters. However, RLEFS achieves better results in 0.3-0.7 in most cases. Therefore, a larger and finer range should be used in practical applications.

Time complexity analysis
Next, we display the execution time of RLEFS and other compared methods on all data sets. The execution time is recorded in Table 8. By analyzing the execution time, RLEFS costs the least execution time on the most data sets. Afterward, RALM-FS, PPT+MI and SSFS obtain similar execution time. And then, the running time of PPT+CHI approaches to that of MIFS. MDMR and LRFS cost more running time than the other embedded methods due to interactive correlations among features and each label. Overall, RLEFS has acceptable execution time and classification performance compared to other methods. Furthermore, we

Conclusions
The most existing feature selection methods adopt the following scheme: extracting logical label correlations and maintaining the consistency between the features and labels. However, the label information is described by logical value that cannot reflect the importance of label. In this paper, we propose a joint multi-label feature selection framework RLEFS. RLEFS has the following several appealing characteristics. First, RLEFS transforms the logical labels to numerical labels and then imposes l 2,1 -norm onto the new numerical label matrix, i.e., we extract numerical label information by label correlations between the feature space and the reconstructed label space with three l 2,1 -norm constraints to guide feature selection. Afterward, the proposed method shares the new numerical label matrix between the feature set and the label set. Furthermore, RLEFS utilizes the graph Laplacian matrix to ensure the consistency of label information. To verify the effectiveness of the proposed method, we conduct numerous experiments on multiple multi-label data sets in comparison with the multiple stateof-the-art multi-label feature selection methods. By analyzing these experimental results, we conclude that the proposed method RLEFS outperforms other compared methods in terms of different evaluation criteria, i.e., Micro-F 1 , Macro-F 1 and Hamming Loss. In other word, the proposed method RLEFS can make full use of effective label information to select the most discriminative features. In addition, RLEFS has some limitations. For instance, RLEFS involves the fixed Laplace graph, which is not suitable for joint multi-label learning and feature selection process. How to effectively combine the advantages of joint Laplace graph learning and label enhancement learning is an open problem.
In future work, multi-label feature selection is still our research focus. However, we will furthermore study feature selection under non-convex optimization under causal mechanisms due to the existence of interpretability problems. The causal sequence relationship can explain the internal mechanism of function from a deeper level, which helps people understand the black box problem.