Parallel dual-channel multi-label feature selection

In the process of multi-label learning, feature selection methods are often adopted to solve the high-dimensionality problem in feature spaces. Most existing multi-label feature selection algorithms focus on exploring the correlation between features and labels and then obtain the target feature subset by importance ranking. These algorithms commonly use single-channel structure to obtain important features, which induces the excessive reliance on the ranking results and causes the loss of important features. However, the correlation between label-specific feature and label-instance is ignored. Therefore, this paper proposes Parallel Dual-channel Multi-label Feature Selection algorithm (PDMFS). We first introduce the concept of dual channel and design the algorithm model as two independent modules. The algorithm obtained different feature correlation sequences, thus avoided relevant feature loss. And then, the proposed algorithm uses the subspace model to select the feature subset with the maximum correlation and minimum redundancy for each sequence, thus obtaining feature subsets under respective correlations. Finally, the subsets are cross-merged to reduce the important feature loss caused by the serial structure processing single feature correlation. The experimental results on eight datasets and statistical hypothesis testing indicate that the proposed algorithm is effective.


Introduction
As a research hotspot in the fields of machine learning, pattern recognition, and data mining, multi-label learning (Zhang and Zhou 2013) has attracted much attention. In contrast to traditional single-label learning, multi-label learning can handle classification tasks with multiple targets, which is more relevant to realistic classification tasks. However, with the advent of 5G information era, multi-label data gradually becomes a kind of huge and high-dimensional data. However, the high-dimensional features of multi-label datasets inevitably include irrelevant or redundant features, which can lead to dimensional disasters and reduce the efficiency of multi-label classification tasks. Feature selection is an effective technique to solve the high-dimensional disaster of data (Zhang et al. 2021a(Zhang et al. , 2014. It reduces feature dimensionality by selecting feature subsets with the least number of irrelevant or redundant features with a certain feature-specific metric (Spolaôr et al. 2013;Fan et al. 2021a;Huang and Wu 2021).
In recent years, various multi-label feature selection algorithms (Fan et al. 2021b;Li and Cheng 2019;Hu et al. 2020;Jiang et al. 2020) have been proposed. These algorithms are usually divided into three main categories: filtering methods, wrapper methods and embedding methods. The filtering methods perform classifier-independent feature selection, and it measures the data features, such as mutual information, and produces a feature ranking before classification; the wrapper methods use the accuracy of a particular classifier to determine the selected feature quality, it rely on the learning performance of an off-theshelf classifier for the evaluation of selected features; and the embedding methods use feature selection as a function component of a predefined classifier and achieve model fitting and feature selection simultaneously. In this paper, we consider only filtering methods to design efficient evaluation metrics and evaluation methods for feature selection.
Current multi-label feature selection algorithms differ in their approach to selecting feature subsets, but the goals are extremely similar, i.e., first performing relevance analysis on features and then selecting features based on established requirements (the requirement is usually to select low-redundancy features). If the relevance analysis and selection requirements are modularized, the general structure of current algorithms is serial connecting modules that is called a single-channel structure in this paper. Singlechannel structure has achieved remarkable results in the feature selection algorithm. However, the relevance analysis on features is not unique. In recent years, Ping et al. found a high level of label association between the label set and the feature set. Also, the labels are naturally clustered into several groups, indicating that similar labels tend to be clustered into the same group, and different labels belong to different groups. Based on this, they proposed a Multi-Label Feature Selection Considering the Max-Correlation in high-order label (MCMFS) (Zhang et al. 2020) algorithm. The relevance analysis of features and label groups replaces the relevance analysis of features and labels. Although MCMFS only performs new relevance analysis without combining the traditional relevance analysis, but this work provides a new idea that learning the new correlations of features to do relevance analysis. For example, the treatment of dual correlation is recognized in causal feature selection (Guo et al. 2022).
In subsequent studies, it is found that the feature obtained from an instance is an external manifestation of the instance. When partial label information becomes experience, there should also be a correlation between the feature and the instance. For example, when people are ready to buy a vehicle, they will investigate the vehicle features within the label range after they know the label information such as the branding, model and applicable population. After people get enough information about the vehicle features in the label range, they will consider the vehicle's sample information to determine their purchasing needs. Therefore, learning from the data labels to obtain important label-specific feature information can effectively help the classification of the sample, and the label-instance information can help locate these important features. It is necessary to learn the correlation between label-specific features and label-instances.
A large number of experimental studies have shown that there is a connection between features and labels, and each label has the corresponding features (Wang et al. 2020(Wang et al. , 2021aCheng et al. 2022;Huang et al. 2015;Zhang et al. 2021b). According to the correlation of labels, Huang et al. proposed a Learning label specific features for multilabel classification algorithm (LLSF) that includes the correlation between labels (Huang et al. 2015). Based on the label-specific-feature proposed by LLSF, a generic relationship matrix W that intuitively reflects the feature in the label can be obtained. Aiming at the generic relationship of the features in the labels, our study obtains the mutual information and the label distribution information for relevance analysis. Then, the metric for a new relevance analysis was successfully obtained using the label-specific features extracted by the LLSF algorithm proposed by Huang et al. In this process, the correlation between features and labels is involved, so there will be a situation in which both correlations are considered. This means that the proposed algorithm needs two different relevance analysis modules, and the second relevance analysis module in series with the single channel structure will destroy the integrity of the first relevance analysis module thereby losing relevant features. Hence, the dual-channel feature selection method is proposed.
The dual-channel feature selection method is a parallel structure that independently solves the feature redundancy problem under the two different feature relevance analysis so as to alleviate the feature loss problem in the singlechannel. Meanwhile, the related algorithms of the dualchannel structure are noticed (Cui et al. 2021;. Dual-channel is extremely rare in machine learning, but it is commonly used in deep learning to form dualchannel convolutional neural networks with CNN (Zhou et al. 2021). In recent years, dual-channel CNN has been applied to the research fields closely related to features, such as population feature engineering  and protein sequence feature fusion (Xu et al. 2019), indicating that the dual-channel structure can promote feature learning. These scientific research results provide certain theoretical support for the thought of this paper. Specifically, a dual-channel structure is applied to the relevant feature selection algorithms in the subspace, and a Parallel Dual-channel Multi-label Feature Selection (PDMFS) algorithm is proposed. The algorithm independently solves the feature redundancy under different feature relevance analysis by establishing parallel information channels and then merging the subsets to obtain the target feature set. This method attempts to select low-redundancy features under different feature correlations to obtain the best target feature set.
The main contributions of this paper are as follows: (1) Different from the current multi-label feature selection method, this paper utilizes the parallel dualchannel structure to successively solve the problem of feature redundancy under different correlations and the problem of subset feature redundancy.
(2) Compared with the traditional multi-label feature selection, PDMFS considers both the correlation between features and labels and the correlation between label-specific features and label instances. PDMFS maximizes the original setting order of the pre-merged subsets by cross-merging methods to avoid feature confusion within the target feature set.
(3) The experiment on eight benchmark multi-label datasets shows better performance of PDMFS against state-of-the-art multi-label feature selection algorithms in multi-label classification.
The rest of this paper is organized as follows. The second part introduces the related work of this paper. Then, the next part introduces the basic knowledge of theory. After that, presents details of proposed method in the fourth part. The comparison between the proposed algorithm and other advanced algorithms is shown in the next. The last part summarizes the full paper.

Related work
As a common measurement metric, information entropy Kim 2015a, 2015b;Lin et al. 2014;Estrela et al. 2020) has been widely used in feature selection. Lee et al. proposed a multi-label feature selection algorithm based on multi-variable mutual information to maximize the relevance between features and labels (PMU) (Lee and Kim 2013). Zhang et al. used two projection strategies to project the original data into a lower-dimensional feature space based on the maximum relevance between the original features and the labeled space. Then, they proposed an attribute reduction algorithm based on the maximum relevance (MDDMspc, MDDMproj) (Zhang and Zhou 2010). Recently, Amin et al. modeled the feature selection process as a multi-standard decision-making process for the first time, and they proposed multi-label feature selection with multi-standard decision making (MFS-MCDM) (Amin et al. 2020). However, the above methods do not consider the redundancy between features. According to the mutual information between features and labels and the principle of maximizing the relevance and minimizing the redundancy (Lin et al. 2015), Liu et al. divided the feature local subspace (Zeng et al. 2017) and performed a fixed ratio of feature selection, and they proposed a multi-label feature selection algorithm based on the local subspace (MFSLS) . Based on the above research, Lin et al. proposed a multi-label feature selection algorithm based on neighborhood mutual information (MFNMIpes) . They defined the neighborhood concept from different cognitive viewpoints and extended the neighborhood information entropy to multi-label learning.
In subsequent studies, MCMFS (Zhang et al. 2020) replaces the feature-label correlation analysis by featurelabel group correlation analysis through spectral clustering. Similarly, Aim et al. implemented a bi-objective optimal feature selection using Pareto-Clustering for feature relevance and redundancy, proposed an efficient Pareto-based feature selection algorithm for multi-label classification (PMFS) (Aim et al. 2021). Regarding the feature selection algorithm combining label-specific features (Zhang et al. , 2023Yu et al. 2021;Wu et al. 2020), Zhang et al. achieved full utilization of label information by learning label-specific features, proposed Fast multi-label feature selection via global relevance and redundancy optimization (GRROfast)  algorithm. In addition, Zhang et al. further utilized the label-specific features to present a new group-preserving label-specific feature selection (GLFS) (Zhang et al. 2023) algorithm for multilabel learning, which simultaneously considers the features special to the labels in the same group and specific features owned by each label to execute feature selection. It can be found that multi-label feature selection algorithms using label-specific features for relevance analysis are gradually developed.
Although the above algorithms differ in the method of selecting feature subsets, they all follow the same rule: The study first determines the features relevant to the learning task and then conducts feature selection according to the requirements. For example, the MDDM, PMU, and MFS-MCDM algorithms all first obtain the feature ranking that satisfies the maximum relevance between features and labels, and then select an appropriate feature subset according to the requirements. These algorithms differ from the MFSLS, and MFNMIpes algorithms in the feature selection process. It is found that the feature with less redundancy is more in line with the needs of the learning task. Finally, a feature subset is obtained, which has the maximum relevance and the minimum redundancy (Lin et al. 2015). Based on this, new feature relevance analysis such as label clustering and label-specific features have been combined, and multi-label feature selection algorithms such as MCMFS, PMFS, GRROfast and GLFS have been proposed successively. However, the relevance analysis is still singular.

Information entropy
By borrowing the concept of thermal entropy in thermodynamics, Shannon proposed information entropy in 1948. Information entropy describes the degree of uncertainty of the information, and it defines the amount of information in mathematical language. Let A ¼ fa 1 ; a 2 ; :::; a n g and B ¼ fb 1 ; b 2 ; :::; b m g be two discrete random variables. Pða i Þ is the probability of a i , the information entropy (Li and Cheng 2019;Lin et al. 2014Lin et al. , 2016 The joint entropy (Lin et al. 2014) of A and B is Given B, the conditional entropy (Lin et al. 2014)

of A is
HðAjBÞ ¼ HðA; BÞ À HðBÞ: When the information of random variable B is obtained, the entropy of random variable A is reduced, and the amount of reduction is the mutual information (Lin et al. 2014) of A and B IðA; BÞ ¼ HðAÞ À HðAjBÞ: IðA; BÞ is used to measure the statistical correlation between A and B, and IðA; BÞ ! 0. When IðA; BÞ ¼ 0,A and B are independent of each other, and no information is provided between the two variables. As for multi-label feature selection, it is assumed that given two features x 1 and x 2 describing an instance, Iðx 1 ; x 2 Þ can effectively describe the redundancy between x 1 and x 2 ; Given feature x of the description object and label set Y ¼ fy 1 ; y 2 ; :::; y l g,8y i 2 Y; i ¼ 1; 2; :::; l,Iðx; y i Þ can effectively describe the degree of correlation between the feature and the label. In this case, the mutual information between feature x and label set Y is

Subspace feature selection method
In multi-label feature selection, the features with high redundancy are mostly irrelevant, but the features with a strong correlation may also contain high redundancy. This contradiction can be resolved by establishing a subspace model (Zeng et al. 2017;Liu et al. 2016). Based on this, the feature sequence that is sorted in descending order of the mutual information size between feature and label or between label-specific feature and label-instance can be divided into k subspaces, and then feature selection can be performed under the sampling ratio P. The specific process is as follows: Given a feature space with a dimension of m, this space has k subspaces with a dimension of m=k, and the i -th space is f i ¼ fx i1 ; x i2 ; :::; x i½m=k g; 8x ij 2 f i ; where j ¼ 1; 2; :::; ½m=k. The mutual information between x ij and other features is The smaller Rðx ij Þ, the lower redundancy between x ij and other features (Lin et al. 2015;Liu et al. 2016). By ascendingly order the mutual information obtained by formula (6), a redundancy ranking of features f 0 i ¼ fx 0 i1 ; x 0 i2 ; :::; x 0 i½m=k g in the i -th subspace can be obtained. Through the sampling ratio P, the feature with less redundancy in the i -th subspace is selected as a subset, and then the subsets are merged into the final subset.

Learning label-specific feature
In the LLSF proposed by Huang et al., it is assumed that each class label is associated with a feature subset from the original feature set. Compared with the original feature, the class label is sparse (Huang et al. 2015). LLSF models the discriminative properties of label-specific features through linear regression, and it uses ' 1 -norms on regression parameters to model the sparsity of label-specific features.
Due to the non-smoothness of the ' 1 -norm regularization term, the objective function of the minimization problem of formula (7) is also non-smooth. Since this problem is a convex optimization problem, LLSF uses an accelerated near-end gradient method to solve this problem. The final matrix W is optimized as: where TrðRW T WÞ, and L f is Lipschitz constant. Based on this, the label-specific feature set W ¼ fw 1 ; w 2 ; :::; w m g; 8w t 2 W; t ¼ 1; 2; :::; m can be obtained.

Mutual information between label-specific feature and label-instance
Denote the mutual information between feature x t and instance d j as Iðx t ; d j Þ. In the label space, d j ¼ fy 1j ; y 2j ; :::; y lj g. Under the premise of knowing W, the correlation between features and instances is described with the mutual information between the label-specific feature and the label instance: Then the correlation between features and instances can be defined as: Similar to the characteristic of the mutual information between the feature and the label, the more the mutual information between the feature and the instance, the more important the feature is.

PDMFS: parallel dual-channel multi-label feature selection
To obtain the features with maximizing the relevance and minimizing the redundancy between the feature and the label, the label-specific feature and the label-instance, this paper proposes a Parallel Dual-channel Multi-label Feature Selection algorithm (PDMFS). PDMFS designs the algorithm model as two independent modules, called Channel A and Channel B, and uses them to build a two-channel model. PDMFS performs subspace feature selection for correlation analysis between features and labels and between features and instances independently in the two channels. The specific process is shown in Fig. 1. From Eqs. (5) and (10), it can be seen that the more mutual information between the feature and the label or the label-specific feature and the label instance, the more important the feature is. By descending the mutual information Iðx; YÞ and Iðx t ; DÞ respectively, two sets of feature importance rankings F Y and F D can be obtained.
The two kinds of mutual information are sorted respectively by the subspace model in their respective channels, and the feature union subset obtained is the target feature subset where f 00 y and f 00 d represent the sum of the selected subspace feature subsets under the two sequences.
As shown in Fig. 1, Channel A calculates the mutual information Iðx; YÞ between features and labels as feature relevance analysis, and Channel B calculates the mutual information Iðx t ; DÞ between label-specific features obtained by LLSF and label instances as another feature relevance analysis. Subsequently, the feature importance rankings F Y and F D are independently performed for the maximum relevance minimum redundancy feature selection of the local subspace model. Finally, the channel feature subsets are merged to obtain the target feature subset.
According to the above description, the pseudocode of PDMFS is as follows: Fig. 1 Flowchart of PDMFS If PDMFS simply merging feature subsets, it will lead to the disorder of feature sequences in the original set. In this case, the target feature subset is internally disordered, resulting in poor performance stability of the algorithm. Therefore, this paper retains the original set order to the maximum extent by cross-merging method. It is known that the set of the subsets f 00 y and f 00 d can be obtained through parallel dual-channel, then the target feature set R acquisition can be modified as follows: The specific process is shown in Fig. 2. As shown in Fig. 2, Cross-merging first returns the feature subsets to the state before merging the local subspaces of the channel feature sets. Then the set of subspace features with the same level of relevance in each channel is merged. Finally, the merged subsets are summed to obtain the final subset in order to expect a more stable feature ranking within the feature subset.

Complexity analysis
In PDMFS, suppose that the number of instances is n, the number of features is m, the number of labels is l, and the number of subspaces is k. The time complexity of the mutual information between the feature and the label is OðmlÞ. The time complexity of Channel A is OðmlÞ. The time complexity of the label-specific feature matrix is Oðm 2 þ ml þ l 2 þ nd þ nlÞ, and the time complexity of the mutual information between the label-specific feature and the label-instance OðmnÞ. Thus, the time complexity of Channel B is Oðm 2 þ ml þ l 2 þ nd þ nl þ mnÞ.The subspace redundancy feature selection time complexity is Oðmðm À kÞ=kÞ. Suppose that the number of selected features is b, since the number of features chosen in this paper is the merging of two channel sets, the size range of b is ½ðm=kÞ; ð2m=kÞ.The time complexity of PMU, MFNMIpes, MCMFS and GRROfast are Oðmnlþ bnml þ nml 2 Þ, Oðn 2 ml þ ðnmÞ 2 þ bmn 2 Þ,Oðnml þ bnlÞ and OðTvl þ m 2 þ mlÞ respectively, where T denotes the number of iterations for clustering and v is the number of groups (or cluster centers). Although PDMFS requires the calculation of two correlations, the monomial index of time complexity is much lower than PMU, MFNMIpes and MCMFS. Certainly, due to the group label search method of GRROfast, the time complexity of PDMFS is higher than that of GRROfast. And since the dual channels run in parallel, the total sum of time complexity of Channel A and Channel B does not produce exponential growth. Compared to the high exponential time complexity of PMU, MFNMIpes and MCMFS, the time complexity of PDMFS is still acceptable.

Experimental data
To verify the effectiveness of the proposed algorithms, eight datasets of Business, Computers, Education, Health, Recreation, Reference, Scene, and Science are used in the experiment. Table 1 shows detailed information about the eight multi-label datasets. Such as the type of domain from which the dataset was acquired; the numbers of samples,

Experimental environment and evaluation index
The experimental codes are all implemented in Mat-lab2016a. The hardware platform is a computer equipped with IntelÒ Core (TM) i5-9600 K CPU (3.70 GHz) and 16 GB memory, and the computer runs Windows10 operating system. In this paper, six common multi-label learning evaluation indicators are taken to comprehensively evaluate the algorithm performance, including Average Precision (AP), Coverage (CV), Hamming Loss (HL), Ranking Loss (RL), Macro F1-score (Macro-F1), and Micro F1-score (Micro-F1) (Schapire and Singer 2000). For convenience, the indicators are abbreviated as AP:, CV;, HL;, RL;, Macro-F1:, and Micro-F1:, where : means the higher the value, the better the performance, and ; means the lower the value, the better the performance. Given the multi-label classifier hðÁÞ, the prediction function f ðÁ; ÁÞ, the ranking function rank f , and the multi-label dataset D ¼ fðx i ; Y i j1 i nÞg. The detailed calculation of the above six evaluation indicators is as follows: 1. AP: It evaluates the average score of the correct label permutation of a specific label. The value of this indicator is between 0 and 1, and the higher the value, the better the performance.
2. CV: It measures the steps it takes on average to traverse all relevant markers of the sample. The value of this indicator is greater than 0, and the lower the value, the better the performance.
3. HL: It measures the mismatch between the true label and the predicted label of the sample in the case of a single label. The value of this indicator is between 0 and 1, and the lower the value, the better the performance.
4. RL: It considers the situation in which the ranking of the unrelated labels of the sample is lower than the ranking of the related labels. The value of this indicator is between 0 and 1, and the lower the value, the better the performance.
F1-score: It measures the accuracy of a binary classification model in statistics. It takes into account both the accuracy rate and recall rate of the classification model. Its maximum value is 1, and the minimum value is 0. For the jÀth label y i ð1 j nÞ, the two-class classification performance of the classifier hðÁÞ on this label can be described by the following four statistics: • TP j (The number of ''True'' Positive examples) • FP j (The number of ''False'' Positive examples) • TN j (The number of ''True'' Negative examples) • FN j (The number of ''False'' Negative examples) As for multi-class problems, Micro-F1 and Macro-F1 can be used for performance evaluation, with the maximum value of 1 and the minimum value of 0. Based on the above statistics: 5. Macro-F1: It is the arithmetic mean of the F1-scores of all labels. The value of this indicator is between 0 and 1, and the higher the value, the better the performance.
6. Micro-F1: It can be seen as a weighted average of the F1 scores of all labels. The value of this indicator is between 0 and 1, and the higher the value, the better the performance.

Parameter settings and experimental results
The  (Zhang et al. 2020), Fast multilabel feature selection via Global Relevance and Redundancy Optimization (GRROfast) . PMU, MDDMspc, MDDMproj and MFNMIpes as classical algorithms for information metrics in feature selection are used to compare the effectiveness of PDMFS performance. MFS-MCDM, MCMFS, and GRROfast as the state-of-theart feature selection algorithms in recent years are used to compare the advancement of PDMFS. In the experiment, the d parameter of the MDDMspc is set to 0.5; the labelspecific feature coefficient matrix W of PDMFS is extracted by the LLSF algorithm, the matrix W is the postfive-fold cross-validation result and the a, b, and c parameters of the LLSF algorithm range from ½2 À10 ; 2 10 , ½2 À10 ; 2 10 , and f0:1; 1; 10g; As the number of features in the dataset used for the experiments was ½294; 793, it's not too large. In the literature , it has been experimentally demonstrated that the best prediction is achieved by dividing the subspace into three when the feature dimension is not too high. So, the number of subspaces k is set to 3 in PDMFS. About the sampling ratio P of the three subspaces, it has been experimentally demonstrated that the best prediction is achieved by set to f0:6; 0:3; 0:1g in the literature . For other algorithms taken for performance comparison, default parameter settings are used. Unlike wrapper methods and embedding methods, the filtered multi-label feature selection process is independent of the classifier. Therefore, the feature selection time loss and feature set performance evaluation of the PDMFS algorithm are not constrained by the classifier. In the experiments, to effectively compare the subset performance of feature selection of each algorithm, ML-KNN is used to evaluate the feature subset performance as a fair and effective classifier familiar in the Parallel dual-channel multi-label feature selection 7123 field of multi-label feature selection. ML-KNN is a multilabel version of the KNN algorithm that the nearest neighbor number k is set to 10 and the smoothing coefficient is set to 1 (Zhang and Zhou 2007). The number of features selected by PDMFS is fixed, and the number of features obtained by other comparison algorithms is random. To better observe the changes in the indicators of each algorithm, all other algorithms use the same number of features as PDMFS. Tables 2, 3, 4, 5, 6 and 7 show the prediction performance of the eight algorithms MDDMspc, MDDMproj, PMU, MFNMIpes, MFS-MCDM, MCMFS, GRROfast, and PDMFS, where Average indicates the average ranking of each algorithm and the best experimental results are shown in bold.  From Tables 2, 3, 4, 5, 6 and 7, we can observe that PDMFS can obtain a better or comparable performance than any of the chosen comparison methods in the average ranking of all metrics. Although PDMFS still does not perform well on particular data, it does not affect the overall evaluation of the algorithm. For example, although the individual evaluation metrics are not optimal in the Reference dataset, most metrics are second only to state-ofthe-art multi-label feature selection algorithms such as GRROfast, and still significantly outperform classical algorithms such as PMU. In addition, PDMFS and other state-of-the-art algorithms performed weaker than classical algorithms in AP, CV and RL on the Education dataset. We realize that specific properties of the data can affect the performance of the algorithm. For example, when the number of truly relevant features in the feature distribution of the data is very sparse, the addition of other features can hardly offset the effect of redundancy even if those features are still extremely relevant. Comparing datasets Reference and Education with other datasets, it can be found that PDMFS performs well on most datasets, but considering the specific properties of the data, there is still room for improvement of PDMFS.
Furthermore, considering that the comparison algorithms are all multi-label feature selection algorithms for global search of features, while PDMFS is a multi-label feature selection algorithm for overall division of local search, the feature distribution of the dataset will have an impact on the results of both global and local methods. Comparing with the GRROfast algorithm, PDMFS outperformed in average ranking, which indicates that PDMFS achieves more correlation features using the dual-channel structure and proves the effectiveness of parallel dual relevance analysis.
To verify whether the cross-merged subset in the model can improve the stability of the target feature set. We compared PDMFS with cross-merge and PDMFSN without cross-merge. Some results are as shown in Table 8, where Metric means indicator type. The variation of the indicator with the number of features selected is shown in Fig. 3, 4 , 5, 6, 7, 8, 9 and 10 and the interval for the number of features is five.  Parallel dual-channel multi-label feature selection 7125 After cross-merging, the target feature set has improved in 70.83% of the results for the eight datasets and three experimental metrics. This demonstrates that without cross-merging, the target feature set is cluttered with internal features and its performance is not fully exploited. In the other 29.16% of results, the change was not significant and the majority of results had a difference between 0.0001 and 0.0006. Only the AP and CV metrics of the Reference dataset and the HL metrics of the Scene dataset showed relatively large performance changes, which were 0.0787, 0.0042, and 0.0084, respectively. Based on Figs. 3 and 9, we can see that the convergence of the metrics changes with cross-merging has improved compared to without cross-merging, and the overall performance tends to be stable. This indicates that the variation of feature sequences within the target feature set after cross-merging tends to be more convergent and stable in general, and the small performance loss is worth the cost.
Most of the comparisons in Fig. 3, 4, 5, 6, 7, 8, 9 and 10 support the conclusion that cross-merging enhances stability of the target feature set, but there are cases where the CV and HL metrics change abruptly and aggressively, as shown in Fig. 6. So, there is a debate here, whether the abrupt variation in performance of feature selection represent unstable selection results? In most cases, yes. As abrupt changes in performance tend to bring about nonconvergence of results, which is why this paper introduces cross-merging to obtain the target feature set. However, if we take the CV metric of the Education dataset as an example, we can see that the convergence in the last part of  the graphical variation is significantly better than without cross-merging, although more drastic performance mutations occur after cross-merging. We are aware that the underlying structure of PDMFS is the subspace feature selection model and that the final target feature set is the sum of the subspace selection subsets. This means that there is inevitably a sudden change in performance in the subset-subset connection part. Therefore, the stability of the subspace feature selection target feature set can still be determined by the final convergence of it, and abrupt changes that do not affect the final convergence can be ignored.

Statistical hypothesis testing
This paper adopts the Nemenyi test under the significance level of a ¼ 0:05 (Janez and Dale 2006) to evaluate the comprehensive performance of PDMFS and other algorithms. If the average ranking difference between the two comparison algorithms on all datasets is greater than the critical difference, the two algorithms are considered  The performance comparison of algorithms significantly different. Otherwise, there is no significant difference. The calculation of the CD value is as follows: where K ¼ 8; N ¼ 8; q a ¼ 3:0310, CD ¼ 3.7122. Figure 11 shows the comparison between each algorithm on the six indicators of AP, CV, HL, RL, Macro-F1, and Micro-F1. The algorithms with no significant difference are connected by a solid line. The evaluation indicators are ordered from left to right, and the performance of the algorithm decreases accordingly.
For each algorithm, there are 42 kinds of experimental comparison results (7 comparison algorithms, 6 evaluation indicators). The following observations are obtained from the results shown in Fig. 11. We can conclude that PDMFS ranks No.1 among all methods, and PDMFS is significantly different from the other algorithms in 54.77% of cases. Compared with the GRROfast algorithm, PDMFS fails to pull ahead, but under the condition of a specific feature subset size, PDMFS maintains the leading position. The results show that PDMFS has superior performance compared to the classical algorithms and good competitive performance compared to the state-of-the-art comparative algorithms.

Conclusions
In this paper, a Parallel Dual-channel Multi-label Feature Selection algorithm is proposed, which explores a parallel structure to consider both the correlation between features and labels and the correlation between label-specific features and label instances. PDMFS adopts a minimal redundancy feature selection method under the dual correlation condition. Meanwhile, PDMFS solves the problem of losing important relevant features in single channel feature selection. The experimental results show that PDMFS includes more important and relevant features. On the other hand, cross-merging makes the target feature set of PDMFS more stable. However, PDMFS only obtains the final target feature set through the sum of subsets, ignoring the connections between feature subsets. From the subset perspective, each channel generates a subset of subspace features, and the minimum redundancy of the subset will be bound to be affected when the channel sets are merged. In other words, the redundancy of the target feature set obtained by PDMFS cross-merging is increased compared to the pre-merge subset. This suggests that the selection model of PDMFS needs to consider the uniformity of properties between sets, especially subsets and merged sets. This will be a major direction for our future research.