Neural Networks with the Correlative Layer for Multi-label Classi ﬁ cation

Multi-label classiﬁcation is a very signiﬁcant but challenging task. Correlation between labels often exists, so recent works have paid much attention to using the correlation between labels to improve the classiﬁcation performance. However, how to eﬀectively learn the correlation is still a problem. In this paper, a general framework, i.e., the neural network with the correlative layer (CLNN), is proposed, where the correlative layer is used to express the dependencies between labels. Diﬀerent from existing work, CLNN ﬁrst trains a neural network without the correlative layer to obtain rough classiﬁcation results and then trains the whole neural network to adaptively adjust all the weights including those of the correlation layer. Thus CLNN could learn both positive/negative and strong/weak relationships between labels. We test CLNN with three typical neural networks, and experimental results show that neural network can achieve better performance by adding the correlative layer, which demonstrates that the CLNN framework is very eﬀective.


Declarations
All of the sections are relevant to my manuscript, including Introduction, Related Work, Proposed Method, Experiments, Conclusion and References

Introduction
In real-world applications, many samples (or objects) have multiple labels, and the classification tasks on such samples are named multi-label classifications [1]. For example, one text may contain multiple categories, and a picture often contains multiple objects. So far, multi-label classification has been studied in many fields in the past decades [2], including functional genomics classification [3], images classification [4,5], text classification [6], etc. The requirements of multi-label classification are becoming more and more sophisticated, which makes its study of great significance [7].
Mathematically, multi-label classification can be described as follows [7]: the d-dimensional feature space X ∈ R d is mapped to the l-dimensional output space, Y = {y 1 , y 2 , · · · , y l } which means that the label space contains l possible labels for each vector x i , which has a corresponding label combination y i ∈ {−1, 1} l or y i ∈ {0, 1} l . Generally, y j = 1 (1 ≤ j ≤ l) means x i owns label y j , otherwise not.
So far, many multi-label classification algorithms have been proposed. Typically, binary relevance (BR) [8] decomposes the multi-label classification problem into multiple independent binary classification problems. This algorithm is very simple and straightforward, but it needs to design l classifiers. A series of algorithms based on the classifier chains (CC) [9][10][11][12][13] have also been proposed. Such algorithms do not carefully consider the correlation between labels. Because the label correlations are often useful to improve the classification performance, some algorithms have been developed to learn the label relationship from the training data [14][15][16][17][18]. Typically, Huang et al. [14] designed an algorithm named LLSF-DL that makes use of the label-specific features and class-dependent labels to obtain better performance. Furthermore, the authors [17] developed an improved algorithm named joint feature selection and classification (JFSC), where label correlations are used to select shared features and label-specific features. Collaboration based multi-label learning (CAMEL) was proposed by Feng et al. [18], which is based on the sparse reconstruction method. However, although these algorithms can obtain better performance, they are not scalable because of matrix calculation and many parameters have to be tuned manually.
Neural network, as well as the popular technique in deep learning, has been used to solve the multi-label classification problems. Typically, backpropagation for multi-label learning (BPMLL) [3] is based on traditional backpropagation neural network, which relies on its own structure mining of label correlations. NN AD [6] redesigned the neural network structure based on BPMLL, which adopted the popular techniques in deep learning, including the rectified linear unit (Relu) function [19] and the Dropout [20] technique. Canonical correlated autoencoder (C2AE) [21] adopted an end-to-end training method based on deep canonical correlation analysis (DCCA) [22], which could achieve feature-aware label embedding and label-correlation aware prediction. The above methods are all based on the implicit correlations between labels through the networks' own structure, which is not highly interpretable. Their performances are also not competitive, as shown in Section 4.
In this paper, a general framework, i.e., the neural network with the correlative layer (CLNN), is proposed, where the correlative layer is added at the end of the existing neural network. In CLNN, the weights of the correlative layer are trained automatically, and finally, the weights explicitly express the label correlations, including both positive/negative and strong/weak correlations. Through such a correlative layer, CLNN could improve the predictive performance of labels. Compared with existing methods, CLNN adaptively corrects the label correlations weights during the training process, where the label correlations weights are expressed by the matrix of the correlative layer.
The main contributions of this paper are summarized as follows.
-We propose a general framework of CLNN, where the correlative layer is attached to the output layer of existing neural networks. Firstly, CLNN trains a neural network without the correlative layer to obtain the rough classification performance. Secondly, CLNN trains the neural network with the correlative layer. Thus it could learn the label correlations automatically, i.e., training the weights of the correlative layer by various backpropagation techniques. -Because the correlative layer is attached to the output layer of existing neural networks, it can be easily used to extend existing algorithms. In this paper, we extend three representative multi-label classification algorithms, i.e., BPMLL [3], NN AD [6] and C2AE [21], by adding the correlative layer, and experimental results show that the CLNN framework achieves better performance.
The rest of this paper is organized as follows. Section 2 introduces the related work in the field of multi-label classification and three representative multi-label classification algorithms based on neural networks. Section 3 introduces the idea of the correlative layer and proposes the CLNN framework as well as its training process. Experimental results and analysis are given in Section 4. Section 5 summarizes the whole paper.
In recent years, the label correlations and feature-label correlations have been paid much attention in developing multi-label classification algorithms. Typically, label-specific features (LIFT) [25] extracts the feature information represented by each label through clustering, adds new feature information in the original dataset to construct a new dataset, and constructs l classifiers using the BR algorithm on the new dataset. Multi-label learning by exploiting label dependency (LEAD) [26] constructs a Bayesian network of labels, which clearly expresses the dependencies between labels through directed acyclic graphs. Since the traditional Bayesian network structure learning is an NPhard problem, conditional dependency network (CDN) [27] constructs a fully connected dependent network on the class label without learning the structure and outputs the probability through Gibbs sampling during prediction.
It is not surprising that neural network is a popular solution to solve multilabel classification problems [28][29][30]. BPMLL [3] is an earlier but typical multilabel classification algorithm based on neural networks. The algorithm exploits the loss function to mine label dependencies. Many algorithms are extended or enhanced from BPMLL. Zhang et al. [31] replaced the hidden layer of BPMLL with the radial basis function (RBF) [32] network. Rafa l et al. [33] modified the BPMLL loss function, while NN AD proposed by Nam et al. [6] modified the activation function, loss function and training method. Convolutional neural network-weighted approximate rank pairwise (CNN-WARP) [4] extended convolutional neural network (CNN) in the field of multi-label image classification field for the first time, which modified the loss function to make the algorithm more suitable for multi-label image classification. Based on CNN, Convolutional neural network-recurrent neural network (CNN-RNN) [5] attempted to use the long short-term memory (LSTM) [34] to encode the label information to obtain the dependencies between the labels. C2AE [21] integrated the deep canonical correlation analysis network and autoencoder technology to make features and labels have a better correlation while using the label-related loss function in the prediction layer.
In the following parts, three representative algorithms based on neural networks will be introduced, and all of them are adopted for experimental comparisons in this paper. These three representative algorithms include the earlier but typical algorithm BPMLL [3], and two state-of-the-art algorithms NN AD [6] and C2AE [21].

BPMLL
BPMLL is first used by Zhang et al. [3] for multi-label classification. BPMLL adopts a simple three-layer neural network and updates the weights by backpropagation of errors. The major advantage of BPMLL is its loss function, which attempts to make the output of the labels belonging to a sample larger than the output of the labels not belonging to the sample. Specifically, given a sample (x i , y i ), the loss function is as follow.
Where y i represents the collection of labels belonging to the sample x i , y i represents the collection of labels that are not part of the sample, | · | represents the size of the collection, o i k and o i l indicate the neural network outputs corresponding to the labels y i k ∈ y i and y i l ∈ y i , respectively. The weights are continually updated by the back-propagation of the error's gradient until convergence. After obtaining the output scores of the labels, a threshold selection is required. For a training sample (x i , y i ), the threshold is obtained by the following formula.
Therefore, according to the trained neural network, for the training dataset D = {(x i , y i )|1 i n}, the threshold training data {(o(x i ), t(x i ))|1 i n} can be obtained. BPMLL assumes that the threshold value is a linear function of the network output values. Given a test sample x, we can get the output o(x) through the neural network. Then, the threshhold for x can be obtained as follow. t The weights and offsets in (3) can be obtained by the least-squares method on threshold training data {(o(x i ), t(x i ))|1 i n}.

NN AD
Compared with BPMLL [3], NN AD [6] introduces more popular techniques in deep learning, and has very competitive performance. In NN AD [6], different loss functions are analyzed, and the cross-entropy loss function is considered to be easier to converge than the ranking-based loss function. Its loss function is where n is the size of the dataset, l is the number of labels, Meanwhile, NN AD adopts the Relu function [19] as the hidden layer activation function, and uses the Dropout technique [20] instead of regularization to prevent the over-fitting effect. Since the learning rate does not change during the training process in the simple gradient descent method, it is often difficult to converge later in training. Therefore, NN AD uses the Adagrad gradient descent method [35]. Thus, as the number of training increases, the weights of the gradients are continuously reduced.
In the threshold training phase, different from [3], over-fitting is prevented by adding a regularization term.
Where n is the size of the dataset, w is the weight parameters, o i is the output vector of the neural network for a sample, and t i is the optimal threshold in terms of formula (2).

C2AE
Canonical correlated autoencoder (C2AE) was proposed by Yeh et al. in 2017 [21], which adopts an end-to-end learning architecture. C2AE integrates a deep correlation analysis network Φ (F x , F e ) and a self-encoding technology network Γ (F e , F d ), and the purpose is to make the features and labels have a better correlation while using the label-related loss function in the prediction layer. The global loss function is defined as follow.
Where the hyperparameter α balances two types of errors. Φ consists of two parts. One maps x i to the latent layer's feature map F x , and another maps y i to the latent layer's label map F e . The obtained latent layer F e is fed to F d to get the label output. In formula (6), obviously, the global loss function consists of two terms. For the first term, by using the deep canonical correlation analysis (DCCA) [22], the feature vector x i and the label vector y i are projected to the lowerdimensional latent space L using two deep neural networks. Here the latent vector obtained by x i is encoded as F x (x i ), and the latent vector obtained by y i is encoded as F e (y i ). The goal is to match the F x (x i ) and F e (y i ) directions as much as possible, rather than expecting F x (x i ) and F e (y i ) are very small. So the objective function is where I is the identity matrix.
For the second term, to learn and recover label-related output, a selfencoder is introduced. First, y i is recovered by feeding the latent layer F e (y i ) into F d , and the formula is expressed as F d (F e (y i )). Then the output adopts the loss function of BPMLL [3], and loss Γ (F e , F d ) is defined as follows.
Where F d (F e (y i )) is the output of the label, y i represents the collection of labels belonging to the sample x i , y i represents the collection of labels that are not part of the sample, | · | represents the size of the collection, (F e (y i )) l and (F e (y i )) k indicate the neural network outputs corresponding to the labels y i k ∈ y i and y i l ∈ y i , respectively. In the prediction phase, the function F d (F x (x i )) is used for prediction.

Primary Idea
In real-world applications, a sample often has multiple labels, and these labels often have some correlations. In a multi-label classification task, it is reasonable to use the label correlations to improve the performance of the classifier.
However, most existing neural networks for multi-label classification do not explicitly express the correlations between labels and do not learn both positive/negative and strong/weak correlations during training. Considering the inter-label nature in training and predicting, we explicitly define a correlative layer after the original label output layer of existing neural networks. We call a neural network with the correlative layer as CLNN.
The framework of CLNN is shown in Fig. 1. In Fig. 1, pre N N is an existing neural network, whose outputs are fed to the correlative layer. The outputs of the correlative layer are the final outputs of CLNN. It can be seen that our CLNN could be well used to extend existing neural networks for multi-label classification.

Correlative Layer
As shown in Fig. 1, we define l vectors of l dimensions in the correlative layer. These vectors explicitly encode the correlation weights between labels, and the correlation weights are used to correct the outputs of the pre-network pre N N .
To encode the correlations between labels, we define C[c i,j ] l * l as the label correlation matrix, where c i,j represents the correlation weight between label Y i and label Y j . The label correlation matrix is the key to the correlative layer.  Fig. 1 The framework of CLNN. The feature x gets a temporary output o pre ∈ R l through the existing network pre N N . For each label, we conduct the dot product between the temporary output o pre and l-dimensional correlation vector, and then the sum obtained is the final output.
In this paper, we present three different label correlation matrices. The first one is implemented by directly using the Pearson product-moment correlation coefficient, the second one is a weighted version of the first one, and the third one is implemented by weights and signs from the Pearson correlation coefficient. We call CLNN with the first label correlation matrix as CLNN-PC, and the second as CLNN-WPC, while the third one as CLNN because it performs the best. Now we explain the three different label correlation matrices in detail.

1) Label Correlation Matrix by the Pearson product-moment correlation coefficient
First, we consider a simple method that uses the Pearson product-moment correlation coefficient r to represent C, and the Pearson correlation coefficient is Therefore, Where Cov(Y i , Y j ) represents the covariance of the label Y i and the label Y j , Var[Y i ] and Var[Y j ] indicate the variances of labels Y i and Y j , respectively. Their definitions are given as follows for better understanding.
Where n represents the size of the dataset, and k represents the index of each sample in the dataset.

2) Label Correlation Matrix by Weights from the Pearson Productmoment Correlation Coefficient
Wilcox et al. [36] pointed out that the Pearson correlation coefficient r is not robust. Meanwhile, the correlations between labels are often asymmetrical. Therefore, the correlations between the labels cannot be fully reflected by r.
Here, we introduce a weight matrix W [w i,j ] l * l , where w i,j represents the weight associated with the label Y i and the label Y j . Then, the label correlation matrix C[c i,j ] l * l is redefined as Because w i,j is adaptively adjusted during training the network, we call this method as an Adaptive Label Correlation Matrix.
W should be initialized before training the whole network and after training the pre-network pre N N . In order to fully consider the contribution of the label itself, we first initialize the diagonal elements of W to 1, and the non-diagonal elements are evenly distributed over the interval [0, 1] as follow.
Because W is randomly initialized except the diagonal elements, in order to get the correlation matrix C between labels more accurately, the weight W could be dynamically adjusted during the training process by back-propagation of the loss function gradient. Because the weight matrix W is obtained by training the network, the correlation matrix C[c i,j ] l * l tends to be more accurate.

3) Label Correlation Matrix by Weights and Signs from the Pearson Correlation Coefficient
Since the Pearson correlation coefficient itself is often inaccurate, we can now completely ignore the specific value of the Pearson correlation coefficient, and only use it to initially extract the positive and negative relationship between the labels Y i and Y j , i.e., s i,j , expressed as Furthermore, initializing the weight matrix W by a uniform distribution between [0, 1] may not be a good choice, which may increase the contributions from other labels and reduce the contribution of itself. Therefore, we define the weight matrix as follow.
That is to say, we first initialize the w i,j diagonal elements to 0, and then set non-diagonal elements to the evenly distributed random numbers over the interval [0, p], where p is set to 0.1.
Finally, the label correlation matrix is express as follow.
Where I is the identity matrix, which indicates that the information of the label itself owns the most contribution in the initial state. Similar to the previous method, because our correlative layer is added after the output layer, the final output is obtained through the correlative layer, and the prediction error is calculated by the final output. In order to limit w i,j to small ranges, we add the L 1 regularization to the original loss function as follow.
loss new = loss pre + W 1 Based on the above loss function, in order to get a more accurate correlation matrix, the back-propagation technique is used to dynamically adjust W .

Training CLNN
Now we explain how to train CLNN in detail. Since our correlative layer is extended on the existing network, the training process of CLNN has two phases, which is shown in Algorithm 1.
Phase 1: Training the pre-network. We carried out preliminary training according to the pre-network architecture, and the obtained network after the training is often not too bad.
Phase 2: Training the whole network. After adding our correlative layer, we use the same training method (including parameter settings, the loss function, and the gradient descent method) to adjust the weights of both the pre-network and the correlative layer in an attempt to obtain a more accurate network.
After training the network, the label output scores corresponding to each sample in the training dataset can be obtained. Similar to [3], the threshold of the output scores is obtained by minimizing the ranking loss as follow.
In this way, the output and the threshold {(o(x i ), t(x i ))|1 i n} of the training set are constructed. We use the same method in [3] to select the Algorithm 1 Training CLNN Input: Untrained model pre N N , initial weights of pre N N network θ 0 , train epochs m, learning rate lr, initial weights matrix W 0 [w i,j ], the training dataset D = (x i , y i ) i∈[n] Output: Trained model 1: Initialize W 0 by (14); 2: D = random shuf f le(D); 3: for i = 1 to m do 4: for get a training data (x i , y i ) from D do 5: end for 9: end for 10: for i = 1 to m do 11: for get a training data (x i , y i ) from D do 12: end for 18: end for 19: return pre N N and C threshold of the label output, assuming that the threshold is a linear function of the network output value. Given a test sample, the threshold can be obtained as follow. t

Prediction
For a test sample x, the pre-network output is the scores corresponding to each label, which can be expressed as where o pre is the output vector of pre N N , pre N N is a specific network. Because we add the correlative layer after the output of the pre-network, the output value of the whole network now is where In the prediction phase, for a test sample x, the output of a series of labels is obtained through the network. The output greater than or equal to the threshold means the corresponding label is positive, and less than the threshold indicates that the label is negative.
Where 1 ≤ j ≤ l is the index of the labels.

Datasets
We use ten benchmark datasets to test the performance of CLNN, including four small datasets (with fewer than 5000 samples) and six large datasets (with the sample number greater than or equal to 5000). These datasets are often used to test multi-label classification algorithms, which are from various fields, including images, genomics, and texts. The details of the datasets are shown in Table 1.

Evaluation
Given , we use seven indicators to evaluate the performance of the algorithm.
In the following formulas, we use f (·) to represent the prediction function that returns the set of positive labels, g(x i , y) returns the confidence that x i has the label y, y i is a collection of labels belonging to a sample, y i is a collection of labels not belonging to a sample, | · | represents the size of the collection.
Hamming Loss: Hamming loss is to describe the quality of the classifier by counting the number of classification errors. Its calculation formula is where ∆ means to find the symmetric difference of two sets, and l is the number of the labels. The smaller the ratio, the smaller the number of classification errors. One-error: Calculates the frequency of the highest-ranking label classification error. Its formula is where [·] represents an explicit function; if the event occurs, it is 1 otherwise 0.
Coverage: Indicates that the maximum index in the sorted label sequences (high to low) to cover all labels that belong to the sample.
rank g(x i , y) means labels of x i are sorted in descending order in terms of the scores of the labels. Ranking loss: Evaluate the number of improper orders for label pairs, i.e., the negative label has a higher output than the positive label.
Average precision: This is a common indicator in information retrieval.
Macro-F1: Calculate the precision and recall corresponding to each label separately, and calculate the average value of precision and recall, and macro-P and macro-R are obtained as follows.

Algorithm Configuration
Three typical algorithms, BPMLL [3], NN AD [6] and C2AE [21] are adopted for experimental comparisons. Their parameters setting are given as follows. For BPMLL [3], the number of the hidden units is set according to the feature dimensions, which is chosen from {0.2, 0.4, ..., 1} of the feature dimensions; learning rate is chosen from {0.0005, 0.001, 0.005, 0.01}; the regularization parameter α is chosen from {0.001, 0.005, 0.01, 0.05}; the training epochs is 100. For NN AD [6], the number of the hidden units is 1000, the dropout rate is 0.5, learning rate is chosen from {0.001, 0.01, 0.1}, and the training epochs is 100. It is noted that, in this paper, the regularization in threshold training process is omitted.
For C2AE [21], the dimensions of the latent space is set according to the label dimensions, which is set 0.8 of the label dimensions; the training epochs is 50, the error balance parameter α is 0.5, and the regularization parameter β is 0.001.
The batch size of all algorithms during training is 32, and all parameters are determined by 10-fold cross-validation in the training data.  For the extended algorithms, i.e., BPMLL-CLNN, NN AD -CLNN, and C2AE-CLNN, all parameters, the gradient descent methods and the loss functions are the same as the original ones. All algorithms are implemented on the Tensorflow [37] framework. Tables 2 and 3 show the performance of all algorithms under four small datasets and six large datasets, respectively. The bold font represents the better one between the original algorithms and its CLNN versions.
Overall, according to Table 2  The font with gray background in the table indicates the best performance among the six algorithms. Among the six algorithms, NN AD -CLNN performs the best.

Label Correlation Matrix Visualization
To clearly observe the label correlation information mined by the correlative layer, we visualized the label correlation matrix. We selected the scene dataset, which contains six labels, namely 'Beach', 'Sunset', 'Fall Foliage', 'Field', 'Mountain', 'Urban'. We obtained the results of label correlation visualization through the BPMLL-CLNN algorithm. The results are shown in Fig. 3. Since the label itself has a dominant influence on itself, the diagonal elements in the label correlation matrix are omitted in Fig. 3. Thus we can observe the influence of other labels.
In Fig. 3, when the color tends to be blue, it indicates that the influence of the label tends to be negative. When the color tends to be red, it indicates that the influence of the label tends to be positive. For example, in column 5, we can see that the appearance of 'Fall Foliage' enhances the appearance of 'Mountain'. For another example, in column 6, the appearance of 'Field' reduces the appearance of 'Urban'.

Correlative Layer Analysis
In Section 3.2, we present three different implementations of the label correlation matrix. The corresponding networks are named CLNN-PC, CLNN-WPC and CLNN.
In this section, we compare three different implementations, and we adopt BPMLL as the pre N N . The corresponding networks are BPMLL-CLNN-PC, BPMLL-CLNN-WPC and BPMLL-CLNN. We adopt four small datasets to conduct experiments. The experimental results are shown in Table 4.
From Table 4, we can see that CLNN perform the best, which has the best results of 53.6% (15/28) on all the evaluation indicators. The second is CLNN-PC, which achieves the best results of 35.7% (10/28). The third is CLNN-WPC and it obtains 10.7% (3/28) best results.

Conclusion
In this paper, we propose a general framework, i.e., the neural network with the correlative layer (CLNN). We present the detailed implementation of the correlative layer. The weights of the correlative layer are trained after the training of the pre-neural network. The purpose of the correlative layer is to make the relatively correct output of the pre-neural network more accurate. Because the correlative layer is added after the pre-neural network, our framework is highly scalable. We adopt three typical neural networks, including BPMLL [3], NN AD [6], and C2AE [21], as the pre-neural networks. Experimental results show that the CLNN framework has competitive performance. In the future, for multi-label image classification, the convolutional neural network with the correlative layer is worthy of studying. The content of the article, author's signature and ranking have been revised and reviewed. Agree to submit.