Feature Fusion Based Parallel Graph Convolutional Neural Network for Image Annotation

Graph neural networks’ application in automatic image annotation is becoming more mature. However, there are still several problems. First, the feature data of the original image obtained by the feature extraction algorithm, such as color features and gradient features, all have the problem of slight intra-class variance and significant inter-class variance. Second, merely utilize the graph convolution neural networks to construct samples or labeled graphs, limiting multimodality’s fusion and expansion. This paper uses a parallel graph convolution network based on feature fusion for automatic image annotation. By fusing the sample features, the inherent defects of the features extracted by a single model are reduced, and the annotation performance under the condition of semi-supervised learning is improved. Experiments on three benchmark image annotation datasets show that this method is superior to the existing methods.


Introduction
Image annotation marks the received visual data through deep learning, which is a primary and critical task in computer vision.It refers to the process in which the computer system automatically assigns the source data to the digital image in a picture description or keyword.It has been widely used in various computer vision tasks, such as target detection, line/edge detection, image classification, etc.Its essence is a classification task.Because images in Fig. 1 A example of interdependence between tags. the sample containing the "ship" tag probably contains the tag "sea", which also reflects the similarity between samples real life often contain many kinds of objects, the traditional single-label classification can not meet practical work needs.With the continuous development of machine learning, multilabel classification has been studied more and more deeply.
Like other machine learning algorithms, automatic image annotation can be divided into two categories: the discriminant model and the generative model.The generation model [22,26,29] obtains a prior probability distribution as a prediction model based on the joint probability distribution of input and output.The discriminant model [10,21,36] method uses the decision function or conditional probability distribution as the prediction model.Although the prediction time is short, it can not reflect the characteristics of the training data itself, and the relationship between variables is not clear.
In recent years, due to the excellent performance of graph neural networks (GNN) in processing European data, the graph-based model [3,14,18,23,35] method is gradually favored by researchers.Chen et al. [3] first constructed a directed graph for the target through the word vectors of all samples as graph nodes, took the statistical information of tag cooccurrence frequency as the adjacency matrix and used GCN to model the correlation between tags.This method has clear thinking and ample room for improvement, so a lot of work to improve this method has emerged in the follow-up.GCN-based methods have also become the mainstream multi-label image recognition in recent years.The mainstream way is to judge the samples through the graph convolution neural network after constructing the sample graph or label graph, while graph nodes are composed of feature vectors extracted from images.Therefore, image feature extraction is crucial for image classification.
In traditional feature extraction algorithms (such as SIFT and HOG), the detection operator is generally artificially designed and obtained by summarizing a great deal of prior knowledge.The feature of the image cannot be adjusted according to the image and its label, which To address these challenges, We propose a parallel graph convolution network image automatic annotation model based on feature fusion (PGCF) method.The framework is shown in Fig. 3. First, we use two types of networks to extract features from images to compensate for the lack of a single focus caused by the same kind of networks.Second, we perform channel fusion on the features extracted by the two networks and carry out more comprehensive image annotation without adding computational parameters Third, we construct the sample graph on the fused features and input the sample graph and the label graph into two parallel GCN networks to increase the computational efficiency.
In summary,the main contributions focus on: (1) We use two classical feature extraction networks (VGG 19 and RESNET 50) to extract the features of samples respectively, (2) We and fuse the sample features to make up for the inherent defects of single features in the data, to classify similar items more accurately.(3) We conduct experiments on three benchmark data sets and verify the effectiveness of the proposed method by setting up control experiments.
Fig. 3 The framework of PGCF.For each image in the datasets, we use VGG19 and RESNET50 network to extract its features to obtain a 1000 dimensional feature vector respectively, and then we add the features on its corresponding channels to obtain the fused sample features.Finally We input the fused sample feature map and label map into two GCN networks respectively, and the product is input into the loss function for final prediction 2 Related Work

Feature Fusion
Automatic image annotation is a cross-research field that combines natural language processing and computer vision.Its essence is a classification task.Currently, most of the work is to obtain the feature data of the original image through the feature extraction algorithm and then classify and train the classifier to output the final result.Therefore, the extraction of image features is crucial.However, because the same type of feature network has a single focus on the image, a network model is usually only sensitive to the changes of some features of the image but insensitive to the changes of other features.When the difference between the two images in some feature-sensitive features is tiny, the classifier trained based on a single feature is difficult to output the correct classification.In addition, the complex background noise in the image will also lead to the decline of feature data quality, which increases the difficulty of classifier training and reduces classification accuracy.
Feature fusion is an effective compensation method, as shown in Figure .They were using multiple models to extract and fuse the features of the data to realize the feature complementarity and reduce the influence of the inherent defects of a single feature.The idea of the feature fusion method comes from the early field of information fusion.Its fundamental theory mainly includes fuzzy set, evidence theory, etc. Taking Taobao shopping as an example, when deciding whether to buy an item, users will consider the item's attribute, the display of the item picture, the comment information of other users, and even watch the introduction video of the thing.In other words, this multi-modal information (text, image, video) will affect the behavior of users.Therefore, using this multi-modal information to model is a way to improve the classification accuracy of recommended models.And how integrating these characteristics is a crucial problem.
With the development of artificial intelligence technology, feature fusion has become a research hotspot [4,16,32,33], especially in image recognition.Its idea is to jointly model these different features to use the features with different characteristics better.
There are many classification methods of feature fusion technology [17].From the perspective of processing time, feature fusion can be divided into three classes: early fusion, intermediate fusion and late fusion: The early fusion [7] refers to fusion on the input layer, fusing multi-layer features at first and then training the predictor on the fused features; The intermediate fusion [6] refers to transforming the features from different data sources into intermediate high-dimensional feature expression, then fusion, and finally training the predictor; The late fusion [25] refers to the fusion on the prediction layer and the fusion of these prediction results after making predictions on different features.From the perspective of model structure, feature fusion can be divided into serial strategy and parallel strategy [32]: serial strategy means that the whole model has only one branch, while in parallel strategy, the model will have multiple branches, and each branch handles different features.

Graph Convolution Neural Network
In convolutional neural networks, graphs are processed to restore the graph structure from high dimension data to a low dimensional famous form by graph embedding.Generally, the graph structure matrix can specifically represent this geometric structure information.Although convolutional neural networks have been effective, they can not deal with the data with a non-Euclidean structure because the conventional convolution can not deal with the information with a changeable node relationship (it is impossible to set convolution kernel with fixed size and other problems).To effectively extract features from such data structure, GCN has become a research hotspot [5,8,12,13,20,34].
As one kind of discriminant model, the graph network has been popular recently because of its ability to explain the correlation between labels(as shown in Fig. 1.Many works combining graph structure and the convolutional neural network have been applied to multilabel image classification [2,3].Chen et al. [3] first constructed a directed graph on object labels, modeled the correlation between labels using GCN, and mapped the label representation to interdependent object classifiers.ML-GCN [3] takes the word vectors of all tags as graph nodes and the statistical information of label co-occurrence frequency as an adjacency matrix uses graph neural network (GCN) to model the correlation between tags, and weights the characteristics of the classification network to obtain the final classification result.This method has the advantages of simple and straightforward structure, clear thinking, and ample room for improvement.Therefore, much work to improve this method has emerged in the follow-up.Similarly, Chen et al. [2] proposed semantic-specific graph representation learning (SSGRL), including semantic decoupling and interaction modules to learn and associate semantic-specific representations, respectively.As the simultaneous work of ML-GCN [3], both SSGRL [2], and ML-GCN [3] took the lead in proposing to use GCN to solve the co-occurrence dependency problem in multi-label recognition, and they also have a lot in common in the construction of GCN nodes and adjacency matrix.In addition, Tang et al. [27] proposed a new GCN-based deep learning model to obtain rich semantic information.Jiahao Xu et al. [31] introduced the detection frame received by the detection model based on the ML-GCN network structure.They used GCN to model the position of different objects to assist in predicting the classification network.The process based on GCN has also become the mainstream direction of multi-label image recognition in recent years.
The graph is a structure that describes the relationship between objects.Nodes can represent objects, edges can define relationships between objects, and each edge can be weighted.A convolution neural network uses a convolution kernel to extract information.Similarly, the graph convolution layer uses the neighbors of specific graph nodes to define convolution operation.The core idea of graph convolution is to aggregate node information by using edge information to generate a new node representation.In the graph volume product, the learnable weight is multiplied by the characteristics of all neighbors of a specific node (including the node itself).Then some activation functions are applied to the results.
The graph can construct very complex relationships, such as social relationships in the real world, commodity relationships in the recommendation system, and intersection connections in the traffic network.Learning to use a graph neural network is one of the unavailable abilities to study machine learning.

Problem Set-Up
We use G={V , E} to represent an undirected graph, where V is the set of nodes, |V | = n represents a total of N nodes on the graph, and E is the set of edges.A represents the adjacency matrix and defines the interconnection between nodes (In an undirected graph, A i, j = A j,i ).L = D − A represents the Laplace matrix on the graph, where D is a diagonal matrix.D i,i represents the degree of the node i, and D i,i = A i, j .The normalized Laplace matrix is defined as , where I n ∈ R n×n is the unit matrix.Thus, the propagation mode between GCN network layers can be expressed by Formula (1).
In Formula (1), H is the characteristic of each layer,and Ã equals A + I n , and D is n j=1 Ãi j .σ is a nonlinear activation function.

Methodology
In this section, we propose a parallel graph convolution network image automatic annotation model (PGCF) based on feature fusion.Its basic framework is shown in the Fig. 3.We first describe the feature extraction of each sample.We use two better networks to extract features The co-occurrent numberr from samples to avoid the shortcomings and one-sidedness caused by a single network.
Then we fuse the features extracted from two networks on the channel to construct a more comprehensive sample feature map.We multiply the results for joint prediction to improve the prediction performance.

The of Sample Feature Graphs
Feature extraction in the deep learning method is automatically extracted through an artificial neural network.In contrast, the deep learning method not only requires lower requirements for feature extraction, does not need the participation of experts, but also has less human intervention, and the extraction of the feature itself is more comprehensive.This is also one of the reasons why deep learning has become more and more popular in practical applications in recent years.With deep learning development, classic neural networks, such as the Alexnet network, VGGnet network, and RESNET network, have been widely used in image feature extraction because of their excellent performance.The existing multi-label classification work mainly adopts RESNET or VGG network.Given the one-sidedness and limitations inevitably caused by extracting features from a single network, this paper uses these two methods to extract features from samples at the same time.In image feature fusion, the two classical feature fusion methods of early fusion include [32]: • Concat: Series feature fusion, that is to connect the two features directly.Concat operation splices the features in the channel or num dimension, without the operation of eltwise layer.• Add: Parallel strategy, that is, two eigenvectors are combined into complex vectors.
To reduce the training parameters, we fuse the sample features in the channel dimension.For each sample picture , we use two classical models (VGG19 and RESNET50) to extract the features of the image.Through VGG19 and RESNET50 network structures, we obtain two 1, 000 dimensional eigenvectors respectively, and we add the corresponding values of the two eigenvectors.The specific process is shown in Fig. 2. From this, we obtain the final sample feature graphs.

GCN With Fused Sample Graph
We have obtained the feature graph data of the samples, in which there are N nodes, each node has its feature, and the features of these nodes form an N × D-dimensional matrix X .Then the relationship between each node will also form an N × N -dimensional matrix A, i.e., adjacency matrix.X and A are the inputs of the sample GCN model.In our work, we only use one layer of GCN.Therefore, the operation can be defined by the formula.2.
The new feature matrix X ∈ R N ×q is output, which is the more refined feature of the sample obtained after graph convolution neural network.

GCN With Label Graph
In general GNNs, the edges between the nodes usually represents one or several relationships.For example, in the social network diagram, the node is the user, and the edge represents the friend relationship between users.Our label relationships are diverse.A single edge cannot define multiple relationships.We further Abstract these complex relationships and reduce them to a relationship to model this relationship with GNN.For the multi-label image classification problem, we can observe that the probability of identifying the same picture is relatively large for those labels with complex dependencies.In other words, those label objects with complex dependencies (such as ship and sea) are more likely to appear in the same picture, regardless of their relationship.Therefore, we abstract the complex relationship between tags into a relationship, which is the "co-occurrence" relationship.When two tag objects often appear together, we think they have complex dependencies.Generally, we measure the co-occurrence relationship through conditional probability.Probability P (labelB / labelA) represents the probability of labelB under the condition that labelA appears.When this probability is more significant than a specific threshold, we believe a co-occurrence relationship between labelA and labelB and P (labelB / labelA) can generally be estimated through statistics on the datasets.
We use 300-dim GloVe vectors trained on the Wikipedia dataset to represent the tag's feature vector directly.The feature matrix W ∈ R m×k and the adjacent matrix A ∈ R m×m of the label are input into GCN together.Unlike visual features, feature matrices are grouped by superimposing word vectors representing labels.
First, the label is converted into a word vector, and then the word vector and the adjacency matrix representing the label correlation are input into GCN to form the classifier.Unlike the adjacency matrix of the sample, the adjacency matrix of the label is calculated by prior knowledge, which means calculating the co-occurrence probability of two labels.First, we obtained the probability matrix by formula (3).
Where M(i, j) denotes the co-occurrent number of tag i and tag j.Similarly, M{ * } represents the current number of tag * .Therefore, the adjacency matrix of the label is symmetrical.However, the disadvantage of adjacency matrix A is that the co-occurrence probability between one label and other labels may exhibit A long-tailed distribution.More precisely, the co-occurrence probability between tag a and tag b is 0.001, a rare probability.Therefore, the co-occurrence probability of tag A and tag B may be noise.To this end, we set the threshold ε to filter the noisy edges in formula (4).
When the values of the adjacency matrix exceed the threshold, these values will be retained; otherwise, they will be zero.At the same time, since it does not contain self-connection, the diagonal element of matrix A is 0.

Parallel Graph Convolutional Neural Network
As we have obtained the fused sample graph and the label graph, to improve the computational efficiency, we input them into two parallel graph convolution networks to obtain their finer feature matrix to improve the experimental performance, we multiply the two characteristic matrices after convolution, and input the results into the final loss function for joint prediction.
In the experiment, to compare fairness, the softmax function of the last layer is replaced by the sigmoid function.Figure 3 describes the processing process of our algorithm.

Experimental setups
We conducted experiments on three standard benchmark data sets, randomly selecting 10% as the training sets and the rest as the test sets.We take the average accuracy (AP), recall rate (AR), and F1 performance evaluation score as evaluation indicators.Meanwhile, the first five tags were assigned as the predicted tags.We trained our model to learn at a rate of 0.02 over 450 generations.In addition, the GCN value of the sample and the GCN value of the label were 0.1 and 1, respectively.For network optimization, we use Adam as the optimizer.
To prove that feature fusion is effective for the experiment, we set up a group of control experiments SPGCN [24].The SPGCN method also uses the parallel graph convolution neural network for the investigation.The only difference is that its sample features are extracted only by the RESNET 50 model, and the VGG network model is not used to extract features or fuse features in its method.Except for the different processing methods of features, our network structure is consistent with it.

Experimental Results
To verify the importance of graph structure in the model, we conducted NMF-KNN [9] experiment without graph structure as a comparison.Furthermore, to verify the importance of modeling different graph structures, fast tag [1], RPLRF [15] , and Laplacian [19,30] are the comparison methods of models using graph regularization instead of GCN.The method of SPGCN [24] is a comparative experiment under our feature fusion method.The specific experimental comparison results are shown in Tables 2, 3, 4 respectively.The experimental results show that compared with the method without feature fusion, our experimental results have the most significant effect on corel5k datasets: AP has increased by 0.71%, AR has increased by 0.35% ,and F1 score has increased by 0.72%; and this method have also achieved different degrees of improvement effects on espgame datasets and IAPRTC12 datasets It can be seen that feature fusion has an obvious effect on the performance improvement of the experiment.The same model has a different sensitivity to some features due to different ways of extracting features.Therefore, when the difference between the two kinds of images is slight in some feature-sensitive characteristics, the classifier trained based on a single feature is difficult to output the correct classification.In addition, the complex background noise in the image will also lead to the decline of feature data quality, which not only increases the difficulty of classifier training but also reduces the accuracy of classification.We use two different models, VGG19 and RESNET50, to extract features, respectively, which overcomes the inherent defects of single image features and improves the final classification performance.It is worth mentioning that our feature fusion method acts directly on features.Its advantage is that it can directly use the existing feature extraction algorithms to extract features.Compared with redesigning features and feature extraction algorithms, our method cost is lower.

Conclution
In this paper, we discussed the impact of feature fusion on the performance of multi-label classification from the perspective of data processing.In order to make up for the inherent defects of feature extraction by a single model, we extracted the features of samples respectively through two classical models and fused them.We conduct experiments on three benchmark datasets and verify the effectiveness of the feature fusion method by setting up control experiments.More methods of feature fusion and the influence of different models on the feature fusion effect will be left for further work.

Fig. 2
Fig. 2 Details of sample feature fusion For the sample image, we set the size to 224×224×3.Through VGG19 and RESNET50 network structures, we obtain two 1000 dimensional eigenvectors, respectively.We add the corresponding values of the two eigenvectors to obtain the final fused eigenve