Triple-level relationship enhanced transformer for image captioning

Region features and grid features are often used in the field of image captioning. As they are often extracted by different networks, fusing them for image captioning needs connections between them. However, these connections often rely on simple coordinates, which will lead the captions lacks of precise expression of visual relationships. Meanwhile, the scene graph features contain object relationship information, through the multi-layer calculation, and the extracted object relationship information is of higher level and more complete, which can compensate the shortage of region features and grid features to a certain extent. Therefore, a Triple-Level Relationship Enhanced Transformer (TRET) is proposed in this paper, which can process three features in parallel. TRET can obtain and combine different levels of object relationship features to achieve the advantages of complementarity between different features. Specifically, we obtain high-level object-relational information with the help of Graph Based Attention, and achieve the fusion of low-level relational information and high-level object-relational information with the help of Cross Relationship Enhanced Attention, so as to better align the information of both modalities, visual and text. To validate our model, we conduct comprehensive experiments on the MS-COCO dataset. The results indicate that our method achieves better performance compared with the existing state-of-the-art methods and effectively enhances the ability of describing the representation of object relationships in the generated outcomes.


Introduction
Image captioning aims to automatically generate a textual description for a given image with the technology of computer vision and natural language processing. For humans, describing a picture is a simple task. But it is necessary to understand the content of the image while generating coherent and semantically correct sentences for computers. Therefore, this task requires an algorithm that models and outputs appropriate and reasonable word sequences based on the understanding of visual images and text sentences.
In recent years, with the development of deep learning techniques, image captioning tasks have been able to achieve very high performance evaluation results. With reference to machine translation tasks, these methods follow the "Encoder-Decoder" architecture [1] basically, in which CNN [2] and RNN [3] are widely used. Generally speaking, image features are extracted by CNN, and a word is generated at each time step with the help of RNN, which is combined to obtain an image description sentence [4][5][6][7]. However, due to the fact that image and text are different modalities of information, the independent processing prevents better interaction between the modalities. Subsequently, attention mechanisms are applied on this task to compensate for this deficiency. Attention mechanism guides the process of model decoding and encoding selectively by attention weights, allowing better interaction of information between different modalities [8][9][10].
Along with improving the model architecture, there is also some work focused on refining the feature representation of the images. Vinyals et al. [11] introduced grid feature. Multiple grids can cover the whole picture and get the detail information of the target. But it is criticized for its low semantic hierarchy. Anderson et al. [10] introduced region feature in this area, greatly reducing the gap that exists between vision and semantics, which caused the performance leap in 2018. In 2019, Yang et al. [12] constructed the scene graph feature. The objects in the image, the attributes of the objects, and the relationships can be connected through the scene graph, a unified representation. These features mentioned above solve some problems while create new ones. Despite significant achievements that have been made, the previous approaches haven't considered the combination of these three methods, which could make full use of complementary advantages to achieve unprecedented results in image captioning.
In this paper, we propose a Triple-Level Relationship Enhanced Transformer (TRET) network to obtain object relationship information at different levels and to fuse scene graph features with region features and grid features separately for subsequent description generation. Specifically, the region features and grid features are processed separately by two main lines, relying on the coordinate information to construct the geometric map and to extract the low-level object relationship information between different features, while the scene graph features are processed by the third main line to extract the high-level object relationship information. Then, the region features and the grid features are fused with the obtained low-level relationship information respectively, and the scene graph features are fused with the obtained high-level relationship information, thus obtaining the three features after relationship information enhancement. Then, the region features and grid features are fused with the scene graph features two by two before concatenating them together. Finally, the concatenated feature is used as input to the decoder for description generation. The model after incorporating multi-level object relationship information can accurately point out the relationship between different objects in given image.
The contributions of this paper are summarized as follows: 1. A Triple-Level Relationship Enhanced Transformer network is proposed for image captioning. Different levels of object relationship information are obtained by training a triple-way encoder to enhance the description ability about object relationships in image captions.
2. Graph Based Attention (GBA) is proposed for fusing object relationship information in scene graph features into object information. Using GBA, the scene graph features about the objects are fused with high-level relationship information.
3. Cross Relationship Enhanced Attention (CREA) is proposed for fusing different modalities of visual features. With the help of CREA, scene graph features are fused with region features and grid features, respectively. This way of feature fusion achieves better relationship information enhancement than direct feature stitching. 4. The experimental results in the standard database MS-COCO [13] show that the performance of TRET is better than the other previous proposals for image captioning and the ability to point out the relationships between image objects in the generated image captions is also significantly enhanced.

Related work
Earlier methods of image captioning can be classified into two categories, one is approaches based on templates [14][15][16][17][18] and the other one is approaches based on retrieval [19][20][21][22]. With the rapid development of deep learning techniques, most of captioning researches are now implemented by constructing an Encoder-Decoder architecture [23,24]. To improve the performance of image captioning, researchers focus on improving the image features and model architecture mostly.

Image features
Anderson et al. [10] proposed to use the region features obtained after the RolPooling layer in Faster R-CNN [25] as the input. They demonstrated by experiments that the use of region features is more effective than the image features extracted by pre-trained CNNs and the information expressed is more complete, enabling a milestone leap in image captioning. Yang et al. [26] used scene graph features to integrate semantic and spatial relationships between objects. They fused the information of the objects detected in the graph with the attributes of the objects and the relationships between the objects as the input to the network. Grid features were introduced by Vinyals et al. [11] in 2015, while Jiang et al. [27] proved the effectiveness of grid features in image captioning by comparing the performance with region features, proving that grid features can also improve the accuracy of models and reduce the cost and training time of model training. There are still some shortcomings to realize the crossing of two modalities from image to text. That is, a single visual feature cannot fully express the information in an image. Luo et al. [28] achieved the complementarity of region and grid features in 2021, constructing a relationship matrix between them with the help of the coordinates of the bounding box. However, the simple position relationship constructed by the intersection of coordinates of region and grid features has certain shortcomings in the description of object relationships in images.

Model architecture
Transformer has been widely used in image captioning recently. In 2019, Huang et al. [29] proposed an extended form of attention to complete captioning by constructing a Transformer-like encoder combining with the decoder of the LSTM. In 2020, Cornia et al. [30] proposed a Meshed Transformer with Memory by constructing an encoder-decoder model consisting of transformer to achieve feature fusion between different levels. Luo et al. [28] proposed a two-layer collaborative Transformer network to fuse two features. These studies stacked self-concentrating modules according to different needs, thus enabling architectural innovation and performance improvements.

Methods
We construct a triple-way encoder structure for processing multiple image features, fusing and interacting with features by using multiple attention mechanisms, enhancing the information about the objects and the relationships between them in the given image. The decoder takes the output of the encoder as input to generate word-by-word captions. The overall model structure is shown in Fig. 1.

Fundamental of attention
We use scaled dot-product attention as the scoring function, which is computationally more efficient than using recursion. Attention is applied to three sets of vectors, a set of queries Q , keys K , and values V , and the value are weighted and summed according to the similarity distribution of the query and key. The scoring function for the scaled dot-product attention can be expressed as The attention using this scoring function can be denoted as where Q is a matrix consisting of n q query vectors, both K and V contain n k pair of keys and values with same dimensionality. d is a scaling factor.

Triple-level relationship enhanced encoder
Our encoder consists of five sub-modules: Position Encoding, Self Attention, Locality-Constrained Cross Attention, Graph Based Attention and Cross Relationship Enhanced Fig. 1 Architecture of the proposed Triple-Level Relationship Enhanced Transformer network. TRET consists of stacked attention mechanisms. "SA" indicates the self-attention layer and "CA" indicates the cross-attention layer. The triple-way encoder structure applied to fuse scene graph, region and grid features, dig the lowlevel object relationship information between region and grid fea-tures. Graph Based Attention processes object information and object relationship information in scene graph features and fuses high-level object relationship information into visual features. Feature fusion is then completed with the help of Cross Relation Enhanced Attention. Finally, use the concatenated feature output as the input of decoder to complete image captioning 1 3 Attention. For one input image I, multiple features are obtained by multiple networks: region features V R are extracted by BUTD [10], grid features V G are extracted by IDGF [31], and scene graph features G are extracted by multi-modal graph convolution network (MGCN) [26]. The attention mechanism processes these three extracted features for information interaction and feature fusion.

Position encoding
Since the transformer itself does not have the ability to learn word order information as RNN does, it is necessary to make the model have this ability by adding positional encoding. In our proposal, we fuse absolute location information, which tells the model about the location of features, and relative location information, which tells the model about the location between features. Absolute position encoding can distinguish items with the same appearance. For regions, with the help of fourdimensional bounding box coordinates x min , y min , x max , y max , the region positional encoding AP RPE can be defined as where i is the index of bounding box, x min , y min and x max , y max represent, respectively, the coordinates of the points in the upper left and lower right corners, W R ∈ R d * 4 denotes an embedding parameter matrix.
For grids, we obtain their absolute position information by stitching two one-dimensional sine and cosine embeddings. The grid positional encoding AP GPE can be defined as where x and y are the row and column index, and the GPE x , GPE y ∈ R d 2 can be defined as where i represents the position and j represents the dimension.
Relative position encoding adds relative position information with the geometric structure of the bounding box. The grid, as a special kind of bounding box, can be represented as (x, y, w, h) , where (x, y) is the coordinate of the point in center, w is the wide and h is the height. The geometric relationship between Box i and Box j can be represented as a four-dimensional vector Then, RP(i, j) is encoded by the Emb method [32] and is mapped to a scalar where W box is a learned parameter matrix.

Self-attention
Self-attention is used to model region features and grid features, respectively. The absolute and relative position information are integrated with the region features or grid features with the help of the self-attention modules. The implicit states of the region features or grid features of the previous layer are used as the input of this layer. Specifically, using region feature as an example, K and Q are modified by adding position information at every attention layer. Therefore, the attention weights of the first layer can be defined as where the AP q and the AP k are the absolute position encoding of Q and K. Q, K and V are all obtained from the input V R with the linear transformation. Then, we use relative position encoding to adjust W: Finally, softmax is performed on W ′ and the output of this attention mechanism is calculated and the output is obtained by calculating. The multi-head attention (MHA) can be formalized as The hidden states of regions H l R and that of grids H l G in the previous layer are fed into the (l + 1) th layer: where RP RR and RP GG are the relative position encoding of regions to regions and grids to grids, respectively. Then, the two outputs are applied to two independent position-wise feedforward layers (FFN): After FFN, the output of Eqs. 16 and 17 are fed into the next module.

Locality-constrained cross attention
Locality-Constrained Cross Attention (LCCA) is used to interact region features with grid features. The hidden states of the upper layer of region features and that of grid features are used as subjects for inter-layer fusion with the corresponding features, respectively. Semantic noise is avoided by constructing an alignment graph AG = (V, E) between the region blocks and the grid blocks. AG is an alignment map that records whether the region block is associated with the grid block and is determined by whether the bounding boxes of the image blocks intersect. The image blocks of region features and the image blocks of grid features are regarded as nodes, respectively, to construct a set of nodes V . When there is an overlap between the boundaries of image blocks, it is treated as an edge between two nodes, and the set of edges E is constructed in this way. On the basis of the constructed geometric alignment map, the input region features and grid features are processed to identify the attention of two different visual feature domains of source and target fields. In addition, the absolute position information and relative position information are integrated to obtain the weight matrix W. Then, we normalize it where v i is a node, A(v i ) is a neighboring nodes set of v i . The weighted sum is expressed as where V j is the j-th visual node value. This phase assigns 0 weight to non-adjacent visualization nodes. The multi-head Locality-Constrained Cross Attention (MHLCCA) module can be represented as In this phase, region features and grid features are alternately used as source and target fields. For the l-th layer, the output can be represented as where RP RG and RP GR are the relative position encoding of regions to grids and grids to regions, respectively. The two outputs are applied to FFN The output of Eqs. 25 and 26 serves as the input of the next self-attention module.

Graph based attention
Both the self-attention and LCCA mentioned above are used to deal with the region features and the grid features, and the information interaction is done through the geometric interaction graph constructed by these two features. However, the information contained in the geometric interaction graph constructed by the simple bounding box interaction principle is lower order relational information. The scene graph feature is a unified representation that connects the objects in the image, the attributes of the objects, and the relationships between objects, which are obtained with the help of MGCN network, and the relationship information implied is higher order information.
Due to the complex structure of the scene graph features themselves, it is hard to represent and compute them directly with a feature vector like ordinary visual features. GBA can better solve the problems caused by the structured characteristics of the scene graph features, taking the object features as input and the relationship features between objects as highlevel relationship information, adding them with the calculated (20) MHLCCA(Q, K, V) = Concat(head 1 , head 2 , ..., head h )W O (21) LCCA(Q, K, V, AP q , AP k , RP, AG) attention weights, and completing the subsequent calculations after the weight matrix has obtained the bias. That is, GBA can combine the properties of graphs to perform attention calculations. Figure 2b shows the internal structure of the GBA. We  n ⇐ n + 1 12: end while process the scene graph features G with the GBA module. In the process, we likewise construct REG, which records the relational weights of the associations that exist between objects and objects in the scene graph features.
The object features G obj in the extracted scene graph features are taken as the input and processed through the attention layer to obtain the weights W: where Q and K are obtained from G obj by linear transformation. Then, the information about object relationships G rela in the scene graph features is processed to construct the relationship weight graph REG: The calculated weight values are normalized by softmax to obtain the output. The multi-head Graph Based Attention can be defined as The hidden state output of scene graph features H l SG in the previous layer are fed into the (l + 1)-th layer: The output is applied to FFN: layer are fed into the (l + 1) -th layer: The output of l-th GBA serves as the input of the next GBA module.

Cross relationship enhanced attention
For the information-enhanced region features, grid features as well as the scene graph features, we perform feature fusion two-by-two with the help of two Cross Relationship Enhanced Attention (CREA) modules. Figure 2a shows the internal structure of the CREA. CREA enables relational enhancement of visual information, so that the final generated image captions are enhanced with descriptions of relationships between objects in the given image. The output obtained by the multi GBA modules will be calculated separately with the other two visual outputs. Taking region features and scene graph features as an example, the region features and scene graph features output in the previous layer of attention are used as input to CREA and are computed in the attention layer to obtain the weights W, as shown in Eq. 33, where Q is the output of the region feature H R , K is the output of the scene graph feature H SG : The softmax operation is then performed on the weights W and then the dot product with V, where V is obtained from H SG . The final output is the region feature after the relationship enhancement. The grid feature after the relationship enhancement can be got by performing the same operation as above: The output of Eqs. 35 and 36 are applied to FFN: The feature splicing operation is performed on the two feature information output O ′ R and O ′ G , which is the final visual information output of the encoder part:

Triple-level relationship enhanced decoder
The decoder is constructed by a multi-layer attention mechanism, where the words generated in the previous layer and the visual information output from the encoder part are used as input responsible for generating the next words that make up the image captions. Before starting decoding, the position of the word in the sentence sequence is represented by means of a sinusoidal positional encoding [32], summing up with a one-hot vector after a linear projection, which is used as the initial input.
For the obtained vector D, which contains the embedding information of the generated words in the previous layer, the weights W are first obtained at the attention layer through a self-attention mechanism: where Q, K and V are all obtained from the vector D by linear transformation. Then, the softmax operation is performed on the calculated weights and then the dot-product operation, whose basic formula can be defined as The output sequence D ′ is then computed with the visual output of the last layer of the encoder O: where D ′′ is the input of next layer. After calculating the correlation between the output sequence and the visual output, a linear projection and softmax operation is performed to obtain the probability for the words in the dictionary.

Objectives
Following the standard practice of this field [6, 10, 33], we pre-train TRET by using word-level cross-entropy loss (XE) and finetuning the generation of image captions with reinforcement learning. The model is trained for the next token prediction based on the previous ground truth during training. Thus, the whole sequence of outputs can be computed in a single process and operate in parallel.
When we train the model with the help of reinforcement learning, we use a variant of the self-critical sequence training method for the sequences, where we always select the top-k ranked words in the word probability distribution at each time step while decoding, and perform the sequence decoding by iteration.
Given a sequence S about the ground truth, and a model where its parameter is , we used cross-entropy loss (XE). The scoring results of CIDEr are then optimized by the selfcritical sequence training method: where k is the beam size, s i is the i-th sentence in the beam, r is the reward function, calculated as the average of the awards received in the sampling sequence. We use beam search for multiple decoding and keep the sequence with the highest prediction probability in the last beam as the final output sequence of the decoder.
Finally, our decoder stacks multiple decoding modules as above to get the image captions.

Dataset
We evaluate TRET on the MS-COCO dataset, which is one of the most commonly used datasets in the field of image captioning. The MS-COCO dataset contains 123,287 images, each of which corresponds to 5 different captions. There are two common ways to split this dataset, the official online test split and the 3rd-part Karpathy split for offline test [34]. We use the later to assign the data to the training set, the validation set and the test set.

Metrics
Following the criteria for evaluation, we use BLEU [35], METEOR [36], ROUGE [37], and CIDEr [38] as the evaluation metrics to evaluate our proposal. For these metrics, higher values indicate better performance.

Implementation details
In order to obtain different visual features, we extract them with the help of three networks. First, the region features are obtained using a Faster R-CNN pre-trained on the Visual Genome dataset, and 50 regions are extracted for each image separately and represented by a 2048-dimensional feature vector. Then, we obtain the grid features by segmenting each image into 7 × 7 image blocks., which are also represented by 2048-dimensional feature vectors. Finally, the scene graph features are extracted with the help of MGCN consisting of object detector, attribute classifier and relationship classifier, and three feature embeddings are obtained, which are relationship embedding of relationship nodes, object embedding of object nodes and attribute embedding of object nodes. Due to the different number of objects extracted on each image, the dimensionality of the scene graph vector varies, and is adjusted to unify the dimensionality in each batch for subsequent calculations.
In the experimental implementation, we set each layer's dimension as 512 and set the number of heads as 8. Features are calculated after each attention and feedforward layer and are discarded with probability 0.9. The learning rate is adjusted in four stages during training, linearly increasing to 1X10 −4 for the first 4 epochs, maintaining it for 5-10 epochs, changing to 2X10 −6 for 11-12 epochs, and 4X10 −7 after 12 epochs. Model optimization is performed with the help of Adam [39], whose beam size is set to 5, after 18 epochs reward optimization is performed with the help of CIDEr [6].

Ablation study
Several ablation experiments are executed for validating the proposed contributions.

Relationship information
Our approach incorporates two different levels of relational information, one is a low-level image block location relationship(L) with the help of region features and grid features, and the other is a high-level object relationship (H) represented with the help of scene graph features. In order to better understand the impact of these two relational information, we conduct two ablation experiments, one incorporating only low-level relationship information and one incorporating both low-level and high-level relationship information, which are shown in Table 1. The results demonstrate that evaluation result of the model after fusing two of relationship information is better than the model after fusing only a single relationship information. This indicates that fusing different levels of relationship information can help the model to better discover the relationships between objects and thus generate more accurate and appropriate description sentences.

Graph based attention
To demonstrate the effectiveness of Graph Based Attention, we perform the following ablation experiments, which are shown in Table 2. The results indicate that GBA can effectively help fuse visual information in scene graph features and can enhance the ability of the model in mining deep object relationships. By fusing visual information of objects and visual information of object relationships in scene graph features, it improves the score of BLEU-4 from 37.2% to 37.5% . The high-level relationship information obtained after multi-layer computation is better integrated into the scene graph features by GBA, which makes the model-generated captions better concentrating on the relationship between objects in the images.
We have noticed that there is no significant improvement from the perspective of CIDEr. The reason is that CIDEr mainly evaluates whether the output sentences get the key information of the image, so the weight of nonkeywords is reduced. However, GBA can only enhance the relational information that can be seen as the non-key words in the sentence. Therefore, this enhancement has less impact on CIDEr.

Cross relationship enhanced attention
Cross Relationship Enhanced Attention is the module used to fuse the three features. To demonstrate its effectiveness, we conduct the following two ablation experiments, one without using CREA, but achieving feature fusion by the simple feature summing and splicing operation, and the other by CREA for feature fusion. The results presented in Table 3 show that the performance of the model is significantly improved after using CREA for feature fusion comparing with concatenating three features directly. We can say that 1 + 1 + 1 > 3 is got by CREA.

Performance comparison
We reproduce several recently proposed transformer-based image captioning models and compare the performance with  . 3 Examples of images and their corresponding captions generated by the original Transformer and TRET, together with ground truth sentences  our approach. The models we compared include x-transformer [40], M2 [30], and DLCT [28]. For x-transformer and M2, we apply region features and grid features, respectively, as visual features for subsequent processing. As shown in the Table 4, our TRET shows better performance results than the other models on BLEU1-4, METEOR, ROUGE, and CIDER. In addition, it is on a par with DLCT on METEOR. The score improvement on ROUGE and BLEU3-4 proves the advantage of TRET, which enhances the ability of relationship representation in image captioning by combining the low-level relationship information represented by region features and grid features with the high-level relationship information represented by scene graph features. Figure 3 illustrates some examples including images with their corresponding captions generated. From these examples, it is obvious that TRET performs better in mining object relationships in images.

Conclusion
In this paper, we propose TRET, a Triple-Level Relationship Enhanced Transformer for image captioning. We complete the complementary of scene graph features, region features, and grid features with multiple main lines trained in parallel, and fuse the visual information of object relationships at different levels. By leveraging two modules, GBA and CREA, our model obtains information about high-level object relationships and achieves feature fusion, which compensates to a certain extent for the weakness of the image captioning task regarding the ability to describe the relationships between objects. We conduct offline training and testing on the MS-COCO dataset, and all the performance achieved are more significantly improved. Although the method proposed in this paper enhances the ability of the model in describing object relationships, there are still some drawbacks, such as the long time required for model training due to the introduction of Graph Convolutional Networks. In future work, we hope to continue to dig deeper into the visual information represented by image features in the hope of gaining new advances.