RNIC-A retrospect network for image captioning

As cross-domain research combining computer vision and natural language processing, the current image captioning research mainly considers how to improve the visual features; less attention has been paid to utilizing the inherent properties of language to boost captioning performance. Facing this challenge, we proposed a textual attention mechanism, which can obtain semantic relevance between words by scanning all generated words. The retrospect network for image captioning (RNIC) proposed in this paper aims to improve input and prediction process by using textual attention. Concretely, the textual attention mechanism is applied to the model simultaneously with the visual attention mechanism to provide the input of the model with the maximum information required for generating captions. In this way, our model can learn to collaboratively attend on both visual and textual features. Moreover, the semantic relevance between words obtained by retrospect is used as the basis for prediction, so that the decoder can simulate the human language system and better make predictions based on the already generated contents. We evaluate the effectiveness of our model on the COCO image captioning datasets and achieve superior performance over the previous methods.


Introduction
Image captioning is a fundamental study in computer vision, which aims to identify objects within an image, understand the relationships between objects, and represent them in a natural language that humans can understand. The difficulty of the image caption research is to make the computer "see" the visible objects and "understand" the invisible object relationships, which is much more difficult than image classification and object detection . Because of its remarkable role in image/video retrieval and assisting visually impaired groups to perceive their environment, image captioning has attracted wide interest from academia (Liu et al. 2021, ?) and industry (Cornia et al. 2020;Ji et al. 2021 In recent years, attention mechanism is widely used on various tasks Fan et al. 2020;, which only focuses on selective parts of the whole visual space when and where as needed. However, as cross-domain research combining computer vision and natural language processing, relying on visual features alone is still not sufficient to generate high-quality captions; textual information is also crucial for improving model performance. State-of-the-art image caption models with long short-term memory (LSTMHochreiter and Schmidhuber 1997) as the decoder is too simple in its utilization of textual information. There are two manifestations in the model. Firstly, the decoder only uses adjacent textual information as the input, and more textual information is passed through the memory unit of the LSTM, which is not effective in dealing with long-term dependency problems. As shown in Fig. 1, when "paddle" is to be predicted, information about "surfing" is not well transferred to that moment because the interval is too long. The second is that the semantic correlation between words is ignored in the final prediction process, and the inherent properties of language cannot be exploited to improve the performance of the model.
In this paper, following the conventional encoder-decoder framework, we propose the RNIC model, which can improve A man surfing a small wave with a paddle Fig. 1 The attention weight distribution over the past generated words is shown when predicting the word "paddle." The thicker line indicates a relatively larger weight the input and prediction process. Different from previous methods which boost captioning performance by improving the visual attention mechanism, our RNIC applied attention in both visual and textual domain.
The main contributions of this paper are as follows: 1. In response to the problem that the current mainstream models overemphasize how to improve visual features, we propose textual attention mechanism for image captioning. The textual attention mechanism allows the model to trace back to the text information that is most relevant to the current moment prediction. 2. We explore the role of textual attention mechanism-it can effectively improve model input and prediction. 3. RNIC model applies both textual attention and visual attention to the model, so that the model is able to make predictions based on what has been generated, and the model's ability to handle long-term dependencies is significantly enhanced. Xu et al. (2015) first introduced attention mechanism into image captioning, which generates a matrix to weight each receptive field in the encoded feature map. Instead of only attending to the receptive field in the encoded feature map, (Chen et al. 2017) added a feature channel attention module. Lu et al. (2017) proposed adaptive attention, which adaptively decides when and where to rely on the visual information. In order to solve the problem that above models lack accurate positioning of informative regions in the original image, (Anderson et al. 2018) proposed bottom-up and topdown attention mechanism, where bottom-up attention first uses object detection models to detect multiple informative regions in the image and then top-down attention attends to the most relevant detected regions when generating a word. Yao et al. (2018) injected a graph convolutional neural network to relate detected informative regions and therefore refine their features before feeding into the decoder.

Textual attention
Though no prior work has explored textual attention in image caption, there are some related works in natural language processing. In Bahdanau et al. (2014), the author propose RNNSearch to learn an alignment over the input sentences. Rocktäschel et al. (2015) propose a more finegrained attention mechanism to reason about the entailment in two sentences. Yin et al. (2016) propose an attention-based bigram CNN for jointly performing attention between two CNN hierarchies.

Visual attention with textual attention in visual question answering (VQA).
In the VQA task, (Nam et al. 2017) combined visual attention with textual attention to capture the fine-grained interactions between vision and text and by focusing on specific regions in images and text to gather the necessary information. Lu et al. (2016) proposed co-attention to make the model focus on different regions of the image as well as different segments of the text (questions) and model the text at three levels to capture different granularity of information.
Unlike the VQA task, the image caption research is a language-generating process and only the generated textual information is known at the time of prediction. Ke et al. (2019) proposed reflective attention, which combines textual attention with visual attention for the first time and applies it to the image caption research. The reflective attention calculates the attention of the hidden units for all moments and uses the results as a basis for prediction. In this paper, we propose a more direct textual attention mechanism by calculating the similarity between hidden units and generated words to obtain the required textual information at that time step and use it to improve the input and prediction of the model.

Method
We adopt the popular encoder-decoder framework for image caption generation. Our model (see Fig. 2 for the model structure) takes a single raw image and generates a caption S encoded as a sequence of 1-of-k encoded words: where k is the size of the dictionary and n is the length of the caption.
As shown in Fig. 2, the RNIC model proposed in this paper is implemented based on a two-layer LSTM, with the first LSTM as the attention model and the second LSTM as the language model. Dual attention module is used to gener- Fig. 2 The overall architecture of RNIC ate a joint context vector to maximize the visual and textual information needed to generate caption. Semantic relevance retrospect module allows the model to better predict based on the generated content, which better simulates the human language system. The two modules make the model significantly more capable of handling long-term dependencies.

Object-level encoder
To generate captions, the first step is to extract the visual features from images. In this paper, the visual features V of an image are extracted by a pre-trained Faster RCNN. The extractor generates L region vectors V i ; each region vector is a D-dimensional representation corresponding to a part of the image: Compared to the conventional uniform meshing method on CNN features, the object-level encoder focuses more on objects in an image that is closely related to the perception mechanism in human visual system.

Retrospect decoder
Given a set of region image features V proposed by encoder, the goal for the retrospect decoder is to generate the caption S. The generated caption should not only capture the content information from the image but also be meaningful and coherent. Similar toAnderson et al. (2018), retrospect decoder contains two-layer LSTM, with the first LSTM as the attention model and the second LSTM as the language model.
The input vector to the attention LSTM at each time step consists of the previous output of the language LSTM, concatenated with the mean-pooled image featureV = where W e ∈ R E×|Z| represents the word embedding matrix, and S t−1 is the output vocabulary at time step t-1, represented by the one-hot vector.
The input vector to the language LSTM at each time step consists of the output of the dual attention module concatenated with the output of Attention LSTM, given by: where the joint context vector ctx (t) is generated by the proposed dual attention module. At each time step t, the conditional distribution over possible output words is given by: where W p ∈ R |Z |×M , b p ∈ R |Z | are parameters that need to be learned. H t is generated by the semantic relevance retrospect module.

Dual attention module
Previous works have shown that visual attention alone can perform fairly well for localizing objects and aiding caption generation. However, as cross-domain research combining computer vision and natural language processing, relying on visual features alone is still not sufficient to generate highquality captions, and textual information is also crucial for improving model performance. To this end, we propose dual attention module which can simultaneously attend to visual and textual modalities. The structure is illustrated in Fig. 3. As shown in Fig. 3, the application of the dual attention module allows the model to learn to collaborate on visual and textual features by generating a joint context vector ctx (t) : where V (t) and Tex (t) represent the results of visual attention and textual attention, respectively.

Visual attention
Visual attention aims to generate a vector by attending to certain parts of the input image. At time step t, given the output of the attention LSTM h 1 t , the visual context vector V (t) is generated by: where W α ∈ R H ,W V α ∈ R H ×V ,W hα ∈ R H ×M are the parameters to be learned.

Textual attention
To make better use of the inherent properties of language, we propose the textual attention mechanism. To our knowledge, this is the first work exploring textual attention in image captioning. The textual attention mechanism can review all the generated words at each time step and extract important information to guide the prediction process. At time step t, given the output of the attention LSTM h 1 t , the text context vector T ex (t) is generated by:

Semantic relevance retrospect module
State-of-the-art image captioning methods mostly uses the hidden state alone to generate captions. In this way, the historical sequence information cannot be used well. Our semantic relevance retrospect module models the dependencies between pairs of words at different time steps. The structure of the semantic relevance retrospect module is illustrated in Fig. 4.
As shown in Fig. 4, the SRR module puts the results of the text attention mechanism through the multilayer perceptron before summing with the hidden state as the basis for the model prediction. The role of the multilayer perceptron is twofold: one is to match the results of the text attention mechanism with the hidden state dimension, and the other is to enable the model to further explore the semantic relatedness between words.
In this way, the probability of the output words at timestep t is calculated as follows: where W p ∈ R |Z |×M ,b p ∈ R |Z | are parameters that need to be learned. The SRR module uses the semantic correlation joint hidden state as the basis for prediction, which enables the decoder to better reason and predict based on the already generated content.

Datasets and evaluation metrics
We evaluate our model on the MS-COCO dataset (Chen et al. 2015). MS-COCO dataset contains 123,287 images labeled with 5 captions for each. We follow the splits provided by Karpathy and Fei-Fei (2015), where 5000 images are used for validation, 5000 for testing and the rest for training. Following most image caption research, we use different metrics, including BLEU (Papineni et al. 2002), METEOR (Banerjee and Lavie 2005), ROUGE-L (Lin 2004) and CIDEr (Vedantam et al. 2015), to evaluate the proposed method and compared with other methods. For simplicity, B-n is used to denote the n-gram BLEU score and M, R, and C are used to represent METEOR, ROUGE-L, and CIDEr, respectively. All of the above evaluation metrics evaluate the performance of the model by measuring the similarity between the generated and labeled sentences.

Implementation details
To represent image regions, we employ a pre-trained Faster-RCNN (Ren et al. 2015) model on ImageNet (Deng et al. 2009) and Visual Genome (Krishna et al. 2016). The dimension of the original vectors is 2048 and we project them to a new space with the dimension of 1024, which is also the hidden size of the LSTM in the decoder. To represent words, we drop the words that occur less than 5 times and end up with a vocabulary of 9945 words. We use one-hot vectors and linearly project them to dimension of 1024. As for training process, we first train RNIC model under XE loss for 25 epochs with the learning rate set to 5e-4, and then we optimize the CIDEr-D score with SCST (Rennie et al. 2017) for another 15 epochs with the learning rate set to 5e-5.

Experiment results
We report the performance on the MS-COCO Karpathy test split of our model as well as the compared models in Table 1. The models include Stack-VS Attention (Cheng et al. 2020), which proposes a visual semantic attention-based multi-stage framework; GCN-LSTM (Yao et al. 2018), which explores visual relationship for boosting image captioning; LBPF (Qin et al. 2019), which can embed previous visual information and look into future; SGAE (Yang et al. 2019), which introduces auto-encoding scene graphs into its model; ORT (Herdade et al. 2019), which takes into account geometric information in the encoder phase; MAD+SAP , which demonstrates that selecting appropriate subsequent attributes to attend to is beneficial for image captioning models; AoANet , which extends the conventional attention mechanisms to determine the relevance between attention results and queries; ETA , which extends the transformer model to exploit complementary information of visual regions and semantic attributes simultaneously; and X-Transformer (Pan et al. 2020), which employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning.
As can be seen from Table 1, the RNIC model has a significant improvement over the baseline, which is because the baseline mainly considers how to improve visual features and does not make enough use of textual information. Visual attention-based models rely only on the memory units of LSTM to utilize the generated textual information, which is not satisfactory when the sentence is too long. Our model uses the textual attention mechanism to improve the input and prediction of the model so that the model can learn to collaborate on visual features and textual features and uses the semantic correlation between words for prediction. The use of the textual attention mechanism enables our model to make better predictions based on the generated text information.

Ablation study and analysis
To verify the effects of the two modules DA and SRR of our model, the ablation experiments are designed as follows: (1) baseline represents the model without both DA and SRR modules; (2) DA represents the removal of the SRR module and keeping only the DA module; (3) SRR represents the removal of the DA module; (4) RNIC represents the removal of the DA module and the SRR modules are applied to the model at the same time. The experimental results are shown in Table 2. As can be seen from Table 2, both the DA module and the SRR module are important for the improvement of the model performance, and both modules improve all the metrics compared to Baseline. This proves that the textual information is crucial for the prediction of the model and indeed solves to some extent the problem that relying only on the memory units of the LSTM for using textual information is not enough to generate high-quality captions.
To better show the specific difference in the image captioning prediction results of our model with the ablation study section, we visualized some of generated captions on the COCO dataset in Fig. 5.
From Fig. 5 we can see, on average, our model is able to generate more accurate and descriptive captions.

Textual attention weight visualization and analysis
To better understand and illustrate our model, we visualize how the RNIC model makes inferences and predictions based on the words that have been generated, as shown in Fig. 6. Take the first case in Fig. 6 as an example, when it comes to predicting "water," "boat" can play a very important role in the prediction. The textual attention mechanism allows our model to trace back to the textual information most relevant to the prediction and act on both the input and prediction aspects of the model, thus improving the performance of the model .

Complexity and efficiency analysis
Compared to visual attention-based models, RNIC added calculation of text attention mechanism. Because there is no dependency between visual attention and textual attention, it can be carried out in parallel. This allows the RNIC model to achieve better performance without reducing computational efficiency.

Textual Attention weight visualization
A boat that is decorated with flags on the water A large white airplane parked at an airport A baseball player throwing a baseball Fig. 6 Examples of captions and textual attention weight visualization generated by RNIC. The thicker line indicates a relatively larger weight, and the word to be predicted is highlighted in green (color figure online)

Conclusion
In this paper, we devise RNIC model for image captioning. By introducing the textual attention, the original visual attention-based model is extended to learn on both visual and textual information to maximize the information needed for generating captions. Our model can better mimic the human language system-making predictions based on what has been generated. Moreover, comprehensive comparisons with state-of-the art methods and adequate ablation studies demonstrate the effectiveness of our framework. In future work, we intend to apply textual attention in RNIC to video captioning. We also explore how to incorporate textual attention mechanism with transformer framework.
Author Contributions Xiulong Yi proposed the method and conducted the experiments, analyzed the data, and wrote the manuscript. Rong Hua supervised the project and participated in manuscript revisions. You Fu, Dulei Zheng, and Zhiyu Wang provided critical reviews that helped improve the manuscript.

Declarations
Conflict of interest Authors XIU-LONG YI, RONG HUA ,YOU FU, DU-LEI ZHENG, and Zhi-Yu Wang declare that they have no conflict of interest.
Funding This study was funded by National key research and development project(2017YFB0202002) Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent Informed consent was obtained from all individual participants included in the study.