Multiple Context Learning Networks for Visual Question Answering

In recently years, some visual question answering (VQA) methods that emphasize the simultaneous understanding of both the context of image and question have been proposed. Despite the effectiveness of these methods, they fail to explore a more comprehensive and generalized context learning tactics. To address this issue, we propose a novel Multiple Context Learning Networks (MCLN) to model the multiple contexts for VQA. Three kinds of contexts are investigated, namely visual context, textual context and a special visual-textual context that ignored by previous methods. Moreover, three corresponding context learning modules are proposed. These modules endow image and text representations with context-aware information based on a uniform context learning strategy. And they work together to form a multiple context learning layer (MCL). Such MCL can be stacked in depth and which describe high-level context information by associating intra-modal contexts with inter-modal context. On the VQA v2.0 datasets, the proposed model achieves 71.05% and 71.48% on test-dev set and test-std set respectively, and gains better performance than the previous state-of-the-art methods. In addition, extensive ablation studies have been carried out to examine the effectiveness of the proposed method.


Introduction
An artificial intelligence agent must be able to understand not just the semantics of words, but also the content of images. This capability is required by many multimodal tasks about image and text modalities [1]- [5]. In this paper, we focus on the visual question answering (VQA) [4]- [5], which aims to identify a correct answer related to the given image-question pair from candidate answers. Compared to other multimodal tasks, VQA is a more difficult task that requires associating visual content in the image with the semantic meaning in the question, together with visual reasoning to make the correct answer.
To process multimodality features of image and question, most VQA methods adopt a combination of Convolutional Neural Networks (CNN) [6] and Recurrent Neural Networks (RNN) [7]. In the early stages, some works [8]- [9] utilized CNNs to extract visual features of global image contents and RNNs to encode global language contents. However, the extracted global features are not fine-grained and contain noisy information. To address this issue and with the development of attention mechanisms in the field of deep learning, a VQA approach that employs attention mechanisms to further process multimodal features is becoming increasingly popular. Such method can to locate the objects related to answers in an image or the keywords in a question. For example, the question-guided visual attention on image regions was first proposed by [10]. Besides, some works shown that learning textual attention on question words can also improve the VQA performance [11]. Following, a large variety of attention-based variations [12]- [14] including coattention and stacked attention have been proposed.
Although aforementioned attention-based approaches have dramatically ameliorated the performance of VQA, there are still challenges in VQA, one of which is multimodal context learning. As context reflects the high-order interactive information between entities (image objects or question words) and helps to distinguish the targets from other entities, several attention-based techniques have lately begun to emphasize multimodal context learning. One strategy is to model the image into a relational contextualized graph. For example, ReGAT [15] and v-AGCN [16] structure the contextualized relation graph on the image and use the graph attention network [17] to generate the relation-aware visual representations for capturing the visual context information. Their methods are capable of performing complex relational inference, but the contextualized graph is built upon the relation with language prior, which does not deliver complete contextual information on the image modality. In addition, both models only take into account the context for image modality. Against this disadvantage, more recently, some works have attempted to learn extra textual context. DFAF [22] based on the inter-modality attention flow to complement the intra-modal information and using the intra-modality attention flow to capture the intra-modal contexts. Two modular co-attention models, namely MCAN [19] and MEDAN [20], have been introduced to VQA task to capture more complete context information based on the and encoder-decoder framework. However, these methods only learn the intra-modal context within each modality. Despite the effectiveness of these methods, we still find that a more comprehensive and generalized context learning tactics is rarely explored. On the one hand, a special inter-modal context, named visual-textual context and expressed as the context dependence under two modal information, is almost ignored by previous methods. On the other hand, how to explore an accordant context learning way modeling multiple contexts and integrate multi-context learning into a unified VQA framework, there is still potential scope for further research. From these insights, multiple contexts are explored in this paper for VQA, which includes two intra-modal contexts (i.e., the visual context in the image modality and textual context in the text modality) and an inter-modal context (i.e., the visual-textual context in the inter-modality). Meanwhile, a novel Multiple Context Learning Networks (MCLN) is proposed for corresponding context learning. Figure 1 shows the proposed multiple context learning strategy. The visual context learning module (VCL), textual context learning module (TCL) and visual-textual context learning module (VTCL) are uniformly modeled base on the key-query attention. Two intra-modal contexts are learned through the VCL and TCL, respectively. Then, the processed image features and question features are connected to form the visual-textual features, which is fed into VTCL to extract the inter-modal context information. After that, by modular composition of the three modules, the multiple context learning layer (MCL) is constructed. In addition, by cascading MCL in depth, the interaction for the intraand inter-modal contexts can be reached handily, allowing for deeper level and more complex context learning. Finally, the deep Multiple Context Learning Networks Network (MCAN) that consists of cascaded MCA layers is proposed. In addition, we introduce the BERT [21] model to further enhance the context learning ability of MCLN for text modality.
The main contributions of this study can be summarized as follows.
(1) We explore multiple contexts for VQA task and propose a uniform context learning strategy to perform multi-context learning upon the detected objects in an image and the words in a question. The proposed MCLN is able to capture the intra-and inter-modal context information.
(2) An advanced text encoder based on pre-trained BERT [25] is also introduced for further facilitate the textual context learning, which boosts the performance of MCLN.
(3) Evaluation results on the benchmark VQA v2.0 dataset [5] demonstrate that MCLN achieves state-of-the-art performance. In addition, comprehensive ablation studies are conducted to quantitatively and qualitatively prove the roles of different context learning modules in the proposed model, verifying the effectiveness of the proposed MCLN architecture.

Multiple Context Learning Networks for VQA
As common practice, identifying a correct answer related to the given image-question pair from a set of candidate answers is a typical formulation of VQA: where the model predicts the correct answer â from the candidate answers A for a given image I and question Q pair, and θ is the learned parameters of model. Without loss of generality, the proposed method also follows this convention. Figure 2 illustrate the overall pipeline of the proposed Multiple Context Learning Networks (MCLN). The proposed MCLN consists of a series of subnetworks: a) Image and Question Representation, which includes image encoder and text encoder. The image encoder detects objects in an image and extracts the object features, which text encoder embeds the question words and encodes the word features; b) Multiple Context Learning that performs multi-context learning and it is composed of multiple visual and textual context learning layers (VTCL) by the cascaded way; and c) Multimodal Fusion and Answer Prediction that fuses image representation and question representation to predict correct answer.

Image encoder
Inspired by humans using bottom-up attention to process visual information [13], we extract the features of salient objects instead of extracting the global convolutional features of the image. The proposed image encoder encapsulates three main steps: (i) a Faster R-CNN [22] model in conjunction with ResNet-101 [23], which is pre-trained on the Visual Genome dataset [24], to select K most salient objects  within a given image I, thereby achieving the bottom-up attention. (ii) for each selected object i o , i r is defined as the mean-pooled convolutional feature from this object region, such that the dimension of the object feature vectors is 2048; and (iii) a linear transformation layer to project the objects feature i r into a 768-dimensional vector.
are the parameter of the linear layer. Finally, image I is represented as visual object features set

Text encoder
Given a question Q, the text encoder first tokenizes it into words and uses a maximum of L words to trim the question. Then, the l-th word in the question is further embedded into a 300-dimensional vector representation 300 l eR  by using the pre-trained GloVe word embeddings [25]. The questions shorter than L words are padded at the end with zero vectors. Thus, the Q is initialized as a word embeddings sequence . Here, a one-layer and 768-dimensional long shortterm memory networks (LSTM) [7] is utilized as text feature extractor, which scans the word embeddings sequence E from 1 e to L e and picks up current l e into its unit to iteratively update hidden state l t .
In contrast to [13] which only uses the final hidden state as the question feature, the hidden state of all words will be preserved and regarded as word features. Finally, the question Q is represented as word features set . In addition, motived by success of BERT in multiple natural semantic tasks [21], we also introduce the BERT model to replace LSTM and fine-tuning it to enhance the context learning of textual features.

Multiple context learning layers
Given that the goal of proposed model is to execute multi-context learning using a uniform strategy, how to capture context of an entity in the entities set (i.e., image object features set V or question word features set T) is essential. Such a method should be capable of modeling context dependence of all entities across whole entities set, it requires that the contextual path dependency distance between all entities is 1 and not affected by the data pattern. Bearing in mind the keys, this paper proposes a unified multi context learning strategy based on the key-query attention mechanism [26]. Using visual context learning as an example, Figure 3 shows the design of the visual context learning module (VCL), and the implementation will be discussed in detail below. and its i-th row represents the feature of i-th object. To learn the visual context features, the matrix V would be converted to key matrix KV, query matrix QV and value matrix VV by the linear transformation. ,, where W K , W Q and W V are the linear transformation parameters that calculate the key, query and value matrices. Then, calculating the dot products between query and key matrices to get the where d is a normalization constant. In this way of dot product, the contextual path dependency distance between all objects is set to 1.
reflects the context dependence of all objects and stores the attention weights between all objects. For the row attention weights : V i A , which represents the context dependence of i-th object in the image. Therefore, the visual context feature for i-th object can be obtained by calculating a weighted combination of the :

V i
A and all object features, whereas the form of matrices for the visual context feature of all objects can be stated as follows: In addition, the multi-head attention based on key-value attention is introduced to perform on V.
headh is h-th head key-query attention.
are the parameter matrices of headh. The W O is the projection matrix for all heads and || represents concatenation of all heads. dh is the dimensionality of the output features from each head and usually making dh = 768/H. The multi-head attention allows the model to jointly attend to context information from different representation subspaces, improving the representation capacity of features.
After acquiring the visual context features of all objects using multi-head attention, residual connection followed by layer normalization is applied to integrate the visual context into the object features.
Where LN (·) represents layer normalization. Aiming to further adjust the object representations, the position wise feed-forward network (FFN) transforms object features V with two fullyconnected layers. And it can be described as: Where the W1, b1, W2 and b2 is the parameters of FFN, and its hidden and output dimensions are 768×4 and 768 respectively. ReLu is the rectified linear unit. In addition, the residual connection and layer normalization are also applied after FFN to facilitate optimization. By performing the above procedure, visual contexts representing high-order interaction and context dependence among all objects can be captured in an image. Such procedure is simplified by the visual context learning module (VCL) as: The textual context learning module can work in the same manner on the word features T, but with a different set of parameters to be learned, thus details are omitted and it is also simplified as: Since the contextual path dependency distance between all entities is 1 in the proposed context learning mothed that based on key-query attention, the inter-modal visual-textual context can be easily modeled by feeding the connected visual-textual matrix ( ) 768 || KM V T R    into the visualtextual context learning module (VTCL) with the same context learning mechanism. , Then, we combine three context learning modules to form a multiple context learning layer (MCL). Finally, such MCL layer is stacked in depth and the output of the last layer is the input of the next layer, expressed as . So in the MCL layer, the intra-modal contexts will provide complementary information for inter-modal context learning, while the intermodal context information will facilitate the intra-modal context learning for the next layer. In the way, the deeper and more high-level context information is modeled by associating the intra-modal contexts with the inter-modal context, allowing more comprehensive understanding for image and text contents.

Multimodal fusion and answer prediction
After N MCL layers, the output image object features V N and question word features T N not only contain the intra-modal contexts but also capture the inter-modal context dependencies. To distinguish important entities from context information and fuse multimodal features, an attention model with a FFN (its hidden and output dimensions are 768 and 1) is designed for V N and T N to obtain attended image feature v and question feature q. Taking image object features V N as an example, the attended image features v is calculated as follows: i  is the learned object attention weight for i-th object. The learned word attention weight l  and the attended question feature q can be obtained using an independent attention model by analogy.
After obtaining image feature v and question feature q, a linear transformation respectively implements on v and q, and using the addition fusion to get joint representation z. Finally, z is fed into a fully-connected layer followed by a sigmoid function, generating an answer vector

Datasets
VQA v2.0 dataset [5] contains 1.1M questions asked by human and 10 answers are collected for each image-question pair from human annotators. The answer with the highest number of occurrences will be regarded as the correct answer. The dataset is divided into three parts: a training set containing 80 thousand images and 444 thousand questions, a validation set containing 40 thousand images and 214 thousand questions, and a test set containing 80 thousand images and 448 thousand questions. Additionally, based on the answer category, all questions are divided into three types: Yes/No, Number and Other. According to official recommendations, we use following accuracy as the evaluation metric for answering quality.

Implementation Details
The proposed MCLN is implemented by PyTorch and all the experiments are conducted on a workstation with NVIDIA 2080 Ti GPU. The hyper-parameters of MCLN model that are used in the experiments are listed as follows. For the pre-trained Faster R-CNN, setting a confidence threshold to the probabilities of detected objects and obtain a dynamic number of objects K∈ [10,100]. The maximum length of tokenized words is L=14. As for textual features, using either a LSTM with a 300-dimensional GloVe word embedding size or a pre-trained BERT [21] model with the768dimensional embedding size. In all context learning modules, we set the number of heads H as 12, so the latent dimensionality for each head is dh = d/12 = 64. Moreover, the number of MCL layers is N∈{1,2,3,4}. In this work, the answers that occur less than 8 times in the training and validation dataset are discarded, which produces |A|=3129 candidate answers. Our model is optimized by the Adam optimizer [27] with 64 batch sizes. For the choice of learning rate lr, the warm-up strategy is employed. Specifically, the initial learning rate is 2.5e-5 and grows by 2.5e-5 at each epoch till it reaches 1e-4 at epoch 4. After 10 epochs, the learning rate is decreased by 1/5 for every epoch up to 12 epochs. Due to the fact that there might exist multiple correct answers for a question in the VQA v2.0 dataset, hereby we utilize the binary cross-entropy loss (BCE) as the loss function to optimize MCLN: Where i y is the provided soft score, i p is the predicted score by the MCLN model and corresponds to the i-th element in the answer vector p.

Ablation Studies
On the VQA v2.0 validation dataset, the proposed MCLN will be evaluated from three different aspects: 1) the effectiveness of the three context learning modules in the proposed MCL network architecture; 2) the impact of different layers on the performance of MCLN; 3) the effect of BERT for MCLN.

Effect of the Context learning module
As mentioned, we design three modules to perform corresponding context learning. To demonstrate the effectiveness of the proposed context learning modules, we use the MCLN model with LSTM and quantitatively ablate the context modules to exercise this ablation study. The results of different ablated versions of model are reported in Table 1. It can be found that adding any module will help to improve performance, while ablation of any module will hurt performance. And the best performance only occurs when the three modules are combined. Further noteworthy is the result about VTCL. The model 4 only with VTCL module obtains 62.07% overall accuracy, which outperforms model 1 more than 7.47% and higher performance is gained compared to model 2 and model 3. Furthermore, a dramatic damage, 9.92%, has emerged when VTCL module is ablated in model 7 compared with model 8, validating the visual-textual context that ignored by previous methods particularly matter to VQA. From these results, all context learning module plays a significant role in improving the VQA performance, proving the effectiveness of our context learning modules.

Effect of VTCL layers
Next, we stack the full-module MCL layer in depth to evaluate the effect of MCL layers. Figure 4 shows the performance of the MCLN models with different number of MCL layers N∈{1, 2, 3, 4}.
Regarding the performance, we observed two phenomena on the overall accuracy: 1) as increasing N, the performances of MCLN models steadily improve and finally saturate at a certain number N = 3; 2) The one-layer model does not perform as well as the deeper models, but there is a quite improvement over N=1 while a subtle decrease over N=3 when N=4. Deeper MCL layers capture more complex context information, enabling a more comprehensive understanding of image and question contents, which improves the performance. What fascinates me is why the performance starts to degrade after four layers. Comparing the shallow model and deep model, the number of layers increases and the parameters of model rise as well. As a result, the model with four MCL layers difficult to be optimized and suffer from a larger risk to over-fit the training set due to the higher number of parameters, thereby harming the performance.

Effect of BERT model
Since BERT [21] is trained on large text corpus and models the text context by stacking 12 Transformer blocks, it provides better generalization and representation for textual context features. So the contextualized text encoder based on BERT is introduced to enhance the textual context learning and the LSTM is replaced when processing the question. The results of fine-tuning the weight of BERT with different learning rate are shown in Table 2. In the case of MCL layer is one, by using the BERT with learning lr × 0.001 (N=1, BERT), 65.21% overall accuracy is gained, and by increasing learning rate of BERT, it achieves the best performance at lr × 0.1. With this learning rate, by increasing the MCL layers, the performance grows and also saturates at three MCL layers, proving that BERT is compatible with MCLN and conducive to enhancing text context learning.

Visualize Analysis
Assuming that, after processing the object and word features through multiple MCL layers, the keywords in a question and the relevant image objects related to the answer can be well distinguished by the multimodal fusion and answer prediction networks according to the multiple context information. Thus, to intuitively illustrate the effectiveness of MCL layers, we selected three models to visualize the learned attention weights by Eq (13) Due to the lack of MCL, MCLN-w/o is not able to accurately locate keywords and relevant objects according to the context information, resulting in wrong answers. For example, MCLN-w/o mainly pays attention to the "people" but ignores other keywords ( e.g., "many" and "board"), which means that the context dependency of "people on board" cannot be captured. In addition, compared with MCLN-1, more reasonable attention weights are usually learned by the MCLN-3. As shown in the last two questions, MCLN-3 mainly attend to the keyword "people", "on" and "board" in the second question, and the keyword "person", "white" and "holding" in the third question. In other words, the deep-level textual contexts like "people on board" and "white person holding" are captured by MCLN-3.

Comparative Experiments with State-of-the-art Models
The proposed model is compared with several state-of-the-art (SOTA) models to verify its advantage. Compared methods include three models without context learning (BUTD [13], MFH [28], Counter [29]), two models with visual context learning (ReGAT [15], v-AGCN [16]) and three models that only learning intra-modal contexts (DFAF [18], MCAN [19], MEDAN [20]). Table 3 shows the compared experiment results on the VQA v2.0 test-dev and test-std sets. We can see that the proposed MCLN can obtain higher performance than the state-of-the-art VQA methods. Compared with three methods that ignore the context learning, on the test-dev set, 4.94%, 1.5% and 2.17% overall accuracy are improved by the MCLN-LSTM. ReGAT [15] explores two types of visual object relations on the image and only models the visual context in the image modality, and which combines various relation learning models and the Counter model. By comparison, the single MCLN-LSTM is not only higher than ReGAT in test-std, but also has an advantage in the number of models. Compared with DFAF [18], who separately learns the inter-modal attention features of vision and text, the proposed MCLN-LSTM captures intra-and inter-modal interactions simultaneously for vision and text features through the visual-textual context learning module. As shown in Table 3, MCLN-LSTM is 0.04 and 0.29 points higher than DFAF on test-dev and test-std. By using the encoder-decoder architecture, recent MCAN [19] and MEDAN [20] achieve the stateof-the-art performance. Even though the overall accuracy of the proposed MCLN-LSTM model is lower than MCAN and MEDAN, the advantage of MCLN model is still obvious. When BERT is introduced as the text encoder, the MCLN-BERT model overperforms these two models on the overall accuracy. This suggests that MCLN is more general that can naturally employ advanced context learning methods. This paper presents the Multiple Context Learning Networks (MCLN), a novel framework to model multiple context learnings for visual question answering. MCLN exploits three types of contexts: the visual context and textual context in the intra-modalities, and the visual-textual context in the inter-modality, to learn a context-aware representation through the corresponding learning module. The core idea of the context learning module is to establish the contextual dependency of all entities in an entity set by using key-query attention mechanism. The proposed approach is simple but extremely effective. MCLN is better able to learn complex contexts and improve VQA performance by composing multiple context learning modules to create MCL layer and stacking such layers in depth. Furthermore, the results of the experiments show that the BERT model is compatible with proposed MCLN. Extensive ablation studies have shown that the proposed method is effective.