Visual question answering is a technique that unifies multiple modern techniques like natural language processing (NLP) and machine vision to improve research and get better result out of all fields. Computer vision or machine vision is a technique used for collecting, processing, analyzing the images that are to be used in the system [7] [8]. It seeks to teach machines how to sight in a nutshell. NLP, on the other hand, is a field focused with facilitating natural language exchanges between computers and people, which includes teaching robots to read. Artificial intelligence encompasses both natural language processing (NLP) and machine vision, and both use machine learning to achieve their goals [8].
According to survey, recent trends in Visual Question Answering, the system is given with an image and a question in a Natural language, as inputs. Then the relationship between image and the question is processed thoroughly. Then, the system will analyze through potential answers which may be of different types like multiple choice (A/B/C), Yes/No type of questions. Fill in the blank is a similar activity in which one or more missing words must be added to an affirmation characterizing the sight. These affirmations are just affirmative answers to declarative queries. Unlike traditional methods, where questions were taken at the time of execution, VQA sets apart from that category. Visual question answering may look simple but is of higher complexity where the information about the question is not necessarily present every time in the given image unlike picture captioning. The required information can be anywhere and anything from common things to complex equations regarding visual elements. This proves that Visual Question Answering is a complete AI problem as it requires multimodal information’s and not just of a single field. It should be noted that image captioning might theoretically be used to evaluate picture understanding just as efficiently. In practice, however, VQA has the advantage of having fewer review criteria. Answers are usually limited to a few words [9]. Longer real-world image captions are more difficult to match to anticipated captions. Despite the fact that sophisticated assessment measures have been investigated, this is still an unresolved research topic.
Since the emergence of the VQA dataset towards the end of 2014, there have been several active research works on the topic to date. According to Singh et al. (2019) [10], VQA methods are categorized into joint embedding methods, attention mechanisms, and compositional models. This study categorized VQA approaches into joint embedding, attention mechanism, and compositional, external knowledge, and graph-based approaches. In joint embedding approaches, text features can be extracted using bag-of-words (BOW) or long short term memory (LSTM) (Ben-Younes et al. 2017; Fukui et al. 2016; Shih et al. 2016) [11–13]. Then, CNN can be used to extract image features. The respective features are further combined into common feature space using either concatenation or element-wise multiplication methods. Finally, the combined feature vector is passed into a classifier to predict an answer to the input question. The joint embedding methods focus on the entire region of the image, which poses a challenge of understanding question specific semantic information of an image. Some of the early research in VQA concentrate on joint embedding approaches and they are considered as common practice in vision and language research communities (Ben-Younes et al. 2017; Fukui et al. 2016; Shih et al. 2016). Although these approaches are mostly used for open-end questions and multiple-choice answers, yet they can only generate answers that are observed during training.
The attention mechanism based approaches improve further by considering part of the input space (Wang et al. [14]; Shah et al. [15]). They extract the global features of an image to answer a question focusing on only a specific region of the image that might confuse the VQA system. For instance, given the question ‘what size is the book’, the salient part of the image contains the book. Also, ‘size’ and ‘book’ are the most relevant words in the sentence. The attention mechanism is guided by an algorithm that represents feature vectors corresponding to each region in an image at a more local level which are then ranked based on their similarity with the features of the question asked (Kallooriyakath et al. [16]). The global image features (like the last hidden layer of a CNN) and global text features (bag-of-words) may not be capable of addressing region specific questions. More promising results were obtained on the attention-based VQA methods using benchmark datasets. But, they still have issues with complex questions that contain reasoning and counting.
In compositional models, questions involve a series of reasoning steps to infer proper answers. For instance, a question like ‘what is beside the cup’ entails finding the cup and naming the object beside it. These approaches are mostly restricted to visual reasoning only. In these approaches, two compositional systems have been proposed in an attempt to solve VQA in a series of sub-steps. The first framework is Neural Module Network (NMN), and the second structure is Recurrent Answering Units (RAU). The NMN structure utilizes external question parsers to find the subtask in the question, whereas RAU is prepared end-to-end, and subtask can be implicitly learned (Andreas et al. [17]; Noh et al. [18]).
In more recent scenarios, the above-mentioned VQA approaches underperform on more complex questions like ‘why is the baby smiling’ and ‘can we get water here’. These types of questions require common sense reasoning and knowledge about the spatial relationship among objects. Consequently, there is a need for robust VQA systems capable of solving more natural questions. Wu et al. [19] proposed an external knowledge base VQA approach. This approach is useful when some common sense or additional background knowledge is required to answer questions correctly. The advantage of this method is that it answers more general questions and understand the reasoning behind how it arrives at the answer by viewing supporting facts generated in the process. But, lack of more precise semantic or visual attention applied to the question and image before querying the knowledge bases is a disadvantage.
Recently, most external knowledge or attention-based VQA approaches do not adequately examine or utilize how the words in the question interact with each other. Hence, there are still issues in answering complex or reasoning questions. The evolution of graph-based approaches (Teney et al. [20]; Kipf and Welling [21]; Narasimhan et al. [22]; Zhu et al. [23]) create new opportunities to overcome still existing challenges in VQA, with better performance on standard datasets.