a. The User Interface of the System
The interface for the EDUVI system is shown in Fig. 5. It is an interactive UI, developed in gradio app, where students can easily add an input image and frame a query related to that image. The answer is produced after clicking on submit button.
For example, the image of fruits and the question “Is papaya present in the basket?” is given as input. After clicking on submit button the answer is generated and displayed as shown in Fig. 6.
An interface for image captioning system is also shown in Fig. 7 for the input image. The caption is produced after clicking on the submit button.
b. Experimentation
The image dataset of the EDUVI contains 100 images. Few images in .jpg format and the questions (in the form of text) are taken as input to show the applicability of the developed system.
The questions from the question categories can be answered using the following CV and NLP tasks given in Table 3:
Table 3: Question categories mapping with CV and NLP tasks
S. No.
|
Question Category
|
CV Tasks
|
NLP Tasks
|
1.
|
Verification
|
Object recognition, Stuff image segmentation, Scene classification
|
Tokenization, Part-of-speech tagging, Named entity recognition
|
2.
|
Disjunctive
|
Object detection, panoptic
|
Tokenization, Part-of-speech tagging, Stop word removal, Named entity recognition
|
3.
|
Concept Completion
|
Activity recognition
|
Tokenization, Part-of-speech tagging, Named entity recognition
|
4.
|
Feature Specification
|
Attribute classification
|
Tokenization, Part-of-speech tagging, Named entity recognition
|
5.
|
Quantification
|
Counting
|
Tokenization, Part-of-speech tagging, Named entity recognition
|
6.
|
Definition
|
Spatial relationship, Scene classification
|
Tokenization, Part-of-speech tagging, Named entity recognition
|
7.
|
One-Word
|
Common sense reasoning
|
Tokenization, Part-of-speech tagging, Named entity recognition
|
8.
|
Sentiment-Based
|
Emotion recognition
|
Tokenization, Part-of-speech tagging, Named entity recognition
|
1. Answering question in different Categories: It can be seen that question in each question category mentioned in Table 3, is answered by the system.
I. Verification
II. Disjunctive
III. Concept Completion
IV. Feature Specification
V. Quantification
VI. Definition
VII. One-Word
VIII. Sentiment Based
2. Generating Description to the Image
For the images where questions were not framed, captions were generated as shown in Fig. 9.
3. Knowledge Development
This feature aids in providing better understanding about the image. For example, in the given image system is able to answer the type of animal classes shown in the picture (Fig. 10).
4. Keyword Based Answer Generation
In this category, an image is taken as an input and keywords related to the image is given as question. The system provides a caption about the image (Fig. 11).
5. Knowledge Correction
Here a wrong related to the image was framed, the system provided the correct answer (Fig. 12).
6. Knowledge of anonymous and complex image for students:
This feature of the system provides the description to an unknown and complex images (Fig. 13).
7. Knowledge of arbitrary tested image:
The given image is not stored in EDUVI dataset but the system is still able to answer the question related to the image (Fig. 14).
Evaluation of the System
The developed system was evaluated using various metrics. Firstly, number of correct answers, incorrect answers and number of Not Available (NA) category were counted for every question under each category. It is observed from Fig. 15(a) that the number of correct answer for the category verification based questions is higher as compare to other categories, which revealed that the model is working well on this category. On the other hand, the category quantification based questions contains higher number of incorrect answers.
From Fig. 15(b) it can be analysed that number of correct answers for the category generating descriptions, keyword based and knowledge of arbitrary images are greater as compared to other categories.
The response time of each question category is shown in Fig. 16(a). It is observed that each type of question category has different response time (in seconds). The response time for the questions in the category sentiment based questions was higher as compared to other categories while the response time for the category verification based question was least. The response time of each knowledge category can be seen in Fig. 16(b) and it can be interpreted from the graph that keyword based category has taken the longest time because it is automatically generating the answer from the given keywords related to the image using NLP tasks.
After testing the model on 800 questions of various categories, accuracy is computed for each category of questions using equation (1). Computed accuracy is shown in Fig. 17(a) and (b).
From Fig. 17(a) it can be analyzed that the accuracy of the verification based questions is the highest i.e. 91% whereas in Fig. 17(b) it can be seen that generating description, keyword based and knowledge of arbitrary images has the highest accuracy among all the categories.