VMSG: a video caption network based on multimodal semantic grouping and semantic attention

Network video typically contains a variety of information that is used by the video caption model to generate video tags. The process of creating video captions is divided into two steps—video information extraction and natural language generation. Existing models have the problem of redundant information in continuous frames when generating natural language, which affects the accuracy of the caption. As a result, this paper proposes a multimodal semantic grouping and semantic attention video caption model (VMSG). VMSG uses a novel semantic grouping method for decoding, which divides the video with the same semantics into a semantic group for decoding and predicting the next word, to reduce the redundant information of continuous video frames, which differs from the decoding mode of grouping by frame. Because the importance of each semantic group varies, we investigate a semantic attention mechanism to add weight to the semantic group and use a single-layer LSTM to simplify the model. Experiments show that VMSG outperforms some state-of-the-art models in terms of caption generation performance and alleviates the problem of redundant information in continuous video frames.


Introduction
Human intelligence manifests itself in two ways-visual perception and language expression.Video caption [1, 2,5] generation is a common application of artificial intelligence that combines visual data and natural language which can be used in many relevant applications such as video retrieval.Understanding a visual scene and naturally describing it is referred to as video captioning.It is a hot topic in computer vision.There are two important aspects to video captioning.The first step is to extract information about distinguishable features from the video.The second step is to extract features and convert them into natural language [3,4].In computer vision and natural language processing, data-driven deep learning is the primary processing method.Figure 1 depicts a video caption generation example.
The most widely used video caption algorithms are encoder-decoder frameworks based on convolutional neural networks and recurrent neural networks [2,24,28,33,41].The convolutional neural network encoder obtains a set of continuous frames from the input video and generates the corresponding video features.The recurrent neural networkbased decoder then takes the visual coding characteristics and previously predicted words as input and generates a word each time.The caption model of an image just needs to understand the static content within a single image.In contrast, the video caption must fully comprehend the video context.
Video on the network contains information that is similar between adjacent video frames.As a result, there is more redundant information between adjacent video frames, which cannot usually provide enough unique information [5].It is natural for humans to comprehend video by segmenting it into information units using semantics.Therefore, viewing each frame as a separate information unit is ineffective for comprehending video.
The complementarity of visual and text information is critical for the video caption algorithm.However, when encoding video, the previous methods primarily focus on the visual aspect (i.e., video frame) while paying less attention to the text aspect (i.e., partially decoded title).The video caption is made up of text that was predicted by the decoder and summarizes the visual content.As a result, the word phrase made up of partially decoded abstracts can group semantically related frames into information units to form a semantic group.
For example, in a video scene where a boy meets a girl and talks to her, the decoder has partially generated "a boy talking with".The phrase "a boy" can form a semantic group from the scene of the boy standing alone, and "talking with" can form a semantic group from the video frame of the following two people talking.The next word, "girl", can be predicted by this semantic group.
To be used as information units for understanding video, semantic groups must have the three characteristics listed below.To begin, semantic groups' meanings should be concrete and observable.Second, a semantic group should have distinct meanings from other semantic groups, allowing it to be treated as a standalone information unit with no redundancy.Third, all video frames in a semantic group should be closely related to the phrases they represent.
This paper proposes a video caption model based on multimodal semantic grouping and semantic attention (VMSG) to address the problem of continuous frame redundancy and insufficient feature extraction.To extract 3D and 2D features, the model employs the 3DResNet [6] neural network and the residual neural network [7].The classification information for audio and video is then added to the multimodal framework for coding.Once the multimodal features have been obtained, they must be decoded.Unlike the previous decoding mode, which groups frames frame by frame, VMSG decodes using semantic grouping.Because the importance of different semantic groups varies, this paper investigates a semantic attention algorithm to give semantic groups more weight.For decoding and predicting the next word, videos with the same semantics are grouped together.This paper's work and contributions are summarized as follows: 1.This paper proposes a multimodal semantic groupingbased video caption model, which fully utilizes multimodal video information.Considering the problem of deep network gradient disappearance, our model uses 3D ResNet to replace C3D [8] network for extracting video feature information.2. Due to the problem of redundant information, this paper constructs a phrase filter to filter excess information, and builds a semantic aligner to group adjacent video frames into semantic groups.3.In the decoding module, we use a single-layer LSTM structure to replace the original double-layer network to simplify the model, ensuring the effectiveness of the model while reducing computational difficulty.At the same time, a semantic attention mechanism is added to attach weights to semantic groups, enabling the model to selectively focus on important information and improving the quality of generated video subtitles.4. We add contrastive attention loss to the model based on cross-entropy loss for training, and experiment with the proposed method on public datasets.The experimental results show that our method significantly outperforms advanced algorithms.Following analysis, this model can extract video features effectively.

Video captioning
Early video caption techniques primarily relied on templatebased methods [32].These methods involved defining specific sentence rules that would break down the entire sentence into several semantic parts.Once the template was defined, a visual detector would be used to extract video elements from the video and match them with the previous template one by one.While this method was easy to implement and produced grammatically correct results, its accuracy and richness of sentence structure were limited.As a result, template-based methods were soon replaced by sequence models based on recurrent neural networks.
The principle behind sequence-based models is to map visual features and textual information into the same vector space, ultimately achieving a sequence-to-sequence mapping [2,24,28,33,41].Venugopala S et al. [33] proposed a method for processing video features that directly averages the features extracted by a convolutional neural network to obtain the encoded representation of the video.This approach reduces the number of parameters and has a simple structure, but it disregards the orderliness between video frames and simply compares the video to multiple images.Moreover, because the model cannot recall all the visual content and syntactic structure in the video, these methods often cannot generate naturally complex sentences like human language.Chen J et al. [28] introduced a retrievalenhanced mechanism (RAM) that can explicitly refer to existing video language pairs in any encoder-decoder caption model.They also integrated RAM into the convolutional encoder-decoder structure to enhance video caption generation.
There is a lot of redundant information between adjacent frames in general videos.To address this, researchers have tried to mimic the behavior of human understanding of videos-collecting semantic-related information into units, and then decoding the collected information units into video captions.Some methods divide videos into fixed or adaptive numbers of segments, which are composed of consecutive frames.For example, Baraldi L et al. [34] assume that videos are a group of continuous time and improve the performance of video caption models by discovering the hierarchical structure of videos.
For natural language decoding, the main method currently used is autoregressive (AR) decoding, that is, reflecting each word according to the output generated previously.This also leads to decoding cannot be parallel computed, causing delays in caption generation.In recent years, nonautoregressive decoding (NA) has become a hot research direction in the field of natural language generation because of its characteristic of generating all words at once by parallel computing.However, non-autoregressive decoding cannot fully utilize contextual information, resulting in a significant performance gap between the model and the autoregressive model [35], manifested as the repetition of generated words.To improve the performance of the autoregressive model, Yang B et al. [26] proposed a model based on non-autoregressive decoding, which has a captioning process from coarse to fine, solving the above problems.In addition, they used a bidirectional self-attention network as a language model to achieve inference acceleration.

Attention mechanism
The memory capacity of recurrent neural networks is limited, so the sequence-based encoder-decoder framework cannot effectively capture the temporal information in videos and may lead to the problem of error accumulation.For instance, in the encoder-decoder structure, the encoder treats each video frame as equal in length, regardless of the different levels of importance between them.To address this problem, researchers have introduced attention mechanisms to the model framework to weigh the temporal information in the video, allowing the model to selectively focus on significant frames.These mechanisms can also assist the model in selecting the most appropriate fusion mechanism during the decoding stage, which ultimately improves the accuracy of video captioning tasks.In recent years, attention mechanisms have undergone further development, including temporal and spatial attention mechanisms.For example, Yao L et al. [20] incorporated temporal attention into the model, enabling it to automatically select the most relevant time segments.Similarly, Deng J et al. [29] proposed a syntaxguided hierarchical content attention and syntax attention network (SHAN), which adaptively integrates features from both temporal and feature aspects, enhancing the interpretability of the model for subtitles.However, a simple temporal attention mechanism may not effectively capture the internal relationship of crucial information.To overcome these challenges, Ji W et al. [19] proposed a new attention-based video subtitle dual learning method (ADL).ADL employs a multihead attention mechanism to capture key information from both the raw video and the generated video subtitles, thereby reducing the semantic gap between the two and improving the quality of the generated video subtitles.

Multimodal fusion
Revised: Videos contain a wealth of information, including visual, audio, and classification information, making it a crucial area of research to effectively utilize this complex information.One approach, proposed by Ramanishaka V et al. [39], is feature-level fusion, which fully exploits multimodal information.Building on this, researchers have proposed decision, attention, and model fusion [3,4,17,36,38,40,41]. Multimodal video caption models use fully connected layers to fuse various features, increasing the encoding vector's video information, and enhancing the expression ability of video features, thus strengthening the correlation between descriptive and video information.
Jin Q et al. [3] proposed a multimodal fusion method for describing video content.The encoder integrates video, image, audio, dialog, speech, and classification information.Fusion networks are then used for fusion and input into the decoder to generate sentences.Xu N et al. [17] use two interactive memory networks to learn feature information for a single mode and multiple interactive attentions to learn interactive information between multiple modes.However, multimodal fusion methods often have high computational costs and complexity.To address this, Nagrani A et al. [40] use an attention bottleneck (AB) module to encode attention-based video and audio information.This method only performs fusion between different modal information in the middle stage of the decoder, greatly reducing operational costs while accounting for the influence of single modal information on itself.

Overall structure of the model
VMSG comprises four modules: video encoding module (i.e., multimodal feature fusion module), phrase encoding module, semantic grouping module, and decoding module.Figure 2 depicts the overall structure of VMSG, where GloVe (Global Vectors for Word Representation) [37] is a word representation tool based on global word frequency statistics.
To enhance the input information, this paper adopts a multimodal input approach in the video encoding module, incorporating 2D features, 3D features, audio features, and classification features.Building upon previous work, the dynamic feature extractor C3D is replaced with a 3D ResNet network to alleviate the degradation problem of deep networks and improve the model's ability to extract dynamic features.

Fig. 2 Over architecture of VMSG
Once the multimodal features are obtained, the phrase encoding module forms phrases based on the acquired words, which are then grouped using the semantic grouping module to generate the final video representation.The semantic grouping diagram in Fig. 3 represents this process, where "w" denotes the generated words, "p" denotes the generated semantic sentences, "f" denotes the multimodal video frames, and "x" denotes the video representations.As each semantic group holds varying importance, this paper introduces an attention mechanism in the decoder to assign weights to each semantic group.
To simplify the model, the double-layer LSTM in VMSG is replaced with a single layer for decoding.Additionally, the model incorporates the contrastive attention loss in addition to the cross-entropy loss.

Multimodal input
VMSG has developed a multimodal segmented label video caption architecture that incorporates various modalities for input.This method significantly enriches the types of features used and has a positive impact on video caption generation.The multimodal feature fusion method in our study employs the early fusion approach (i.e., fusing after extracting features and training on the model after fusion), gradually concatenating the multimodal features frame by frame.
The multimodal inputs utilized in this paper include 2D features, dynamic features, video classification features, and audio features, as detailed below:

2D Features
2D features are widely used in image detection and classification tasks.It gives specific information about objects and scenes.In this paper, ResNet-152, a residual network, is used for 2D feature extraction.The model is pre-trained on over 1.2 million images belonging to 1000 categories.Meanwhile, a pooling layer is added to ResNet to generate 2048-dimensional 2D features.

Dynamic features
Dynamic features are critical in describing the motion information of each object.While ResNet is effective in generating visual features in static images, its capability to extract dynamic features is limited.To improve the recording of dynamic features, we extend the 2D neural network to a 3D convolutional neural network.
Compared to the C3D network, the 3D ResNet network can alleviate the degradation of the deep network model sufficiently.Given that 3D ResNet has multiple hierarchical structures, such as 18, 34, and 101 layers, we examine and experiment on various layers of 3D ResNet, eventually selecting 3D ResNet-101 to extract dynamic features for VMSG.The 3D ResNet is trained on the Kinetics dataset to improve dynamic feature extraction.

Classification features
In the ablation experiments on video features, we discover that the classification information of the video contains helpful information for video caption.For example, if the object is a music video, the weight of the audio should be appropriately increased.If the object is a sports video, the visual weight should be increased.Therefore, this study uses a 3D ResNet network with a fully connected layer to extract the classification information of the video.The 3D ResNet is trained on the Kinetics dataset, which includes 400 categories, including subcategories in the fields of sports, movies, food, and so on, greatly improving the level of detail.Furthermore, because it is possible to generate labels autonomously, the generalization performance has also been improved.

Audio features
To better utilize the existing audio features, we use a widely used audio feature-Mel-frequency cepstral coefficients (MFCC).The pyAudioAnalysis [9] is used to extract features from uniformly sampled 1-s audio clips.The actual , and the audio feature representation {v v i } N i=1 of the video are obtained.Then the model concatenates the video features of different categories frame by frame to form a frame representation:

Phrase coding module
Some words have no meaning when used alone, such as "is" and "the", while some words have unclear meanings when used alone.For example, when connecting "woman" and "cap" to form "woman with cap", the meaning becomes clearer.Therefore, when performing semantic grouping in this article, phrases are used instead of single words.
To build the semantic phrases for the model, we must generate appropriate words and phrases from partially generated captions.To accomplish this goal, this paper needs to find the dependency relationships between words.When generating the tth word w t of the caption, there is a word representa- tion matrix , where E represents a word embedding matrix, and d w stands for the dimension of w t .We use a phrase encoder p to generate the phrase representation matrix P t = [p 1,t ⋯ p t−1,t ] T ∈ ℝ (t−1)×d w from the word representation matrix W t , which is given as.
where A t = [a 1,t ⋯ a t−1,t ] T ∈ ℝ (t−1)×(t−1) is the word atten- tion matrix, and a j,t ∈ ℝ t−1 is the attention weight of word {w i } i=t−1 i=1 .For the encoder p , we use the self-attention mechanism module [10] proposed by Vaswani et al., which can well model the dependency between words in sentences.

Semantic grouping module
A phrase serves as the foundation for a semantic group, which is made up of phrases and all semantically related video features.The number of candidate phrases equals the number of words, and many of them are very similar.As a result, phrase suppressors are used by the model to filter out these redundant phrases.After obtaining all available phrases, the semantic aligner combines these phrases with video frame features that are semantically related to them to form semantic groups.

Phrase filter
To keep those phrases with meaning and low coupling, the model must use a phrase filter to determine which phrases to discard based on their similarity.In this paper, we calculate (1) P t , A t = p (W t ), similarity using the attention matrix of phrases, which is given as.
where r i,j,t ∈ R t denotes the similarity between p i,t and p j,t .We set a threshold and if r i,j,t is more significant than this threshold, it will be determined that the two phrases are related.Once we have obtained two related phrases, compare the similarity of each phrase with all the other phrases, and then discard the one with the higher numerical value.If ∑ r i,k,t > ∑ r j,k,t , then p i,t will be discarded.Table 1 shows the detailed phrase filter mechanism.

Semantic aligner
For each phrase and each frame image f j , the model needs to obtain a score i,j,t based on the correlation between their representation vectors.This paper uses the following method for measuring the correlation between vectors to calculate the score: where u s , U s , H s , and b s are learnable parameters, v j is the video frame feature, is the hyperbolic tangent activation function.Normalize the scores with softmax , then calculate the frame representation v p i,t .Finally, the semantic group representation s i,t is obtained by matching phrase and frame representations.The equations are as follows: s i,t represents a semantic group composed of frame features and related phrase, thereby avoiding redundant information between adjacent frames.

Single layer LSTM decoding module based on semantic attention
Traditional video caption models typically employ a twolayer LSTM during decoding.However, the double-layer LSTM network increases the number of parameters in the model, making it more difficult to train.Therefore, VMSG adopts a single-layer LSTM network and uses semantic groups as the input of LSTM.To simplify the model, the network consists of one embedding layer, one bidirectional layer, and two fully connected layers. (2) Figure 4 shows the working mechanism of semantic attention.The contribution of each semantic group s i,t in generating the t th word w t is different.For example, when generating "people" in "a man is talking to a group of people", two semantic groups, "a man is talking" and "a ground of" will be generated.For these two semantic groups, "a group of" is more important to "people".
The traditional attention mechanism has three parameters Q, K, and V.In this paper, Q is the semantic groups s i,t of this paper, K and V are all state function h t−1 .Con- sidering the limited expressiveness of linear models, a hyperbolic tangent function is added to the attention module.The final form is given as.
where u d , U d , H d and b d are learnable parameters, and i,t is the weight of each semantic group.
After adding semantic attention, the decoder in this paper will assign a weight to each semantic group according to the state function h t−1 of the previous decoder.After getting the weight of the semantic group, the weighted average result x t is obtained, which is given as.
Table 1 The detailed phrase filter mechanism where M t is the number of semantic groups s i,t .
Then x t is output to LSTM as shown in Eq. ( 8): (8) The possible probability P of the next word is calculated by a full connection layer and a softmax layer, which is given as.
where V is the given video, U h and b h are learnable param- eters of the full connection layer.Table 2 shows the detailed semantic attention algorithm.

Loss function of the model
The key to training the model is to generate a distinct and coherent semantic information group.To ensure the generation of semantic groups, the phrase filter will filter out redundant phrases.This paper employs the typical cross-entropy loss function Lce and the contrastive attention loss function Lca to train the model.When a video V and its standard caption Y = [y 1 , ⋯ , y T ] are given, its loss function can be obtained as.   2 The detailed semantic attention mechanism

Cross-entropy loss function
Cross-entropy loss is defined as the negative logarithmic probability of producing the correct title:

Contrastive attention loss function
To ensure that the semantic group's members have a consistent meaning, a semantic group should only contain frame information that is highly related to the semantics.In this paper, we use another group of videos with low correlation as the semantic alignment module's incorrect candidate.To ensure low correlation, we choose a control video at random from a group of videos with completely different abstracts.From Eq. ( 3), the positive correlation coefficient pos i,j,t and the negative correlation coefficient neg i,j,t of the input video frame f j and the phrase pi can be obtained.
After obtaining the positive and negative correlation coeff icient, we use sof tmax for nor malization.p ca (s i,t ) = ∑ N j=1 pos i,j,t represents the probability that there is no control frame in the semantic group.p ca (s i,t ) increases with the increase of positive correlation coefficient relative to negative correlation coefficient.M t indicates the number of semantic groups.The equation is given as.

Experimental results and analysis
This section first describes the dataset's characteristics, then provides the evaluation criteria and thoroughly analyzes the experimental results.

Dataset
We train and test our VMSG in MSR-VTT [11] throughout the experiment.In the field of video caption, MSR-VTT is a critical dataset.It specifies the video category as well as the video's audio characteristics.MSR-VTT has 10,000 online videos totaling 41.2 h in 20 different classes.AMT staff created 20 video captions for each online video.
During the experiments, we find that some issues with the dataset's video, such as word spelling errors and audio ( 11) information that cannot be used.Although the total number of words in the video caption is 23,667, 10,040 of them appear only once.Furthermore, when all the words are compared to Wikipedia's vocabulary, we find that 836 words do not exist, owing to spelling errors.It makes the model's training and testing complicated.
The dataset's video includes audio features that can be used to generate a video caption.However, because approximately 13% of videos lack audio information, the experiment is complicated.
More than 90% of the videos are under 30 s long, and 90% of the video captions are under 16 words long.As a result, we take 30 frames evenly, which allows us to better characterize the video features while keeping the data size manageable.

Evaluating metrics
We use four metrics to evaluate the model: BLEU [12], METEOR [13], ROUGE-L [14], CIDEr [15].Due to the common use of these metrics, complex calculation formulas are no longer listed.
BLEU: BLEU (bilingual evaluation understudy) is an overall video caption evaluation criterion.It represents the likelihood of the occurrence of n-word lengths in the text of video caption and standard translation.
METEOR: METEOR (metric for evaluation of translation with explicit ordering) is a weighted average based on single-precision and recall rate.
ROUGE-L: ROUGE (recall-oriented understudy for gisting evaluation) is a common video caption criterion that is primarily based on recall rate.
CIDEr: CIDEr (consensus-based image description evaluation) considers each sentence to be a document and computes the cosine angle of the TF-IDF vector, where TF is the number of times a specific phrase appears in a sentence, and IDF is used to express the importance of the phrase.

Experimental setup
First, we sample each input video uniformly, and each video samples 30 frames and 30 video clips.The video clips consist of video frames surrounding the sampled frames.We can extract the 2D and dynamic features of the video as well as the audio features from the 1-s clip at the beginning of these video frames using these 30 frames and video clips as the input of ResNet and 3DResNet.
Because VMSG is a multimodal input, multimodal input will inevitably result in increased input dimension, which dramatically reduces hardware requirements.As a result, we use a full connection layer to reduce feature dimension.The initial dimensions of 2D and 3D features are 2048, while audio features and classification labels are 1.
We fed the sampled video frames into the 3DResNet network after it had been trained on the Kinetic dataset.To encode the label and feed it into LSTM, we use a single hot coding method.We initialize the word embedding matrix with a glove, set embedding_ size to 300, and train it with the entire model.Before producing the first word, we use < SOS > as the caption's beginning, then ignore it, and are set to 0.2 and 0.16, respectively.A thesaurus is required to generate a sentence or a word.With a total of 23,667 words, the model thesaurus is entirely derived from the video captions of the training and test sets in MSR-VTT.We set dropout to 0.5 during training to reduce overfitting.Adam optimizer is used to optimize the model, and the initial learning rate is set to 0.0005.The MSR-VTT dataset videos 6513, 497, and 2990 are used to train, verify, and test the model in this paper.The batch size is set to 100, and the cycle speed is set to 50.To evaluate the model, we use Microsoft Coco's official code.The maximum caption length is set at 15 characters.

C3D and 3DResNet
When the problems of gradient disappearance and gradient explosion in the deep network are considered, 3DResNet can solve gradient disappearance and gradient explosion better than the C3D network, so the effect of extracting features in deep 3DResNet is better than C3D.On the ActivityNet dataset, 3DResNet-18 outperforms C3D.Table 3 shows that 3DResNet-34 outperforms C3D on various datasets.
We use C3D, 3D ResNet-34, and 3D ResNet-101 to extract dynamic features to form multimodal features, and single-layer LSTM for decoding.Table 4 displays the results, where C stands for C3D, R34 stands for 3D ResNet-34, and R101 stands for 3D ResNet-101.The table shows that 3D ResNet-101 has the best effect, outperforming C3D in BLEU4, METEOR, and CIDEr.In this paper, we use 3D ResNet-101 as the dynamic feature extraction model after conducting experimental comparisons.The experimental results are depicted in Fig. 5.The experimental results of 3D ResNet-101 are more accurate and better predict the "person".

Multimodal and semantic grouping module
We conduct ablation experiments on each module to evaluate the effectiveness of each module in the multimodal semantic grouping, and the results are shown in Table 5. Multi represents multimodal features, which enrich the model's extracted features.SA is a semantic aligner (including phrase coding) that allows video frames with similar semantics to be combined to form a semantic group.PS is a phrase filter that can generate semantically relevant words.CA indicates the contrastive attention loss, which promotes accurate alignment of semantic words and video features and improves the model's ability to form a semantic group.According to the table, the performance improved by SA is the most outstanding, while the range improved by PS is the smallest.SA and CA do a better job of combining adjacent frames into a semantic group, and PS generates semantic words that correspond to the semantic group.In comparison to forming adjacent features into a semantic group, the effect of generating semantic words in the semantic group is subtle.The reason for this is that both SA and CA promote the formation of a semantic group directly, whereas PS indirectly promotes the formation of a semantic group.Multimodal video features can significantly improve the model's performance by enriching the video information contained in the encoder.Figure 6 depicts the experimental results.VMSG experimental results are more precise and accurately generate "apply makeup" when compared to other experimental results.

Semantic attention mechanism
To investigate the effect of the semantic attention mechanism, we conduct experiments with and without the semantic attention mechanism, as well as 50 cycles of iterative training.Table 6 displays the results.VMSG-WA denotes that the semantic attention mechanism is not added, whereas VMSG denotes that the semantic attention mechanism is added.As shown in the table, VMSG outperforms VMSG-WA in four criteria.
VMSG has improved in four criteria when compared to VMSG-WA.VMSG has improved slightly in BLEU4, METEOR, and ROUGE-L, and has increased by 2% in CIDEr.CIDEr represents the model's ability to grasp key points.The attention mechanism assigns a weight value to each semantic group.As a result, the weight value of the semantic group with high importance is high, which improves the model's ability to grasp the key points.The experimental results are depicted in Fig. 7.The attention mechanism improved the focus of the VMSG model's generated caption, replacing "cooking" with "making a dish".[31]: A video captioning method that combines vision transformer and reinforcement learning.17 R-ConvED(IRV2) [28]: A retrieval augmented convolutional encoder-decoder network with Inception-ResNet-V.18 R-ConvED(R) [28]: A retrieval augmented convolutional encoder-decoder network with ResNet.

Performance comparison
The performance comparison results between VMSG and the state-of-the-art models are shown in Table 7.In the MSR-VTT dataset, VMSG has progressed to the advanced level.VMSG takes first place in CIDEr, ahead of second place by 2%.In METEOR, VMSG ranks second, with a difference of only 0.1 from first place.In the other two metrics, BLEU4 and ROUGE-L, VMSG reaches the current advanced level.Therefore, VMSG is the state-of-the-art video caption generation model at present.METEOR represents the semantic correctness of the model-generated abstract, while CIDEr represents the ability to extract key information from the model.We believe that the semantic group can eliminate redundant information, resulting in less interference information.In addition, we include a semantic attention mechanism to increase the weight of the semantic group, highlight key points, and improve the model's ability to extract key information and extract semantically correct results.
The model's accuracy and recall rate are represented by BLEU4 and ROUGE-L, respectively.VMSG makes use of multimodal input.We use ResNet-152 to extract 2D features and 3DResNet-101 to replace the C3D network, which can extract dynamic video features better.We also include audio features, which allow the model to extract enough information via classification features.Meanwhile, the semantic grouping and attention techniques used in this paper reduce redundant information from adjacent frames.As a result of these factors, the VMSG model's abstraction accuracy and recall rate reach an advanced level.

Qualitative evaluation
Figures 8 and 9 show an example of SA-LSTM and VMSG generating titles, with VMSG outperforming SA-LSTM in terms of accuracy.As shown in Fig. 8, VMSG can generate the subject who is acting in the video scene.VMSG predicts a group of cartoon characters rather than just one, and the content is more accurate as a result.Meanwhile, its ability to extract critical information has improved.In general, VMSG outperforms SA-LSTM. Figure 10 constructs the phrases "a man is talking" and "a group of" from the words in the partially decoded title "a man is talking to a group of".Collecting a man's speech and a group of people resulted in the formation of one semantic group.More data from the latter semantic group predict the following word, "people".The findings show that VMSG can form semantic phrases and associate video frames with semantic phrases.

Conclusion
In this paper, a multimodal video caption method is proposed, which employs semantically grouped feature fusion based on 2D and 3D features, as well as classification and audio features.VMSG uses 3D ResNet to extract dynamic features from videos instead of C3D.To address the problem of redundant information in adjacent video frames, frames with the same semantics are grouped into a semantic group for decoding.As the importance of different semantic groups varies, we investigate a semantic attention algorithm that assigns weight to semantic groups.To simplify the model, we employ a single-layer LSTM.Finally, we add contrastive attention loss to the model based on cross-entropy loss for training, and experiment with the proposed method on MSR-VTT.The experimental results show that VMSG achieves good results.

Fig. 1
Fig. 1 Example of video caption generation

Fig. 4
Fig. 4 Working mechanism of semantic attention.The orange shades represent the size of semantic weight

4. 4
.1 Benchmarks 1 MGSA[42]: A video captioning framework by utilizing motion guided spatial attention.2 OA-BTG[43]: A video captioning approach based on object-aware aggregation with bidirectional temporal graph.3 Two-stream[21]: A convolutional network models based on deep learning, which including inflated 3D convnet and temporal segment networks.

Fig. 6
Fig. 6 Ablation experiment of each module of semantic group

Table 3
Accuracy of C3d and 3D Resnet-34 in extracting features on each dataset (Red indicates the best performance.)

Table 4
Comparison of experimental criteria between C3d and 3D ResNet (Red indicates the best performance.)

Table 6
Ablation experiment of semantic attention (Red indicates the best performance.)