MuMUR: Multilingual Multimodal Universal Retrieval

Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs. We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space based on pretrained multilingual models. We evaluate our proposed approach on a diverse set of retrieval datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and Multi30k. Experimental results demonstrate that our approach achieves state-of-the-art results on all video retrieval datasets outperforming previous models. Additionally, our framework MuMUR significantly beats other multilingual video retrieval dataset. We also observe that MuMUR exhibits strong performance on image retrieval. This demonstrates the universal ability of MuMUR to perform retrieval across all visual inputs (image and video) and text inputs (monolingual and multilingual).


Introduction
The task of multimodal retrieval aims to retrieve images or videos that are semantically similar to a given text query and vice-versa.In the world of computer vision, image retrieval and video retrieval have been treated as separate tasks.For example, transformer models pretrained on large amounts of image-text pairs are then fine tuned for the task of image retrieval [10,24,[31][32][33].In contrast, video retrieval models are developed in two parallel directions.The first line of work [4,19,31] focused on video-text pretraining on large-scale datasets like Howto100M [44] and WebVid-2M [4].While the second line of work [41] focused on using pretrained image features like CLIP [48] for video retrieval often surpassing the models pretrained on video datasets.
Most of the current multimodal retrieval datasets typically contain around 10k visual inputs and the corresponding captions of maximum lengths ranging from 30 to 60.In multimodal retrieval, the text encoder projects the input text caption and visual inputs (image or video) into a common embedding space.With longer captions, the text embeddings might lose required contextual information resulting in incorrect retrievals.One could address this by incorporating more structured knowledge (e.g., parts of speech, dependency graphs) in the text encoder [7].However, a drop in performance is observed [7] with the addition of structural knowledge in a text-tovideo retrieval setting.One could argue that the reason might be the smaller retrieval datasets and creating meaningful structural knowledge becomes a challenging task.
There are over 6000 languages in the world each having its own vocabulary of words, grammar and morphology.However, there exists an overlap of knowledge among these languages [3,22,59].In Natural Language Processing, some works [12] explored this idea of using multilingual data to improve the performance on monolingual English datasets.Conneau et al. introduced a new pretraining objective: Translation Language Modeling (TLM) in which random words were masked in the concatenated sentences of English and multilingual data and the model predicts the masked words.There, the objective was to use multilingual context to predict masked English words if the English context was not sufficient and vice-versa.
Most of the current works in multimodal retrieval are primarily focused on monolingual English datasets.Recently, few works have proposed models that can perform retrieval in a multilingual setting [6,23,45,66].However, these works use large amounts of multilingual data for pretraining and are tuned for retrieval on each language.Moreover, these models are also limited to performing either image retrieval or video retrieval.In addition, these models show a drop in performance on English retrieval datasets even though there is a boost in performance on multilingual data.
In this work, we are interested in two objectives (i) design a model capable of performing retrieval on both images and videos (ii) design a model capable of performing retrieval on multiple languages including English.Multilingual data serves as a powerful knowledge augmentation for monolingual models [12].Nevertheless, creating multilingual data requires huge human effort.To overcome this, we use state-of-the-art machine translation models [52] to convert English text captions into other languages.Specifically, we choose languages whose performance on XNLI benchmark [13] is comparable to that of English (i.e., French, German, Spanish).With this, we create high quality multilingual data without requiring human labelling.To the best of our knowledge, this is the first work that uses multilingual knowledge transfer to improve multimodal retrieval.
We propose a new framework MuMUR : Multilingual Multimodal Universal Retrieval that is capable of performing both image and video retrieval on multilingual datasets.This framework is based on CLIP [48] to effectively utilize and adapt the multilingual knowledge transfer.Our model takes a visual input, English text caption and multilingual text caption as inputs and extracts joint visual-text representations.
The multilingual text representations should act as a knowledge augmentation to the English text representations aiding in retrieval.For this purpose, we introduce a dual cross-modal (DCM) encoder block which learns the similarity between English text representations and visual representations.In addition, the DCM encoder block also associates the visual representations with the multilingual text representations.In the common embedding space, our model learns the important contextual information from multilingual representations which is otherwise missing from the English text representations effectively serving as knowledge transfer.
We validate our proposed model on a comprehensive set of image retrieval datasets: Flickr30k [47] and video retrieval datasets: MSRVTT-9k [61], MSRVTT-7k [61], MSVD [9], DiDeMo [2] and Charades [50].We show that our approach achieves stateof-the-art results, outperforming previous models on most datasets.In addition to the evaluation on monolingual retrieval datasets, we also compare the performance of our model on multilingual datasets Multi30k [16] and MSRVTT multilingual.[23].Experimental results demonstrate that MuMUR achieves state-of-the-art results on all English video retrieval datasets and significantly outperforms previous models on multilingual video retrieval in a zero-shot setting.Furthermore, our model MuMUR establishes new benchmark on multilingual image retrieval while achieving strong performance on English image retrieval datasets.These results demonstrate the universal capability of MuMUR to perform all types of multimodal and multilingual retrieval.
To summarize, our contributions are as follows: (i) We generate multilingual data using external state-of-the-art machine translation models.(ii) We propose a model that is capable of knowledge transfer from multilingual data to improve the performance of multimodal retrieval.(iii) We evaluate the proposed framework on six English retrieval benchmarks and achieve state-of-the-art results in both text-to-visual and visual-to-text retrieval settings.(iv) Finally, we demonstrate that our model significantly outperforms previous approaches on multilingual retrieval datasets.
2 Related work

Multimodal retrieval
Pre-train and then fine-tune is the most popular paradigm involving image retrieval [10,[32][33][34].These models are pre-trained on huge amounts of image text pairs such as Conceptual Captions [8], Visual Genome [28] and SBU [40] and tested on image retrieval datasets such as Flickr30k [47] and COCO [25].
The task of video retrieval has seen tremendous progress in the recent years.This is partly due to the availability of large-scale video datasets like HowTo100M [44] and WebVid-2M [4].Besides the adaption of transformers to image tasks like image classification [14] spurred the development of models based on transformers.However, videos require computationally more memory and compute power and can be infeasible to compute self-attention matrices.With the introduction of more efficient architectures [5] large-scale pretraining on videos became a possibility.In this direction, several transformer based architectures [4,19,42,43] were proposed and pretrained on large video datasets which achieved state-of-the-art results on downstream video retrieval datasets in both zero-shot and fine-tuning settings.
In a parallel direction, a few works [41] have adopted image level features pretrained on large scale image-text pairs to perform video retrieval.Surprisingly, these works have performed significantly better than the models that are pretrained from scratch on large scale video datasets.Compared to these models, our approach completely differs in the architecture and the training methodology.

Multilingual training
The recent success of multimodal image-text models on a variety of tasks, such as retrieval and question-answering, has been mostly limited to monolingual models trained on English text.This is mainly due to the availability and high-quality of English-based multimodal datasets.Recent work indicates that incorporating a second language or a multilingual encoder, thus creating a shared multilingual token embedding space, can improve monolingual pure-NLP downstream tasks [12].This concept was rapidly embraced for training multimodal models.Previous works had used images as a bridge for translating between two languages, without using a language-to-language shared dataset for training [9,49,51].
Recent work has focused on multimodal tasks, such as image retrieval, aiming to add multilingual capabilities to multimodal models [6,21].The work often indicates that incorporating a second language during training of multimodal models, improves performance on single-language multimodal tasks such as image retrieval, compared to multimodal models that were trained on a single language [21,26,57].MULE [26],

CLIP Image Encoder
"smoke rises" "de la fumée s'élève" which is a multilingual universal language encoder trained on image-multilingual text pairs, showed an improvement on image-sentence retrieval tasks of up to 20% compared to monolingual models.Nevertheless, all these previous works focus on designing models separately for image and video retrieval.Our objective is to use multilingual knowledge transfer to improve the performance on current image and video retrieval datasets.In addition, our model is capable of performing multilingual retrieval on more than 10 languages.

MuMUR : Multilingual Multimodal Universal Retrieval
In this section, we introduce our framework MuMUR : Multilingual Multimodal Universal Retrieval.We first describe the problem statement, then the multilingual data augmentation strategy and finally go over the proposed approach that enables knowledge transfer from multilingual data for video retrieval.

Problem statement
Given a set of visual data V , their corresponding English text captions E and related multilingual text captions M , our goal is to learn similarity functions s 1 (v i , e i ) and In other words, we propose a framework MuMUR that enables end-to-end learning on a tuple of visual input, English text caption and multilingual text caption by bringing closer the joint representations of those three elements.Specifically, for each visual input V and the English text captions E, we generate the multilingual translations M using external state-of-the-art machine translation models [52].Next, we present the proposed approach that facilitates the end-to-end learning using multilingual data.

Approach
Our model, illustrated in Figure 2, is comprised of three components: (i) visual encoder (ii) text encoder (iii) dual cross modal encoder.Next, we describe the framework in detail.

Visual Encoder
Given a visual input V , we consider uniformly sampled clips C ∈ R Nv×H×W ×3 where N v is the number of frames (1 for images), H and W are the spatial dimensions of a RGB frame.We then use a pretrained CLIP-ViT image encoder [48] to extract the frame embeddings F v ∈ R Nv×Dv where D v denotes the dimensions of the frame embeddings.The frame embeddings are concatenated to obtain the final representation for the video V .

Text Encoder
Let the inputs English text caption be E and multilingual text caption M of lengths p and q respectively.We use a pretrained CLIP-ViT text encoder to convert the English text caption into a sequence of embeddings R E = R Ep×D E where D E denote the embedding dimensions.We consider the representation of the token [EOS] as the final representation of English text caption.To encode multilingual text caption M , we use a M-CLIP1 model which is a multilingual clip model pretrained on multilingual text and image pairs.Specifically, the multilingual text caption is converted into a sequence of embeddings R M = R Mq×D M where D M denote the embedding dimensions.Similar to the CLIP model, we consider the [EOS] representation as the final representation of M-CLIP model.

Dual cross-modal encoder (DCM)
Our goal is to closely associate the visual embeddings R v , English text embeddings R E and multilingual text embeddings R M in a common embedding space.For this purpose, we propose a dual cross-modal encoder (DCM).To incorporate textual information into visual features and to learn visual features that are semantically most similar to text features, we use multi-head attention.The text features are used as the queries whereas the visual features are used as the keys and values.
where multi-head attention (Attention) is defined as: Here Q, K and V are same as the original multi-head attention matrices in the transformer encoder.We then apply a fully connected layer on the attention outputs and finally layer normalization to obtain R vE and R vM .
where F C is the fully connected layer and LN is the layer normalization layer.

Loss
We use the standard image-text or video-text matching loss [58] to train the model.It is measured as the dot product similarity between matching text embeddings and visual embeddings in a batch.First, we compute the loss L E between R vE and R E and then compute the loss L M between R vM and R M .The final loss is the sum of losses L E and L M . where

Inference
During inference for English retrieval datasets, we freeze the multilingual text encoder and measure the retrieval performance only using R E and R vE .Similarly, for multilingual datasets, we freeze the English text encoder and calculate the retrieval score using R M and R vM .

Datasets
Our goal is to design a universal multimodal multilingual retrieval model.Therefore, our experiments focus on evaluating on both image and video retrieval datasets comprising of monolingual and multilingual captions.
MSRVTT contains 10K videos with each video ranging from 10 to 32 seconds and 200K captions.We report the results both on MSRVTT-9k and MSRVTT-7k datasets following [41].
MSVD consists of 1970 videos and 80K descriptions.We use the standard training, validation and testing splits following [41].In this dataset, each video has multiple captions and are treated as independent samples during testing.
DiDeMo is made up of 10K videos and 40K localized descriptions of the videos.We concatenate all the sentences for each video and evaluate the paragraph-to-video retrieval following [35,41].
Charades contains of 9848 videos and each video is associated with a caption.We use the standard training and test splits following [35].
MSRVTT multilingual is a multilingual version of MSRVTT in which the English captions are translated into nine different languages.We use the standard splits following [23].

Image Retrieval
We evaluate the proposed approach MuMUR on the following image retrieval datasets: Flickr30k contains 31000 images with each image containing single caption for training and validation data and 5 captions for testing data.We follow the standard splits of 29k/1k/1k [32].
Multi30K [16] is a multilingual version of Flickr30k [47] in which the English text captions are translated into German (de), French (fr) and Czech (cs) languages.We use the training, validation and testing splits of 29k/1k/1k following [45].

Metrics
For evaluating the performance of models, we use recall at rank K (R@1, R@5, R@10), median rank (MedR) and mean rank (MnR).Unless specified, the values reported are the mean of three runs with different seeds.

Text-to-Video Retrieval
Video-to-Text Retrieval Type Model R@1 R@5 R@10 MdR MnR R@1 R@5 R@10 MdR MnR  1 Text-to-video and video-to-text retrieval results on MSR-VTT dataset 9k split.Recall at rank 1 (R@1)↑, rank 5 (R@5)↑, rank 10 (R@10)↑, Median Rank (MdR)↓ and Mean Rank (MnR)↓ are reported.Results of other methods taken from mentioned references.Our model surpasses previous state-of-the-art performance.In video-to-text retrieval, our model achieved 1.6 points boost in performance.

Hardware
Training was conducted on compute nodes that consist of Intel Xeon processors equipped with Intel®Gaudi®2 AI accelerators with 96GB of vRAM

Implementation Details
We use translations of French for the multilingual inputs to train the MuMUR model.The visual encoder and the English text encoder are initialized with CLIP-ViT-B-32.
The multilingual text encoder is initialized with M-CLIP-ViT-B-32.The dimension size of the video, English caption and multilingual caption representations is 512.The dual cross-model encoder is initialized randomly and trained from scratch.The dimension size of the key, query and value projection layers is 512.The fully connected layer in the transformer has a size of 512 and a dropout of 0.4 is applied on this layer.We use 16 frames for MSRVTT-9k, MSRVTT-7k and MSVD datasets, 42 frames for DiDeMo and Charades datasets.The maximum sequence length is set to 32 for MSRVTT-9k and MSRVTT-7k, 64 for DiDeMo and 30 for charades dataset.The model is trained using AdamW [39] a learning rate of 1e-4 and a cosine decay of 1e-6.The MSRVTT-9k and MSRVTT-7k datasets are trained with a batch size of 32 and for 15 epochs.The MSVD dataset is trained with a batch size of 32 and for 5 epochs.The DiDeMo and charades datasets are trained with a batch size of 16 for 12 and 15 epochs respectively.

Evaluation on English video retrieval datasets
In Table 1 we report the results of our proposed approach on MSVRTT-9k dataset.It can be observed that the difference between CLIP based models and other models is very significant (> 5%).Therefore, it explains the incentive to build our model using CLIP features.On MSRVTT-9k split, our model significantly outperforms CLIP4Clip model on all the metrics in both text-to-video and video-to-text retrieval settings.
VCM employs a knowledge graph between video and text modalities making its performance superior to other models in a video-to-text retrieval task.Our model surpasses VCM significantly in all the metrics elucidating that the multilingual representations serve as a powerful knowledge transfer.Moreover, our approach outperforms MCQ and MILES which are pretrained on WebVid-2M data, initialized with CLIP features, employing additional semantic information like parts-of-speech.This validates that our model doesn't require any pretraining on videos and structural knowledge injection.
The multilingual text representations in our model effectively serves this purpose.In Tables 2, 3, 4 and 5 we report the results on MSRVTT-7k, MSVD, DiDeMo and Charades datasets respectively.Our model outperforms all the previous approaches across all the metrics on all the datasets.For the MSRVTT-7k split, our model achieves a significant boost of 2.1%, 2.5% and 4.4% in R@1, R@5 and R@10 respectively compared to the previous baselines.For the MSVD dataset, we notice an improvement of 0.2%, 0.5% and 0.3% in R@1, R@5 and R@10 respectively.MSVD is a relatively smaller dataset with test size of 670 videos and hence, the improvements are relatively marginal.
For the DiDeMo dataset, our model showed a marginal boost of 0.2% in R@1 but a significant boost of 4.3% and 2.9% in R@5 and R@10 respectively compared to the previous approaches.For the Charades dataset, our model outperformed previous
approaches by 0.9% in R@1 and by a significant margin of 4.6%, 7.6% and 6.0% in R@5, R@10 and MedianR respectively.ECLIPSE uses audio as additional information for video retrieval.We showed that multilingual text acts as a better knowledge transfer input.

Evaluation on English image retrieval datasets
Next, we measure the performance of MuMUR on English image retrieval dataset Flickr30k.The results are reported in the table 6 in both image-to-text and text-toimage settings.As shown in the table, MuMUR achieves comparable performance to previous approaches on image-to-text retrieval.Moreover, we observe that MuMUR significantly outperforms these models by 5.5% in image-to-text retrieval.Note that, these models are pretrained on large amount of image-text pairs and fine-tuned on Flickr30k dataset.In contrast, our model uses a small amount of multilingual data and achieves remarkable results in both the settings.This validates that multilingual data acts as superior knowledge transfer even for image retrieval.Table 8 Text-to-image retrieval (R@1 metric) results on Multi30K [16] dataset (de -German, fr -French and cs -Czech).Results of other methods are taken from [45].Our model is fine-tuned for each language and evaluated on the corresponding test data.

Evaluation on multilingual video retrieval datasets
In addition to the monolingual datasets, we also evaluate the proposed approach on multilingual video retrieval datasets.Specifically, we use the model trained only using French captions and test on 6 languages such as German (de), Czech (cs), Chinese (zh), Swahili (sw), Russian (ru) and Spanish (es) in a zero-shot setting.Table 7 shows the results on MSRVTT-multilingual dataset.Our model achieved a significant boost of 8.2% (average) in R@1 in a zero-shot setting.It is worth noting that our model in a zero-short evaluation outperformed the previous approaches fine-tuned on these languages by a huge margin of 6.1% (average).MMP [23] is pretrained on the large scale multilingual dataset HowTo100M on 9 languages.However, our model trained on just 1 language outperformed MMP.This shows that our dual cross-modal (DCM) encoder block can effectively learn the association among video, English and multilingual representations even when large video pretraining is not involved.

Evaluation on multilingual image retrieval datasets
We also evaluate the proposed model MuMUR on the multilingual image retrieval dataset Multi30K.Table 8 show the results on 3 languages German (de), French (fr) and Czech (cs) fine-tuned for these languages.As shown in the table, MuMUR significantly outperforms previous models by 3.1% for German (de), 3.2% for French (fr) and 4.3% for Czech (cs) languages.

Effect of multilingual knowledge transfer
We investigate the effect of multilingual knowledge transfer on the video-retrieval performance.Precisely, we train a model without the multilingual text encoder keeping the rest of the architecture intact.As shown in Figure 3, using multilingual data as knowledge transfer significantly improved the performance on DiDeMo and Charades datasets.The improvement is 2.4% for DiDeMo and 3.62% for Charades datasets.

Using only multilingual text encoder
Next, we ablate the choice of using an English text encoder.We validated previously that multilingual data improves the performance of video retrieval.This raises the question: Why a separate English text encoder is required if multilingual text encoder can be used for both English text and multilingual text representations?In Figure 4,    English data.Hence, leveraging a separate English text encoder showed much superior performance to using multilingual text encoder for English text.

Effectiveness of dual cross encoder block
Next, we ablate the effectiveness of dual cross encoder block.We train a model without the DCM block and directly compute the loss between video representations and English text representations and video representations and multilingual text representations.From Figure 5, we can see that the model using DCM block achieves better performance than the model without the encoder block.This justifies our motivation to use DCM block in our model.
Fig. 9 We ablate the sampling strategy used for selecting video frames on MSRVTT-9k dataset.We observe that uniform and random sampling techniques achieve similar performance for all the video frames.

Training with more languages
Next, we ablate training our model with more than one language.Concretely we train our model with German (de) and Spanish (es) captions.These languages are chosen because their performance on XNLI dataset [13] is comparable to English.The results are shown in Figure 6 and it is seen that training with more languages improved the performance on video retrieval.These results validate that multilingual data act as an effective knowledge transfer mechanism for improving video retrieval.

Effect of video frames
We investigate the impact of the frequency of video frames on the retrieval performance.We train the MuMUR model with increasing number of video frames in intervals of 4. Figures 7 and 8 demonstrate the results of this ablation study on MSRVTT-9k and MSVD datasets respectively.As shown in the figure, the R@1 score is the highest when the model is trained with 16 video frames in both text-to-video and video-to-text retrieval settings.Therefore, these results motivate the use of 16 frames for training MuMUR on video retrieval datasets.

Effect of video frame sampling
Following, we study the impact of choosing video frames at random vs with an uniform manner.Figures 9 and 10 illustrate the results for varying video frames on MSRVTT-9k and MSVD datasets respectively.It is clear from the figures that uniform and random sampling strategies achieve nearly the same performance.However in case of MSVD, we observe that random sampling performed much better compared to uniform sampling.Nevertheless we see that random sampling fails for video frames greater than 20.

Conclusion
In this paper we introduced MuMUR, a multilingual knowledge transfer framework to improve the performance of multimodal retrieval.We constructed multilingual captions using off-the-shelf state-of-the-art machine translation models.We then proposed a CLIP-based model that enables multilingual knowledge transfer using a dual crossmodal encoder block.Experiment results on six standard multimodal retrieval datasets showed that our framework achieved state-of-the-art results on all the datasets.Finally, our model also showed superior performance to previous approaches on multilingual retrieval datasets in a zero-shot and fine-tune settings.In the future, we will focus on more efficient ways of multilingual knowledge transfer for multimodal retrieval.

Authors' contributions
A.M designed the model, performed major experiments and wrote the first draft.E.A helped in experiments and manuscript writing.G.B.M.S helped in experiments and manuscript writing.S.R helped in experiments and manuscript writing.S-Y.T helped manuscript writing and supervised the project.G.B supervised the project and reviewed the manuscript.V.L supervised the project and reviewed the manuscript.

Funding
This declaration is not applicable.

Availability of data and materials
The datasets used in this work are publicly accessible and code will be made public after paper acceptance.

Fig. 2
Fig. 2 Illustration of the proposed MuMUR model.The model takes as visual input (image or video), a corresponding English text query and a translated multilingual query.The multilingual text query is obtained using the off-the-shelf machine translation model.It is used only for the inference and is not part of the architecture.The video and English text features are extracted using CLIP model whereas multilingual text features are extracted using M-CLIP model.The features are then passed onto a cross-model encoder to learn the association in a common embedding space.Crossentropy loss is then applied to measure the similarity between text features R E and R vE , R M and R vM .The final loss is the sum of both the losses.

Fig. 3 Fig. 4
Fig.3Comparison of models with and without using multilingual data as input.The first model takes as input only video and English text captions whereas the second model takes video, English text and multilingual text captions as input.As shown in the figure, using multilingual text data as knowledge transfer significantly improved the performance.

Fig. 5 Fig. 6
Fig.5Comparison of models with and without DCM block in the architecture.Using DCM block in the architecture showed superior performance to models without the DCM block.

Fig. 7
Fig. 7 Figure shows the effect of the number of video frames used to train MuMUR model on MSRVTT-9k dataset.Results demonstrate that R@1 score is the highest when 16 frames are used in text-to-video and video-to-text settings.

Fig. 8
Fig. 8 Figure demonstrates the performance of MuMUR for varying number of video frames on MSVD dataset (single caption evaluation).It is evident that the R@1 score is maximum at 16 frames when evaluated in both text-to-video and video-to-text settings.

Fig. 10
Fig. 10 Figure illustrates the comparative performance of random and uniform video sampling techniques on MSVD dataset.It is evident that random significantly outperforms uniform sampling but fails for larger number of video frames.
we show the results of two different model variants.The first model uses a separate English text encoder to encode English text captions whereas in the second model, both the English text and multilingual text are encoded using the same multilingual text encoder.Results show that encoding English text captions using a separate English text encoder surpasses the model using multilingual text encoder to encode both English text and multilingual text.Multilingual pretraining employs a part of English data whereas the English text encoder is pretrained on a comparatively larger