Visual and Semantic Guided Scene Text Retrieval

In this paper, we introduce a novel end-to-end trainable network for the task of scene text retrieval. Diverging from the state-of-the-art methods that match the visual features of individual character images for retrieval, our network transforms the entire query text into a single query image. By integrating visual and language modules, our network extracts rich visual and semantic features from the query image, facilitating eﬃcient similarity modeling and query matching. This hybrid embedding approach using visual-semantic features of query images shows excellent robustness in dealing with complex text styles and layouts. Experimental results on multiple benchmark datasets validate the superiority of our framework, especially in multilingual retrieval tasks, where our framework achieves a 20.15% increase in mAP score compared to the current state-of-the-art. This signiﬁcant performance boost showcases the potent potential of our network in multilingual scene text retrieval tasks.


Introduction
Images are one of the primary carriers of information.With the rapid development of the internet, a vast number of images are disseminated online every day, a significant portion of which contain text.The automated understanding of text within these Fig. 1 Given a query word 'galaxy', the goal of scene text retrieval is to find all images containing query texts, and determine their position in the image in the image.
images has garnered widespread attention, such as in scene text detection [1][2][3] and scene text recognition [4][5][6][7][8][9].Differing from the aforementioned studies, the task of scene text retrieval was first introduced by [5].This task aims to search a collection of natural images for all images that are identical or closely related to a given query text, making it a cross-modal matching task.As illustrated in Fig. 1, in a collection of natural images, we retrieve all images containing the query text "galaxy" and locate the position of the query text within the images.This task has practical applications in various fields, such as intelligent transportation, product search, book search [10], frame retrieval [11], and more.
For scene text retrieval tasks, given a query text and a collection of natural images, the retrieval system can match all images containing the text based on the input query.Traditional retrieval methods [5,[12][13][14][15] involve designing generic handcrafted features or unique encoding methods for images and strings, creating a common feature space to measure the distance between text images and the query text.However, since these methods use cropped text images, they cannot be directly transferred to the retrieval of natural scene images.Some end-to-end scene text recognition models [4,5,7,8,16] can directly use recognition results to retrieve images, but the retrieval outcomes rely heavily on the accuracy of recognition, making these methods unsuitable for scene text retrieval tasks.To address these issues, cross-modal models [17][18][19] have achieved higher accuracy by directly measuring the distance between text instances and query text.In this paper, we work to optimize based on the work in [17] and [20].Firstly, we adjust the input on the query side.We observe that text character spacing in natural scenes is mostly proportional spacing, as in Fig. 2(a), because it is easier to read and aesthetically pleasing, whereas characters are equidistantly spaced only in a minority of texts, as in Fig. 2(b).However, in [17] and [20], the inputs are text and groups of character images, respectively, where fixed-size features are obtained through interpolation, implicitly assuming that the characters in the text are equidistantly distributed.Based on this observation, we render the entire input query text as a query image, hypothesizing that a more realistic character distribution may enhance the model's understanding of the scene text.Then, we downsample the query image using convolutional blocks and obtain the fused features of the query image through visual and semantic modules.Finally, matching is performed by computing the cosine distance between the query features and instance features.
Summarily, the main contributions of this paper are as follows: (1) We propose a new model for scene text retrieval, unlike previous work that inputs query text [17] and query characters [20], our approach inputs query images at the query side, which better models the visual distribution of scene text.(2) We obtain hybrid features through visual and semantic modules and impose alignment constraints on visual and semantic features before calculating the similarity, which makes it easier to model the similarity of hybrid features of both branches.(3) Experimental results demonstrate that our method achieves the highest accuracy on multiple benchmark datasets and is naturally suited for multilingual retrieval tasks.

Related Work
In early research [12-15, 21, 22], the task was achieved by designing encoding methods for text and text images, most of which were obtained by cropping text segments from document images.For instance, [12] proposed the use of a Pyramidal Histogram of Characters (PHOC) to encode text.In [21], text image features could be represented by both visual and textual representations, and retrieval was achieved by projecting these two types of feature representations into a common subspace to minimize the embedding distance, using only visual features during retrieval.[15] employed random search and the MSER [23] algorithm to obtain text instances.These instances were embedded using the inverted index generation per n-gram [24] algorithm, while the query text was encoded using PHOC.The matching was performed by comparing the similarity of boundaries and gradients across the two modalities.Additionally, some methods utilize Convolutional Neural Networks (CNN) for feature extraction.In [14], the Discrete Cosine Transform of Words (DCToW) for strings was introduced, with matching achieved by comparing the embedded features of the two modalities.In [13], an enhanced feature representation strategy named Levenshtein Space Deep Embedding (LSDE) was proposed.This strategy projects text strings and word images jointly into a shared space, where the Euclidean distance between each pair of points in the space equals the edit distance of their corresponding words [16].This model then serves as a teacher model to train a new model, and retrieval is accomplished by matching strings and word images.In [22], the query text is encoded using PHOC, and the PHOC encoding of text images is predicted through a convolutional neural network, with the comparison of the two encodings facilitating query matching.However, these methods are designed for cropped text images and struggle with retrieving natural scene images.
Recent research has primarily focused on end-to-end scene text retrieval.[15] introduced the first end-to-end scene text retrieval model, utilizing YOLO [25] to detect text instances and simultaneously predict their corresponding PHOC encodings, with matching achieved by comparing the PHOC encoding of the query text.[19] represents an improvement upon the work in [18].[17] achieved satisfactory retrieval results by optimizing the embedding distance between query text and text instances.A text localization network provides several text proposals, with cross-modal matching between the query text and text proposals achieved through similarity learning.Additionally, scene text recognition models [4,5,7,8,16] can also be used for retrieval tasks, directly utilizing recognition results for retrieval, which places high demands on recognition accuracy and speed.
The most recent work [20] simplifies the task to a single modality by converting characters into images.Specifically, the query side inputs character images, which are then restored to the entire query through bilinear interpolation.A text localization network provides several text proposals, which are then cropped for visual embedding, with visual matching subsequently performed through similarity learning.

Methodology
As shown in Fig. 3, our network framework consists of a text instance embedding branch (upper branches), which is used for locating text instances and embedding them into specified-sized feature vectors.There is also a query text embedding branch (lower branches), which transforms the query text into a query image, referencing the method in [20].After down-sampling and passing through an embedding module, it calculates similarity scores with the features obtained from the other branch.The images are then output based on the ranking of these scores.

Query Embedding Branch
This branch aims to capture standardized visual and semantic features of text.Firstly, we utilize a fine-grained transformation method to convert query text into word images.Compared to the character-level conversion in [20], text-level conversion makes the distribution of characters in the text closer to that in real scenes.In our experiments, Arial-Unicode is used as the default font with a font size of 32×32.The text is rendered in black on a white background, get a query image of 32×120.
Specifically, we have a collection of query texts T = {t i } n i=1 , where n is the number of query texts in the current batch.Each t i is an individual query text, composed of a character set A = {a i } r i=1 , where r is the number of characters in the set, and a i is an individual character.In English retrieval, this includes 26 English letters plus 10 digits, and in multilingual and Chinese retrieval, it includes characters from multiple languages.Firstly, we build the transformation from text to image t i → M H×W ×3 , thus obtaining a collection of images G = {g i } n i=1 as query input.Then, we use convolution for down-sampling and doubling the number of channels, employ depth-wise separable convolution [26], and channel attention blocks [27] to form residual blocks concatenated with convolutions to enhance features.This way, we obtain shallow features of query images G = {g i } n i=1 , gi ∈ R 2C×H ′ ×T , where H ′ = H/8, T = W/8, and C is the dimension of the output feature vector.Through the Hybrid Embedding Module (HEM) (see Sec. 3.2 for details), we obtain visual features V t = {p t i } n i=1 and semantic features X t = {x t i } n i=1 of the query side.These are fused to produce the output , where T is the length of the embedded character sequence.
During the training phase, we apply two types of data augmentation to enhance the generalization and robustness of the labeled images.Specifically, we use random vertical flipping and partial random occlusion of word images.The partial occlusion is achieved by superimposing rectangular noise masks.The generated labeled images are then subjected to the same image normalization as the images in the image library.Additionally, similar to [20], we use multiple fonts to render the labeled images.However, in multilingual and Chinese experiments, we only use the default font because the character sets for multilingual and Chinese contain no fewer than 1,000 characters.

Instance Embedding Branch
This branch aims to locate all potential text-containing regions in natural images and then obtain the hybrid features of text instances through the Hybrid Embedding Module (HEM), the structure of which is depicted in Fig. 4. For the text localization network, similar to the works [17] and [20], we also use the single-stage anchor-free detector FCOS [28] to locate text, employing ResNet50 [29] as the feature extraction network and the Feature Pyramid Network (FPN) [30] to fuse features.HEM contains visual branches (upper branches) and semantic branches (lower branches).The visual branch (VB) consists of a Visual Transformer (VT) [31] and two convolutions.We stack the basic structure of VT in three layers and then obtain visual features after down-sampling through two convolutions.The semantic branch (SB) consists of two convolutions and a bidirectional GRU (Bi-GRU) [32].GRU, a type of recurrent neural network with two types of control gate structures, first converts image features into a sequence through convolutions, and then outputs semantic features through the Bi-GRU.Finally, the two feature vectors are concatenated and fused using an MLP.
Specifically, for an input image, the text localization network produces k text instance proposals, and the Region of Interest (RoI) features P = {p i } k i=1 are extracted through RoI-Align [33].The HEM module produces visual features V e = {v e i } k i=1 and semantic features X e = {x e i } k i=1 , as well as outputs F e = {f e i } k i=1 , where f e i ∈ C T ×C .Before computing similarity and matching, text recognition is first performed using CTC loss (L c ).Then we measure the distance between the visual and semantic features obtained from the two branches.For visual features V t and V e , we use the Mean Squared Error (MSE) loss as the visual loss L v : For the semantic features X t and X e , we use the smooth − L1 loss to measure the distance between two semantic features as the semantic loss L s : where x is the difference between x t i and x e i .

Loss Function
The overall loss consists of five components: L d is the loss of the text localization network, L v , L s , and L c are the visual loss, semantic loss, and CTC loss mentioned in Sec.3.2, and L m is the image-query matching loss [17,20].In this paper, to balance multiple losses, after several experiments, we set λ 1 , λ 2 , λ 3 , λ 4 , and λ 5 as 1, 1, 1, 0.1, and 2.

Experiment
First, we describe the dataset used in the experiment.Second, we detail the specifics of the experimental setup.Then, the framework we propose is evaluated and compared with current state-of-the-art models.Finally, to further validate the effectiveness of our proposed model, we conduct an ablation study.

Datasets
(1) IIIT Scene Text Retrieval (STR) [5].The dataset contains 50 query texts and 10,000 images.Its scene texts vary in wording, style, and scenes, making it a challenging dataset.(2) Street View Text (SVT) [34].With 100 images in the training set and 249 images in the test set, we use all non-repeated words in the test set for testing.(3) SynthText-900K [18].Contains about 925, 000 synthesized text images.These images are obtained by the image synthesis engine by embedding text into real scene images.(4) Coco-Text Retrieval dataset (CTR).CTR is a subset of Coco-Text [35].We followed the experimental guidelines of [17] for our experiments.(5) Total-Text [36].The dataset has a total of 1,555 images, with 1,255 in the training set and 300 in the test set.Its text layout is diverse, containing horizontal text, slanted text, and curved text.
(6) Multi-lingual Scene Text (MLT) [37].It contains images of scenes in 9 different languages, where the training set contains 7200 images and the test set 1800 images.About 5,000 images were obtained by filtering text instances that contained at least one English text instance, which we named MLT-ENG.As in the work [20], we perform multilingual retrieval experiments on MLT. ( 7) Chinese Street View Text Retrieval (CSVTR) [17].The dataset contains 1667 images and most of the 82 unique Chinese characters.Most of the query terms are street images containing stores, hotels, banks, etc.
For evaluating performance on English datasets, we pre-train our proposed model on SynthText-900K and fine-tune it on MLT-ENG.All text instances used for training are in English.The datasets used for evaluating the model are SVT, STR, CTR, and Total-Text.The performance of the model is measured using mean Average Precision (mAP).The inference speed is assessed by how many images the model can process per second (FPS).
For evaluating performance on multilingual datasets, we train on the MLT training set and test on the test set.
For evaluating performance on Chinese datasets, we directly train on the multilingual dataset MLT, but replace the Chinese character set in the original MLT with an authoritatively published common Chinese character set, and evaluate on CSVTR.

Experimental Details
Our implementation is based on previous works [17] (TDSL) and [20] (VSTR).We use VSTR's method for text-to-image conversion, but modify the character-level conversion to text-level conversion.In terms of image enhancement strategies, the input image is resized to 640×640 during training and then data enhancement strategies such as equal scaling [1,2] (randomly cropping the image region to 640×640), equal scaling [0.5, 1] (using black to fill in the periphery of the image to 640×640) and random rotation [-15, 15] are added.For evaluation, the long side of the input image was resized to 1280 and the short side was scaled equally.After converting the query text into a query image, a portion of the query image was masked by applying up and down flipping, small rotations [-10, 10], and Gaussian noise masking rectangular blocks, with the probability of applying the data enhancement and independence of each of the three enhancement strategies being 0.5.
The training phase consists of two parts.We first pre-train using the synthetic dataset SynthText-900K, followed by fine-tuning on the real-scene text dataset MLT-ENG.During pre-training, the batch size is set to 48.In the fine-tuning phase, the batch size is set to 32.We also employ the same data augmentation and learning rate decay as in work [17], and use AdamW [38] to optimize our model.The pretraining phase involves 60,000 iterations, and fine-tuning involves 20,000 iterations.For Chinese and multilingual experiments, the batch size is set to 32 with 40,000 iterations.We train on 4 NVIDIA TITAN XP GPUs and evaluate on a single GPU.

Performance on the English Datasets
We evaluated the performance of our proposed framework on English datasets.The results are shown in Table 1.Our approach achieves the best performance on the three benchmark datasets, obtaining a 2.96% improvement on Total-Text relative to the current state-of-the-art VSTR [20] and a 2.47% improvement on CTR relative Fig. 5 Enter retrieval terms: 'house', 'grand'.We show the first 5 images retrieved, where the first row of each group shows the retrieval results for TDSL [17] and the second row shows the retrieval results for our proposed framework.Green borders indicate correct searches, and errors are highlighted with red borders.
to TDSL [17].In multi-scale testing experiments with input scales of 960, 1280, and 1600, the mAP scores increased by 1.8%, 4.14%, 3.35%, and 1.89% respectively, also reaching the state-of-the-art level.As the source code of VSTR [20] is not publicly available, we also compared our results with the highly competitive TDSL [17].As shown in Fig. 5, our method can correctly match curved texts.
In terms of retrieval efficiency, the length of our model's embedded features is equal to the length of the output features, so the processing speed has not decreased, reaching the same level as VSTR [20].Additionally, we utilized t-SNE [39] to reduce the dimensionality of the feature representations predicted by TDSL [17] and our method.We queried 10 words, selected the top 60 images, and filtered out the positive samples based on the labels.Positive samples are displayed in different colors, negative samples are shown in grey, and the majority of positive samples are enclosed in circles.The center of each circle is marked with a star, and its radius is determined by the distance to the second farthest positive sample from the center, as shown in Fig. 6.It can be observed that the positive samples predicted by our method are more concentrated compared to those predicted by TDSL [17].

Performance on Multilingual Dataset
We further evaluated the performance of our proposed framework on the MLT test set.The MLT test set contains 1,800 images with 7,000 unique characters, which include 8,342 unique texts.After filtering out some meaningless texts, 5,318 query texts were obtained.As shown in Table 2, our results in multilingual experiments show a 20.15% improvement over the current state-of-the-art VSTR [20].This is attributed to the fact that the text of "Arabic" and "Bengali" is presented as a whole, and segmentation into characters causes severe visual loss, which is effectively circumvented by our approach.As shown in Fig. 7, our method can correctly retrieve target images in different languages.

Performance on Chinese Dataset
To verify the generality of our method in retrieving non-Arabic languages, we used the authoritatively published common Chinese character set and the unique 82 Chinese characters from the CSVTR dataset, obtaining Chinese character sets containing 1,019 and 3,755 characters, respectively.In the MLT dataset, we obtained a multilingual character set of 7,000 characters, including 4,896 unique Chinese characters and 2,104 characters of other languages.We conducted three experiments, using the two mentioned Chinese character sets to replace the original Chinese character set in MLT, and also using the original MLT character set for the experiments.The  retrieval results are shown in Fig. 8.Note that, unlike [17] and [20], we did not use a synthetic Chinese dataset for the experiments but instead used MLT.
From Table 3, our method is able to achieve the highest accuracy when trained with either 1,019 or 3,755 Chinese character sets, and the mAP scores are improved by 6.58% and 9.3%, respectively.

Ablation Study
In this section, we perform ablation experiments on the main components of the proposed framework.Specifically, in Section 3.2, we introduce HEM, which contains a visual branch (VB) and a semantic branch (SB).First, we explore experiments using only the VB or only the SB.The results are shown in Table 4. Experiments employing either a single visual or semantic module have yielded highly competitive mAP scores.Notably, when solely utilizing the VB, a remarkable mAP score of 82.09% was attained on the TotalText dataset, characterized by its challenging tilted and curved text instances.This performance surpassed that of the state-of-the-art VSTR [20] method, underscoring the effectiveness of employing the entire query image for retrieval.Second, we explore text collection input as in TDSL [17] and character image collection input as in VSTR [20].It is evident from Table 5 that the query input format employing textual holistic transformation markedly outperforms the alternative approaches.This substantiates the facilitative role of textual holistic visualization in multilingual retrieval.

Conclusion
In this paper, we present an end-to-end trainable scene text retrieval framework that considers both visual and character semantic information.Through the text-toimage conversion method, entire words are transformed into images, converting the cross-modal retrieval task into a problem of matching visual and semantic features.The experimental results demonstrate that converting entire texts into images and considering both visual and semantic aspects leads to state-of-the-art performance.Additionally, our proposed method supports an unlimited number of characters and styles.However, this method has a drawback: it is not significantly effective for retrieving long texts due to the limitation on the length of the converted query image.As a next step, we can try to extend this method by allowing a text input to correspond to two or more label images.These label images would uniformly split the input text and then render it into images, which might improve the model's performance in retrieving long texts.

Fig. 2
Fig. 2 Examples of text in natural scene images.(a) text with characters proportionally spaced, (b) text with characters equally spaced.

Fig. 3
Fig.3The main framework of the network.Contains a query embedding branch (lower branch), and an instance embedding branch (upper branch).

Fig. 4
Fig. 4 HEM structure, S denotes splitting the feature vector into two, C denotes stitching the feature vectors together over the channel.

Fig. 6 T
Fig. 6 T-SNE visualization of queries and gallery candidates.

Fig. 7
Fig. 7 Enter query terms in multiple languages and select the top 3 highest-scoring images.

Fig. 8
Fig. 8 Input the query words: 'Yanghe Blue Classic', and 'Dai Mei Hot Pot', and select the top 8 images with the highest scores.

Table 1
Presents the results on SVT, STR, CTR, TotalText.MS indicates multi-scale testing, and * denotes results obtained using the officially published code by the authors.

Table 2
Presents the results on MLT test set.

Table 3
Evaluation results on the CSVTR.

Table 4
Ablation experiments in VB and SB.

Table 5
Ablation experiments with different forms of query input.