Bagging-based cross-media retrieval algorithm

It is hard to come up with a strong learning algorithm with high cross-media retrieval accuracy, but finding a weak learning algorithm with slightly higher accuracy than random prediction is simple. This paper proposes an innovative Bagging-based cross-media retrieval algorithm (called BCMR) based on this concept. First, we use bootstrap sampling to take a random sample from the original set. The amount of sample abstracted by bootstrapping is set to be the same as the amount of sample abstracted by the original dataset. Secondly, 50 bootstrap replicates are used to train 50 weak classifiers. In our experiments, we used homogeneous individual classifiers and eight different baseline methods. Last but not least, we generate the final strong classifier from 50 weak classifiers using the sample voting integration strategy. Using collective wisdom, we can get rid of bad decisions, giving the integrated model a much better generalization ability. Extensive experiments on three datasets show that BCMR can significantly improve cross-media retrieval accuracy.


Introduction
With the advent of the Internet era, data representation has become more flexible and versatile. Therefore, multimedia data are emerging on live platforms and the Internet, like social networking and gaming sites. A variety of media expressions can convey the same information from different perspectives, forming complex relationships and organizational structures. The proliferation of various types of data information encourages the emergence of cross-media retrieval Liu et al. 2017). Cross-media retrieval overcomes the limitations of traditional singlemedia retrieval methods (Xu et al. 2019;Sun et al. 2016), and it can retrieve specific types of information that we require from large amounts of data. As compared to singlemedia retrieval, it presents the data in a more colorful and detailed way, and it can meet the growing demands of customers for data retrieval. Figure 1 depicts the retrieval of various types of data. When we search for GPU on Amazon, we can see text, images, audio, and video about GPU. You can find out the GPU's performance and price in the text or voice message left by the customer. The GPU's appearance is depicted by images from many different angles. Videos explain how to use it, how to install it, and more. When a user types in a media query, like ''pictures,'' the system automatically retrieves all of the media content on the query subject. It is intended for scenes containing various media types of query and retrieval results. When it comes to online searches, people have different requirements. Cross-media retrieval was created when one type of search result could not meet the requirements.
It is hard and challenging to do cross-media retrieval since the data information is generally characterized by the following characteristics. Hybrid data of different modalities have complex organizational structures, and they are mostly unstructured or semi-structured, so they are hard to store and retrieve. The relationship between different modalities is usually implicit and hard to figure out. The noise in different data modalities makes it very easy to destroy the one-to-one correspondence between cross-media descriptions and misunderstand the semantics expressed in the data itself. Same object could be described more than once in the same modality. We need to use the description of other modalities to eliminate redundant descriptions. Due to the aforementioned characteristics, cross-media data can be hard to handle, and it is expressed with underlying features of different dimensions and attributes. Additionally, since different media data are incommensurable and are isomorphic, it is hard to measure cross-media correlations. The semantic gap is unavoidable in cross-media retrieval. Researchers have proposed several new cross-media retrieval methods both at home and abroad to address these issues. Even neural networks (Vinyals et al. 2015;Srivastava and Salakhutdinov 2012) are used to train data in order to obtain a sophisticated method with high accuracy.
These days, ensemble learning is gaining popularity, and its benefits include the following: (1) It is obvious that it can improve the classification accuracy of weak classifiers.
(2) It has a high level of robustness and can maintain high classification accuracy across a wide range of datasets. (3) Parameter selection is simplified. Weak classifiers that have not been adjusted to the best parameters can be combined to produce a better classification effect than a single classifier.
In this paper, we proposed a Bagging-based cross-media retrieval algorithm called BCMR, because integrating weak learning algorithms is easier than finding strong learning algorithms. We used bootstrap sampling (Zhang and Oliver 2010) to process a single training set and create 50 bootstrap replicates since a single training set is insufficient to provide satisfactory and effective information. The variety of training data has been greatly increased. In cases where a single learner does not produce optimal results, integrating multiple learner will take advantage of the wisdom of the group to obtain the best value. In our paper, we aim to apply ensemble learning to cross-media retrieval.
The remainder of the content was organized as follows. Section 2 introduced two major approaches to cross-media retrieval, including similarity measurement and common space learning. Bootstrap sampling and tagging are described in Sect. 3. The results of ensemble learning for cross-media retrieval are presented in Sect. 4. Lastly, we summarized this paper in Sect. 5.

Related work
Online multimedia data are evolving fast and getting a lot of attention these days. Cross-media retrieval is also in full swing among researchers. This retrieval mode has been improved by scholars in pattern recognition, probability statistics, and graph theory. Retrieval methods come in binary and real-value formats (Xie et al. 2016;Xu et al. 2020a). Methods like deep learning (Ngiam et al. 2011), dictionary learning (Zhu et al. 2014;Li et al. 2018), and graph correlation (Xu et al. 2021(Xu et al. , 2020b are popular realvalue representation methods. These methods are widely used and have a high degree of accuracy. Deep learning has played a big role in cross-media retrieval, and deep neural networks (Zhang et al. 2022) have gained popularity in this field recently. With deep neural networks, you can dig into the relationships between different types of data in great depth. Andrew et al. (2010) proposed DCC, which is a parameter model that does not require data when calculating the characteristic description. Deep canonical correlation analysis (DCCA) is an abbreviation for deep canonical correlation analysis. DCCA is scalable in terms of computing complexity because it uses two deep neural networks. Learning the multiple nonlinear interchange descripts the isomorphic features of various modal data. Due to the generative adversarial network's ability to mine the distribution of various modes, Peng et al. (2016a) utilized a generator and a discriminator to capture information between heterogeneous data and illustrate semantic conformance between modes. In their paper, Wang et al. (2017) applied game theory to crossmodal retrieval, proposing adversarial cross-modal retrieval (ACMR). As an adversary, it uses a feature projector and modal classifier that can effectively smooth out the gap between image and text.
Dictionary learning methods typically take sparse representation into account when retrieving data in different data modes. New dictionary learning methods have emerged recently. A novel method was developed by Hang et al. (Shang et al. 2018). By using representation coefficients, this method projects heterogeneous data into an isomorphic subspace. There are also methods (Zhuang et al. 2013) for extending single-type media to cross-media scenes, known as coupled dictionary learning. By using a The Concept about Cross-Media Retrieval sparse coefficient map, the querying information will be mapped into another kind of space. Modified dictionary learning is what Bahrampour et al. (2015) proposed for task-driven learning. By using this method, dictionaries and classifiers for different modes can be obtained simultaneously.
With graphs, there are two types of methods: graph regularization and graph-based methods. The first is frequently used in semi-supervised learning and can express intra-and inter-pair similarity. Zhai et al. (2014) developed a new method called joint representation learning (JRL). In addition to the original label information, this method could fully mine the pairwise correlation data. Putting both the different modalities of samples and patches together, Peng et al. (2016b) came up with a graph that can develop local information. Graph-based methods generate independent graphs for each modality. Tong et al. (Tong et al. 2005) learned the task with graphs based on supervision information and graph constraints.
There are three contributions to this paper: (1) We propose BCMR, the first method to combine Bagging with cross-media retrieval. (3) We investigated the effect of bagging in a variety of datasets and found it promising for cross-media retrieval. (3) We evaluate the stability of different algorithms by looking at the accuracy improvement when combining different methods.

A. Bootstrap sampling
Bootstrap (Efron 1979) simulates sampling statistical inference using the original data. With it, you can work on distribution characteristics of a set of data statistics, especially for problems like interval estimation, hypothesis testing, and so forth, which are hard to export using traditional methods. Basic concept is to resample within the scope of the original data. Because each number in the original data has the same chance of being pumped every time, this sample is called a bootstrap sample (Parke et al. 1999). After K time sampling, the probability that the sample is not always sampled is ð1 À 1 k Þ k . We can get the following expression by taking the limit.

B. Formal description of bagging
For the classification problem, based on its basic assumption that the predictor uðx; LÞ predicts the class labels 2 1; Á Á Á S, we draw L based on the distribution P. In the case of ensuring that Y, X come from P independent of L, and correct classification's probability for L fixed is (Breiman 1996;Liang et al. 2011): We make The probability of the whole of correct classification is where where KðÁÞKðÁÞ is the indicator function. Denote For we can obtain that where C represents the order-input set, and C is the complementary set of C 0 . The highest available correct classification rate is as follows: and it has the correct classification rate 3.3 C. The process of integration We used R T and R I to represent the feature space corresponding to text and image modality, respectively. S T and S I denote isomorphic semantic subspaces. For any baseline model, the following mapping relationship can be obtained: Bagging-based cross-media retrieval algorithm 2617 Thus, we can get a representation in the isomorphic semantic subspace for any text/image sample. In order to determine how similar the query and the retrieved data are, we use the following formula. The task of cross-media retrieval is completed by the semantic similarity matching of different modal data.
where DðT; IÞ represents the distance matrix of d Â d of the text data query image data, and d represents the number of samples.
Next, we will introduce the integration strategy and integration process in BCMR in detail. As shown in Fig. 2, we assume that testing set contains m image-text pairs. For 50 different bootstrap replicates, we use same baseline model to train data and get 50 weak classifiers. When calculating the similarity between texts and images in testing set, 50 m Â m dimensional distance matrices which called Distance_1 to Distance_50 are obtained, respectively.
We sort the distance matrices of every row from the nearest to the farthest, that is to say, from left to right of every row, the similarity between samples is smaller. We transform the distance into the index corresponding to the sample and get the matrices Sort_1 to Sort_50 in Fig. 2b. Let's suppose that we extract all the n-th rows of all weak learners and arrange them into Reset_n in Fig. 2c in sequence, where n 2 1; m ½ : C2:1 represents the first row of the second weak classifier. For every Reset_n, we take the sample voting strategy for each column, that is, the final result is the majority of votes. If there are equal votes the first sample is selected. We mark the integration results of each Reset_n as Bn. Arrange all Bn in order and get the final integration result.

Experiments
In this section, we referred to the Bagging add a baseline method as a BCMR method. A series of experiments and eight baseline methods were used to evaluate the performance of BCMR, Bagging cross-media retrieval. The experiments included image-retrieving text (I2T) and textretrieving image (T2I) tasks, as well as their average retrieval scores on three popular benchmark datasets: Labelme, Wikipedia-CNN, and Pascal Sentence.

1) Datasets and evaluation metrics
The Labelme dataset (Russell et al. 2008) is a crossmedia retrieval online image dataset created by MIT's Laboratory. There are 2688 images and tags associated with them. For training and testing, 2016/672 image-text pairs are used. This dataset is divided into eight distinct categories (corresponding to eight distinct outdoor scenes), and each image is assigned to one of them. Images are represented by 512-dimensional GIST features, and texts are represented by an index vector of selected tags.
The Wikipedia-CNN dataset is selected from Wikipedia features and contains 2866 image text pairs that are randomly divided into two sets, 2173 for training and 693 for testing. All of the samples are classified into ten semantic categories, and the 2866 pairs are classified into ten categories. The image features are derived from 4096-dimensional CNN features, while the text features are derived from 100-dimensional Latent Dirichlet Allocation.
The Pascal Sentence dataset contains 1000 image-text pairs divided into 20 semantic categories. So each category contains 50 pairs, with 30 pairs for training and 20 pairs for random testing. As the image's feature representation, Fig. 2 The integration process during the retrieval stage. a The distance between each query sample and the retrieval sample is calculated for each of 50 weak classifiers. b Sort distances from shortest to longest. c The sample voting strategy is used to integrate multiple weak classifiers. d The matrix of integrated similarity 4096-dimensional CNN features are extracted. The BoW representation for text is obtained using 300 word roots, and its probability distribution in 100 themes is calculated using Latent Dirichlet Allocation as its feature representation.
The sample number of the above three datasets is listed in Table 1.
To validate the efficacy of our method, we use three commonly used evaluation metrics in the field of crossmedia to assess performance. The mean average precision (MAP) is the average precision's mean value (AP). Precision-recall (PR) curves indicate a person's ability to learn latent concepts. MAP for each class displays a more detailed map score for each category.
2) Baseline methods Partial least square (PLS) (Rosipal and Kramer 2006) is a new statistical data analysis model. When the prediction matrix has more variables than the observation, partial least squares regression is especially useful. Using the prediction and observation variables, a linear regression model would create a new space. PLS combines the characteristics of PCA, CCA, and LRA during the modeling process.
Semantic matching (SM) (Rasiwasia et al. 2010) is a method for representing texts and images at a higher level. Texts and images are projected into the same semantic subspace in order to measure the various modalities.
The most traditional task-driven method is MDCR (Wei et al. 2015). It learns their mapping matrices for each subtask (I2T and T2I), ensuring that I2T and T2I results are optimized.
JFSSL (Wang et al. 2016) solves the similarity measurement problem of different modal data using feature projection, with out l 21 -of-norm constraints to deal with paired feature selection. Graph regularization is applied to the data projection in order to maintain the relationship between intra-modality and inter-modality of multi-modal data.
In this field, CR-CMR is a very effective method. While other methods use spare representation, we use collaborative representation. This method combines elements of both the dictionary learning method and the semantic mapping method.
Another modality-dependent, high-accuracy method is JGRMDCR (Yan et al. 2017). It fully exploits one-to-one correspondence between image-text pairs before exploring intra-modal and inter-modal similarities.
The goal of PL-ranking  is to optimize the ranked list. It combines pairwise ranking loss constraint, listwise constraint, and regularization to improve querying accuracy.
The GSS-SL ) method is semi-supervised. First, semantic labels are generated using the label graph constraints, making this a semi-supervised method. Then, we use labels that contain semantic information to connect different data modes. In this method, we divide labeled and unlabeled samples into different proportions by JFSSL and validate its effectiveness in various ways.
3) Evaluation metric and parameter tuning In the following content, we used MAP and PR curves as common evaluation criteria for cross-media retrieval tasks.
As shown in Table 2, we used the SM algorithm as a baseline algorithm to perform various retrieval tasks on the Pascal Sentence dataset in order to investigate the effect of sampling times on experimental results. The value in the first row represents the experimental results of the original SM algorithm. We can clearly see that as the number of samples increases from 10 to 100, the MAP scores for I2T, T2I, and their average value increase, with 10 to 50 increasing very quickly and 50 to 100 growing slowly. The running time increases sharply with the increase of sampling times. In this paper, we chose a sample size of 50 based on the MAP scores and running time.

Accuracy
(1) Results on the labelme dataset Table 3 shows the MAP scores obtained by the eight baseline methods and the corresponding method after integration into the Labelme dataset. MAP scores after integration are obviously much higher than those of the  original methods in various cross-media retrieval tasks. Bagging ? PL-ranking achieved the most noticeable improvement for the I2T task, the T2I task, and their average retrieval scores. Based on the fact that the Bagging algorithm can improve the retrieval performance of the unstable learning algorithm, while the performance of the stable learning algorithm is unremarkable, we can conclude that the GSS-SL algorithm is more stable than the other algorithms, and the PL-ranking algorithm is the most unstable on the Labelme dataset. Figure 3 depicts the PR curves of eight baseline methods and their corresponding methods following integration on the Labelme dataset. It is clear that when methods are combined, they are more accurate when they are used at the highest levels of recall. Figure 4 a, c, e depicts additional analysis of MAP performance on the Labelme. BCMR is found to produce better results than baseline methods in the majority of classes. For the I2T task, BCMR achieves higher MAP scores than the corresponding baseline methods on the 'street' and 'highway' classes and achieves good results on the 'street,' 'coast,' and 'open country' classes for T2I tasks, as well as average retrieval scores.
(2) Results on the Wikipedia-CNN dataset Table 4 shows the MAP scores obtained by eight baseline methods as well as their corresponding methods after integration on the Wikipedia-CNN dataset. The higher the accuracy, the more effective BCMR is. We can see that despite the fact that the PLS baseline method has the lowest accuracy of the five baseline methods. Bagging?PLS sig- nificantly improves the I2T, T2I tasks and their average retrieval scores, approximately 9.9 percent, 17.2 percent, and 14 percent higher separately than the baseline PLS method. Figure 5 depicts the PR curve on the Wikipedia-CNN dataset. Further analysis of MAP performance on the Wikipedia-CNN is expressed in Fig. 4b, d, f. Although BCMR has slightly lower precision on one class, it has higher MAP scores on the majority of classes in the Wikipedia-CNN dataset.
(3) Results on the Pascal sentence Despite the fact that the number of Pascal Sentence datasets is greater than the previous two datasets, and the number of samples is much smaller, we can be seen in Table 5 that it has achieved good performance. BCMR's performance on the JGRMDCR baseline method is marginally inferior. Because the Bagging algorithm has limitations, it is difficult for the Bagging algorithm to improve accuracy for the relatively stable JGRMDCR algorithm.
As illustrated in Figs. 6 and 7, BCMR performs well in two indicators of PR curves and MAP scores on the Pascal Sentence dataset when using bootstrap sampling. BCMR's success is due to its ability to integrate multiple weak classifiers, obtain a more reasonable classification boundary than a single classifier, reduce the overall error rate, and significantly improve classification performance. At the same time, our method has the advantages of high robustness and simple parameter selection, so it will outperform the baseline methods.  Bagging-based cross-media retrieval algorithm 2621

Conclusion
In this paper, an innovative approach known as BCMR is proposed, which is the first cross-media retrieval algorithm to use the integration concept in the retrieval stage. The algorithm is split into two stages: (1) Collecting samples This paper uses the Bootstrap method, which is the random sampling with replacement method.
(2) Incorporation of the retrieval stage. The sample voting method is used to integrate the decisions of 50 weak classifiers. We can conclude from the above extensive experiments that BCMR can greatly improve cross-media retrieval accuracy. Data availability Enquiries about data availability should be directed to the authors.

Declarations
Conflict of interest The authors declared that they have no conflict of interest in this work.
Ethical approval There are no animal or human experiments in this paper.
Informed consent None.