Multi-instance embedding learning with deconfounded instance-level prediction

Confounded information is an objective fact when using multi-instance learning (MIL) to classify bags of instances, which may be inherited by MIL embedding methods and lead to questionable bag label prediction. To respond to this problem, we propose the multi-instance embedding learning with deconfounded instance-level prediction algorithm. Unlike traditional embedding-based strategies, we design a deconfounded optimization goal to maximize the distinction between instances in positive and negative bags. In addition, we present and use bag-level embedding with feature distillation to reduce the MIL classification task to a single-instance learning problem. Under the theoretical analysis, the embedding cohesiveness and feature magnitude metrics are developed to explain the benefits of the proposed deconfounded technique in MIL settings. Extensive experiments on thirty-four data sets demonstrate that our proposed method has the best overall performance over other state-of-the-art MIL methods. This strategy, in particular, has a substantial advantage on web data sets. Source codes are available athttps://github.com/InkiInki/MEDI.


Introduction
Multi-instance learning (MIL) was originally designed for drug activity prediction [3]. In contrast to traditional singleinstance learning (SIL), each object in MIL is a bag containing various numbers of instances. A label is assigned to the bag, but not to the individual instances. To date, MIL has also Mei 1 been frequently utilized in a variety of applications, such as image classification [19,23], text categorization [9,18], sentiment analysis [1], web index recommendation [12,17], whole slide images [8,24], and video anomaly detection [7]. Among them, the embedding-based approaches are one of the representative researches in MIL, with the primary notion of transforming bags into a new feature vector and establishing the learning process using SIL methods [18,20,21,25].
However, the majority of the MIL embedding methods usually ignore the confounded information among bags, as shown in Fig. 1. When we are interested in retrieving an image with wolves from images containing different objects (such as wolves, snow, and forests), confounded information that indicates the difference in color or contrast among the images may lead the MIL embedding learner astray. For example, when utilizing a bag generator [14] to obtain a bag representation of an image, features such as image color are heavily considered, resulting in the well-known fact that the processed image contains confounded information. Physical intervention to image training and an associated optimization goal [8] are the recent novel approaches to this challenge. Unfortunately, the input to the MIL embedding learner is typically a preprocessed bag that is generated while ignor-(a) Image A and its histogram (b) Image B and its histogram (c) Image C and its histogram ing confounded information, making physical intervention methods impossible.
In this paper, we propose the multi-instance embedding learning with deconfounded instance-level prediction (MEDI) algorithm to address the above challenges. As depicted in Fig. 2, we design a deconfounded instancelevel prediction tactic to mitigate the impact of confounded information among bags. Its core is using the designed deconfounded optimization goal to maximize the distinction between instances in positive and negative bags. To better acquire the bag's embedding vector in the new feature space, we additionally present bag-level feature distillation based on traditional embedding methods to eliminate some instancelevel interference information.
The contributions of this work are summarized as follows: -The deconfounded instance-level prediction is designed to handle the challenge of confounded information among bags that is usually ignored in the MIL embedding methods. -The feature distillation is introduced into the bag-level embedding to provide better embedding results of bags for their label prediction. -The embedding cohesiveness and feature magnitude metrics are additionally designed and added to a series of associated experiments to assess the quality of these results and verify the efficacy of our method.
Experiments were undertaken on thirty-four MIL classification data sets to quantify the performance of MEDI. These data sets come from a variety of fields, including drug activity prediction, text classification, image classification, and web index recommendation tasks. In most cases, the experimental results show that MEDI outperforms state-of-the-art algorithms and has demonstrated significant benefits on web data sets.

Related work
Multi-instance embedding learning originated from the research of [2] on drug activity prediction and image classification, which has the core idea of embedding bags into a new feature space and training a model using SIL methods. Since then, many excellent algorithms of this type have been proposed.
MILIS [4] provides an alternating optimisation framework by combining the phases of instance selection and classification in an iterative convergent fashion. MILFM [5] addresses the problem that most feature mapping overlooks the discriminative power and noise of the generated features by using discriminative feature mapping and feature selection. miFV [15] extracts information from the instance space using the Gaussian mixture model and derives the embedded vector using the Fisher vector, while the time complexity of this technique increases as the dimensionality of the data set grows. miVLAD [16]   and generates a codebook using the instance-level kMeans to provide a novel embedding function. MILDM [17] designs an instance evaluation function to select instances with the most discriminativeness and builds a mapping pool to embed bags. StableMIL [27] builds upon identifying a novel connection between MIL and the potential outcome framework in causal effect estimation. ELDB [19] introduces discriminative analysis and self-reinforcement mechanisms under the concept of continual learning to maximize the designed discriminative optimization goal. Some others also include PL [11], AEMI [22], MIHI [21] and so on. Our work differs from most existing MIL embedding work in that we take into account the confounded information among bags as well as the interference instances within each bag, potentially improving the algorithm's performance.

Methodology
as a MIL data set of N bags. The bag B i = {x i j } n i j=1 containing n i unlabeled instance x i j ∈ R d , i.e., each instance-level label y i j ∈ Y = {+1, −1} for x i j is unavailable. The bag label y i ∈ Y is the sole known supervision information in MIL, and it can be computed using the basic MIL assumption [3]: Our goal is to convert the MIL classification tasks to SIL ones using two main tactics due to the fact that there is confounded information among bags: (a) deconfounded instance-level prediction and (b) bag-level embedding with feature distillation.

Deconfounded instance-level prediction
In the MIL scenario, the confounded information among bags is a fundamental reality that can be handled through physical intervention and backdoor adjustment in neural network algorithms [8]. However, in MIL embedding approaches, this strategy is not viable because the learner's input is a preprocessing bag that is constructed without considering confounded information. To address this issue, we give deconfounded instance-level prediction with the following optimization goal that maximizes the margin between instances from distinct labeling bags and minimizes the margin between instances from the same labeling bags. By introducing cross-entropy [22] and l2-norm, the above optimization objectives can be formulated as where p c i j is the prediction probability of the instance label y i j based on a MIL learner M(·) (such as neural network), is the parameter of M, and λ > 0 is a user-defined trade-off parameter.
To achieve the instance-level prediction, we borrow the core idea of attention mechanism [6] as follows. Let M(·) be a neural network. The attentional representation of the instance x i j ∈ B i is generated as where and W t , W s ∈ R L×D , W o ∈ R L×D and W r ∈ R d×L are parameters belonging to , D is the number of nodes, and d is the dimension. In addition, h a i j and h r i j are merged into where ⊕ is element-wise addition. To improve the information extraction capabilities of the network, and get the new representation and class probability vector of instance, we add the following fully connected layers: where W h ∈ R L×H , W e ∈ R H ×E , W c ∈ R H ×2 , H and E are the number of nodes, and

Bag-level embedding with feature distillation
The embedding function is used to transform a bag into a new feature space, and its general definition is as follows: where K is a key sample set derived from the bag space B, K i is the ith sample of K, and d(·, ·) is the distance between the bag and the key sample. By specifying the size of K, the bag B i can be embedded as a vector b i in the new feature space.
One disadvantage of Eq. (7) is that the employed d(·, ·) has a significant impact on embedding results. Therefore, by considering the probability distribution of instances in the bag, we design a new bag-level embedding with feature distillation [24] as follows: 1. Aggregated embedding (AE): 2. Maximum embedding (ME): where 3. Maximum minimum embedding (MME): where where p c i j is the cth element of p i j , and both of the p i j and l e i j are computed according to Eq. (6). The embedding function, when combined with a specific distillation strategy, might lessen the effect of some interference information in the bag, potentially improving classification performance, which will be investigated in the parameter analysis of experiments. Algorithm 1 presents the pseudo code of the MEDI algorithm.

Algorithm 1 The MEDI algorithm
with each bag's label y i ; Output: The trained neural network M; Single-instance classifier C;

Discussion
In the inevitable scenario where physical intervention is not applicable, we devised the MEDI algorithm to optimize the prediction of positive and negative bags by resorting the deconfounded goals in [8]. MEDI includes two crucial parts, namely: maximizing the distinction between positive and negative bags as well as embedding bags into the new feature space. Therefore, the instance-level deconfounded degree and the quality of the bag embedding results have a significant impact on the final classification performance. Equation (2) guarantees the former, and the embedding cohesiveness is defined to reflect the latter: where y t = +1, y l ∈ +1, c , and T is the number of positive bags. The average distance between the center vectors for all positive and negative bags and their related embedding vectors makes up ε 1 , which is the embodiment of intraclass cohesion. Additionally, we want ε 2 , the distance between the center vectors of bags with different labels, to be as large as possible. This is the embodiment of interclass coupling. Equation (13) is primarily used to explain the advantages of our design in MIL settings against rival methods. Furthermore, we design the bag's feature magnitude calculation with reference to [13] to highlight the benefits of the deconfounded technique to the algorithm itself: where i1 and i2 reflect the changes in bag feature magnitude before and after being deconfounded, respectively. Related experiments will be covered in Sect. 4.4.

Experiments
In this section, we will firstly describe the used data sets and the comparison algorithms. Then, the MEDI algorithm is put to the test in a comparison against seven state-of-the-art approaches after parameter analysis in a series of experiments. For each data set, the average accuracy of 5 times fivefold cross-validation (5CV) and its standard deviation (the value with "±") are reported.

Data sets
We conducted experiments on four types of MIL data sets: Drug activity prediction, text classification, image classification, and web index recommendation data sets. All of these data sets are available via [22].

Drug activity prediction
The benchmark data sets musk1 and musk2 are commonly used in drug activity prediction tasks [3]. Its goal is to predict whether a new molecule can be used to make a drug. In the MIL domain, a musk molecule is represented as a bag with a variable number of 166-dimensional instances. According to the basic MIL assumption, a molecule is positive if it possesses at least one instance that can be used to make a drug; otherwise negative.

Text categorization
To conduct experiments, we employed ten text data sets derived from the Newsgroups corpus [29]. Each data set contains 50 positive and 50 negative bags. Each positive bag contains 3% of posts from the specified positive class and the rest from other classes, whereas instances of negative bags are randomly drawn from the non-main class. Each instance is also represented by a 200-dimensional TFIDF feature.

Image classification
Corel with 100 categories is a famous database for the image classification task [2]. Each category contains 100 images in JPG format with a shape of 187 × 126 or 126 × 187. Elephant and tiger are from the Corel database, and all of them have been preprocessed by the Blobworld bag generator. To consider a more challenging scenario, we built ten mnist-bag data sets with the mnist classification data set by following the setting of [22].

Web index recommendation
The purpose of web index recommendation is to recommend interesting web page indexes to particular users. Each of the nine sub data sets in the web data set corresponds to a user's evaluation of a web page [28]. Each web page serves as a bag and links on the page serves as instances. Since web page processing is connected to word frequency, web data sets have high-dimensionality and sparsity.

Comparative algorithms
As a comparison, we employed six state-of-the-art MIL classification algorithms listed below: 1. miVLAD [16] uses the clustered centers of instance-level kMeans as key samples; 2. ABMIL [6] and LAMIL [10] are two popular MIL networks with well-known attention mechanisms; 3. ELDB [19] designs the discriminative analysis and selfreinforcement mechanisms to optimize the distinguishability of bags' embedding vectors. 4. IMIL [8] takes into account the problem of the confounded information in the bags by deploying physical intervention and backdoor adjustment. 5. AEMI [22] generates embedding vectors by combining the attention mechanism with traditional MIL methods and employing a specially developed embedding function. Table 4 shows the parameter settings for MEDI and the rival algorithms.

Parameter analysis
We set the number of nodes E of Eqs. (8), (9), and (11) to the number of bags in the data set to improve MDEI's adaptability to data sets from various fields. So we will concentrate on two factors that influence classification performance: training epochs and mapping functions (AE, ME, and MME) that combine distinct feature distillation strategies. For each subfigures in Figs. 3, 4, and 5 , the abscissa represents the number of training epochs and the ordinate represents the classification accuracy (only except for Fig. 3a). Figure 3 shows the instance-level training loss and accuracy of MEDI on two exemplary data sets, the musk1 and elephant from drug activity prediction and image classification tasks, respectively. Note that the figure only shows the results of one round of the experiment in 1 time 5CV. The experimental results demonstrate that MDEI has good convergence and high instance-level deconfounded ability, which means that instances in the positive and negative bags can be well distinguished. Furthermore, the general trend of training loss is similar to that of training accuracy, with the exception that the former converges faster. A phenomenon is that under the three distillation strategies on the musk1 data, MEDI's training accuracy approaches 1 in roughly 150 epochs. This is why we can't pick a greater training epoch later, as this could result in severe overfitting. Figures 4 and 5 show MEDI's more experiments on the influence of parameters. Experimental results show that on most data sets, such as text, mnist, and web, only a few epochs of training are required to achieve good results; three distillation strategies have little effect on classification performance on the musk1 and elephant data sets, while significant disparities in classification performance are caused by changes in distillation mechanisms on the text, mnist, and web data sets. Specifically, AE practically fails on text data sets, which could be owing to the fact that text data sets are unusually sparse and have small feature values, resulting in difficult to exclude the confounded information in the bag. We make some recommendations for MEDI parameter settings based on the aforementioned parameter analysis tests, which are listed in Table 4.

Benefits of deconfounded
To facilitate a theoretical presentation of the benefits of the deconfounded process in the MIL settings, we developed two metrics, Eqs. (13) and (14) (a.k.a. embedding cohesiveness and feature magnitude), in the Discussion section. Equation E-step embedding function Same as [16] Instance selection mode Maximum scores Others refer to [22], and the instance-level classifier only use SVM Fig. 4 Parameter analysis of MEDI on two representative data sets: a The musk1 from drug activity prediction task and b the elephant from image classification task (13) reflects the deconfounded technique's pursuit of the goal of high cohesion and low coupling of bag embedding results. However, not all embedding methods yield embedding vectors with the same dimensions, which causes evaluations to occur under unfair circumstances. The final evaluation results, which are displayed in Table 2, will therefore be normalized as ε = ε 1 /ε 2 . The smaller the value of ε, the higher the quality of the embedding result after deconfounded. In addition, it is clear that this value is not as small as possible when combined with the classification performance of ELDB in musk1 (Table 3), as doing so could cause the algorithm to overfit, which is a point we'll need to work on in the future. Figure 6 shows how the deconfounded mechanism affects the algorithm itself. Each subfigure contains two parts: the normed feature magnitude i1 (denoted by ) and i2 (denoted by +) of the original bags and that of the deconfounded bags, respectively. Further, positive and negative bags are, respectively, represented by red and blue. The findings demonstrate that, while preserving some edge samples, the deconfounded technique can improve the difference in magnitude between the positive and negative bags, which may be advantageous in reducing overfitting.

Performance comparison
Tables 3 shows the experimental results of the MEDI and six rival algorithms. The best accuracy value for each data set is highlighted with bold. Average denotes the average classification performance across data sets. The results reveal that the MEDI algorithm outperforms the other algorithms (a) News.aa (b) News.rsh (c) News.se

Fig. 5
Parameter analysis of MEDI on nine representative data sets: a the news.aa, news.rsh, and the news.se from the text classification task; b the mnist0, mnist3, and mnist6 from the image classification task; and c the web1, web4, and web7 from the web recommendation task in terms of overall classification performance, especially on the web recommendation and mnist data sets. The following reasons may apply: (a) The most intuitive goal of MEDI is to eliminate the confounded information in the bag as much as possible, making it easier to separate instances in various label bags to prepare for later embedding; and (b) the designed embedding function with three feature distilla-tion strategies is highly adaptable to data sets from various domains, which is particularly useful for data sets with unique data features, such as text data sets. Furthermore, some results necessitate further care. (a) On the text categorization data sets, MEDI achieves a moderate outcome, while miVLAD, LAMIL, and ELDB get relatively large advantages. For example, LAMIL has a considerable Experiments were run 5 times 10CV and an average of the classification accuracy (± the standard deviation) is reported  6 Evaluation results of feature magnitude on the two representational data sets, musk1 and mnist0 edge on the news.rsb data set. The reason for this could be that the novel network with the coupled loss function can effectively mine the information in this type of data sets; (b) this is because it does not have a distillation strategy, resulting in the inability to effectively remove interference instances in text data sets; and (c) the overall categorization performance of IMIL is the worst. Because its deconfounded technique is primarily intended for whole slide image classification, it may not work in traditional MIL. All of the aforementioned comparison algorithms rely on embedding techniques. To develop a broader grasp of the MIL field, it is necessary to compare other kinds of approaches, like the maximum margin principle-based methods. Table 1 shows the comparison results combined with the experimental setup in [26]. The results show that, for some data sets, our algorithm is superior. However, there is still a lot of room for improvement in performance on other data sets, and it may be a good idea to combine margins and instance weights of the comparison methods.

Conclusion and further work
We propose the MEDI algorithm to deal with confounded information in the bag by setting an optimization objective that fuses cross-entropy and l2-norm. Furthermore, to obtain better bag embedding results in the new feature space, three feature distillation strategies are designed and coupled into a new embedding function. The experimental results of studies prove that MEDI is superior to state-of-the-art MIL classification methods and has significant advantages, especially on web and mnist data sets. The following topics deserve further investigation: -Adaptive feature distillation. The three distillation strategies have different adaptations to different data sets. In the future, we will consider adaptively selecting an appropriate mechanism according to the characteristics of the data set. -Deconfounded more efficiently. MEDI performs poorly on some data sets such as musk and fox, probably because existing strategies do not handle the confounded information in these data sets well.