Few-shot object detection via message transfer mechanism

Abstract. Few-shot object detection aims to achieve object localization and recognition on novel classes with limited training instances. Due to the constraints of the two-stage fine-tuning mechanism, existing models lack the ability of knowledge reasoning. When transferring the base model to novel class detection, we add a region of interest feature transfer branch, which establishes a message transfer mechanism between complex instances, ensuring mutual attraction between instances of the same category while allowing for association across different categories. Specifically, a self-attention message transfer graph is constructed to facilitate the propagation of attribute information among target instances. Second, a box transfer loss function is proposed to combine the semantic relationships among instances to promote mutual exclusion among instances with significant category attribute bias, thereby constructing better category feature representations. Finally, we demonstrate the effectiveness of our proposed framework compared to other state-of-the-art methods on two popular datasets: PASCAL VOC and MS-COCO.


Introduction
Object detection, as a crucial research area in computer vision, is widely applied in fields, such as autonomous driving, video surveillance, and medical diagnosis. 1,2Traditional object detection based on deep learning networks heavily relies on a large amount of annotated data.However, in real-life scenarios, it is difficult to collect images in some fields (e.g., rare species and rare medical cases), and the manual annotation of images is time-consuming and laborious, and the accuracy of annotations can significantly impact the model's performance. 3,4Furthermore, the model trained on existing classes has difficulty coping with emerging classes, leading to a lack of rapid adaptability.Therefore, few-shot object detection (FSOD) has been proposed to reduce reliance on a large number of annotated samples, and to construct object detection models with great generalization capabilities using only a minimal amount of data from novel classes. 5,6he key to FSOD is that the transfer knowledge learned from the base class can be generalized to the novel class with less supervision. 6In the early years, many scholars 7,8 tried meta-learning strategies that use multiple independent tasks to match base class instances with novel class targets.In recent years, the two-stage fine-tuning method TFA 9 has been proposed to improve the detection performance of less-supervised models.The overall framework of the basic TFA is shown in Fig. 1.In the first stage, the two-stage detector (mostly faster-RCNN) completed the initial training of the model by utilizing a large amount of base class data.In the second stage, the detector fixed the backbone network parameters of the feature extraction and only refined the regression and classification branches by using a few sample instances.MPSR 10 proposed the multi-scale positive sample enhancement solution for the unbalanced scale distribution problem in FSOD.However, the feature maps of the corresponding scales were manually selected for input to the region proposal network (RPN) lacking the autonomy of network learning.To enhance the generation of foreground proposals for novel instances, a few-shot contrastive encoder (FSCE) 11 initiates by unfreezing the RPN and region of interest (RoI) modules.The maximum number of proposals retained after non-maximum suppression (NMS) is doubled, while, during loss calculation in the RoI head, the number of proposals is reduced by half (with half of the proposals in the fine-tuning phase being designated as background).Then, FSCE added a contrastive loss function to increase the identity of proposals of the same category and the discrimination between different categories.The existing structures failed to utilize the information transfer capability between instances, which largely ignored the correlation between classes, thus limiting the ability to distinguish related classes (e.g., distinguishing between cats and dogs) and the lack of related class generalization (e.g., detecting dogs by generalizing from detecting cats).
To alleviate the above limitations, we design a novel few-shot object detector MTM-FSOD, which implements a box message transfer branch that builds a relationship graph between instances and generalizes the knowledge of attributes of the associated classes based on the self-attention relationship between instances.To better distinguish the associated classes, we propose a box transfer loss (BTL) function, which employs the similarity of semantic embeddings and consistency of classes to control the difference between classes, thus ensuring that the same classes attract each other and that different classes repel each other.
The main contributions of this work are summarized as follows:  The dense relation distillation model was designed to cover all the spatial locations by using dense matching of support features and query features exclusively in a forward propagation manner.The main transfer methods included LSTD, 15 TFA, 9 MPSR, 10 and FSCE, 11 where new concepts were learned through fine-tuning.DeFRCN 16 introduced the gradient decoupling layer for multi-level decoupling and the prototypical calibration block for multi-task decoupling.The former was a new deep operation that redefined feature forward operations and gradient backward operations to decouple its successor layer from the previous layer; the latter was an offline classification model based on prototype, using the detectors' proposals as input and improving the original classification scores for calibration by additional pairwise scores.FADI 17 expressed the feature space of the novel class in terms of the feature space of the base class with which it has the highest similarity, separating the novel class from other base classes.Due to the complex network structure, the above algorithms are prone to overfitting.In contrast, we employ a BTL function to distinguish the learning proposal representation instead of complicating the model.

Graph Attention Network
The remarkable ability of graph neural networks (GNNs) to handle unstructured data enables new breakthroughs in network data analysis, recommender systems, physical modeling, natural language processing, and combinatorial optimization problems on graphs. 18The current popular GNN networks, which included GCN, 19 GraphSAGE, 20 GAT, 21 and GAE. 22For traditional GNNs were required the input of node feature matrix and adjacency matrix so that the aggregation operation of nodes can be performed.But GCN 19 was also needed to introduce a degree matrix, which was used to indicate the number of nodes that a node was associated with.GCN applied the convolution operation in image processing to the graph structure data processing for the first time in a simple way.GraphSAGE 20 maintained the edge relationship between the training samples in the training phase and consisted of two main steps, named "sample" and "aggregate.""Sample" refers to how to sample the number of neighbors, and "aggregate" refers to how to let neighboring nodes update their own embedding information after aggregating the embedding information.In order to solve the problem that GNN aggregates neighbor nodes without considering the different importance of different neighbor nodes, GAT 21 drew on the transformer's idea to introduce the masked self-attention mechanism, which assigned different weights to each node in the graph according to its different characteristics when calculating its representation.The GAE 22 was fed with the adjacency matrix of the graph and the feature matrix of the nodes, and the mean and variance of the low-dimensional vector representation of the nodes were learned by the encoder (graph convolution network), and then the graph was generated by the decoder.Given that GCN can effectively construct association relations between instances, we adopt a graph structure to improve the generalization of category association.

Model Architecture
The proposed MTM-FSOD architecture is shown in Fig. 2, which utilizes faster R-CNN as the base detection framework.FSCE observed that the scores of novel objects in the RPN are often low, and when lower-scoring objects are fed into RoI heads for learning novel class targets after NMS, it can lead to poor performance of the model on novel classes.Therefore, we draw inspiration from the FSCE structure, where the RPN and RoI are unfrozen to generate more foreground proposals for novel instances.When the proposals pass through the RoI feature extractor, the box candidate boxes ðp i ; y i Þ are simultaneously fed into three branches.Among them, the regression and classification branches follow the traditional detection model structure.To establish association transfer relationships between categories, the box transfer branch is proposed as the third paradigm to constrain the network.When given N candidate boxes, we first define them as N nodes to construct a self-attention transfer graph.Then, to improve the generalization of the associated classes, we detail the message transfer mechanism (MTM) built from the transfer graph later.Next, semantic embeddings of the categories (each category corresponds to a one-dimensional word vector) are introduced to assign each node as prior knowledge of the attribute associations.With the node category consistency, we define a BTL function to guarantee the discriminability of the categories.

Message Transfer Mechanism
MTM is proposed to perform associative class generalization learning, where it aggregates base class and novel class instance features to construct a self-attention transfer graph.The innovation of MTM is to capture the association between instances to reduce misclassification and enhance model generalization.As shown in Fig. 3, the candidate box features z ¼ f θ ðXÞ ∈ R N×d generated by the RoI extractor are fed into the attention transfer graph (ATG) module after linear mapping and encoded in the graph structure, where N is the number of candidate boxes and d denotes the feature dimension.It is worth mentioning that in order to improve the association class features and handle the instance relationships in the batch, we build G ¼ graphðz; e; uÞ.
Here, z represents the initialized feature nodes, e ∈ R N×N is the metric correlation matrix between every two nodes, and u indicates the existence of connections between the nodes.The ATG structure is elaborated in Sec. 4.Then, class prototypes z 0 with attribute associations are guaranteed to be obtained from the contained correlation instances by applying the M times ATG structure.After that, z 0 is fed into the feedforward neural network to produce the final feature output.In order to improve the consistency of the same category proposals and the transitivity of the different category properties, we introduce a BTL function to push the instances of each category to form a closer cluster.Inspired by the contrast of supervision in FSCE, our BTL function is defined as follows, taking into account the associated category attribute transfer.Specifically, for a batch of N proposal features fz 0 i ; u i ; y i ; a i ; g N i¼1 , where y i represents the category label, u i denotes the IoU score with its matching ground-truth category label, and a i is the category semantic vector: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 7 ; 2 8 6 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 7 ; 2 4 8 where N y i represents the number of proposals consistent with y i .Given a threshold value α, N a i is the number of proposals whose cosine similarity with a i is greater than α.In addition to τ y and τ a , they denote the balanced hyperparameters of category consistency and semantic attribute transferability, respectively.The consistency of the categories is described by FSCE contrastive loss function, where fðu i Þ ¼ Πfu i ≥ ϕg:gðu i Þ is used to consider the proposals that are separated from the regression object, and gðu i Þ is given a different weighting factor for proposals with IoU values greater than 0.8.Therefore, the optimization of the BTL function improves the consistency of the same class instance object labels and the semantic property transferability of different class instance objects.
In the first stage, we adopt the standard faster-RCNN losses, including a binary crossentropy loss L rpn to extract foreground boxes, a cross-entropy loss L cls as the classification branch constraint, and a box regression loss L reg employing the smooth-L1 function.When migrating to novel class learning, the box transfer function L transfer is added to train the fine-tuned network in a multi-task manner: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 4 ; 7 1 2 where λ ¼ 0.5 which is set as the balance loss value.

Attention Transfer Graph
To effectively transfer associated category information, we refine the initial node features z by constructing an ATG.The connection relationship in the graph structure is set by using the attentional form of the dot product between nodes 23 abandoning the traditional graph using a weight matrix W as a linear transformation between nodes. 24This approach is a transfer model constructed based on the relationship between the feature data itself.Specifically, after the RoI feature extractor, the output feature z ∈ R N×d is reduced in dimensionality by 1 × 1 convolution in order to avoid the large computational cost caused by the self-attention mechanism to perform global analysis on the feature map.The self-attention way is implemented to facilitate the extraction of association relationships between proposals.We pass consecutive message passes for updating the node features, and in each pass, z 0 i is determined by the previous node feature z i : ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 4 ; 5 3 1 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 4 ; 4 8 2 e i;j ¼ e j;i ¼ soft maxðw q z i : where w q and w k are learning weights, e i;j represents the self-attention score between nodes i and j, when e i;j ≥ β indicates the existence of association between two nodes, otherwise break the connection relationship between two nodes set e i;j ¼ 0. The node association threshold β is set to avoid information interference by nodes with less association to ensure that the received associated information has sufficient similarity in characteristics.
4 Experimental Results

Datasets
We obey the data settings of the FSOD criteria, 5,6,9,11 specifically, PASCAL VOC and MS-COCO are introduced to verify the state-of-the-art of our model.PASCAL VOC 25 includes 20 classes of benchmark data with well-defined labels.We employ trainval 07+12 in the training phase and perform evaluation work on test 07.In setting up the few shots data categories, we introduce three novel/base class segmentations, where the three sets of novel classes are classified as {"bird," "bus," "cow," "motorbike," "sofa"}, {"aeroplane," "bottle," "cow," "horse," "sofa"}, and {"boat," "cat," "motorbike," "sheep," "sofa"}.The number of shot samples are set to 1, 2, 3, 5, and 10.The novel average precision (nAP50) is adopted as the evaluation criterion under the condition of IoU ¼ 0.5.
MS-COCO 26 is a challenging dataset that consists of 80 classes, including 20 PASCAL VOC classes within it.We use 20 shared classes as novel classes and the remaining 60 classes as base classes.The train2014 is used for the training process and val2014 is introduced for the evaluation work.The number of few shots is set to 10 and 30, respectively.Under different IoU definitions, we perform three evaluation accuracies (AP, AP50, and AP75), respectively.

Implementation Detail
In the standard two-stage object detection framework, ResNet101 is employed as the backbone network, combined with the feature pyramid network (FPN) to form faster R-CNN.The experimental hardware platform utilizes Nvidia Tesla P100 GPUs, running on the Ubuntu 20.04 operating system with the software platform consisting of Python 3.7 and the PyTorch 1.8.1 + CUDA 11.1 toolkit.The model is configured with a weight decay of for standard stochastic gradient descent (SGD) as the optimizer.During the base class training stage, the SGD learning rate is set to 0.02, with a momentum of 0.9.The maximum number of iterations is set to 18,000, and the batch size is set to 16.
In the few-shot fine-tuning stage, the SGD learning rate is adjusted to 0.001, with a momentum of 0.5.The maximum number of iterations is set to 5000, and the batch size remains at 16.

PASCAL VOC
Table 1 shows the performance of the novel class of few-shot detection on the PASCAL VOC dataset.It can be seen from Table 1 that our method outperforms the existing method consistently under different settings.By carrying out multiple runs on randomly selected few shot data to reduce randomness, MTM-FSOD achieves the best performance under different base/novel class segmentation conditions, and our accuracy improves under shot ¼ f3;5; 10g compared to the second ranked Meta-DETR. 13The better performance in Table 1 demonstrates the superiority and robustness of our proposed MTM-FSOD algorithm.We visualize a part of the novel class detection results in Fig. 4.
We show the joint performance of the base∕novel in Table 2.With the PASCAL VOC split1 setting, MTM-FSOD obtains good performance on both novel and base classes in scenarios with limited training samples.TFA 9 fine-tunes only the last layer of the detector and thus has a relatively limited ability to generalize on novel classes and thus performs well on base classes.FSCE 11 unfreezes the RPN and RoI structures based on TFA so that it generates proposals Table 1 The performance evaluation of existing few-shots detection under three novel class segmentation settings of PACSCAL VOC (nAP50).Γ denotes the meta-learning method.Δ represents the average of 10 random seeds.suitable for novel classes, and thus it achieves good performance on both novel and base classes.
In meta-learning approaches (i.e., Meta-YOLO, 5 FsDetView, 8 and Meta-DETR), there is a relative advantage in the detection of novel classes due to the ability of the network to match a novel class with the base class, which leads to the forgetting of the base class in the process of detecting the novel class.

MS-COCO
Table 3 shows the results of the novel class of MS-COCO detection at shot settings of 10 and 30, respectively.Compared to PASCAL VOC, MS-COCO is more challenging due to the presence of occlusion and large variations in appearance of complex scenarios.As shown in Table 3, our method performs better in different settings.Notably, the fine-tuning approach (TFA w/cos, 9 FSCE 11 ) performs poorly on the performance of novel classes compared to the meta-learning structure (FsDetView 8 and Meta-DETR).However, our approach breaks this barrier, and the detection performance of novel classes is comparable with advanced meta-learning methods.
Table 2 The performance evaluation of base/novel classes on PASCAL VOC split1 (nAP50).

Method
Base classes AP50 Novel classes AP50 Meta-DETR 13 when shot ¼ 30, which is a significant improvement over the previous detection performance.To explicitly express the effectiveness of our MTM-FSOD method, Fig. 5 shows the t-SNE feature map for the 20 novel classes and we randomly select 80 samples for each class under COCO test with shot ¼ 30.

Ablation Studies
We perform comprehensive ablation studies to verify the validity of our design MTM.The experimental results are averaged over 10 runs on a PASCAL VOC split1 base∕novel classes with different randomly sampling few shot data sets.We report the results for the three variants of our model in Table 4.The performance of the model on novel classes is significantly lower when the MTM structure in the box transfer branch is removed, indicating that knowledge transfer between categories is effective for the objects of novel classes.In the absence of BTL, this variant reduces the performance in the FSOD setting.We further combine MTM and BTL to constitute MTM-FSOD to obtain the best results, demonstrating the effectiveness of component composition.

Influence of message transfer mechanism
As shown in Table 4, when MTM is incorporated into our model, MTM still has the ability to detect novel classes despite the number of novel class training samples is 1.This indicates that MTM completes the update of novel class features by virtue of the inference role of the self-attention graph.To demonstrate the advantages of the MTM structure, the t-SNE 31 visualization of proposals embedding learning is shown in Fig. 6.The features of the generated proposals that only go through the MTM structure ((b) w∕MTM) present advantages in terms of intra-class similarity and crossclass distance compared with the absence of MTM structure ((a) o∕MTM).A point to emphasize is that t-SNE 31 is the proposal encoding for all test 07 (PASCAL VOC split1) novel class images.

Role of box transfer loss
We utilize the box transfer branch to compare the FSCE 11 contrastive branch.In the contrastive branch, the FSCE sets a category consistency loss CPE to ensure the aggregation of same class features.However, we fuse the visual feature vectors of the categories, and when the semantic properties of two instances are similar, we perform feature passing operations between different categories based on the semantic similarity scores.In the fine-tuning phase, the sample involved in training is random and few.The RoI head outputs the feature dimension N × d with d ¼ 1024.
In other words, a few images are not equipped to build a strong feature representation.Therefore, we force the feature representation to have the property of absorbing similar attributes with the help of external attribute information in the condition that the dimension of d is converted to 256 according to the MTM mechanism.To ensure that our 256-dimensional feature  representation with category distinguishability, we compare it with the contrastive branch of FSCE in Table 5 to fully show BTL enhance the feature representation of the instance without affecting the category separability.Table 5 shows set under two feature encoding forms, where the original contrastive branch employs a linear transformation to reduce the feature vector of d ¼ 1024 to d ¼ 128, and the MTM structure adopts a graph convolution structure to update the feature vector d ¼ 256.We also replace the CPE constraint with BTL, and from Table 5, it can be seen that under different shot settings, BTL has a better performance in both FSCE and our MTM-FSOD.

Node Association Threshold β
In the MTM structure, we use the features of the instances to represent nodes, and the existence of connection between the nodes is set by the association threshold β.We experimentally set the choice set of association threshold as β ¼ f0.1; 0.3; 0.5; 0.7; 0.9g.In Fig. 7, we verify the AP50 of shot ¼ f1;2; 3;5; 10g under different threshold conditions, respectively.As shown in Fig. 7, the AP50 obtains the maximum value in β also varies for different shots.With the number of shots set to 1 and 2, AP50 obtains the best performance at β ¼ 0.3.When β > 0.3, the accuracy decreases with the increase of the threshold value.At shot ¼ f3;5; 10g, β ¼ 0.5 is the optimal solution for all three cases.Therefore, the setting value of β is various for different shot modes.When the number of shots is small, the few sample features in the inference stage are not rich, implying the need to increase the information flow transfer between nodes.Therefore, the generally small setting value of β is beneficial for nodes to absorb the feature attributes of other nodes to compensate for their own feature incompleteness.When the number of shots is relatively large, we increase the setting value of β in order to avoid the interference of attributes of dissimilar nodes.

Semantic Similarity Threshold α
In BTL, each proposal corresponds to a one-dimensional semantic vector, and we perform feature aggregation operations on similar proposals based on the cosine similarity of the semantic vectors.In other words, the semantic vectors guide the classification role of visual features.
To avoid the appearance of categories that are semantically similar with large differences in visual features, we add the category consistency operation at the same time.We set α ¼ f0.2; 0.4; 0.6; 0.8; 1.0g in the case of shot ¼ f1;2; 3;5; 10g.From Fig. 8, it is clear that the best results of AP50 are obtained for α ¼ 0.8 under different shots.To analyze the above results, we first visualize the heat map between the semantic vectors of each category in Figs.9(a).With Fig. 9(b), we look for similar classes between the novel class and the base class, with remaining cosine values of about 0.6 to 0.9, which meets the conditions for our choice of threshold.To break the inconsistency of properties in both visual and semantic heterogeneous spaces, we visualize again the heat map of the category visual features in Fig. 9(a).We randomly select 100 samples of each class on test 07 and input them into the network to generate visual features to obtain the mean value that represents the category's visual features.From Fig. 9(b), we observe that the similarity of categories in visual space and semantic space is mostly consistent, therefore, combined with the above analysis, we set α ¼ 0.8.

Conclusion
In response to the misclassification issue arising from high feature similarity among objects during the two-stage fine-tuning strategy, this work introduces a novel method for exploring the knowledge association between novel and base class instances in FSOD.We propose a framework tailored for target detection with insufficient supervised samples, termed MTM-FSOD.The MTM-FSOD algorithm establishes an information propagation mechanism among instances to enhance the reasoning capability for related classes.In addition, a box transfer branch is introduced, guided by semantic embeddings, to enforce BTLs that constrain category consistency, promoting the aggregation of targets of the same class and segregation of targets from different classes for better discrimination of related classes.We validate the superiority of MTM-FSOD on two datasets with few-shot samples from PASCAL VOC and MS-COCO, demonstrating its adaptability to other instance-level few-shot learning tasks.While the MTM-FSOD algorithm alleviates the misclassification problem caused by high feature similarity between novel and base classes to some extent, the inherent limitation of limited samples for novel classes hinders the fulfillment of diversity in feature extraction.Furthermore, the initial model trained on base classes exhibits low sensitivity to novel classes, resulting in a noticeable performance gap in detection between novel and base classes.Therefore, the next work's focus is on enriching novel class features and ensuring that the feature extractor generates candidate boxes suitable for novel classes.

Disclosures
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.Our code can be found in our GitHub repository: https://github.com/lvwengmx/MTM_FSOD

( a )
We propose the MTM-FSOD, a novel method for FSOD, which adds a box message transfer branch for the fine-tuning training phase of few-shot instances.(b) To improve the generalization of associated classes, we establish a graph of self-attention relationships between instances.(c) Under the condition of similarity of semantic embeddings and consistency of proposal classes, we propose a BTL function to discriminate between category differences.(d) We demonstrate the effectiveness of MTM-FSOD compared to other advanced methods on two datasets: PSCAL VOC and MS-COCO.
Given two data categories C base and C novel , where C base is rich with labeled data and C novel has few shot instances, requires C base ∩ C novel ¼ ∅.We propose an MTM-FSOD belonging to a two-stage fine-tuning framework.Specifically, a standard faster RCNN is trained under the condition of sufficient base class data (D train ¼ D base ).Second, we transfer the base class knowledge to the novel class by a fine-tuning strategy when given balanced few-sample random data (D train ¼ D base ∪ D novel ).In the fine-tuning phase, we draw on the advantages of the FSCE 11 model, where the backbone feature extractor is frozen and the RPN and RoI modules are liberated, while a novel box transfer branch is added to supervise the RoI feature extractor.Finally, we constrain the network with box transfer, object classification, and object classification loss functions simultaneously, and the MTM-FSOD method overview is shown in Fig. 2.

Fig. 3
Fig.3MTM is employed for associated category generalization.

Fig. 5 t
Fig. 5 t-SNE visualization of objects learned in the feature space for the 20 novel classes under COCO test with shot ¼ 30.

Fig. 6
Fig. 6 t-SNE 31 visualization of objects learned in the feature space with and without our designed MTM.Results are obtained on Pascal VOC class split 1.(a) o/MTM; (b) w/MTM.

Fig. 9
Fig. 9 Visual feature and semantic vector heat maps on the category of PASCAL VOC, respectively.The categories in red boxes indicate novel classes and the yellow box corresponds to the maximum similarity value.(a) semantic space; (b) visual space.

Table 3
The performance of the detection of few-shot novel class samples on the MS-COCO dataset is demonstrated.

Table 4
Ablation for key components proposed in MTM-FSOD.

Table 5
The results of MTM-FSOD detection are compared under the conditions of different branch settings.
Note: bold values indicate the maximum value.