Zero-Shot Object Detection with Partitioned Contrastive Feature Alignment

How to properly align the extracted visual features with certain semantic embeddings of unseen objects is crucial to the problem of Zero-Shot Object Detection (ZSD). To give a better guess of those unseen visual features, a partitioned contrast strategy is proposed in this paper to train the visual and attribute feature alignment networks. To be specific, four types of contrast are considered, including the visual-to-visual, visual-to-attribute, attribute-to-visual and attribute-to-attribute contrasts. Combining with two cross-batch memory banks of the visual features and unseen attribute features, it is effective to adjust the alignment rules for unseen visual features. Experimental results on the MS-COCO dataset show the superiority of the proposed model. Our code is available at: https://github.com/lihh1023/PCFA-ZSD.


INTRODUCTION
With the development of deep learning, object detection has made great progress in recent years [1][2][3][4].However, in order to achieve the best performance, massive and well-labeled image datasets are needed for the generic object detection models.Obviously, it is difficult to collect enough images with bounding box annotations for rare target classes.Therefore, some scholars propose zero-shot learning [5][6][7][8] to address the extreme situation when there is no image for training.Naturally, it is extended to zero-shot object detection (ZSD) [9][10][11][12][13][14][15][16][17][18][19]27]very soon.The characteristic of ZSD lies in how to guide the model to learn visual features and detect the objects without training samples.
Most ZSD models rely on various semantic embedding to build the connection between seen classes and unseen classes, such as word vectors [9], [12][13][14][15][16][17][18], textual descriptions [10,11], attributes [19].Many works [9,12,13] focus on learning the projection from visual features to semantic space.Rahman et al. [9] designed a semantic alignment network with the semantic clustering loss and maxmargin loss.Later, they proposed a polarity loss in [12] to address the class-imbalance issue.Mao et al. [19] built an attribute table to connect the seen and unseen classes at the semantic level.Projecting semantic features into visual space is exploited in [10] using textual descriptions.Similarly, they are also used in [11] by projecting visual and semantic features into a common space.In addition, Rahman et al. [15] proposed a self-monitoring mechanism, using pseudolabeling techniques in a transductive way.Using generative networks, e.g., SU-ZSD [16], GT-Net [17] and DELO [18], to generate unseen class features is becoming very active recently.
It is obviously that the key to ZSD is the projection between visual and semantic features.Thus, the goal of this work is to achieve better alignment for both features in a mutual space based on the idea of contrastive learning [20], [23], i.e., simultaneously maximizing consistency between similar instances and encouraging differences between different instances.Unlike previous contrastive learning based work [13,20,23], different types of contrasts, including visual-to-visual, visual-to-attribute, attribute-tovisual and attribute-to-attribute for either seen or unseen classes, are considered separately to constrain the feature projection as shown in Fig. 1, while the attributes are selected as the semantic descriptions of class.Moreover, to enrich the variety of available features, especially the ones of unseen classes, a cross-batch memory bank mechanism is employed to collect contrastive features.Although the visual features of the unseen classes are missing, the common projection rules for visual and attribute features are learned by emphasizing the contrast involving unseen attribute features.

METHODOLOGY
As shown in Fig. 2, RetinaNet [1] is chosen as the base model for object detection.Inspired by the structure in [5], [8], the projected semantic attribute vectors from ASC-ZSD [19] are chosen as the category centroids, while their distances from the extracted visual features are used for classification.In this section, a new partitioned contrastive feature alignment (PCFA) strategy, including the visual feature sampling, cross-batch memory bank and partitioned contrastive learning, is designed to simultaneously enlarge the inter-class distance and reduce the intra-class distance for both the attribute and visual features.

General Framework
The structure of the box regression subnet is the same as [12], in the proposed network as shown in Fig. 2. The semantic embeddings of seen and unseen classes are introduced in the classification subnet.However, the gap between semantic embeddings (attribute vectors  in this work) and the visual features  of the prediction boxes is huge.Thus, it is natural to project them into a mutual space using a linear mapping  and convolution operation  with the parameters   and   ,  ̃= (;   ), (1)  � = (;   ).
(2)  ̃= � ̃ ,  ̃ � ∈  (  +  )× and  � ∈  × represent the attribute features which include seen and unseen attribute features and visual features in a common space, where   ,   ,  and  are the number of seen categories, unseen categories, prediction boxes and dimensions, respectively.Then the cosine similarity between all pairs of  ̃ and  � are calculated to find the seen or unseen objects with the score threshold.
However, such simple operations in Eq. ( 1) and ( 2) do not perform well in ZSD due to the lack of constraints between the seen and unseen parts in  ̃ during the training stage.Hence, it is important to construct a more intuitive relationship between them.

Partitioned Contrastive Feature Alignment
To find the implicit links between the seen and unseen objects in semantic attributes and visual features, inspired by contrastive learning [8,20,23], the contrast between them is utilized to guide the learning of   and   for better feature alignment.Moreover, a cross-batch memory bank mechanism is introduced to enrich the contrast.

Visual Feature Sampling
It should be noted that most of the candidate boxes detected by RetinaNet [1] are easy negatives, i.e., various backgrounds.In order to filter out these background boxes, the Intersectionover-Union (IOU) scores  =   =1  with matched ground truth are used as the criteria for visual feature sampling.Only those visual features � �  ∈  � � whose corresponding   are greater than a given threshold ε (0.7 in this paper) are retained for later feature alignment.Then, the new set of visual features  � () at t-th batch can be defined as, where K is the total boxes detected and  is the length of  � () .

Cross-batch Memory Bank
During the training stage, the actual visual features and their corresponding seen categories are limited in one batch.And only one attribute feature is available for either unseen or seen class.Such unbalance on the categories and types of features is not good for the learning effectiveness.Inspired by [24] [25], two additional memory banks are introduced to retain the visual and unseen attribute features in previous batches.At -th batch, the visual and attribute memory banks, denoted as   () and   () , contain the features of current batch and previous ( − 1) batches, respectively.They are updated at the beginning of each batch as, (5) In this work, the value of b is set as 1, while a larger value only brings slight boost with huge memory and time cost.

Partitioned Contrastive Feature Alignment
As one of the key contributions of this work, the idea of contrastive learning is adopted in the ZSD framework.Unlike previous works, the partitioned contrast between visual and attribute features is performed in the mutual space to train the operations in Eq. ( 1) and ( 2) for better feature alignment.To be specific, four types of contrast are performed, including the visual-to-visual, visual-to-attribute, attribute-to-visual and attribute-to-attribute contrasts.They can then be utilized to formulate new losses to make sure the projected features, either visual or attribute ones, of the same category will be close to each other.Furthermore, the unseen attribute features should be laid far from seen ones, i.e., having a large contrast with other features.
each element  �  ′ in the whole visual memory bank of size   is constructed to contain all visual features with the same label   in   () .Besides that, it also has a corresponded attribute feature  �   + in attribute memory bank   () .Then, the visual-to-visual and visual-to-attribute contrastive loss  2 � �  ′ � and  2 � �  ′ � can be defined as, where ( * , * ) is the cosine similarity between two features,  2 � �  () � is the visual-to-visual and visual-to-attribute similarities of all features.They are calculated as, where the temperature  is set to 0.2 following [23].
Similarly, for each element of seen attribute feature  �  ′ in the attribute feature set from   () , all visual features belonging to the  -th class are extracted to form another positive bag   2 = � � 1 + ,  � 2 + ,⋅⋅⋅,  �   + �,  = 1, 2,⋅⋅⋅,   .Then, the attribute-to-visual contrastive loss  2 ( �  ′ ) can be formulated as, Considering the contrast between seen and unseen attribute features, each unseen attribute feature  �  ′ from is corresponding to b attribute features with the same label, i.e.   2 = { � 1 + ,  � 2 + , … ,  � b + }.Subsequently, the attribute-toattribute contrastive loss  2 ( �  ′ ) can be formulated as, It is worth noting that the information of unseen categories is introduced by the similarity between the attribute features in Eq. (12).Although the visual features of unseen classes are not available in ZSD, such contrast on the similarity may give a hint where the unseen visual features will be projected.Finally, the total loss of the proposed PCFA is, where  ,  and  are the weights for different types of contrast, respectively.

Dataset and Implementation Details
The proposed PCFA-ZSD model is evaluated in MS-COCO (2014) [26].There are 80 categories of MS-COCO [26].Following the protocol in [12,19], 80 categories are divided into 65 seen classes and 15 unseen classes.There are 62,300 images for training, without examples of unseen classes.At the time of ZSD validation, there are 10,098 images containing 16,388 object detection boxes that are unseen classes.For generalized ZSD (GZSD) verification, the seen images and bounding boxes are included.The dimension  of the mutual space is set to 128 in this experiment.In the inference stage, the predicted bounding boxes with IOUs > 0.5 and similarity scores greater than 0.3 for seen classes and 0.1 for unseen are selected.The weight of   is set to 0.5, and the ,  and  are 1 in our work.

Experimental Results
The comparison results with other state-of-the-art models on MS-COCO dataset are shown in Table 1.All reported results are the average of five experiments.The mAP is given for both ZSD and GZSD, while the Harmonic Mean (HM) of mAP is provided for GZSD.It can be seen that the proposed PCFA-ZSD achieves the highest mAP for unseen objects in both ZSD and GZSD settings.It is worth noting that mAP of PCFA-ZSD is higher than the baseline by about 3.6-4.6%for not only unseen classes but also seen ones.Here, the baseline refers to the base model without the PCFA and memory bank.
For the GZSD task, our method still surpasses the compared state-of-the-arts models in both Unseen and HM mAP.The HM mAP of the proposed PCFA-ZSD is 3.23% higher than the second place (SU-ZSD [16]) in Table 1.
Although ContrastZSD [13] also adopted the idea of contrastive learning and achieves the highest mAP on seen objects, its performance on unseen objects is worse than our method, which matters more in ZSD problem.It shows the effectiveness of the proposed partitioned contrast strategy.
The per-class results on MS-COCO are given in Table 2 to show more details.It can be observed that our work achieves the highest AP in about half classes, and the second highest AP in another 4 classes.Since the "mouse", "toaster", and "hair drier" share very limited visually similarity with the seen classes, it is challenging for the ZSD task.Thus, the AP of these classes is low for all methods.
A set of qualitative comparison of the GZSD results is presented in Fig. 3.It can be noted that our model is more accurate on the detection of "airplane", "frisbee" and "parking meter".The "airplane" is wrongly detected as snowboard by ASC-ZSD [19], while the "frisbee" is mistakenly detected as a donut by PL-ZSD [12].Moreover, both PL-ZSD [12] and ASC-ZSD [19] miss some unseen objects in the images.

3.3Ablation Studies
An ablation study on PCFA and cross-batch memory bank (MB) are demonstrated in Table 3.It is worth mentioning that the mAP and Recall@100 receive significant boost with the CFA for both ZSD and GZSD, which means the partitioned contrast strategy is useful.Furthermore, the additional features introduced by MB can further elevate the detection performance.To a certain extent, the visual features for contrastive alignment are enriched, while previous unseen attribute features are also available for a more robust contrast.

CONCLUSION
In this work, a new framework based on partitioned contrastive feature alignment is proposed for zero-shot object detection.Constrained by four types of contrasts, the visual and attribute features can be better aligned in the common space.As a result, the detector gains the enhanced ability to detect unseen classes more accurately for both ZSD and GZSD tasks on the MS-COCO dataset.

Fig. 2 .
Fig. 2. The overall architecture of our model.The hollow circles, solid circles, and solid squares represent visual features, seen attribute features, and unseen attribute features, respectively.
This work was supported by the Ningbo Municipal Natural Science Foundation of China (No. 2022J114) and Innovation Challenge Project of China (Ningbo) (No. 2022T001).

Fig. 3 .
Fig. 3. Qualitative comparisons.Yellow and purple bounding boxes represent seen and unseen classes, respectively.

Table 2 .
Average Precision (%) of each unseen class on MS-COCO dataset with 65/15 data split