Multi-view contextual adaptation network for weakly supervised object detection in remote sensing images

ABSTRACT Weakly supervised learning plays a pivotal role in the field of object detection, i.e. Weakly supervised object detection (WSOD), significantly reducing annotation costs relying on image-level labels. However, WSOD exhibits certain limitations. Typically, they tend to identify the most easily recognizable local regions within targets, posing challenges in accurately delineating the boundaries of targets. Moreover, the presence of multiple instances of the same class in adjacent locations complicates the effective distinction between multiple objects within the same category. On the other hand, the complex backgrounds and dense distribution of targets in remote sensing images (RSI) further exacerbate the difficulty of weakly supervised detection. To address the above issues, we propose a model termed the Multi-View Contextual Adaptation Network (VCANet). Building on the classic Online Instance Classifier Refinement (OICR) framework, we propose to incorporate an contextual adaptation perception, within a multi-view learning framework, and integrate a pseudo-label filtering process. The contextual adaptation perception utilizes the surrounding environment information to enhance localization capabilities, guiding the model to prioritize target objects by referring to their spatially neighbouring pixels. Multi-view learning manufactures additional constraints from diverse perspectives, thereby revealing objects that might be overlooked due to the weak supervision in a single view. The pseudo-label filtering process eliminates inaccurate pseudo-labels by identifying reliable foregrounds to mitigate overlapping proposals during the label propagation. On challenging datasets NWPU VHR-10.v2 and DIOR, we achieve promising results with mAP of 62.3% and 28.2%, respectively, surpassing existing benchmarks.


Introduction
With the continuous development of satellites and sensors, remote sensing images (RSI) possess higher spatial, thereby providing richer details.The significance of RSI is increasingly evident in various practical applications [1][2][3][4].Currently, remote sensing object detection predominantly relies on fully supervised training.However, the issue of high annotation costs arises in RSI.The emergence of Weakly Supervised Object Detection (WSOD) in RSI has significantly alleviated the burden on annotators [5,6], which holds considerable significance and research value in handling RSI data.
Currently, the predominant methods for addressing WSOD issues center around approaches based on multiple instance learning.WSDDN [7] is acknowledged as the pioneering end-to-end WSOD network, providing vital inspiration for subsequent research.Moreover, Online Instance Classifier Refinement (OICR) [8] decomposes images into candidate boxes using methods like Selective Search [9] or Edge Box [10], followed by employing an online instance refinement mechanism.This mechanism iteratively selects candidate boxes with the highest confidence scores as pseudo-labels to train the detector within the framework of multiple instance learning.Recently, several research endeavors have emerged, advancing the field of WSOD by integrating background information, cascading networks, detection box regression, and eliminating discriminative local regions [11][12][13][14].Despite the progress in WSOD, challenges persist due to the lack of instance-level annotated information.Firstly, WSOD often employs non-convex loss functions, making it vulnerable to getting trapped in local optima, thereby resulting in excessive bias towards discriminative local regions.Additionally, as illustrated in Figure 1, RSIs exhibit intricate backgrounds and dense distributions of objects.Hence, comprehensively detecting all objects during the detection process becomes arduous, potentially leading to cases where one detection box encompasses multiple objects.
To address the above challenges, we propose a multi-view contextual adaptation network.Specifically, while fully supervised training often utilizes contextual information from the surrounding regions of objects or the entire image to enhance performance, weakly supervised settings lack supervision regarding both object locations and contextual regions.Motivated by Content-LocNet [15], we introduce a contextual adaptation perception to tackle this issue.This perception incorporates three distinct pooling approaches: Region of Interest (ROI) pooling, context pooling, and frame pooling.Context pooling facilitates the extraction of external region features of the ROI, while frame pooling aids in extracting internal region features.These pooling methods are seamlessly integrated into the conventional OICR pipeline.By quantifying the disparity between external and internal region features of the ROI, we emphasize the object itself, thereby reducing erroneous detections within it.Guided by the contextual adaptation perception, the model can accurately locate objects even with limited supervision.Furthermore, this perception significantly emphasizes the object within the contextual region by amplifying the disparity between the category scores of the predicted object and the surrounding context.This strategy enhances the model's capability to locate objects, particularly in large-scale cluttered background scenarios.Furthermore, the inherent limitations of weakly supervised learning, characterized by a lack of instance-level annotation, often result in an insufficient number of positive instance examples during training.This scarcity poses a challenge in comprehensively detecting all objects.To tackle this issue, we incorporate a multi-view learning approach along with random erasing and HSV color augmentation techniques.Training the weakly supervised network is facilitated by feeding both the color-enhanced and randomly erased images, along with the original images, into the network simultaneously.This approach aids the weakly supervised network in comprehensively identifying all instances.Random erasing and HSV data augmentation are powerful techniques that enhance the generalization capability of deep models.Random erasing removes pixels in specific image regions, reducing over-reliance on easily recognizable local object regions.HSV augmentation adjusts brightness, color, and saturation features to enrich data samples, particularly effective in improving target object contrast for easier detection and localization in complex backgrounds of RSIs.To address the challenge of the detector incorrectly identifying multiple objects as a single entity, we propose a pseudo-label filtering process.This process selects appropriate labels as supervision signals from the generated pseudo-labels.Specifically, we traverse the pseudo-labels generated during network training and employ a filtering module to optimize them, ensuring the network utilizes accurate pseudo-labels, thus minimizing ambiguity.
The Multi-view contextual adaptation network consists of three key components: a contextual adaptation perception, multi-view learning, and a pseudo-label filtering process.In this model, multi-view learning captures complementary visual patterns to enrich the information for object detection.The contextual adaptation perception is tailored to delineate object boundaries, thereby improving localization accuracy.Additionally, the pseudo-label filtering process is crucial for filtering out inaccurate pseudo-labels, ensuring precise supervision signals during training, and consequently enhancing the detector's performance.The main contributions are outlined as follows: • We introduce a contextual adaptation perception that encourages the model to prioritize the prediction of target objects more prominently within their background regions by capturing contextual information.• We devise a multi-view learning framework to capture information from diverse perspectives, while the pseudo-label filtering process sifts out incorrect pseudo-label information.• Comprehensive experimental results demonstrate the promising performance of our model against trending methods on NWPU VHR-10.v2 and DIOR datasets.
2 Related Work

Weakly Supervised Object Detection
Weakly Supervised Object Detection (WSOD) stands as a pivotal paradigm, given its low requirement on annotation.Presently, WSOD predominantly relies on multiple instance learning frameworks in scenarios with multiple instances of the same class within an image [16][17][18].For instance, Tang et.al. [8] segment features into distinct streams, where the initial stream trains the basic instance classifier, while subsequent ones are designed for refining it.The output of each preceding stream serves as the supervisory signal for the subsequent one, iteratively enhancing the detection performance.Additionally, Tang et.al. [19] introduce the concept of proposal clusters as an enhancement to OICR.They propose grouping candidate boxes into spatial clusters, where each object is associated with an independent spatial cluster, and the candidate boxes within each cluster are spatially adjacent.However, the characteristics of complex backgrounds and dense arrangements in RSIs render the direct application of the above methods to WSOD in RSI impractical [20][21][22][23].Wang et.al. [24] tackle the challenge posed by the presence of multiple instances in RSI by proposing a spatial-map-based voting mechanism to identify high-quality target objects.Recognizing that heuristic strategies for generating candidate boxes often fail to adequately cover the entire target object, thus significantly impacting detector performance, Cheng et.al. [25] employ a random walk to generate confidence maps for the target and partition candidate boxes based on a threshold.Furthermore, Wang et.al. [26] introduce a time-consistent instance selection strategy to identify foreground objects and mitigate the risk of background interference arising from a lack of accurate supervisory information.

Contextual Information in Object Detection
In object detection, the extraction of contextual information aims to encompass the environment and relevant details surrounding the target object, thereby augmenting detection performance [27][28][29][30][31].This can be achieved through from diverse perspectives.One common approach is the utilization of attention mechanisms, which assign weights to features from different regions based on the surrounding content to emphasize contextually relevant information.Additionally, spatial pyramid pooling can decompose the image into distinct spatial scales and aggregate features at each scale, acquiring context at varying spatial hierarchies.Another technique involves the use of dilated convolutions, employing convolution operations at diverse spatial scales to encompass a broader range of context.For example, Vadim et.al. [15] introduce two guided models, namely additive and contrastive models, for context awareness to assist in accurately delineating object boundaries.The Inside-Outside Net combines skip pooling and recurrent neural networks, where skip pooling aids in capturing multi-scale context at different levels, and recurrent neural networks aid in modeling the relationship between the target and context in a sequence.Feng et.al. [32] propose a Triple Context Aware Network, enhancing detection through a global context-aware enhancement module and a dual local context residual module.The global module captures contextual information for the entire visual scene to activate features for the entire object.In our model, through ROI pooling, context pooling, and frame pooling, we integrate features from both outside and inside the ROI region to more precisely locate the target amidst complex backgrounds.

Multi-View Learning
Multi-view learning is an emerging approach aimed at improving model generalization by incorporating data transformation from different perspectives [33][34][35][36].The fundamental concept involves leveraging information from diverse viewpoints, amalgamating them, and generating more comprehensive and robust data representations.Multi-view learning operates on the premise that data sources from different perspectives offer complementary yet related information.By integrating these perspectives, the model can better comprehend and describe the data's characteristics.Fig. 2 The architecture of the Multi-view Contextual Adaptation Network is described here to tackle the challenge of inadequate instance supervision.Multi-view learning facilitates the extraction of additional positive sample instances, while the contextual adaptation perception delineates object boundaries.Additionally, the pseudo-label filtering process is employed to filter out erroneous pseudo-labels.
Multi-view learning can be roughly grouped into co-training algorithms, co-regularization algorithms, and boundary consistency algorithms based on a comprehensive survey of the field [37,38].Co-training methods involve training two independent classifiers concurrently, each utilizing a distinct dataset for training.Over time, these classifiers exchange information, leveraging unlabeled data to augment label information, thereby enhancing model performance through complementarity.Canonical Correlation Analysis (CCA) [39] is a multivariate statistical analysis technique used to explore linear relationships between two or more datasets, aiming to identify the maximum correlation or common structure among them.Sun et.al. [40] address discriminative challenges in multi-view learning, focusing on how to effectively integrate information from different perspectives.Our approach employs data augmentation techniques and utilizes multiple detectors to extract features from images at diverse viewpoints.This facilitates the extraction of complementary information, enabling a global detection of all objects in the image.

Overall Architecture
The network architecture is depicted in Figure 2. We select the OICR as the base network and utilize the VGGNet [41] as the backbone to construct a multi-branch network model.Initially, we apply random erasure and HSV color enhancement to the input images, resulting in three enhanced versions that are fed into the shared backbone.Three distinct pooling methods-ROI pooling, context pooling, and frame pooling-are employed to extract features.In the classification branch, features from ROI pooling are used, and the classification results are obtained through fully connected layers and the Softmax.In the localization branch, features from context pooling and frame pooling undergo interpolation via fully connected layers, followed by the Softmax to generate the outputs.The integration of these three sets of results led to an overall improvement in detection performance.To further improve the accuracy of the supervision signal, we introduce a pseudo-label filtering process to refine the initial predictions.Finally, these optimized pseudo-labels are incorporated into the online instance refinement process to obtain the final results.

Contextual Information in Object Detection
To accurately locate object boundaries, we introduce the contextual adaptation perception module, integrated into the ROI-based network.This module assesses the contrast of the maximum matching score between a rectangular bounding box and its neighboring boxes, thereby enhancing the localization accuracy, particularly in scenarios with limited supervision.Illustrated in Figure 3, the proposed module incorporates three types of pooling methods-ROI pooling, context pooling, and frame pooling-for contextual awareness in localization.Here, "context" refers to the external region around the specified ROI, while "frame" pertains to the internal region around the given ROI.Notably, the outputs of context pooling and frame pooling share the same feature map shape, with the central regions of these feature maps are set to zero values.
The network process of the contextual adaptation perception module is illustrated in Figure 2. We denote the feature map of the final layer of the VGG backbone as X.Subsequently, through a series of steps, including ROI pooling, context pooling, and frame pooling, we obtain the mapping of the ROI to the feature map.In our model, Y ROI , Y context , and Y f rame denote the outcomes of ROI pooling, context pooling, and frame pooling, respectively, each constituting a fixed-size feature tensor.The contextual adaptation perception is integrated into a two-stream structure.Initially, Y ROI serves as the input for the classification flow, passing through fully-connected layers to produce classification scores S cls , where S cls ∈ R C×N for C classes and N number of proposals.Meanwhile, the discrepancy between Y context and Y f rame is utilized as the input for the detection stream, generating localization scores S loc following the same fully-connected layer process.
Additionally, OICR incorporates the online instance classifier refinement process, which utilizes a self-supervised approach to enhance the initial highscore region outputs in the early stages of the target detector training process.Throughout the training of the target detector, we employ a loss function that considers each annotated candidate box as the instance-level supervision.The formulation of this loss function is as follows: The binary variable y k cn indicates whether the n th candidate box belongs to the c th category or not, where the k is for k th time refinement, K is the total refinement times.While S N k cn represents the significance score of the n th candidate box in category c, reflecting the model's confidence in associating category c.The parameter λ represents the loss weight.The network's final loss function comprises two components, L M IL and L REF , where α represents the weight assigned to these two components.

Multi-View Learning
Multi-view learning aims to improve the identification of positive instance samples by integrating data from various perspectives, resulting in a more comprehensive and accurate data representation.
We begin by inputting the image and applying random erasing and HSV color augmentation to generate the corresponding augmented versions.Random erasing is governed by two probability parameters: the erasing probability (P=0.5, representing the likelihood of erasing a region in the image) and the preservation probability (1-P).In the process of random erasing, a portion of the image is systematically erased in the form of a rectangular region, with the pixel values within the erased region randomized.Specifically, for an original image of size W × H, with a total area of S, the area of the erased region (S e ) needs to satisfy the ratio of S e /S between S l = 0.02 and S h = 0.4, while the aspect ratio of the erased region (r e ) needs to be maintained between r 1 = 0.3 and 1/r 1 .The size of the erased region is defined as and W e = Se re .The HSV representation of a color image comprises three fundamental components: Hue, which adjusts the color tone and ranges from 0 to 360 degrees, representing different color categories.Changes in hue value yield distinct color presentations.Saturation is primarily employed to adjust color saturation and purity, ranging from 0 to 100, where higher values indicate greater color saturation and more vivid colors.Value is used to modulate the brightness of the image, also ranging from 0 to 100, with higher values indicating brighter colors.
Following this, the three images undergo feature extraction within a shared network architecture, traversing through the VGG network, ROI pooling layer, context pooling layer, frame pooling layer, and two fully connected layers.Each image yields a proposal feature vector.Subsequently, these three feature vectors, each distinct, are input into their respective dual-stream networks, where they undergo processing through classification and localization branches, resulting in the generation of positive sample pseudo-labels.The loss function for the dual-stream network is defined as follows: Here, L all represents the sum of the losses from the three dual-stream branches.L orign , L erase , and L hsv denote the losses of the dual-stream branches corresponding to the original image, random erasure, and HSV color enhancement, respectively.Their formulations follow the same pattern as Equation (3).At this stage, the total loss function for network training is defined as follows: Following this, the pseudo-labels produced by the three dual-stream networks undergo a filtering process (further elaborated in the subsequent section).The pseudo-labels filtered by the three branches are then amalgamated to augment the pool of positive sample pseudo-labels.The merging process is outlined as follows: The aggregated score set U all is formed by combining the scores of pseudolabels from the three branches.U orign , U erase , and U hsv represent the scores of pseudo-labels generated by their corresponding branches.The function F (•, •, •) selects the maximum value among U orign , U erase , and U hsv .Subsequently, U all serves as the supervision signal for the online instance classifier refinement mechanism, facilitating the extraction of additional positive sample pseudolabels.

Pseudo-Label Filtering Process
The dual-stream network generates pseudo-labels for classification and localization during the training process.To refine this process, we propose the integration of a module noted as FPL (Filtering of Pseudo Labels), aimed at selecting suitable labels from the generated pseudo-labels to serve as supervision signals during training.
The FPL module comprises three components: 1) Pseudo-label Mining: Initially, we iterate through different classes, identifying instances where the class in the pseudo-label matches the current class.For matching classes, we extract their classification probabilities and select the box with the highest probability.Subsequently, we compute the Intersection over Union (IoU) between the selected box and others, filtering out positive samples that meet the IoU threshold condition.2) Pseudo-label Filtering: Recognizing the potential for large bounding boxes in the pseudo-labels to encompass multiple targets, thereby leading to false detections, we introduce a filtering module.First, we determine the class to which the box with the highest probability belongs.For certain classes (e.g., those naturally containing multiple objects like a harbor), we exclude them.Then, we iterate over the positive samples obtained in the first step, examining the spatial relationships between the current positive sample and other boxes.If a box is found to contain multiple other boxes, we remove it as a pseudo-label.3) Iterative Refinement: To further enhance performance, we iterate through the second step during the instance refinement phase to reduce the number of false positive samples during training.We perform these steps in each iteration.

Datasets & Evaluation Metrics
We assessed the efficacy of our method using two widely utilized remote sensing object detection datasets: NWPU VHR-10.v2[42] and DIOR [43].The NWPU VHR-10.v2dataset comprises 1,172 images, each measuring 400x400  Compared to the NWPU VHR-10.v2dataset, the DIOR dataset boasts a larger scale, featuring a greater number of images and object categories, thereby showcasing richer image diversity and object variability.The DIOR dataset comprises a total of 192,472 instance objects and 23,463 images, each with dimensions of 800x800 pixels.It encompasses 20 different object categories, including Airplane (PL), Airport (AP), Baseball field (BF), Basketball court (BC), Bridge (BR), Chimney (CM), Dam (DA), Expressway service area (ES), Expressway toll station (ET), Golf field (GF), Ground track field (GTF), Harbor (HB), Overpass (OP), Ship (SH), Stadium (SD), Storage tank (ST), Tennis court (TC), Train station (TS), Vehicle (VH), and Windmill (WM).This places the DIOR dataset at a large-scale level in terms of category count, image quantity, and instance object number.
WSOD in RSIs falls under the broader category of object detection tasks.Consequently, we opted for common object detection metrics, specifically mean average precision (mAP) and correct localization (CorLoc), to assess model performance.We adopted a standardized approach to object detection evaluation, drawing from established practices in various object detection models.Specifically, we evaluated the accuracy of predicted boxes by computing the IoU between predicted and ground truth bounding boxes.A standard threshold, typically set to 0.5, was utilized, whereby a predicted box is deemed correct if its IoU with the ground truth box exceeds this threshold.

Implementation Details
We opted for pre-trained VGG16 [41] as the backbone network.Furthermore, we employed OICR as our base pipeline for the detection.To generate candidate proposals, we employed the Selective Search method, yielding approximately 2000 potential target boxes for each image.During the training phase, we trained on the NWPU VHR-10.v2 and DIOR datasets for 30,000 and 200,000 iterations, respectively.The learning rate commenced at 0.005 and was subsequently decreased to 0.0005 after 20,000 iterations for NWPU VHR-10.v2 and after 120,000 iterations for DIOR.Stochastic Gradient Descent (SGD) was utilized as the optimization method, with weight decay set to 0.0005.Additionally, a momentum of 0.9 was applied, and the batch size was fixed at 1. To eliminate redundant candidate boxes, Non-Maximum Suppression (NMS) [44] was employed with an IoU threshold of 0.3.In the training stage of OICR, the parameter α in the loss function was set to 0.1.All experiments were conducted with NVIDIA GeForce RTX 3090 GPU, and CUDA 11.1 environment.
Table 3 demonstrates that our method achieves respective improvements of 5.0%, 6.7%, 4.5%, 3.2%, and 3.4% in Corloc compared to PCIR, MIG, CDN, MSFE, and SAE.Additionally, our method significantly reduces the performance gap between weakly supervised and fully supervised methods.Notably, our approach excels in detecting target objects such as storagetanks, harbors, and vehicles.Storagetanks and vehicles belong to densely arranged target categories, and our contextual adaptation perception effectively delineates their boundaries.Moreover, through multi-view learning, our method achieves more precise detection of these targets.
The DIOR dataset also exhibits similar trends.Table 2 presents the mAP comparison results of various methods.Given the presence of complex backgrounds and numerous small objects in the DIOR dataset, it poses more challenging scenarios.In such cases, our proposed method excels, achieving the highest mAP results compared to competing models.Specifically, compared to WSDDN, OICR, PCL, MELM, and MIST, our method improves by 14.9%, 11.7%, 10.0%, 9.5%, and 6.0%, respectively.Moreover, our method demonstrates strong performance in complex scenes, with improvements of 8.0%, 3.3%, 3.1%, 1.8%, and 1.1% compared to DCL, PCIR, MIG, MSFE, and SAE.

Ablation Study
We conducted ablation experiments on the NWPU VHR-10.v2dataset to further validate the effectiveness of the Contextual Adaptation Perception (CAP), Multi-View Learning (MVL), and Pseudo-Label Filtering Process (FPL).In these experiments, we established the OICR using only the pre-trained weights of the VGG backbone as a baseline.Subsequently, we performed comparative experiments by integrating relevant modules to demonstrate the potential benefits of them.
Initially, we conducted an analysis by integrating the contextual adaptation perception into the OICR.The results presented in Table 5 clearly demonstrate a significant improvement in performance resulting from the incorporation of the contextual adaptation perception.Specifically, the mAP score increased substantially from 34.5% to 58.1%, indicating a notable enhancement in detection accuracy.This enhancement can be attributed to the capabilities of ROI pooling, contextual pooling, and frame pooling, which emphasize the target's position by focusing on external and internal regions around the ROI.Furthermore, the effectiveness is further augmented by multi-view learning, leading to a boost in the mAP score from 58.1% to 61.5% through the fusion of data from different perspectives.Moreover, the FPL intelligently reduced errors in background region candidate boxes, showcasing more precise localization capabilities.This method introduces more accurate pseudo-labels into the network, resulting in an elevation of the mAP score from 61.5% to 62.3%.

Qualitative Results
We present qualitative results for each category on the NWPU VHR-10.v2dataset and DIOR dataset in Figure 4 to visually showcase the model's performance.The figure illustrates detection outcomes from two perspectives: the top and third rows showcase the detection results of OICR on NWPU VHR-10.v2 and DIOR datasets, respectively.In contrast, the second and fourth rows present the detection results of our method on the same datasets.In the visual representation, accurately detected objects are marked with green boxes, instances of missed detections are highlighted in red boxes, and incorrect detections are indicated by blue boxes.Despite the diverse and complex environments in the majority of images, our model adeptly detects potential objects with high precision and recall rates.In the case of aircraft detection, illustrated in column (a), our model efficiently filters overlapping boxes, resulting in more precise detections.Notably, in column (b), particularly for categories such as bridges and chimneys situated in complex backgrounds, our model demonstrates increased accuracy.Columns (c) and (d) showcase our model's ability to precisely identify all targets.Furthermore, in columns (e) and (f), when multiple instances of the same class densely populate the image, our model accurately detects the targets, reducing instances of overlapping boxes.These findings affirm the efficacy of contextual adaptation perception, multi-view learning, and the pseudo-label filtering process.
Images of objects such as airplanes, baseball diamond, basketball court, and ship after data augmentation are presented in Figure 5.In WSOD, employing color enhancement aids in the detection of potentially overlooked objects.Furthermore, random erasure, which involves erasing parts of the target or the entire target, allows the network to focus more on other salient regions, thereby aiding in the detection of potentially missed objects.The visualization effect of the FPL is illustrated in Figure 6.During the WSOD training process, pseudolabels are generated, which may impact detection results.Due to the absence of instance supervision, pseudo-labels may encompass multiple targets.Hence, FPL is utilized to filter out incorrect pseudo-labels and mitigate their negative influence on network training.Additionally, Figure 7 presents precision-recall curves for each class, highlighting the model's outstanding performance in terms of both precision and recall.

Conclusion
This paper introduces an innovative approach that utilizes the Multi-View Contextual Adaptation Network to enhance weakly supervised remote sensing object detection.By incorporating three distinct pooling methods -ROI pooling, context pooling, and frame pooling -precise object boundaries can be determined, leading to improved performance in weakly supervised learning.This contextual extraction method enhances not only the accuracy of positional information but also elevates the overall performance of weakly supervised learning.Multi-view learning allows for a more comprehensive exploration of data from various perspectives, enabling the network to learn richer visual features during training.Additionally, the pseudo-label filtering process eliminates erroneous pseudo-labels that may adversely affect network learning, thereby enhancing precision.Experimental validations conducted on benchmark datasets such as NWPU VHR-10.v2 and DIOR consistently demonstrate significant improvements compared to existing methods.

Declarations
• Conflict of interest: The authors declare that they have no conflict of interest.

Fig. 1
Fig. 1 Challenging detection examples in RSIs. Figure (a) and (b) represent densely packed scenes, while (c) and (d) depict scenes with complex backgrounds.

Fig. 3
Fig.3Three different types of pooling methods for context-aware localization: ROI pooling, context pooling and frame pooling.For the context and frame, the fixed ratio between the sides of the outer rectangle and the inner rectangle is 1.8.It is important to emphasize that the pooling types for context and frame are tailored to generate feature maps with identical shapes, specifically frame-shaped feature maps with zero values at the center.
Subsequently, the classification score and localization score are fed into the softmax layer.The outputs of the classification score and localization score are denoted as [σ(S loc )] cn = exp (S loccn ) Σ N i=1 exp (S loc ni ) and [σ(S cls )] cn = exp (S clscn ) Σ C i=1 exp (S cls ci ) , respectively.The final score for proposals are computed by element-wise multiplication of σ(S loc ) and σ(S cls ), yielding S N = σ(S loc ) ⊙ σ(S cls ).The image score for class C is obtained by summing all proposals scores Z c = Σ N n=1 S N nc .The [σ(S cls )] cn is the probability of proposal n belonging to class c. [σ(S loc )] cn is the normalized weight that indicates the contribution of proposal n to image being classified to class c.During training, the standard cross-entropy loss is utilized for multiple classes.For each variable Y = [y 1 , y 2 , . . ., y c ] T ∈ R C×1 , where each y c indicates the presence (y c = 1) or absence (y c = 0) of a target belonging to category c in the image.The loss function is depicted below.
MSFE [52], and SAE [53].To ensure experimental fairness, all compared methods utilize open-sourced implementations, and the candidate proposals employed in weakly supervised models are generated through selective search.The quantitative evaluation metrics utilized in the experiments include Corloc and mAP.

Fig. 4
Fig.4The first and third rows present the detection results of OICR on NWPU VHR-10.v2 and DIOR, respectively, while the second and fourth rows showcase the detection results of our method on the same datasets.Green boxes indicate accurately detected items, red boxes signify missed detections, and blue boxes denote erroneous detections.

Fig. 5
Fig. 5 Visualisation of airplanes, baseball diamond, basketball court, and ship after colour enhancement and random erasure.

Fig. 6
Fig.6Visualisation of pseudo labels before and after screening by pseudo-label filtering process.

Table 1
Comparisons with state-of-the-arts in terms of AP on the NWPU VHR-10.v2test set.Best and second best results are noted in bold and italic.

Table 2
Comparisons with state-of-the-arts in terms of AP on the DIOR test set.Best and second best results are noted in bold and italic.

Table 3
Comparisons in terms of Corloc for different methods on the NWPU VHR-10.V2 trainval set.Best and second best results are noted in bold and italic.

Table 4
Comparisons in terms of Corloc for different methods on the DIOR trainval set.Best and second best results are noted in bold and italic.

Table 5
Ablation studies on NWPU VHR-10.V2.CAP stands for Contextual Adaptation Perception, MVL represents Multi-View Learning, and FPL denotes the Pseudo-Label Filtering Process.