Mask-guided infrared small multi-target detection via coarse-to-fine candidate selection

Infrared acquisition technology is often used for target detection in military fields, since it is not easily disturbed by diverse environmental factors. However, the characteristics of low image contrast, coupled with complex imaging backgrounds, long distance and the lack of visual features, making infrared small targets detection a challenging task. The generic deep neural network-based models are struggled to be applied for infrared small targets, since the increased network layers may lead to the gradually loss of target features and positional information. To address above issues, we propose a mask-guided detection model via the coarse-to-fine candidate selection for infrared small multi-target detection. More specifically, to enhance target features and guide the localization process in the neural network, we propose to utilize the foreground mask generated by referring to the non-local self-correlation property of infrared background and the sparse property of target distribution. The obtained mask is treated as the prior to re-weight the convolutional feature maps. Considering the complex background, the multi-target detection is prone to mis-detections. Therefore, we propose a coarse-to-fine candidate selection method on top of the initial detection results. A shallow network is constructed to extract more nuanced visual features from the candidate positions for a binary classification, in which the false positive candidates are ruled out free from the disruptions of other background features. Moreover, given the lack of multi-target infrared datasets, we propose two synthetic datasets based on the public available and own collected infrared data. Extensive experimental results verify the effectiveness and advantages of our model compared to state-of-the-art methods.


Introduction
Given the infrared acquisition is mainly determined by the temperature of the target, the infrared detection technology has the advantage of not being easily interfered by environmental factors, and has been widely used in the infrared guidance, early warning and other 1 3 56 Page 2 of 19 military fields. Apart from its advantages against the generic imaging, the infrared target detection poses following unique challenges. First of all, the visual features of infrared targets are deficient. On the one hand, with the long imaging distance, the radiation energy of infrared target is often smaller than the background, which only occupies a few pixels in the entire image, lacking the texture or shape features (Wei et al. 2003). On the other hand, due to the acquisition of the infrared imaging, the most commonly used color features are unavailable for these small-sized targets. Secondly, the signal-to-clutter ratio (SCR) of target is low. Affected by the long imaging distance and diverse backgrounds, small targets often posses similar characteristics as the background clutter and noise, leading to the low intensity and SCR, which are easily submerged in unpredictable signals. As shown in Fig. 1, we can observe that the infrared targets are small sized with the low SCR in diverse application scenarios, and the above phenomena are much more serious with the multitarget cases, making the infrared small multi-target detection a challenging task.
The single-frame infrared target detection, as the foundation of its multi-frame counterpart and various infrared vision applications, has been extensively studied for decades (Li et al. 2020;Rawat et al. 2020), which can be roughly grouped into the traditional and deep neural network-based methods. For traditional ones, filter-based methods (Rivest and Fortin 1996;Wang et al. 2017;Ren et al. 2020) are one of the earliest methods used for small target detection. They aim at utilizing the frequency difference between the target and background to distinguish the target. For example, Seyed et al. (2018) proposed to utilize the structural elements based on the genetic algorithm to suppress background clutter and noise to detect targets. A non-subsampled contourlet transform model combined with SVD was proposed in Tianai et al. (2016) to adjust coefficients through singular values to make the infrared target protuberant. Different from them, local contrastbased methods (Han et al. 2014;Jinhui et al. 2019;Deng et al. 2016;Ma et al. 2021;Qian et al. 2020) use the brightness differences of the target and neighboring areas to detect small targets. Chen et al. (2013) proposed to enhance the target signal and suppress the background simultaneously by referring to the local contrast. Han et al. (2019) designed a matched filter to enhance the true target purposefully before the computation of local contrast to improve the detection. Given infrared targets are often presented in a sparse distribution, and the background has low-rank characteristics, the low-rank sparse recovery-based methods (Zhao et al. 2011;He et al. 2015) have been proposed, of which the infrared patch-image (IPI) model (Gao et al. 2013) and its variations (Kong et al. 2021;Xiong et al. 2020;Rawat et al. 2021) are the most widely applied models. The IPI model Fig. 1 The examples of infrared small targets. The part of the target areas is enlarged in the top-left corner for better visualization. The SCR is marked next to the bounding box focuses on modeling the characteristic of non-local self-correlation property of the background, and separating the features of sparse targets by the sliding window. Through refactoring the image, the properties of each local patch are enhanced. Motivated by the IPI, the weighted infrared patch-image model (WIPI) (Dai et al. 2016), the Re-weighted Infrared patch-tensor model (RIPT) (Yimian 2017), and the improved optimization of IPI  were consecutively proposed from different perspectives to enhance the original IPI.
Since traditional methods generally rely on the hand-crafted features and have relatively high computation complexity, with the advanced representation learning ability of deep learning, various neural network-based models (Du et al. 2021;Hou et al. 2021) have been proposed for the infrared target detection. For example, Liu et al. (2017) proposed the first CNN-based model using a five-layer multi-layer perception (MLP) for infrared small target detection. Mcintosh et al. (2020) employed the eigen-vectors as the input, and further finetuned multiple generic object detection networks. To address the problem of lacking visual feature and class information during training, Zhao et al. (2019) proposed a TBC-Net, which contains a semantic constraint module to count the target number as an auxiliary task to assist the detection. Fan et al. (2021) proposed to enhance the target intensity by its local intensity characteristics, and combined the corner detection with CNN to locate the infrared target. A multi-patch attention network (MANet) is proposed in Chen et al. (2021), where the global and local properties of infrared small targets are jointly considered to suppress the background pixels and capture the target locality. Different from prior works, Shi and Wang (2020) proposed an end-to-end detection framework based on denoising autoencoder, where small targets are treated as the noise. Similarly, Ju et al. (2021) proposed ISTDet using image filtering modules to enhance the response of target. Dai et al. (2022) proposed the segmentation-based detection framework with the asymmetric contextual module to improve the local contrast of target from multiple scales.  proposed a dense nested attention network (DNANet) to handle the problem of the loss of targets in deep layers via densely connected interactive modules among low-level and highlevel features. Moreover, Generative Adversarial Network (GAN) is also applied for this task. Zhao et al. (2020) proposed to focus on the essential features of infrared small target in an adversarial learning manner. Wang et al. (2019) proposed a deep adversarial learning framework by training two adversarial models to reduce the miss-detection and false alarm simultaneously.
Though existing infrared target detection models, especially the deep neural networkbased ones, have achieved promising results on publicly available infrared datasets, the increased layers of deep neural network may lead to the gradually loss of target features and positional information, causing smaller targets difficult to be detected. Moreover, these methods often focus on detecting single or a few targets in each frame (normally two targets), the more challenging case of multi-target detection are yet to be explored. With the complex backgrounds, the probability of false alarm can occur greatly as the number of targets increases. Therefore, in this paper, we focus on tackling the infrared small multi-target detection, where multiple smaller and dimmer targets presented in each frame compared to existing cases.
To address above issues, we propose a mask-guided detection model via coarse-to-fine candidate selection for infrared small multi-target detection. More specifically, considering the non-local self-correlation property of the infrared background and the sparsity of targets, we obtain the rough foreground mask by recovering the low-rank and sparse matrices of the infrared image. The foreground mask is then used to re-weight the convolutional feature map, which guides the network more focus on the small targets. To further eliminate the increased falsely detected regions, we treat them as candidates, and design a shallow network to extract the nuanced features from them. A binary classifier is applied as a coarse-to-fine candidate selection process to obtain final results. Overall, the main contributions of this paper can be summarized as follows: 1. We propose a mask-guided neural network for infrared small multi-target detection. By recovering the low-rank and sparse matrices of the infrared background and targets, the obtained foreground mask can guide the network more focus on the extraction of small target features; 2. We design a coarse-to-fine candidate selection process to reduces false alarms in multitarget detection, where the initial detected regions are treated as candidates, and nuanced visual features can be extracted from them for further classification; 3. We propose two synthetic datasets based on the publicly available and own collected infrared data. Extensive experiments on these datasets demonstrate the advantages of our method compared against state-of-the-art models.
The remainder of the paper is organized as follows: in Sect. 2, we describe the details of our model and conduct thorough analysis. In Sect. 3, we present the experimental comparisons on the proposed datasets. Last, conclusions are given in Sect. 4.

Overall architecture
The overall architecture of the proposed model is shown in Fig. 2. We take the infrared image as its input and sequentially performs the feature extraction, the initial detection, and the candidate selection. For the feature extraction, a foreground mask generated by recovering the low-rank and sparse matrices of the infrared image is utilized to guide the extraction process to emphasize small targets for further detection. With the initial detection results as candidates, we design a coarse-to-fine candidate selection process, the candidates are screened by the shallow visual features of the corresponding original image regions to reduce the false alarm rate of the network. We employ a light-weighted CNN classifier to determine whether the candidate is the real target or not.

Mask-guided feature extraction
Mask-guided feature extraction aims to generate a foreground mask that is instructive for the detection of small targets in infrared images by enhancing target features. Considering the sparsity of targets, as they generally occupy only a few pixels with the scattered distribution, we propose to adopt the infrared patch-image (IPI) model (Gao et al. 2013) to generate the foreground mask. Given the sparsity of the target distribution and the low rank property of infrared backgrounds, the problem of detecting small targets converts to the optimization problem of recovering low rank and sparse matrices of the original image. The accelerated proximal gradient (APG) algorithm (Toh and Yun 2010) is thus employed for the optimization. The overall architecture of the proposed model. The feature extraction of the infrared image is guided by the generated mask to concentrate more on the nuanced small target areas. The initial candidate regions are further screened by the proposed coarse-to-fine selection process to obtain the final detection results Fig. 3 The illustration of mask-guided feature extraction module. The input infrared image I is constructed and reconstructed to generate the initial mask M i , and the final mask M is obtained by normalization and thresholding sequentially More specifically, as shown in Fig. 3, we obtain image patches through a series of sliding windows from the top-left to the right-bottom, and reformulate them as a new patchimage by vectoring each patch into a column vector. After the construction, we can treat each patch-image as: where O is the constructed patch-image, B and T are the background and target patch-image that need to be obtained respectively. Since the proportion of targets is relatively small for the whole image, the target patch-image T thus can be regarded as a sparse matrix. The background pixels of the infrared image are often correlated with each other in a distant position, i.e., possess the non-local self-correlation property. Based on this characteristic of the background, the backgournd patch-image B can be treated as a low-rank matrix. The target patch-image T and background patch-image B can be simultaneously estimated by APG Algorithm, and the sparse target patch-image T is reconstructed as the initial foreground mask.
Given the initial mask aims at highlighting the sparse component of the original image, to further improve the accuracy of the mask, we obtain the final mask by firstly normalizing the initial mask and multiplying the original input image: where x and y represent the spatial index of position. M i , I indicate the initial mask and the input image. N(⋅) stands for the min-max normalization. Then, we threshold the normalized output M norm . The threshold t is determined by: where and are the mean value and standard deviation of the target image, respectively, k is the weight to balance two values, by referring to Gao et al. (2013), we empirically set it to three. We consider a pixel belongs to the target if its value is above the threshold, otherwise, it is a background pixel. We set the background pixel to zero and keep the target pixel as itself. After the thresholding, we obtain the final mask M.
The obtained mask M contains the positions of targets and other target-liked background pixels. We use this mask to guide the feature extraction process, namely enhancing the masked part of the convolutional feature map to guide the network to focus on the positions of potential targets as bellow: where F represents the extracted convolutional feature map. The mask is firstly resized to the same size as the feature map F. Then, it multiples the feature map and add on it. This is to highlight the features of potential target areas. The processed feature map F ′ is utilized for subsequent network operations.

Coarse-to-fine candidate selection
Since false alarms are inevitable in the initial detection results for multi-target cases, especially when we utilize the foreground masks to re-weight the convolutional feature maps. To further refine the initial results and make use of the visual features of small targets, we propose a coarse-to-fine candidate selection process to screen out the true predictions.
Page 7 of 19 56 The initial detected regions are treated as candidates, and we locate the corresponding regions in the original image to crop out the image patches. With these patches, we can utilize a shallow convolutional neural network to extract the visual features of targets without the interference of the extra backgrounds. Therefore, a simple yet effective binary classification network can be trained to classify the given candidate as the true target or not. The selection process is shown in Fig. 4.
More specifically, the selection network takes the bounding boxes of true targets and surrounding regions as positive and negative samples, and feed them through two 3 × 3 convolution layers followed by the ReLU layer to extract features, and further pass them through the fully-connected layers to compute the classification confidence. During the inference, we feed initial candidates to the trained classifier, and remove ones that are identified as the background.

Loss function
The loss function of the proposed model contains two parts. One is the loss of detection network, the other is for the candidate selection network.
The loss function of detection network mainly consists of three different losses, i.e., bounding box regression loss, confidence loss, and category loss (Redmon et al. 2016). The regression loss includes the loss of the target position x and y and the loss of the target size w and h. The loss for horizontal position x is defined as follows, and the losses for y, w and h are of the same form: where i is the index of the image cell, and each image is separated to S 2 cells. B indicates the number of predictions from one cell. I obj ij denotes the j th bounding box predictor in cell i that is responsible for that prediction.
x and x represent the prediction and the ground-truth respectively.
The confidence loss and the category loss are defined as follows, which are the same as original Yolov3 (Redmon and Farhadi 2018) framework: Fig. 4 The architecture of candidate selection process, where w and h represent the size of the candidate region where C j i represents the confidence of each cell, and Ĉ j i is the binary value of zero and one, which is determined by whether the bounding box of the cell is responsible for prediction. P j i is the classification confidence. The final loss takes the form of the sum of above losses. Since the task of detection network is a single category detection and the size of the targets is relatively small, we re-weight the bounding box loss to achieve a better balance among losses as: where and stand for the weights of regression losses for the target location and size respectively. For the infrared small multi-target detection, the imaging sizes of targets distribute in a certain range, which is relatively easier for the detection model to fit during training. However, the capture of target locations are non-trivial, given the flexible spatial distributions of multiple targets in each image. Therefore, it is intuitive to set a larger weight for the regression loss of the location compared to the weight for the loss of height and width of targets. More details regarding the weight selection can be found in Sect. 3.4.
Moreover, for the loss function of the candidate selection network, we employ the binary cross entropy loss:

Implementation details
Our model was implemented with the PyTorch framework on a workstation with Intel Xeon W-2123 @ 3.60GHz CPU and two NVIDIA GTX2080TI GPU. During training, we set the batch size to 8 and the initial learning rate to 0.01. Adam algorithm (Kingma and Ba 2014) was used as the optimizer, the momentum and weight decay were set to 0.9 and 0.0005 respectively. The weight and of the loss were empirically set to 50 and 1. We adopted the anchor-based detection framework and set nine anchor sizes from 2 × 2 to 9 × 9 with an additional 15 × 15 by referring to the k-means algorithm and artificial adjustments. We trained the detection model with generated masks for 350 epochs around 45 hours. For the inference, a single image was processed about 4.05 × 10 −2 s (24.7 FPS) with the given mask. (6) (8) Loss = Loss x + Loss y + Loss w + Loss h + Loss cls + Loss conf (9) Loss cs = −(ŷlog(y) + (1 −ŷ)log(1 − y))

Experiments
In this section, we introduce the details of constructing the synthetic multi-target datasets and the evaluation metrics employed for the multi-target detection. Next, we compare the proposed method against state-of-the-art models and conduct the thorough analysis. Finally, we present ablation studies to investigate the effectiveness of proposed components.

Datasets
For the experimental evaluation, we constructed synthetic datasets based on the publicly available dataset IDST (Hui et al. 2020) and our own collected infrared data. These two datasets contain different types of background, which demonstrates the robust detection abilities of our model. By referring to the general setting of object detection (Redmon and Farhadi 2018), we did not explicitly distinguish the background type during training. As for the targets, these datasets both contains the single target type, namely UVA and airplane respectively. Given original datasets generally contain only one or two targets in each frame, to mimic the realistic scenarios, we converted them into multi-target counter-parts. The following simulation strategy was adopted, mainly consists of two steps: Target template design: Firstly, we selected appropriate targets from the original dataset and extracted their corresponding areas on the original image as the initial templates by referring to their ground-truth bounding boxes. Next, considering the actual target often has the diverse brightness in different scenes, we adjusted the brightness of the template to fit the brightness change of the real scene. Specifically, we took the mean value of the template as the threshold to separate the foreground and background. The part of the area above the threshold was considered as the foreground, and its value can be enlarged or reduced with certain proportion to mimic the cases of brightness enhancement and weakening. At last, given the moving characteristics of the infrared targets, we applied the Gaussian blur to fit the blurry phenomenon of moving targets.
Target insertion After obtaining the target templates, we manually selected a reasonable initial area and randomly selected the coordinates of the start point in this area as the position of the target center. Then, we inferred tracks of inserted targets based on the motion tracks of the original target among consecutive frames, and added certain random displacements in the tracks. Finally, we transformed the sizes and brightness of targets into one of pre-defined settings, and fused them with the original image using Poisson Integration. The pre-defined settings of size and SCR, as well as the number of targets for each mode can be found in Table 1.
For the IDST dataset, we processed it with the above method, and obtained its multitarget counterpart, noted as IDSMT, which includes 16177 images with about 18 targets in each frame and three types of diverse backgrounds. By referring to Zhang et al. (2003), a small target is defined as possesses less than 0.15 % pixels in an entire image. Therefore, the constructed IDSMT meets the standard. The image size of IDSMT is 512×512, and targets varies from 2 × 2 pixels to 10× 10 pixels. Apart from the original targets, the entire dataset contains about 97062 objects of each type. The background includes 1397 images of pure sky, 901 images of combined sky and ground filed, and 13879 images of pure ground filed. The total 22 groups of data is divided into the training and test sets with the ratio of 7:3. All types of target size and backgrounds are evenly distributed among training and test sets. The examples of infrared multi-target images of IDSMT are shown in Fig. 5.
Moreover, we also collaborated with Shanghai Institute of Technical Physics of the Chinese Academy of Sciences to collect a second infrared dataset, noted as SITP. These images were captured to record planes from the distance of 10 km, each frame contains one or two planes as targets. In order to model the multi-target scenario, we extended this dataset by following the same synthetic protocol, noted as SITPMT. After removing invalid frames, SITPMT contains 15101 images in total, and the background is mainly composed of the sky and buildings. Each frame includes about 12 targets in average, and there are 53796 Type3 objects, 75784 Type2 objects, and 53985 Type1 objects respectively. The size of each image is 320×256. The example images of SITPMT are shown in Fig. 6.

Evaluation metrics
To quantitatively evaluate the performance of proposed method for the infrared multi-target detection task, by referring to Ju et al. (2021), we select precision (P), Recall (R), average precision (AP) and F1 score to evaluate results. Precision is the ability of a model to identify only the relevant objects, which is the percentage of correct positive predictions and is given by: Recall is the ratio of true positive instances against the sum of true positives and false negatives in the detector, based on the ground-truth, and is given by: F1 score is the measurement that leverages both precision and recall, and is given by: True positive (TP) represents the number of correct detections. False positive (FP) counts the number of incorrect predictions. False negative (FN) indicates the number of the ground-truth boxes missing in results. We consider a prediction as TP or FP by calculating the IoU and the distance of center points between the ground-truth bounding box and the prediction. An additional metric to compute the area under the curve of the precision-recall curve, i.e., average precision (AP) is also employed for the evaluation.

Results and analysis
In this section, we demonstrate the effectiveness of our proposed method by comparing against classic and state-of-the-art methods, including traditional methods IPI (Gao et al. 2013) and deep neural network-based models. For neural network-based models, we compared against the generic object detection model Yolov3 (Redmon and Farhadi 2018), as well as models proposed tailored for infrared target detection: ACM , ALCNet (Dai et al. 2022), DNANet , AGPCNet . For fair comparisons, the compared models were implemented by referring to their publicly released code and re-trained on our synthetic datasets with same experimental settings.
(10) P = TP TP + FP Fig. 6 Infrared images of SITPMT with three consecutive frames Table 2 shows the results of different methods on the IDSMT and the best results are shown in bold. The traditional method IPI behaves worse than deep neural network-based methods, illustrating the advanced learning ability of network models. Among networkbased methods, ACM and ALCNet rely on the bottom-up attention mechanism to preserve the spatial information of small targets, while AGPCNet uses the relatively complicated pyramid structure to capture the correlation among adjacent pixels, which achieves the better performance. Compared to these methods, DNANet proposes to utilize the nested network architecture combined with attention mechanism, which obtains a high recall. However, due to the interference of complex backgrounds, the precision of these methods is somewhat limited. It is worth noting that these methods treat the small target detection as a segmentation task, and further ground the foreground pixels with bounding boxes. Different from them, we utilized the Yolov3-based framework and adjusted anchor sizes to suit for identifying the small targets. As we can see, by introducing the mask-guided feature enhancement tailored for detecting small targets, and employing the coarse-to-fine candidate feature selection process, our proposed model obtains significant improvements on AP and F1 scores against the plain Yolov3, as well as outperforms state-of-the-art models. Fig. 7 shows the visualized results of different methods on IDSMT. We can observe that false alarms generated by IPI are mainly the background areas that possess the similar brightness as targets in (b). A certain number of bright targets are omitted, while a small number of background areas that close to the target are incorrectly detected. As shown in the red boxes of (d) and (e), for targets with the similar high brightness as buildings, ACM and ALCNet also tend to miss them. Besides, ALCNet also misses the dimmer targets. Meanwhile, in the blue boxes of (c) and (d), ACM and ALCNet also have false alarms. DNANet is largely affected by backgrounds with false alarms occurred in complex areas as illustrated in the blue box of (g). In (f), we can see that the results of AGPCNet basically fit the ground-truth, however, the multiple redundant predictions occur for the same target.
The results of our model are the closest to the ground-truth, proving the advantage of our model in detecting multi-target under complex backgrounds.
The experimental results in Table 3 which the best results are shown in bold and visualized results in Fig. 8 on the SITPMT are consistent with the IDSMT, demonstrating the robustness of the proposed model on different scenes. Specifically, as shown in the boxes in (c), (d) and (e), for targets appeared around buildings, our model performs significant better compared to plain Yolov3, ACM and ALCNet. For dim targets inside the cloud, the proposed model can successfully locate them, while AGPCNet and DNANet often fail to perform detection as illustrated in the boxes of (f) and (g).

Ablation study
To verify the effectiveness of each proposed component, we conducted thorough ablation analysis. The plain Yolov3 was adopted as our backbone network, and we added each component on top of the backbone to carry out experiments as shown in Table 4 and the best results are shown in bold. It can be seen that the mask-guided feature extraction achieves 3.55% improvement compared to the baseline on AP, while the candidate selection obtains 2.76% , and the final model performs best with two components together. We also observe that the mask-guided feature extraction mainly improves the recall, while the candidate selection process can screen out the false alarms to further increase the precision. We visualize the predicted results of baseline, its mask-guided version, as well as after the candidate selection. It can be observed from red boxes in the second column in   Fig. 9, the dimmer targets can be located by the mask-guided feature extraction. Though introducing the mask increases the false alarms, they can be effectively removed by the candidate selection as shown in blue boxes. We also investigated the mask-guided position, i.e., which convolotional feature map should be guided in order to obtain the optimal effect. As shown in Table 5, we selected 6 positions for experiments. The best results are shown in bold and the positions are noted as follows: 1. Start: the output of the first convolutional layer of Yolov3. 2. Middle: the output of the convolutional layer after two down-sampling layers. 3. Before Branch: the output of the convolutional layer before the first branch. 4. Branch Start: the outputs of the first convloutional layers of three branches. 5. Branch Middle: the outputs of the convloutional layers after four layers in three branches. 6. Branch End: the outputs of the last convloutional layers before detection head in three branches.  The Table 5 shows that Before Branch is the best position to apply the mask, since the feature map at this position possess both spatial and semantic features, adding the maskguided information here helps the subsequent network to pay more attention on the features of highlighted areas. It also can be seen that the insertion at the Branch End greatly improves the recall, yet has a negative impact on the precision. The reason is that the feature extraction of targets have already completed at the Branch End. Adding the guided Fig. 9 The visualized results of ablation studies on IDSMT. Part of the areas is enlarged in corners for better visualization. The pink boxes are the detected box. a Baseline, b Baseline+Mask-guided feature extraction, c proposed model information here might introduce the interference from generated masks to the extracted features. Since there are no subsequent convolutional layers to further adjust features, such interference could lead to the decrease of precision. Therefore, we adopted position of Before Branch in our final model. Moreover, we also studied the sensitivity of different weight ratios and among bounding box losses in equation 8. To verify the rationality of setting larger against , we have conducted parameter the sensitive analysis of different weight ratios ranges from 100:1 to 1:10. Table 6 shows the results of different weight ratios and the best results are shown in bold. As we can see, the ∶ = 50 ∶ 1 reaches the best overall performance. The reason is that the detection errors mainly caused by the inaccurate localization instead of the size of the predicted bounding box in our experiments. Therefore, it is reasonable to set a relatively large weights to the Loss x and Loss y .

Conclusions
In this paper, we propose a mask-guided detection model via coarse-to-fine candidate selection for infrared small multi-target detection. The mask-guided feature extraction utilizes the sparse and low-rank properties of small targets and infrared backgrounds to generate the foreground mask, and further re-weights the convolutional feature maps to guide the network focus on small targets, while the candidate selection process is conducted in a coarse-to-fine fashion to rule out false alarms caused by complex backgrounds using nuanced visual features obtained from the original image areas. Extensive experiments demonstrate that the proposed method achieves the best performance against state-of-theart methods under infrared small multi-target detection scenarios.
Funding This work was supported in part by the National Natural Science Foundation of China (No.61572307).

Data availibility statement
The datasets analysed during the current study are available from the corresponding author on reasonable request.

Conflict of interest
The authors declare that they have no conflict of interest.
Consent to participate All authors are agreed to participate in this research work.

Consent for publication
All authors are agreed to the publication of this research paper.