MS-AFF: a novel semantic segmentation approach for buried object based on multi-scale attentional feature fusion

Infrared technology is widely used in buried object detection since it can capture the heat radiated outward from the target objects. Compared to the visible images, infrared images present poor resolution, low contrast, and fuzzy visual effect, making it challenging to detect the target objects, specifically in complex backgrounds. In recent years, deep learning-based methods have made significant improvements in detection tasks. However, infrared images of buried objects are difficult to obtain. Less training samples, worse performance of deep-learning based detection. We raise a multi-scale attentional feature fusion module for infrared image semantic segmentation to solve this problem. Precisely, we integrate a series of feature maps from different levels by an atrous spatial pyramid structure. In this way, the model can obtain rich representation ability on the infrared images. Besides, a global spatial information attention module is employed to let the model focus on the target region and reduce disturbance in infrared images’ background. In addition, we propose an infrared segmentation dataset based on the infrared thermal imaging system. Finally, we use state-of-the-art object detection methods to train the buried object dataset and compare them with our proposed method. Extensive experiments on infrared segmentation dataset demonstrate the superiority of our method.


Introduction
Infrared technology is one of the most powerful technologies used in precision guidance and mine detection.There are many researches on infrared technology, such as Lang et al. (2013), Wen et al. (2018) improving the performance of infrared imaging.Image segmentation is the process of detecting objects or exciting areas from the input image.It is an essential step in object detection and recognition, tracking, and other related technologies.Its primary function is, which is mainly used to classify the object information in the image from the background.Conversely, due to the external environmental influence, such as temperature, airflow, radiation, and other factors, infrared images tend to have low resolution and high noise.This makes it a challenge to extract target objects from infrared images with complex backgrounds.
Some traditional image segmentation methods, such as threshold, which are based on the histogram of the bundle (Otsu 2007), the area expansion of Nock and Nielsen (2004), while the K-means method of Dhanachandra et al. (2015) is adopted to image segmentation, as shown in Fig. 1.As infrared images are generally obtained by measuring the object to the outside radiation heat gain, infrared images have poor resolution, low contrast, fuzzy visual effect, and there is no linear relationship between gray distribution and target reflection.Therefore, the K-means method can not achieve an excellent result.AlSaeed et al. (2012) has improved Otsu's method.However, these traditional algorithms only consider the gray level information between image pixels.When the gray level information, which is affected by noise, has no apparent linear relationship with the object, the segmentation effect is not noticeable.
Deep convolutional neural networks (LeCun et al. 1989;Krizhevsky et al. 2012) have achieved a collection of advancements in computer vision tasks.The CNN is gradually applied to extract image features instead of artificial design features, and the convolutional neural network is adopted to solve this problem.Using a CNN to solve traditional image problems has become a trend, and the field of image segmentation is no exception.FCN Fig. 1 The image is segmented by the clustering method: a original image, b result of the clustering method (Long et al. 2015) is a sign of image segmentation.They employ a full CNN to extract image features, realizing end-to-end training of images of different sizes.Ignoring highresolution feature maps will cause a loss of edge information.Moreover, Chen et al. (2014) has proposed a combination of CNN and CRF to overcome the relatively low localization of deep convolutional neural networks.Combining the last layer of the neural network with the fully connected CRF can obtain more accurate boundary information.Another commonly used segmentation model based on deep learning is the codec structure.(Noh et al. 2015) belongs to the earlier semantic segmentation network using deconvolution.This model contains two parts: One is the encoder composed of VGG16.The other uses a deconvolution network, takes the encoder's output as input, and finally generates a pixel-level prediction probability map.The core of SegNet (Badrinarayanan et al. 2017) is composed of an encoder network and a corresponding decoder structure, and the final pixel-level classification layer.Its main contribution is that the decoder performs nonlinear sampling on the input features of the resolution.Up-sampling maps and filters are trained together to produce dense feature maps.SegNet cannot obtain global semantic information well, and misjudgments often occur.U-net proposed by Ronneberger et al. (2015) is as well as a very classic codec network for medical image segmentation.Zhao et al. (2017) proposed the Pyramid Scene Analysis Network (PSPNet), which is based on various regions to the aggregate context information and then finds the relative information.Dilated convolution (also called atrous convolution) can spread the receptive field and reduce the computational cost without losing spatial resolution.
When setting different expansion rates, multi-scale contextual information can be captured.Deeplabv1 (Chen et al. 2014) and DeepLabv2 (Chen et al. 2018a) are the most popular segmentation methods.The latter uses dilated convolution to solve resolution reduction and uses atrous spatial pyramid pooling (ASPP) to catch objects at multiple scales with context information.Chen indicated DeepLabv3 (Chen et al. 2017) and deeplabv3 + (Chen et al. 2018b), which are parallel modules that use expanded convolution, and improved the ASPP structure, adding 1 × 1 convolution kernel and batch normalization to fuse features.Xu et al. (2018) employed the semantic segmentation method to deal with the aluminum electrolyte image.Furthermore, ResNet (He et al. 2016) is supposed to solve the problem of network degradation caused by network deepening.Although the above techniques are useful for public datasets, some limitations is existed in infrared images.Infrared images are obtained by measuring the heat radiated from the object and have poor resolution, low contrast, fuzzy visual effects, and there is no linear relationship between gray distribution and target reflection.The semantic information in the infrared image is scarce, so the above methods can't segment the infrared image well.
So as to solve these problems, we have created improvements based on PSPNet (Zhao et al. 2017).PSPNet uses a pre-trained ResNet model with dilated convolution to extract image features and then uses (He et al. 2014) pyramid pooling module to obtain semantic information.Eventually, these features were merged, and a convolutional layer is used to generate the prediction result.After the pyramid pooling module, directly up-sampling will lose part of the spatial information.A feature fusion structure was designed based on ASPP.The global attention up sample (GAU) module (Li et al. 2018) can better fuse the shallow and deep information and avoid information loss caused by violent up-sampling.
Furthermore, we also proposed a new type of attention module.This module integrates space and channel information, making the network focus on the target, thereby suppressing background and obtaining more accurate segmentation information.We also proposed an infrared segmentation dataset based on an infrared thermal imaging system.Our main contributions of this paper are as below: 302 Page 4 of 17 (1) A multi-scale attentional feature fusion (MS-AFF) method for infrared images semantic segmentation is proposed, integrating a series of feature maps from different levels by an atrous spatial pyramid structure.Our model can obtain rich representation ability on the infrared images.
(2) We also propose a global spatial information attention module to let the model focus on the target region and reduce disturbance in the infrared background.(3) We build a buried infrared segmentation dataset based on the infrared thermal imaging system.Extensive experiments in the infrared segmentation dataset show the advantages of our methods.
The remainder of the paper is constructed as follows.Section 2 show the methods of architecture and gives its analysis.Section 3 evaluate our network's performance in the infrared segmentation dataset, and the implementation details are given.Last, we provide the conclusion in Sect. 4.

Methods and materials
In this section, based on the pyramid scene analysis network, we used the atrous convolutional resnet network to extract the image features to extract more features without losing the image resolution.In the later pyramid pooling, we used an improved aspp structure to replace the pyramid pooling module in the pyramid analysis network.We selected a GAU module to fuse multi-scale information in aspp.This structure could aggregate semantic information from multiple scales and better integrated shallow and in-depth semantic details, thereby providing more accurate positioning.We also designed a global spatial information attention module to focus on the target and ignore the background.The segmentation was more accurate, thereby improving the network's recognition ability in complex scenes.The main structure of the network, as shown in Fig. 2, it mainly contains three sections, a feature extraction part, a multi-scale attentional feature fusion (MS-AFF), and the last is a Global spatial information attention module.

Atrous convolution for infrared image feature extraction
The convolutional neural network's main job is to learn the infrared image features, and the extracted features have a decisive effect on the subsequent segmentation.Xia also suggested in Xia et al. (2019) that feature extraction's accuracy directly affects the final classification accuracy.Feature extraction networks mainly include AlexNet (Krizhevsky et al. 2012), VGG (Simonyan and Zisserman 2014), etc.The more layers of the network, the richer and more abstract the features that can be learned, but there is also a problem with it, which is the degradation of the network's learning ability.For this reason, the residual structure in the ResNet proposed by He et al. (2016) figured out this problem.We will use resnet101 as a backbone network.The direct down-sampling of resnet will lose part of the spatial information, and the hole convolution can solve this problem.Atrous convolution can arbitrarily spread the receptive field and reduce the computational cost without losing spatial resolution, so it is generally used in object detection and segmentation.Wu et al. (2016) applied atrous convolution to capture more massive regional information, and Wang et al. (2018a) selected atrous convolution extract features in resnet.In the last two stages of the backbone network, we cancel the down-sampling operation, use atrous convolution as the alternative to make up for the lost receptive field, and extract semantic information from different levels to enrich the information.Finally, we fulfill the R 1 , R 2 , R 3 , R 4 , R 5 fea- ture map.

A multi-scale attentional feature fusion (MS-AFF)
The ASPP structure was first raised in DeepLabv2 (Chen et al. 2018a), composed of 4 hole convolutions with different atrous rates, and utilized hole convolution characteristics to extract multi-scale context information in parallel.As shown in Fig. 2, we use 1 × 1 con- volution to replace the previous 24 and use the gap (Lin et al. 2013) to obtain global feature information to reduce information loss.Of course, the atrous convolution also has some problems.Because the result of the hole convolution of the current layer comes from the upper layer's independent combination, there is no mutual dependence, which will lose the local information.The information obtained from a long-distance does not correlate.Besides, a novel feature fusion structure was designed to solve this problem, according to ASPP, applying GAU (Li et al. 2018) to fuse information between contexts, as indicated in Fig. 3. GAU is a brand new type of pyramid attention model.In order to reduce the number of channels in the CNN feature map, it is necessary to perform a 3 × 3 convolution operation on low-level features.The global context information which is generated from high-level features undergoes 1 × 1 convolution.Firstly, batch normalization and nonlinear transfor- mation operations in turn and subsequently is multiplied by low-level features.Ultimately, it is essential to accompany the high-level features with the weighted low-level features, and a gradual up-sampling process is performed.The GAU structure is listed in Fig. 4.
According to Fig. 3, it can be obtained that the shallow feature map the h and w mean the feature maps' height and width, respectively.R ′ 4 can be achieved by R 4 and R 5 through GAU as follows: Ultimately, R ′ 2 and R 1 fed into GAU to get the final feature F .It combines all the infor- mation and has richer semantic features.For infrared images, the resolution is low, which can better mine the information of the images.

Global spatial information attention module
Unlike RGB images, infrared images have stunted semantics and channel information.We design a global spatial information attention module to focus on the target area and ignore the background area.The feature map's essential features can be enhanced through the network, while the useless features can be suppressed.SENet (Hu et al. 2018)  non-local modules to capture long-range dependencies.As the response of the corresponding position, it is necessary to calculated the weighted sum of features on all positions.Inspired by Wang et al. (2018b), we put forward the global spatial information attention module to learn the discriminative spatial features.As shown in Fig. 5, it is the structure of the GSA network.
By utilizing the correlation function in Wang et al. (2018b), we use dot product similarity as below: i and j represent the index of position.We apply Conv1 × 1 layer to get and which are feature embeddings in our attention module.The Attention map N(F) for each position can be computed by Softmax function, i.e.
Then we used all the features of the relevant positions as weights and calculated the following: We can obtain the feature map X i which has the same size as F .The function , which consisted of Conv 1 × 1 one layer to computes the input's representation, then we can attain the output M.
where represents the cross-channel transform.Then we send feature M into GATE_CONV .GATE_CONV consists of two 1 × 1 convolution layers and a sigmoid layer: where denote spatial information correlated with the shape, h and w is the shape of F.
(2) 5 The network structure of the global spatial information attention module Finally, we attained the final output Y, focusing on the image's target and ignoring the background.

Loss function
The loss function is also commonly accustomed to evaluate the degree of inconsistency between the model's predicted and the ground truth.In the process, the smaller the loss value, the closer the model's predicted, and the better the robustness of the model, we use softmax loss to train the model as the output as follows.
As shown in formula 8, N represents the pixel numbers, like PSPNet (Zhao et al. 2017), we added an auxiliary loss function to the fourth stage of resnet101.A good auxiliary loss setting can help the network learn better performance.In our experiments, we set 0.25 can get the great result.

Description of our dataset
Our dataset can simulate the process of temperature rising and cooling in an outdoor environment and detect buried or scattered targets by using the difference of soil surface temperature caused different thermal conductivity characteristics of objects, or the difference between the object to be detected and the soil temperature.The object to be detected here is named melon.As shown in Fig. 6, experiments were carried out in the following four soils: soil, organic soil, fine sand, and yellow soil.Different pictures were collected by changing soil, voltage, heating/cooling time, humidity, and buried depth.
In the infrared imaging experiment, the format of infrared data followed the Chinese Academy of Sciences.Data was collected through the infrared camera made by the Chinese Academy of Sciences.The black body calibration data collected at the beginning of the experiment is converted into raw format infrared data through image processing, and its size is 320 × 256 .Because the obtained raw format image has low contrast and concen- trated pixel distribution, it is usually displayed.It is pure white or pure black, so it needs to be processed numerically.By taking the maximum and minimum values of the raw data and mapping it to the range of 0-255, the 14-bit raw image is converted into an 8-bit highcontrast bmp image.After a series of preprocessing, the infrared image is shown in Fig. 7.
There are 66 batches of experiments in our dataset, among which 54 collections are divided as the training dataset, and 12 collections of experiments are used as the test dataset.There are 4379 BMP format images of infrared collected data in the training dataset, 742 images in the test dataset.As depicted in Fig. 1, the real label is an original BMP format image.The dataset label is 320 × 256, where the label is a grayscale image.The training dataset and test dataset used in independent, thus ensuring the experimental results' validity.

Data augmentation
The deep model requires many data for training to have a good effect.Therefore, this article adopts MSDA, data augment in all experiments.We use random mirroring and random resizing between 0.5-2 for all training datasets, vertical flipping (50%), rotation −10° to 10°, and so on.These methods were fine-tuned in the infrared image, such as rotating the image, change the angle of the image, and randomly crop 0.5-2 times so that the infrared image has more conditions to provide training.Due to the experiments' limitations, these data enhancements help improve the network's robustness and the model's data diversity.(Krizhevsky et al. 2017).We train our model by using stochastic gradient descent (SGD).The batch size of each gpu card was set to 11 in the experiment details.The "poly" learning rate is which we adopted strategy.The learning rate is seen as follows:

Implementation details
We set the bs_lr is 0.002, the max_iter is 120,000, the weight decay is 0.0001, and the pow is 0.9.

Evaluation metrics
To quantitatively evaluate the performance of this method in infrared image segmentation tasks, we selected recall (R), precision (P), overall accuracy (OA), and F1 score to estimate the segmentation result of our proposed method.The leading four evaluation indicators are only applicable to two categories.We regarded melons and non-melons as two categories and Interference and non-interference as the other two categories for accuracy evaluation.
True positive (TP) shows the correct prediction of the number of melon pixels, false positive (FP) represents the incorrect prediction of the number of interference pixels.True negative (TN) shows the correct number of melon pixels classified, and false negatives (FN) indicate interference pixel classification error.Besides, it is a segmentation task, and we chose pixel accuracy (PA), average pixel accuracy (MPA) to measure the segmentation effect of melons and Interference objects.
Pixel accuracy (PA) The ratio of the correct number of classified pixels to the number of all pixels, the formula is as follows: Mean pixel accuracy (MPA) Calculate the ratio number of correct pixels in each class to the number of all pixels in that class and then make an average.

Mean intersection over union (MIoU)
Calculate the IoU of each category and average.One type of IoU is calculated as follows; for example, i = 1,P 11 means true positives, that Page 11 of 17 302 is, it belongs to category 1.The forecast is also category 1, k ∑ j=0 P j1 indicates the number of pixels that belong to different types is expected to be class 1 (containsP 11 ), specific calculation formula as below.

Results and analysis
In this part, we verified that our proposed segmentation method.Our method in this paper can offer accurate segmentation of melon and interference in complex infrared scenes, and all experiments data is collected from our experiments, the dataset called buried object dataset.To prove the advantage of our algorithm, we compared it with some state-of-theart segmentation methods proposed in FCN-32 s (Long et al. 2015), SegNet (Badrinarayanan et al. 2017), U-net (Ronneberger et al. 2015), and DeepLabv3 (Chen et al. 2017), which of them were all introduced in part2.All experiments were performed under the same environment and parameters.Apparently, our methods can achieve excellent segmentation results.
Table 1 shows various methods in buried object dataset results.Among them, Mean IoU was the most common evaluation index in segmentation.PSPNet (Zhao et al. 2017) behaved the worst performance among all networks.Deeplabv3 (Chen et al. 2017) and SegNet (Badrinarayanan et al. 2017) perform performed better than other networks.But our methods applied in which received the best result in all networks, proved the validity of our approach.
Table 2 shows the competitor of the different methods in the same experimental setting.Our proposed method of precision and F1 of melons is significantly higher than other deep learning models.Our accuracy of melon detection reached 89.65%, F1 score of melon detection is nearly 92.39%, and other evaluating indicators also achieve better results than most others.
Table 3 shows the competitor of the three primary segmentation metrics under different methods.It can be found from Table 3 that the FCN-32 s model is the worst in terms of the performance of these metrics among all networks.Secondly, the U-net model has a slight improvement over the FCN-32 s model.Our model gets the PA of Interference detection  (Ronneberger et al. 2015) 89.1 PSPNet (Zhao et al. 2017) 87.2 DeepLabv3 (Chen et al. 2017) 89.5 Our 90.54 302 Page 12 of 17 is nearly 99.84%, which of the best performance.We can also find our method was better than most of the models on the other evaluation metrics.Table 4 also shows the competitor of the three primary segmentation metrics under different methods.That FCN-32 s model also has the worst performance of these metrics among all networks.Secondly, SegNet and DeepLabv3 models have improved than other methods.Our proposed method can get the perfect effect at all three primary segmentation metrics.Our PA of melon detection is nearly 99.51%, our MPA is almost 94.75%, and the MIoU of our methods also achieved 92.67%, which is higher 0.54% than the second models.Tables 3 and 4 show that the above methods can realize exact segmentation in the infrared segmentation dataset.
Figure 8 also shows the segmentation result of different methods on buried object dataset.The Green color is melons, and the red color is Interference.We can find that (a) is the test image, (g) is the ground truth.From the result, we can see that FCN-32 s, DeepLabv3, and SegNet consist of false detection and misdetection.U-net exists some of the melons   They are more or less missed or falsely detected.The below image result also shows our method has better performance.
We can find that besides our result, the other methods predict melons as Interference from Fig. 10.They make the pixel classification not correct, but (f) shows the infrared image segmentation result was superior.Our model provided the best results on melons, it's PA, MPA, and MIoU were 99.51%, 94.75%, and 92.67%.These indicators are a good measure of the effect of segmentation.
We use state-of-the-art object detection methods to train the buried object dataset and compare them with our proposed methods.The segmentation results of the infrared  image are converted to detection, and the smallest bounding rectangles of the mask are taken as the final detection results.We set the threshold of IoU as 0.5 for all segmentation experiments here.We compare MS-AFF with the state-of-the-art object detection approaches on the buried object dataset in Table 5.Our model achieves an mAP of 92.14%, outperforming all reported segmentation results and object detectors, such as FCOS (Tian et al. 2019), PAA (Kim and Lee 2020).Since the infrared images lack texture information of buried objects and the training data is small, convolutional neural networks can not learn characteristics of objects well.However, segmentation is based on pixels that need less training data.In Interference's AP, our performance is 0.48% AP below SegNet (Badrinarayanan et al. 2017).However, our methods have higher AP for melon and 92.14% mAP.Compared with the baseline (PSPNet Zhao et al. 2017), we have a great improvement of 5.36% mAP, proving that the proposed approach is very effective.

Ablation study
In this part, we explore the performance of each unit.The mean intersection over union (Mean IoU) is used to estimate the model performance.Table 6 shows the different parts of the models.We make the PSPNet as our baseline.It can show that the MS-AFF model has achieved nearly 2.62% improvement than baseline, the non-local module got 0.25% than the baseline, and the GSA module improved by 1.57%.Our proposed method improves the baseline from 87.21 to 90.54%.Table 6 shows that the proposed model contains all components (i.e., MS-AFF and GSA) that achieve the best performance.

Conclusions
This paper focuses on buried object infrared image segmentation.A multi-scale attentional feature fusion (MS-AFF) method is proposed for infrared image semantic segmentation.We integrate a series of feature maps from different levels by an atrous spatial pyramid structure to obtain rich representation ability on the infrared images.We also proposed a global spatial information attention module to let the model focus on the target region and reduce disturbance.The infrared images lack texture information of buried objects, and the training data is small.The convolutional neural networks can not learn the characteristics

Fig. 2
Fig.2The main pipeline of infrared image segmentation.The infrared image is fed to the network to extract the feature.Then, MS-AFF integrates a series of feature maps.The GSA-Module to let the model focus on the target region.Finally, the network generates the segmentation result

Fig. 3
Fig.3A multi-scale attentional feature fusion.The features of the adjacent feature maps are fused by the GAU

Fig. 6
Fig. 6 Four experimental soil environments: a mineral soil, b organic soil, c fine sand soil, d yellow soil

Figure 9
also shows the two test images segmentation performance on different methods.The image of the top row directly indicates that our work is more precise than others.

Fig. 8
Fig. 8 The comparison result of different methods on buried object dataset.a Test image, b Segmentation performance on FCN-32 s, c Segmentation performance on U-net, d Segmentation performance on Deep-Labv3, e Segmentation performance on SegNet, f Segmentation performance on ours, g ground truth

Fig. 10
Fig. 10 The comparison result of different methods on buried object dataset.a Test image, b Segmentation performance on FCN-32 s, c Segmentation performance on U-net, d Segmentation performance on Deep-Labv3, e Segmentation performance on SegNet, f Segmentation performance on ours

Table 2
Other evaluate metrics on buried object testset Our method's result can see (f), which detects all the melons and Interference.As shown in Table1, our method achieved MIoU was 90.54%, which has the best accuracy than other models in our experiments.

Table 5
Comparisons with state-of-the-art methods on buried object dataset

Table 6
Detailed Mean IOU comparison of proposed our method of objects well.However, segmentation is based on pixels that need less training data.Extensive experiments show that our model has good segmentation results than other segmentation methods, and demonstrate excellent performance than state-of-the-art object detection methods on buried object infrared datasets.