A Multi-Scale Cross Attention Network to fuse multi-features for smoke detection

doi:10.21203/rs.3.rs-1702089/v1

Download PDF

Research Article

A Multi-Scale Cross Attention Network to fuse multi-features for smoke detection

https://doi.org/10.21203/rs.3.rs-1702089/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Motion feature is the basic feature of smoke in the sequence. The spatial feature and color feature of smoke can be helpful to extract the motion feature of smoke. In this paper, we study how to express the three features of smoke and how to integrate these features effectively. We propose a Multi-Scale Cross Attention Net(MSCANet) to extract the three features of smoke and establish the relationship between the three features. It is composed of Multi-Scale Convolution Module(MSCM) and Cross Attention Module(CAM). MSCM uses convolution kernels of different size to extract feature mapping in different receptive fields,which can capture the global and local features of each smoke image. We establish three MSCM to extract three features independently, and the convolution parameters of different MSCM are not shared, which can realize the independence of different feature extraction. CAM uses the attention mechanism to strengthen the expression ability of smoke features, and uses attention map multiply by two external features to realize the fusion and mutual guidance of the three features. We embed CAM into MSCM to realize the independence and correlation of three feature extraction. We conducted comparative experiments and ablation experiments in our own large dataset, and the accuracy can reach 84.94%. Experimental results show that the proposed method can effectively fuse the three features of smoke, so as to effectively detect smoke.

smoke features

Multi-Scale Cross Attention Network

Multi-Scale Convolution Module

Cross Attention Module

Accurately and timely detection of fire smoke is important to reduce losses caused by fire. In the early days, sensors-based fire smoke detector are been widely studied by researchers. MD et al.[1] describes measurement of forest fire properties like fire front, flame height, flame inclination angle, fire base width. But they are not suitable for monitoring very large areas, such as forest and square.Because the camera-based fire smoke detector can be deployed in a wider scene,more and more researchers consider the smoke detection method based on computer vision. Color, texture, motion and shape change are the remarkable characteristics of smoke[2–3]. The camera-based fire smoke detection can be divided into two categories: image-based smoke detector[4],[5] and video-based smoke detector[6],[7]. Image-based smoke detector usually analyze the single image. The video-based smoke detector not only analyzes the spatial characteristics of single image,but also analyzes the temporal characteristics in consecutive frames.

Smoke is a moving object that spreads outward. And Chen et al. [8] found through statistics that the gray values of the three color channels of smoke are roughly the same and distributed between 80–220. It shows that the spatial feature, motion feature and color feature of smoke can be used as important factors to distinguish smoke from other objects. CNN is a feed- forward network,which lacks recurrent structure,so it can’t learn the time-context information of video sequence. Cao et al.[9] proposed to use residual frames to represent the motion trajectories of all objects in the sequence frames, and add the extracted motion features of smoke with the spatial features of smoke.

In this paper, we use the optical flow model to calculate the motion trajectory of the object. The physical point imaged at one position of frame t may be completely independent of that found at the same position in frame t + k. These motion correspondences should be modeled to facilitate understanding of dynamic scenarios. That means if we use the original RGB image as the input of time feature, we can not extract the time feature well. The optical flow model infers the trajectory of pixel movement according to the front and rear frames, and automatically generates a new vacant frame, which can make the motion trajectory smoother.So that it is conducive to extract the motion feature of smoke. At the same time, we use the color distribution matrix to calculate the color features of the image, and extract the color features of smoke. We propose a multi-scale cross attention network to optimize the fusion process between multiple features.And we design a spatio-temporal perceptron to integrate the extracted smoke features.

Our contributions can be described as follows:

(1) We propose the cross attention module(CAM) to fuse three features of smoke effectively.CAM strengthens each feature firstly, then establishes the relationship between features to guide each other.

(2) We design a multi-scale module based on the CAM. Multi-Scale Convolution Module(MSCM) pays more attention to the overall texture and upward movement trend of smoke in the large-scale, and pays more attention to the edge texture and edge divergence movement trend of smok in the small-scale.

(3) In the end, we design a spatio-temporal encoder(STE) to integrate feature sequence of smoke. STE captures the relationship of corresponding pixels between T frames, and establish the feature mask of smoke which is used for subsequent judgment.

Computer vision-based smoke detector is a research hotspot all the time.People mainly used manual features and machine learning classifier to detect smoke in the early years. Bu et al.[10] describes some of the intelligent and vision-based fire detection systems that have been presented by researchers in the last decade. Recently, people pay more attention to use Deep Neural Network(DNN) to detect smoke.Previous test have proved that the traditional classification methods still have defects.

2.1 Traditional Classification Methods

Traditional video smoke detection methods are mostly based on digital image processing technology,such as wavelet analysis[11],[12]. Ko et al.[13] extracted a histogram of oriented gradient (HOG) and a histogram of oriented optical flow (HOOF),then used key-frame differences and nonparametric smoke color models to detect candidate smoke blocks. Park et al.[14] used motion detection and support vector machine(SVM) to detect smoke in ship engine room. Han et al.[15] used background subtraction to extract moving objects, then combined with multi-color-based spaces to detect possible fire areas. Tian et al.[16] extracted smoke spatial features by directional gradient histogram (HOG) descriptor. Yuan et al.[17] used pyramids histogram sequences of Local Binary Pattern (LBP) and Local Binary Pattern Variance(LBPV) to detect smoke. Tung et al.[18] proposed the median method and fuzzy c-means method to segment the moving region, then cluster the candidate smoke regions from the moving region. Jia et al.[19] proposed a smoke region segmentation method based on saliency detection. Calderara et al.[20] use image energy and color information to develop smoke detection system.

2.2 Deep Learning Method

In recent years,convolution neural network (CNN) [21] has achieved great success in image classification.Deep learning-based smoke detection adopts the prevailing CNN such as Resnet50[22]. Zhao et al.[23] proposed to classify smoke videos with CNN and FC. Zhong et al.[24] extracted the suspected flame target area, and then use CNN to classify the extracted candidate area feature mapping. Pan et al.[25] used addition and sign manipulation operations to constructed a dot product operation, then used it to define convolution feedforward transfer to detect smoke. Yunji Zhao et al.[26] proposed a target-awareness method for smoke detection. Mengxia Yin et al.[27] propsed to fuse spatial features and motion features to classify smoke. Cui et al.[32] combine the attention mechanism and residual block to design a segmentation network model Smoke-Unet.

2.3 Color Moment

Color feature is one of the most important features of smoke image. However,Deep Neural Network can not fully learn the color features of images when it extracts high-level semantic features. To solve this problem,we need to use other methods to express the color attributes of images,so that the network can better learn color features. The color moment[28] is a method to represent color features.Firstly color distributions in the image can be represented by its moments,secondly the color feature information is concentrated in the low-order moments. Therefore,only the first-order moment(mean), second-order moment(variance) and third-order moment(skewness) of color are enough to express the color features of the image. Because HSV color space[29] is relatively consistent with the subjective understanding of color by human eyes, we use the color moment in HSV color space as the color feature of single frame smoke image in thie work.

2.4 Attention Mechanism

Attention mechanism originated from the study of human vision[30]. In cognitive science, due to the bottleneck of information processing, human beings will selectively focus on part of all information and ignore other visible information. The attention module can produce the characteristics of attention awareness so it can select the focus position to produce a more discriminative feature representation, and the characteristics of different modules will change adaptively with the deepening of the network. Smoke is an object with irregular size. the attention mechanism can more easily obtain the high correlation feature attributes of different positions of smoke by capturing the long-distance dependence.

In order to maintain the independence of different extracted features, we input the smoke sequence frame to three subnetworks to capture the potential mapping of different features independently. On this analysis, We first describe the framework of smoke recognition, then introduce Multi-Scale Convolution Module(MSCM) and Cross Attention Module(CAM) in detail in the following section.

3.1 Smoke recognition

Given the previous frame X₁: X_t, our goal is to detect whether there is smoke at the frame X_t. To achieve this goal, we input smoke video to capture the potential mapping of smoke features in frames X₁: X_t. We use the optical flow model to calculate the motion trail of all objects in the sequence frame,and use it as the input of the smoke motion encoder. And we extract the color information in the three channel RGB image of each frame as the input of the smoke color encoder.

From each time frame, we calculate the spatial potential mapping S_t, the motion potential mapping M_t, and the color potential mapping C_t. Each feature mapping represents different feature distributions of smoke. We design a Cross Attention Module(CAM) to establish the relationship between different mappings (presented in Section 3.2). So that we can capture the feature mapping of smoke in each time frame through different mapping guidance. Then we fuse the captured feature mappings and use the Spatio-Ttemporal Perceptron(STP) to establish the feature mask F_t.

$$\begin{array}{c}{F}_{t}= \sum _{i}\left({\omega }_{i}*\right({ConvLSTM(}_{concat}({S}_{t},{ M}_{t}, { C}_{t})\left)\right)) \#\left(1\right)\end{array}$$

Where i denote the defferent kernel size of STP. S_t, M_t, C_t contain T×H×W×C pixels, ConvLSTM establishes the connection of pixels between T frames while maintaining the mapping size. Figure 1 shows the pipeline of our work. The feature mapping of ConvLSTM is defined to be of size H×W×C^༇. Then, the multi-scale convolution kernels(presented in Section 3.4) use this feature mapping to establish the feature mask of smoke F_t. F_t contains all the feature mapping information of smoke. We use F_t to calculate a posteriori distribution p_smoke(X_t | F_t) to detect whether there is smoke in the current scene.

3.2 Cross Attention Module

The attention mechanism is to scan the global image quickly to obtain the target area that needs to be focused, get the focus of attention, and then pay more attention to this area. Figure 2 shows the combination of attention mechanism and spatial encoder to better capture the spatial potential mapping S_t in our work.

Key and Query are calculated through two 1×1 convolution layers separately, and then the Attention Map is calculated by softmax operation on the product of Query and Key. For each feature mapping of smoke, we use Attention Map to strengthen the expression ability of feature mapping. It can be expressed as:$\begin{array}{c}\widehat{{S}_{t}}={A}_{S} * {S}_{t} \#\left(2\right)\end{array}$

Where S_t denotes spatial potential mapping, $\widehat{{S}_{t}}$ denotes self-attention spatial potential mapping, ${A}_{s}$ denotes spatial attention map which can be expressed as:

$$\begin{array}{c}{A}_{s}= Softmax\left(\frac{{Query }^{T} Key}{\sqrt{d}}\right)\#\left(3\right)\end{array}$$

Where d is the dimension of Key and Query. CAM is designed based on the attention mechanism to achieve that a feature mapping of smoke can guide the capture of other feature mapping. Figure 3 shows the mutual guidance process of S_t and M_t. S_t is multiplied by the motion attention map, then fused with $\widehat{{S}_{t}}$ to generate the cross-attention spatial potential mapping. It can be formulated as follows:

$$\begin{array}{c}\tilde{{S}_{t}}={ S}_{t} \times {A}_{M}+\widehat{{S}_{t}}\#\left(4\right)\end{array}$$

Where $\tilde{{S}_{t}}$ denotes the cross-attention spatial potential mapping, ${A}_{M}$denotes the motion attention map, ${A}_{M}$ is obtained by motion potential mapping.

S _t contains the static feature information of smoke in sequence frames, and M_t the dynamic feature information of smoke. The spatial encoder can obtain the dynamic feature information of smoke through the CAM. So that the spatial encoder can focus the smoke area faster and obtain more sufficient static feature information of smoke.

3.3 Multi-Scale Cross Attention Network

Fusing the features of different scales is an important means to improve the performance of network model. Low-scale features have higher resolution and contain more detail information, which is conducive to capture the spatial texture feature of the smoke edge and the movement trend feature of the outward divergence of the smoke edge. High-scale features focus on semantic information, which is conducive to capture the spatial integrity of smoke and the upward movement trend of whole smoke.

The traditional deep learning network usually uses down sampling operation to increase the receptive field, so as to obtain the feature information under the large-scale perspective, but the down sampling operation will delete some feature information in the image. Dilate convolution is to add some zero values between the elements of the convolution kernel to expand the convolution kernel. Assuming that the expansion coefficient of dilated convolution is measured by a variable a, the relationship between the actual convolution kernel size after adding the dilated and the original convolution kernel size will be determined, it can be expressed as :

$$\begin{array}{c}K=K+\left(K-1\right)\times \left(a-1\right)\#\left(5\right)\end{array}$$

When a > 1. the convolution kernel of conventional convolution expands by adding zero value, resulting in dilated convolution. The increase of convolution kernel also increases the receptive field of convolution layer, so as to obtain the characteristic information from a large-scale perspective. Since the zero value is added, the computational complexity of dilated convolution will not be too large, and the image information will not be lost like the down sampling operation. Therefore, we proposes a multi-scale feature fusion network based on dilated convolution, and combine the cross attention module to realize the multi-scale and multi feature fusion of smoke. Figure 4 shows the cross attention multi-scale feature fusion network we designed.We change the stride size and padding size of dilated convolution to construct the multi-scale feature fusion network.

$$\begin{array}{c}{F}_{i}=D\left(d,s,p\right)\#\left(6\right)\end{array}$$

Where D denotes dilated convolution, d denotes the dilation of convolution, s represents the stride of convolution, p denotes the stride of convolution, i represents the order of dilated convolution.

3.4 Spatio-Temporal Perceptron

Figure 5 Spatio-Temporal Perceptron. The fused potential map contains three kinds of feature information of smoke sequence frames then input to ConvLSTM. STP establishes the feature mask F_t in this information through recursive network, and uses small multi-scale network to enrich the mask information.

ConvLSTM captures the temporal and spatial features of smoke sequence at the same time. the forget gate determines the spatio-temporal context that what features needs to retain and the update gate determines the spatio-temporal context needs to retain. Then, we use a small multi-scale convolution network to calculate the smoke feature mask F_t from different scales. This multi-scale network is also designed based on dilated convolution(as shown in section 3.3).

4.1 Datasets

There is currently no baseline data set for smoke detection. The datasets used in the experiment is from Yuan[39], CVPR Lab[40],Bikent[41], University of Science and Technology of China[42].

There are a lot of the same background in these videos, which probably makes the model to over fitting.In order to build a smoke detection datasets with diverse scenes, we propose to decompose the video into frames, and skip 40 frames if 60 consecutive frames are selected each time. Finally, our data set includes 20 smoke videos and 20 non-smoke videos for each has 600 frames.

We construct the diverse smokedatasets by constrain the frequency of same backgound, which avoid producing over-fitting situation and make the network adapt different scenes.

4.2 Implementation Details

All of our experiments are conducted on the NVIDIA-V100. The proposed method was implemented on pytorch 1.4. Adam optimizer is used to optimize model weights. We start from a learning rate of 0.001 and the weight decay is 0.0005 and train for 100 epochs. In addition, the input image size is 128 ×128 × 3 (channels).

4.3 Performance Measurements

To evaluate the detection performance of wildfire smoke, this study used the accuracy,precision, recall and F1-score based on the number of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs), which are widely used in two-class object detection.

$$\begin{array}{c}Accuracy=\frac{TP+TN}{TP+FP+FN+TN}\#\left(7\right)\end{array}$$

$$\begin{array}{c}Precision=\frac{TP}{TP+FP}\#\left(8\right)\end{array}$$

$$\begin{array}{c}Recall =\frac{TP}{TP+FN}\#\left(9\right)\end{array}$$

$$\begin{array}{c}F1=\frac{2\times Precision\times Recall }{Precision+Recall }\#\left(10\right)\end{array}$$

4.4 Experiment Evaluation

4.4.1 Compare with other models

We compare the performance of the model based on MSCANet with the other six modelsm, they are ResNet50 + BiLSTM[31], ResNet50 + LSTM[31], vgg16 + LSTM[23], RGB-I3D-LSTM[32], STC-Net[18]. In Table.1, we show the comparison results between our model and the above models on the dataset we created.

Table.1 Two different fusion methods of the color feature and the spatio-temporal are compared

	Accuracy	Precision	Recall	F1-score
Res50_BiLSTM[31]	80.48%	76.36%	86.62%	81.16%
Res50_LSTM[31]	74.12%	57.76%	82.54%	67.96%
vgg16_LSTM[23]	72.36%	61.63%	80.14%	69.67%
RGB-I3D-LSTM[33]	80.11%	78.37%	87.81%	82.82%
STC-Net[18]	82.74%	79.26%	87.32%	83.10%
Our Method	84.94%	81.26%	90.64%	85.69%

As shown in the table, our model outperforms other smoke detection models. ResNet50 + BiLSTM[31] and ResNet50 + LSTM[31] are the spatial context information encoded by resnet50, and the temporal and spatial features are learned through BiLSTM or LSTM. In comparison, we propose subnetworks to encode spatial information, motion information and color information respectively, so as to maintain the independence of the three features, and achieve the mutual guidance of the three feature information through fusion. STCNet integrates spatial context information and motion context information in a simple adding way. In comparison, our method optimizes the capture process of each smoke feature information through cross attention module and multi-scale network, and more effectively integrates the advantages of multiple smoke features and we use the Spatio-Temporal Perceptron(STP) to integrate the feature sequence frames, and effectively encode the spatiotemporal context information of the whole sequence frames.

4.4.2 stack different numbers with CAM

We evaluated the impact of stacking different numbers of CAM on the smoke detection model. We set up three scale network paths, dilation = 1 and dilation = 4 and dilation = 8 and start by stacking six numbers of CAM. As shown in Table 2, with the increase of the number of CAM, the indicators of the model have increased significantly. When the number of stacking is 9, the effect is the better than others. It indicates the model stacking 9 numbers with CAM improves its ability to express smoke features and capture more effective feature mapping.

Table.2 stack different numbers with CAM

	Accuracy	Precision	Recall	F1-score
number of stacks is 7	78.39%	77.49%	81.29%	79.34%
number of stacks is 8	81.72%	79.11%	82.12%	80.56%
number of stacks is 9	84.94%	81.26%	90.64%	85.69%
number of stacks is 10	83.93%	79.01%	90.54%	84.38%
number of stacks is 11	84.32%	79.54%	89.63%	84.28%

Figure 7 shows the test accuracy of models stacking with different numbers of CAM. The model with stacked number is 9 shows the best accuracy. The model with stacked number is 7 shows the relatively low accuracy because its network expression ability is not enough. The model with stacked number is 11 converges fastest, but its accuracy is not as good as that of the number is 9 due to over fitting.

	Accuracy	Precision	Recall	F1-score
STC-Net[18]	82.74%	79.26%	87.32%	83.10%
STC-Net-CAM	83.72%	80.64%	90.12%	85.12%
Vgg16	76.23%	73.16%	79.33%	76.12%
Vgg16-CAM	79.23%	75.35%	78.64%	76.95%
Resnet18	78.93%	76.31%	80.34%	78.27%
Resnet18-CAM	82.64%	79.37%	81.64%	80.49%

4.4.3 Embedded effect of CAM

We embed CAM into STC-Net to test the applicability of CAM. Then resnet18 and vgg16 are used as the backbone network in STC-Net. As shown in table.3, When embedding with CAM, the model shows better performance. STC-Net fuse different features by adding, but CAM uses attention mechanism to achieve the mutual guidance between different features, which makes the information fusion between different features more sufficient. This proves that CAM we proposed can be embedded into the multi-path network to improve its robustness.

In this paper, we analyze three typical features of smoke, and propose to establish three subnetworks to encode the context information of the three features of smoke independently. In order to better capture the feature mappings of smoke, we propose a multi-scale cross attention network(MSCANet) which pays more attention to the whole and edge details of smoke at the same time. And MSCANet uses the attention mechanism to realize the mutual guidance between three features of smoke. At the same time, we propose a spatio-temporal encoder(STE) to integrate the spatio-temporal context information of the feature sequence and establish the feature encoding information of smoke, so as to better judge whether there is smoke in the current scene. The experimental results show that MSCANet we proposed is helpful to improve the accuracy of smoke detection. In the future work, we will continue to study how to capture the three typical features of smoke more effectively.

Acknowledgements

This research is partially supported by the Natural Science Foundation Project of Fujian Province. Author Mengqi Ye, Author Yanmin Luo declare that they have no conflict of interest.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material.

MD Dios, Arrue B C , Ollero A , et al. Computer vision techniques for forest fire perception[J]. Image and Vision Computing, 2008, 26(4):p.550-562.
Shuo, Zhang, Demin, et al. Wildfire Detection Using Sound Spectrum Analysis Based on the Internet of Things.[J]. Sensors (Basel, Switzerland), 2019.
Gubbi J , Marusic S , Palaniswami M . Smoke detection in video using wavelets and support vector machines[J]. Fire Safety Journal, 2009.
Tian H , Li W , Ogunbona P O , et al. Detection and Separation of Smoke From Single Image Frames[J]. IEEE Transactions on Image Processing, 2018.
Yin Z , Wan B , Yuan F , et al. A Deep Normalization and Convolutional Neural Network for Image Smoke Detection[J]. IEEE Access, 2017.
Dimitropoulos K , Barmpoutis P , Grammalidis N . Higher Order Linear Dynamical Systems for Smoke Detection in Video Surveillance Applications[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017.
Lin G , Zhang Y , Xu G , et al. Smoke Detection on Video Sequences Using 3D Convolutional Neural Networks[J]. Fire Technology, 2019.
Chen T H , Yin Y H , Huang S F , et al. The Smoke Detection for Early Fire-Alarming System Base on Video Processing[C]// International Conference on Intelligent Information Hiding & Multimedia Signal Processing. IEEE, 2006.
Y Cao, Tang Q , Lu X , et al. STCNet: Spatio-Temporal Cross Network for Industrial Smoke Detection[J]. 2020.
Bu F , Gharajeh M S . Intelligent and vision-based fire detection systems: A survey[J]. Image and Vision Computing, 2019, 91(Nov.):103803.1-103803.15.
An effective algorithm to detect both smoke and flame using color and wavelet analysis[J]. Pattern Recognition & Image Analysis, 2017.
Ye W , Zhao J , Wang S , et al. Dynamic texture based smoke detection using Surfacelet transform and HMT model[J]. Fire Safety Journal, 2015.
Ko BC , Park J O , Nam J Y . Spatiotemporal bag-of-features for early wildfire smoke detection[J]. Image and Vision Computing, 2013, 31(10):786–795.
Park K M , Bae C O . Smoke detection in ship engine rooms based on video images[J]. IET Image Processing, 2020.
Han X F , Jin J S , Wang M J , et al. Video fire detection based on Gaussian Mixture Model and multi-color features[J]. Signal Image & Video Processing, 2017.
Tian H , Li W , Ogunbona P O , et al. Detection and Separation of Smoke From Single Image Frames[J]. IEEE Transactions on Image Processing, 2018.
Yuan F . Video-based smoke detection with histogram sequence of LBP and LBPV pyramids[J]. Fire Safety Journal, 2011.
Tung T X , Kim J M . An effective four-stage smoke-detection algorithm using video images for early fire-alarm systems[J]. Fire Safety Journal, 2011.
Jia Y , Yuan J , Wang J , et al. A Saliency-Based Method for Early Smoke Detection in Video Sequences[J]. Fire Technology, 2016.
Nguyen T , Kim J M . Multistage optical smoke detection approach for smoke alarm systems[J]. Optical Engineering, 2013.
Technicolor T , Related S , Technicolor T , et al. ImageNet Classification with Deep Convolutional Neural Networks, Neural Information Processing Systems (NeurIPS).2012.
Sharma J , Granmo O C , Goodwin M , et al. Deep Convolutional Neural Networks for Fire Detection in Images[J]. 2017.
Luo Y , Liang Z , Liu P , et al. Fire smoke detection algorithm based on motion characteristic and convolutional neural networks[J].Multimedia Tools and Applications, 2018.
Zhong Z , Wang M , Shi Y , et al. A convolutional neural network-based flame detection method in video sequence[J]. Signal Image and Video Processing, 2018, 12.
Pan H , Badawi D , Zhang X , et al. Additive neural network for forest fire detection[J]. Signal Image and Video Processing, 2020, 14(1):1-8.
Zhao Y , H Zhang, Zhang X , et al. Fire smoke detection based on target-awareness and depthwise convolutions[J]. Multimedia Tools and Applications, 2021(2).
Yin M , Lang C , Li Z , et al. Recurrent convolutional network for video-based smoke detection[J]. Multimedia Tools and Applications, 2019, 78(8):1-20.
Santagati R , O'Brien J L , Thompson M G , et al. Proceedings of SPIE[C]// Spie-optics + Photonics. 2015.
Sural S , Gang Q , Pramanik S . Segmentation and histogram generation using the HSV color space for image retrieval[C]// Proceedings. International Conference on Image Processing. IEEE, 2002.
Chaudhari S , Polatkan G , Ramanath R , et al. An Attentive Survey of Attention Models[J]. 2019.
Cao Y , Yang F , Tang Q , et al. An Attention Enhanced Bidirectional LSTM for Early Forest Fire Smoke Recognition[J]. IEEE Access, 2019.
Cui W. Semantic Segmentation and Analysis on Sensitive Parameters of Forest Fire Smoke Using Smoke-Unet and Landsat-8 Imagery[J]. Remote Sensing, 2021, 14.
Donahue J , Hendricks L A , Guadarrama S , et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description[J]. Elsevier, 2015.

No competing interests reported.

Download PDF

Editorial decision: Revision requested
03 Mar, 2024
Reviews received at journal
17 Mar, 2023
Reviewers agreed at journal
01 Dec, 2022
Reviewers agreed at journal
09 Jul, 2022
Reviewers invited by journal
09 Jul, 2022
Editor assigned by journal
06 Jul, 2022
Submission checks completed at journal
02 Jun, 2022
First submitted to journal
28 May, 2022

You are reading this latest preprint version

A Multi-Scale Cross Attention Network to fuse multi-features for smoke detection

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

2.1 Traditional Classification Methods

2.2 Deep Learning Method

2.3 Color Moment

2.4 Attention Mechanism

3 The Proposed Method

3.1 Smoke recognition

3.2 Cross Attention Module

3.3 Multi-Scale Cross Attention Network

3.4 Spatio-Temporal Perceptron

4 Experiment Result

4.1 Datasets

4.2 Implementation Details

4.3 Performance Measurements

4.4 Experiment Evaluation

4.4.1 Compare with other models

4.4.2 stack different numbers with CAM

4.4.3 Embedded effect of CAM

5 Conclusions

Declarations

Acknowledgements

References

Additional Declarations

Status:

Version 1