In order to maintain the independence of different extracted features, we input the smoke sequence frame to three subnetworks to capture the potential mapping of different features independently. On this analysis, We first describe the framework of smoke recognition, then introduce Multi-Scale Convolution Module(MSCM) and Cross Attention Module(CAM) in detail in the following section.
3.1 Smoke recognition
Given the previous frame X1: Xt, our goal is to detect whether there is smoke at the frame Xt. To achieve this goal, we input smoke video to capture the potential mapping of smoke features in frames X1: Xt. We use the optical flow model to calculate the motion trail of all objects in the sequence frame,and use it as the input of the smoke motion encoder. And we extract the color information in the three channel RGB image of each frame as the input of the smoke color encoder.
From each time frame, we calculate the spatial potential mapping St, the motion potential mapping Mt, and the color potential mapping Ct. Each feature mapping represents different feature distributions of smoke. We design a Cross Attention Module(CAM) to establish the relationship between different mappings (presented in Section 3.2). So that we can capture the feature mapping of smoke in each time frame through different mapping guidance. Then we fuse the captured feature mappings and use the Spatio-Ttemporal Perceptron(STP) to establish the feature mask Ft.
$$\begin{array}{c}{F}_{t}= \sum _{i}\left({\omega }_{i}*\right({ConvLSTM(}_{concat}({S}_{t},{ M}_{t}, { C}_{t})\left)\right)) \#\left(1\right)\end{array}$$
Where i denote the defferent kernel size of STP. St, Mt, Ct contain T×H×W×C pixels, ConvLSTM establishes the connection of pixels between T frames while maintaining the mapping size. Figure 1 shows the pipeline of our work. The feature mapping of ConvLSTM is defined to be of size H×W×C༇. Then, the multi-scale convolution kernels(presented in Section 3.4) use this feature mapping to establish the feature mask of smoke Ft. Ft contains all the feature mapping information of smoke. We use Ft to calculate a posteriori distribution psmoke(Xt | Ft) to detect whether there is smoke in the current scene.
3.2 Cross Attention Module
The attention mechanism is to scan the global image quickly to obtain the target area that needs to be focused, get the focus of attention, and then pay more attention to this area. Figure 2 shows the combination of attention mechanism and spatial encoder to better capture the spatial potential mapping St in our work.
Key and Query are calculated through two 1×1 convolution layers separately, and then the Attention Map is calculated by softmax operation on the product of Query and Key. For each feature mapping of smoke, we use Attention Map to strengthen the expression ability of feature mapping. It can be expressed as:\(\begin{array}{c}\widehat{{S}_{t}}={A}_{S} * {S}_{t} \#\left(2\right)\end{array}\)
Where St denotes spatial potential mapping, \(\widehat{{S}_{t}}\) denotes self-attention spatial potential mapping, \({A}_{s}\) denotes spatial attention map which can be expressed as:
$$\begin{array}{c}{A}_{s}= Softmax\left(\frac{{Query }^{T} Key}{\sqrt{d}}\right)\#\left(3\right)\end{array}$$
Where d is the dimension of Key and Query. CAM is designed based on the attention mechanism to achieve that a feature mapping of smoke can guide the capture of other feature mapping. Figure 3 shows the mutual guidance process of St and Mt. St is multiplied by the motion attention map, then fused with \(\widehat{{S}_{t}}\) to generate the cross-attention spatial potential mapping. It can be formulated as follows:
$$\begin{array}{c}\tilde{{S}_{t}}={ S}_{t} \times {A}_{M}+\widehat{{S}_{t}}\#\left(4\right)\end{array}$$
Where \(\tilde{{S}_{t}}\) denotes the cross-attention spatial potential mapping, \({A}_{M}\)denotes the motion attention map, \({A}_{M}\) is obtained by motion potential mapping.
S t contains the static feature information of smoke in sequence frames, and Mt the dynamic feature information of smoke. The spatial encoder can obtain the dynamic feature information of smoke through the CAM. So that the spatial encoder can focus the smoke area faster and obtain more sufficient static feature information of smoke.
3.3 Multi-Scale Cross Attention Network
Fusing the features of different scales is an important means to improve the performance of network model. Low-scale features have higher resolution and contain more detail information, which is conducive to capture the spatial texture feature of the smoke edge and the movement trend feature of the outward divergence of the smoke edge. High-scale features focus on semantic information, which is conducive to capture the spatial integrity of smoke and the upward movement trend of whole smoke.
The traditional deep learning network usually uses down sampling operation to increase the receptive field, so as to obtain the feature information under the large-scale perspective, but the down sampling operation will delete some feature information in the image. Dilate convolution is to add some zero values between the elements of the convolution kernel to expand the convolution kernel. Assuming that the expansion coefficient of dilated convolution is measured by a variable a, the relationship between the actual convolution kernel size after adding the dilated and the original convolution kernel size will be determined, it can be expressed as :
$$\begin{array}{c}K=K+\left(K-1\right)\times \left(a-1\right)\#\left(5\right)\end{array}$$
When a > 1. the convolution kernel of conventional convolution expands by adding zero value, resulting in dilated convolution. The increase of convolution kernel also increases the receptive field of convolution layer, so as to obtain the characteristic information from a large-scale perspective. Since the zero value is added, the computational complexity of dilated convolution will not be too large, and the image information will not be lost like the down sampling operation. Therefore, we proposes a multi-scale feature fusion network based on dilated convolution, and combine the cross attention module to realize the multi-scale and multi feature fusion of smoke. Figure 4 shows the cross attention multi-scale feature fusion network we designed.We change the stride size and padding size of dilated convolution to construct the multi-scale feature fusion network.
$$\begin{array}{c}{F}_{i}=D\left(d,s,p\right)\#\left(6\right)\end{array}$$
Where D denotes dilated convolution, d denotes the dilation of convolution, s represents the stride of convolution, p denotes the stride of convolution, i represents the order of dilated convolution.
3.4 Spatio-Temporal Perceptron
Figure 5 Spatio-Temporal Perceptron. The fused potential map contains three kinds of feature information of smoke sequence frames then input to ConvLSTM. STP establishes the feature mask Ft in this information through recursive network, and uses small multi-scale network to enrich the mask information.
ConvLSTM captures the temporal and spatial features of smoke sequence at the same time. the forget gate determines the spatio-temporal context that what features needs to retain and the update gate determines the spatio-temporal context needs to retain. Then, we use a small multi-scale convolution network to calculate the smoke feature mask Ft from different scales. This multi-scale network is also designed based on dilated convolution(as shown in section 3.3).