DS-BEV focuses on improving modal fusion, and we give the overall framework in the Fig. 1. Given different inputs, we first apply a specific modal encoder to extract their features separately and convert them to a unified BEV representation space, and generate preliminary BEV fusion features through the FSM. With the original features from CNN network, the updated BEV fusion features are obtained through FRM.
3.1 Encoder
After a given input, a specific modal backbone network is used to extract features. Based on the state-of-the-art perceptual method BEVFusion, we construct our multi-modal feature encoder, which takes multi-view images and LiDAR point pairs as inputs and converts camera features into BEV space with depth prediction and geometric projection, respectively. In addition, we have added a CNN network to extract image locate information through convolution. Specifically, the image input is converted into N 2D image features, and then the features are converted into 3D space to generate 3D voxel features, and finally compressed into BEV features. The Lidar feature is projected into the 3D voxel feature and mapped to the BEV feature.
3.2 FSM
The FSM mainly consists of three parts : spatial attention module, channel attention module and feature selection module.Attention module :After a unified BEV space representation, the image features and Lidar features are position-dependent in the BEV space. We sample the feature map to the same scale, and use the spatial attention channel to supplement the problem of missing single feature information. Similarly, channel attention is used. Then, we combine the channel attention weight and spatial attention weight obtained by the fused modality with the BEV features before fusion, and integrate the new attention features into the fusion features through attention selection. The channel attention module helps the network focus on important channel features by learning the correlation between channels, thereby improving the representation ability of features and network performance. The spatial attention module helps the network to focus on the local important areas in the image and improve the understanding of the image spatial structure by learning the importance of spatial location.
$${\text{F}}_{\text{f}\text{u}\text{s}\text{e}}=\text{C}\text{o}\text{n}\text{c}\text{a}\text{t}({\text{F}}_{\text{C}\text{B}},{\text{F}}_{\text{L}\text{B}})$$
1
$${\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}={\text{F}}_{\text{f}\text{u}\text{s}\text{e}}+{\text{F}}_{\text{f}\text{u}\text{s}\text{e}}\otimes {\text{M}}_{\text{c}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}\right)$$
2
$${\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }{\prime }}={\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}+{\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}\otimes {\text{M}}_{\text{s}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}\right)$$
3
\({\text{M}}_{\text{c}}\) is the channel attention weight, and\({ \text{M}}_{\text{s}}\) is the spatial attention weight.
Feature selection module : Through the previous step, the spatial attention and channel attention features, and the original features are obtained. We propose a feature selection module to assign weights to each feature. In the paper, the weights of each feature are different.
$${\text{F}}_{\text{C}\text{B}}^{{\prime }}={\text{F}}_{\text{C}\text{B}}+{\text{F}}_{\text{C}\text{B}}\otimes {\text{M}}_{\text{c}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}\right)$$
4
$${\text{F}}_{\text{C}\text{B}}^{{\prime }{\prime }}={\text{F}}_{\text{C}\text{B}}^{{\prime }}+{\text{F}}_{\text{C}\text{B}}^{{\prime }}\otimes {\text{M}}_{\text{s}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}\right)$$
5
$${\text{F}}_{\text{F}\text{u}\text{s}\text{e}}={\phi }_{1}{\text{F}}_{\text{C}\text{B}}^{{\prime }{\prime }}+{\phi }_{2}{\text{F}}_{\text{L}\text{B}}^{{\prime }{\prime }}+{\phi }_{3}{\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }{\prime }}$$
6
Among them, \({\phi }_{1},{\phi }_{2},{\phi }_{3}\) is the coefficient.After the fusion of three features, the fused BEV feature map is obtained.
3.3 DS-BEV Decoder
After obtaining the preliminary fused BEV feature map, we further refine the fused features through the FRM. FRM consists of two parts : the first is the deformable attention module, and the second is the self-attention module.In BEVFormer[32], image features and BEV queries are input into spatial cross-attention. Inspired by BEVFormer[32] and DAT, we designed a deformable attention layer for our network. Specifically, we find the relevant points of the BEV query on the image features, and then the BEV query generates the attention matrix and bias information. The relevant points obtain the learned points (as Value ) according to the bias. Finally, the attention matrix is calculated with these related Values, and the relevant image features under the BEV query are obtained. Specifically, the image is formed by the point cloud through the feature angle, so we can use the camera parameters to map the points of the 3D space to the image. We first estimate the possible heights of each query on the BEV plane, and then project these points onto a 2D view. For a BEV query, the projected 2D points can only fall on some views, and other views will not be hit. Here, we call the hit view Vhit. After that, we regard these 2D points as the reference points of query and sample features from the hit views around these reference points. Finally, we use the weighted sum of sampling features as the output of deformable cross-attention.
$$\text{D}\text{C}\text{A}({\text{Q}}_{\text{p}},{\text{F}}_{\text{t}})=\frac{1}{\left|\left.{\text{V}}_{\text{h}\text{i}\text{t}}\right|\right.}\sum _{\text{i}\in {\text{V}}_{\text{h}\text{i}\text{t}}}\sum _{\text{j}=1}^{{\text{N}}_{\text{r}\text{e}\text{f}}}\text{D}\text{e}\text{f}\text{o}\text{r}\text{m}\text{A}\text{t}\text{t}\text{n}({\text{Q}}_{\text{p}},{\rho }(\text{p},\text{i},\text{j}),{\text{F}}_{\text{t}}^{\text{i}})$$
7
The \({\text{V}}_{\text{h}\text{i}\text{t}}\) is the hit views,\({\text{N}}_{\text{r}\text{e}\text{f}}\) is the correlation points of image feature.\({\rho }(\text{p},\text{i},\text{j})\) is the way 3D to 2D. \({\text{F}}_{\text{t}}^{\text{i}}\) is Point j of perspective i view.
In MetaBEV[8], a self-attention layer is added after the cross-attention layer, and the effectiveness of the attention layer is verified by experiments. We set the self-attention layer after the deformable attention layer, which is different from the traditional method. DS-BEV Decoder can capture the relationship between external queries and internal queries at the same time.