DS-BEV:An Efficient Multi-Modal Fusion in Object Detection with Unified Bird's-Eye View Representation

doi:10.21203/rs.3.rs-4477033/v1

Download PDF

Research Article

DS-BEV:An Efficient Multi-Modal Fusion in Object Detection with Unified Bird's-Eye View Representation

https://doi.org/10.21203/rs.3.rs-4477033/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recently, BEVFusion has been proposed to integrate LiDAR and image features in a unified Bird's Eye View (BEV) representation. However, there is an issue with the loss of local image information during the extraction of global image information on the backbone network. In order to fully integrate local features with global features, this paper proposes a network called DS-BEV based on feature selection and refinement. It includes a Feature Selection Fusion module (FSM) and a Feature Refinement module (FRM). In the FSM, the features of different modal are first extracted by using specific networks and projected into a unified BEV representation space. Through channel and spatial learning, important information is selected from the initial features and fused to generate preliminary fusion features. Then,the image features extracted by a CNN network and the preliminary fusion features output by the FSM are sent to the FRM together. By combining the local features generated by CNN network, the fusion features are refined. We evaluate our model on the nuScenes dataset. Experiments show that our DS-BEV achieves 69.5% mAP and 72.3% NDS in detection accuracy.

unified representation

feature refinement

feature selection

The autonomous driving system is equipped with different sensors. For example, Waymo's[1] self-driving car has 29 cameras, 6 radars and 5 lidars. Data from different sensors are represented in completely different ways : for example, the camera captures data in a perspective view and captures LiDAR in a 3D view. The sensor fusion strategy shows significant advantages in achieving stronger sensing capabilities.

The previous method[2–6] uses other features to enhance the Lidar feature. Because the Lidar feature has rich spatial information, the image is projected into the point cloud feature. This method requires strong conversion ability, and it is easy to generate noise, resulting in increased difficulty in detection. In order to simplify and overcome the noise, researchers have proposed a candidate box-based fusion[7], which first generates candidate boxes on different modalities, and then jointly detects them on the candidate boxes. Although this method avoids complex point fusion, its accuracy is not significant. Recently, researchers have introduced a new fusion method to project multiple features into a unified representation space[8–11].This method uses a unified representation method to maintain both geometric structure and semantic density, which provides a new fusion idea for researchers. Under the unified BEV feature space, we can use the method of two-dimensional image to process BEV features.

In this paper, we propose DS-BEV to further refine the fusion feature. On the basis of BEVFusion[10], we first enrich the channel information and spatial information through our FSM to generate preliminary fusion feature. Under the selection of the original features, we further enhance the channel features and spatial features. Then the image features will extracted by CNN network. The image features and the preliminary BEV feature are sent to the FRM. In order to reduce the computational cost, we use deformable-attention to instead the self-attention. In [12], it is mentioned that the computational complexity of deformable-attention is less than that of self-attention. In FRM, Deformable-attention only focuses on outside query, we add a self-attention module after deformable-attention to get inside relation of fusion feature. Finally generate high-quality fusion features.

In order to evaluate the effectiveness of our proposed DS-BEV, we conducted experiments on the nuScenes dataset[13], and our detection accuracy was further improved to 69.5% mAP and 72.3% NDS.

The main characteristics of DS-BEV : We propose a new framework of DS-BEV, which in the target detection task, has a good performance.We propose a FSM which can select channel and spatial information to enhace BEV features.We propose FRM,a BEV refinement module.

2.1 LiDAR-Camera Fusion

Multi-modal fusion is very significant in 3D detection tasks. Therefore, many researchers focus on how to better combine point clouds ( geographic information ) and images ( semantic information ). The existing methods mainly focus on the candidate box level[14–19], the point level[20–25] and the feature level [8, 9, 10, 11, 26, 27]. MV3D[14] creates 3D proposal and projects to image. F-PointNet[15], F-ConvNet[16], TransFusion[17] create 2D proposal and projects to 3D. Point-level fusion methods, on the other hand, usually paint image semantic features onto foreground LiDAR points and perform LiDAR-based detection on the decorated point cloud inputs.Feature-level fusion uses different fusion methods in the feature stage to enhance modal interaction.In our module, in a unified BEV space, the fusion features are easy to fuse with the ways of 2D.

2.2. Deformable Attention

In the traditional transformer, the attention calculation is usually global, and the amount of calculation will be very large when the feature size is large. Combined with deformable convolution, attention calculation based on deformable-attention is designed in Deformable DETR[12],DAT[31]. In Deformable DETR[12] ,query only calculates with the value of correlation points, which greatly reduces the amount of calculation. And the attention weight matrix is directly generated based on Query, and then calculated with the correlation points. In[31], the feature of the relevant points are connected with the paranoid information generated by the offset network to generate key, value. In our paper, we use Deformable DETR[12] to calculate attention.

2.3. BEV Feature-based

Bird 's Eye View ( BEV ) is a perspective of viewing objects or scenes from above, just like a bird looking down at the ground in the air. In the field of autonomous driving and robotics, data obtained by sensors ( such as LiDAR and cameras ) are usually converted into BEV representations to better perform tasks such as object detection. BEV can simplify a complex three-dimensional environment into a two-dimensional image, which is particularly important for efficient calculation in real-time systems. BEVFormer[32] is a network structure based on Transformer[34], which applies deformable-attention mechanism to feature extraction on BEV. Compared to traditional CNN network, BEVFormer can better capture long-distance dependencies. BEVDet[35] rotates, cuts, and scales the original image, and needs to multiply the internal and external parameters of the camera by an inverse transformation. BevFusion[10] is a multi-sensor fusion technology that can fuse data from different sensors (such as LiDAR and cameras) into a unified BEV representation. BevFusion[10] can combine the advantages of multiple sensors to achieve better performance in object detection and tracking tasks. BEVDepth[36] is a deep estimation algorithm based on deep learning. It projects the point cloud data to the BEV, and then uses CNN network to predict the depth information of each pixel. BEVDepth[36] can produce high-quality depth maps, which is very useful for tasks such as navigation and object detection.

DS-BEV focuses on improving modal fusion, and we give the overall framework in the Fig. 1. Given different inputs, we first apply a specific modal encoder to extract their features separately and convert them to a unified BEV representation space, and generate preliminary BEV fusion features through the FSM. With the original features from CNN network, the updated BEV fusion features are obtained through FRM.

3.1 Encoder

After a given input, a specific modal backbone network is used to extract features. Based on the state-of-the-art perceptual method BEVFusion, we construct our multi-modal feature encoder, which takes multi-view images and LiDAR point pairs as inputs and converts camera features into BEV space with depth prediction and geometric projection, respectively. In addition, we have added a CNN network to extract image locate information through convolution. Specifically, the image input is converted into N 2D image features, and then the features are converted into 3D space to generate 3D voxel features, and finally compressed into BEV features. The Lidar feature is projected into the 3D voxel feature and mapped to the BEV feature.

3.2 FSM

The FSM mainly consists of three parts : spatial attention module, channel attention module and feature selection module.Attention module :After a unified BEV space representation, the image features and Lidar features are position-dependent in the BEV space. We sample the feature map to the same scale, and use the spatial attention channel to supplement the problem of missing single feature information. Similarly, channel attention is used. Then, we combine the channel attention weight and spatial attention weight obtained by the fused modality with the BEV features before fusion, and integrate the new attention features into the fusion features through attention selection. The channel attention module helps the network focus on important channel features by learning the correlation between channels, thereby improving the representation ability of features and network performance. The spatial attention module helps the network to focus on the local important areas in the image and improve the understanding of the image spatial structure by learning the importance of spatial location.

$${\text{F}}_{\text{f}\text{u}\text{s}\text{e}}=\text{C}\text{o}\text{n}\text{c}\text{a}\text{t}({\text{F}}_{\text{C}\text{B}},{\text{F}}_{\text{L}\text{B}})$$

$${\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}={\text{F}}_{\text{f}\text{u}\text{s}\text{e}}+{\text{F}}_{\text{f}\text{u}\text{s}\text{e}}\otimes {\text{M}}_{\text{c}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}\right)$$

$${\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }{\prime }}={\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}+{\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}\otimes {\text{M}}_{\text{s}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}\right)$$

${\text{M}}_{\text{c}}$ is the channel attention weight, and${ \text{M}}_{\text{s}}$ is the spatial attention weight.

Feature selection module : Through the previous step, the spatial attention and channel attention features, and the original features are obtained. We propose a feature selection module to assign weights to each feature. In the paper, the weights of each feature are different.

$${\text{F}}_{\text{C}\text{B}}^{{\prime }}={\text{F}}_{\text{C}\text{B}}+{\text{F}}_{\text{C}\text{B}}\otimes {\text{M}}_{\text{c}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}\right)$$

$${\text{F}}_{\text{C}\text{B}}^{{\prime }{\prime }}={\text{F}}_{\text{C}\text{B}}^{{\prime }}+{\text{F}}_{\text{C}\text{B}}^{{\prime }}\otimes {\text{M}}_{\text{s}}\left({\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }}\right)$$

$${\text{F}}_{\text{F}\text{u}\text{s}\text{e}}={\phi }_{1}{\text{F}}_{\text{C}\text{B}}^{{\prime }{\prime }}+{\phi }_{2}{\text{F}}_{\text{L}\text{B}}^{{\prime }{\prime }}+{\phi }_{3}{\text{F}}_{\text{f}\text{u}\text{s}\text{e}}^{{\prime }{\prime }}$$

Among them, ${\phi }_{1}，{\phi }_{2}，{\phi }_{3}$ is the coefficient.After the fusion of three features, the fused BEV feature map is obtained.

3.3 DS-BEV Decoder

After obtaining the preliminary fused BEV feature map, we further refine the fused features through the FRM. FRM consists of two parts : the first is the deformable attention module, and the second is the self-attention module.In BEVFormer[32], image features and BEV queries are input into spatial cross-attention. Inspired by BEVFormer[32] and DAT, we designed a deformable attention layer for our network. Specifically, we find the relevant points of the BEV query on the image features, and then the BEV query generates the attention matrix and bias information. The relevant points obtain the learned points (as Value ) according to the bias. Finally, the attention matrix is calculated with these related Values, and the relevant image features under the BEV query are obtained. Specifically, the image is formed by the point cloud through the feature angle, so we can use the camera parameters to map the points of the 3D space to the image. We first estimate the possible heights of each query on the BEV plane, and then project these points onto a 2D view. For a BEV query, the projected 2D points can only fall on some views, and other views will not be hit. Here, we call the hit view Vhit. After that, we regard these 2D points as the reference points of query and sample features from the hit views around these reference points. Finally, we use the weighted sum of sampling features as the output of deformable cross-attention.

$$\text{D}\text{C}\text{A}({\text{Q}}_{\text{p}},{\text{F}}_{\text{t}})=\frac{1}{\left|\left.{\text{V}}_{\text{h}\text{i}\text{t}}\right|\right.}\sum _{\text{i}\in {\text{V}}_{\text{h}\text{i}\text{t}}}\sum _{\text{j}=1}^{{\text{N}}_{\text{r}\text{e}\text{f}}}\text{D}\text{e}\text{f}\text{o}\text{r}\text{m}\text{A}\text{t}\text{t}\text{n}({\text{Q}}_{\text{p}},{\rho }(\text{p},\text{i},\text{j}),{\text{F}}_{\text{t}}^{\text{i}})$$

The ${\text{V}}_{\text{h}\text{i}\text{t}}$ is the hit views,${\text{N}}_{\text{r}\text{e}\text{f}}$ is the correlation points of image feature.${\rho }(\text{p},\text{i},\text{j})$ is the way 3D to 2D. ${\text{F}}_{\text{t}}^{\text{i}}$ is Point j of perspective i view.

In MetaBEV[8], a self-attention layer is added after the cross-attention layer, and the effectiveness of the attention layer is verified by experiments. We set the self-attention layer after the deformable attention layer, which is different from the traditional method. DS-BEV Decoder can capture the relationship between external queries and internal queries at the same time.

In this section, we introduce the specific experimental settings in detail.And the performances on 3D detection is presented to validate the effectiveness, flexibility of our DS-BEV.

4.1 Implementation Details

4.1.1 Network architecture

Our network is based on the BEVFusion architecture, with Swin-T[37] and VoxelNet[38] used as feature encoders for cameras and LiDAR, respectively. And another image Net is ResNet-50[39]. In the FSM module, we transfer the three levels of features to the same size 180×180. In the DS-BEV Decoder, we use one deformable-attention layer and one self-attention layer to generate a fused BEV. In the deformable-attention, we use 4 related points, that 4 related points have good performance.

Table 1

Comparisons with Sota methods on nuScenes val set.* is reported from [10].
Method	Modality	Resolution	mAP	NDS
BEVDepth-R50 [36]	C	256×704	35.1	47.5
BEVFormer [32]	C	900×1600	41.6	51.7
CenterPoint [40]	L	-	59.6	66.8
TransFusion-L [19]	L	-	65.5	70.2
FUTR3D [18]	C + L	-	64.5	68.3
UVTR [41]	C + L	-	65.4	70.2
PointPainting [20]	C + L	-	65.8	69.6
MVP* [42]	C + L	-	66.1	70
AutoAlign [24]	C + L	-	66.6	71.1
PointAugmenting [21]	C + L	-	66.8	71
TransFusion [43]	C + L	448×800	67.5	71.3
BEVFusion [9]	C + L	448×800	69.6	72.1
DeepInteraction [44]	C + L	640×1600	69.9	72.6
MetaBEV [8]	C + L	-	68.0	71.5
BEVFusion [10]	C + L	256×704	68.5	71.4
EA-BEV [11]	C + L	256×704	69.4	71.8
DS-BEV(Ours)	C + L	256x704	69.5	72.3

4.12 Data sets and evaluation indicators

We evaluated DS-BEV on nuScenes[13], a large-scale multi-modal dataset for 3D detection. The dataset is divided into 700 / 150 / 150 scenarios for training / verification / testing, including 40k labeled samples and 23 different classes. It contains data from multiple sensors, including six cameras, one lidar and five radars. For camera input, each frame consists of six views of the environment around a specific timestamp. We adjust the size of the input view to 256 × 704 resolution and voxelize the point cloud to 0.075m and 0.1m for detection. Our evaluation indicators are consistent with [13]. For 3D detection, we use standard nuScenes detection score ( NDS ) and mean average precision ( mAP ). We use the average precision ( mAP ) of 10 foreground classes and nuScenes detection score ( NDS ) as our detection metric.

4.2. Comparison Results

Table 1 reports the experimental results of the nuScenes 3D object detection verification data set. At the baseline of BEVFusion[10], the mAP score and NDS score increased by 1.5% and 0.9%, respectively. Our DS-BEV is close to BEVFusion[9].

In Table 2, we report the experimental results of the nuScenes 3D object detection test data set. The average mAP score and average NDS score of 10 foreground classes on the test set. On the BEVFusion[10] baseline, the mAP score and NDS score increased by 1.0% and 0.9%, respectively.

Table 2

Comparisons with Sota methods on nuScenes test set.
Method	Modality	mAP↑	NDS↑	mATE↓	mASE↓	mAOE↓	mAVE↓	mAAE↓
BEVDet [35]	C	42.2	48.2	0.529	0.236	0.396	0.979	0.152
BEVFormer [32]	C	44.5	53.5	0.582	0.256	0.375	0.378	1.123
Pointpillars [46]	L	30.5	45.3	0.517	0.29	0.5	0.316	0.368
SECOND [47]	L	52.8	63.3	-	-	-	-	-
CenterPoint [40]	L	60.3	67.3	0.262	0.239	0.361	0.288	0.136
PointAugmenting [21]	C + L	66.8	71.0	0.253	0.235	0.354	0.266	0.123
MVP [22]	C + L	66.4	70.5	0.263	0.238	0.321	0.313	0.134
TransFusion [43]	C + L	68.9	71.3	0.259	0.243	0.359	0.288	0.127
CMT [48]	C + L	70.4	73.0	0.299	0.241	0.323	0.240	0.112
DeepInteraction [44]	C + L	70.8	73.4	0.257	0.240	0.325	0.245	0.128
BEVFusion [10]	C + L	71.3	73.3	0.250	0.240	0.359	0.254	0.132
DS-BEV(Ours)	C + L	71.5	73.6	0.251	0.238	0.345	0.255	0.122

4.3 Ablation Studies and Discussions

In order to test the effectiveness of each module, Table 4 is the ablation study of FSM ,C denotes Channel attention,S denotes spacial attention .Table 3 describes the ablation experiment of our module. D denotes that we use the DCA module, and S denotes our SA module. The mAP and NDS in the table show the effectiveness of our module.Table 4 is the number of relevant points,it was find when the number of relevant points is 8,the model has the best performance .

Table 3

Ablation study of DS-BEV.
FSM		FRM		mAP(%)	NDS(%)
C	S	D	S	mAP(%)	NDS(%)
✘	✘	✘	✘	68.5	71.4
✔	✘	✘	✘	68.6	71.5
✔	✔	✘	✘	68.7	71.8
✔	✔	✔	✘	69.2	72
✔	✔	✔	✔	69.5	72.3

Table 4

Ablation study of related point.
R-P	mAP	NDS
2	68.7	71.9
4	68.8	72
8	69.1	71.9

In this paper, we propose a fusion method of camera features and Lidar features in BEV mode to improve the accuracy of model detection. By combining the weights of channel attention and spatial attention through a specially designed fusion module for attention, enhancing the expression of key features, and combining DCA and self-attention to enhance local and overall coordination, our method can show advanced performance on the challenging nuScenes data machine. Our paper still has follow-up tasks. Experiments on different basic networks prove that our model is robust and universal.

Author Contribution

All the work is done by H.

MEI J, ZHU A, YAN X, et al. Waymo Open Dataset: Panoramic Video Panoptic Segmentation[J].
LIANG M, YANG B, WANG S, et al. Deep Continuous Fusion for Multi-Sensor 3D Object Detection[M/OL]//Computer Vision – ECCV 2018,Lecture Notes in Computer Science. 2018: 663-678. http://dx.doi.org/10.1007/978-3-030-01270-0_39. DOI:10.1007/978-3-030-01270-0_39.
Liang, Ming, et al. “Multi-Task Multi-Sensor Fusion for 3D Object Detection.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, https://doi.org/10.1109/cvpr.2019.00752.
Nabati, Ramin, and Hairong Qi. “CenterFusion: Center-Based Radar and Camera Fusion for 3D Object Detection.” 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, https://doi.org/10.1109/wacv48630.2021.00157.
Xie, Liang, et al. “PI-RCNN: An Efficient Multi-Sensor 3D Object Detector with Point-Based Attentive Cont-Conv Fusion Module.” Cornell University - arXiv,Cornell University - arXiv, Nov. 2019.
Zhang, Haolin, et al. “Faraway-Frustum: Dealing with Lidar Sparsity for 3D Object Detection Using Fusion.” 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021, https://doi.org/10.1109/itsc48978.2021.9564990.
Pang, Su, et al. “CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection.” 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, https://doi.org/10.1109/iros45743.2020.9341791.
Ge, Chongjian, et al. MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation. Apr. 2023.
Liang, Tingting, et al. BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework.
Liu, Zhijian, et al. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation.
Haotian, Haotian, et al. EA-BEV: Edge-Aware Bird’ s-Eye-View Projector for 3D Object Detection. Mar. 2023.
Zhu, Xizhou, et al. “Deformable DETR: Deformable Transformers for End-to-End Object Detection.” arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Oct. 2020.
Caesar, Holger, et al. “nuScenes: A Multimodal Dataset for Autonomous Driving.” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, https://doi.org/10.1109/cvpr42600.2020.01164.
Chen, Xiaozhi, et al. “Multi-View 3D Object Detection Network for Autonomous Driving.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, https://doi.org/10.1109/cvpr.2017.691.
Qi, Charles R., et al. “Frustum PointNets for 3D Object Detection from RGB-D Data.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, https://doi.org/10.1109/cvpr.2018.00102.
Wang, Zhixin, and Kui Jia. “Frustum ConvNet: Sliding Frustums to Aggregate Local Point-Wise Features for Amodal 3D Object Detection.” 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, https://doi.org/10.1109/iros40897.2019.8968513.
Nabati, Ramin, and Hairong Qi. “CenterFusion: Center-Based Radar and Camera Fusion for 3D Object Detection.” 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, https://doi.org/10.1109/wacv48630.2021.00157.
Chen, Xuanyao, et al. FUTR3D: A Unified Sensor Fusion Framework for 3D Detection.
Bai, Xuyang, et al. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers.
Vora, Sourabh, et al. “PointPainting: Sequential Fusion for 3D Object Detection.” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, https://doi.org/10.1109/cvpr42600.2020.00466.
Wang, Chunwei, et al. “PointAugmenting: Cross-Modal Augmentation for 3D Object Detection.” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, https://doi.org/10.1109/cvpr46437.2021.01162.
Yin, Tianwei, et al. “Multimodal Virtual Point 3D Detection.” Cornell University - arXiv,Cornell University - arXiv, Dec. 2021.
Xu, Shaoqing, et al. “FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection.” 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021, https://doi.org/10.1109/itsc48978.2021.9564951.
Chen, Zehui, et al. “AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection.” Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022, https://doi.org/10.24963/ijcai.2022/116.
Chen, Yukang, et al. Focal Sparse Convolutional Networks for 3D Object Detection. Apr. 2022.
Liang, Ming, et al. “Deep Continuous Fusion for Multi-Sensor 3D Object Detection.” Computer Vision – ECCV 2018,Lecture Notes in Computer Science, 2018, pp. 663–78, https://doi.org/10.1007/978-3-030-01270-0_39.
Li, Yingwei, et al. DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection.
Haotian, Haotian, et al. EA-BEV: Edge-Aware Bird’ s-Eye-View Projector for 3D Object Detection. Mar. 2023.
Ge, Chongjian, et al. MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation. Apr. 2023.
Xie, Yichen, et al. SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection.
Xia, Zhuofan, et al. Vision Transformer with Deformable Attention.
Li, Zhiqi, et al. “BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers.” Lecture Notes in Computer Science,Computer Vision – ECCV 2022, 2022, pp. 1–18, https://doi.org/10.1007/978-3-031-20077-9_1.
Zou, Jiayu, et al. DiffBEV: Conditional Diffusion Model for Bird’s Eye View Perception. Mar. 2023.
Vaswani, Ashish, et al. “Attention Is All You Need.” Neural Information Processing Systems,Neural Information Processing Systems, June 2017.
Huang, Junjie, et al. BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View.
Li, Yinhao, et al. BEVDepth: Acquisition of Reliable Depth for Multi-View 3D Object Detection.
Lin, Liting, et al. “SwinTrack: A Simple and Strong Baseline for Transformer Tracking.” arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Dec. 2021.
Zhou, Yin, and Oncel Tuzel. “VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, https://doi.org/10.1109/cvpr.2018.00472.
He, Kaiming, et al. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, https://doi.org/10.1109/cvpr.2016.90.
Yin, Tianwei, et al. “Center-Based 3D Object Detection and Tracking.” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, https://doi.org/10.1109/cvpr46437.2021.01161.
Li, Yanwei, et al. Unifying Voxel-Based Representation with Transformer for 3D Object Detection. June 2022.
Pan, Liang, et al. Multi-View Partial (MVP) Point Cloud Challenge 2021 on Completion and Registration: Methods and Results.
Bai, Xuyang, et al. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers.
Yang, Zeyu, et al. DeepInteraction: 3D Object Detection via Modality Interaction. Aug. 2022.
Huang, Junjie, et al. BEVDet4D: Exploit Temporal Cues in Multi-Camera 3D Object Detection.
Lang, Alex H., et al. “PointPillars: Fast Encoders for Object Detection from Point Clouds.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, https://doi.org/10.1109/cvpr.2019.01298.
Yan, Yan, et al. “SECOND: Sparsely Embedded Convolutional Detection.” Sensors, Oct. 2018, p. 3337, https://doi.org/10.3390/s18103337.
Yan, Junjie, et al. Cross Modal Transformer: Towards Fast and Robust 3D Object Detection. Jan. 2023.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

DS-BEV:An Efficient Multi-Modal Fusion in Object Detection with Unified Bird's-Eye View Representation

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

2.1 LiDAR-Camera Fusion

2.2. Deformable Attention

2.3. BEV Feature-based

3 Method

3.1 Encoder

3.2 FSM

3.3 DS-BEV Decoder

4 Experiments

4.1 Implementation Details

4.1.1 Network architecture

4.12 Data sets and evaluation indicators

4.2. Comparison Results

4.3 Ablation Studies and Discussions

5 Conclusions

Declarations

Author Contribution

References

Additional Declarations

Status:

Version 1