3D object detection boxes directly from image and point data using multi-modal features with CMT

Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

... propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection.

Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes.

The spatial alignment of multi-modal tokens is performed by encoding the 3D points into multi-modal features.

The core design of CMT is quite simple while its performance is impressive.

It achieves 74.1% NDS (state-of-the-art with single model) on nuScenes test set while maintaining faster inference speed.

Moreover, CMT has a strong robustness even if the LiDAR is missing ...

Please like and share this post if you enjoyed it using the buttons at the bottom!

News to help your R&D in artificial intelligence, machine learning, robotics, computer vision, smart hardware