Code Monkey home page Code Monkey logo

occupancy-mae's Introduction

Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders

Repository for our arxiv paper "Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders".

Introduction

Mask-based pre-training has achieved great success for self-supervised learning in images and languages without manually annotated supervision. However, it has not yet been studied for large-scale point clouds with redundant spatial information. In this research, we propose a mask voxel autoencoder network for pre-training large-scale point clouds, dubbed Voxel-MAE. Our key idea is to transform the point clouds into voxel representations and classify whether the voxel contains point clouds. This simple but effective strategy makes the network voxel-aware of the object shape, thus improving the performance of downstream tasks, such as 3D object detection. Our Voxel-MAE, with even a 90% masking ratio, can still learn representative features for the high spatial redundancy of large-scale point clouds. We also validate the effectiveness of Voxel-MAE on unsupervised domain adaptative tasks, which proves the generalization ability of Voxel-MAE. Our Voxel-MAE proves that it is feasible to pre-train large-scale point clouds without data annotations to enhance the perception ability of the autonomous vehicle. Extensive experiments show great effectiveness of our pre-training method with 3D object detectors (SECOND, CenterPoint, and PV-RCNN) on three popular datasets (KITTI, Waymo, and nuScenes).

Flowchart of Voxel-MAE

Installation

Please refer to INSTALL.md for the installation of OpenPCDet(v0.5).

Getting Started

Please refer to GETTING_STARTED.md .

Usage

First Pre-training Voxel-MAE

KITTI:

Train with multiple GPUs:
bash ./scripts/dist_train_voxel_mae.sh ${NUM_GPUS}  --cfg_file cfgs/kitti_models/voxel_mae_kitti.yaml --batch_size ${BATCH_SIZE}

Train with a single GPU:
python3 train_voxel_mae.py  --cfg_file cfgs/kitti_models/voxel_mae_kitti.yaml --batch_size ${BATCH_SIZE}

Waymo:

python3 train_voxel_mae.py  --cfg_file cfgs/kitti_models/voxel_mae_waymo.yaml --batch_size ${BATCH_SIZE}

nuScenes:

python3 train_voxel_mae.py  --cfg_file cfgs/kitti_models/voxel_mae_nuscenes.yaml --batch_size ${BATCH_SIZE}

Then traing OpenPCDet

Same as OpenPCDet with pre-trained model from our Voxel-MAE.

bash ./scripts/dist_train.sh ${NUM_GPUS}  --cfg_file cfgs/kitti_models/second.yaml --batch_size ${BATCH_SIZE} --pretrained_model ../output/kitti/voxel_mae/ckpt/check_point_10.pth

Performance

KITTI Dataset

The results are the 3D detection performance of moderate difficulty on the val set of KITTI dataset. Results of OpenPCDet are from here .

Car@R11 Pedestrian@R11 Cyclist@R11
SECOND 78.62 52.98 67.15
Voxel-MAE+SECOND 78.90 53.14 68.08
SECOND-IoU 79.09 55.74 71.31
Voxel-MAE+SECOND-IoU 79.22 55.79 72.22
PV-RCNN 83.61 57.90 70.47
Voxel-MAE+PV-RCNN 83.82 59.37 71.99

Waymo Open Dataset

Similar to OpenPCDet , all models are trained with a single frame of 20% data (~32k frames) of all the training samples , and the results of each cell here are mAP/mAPH calculated by the official Waymo evaluation metrics on the whole validation set (version 1.2).

Performance@(train with 20% Data) Vec_L1 Vec_L2 Ped_L1 Ped_L2 Cyc_L1 Cyc_L2 Voxel-MAE 3D Detection
SECOND 70.96/70.34 62.58/62.02 65.23/54.24 57.22/47.49 57.13/55.62 54.97/53.53
Voxel-MAE+SECOND 71.12/70.58 62.67/62.34 67.21/55.68 59.03/48.79 57.73/56.18 55.62/54.17
CenterPoint 71.33/70.76 63.16/62.65 72.09/65.49 64.27/58.23 68.68/67.39 66.11/64.87
Voxel-MAE+CenterPoint 71.89/71.33 64.05/63.53 73.85/67.12 65.78/59.62 70.29/69.03 67.76/66.53
PV-RCNN (AnchorHead) 75.41/74.74 67.44/66.80 71.98/61.24 63.70/53.95 65.88/64.25 63.39/61.82
Voxel-MAE+PV-RCNN (AnchorHead 75.94/75.28 67.94/67.34 74.02/63.48 64.91/55.57 67.21/65.49 64.62/63.02
PV-RCNN (CenterHead) 75.95/75.43 68.02/67.54 75.94/69.40 67.66/61.62 70.18/68.98 67.73/66.57
Voxel-MAE+PV-RCNN (CenterHead) 77.29/76.81 68.71/68.21 77.70/71.13 69.53/63.46 70.55/69.39 68.11/66.95
PV-RCNN++ 77.82/77.32 69.07/68.62 77.99/71.36 69.92/63.74 71.80/70.71 69.31/68.26
Voxel-MAE+PV-RCNN++ 78.23/77.72 69.54/69.12 79.85/73.23 71.07/64.96 71.80/70.64 69.31/68.26

nuScenes Dataset

mAP NDS mATE mASE mAOE mAVE mAAE
SECOND-MultiHead (CBGS) 50.59 62.29 31.15 25.51 26.64 26.26 20.46
Voxel-MAE+SECOND-MultiHead 50.82 62.45 31.02 25.23 26.12 26.11 20.04
CenterPoint (voxel_size=0.1) 56.03 64.54 30.11 25.55 38.28 21.94 18.87
Voxel-MAE+CenterPoint 56.45 65.02 29.73 25.17 38.38 21.47 18.65

License

Our codes are released under the Apache 2.0 license.

Acknowledgement

This repository is based on OpenPCDet.

Citation

If you find this project useful in your research, please consider cite:

@ARTICLE{Occupancy-MAE,
    title={Voxel-MAE: Masked Autoencoders for Pre-training Large-scale Point Clouds},
    author={Chen Min, Xinli Xu, Dawei Zhao, Liang Xiao, Yiming Nie, and Bin Dai},
    journal = {arXiv e-prints},
    year={2022}
}

occupancy-mae's People

Contributors

chaytonmin avatar irohxu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.