Code Monkey home page Code Monkey logo

minklocmultimodal's Introduction

MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition

Paper: MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition accepted for International Joint Conference on Neural Networks (IJCNN) 2021 ArXiv

Jacek Komorowski, Monika Wysoczańska, Tomasz Trzciński

Warsaw University of Technology

Our other projects

  • MinkLoc3D: Point Cloud Based Large-Scale Place Recognition (WACV 2021): MinkLoc3D
  • Large-Scale Topological Radar Localization Using Learned Descriptors (ICONIP 2021): RadarLoc
  • EgonNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale (IEEE Robotics and Automation Letters April 2022): EgoNN
  • Improving Point Cloud Based Place Recognition with Ranking-based Loss and Large Batch Training (2022): MinkLoc3Dv2

Introduction

We present a discriminative multimodal descriptor based on a pair of sensor readings: a point cloud from a LiDAR and an image from an RGB camera. Our descriptor, named MinkLoc++, can be used for place recognition, re-localization and loop closure purposes in robotics or autonomous vehicles applications. We use late fusion approach, where each modality is processed separately and fused in the final part of the processing pipeline. The proposed method achieves state-of-the-art performance on standard place recognition benchmarks. We also identify dominating modality problem when training a multimodal descriptor. The problem manifests itself when the network focuses on a modality with a larger overfit to the training data. This drives the loss down during the training but leads to suboptimal performance on the evaluation set. In this work we describe how to detect and mitigate such risk when using a deep metric learning approach to train a multimodal neural network.

Overview

Citation

If you find this work useful, please consider citing:

@INPROCEEDINGS{9533373,  
   author={Komorowski, Jacek and Wysoczańska, Monika and Trzcinski, Tomasz},  
   booktitle={2021 International Joint Conference on Neural Networks (IJCNN)},   
   title={MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition},   
   year={2021},  
   doi={10.1109/IJCNN52387.2021.9533373}
}

Environment and Dependencies

Code was tested using Python 3.8 with PyTorch 1.9.1 and MinkowskiEngine 0.5.4 on Ubuntu 20.04 with CUDA 10.2.

The following Python packages are required:

  • PyTorch (version 1.9.1)
  • MinkowskiEngine (version 0.5.4)
  • pytorch_metric_learning (version 1.0 or above)
  • tensorboard
  • colour_demosaicing

Modify the PYTHONPATH environment variable to include absolute path to the project root folder:

export PYTHONPATH=$PYTHONPATH:/home/.../MinkLocMultimodal

Datasets

MinkLoc++ is a multimodal descriptor based on a pair of inputs:

  • a 3D point cloud constructed by aggregating multiple 2D LiDAR scans from Oxford RobotCar dataset,
  • a corresponding RGB image from the stereo-center camera.

We use 3D point clouds built by authors of PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition paper (link). Each point cloud is built by aggregating 2D LiDAR scans gathered during the 20 meter vehicle traversal. For details see PointNetVLAD paper or their github repository (link). You can download training and evaluation point clouds from here (alternative link).

After downloading the dataset, you need to edit config_baseline_multimodal.txt configuration file (in config folder). Set dataset_folder parameter to point to a root folder of PointNetVLAD dataset with 3D point clouds. image_path parameter must be a folder where downsampled RGB images from Oxford RobotCar dataset will be saved. The folder will be created by generate_rgb_for_lidar.py script.

Generate training and evaluation tuples

Run the below code to generate training pickles (with positive and negative point clouds for each anchor point cloud) and evaluation pickles. Training pickle format is optimized and different from the format used in PointNetVLAD code.

cd generating_queries/ 

# Generate training tuples for the Baseline Dataset
python generate_training_tuples_baseline.py --dataset_root <dataset_root_path>

# Generate training tuples for the Refined Dataset
python generate_training_tuples_refine.py --dataset_root <dataset_root_path>

# Generate evaluation tuples
python generate_test_sets.py --dataset_root <dataset_root_path>

<dataset_root_path> is a path to dataset root folder, e.g. /data/pointnetvlad/benchmark_datasets/. Before running the code, ensure you have read/write rights to <dataset_root_path>, as training and evaluation pickles are saved there.

Downsample RGB images and index RGB images linked with each point cloud

RGB images are taken directly from Oxford RobotCar dataset. First, you need to download stereo camera images from Oxford RobotCar dataset. See dataset website for details (link). After downloading Oxford RobotCar dataset, run generate_rgb_for_lidar.py script. The script finds 20 closest RGB images in RobotCar dataset for each 3D point cloud, downsamples them and saves them in the target directory (image_path parameter in config_baseline_multimodal.txt). During the training an input to the network consists of a 3D point cloud and one RGB image randomly chosen from these 20 corresponding images. During the evaluation, a network input consists of a 3D point cloud and one RGB image with the closest timestamp.

cd scripts/ 

# Generate training tuples for the Baseline Dataset
python generate_rgb_for_lidar.py --config ../config/config_baseline_multimodal.txt --oxford_root <path_to_Oxford_RobotCar_dataset>

Alternatively you can download pre-processed and downsampled RobotCar images from this link.

Training

MinkLoc++ can be used in unimodal scenario (3D point cloud input only) and multimodal scenario (3D point cloud + RGB image input). To train MinkLoc++ network, download and decompress the 3D point cloud dataset and generate training pickles as described above. To train the multimodal model (3D+RGB) download the original Oxford RobotCar dataset and extract RGB images corresponding to 3D point clouds as described above. Edit the configuration files:

  • config_baseline_multimodal.txt when training a multimodal (3D+RGB) model
  • config_baseline.txt and config_refined.txt when train unimodal (3D only) model

Set dataset_folder parameter to the dataset root folder, where 3D point clouds are located. Set image_path parameter to the path with RGB images corresponding to 3D point clouds, extracted from Oxford RobotCar dataset using generate_rgb_for_lidar.py script (only when training a multimodal model). Modify batch_size_limit parameter depending on the available GPU memory. Default limits requires 11GB of GPU RAM.

To train the multimodal model (3D+RGB), run:

cd training

python train.py --config ../config/config_baseline_multimodal.txt --model_config ../models/minklocmultimodal.txt

To train a unimodal model (3D only) model run:

cd training

# Train unimodal (3D only) model on the Baseline Dataset
python train.py --config ../config/config_baseline.txt --model_config ../models/minkloc3d.txt

# Train unimodal (3D only) model on the Refined Dataset
python train.py --config ../config/config_refined.txt --model_config ../models/minkloc3d.txt

Pre-trained Models

Pretrained models are available in weights directory

  • minkloc_multimodal.pth multimodal model (3D+RGB) trained on the Baseline Dataset with corresponding RGB images
  • minkloc3d_baseline.pth unimodal model (3D only) trained on the Baseline Dataset
  • minkloc3d_refined.pth unimodal model (3D only) trained on the Refined Dataset

Evaluation

To evaluate pretrained models run the following commands:

cd eval

# To evaluate the multimodal model (3D+RGB only) trained on the Baseline Dataset
python evaluate.py --config ../config/config_baseline_multimodal.txt --model_config ../models/minklocmultimodal.txt --weights ../weights/minklocmultimodal_baseline.pth

# To evaluate the unimodal model (3D only) trained on the Baseline Dataset
python evaluate.py --config ../config/config_baseline.txt --model_config ../models/minkloc3d.txt --weights ../weights/minkloc3d_baseline.pth

# To evaluate the unimodal model (3D only) trained on the Refined Dataset
python evaluate.py --config ../config/config_refined.txt --model_config ../models/minkloc3d.txt --weights ../weights/minkloc3d_refined.pth

Results

MinkLoc++ performance (measured by Average Recall@1%) compared to the state of the art:

Multimodal model (3D+RGB) trained on the Baseline Dataset extended with RGB images

Method Oxford (AR@1) Oxford (AR@1%)
CORAL [1] 88.9 96.1
PIC-Net [2] 98.2
MinkLoc++ (3D+RGB) 96.7 99.1

Unimodal model (3D only) trained on the Baseline Dataset

Method Oxford (AR@1%) U.S. (AR@1%) R.A. (AR@1%) B.D (AR@1%)
PointNetVLAD [3] 80.3 72.6 60.3 65.3
PCAN [4] 83.8 79.1 71.2 66.8
DAGC [5] 87.5 83.5 75.7 71.2
LPD-Net [6] 94.9 96.0 90.5 89.1
EPC-Net [7] 94.7 96.5 88.6 84.9
SOE-Net [8] 96.4 93.2 91.5 88.5
NDT-Transformer [10] 97.7
MinkLoc3D [9] 97.9 95.0 91.2 88.5
MinkLoc++ (3D-only) 98.2 94.5 92.1 88.4

Unimodal model (3D only) trained on the Refined Dataset

Method Oxford (AR@1%) U.S. (AR@1%) R.A. (AR@1%) B.D (AR@1%)
PointNetVLAD [3] 80.1 94.5 93.1 86.5
PCAN [4] 86.4 94.1 92.3 87.0
DAGC [5] 87.8 94.3 93.4 88.5
LPD-Net [6] 94.9 98.9 96.4 94.4
SOE-Net [8] 96.4 97.7 95.9 92.6
MinkLoc3D [9] 98.5 99.7 99.3 96.7
MinkLoc++ (RGB-only) 98.4 99.7 99.3 97.4
  1. Y. Pan et al., "CORAL: Colored structural representation for bi-modal place recognition", preprint arXiv:2011.10934 (2020)
  2. Y. Lu et al., "PIC-Net: Point Cloud and Image Collaboration Network for Large-Scale Place Recognition", preprint arXiv:2008.00658 (2020)
  3. M. A. Uy and G. H. Lee, "PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition", 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  4. W. Zhang and C. Xiao, "PCAN: 3D Attention Map Learning Using Contextual Information for Point Cloud Based Retrieval", 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  5. Q. Sun et al., "DAGC: Employing Dual Attention and Graph Convolution for Point Cloud based Place Recognition", Proceedings of the 2020 International Conference on Multimedia Retrieval
  6. Z. Liu et al., "LPD-Net: 3D Point Cloud Learning for Large-Scale Place Recognition and Environment Analysis", 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  7. L. Hui et al., "Efficient 3D Point Cloud Feature Learning for Large-Scale Place Recognition" preprint arXiv:2101.02374 (2021)
  8. Y. Xia et al., "SOE-Net: A Self-Attention and Orientation Encoding Network for Point Cloud based Place Recognition", 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  9. J. Komorowski, "MinkLoc3D: Point Cloud Based Large-Scale Place Recognition", Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), (2021)
  10. Z. Zhou et al., "NDT-Transformer: Large-scale 3D Point Cloud Localisation Using the Normal Distribution Transform Representation", 2021 IEEE International Conference on Robotics and Automation (ICRA)
  • J. Komorowski, M. Wysoczanska, T. Trzcinski, "MinkLoc++: Lidar and Monocular Image Fusion for Place Recognition", accepted for International Joint Conference on Neural Networks (IJCNN), (2021)

License

Our code is released under the MIT License (see LICENSE file for details).

minklocmultimodal's People

Contributors

alexmelekhin avatar jac99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

minklocmultimodal's Issues

KITTI dataset

According to your paper, the database and query split of the KITTI dataset were as follows.

We take Sequence 00 which visits the same places repeatedly and construct the reference database using the data gathered during the first 170 seconds. The rest is used as localization queries.

However, most queries did not have a close database data with the above split when I tried. (sequence 00 -> database: first 170 sec & query: rest(=170~470 sec)) Could you please explain more about this setting or send me the code?

bin file data format

Hello, I have another question. When I read the point cloud bin file, I found that the value inside is very small, all less than 1. What preprocessing has been done? I would be waiting for your reply. Thank you!

How to normalize the RobotCar pointcloud data to [-1,1]?

Hi, in your paper "As point coordinates in the Oxford RobotCar dataset are normalized to be within [−1,1] range, this gives up to 200 voxels in each spatial direction." May I know what algorithm was used to get the normalized point cloud between -1 and 1?

KITTI dataset details

I have a few questions regarding the KITTI dataset.

  1. There are a total of 4 images (grayscale 2, color 2) in the KITTI dataset. I wonder which one was used.
  2. When using the point cloud, did you remove the road and downsampled it to have 4096 points like the oxford and inhouse datasets? Moreover, if so, KITTI's point cloud seems to contain a wider area of ​​information. Did you crop it?
  3. How did you handle the northing and easting information? The distance was calculated only by northing and easting in the existing oxford and inhouse datasets without considering the altitude. Did you save and use it in the same way in KITTI? (for example, easting: x, northing: z) Or did you use the 3D distance?

I would be waiting for your reply. Thank you!

Oxford Dataset RGB Image Process

Hi, thanks for your great work.
I have one question.
The stereo/center camera data downloaded from the official website is single-channel, but I haven't seen any code in your implementation for handling single-channel data. Did I overlook something?
Thanks in advance.

MinkLoc++ (RGB-only) generation

Hello~ Thanks for your work!
I saw that your paper gave the result of using only rgb pictures. But I did not find the relevant weight file and runing steps. Can you tell me the operation steps?

Bug in lidar2image_ndx generation for val queries

I've tried to implement validation during training and faced a problem that in the val phase the script returns an error:

AssertionError: Unknown lidar timestamp: 1435937763823973

I've checked the index generation script and found that there is a bug in the code:

ts, traversal = get_ts_traversal(train_queries[e].rel_scan_filepath)

I am pretty sure that it should be ts, traversal = get_ts_traversal(val_queries[e].rel_scan_filepath)

About how to get image to lidar dataset?

Hi,
Thanks for your nice work and shared code.
When I need to get the image to correspond to the lidar point cloud, I follow your instruction to run the generate_rgb_for_lidar.py script firstly. But the code which showed as below is confused for me:
1
May I know what are lidar2image_ndx_path, lidar2image_ndx.pickle and pickle..., etc. Do I need to run other scripts to get these files?

RGB generation

Thanks for the work!
I'm a beginner in point cloud. When I run generate_rgb_for_lidar.py, it seems that lidar2image_ndx.pickle is required, but I cannot figure out where it is first generated. The problem is reported before the function create_lidar2img_ndx has been run. Could U plz give me some hint about that?

How to evaluate on KITTI dataset?

Hi, I wanted to recreate the generalisation results on KITTI dataset, that you have mentioned in the paper. I would appreciate any advice on how to run the model for KITTI dataset.

RobotCar Seasons

Hello~ Thanks for your work!
How can I reproduce the result of Table 3 in your paper? Can you provide the script that finds LiDAR readings with corresponding timestamps in the in original RobotCar dataset for each image in RobotCar Seasons Datasets?

Question about the model structure

Hi,
I recently implemented your work, but I am confused about your model structure, which is shown as blew:
image

Question-1: I think the module of convs(1) is not consistent with your description in Fig.3 of your paper?
Question-2: What does the module downsample for?

RobotCar images

Hi, thanks for your great work.
The pre-processed and downsampled RobotCar images are unavailable here. Could you please re-upload to the pre-processed images via Google Drive?
Thanks in advance.

Question about pointcloud data

Hi, in your work you used the 2D LiDAR (from Oxford RobotCar: 2 x SICK LMS-151 2D LIDAR, 270° FoV, 50Hz, 50m range, 0.5° resolution). According to the documentation description: Each 2D scan consists of 541 triplets of (x, y, R), where x, y are the 2D Cartesian coordinates of the LiDAR return relative to the sensor (in metres), and R is the measured infrared reflectance value. Does the point cloud data you used actually not a real 3-dimensional dataset, the coordinate Z represents the reflectance?

questions about training

Hello, thanks for your great work.
I have run your training code and found something confused.

  1. The part of image loss is relatively larger than point cloud (around 100 times), which is also mentioned in the paper as "overfitting problem". Could you please explain it more detailed? Why are there less active triplets for RGB image modality than for 3D modality, does the active triplet correspond to num_non_zero_triplets in code?
  2. The adopted resnetFPN is already pretrained. Is the large training loss mainly caused by huge difference in illumination across traversals on the Oxford RobotCar Dataset? It's also a bit werried that image loss gets down while its weight(beta) is set 0.0.

Looking forward to your reply. Thanks.

loss become nan during training

hello!
I run your code and find the loss becomes nan after 40 epoch. I find it is caused by that values of all embeddings become nan. Do you find this during your research? Is there any bug? or just by accident?

the Oxford Robotcar Dataset unavailable

Hi, thanks for your great work on multi modal fusion for place recognition.
However, the Oxford Robotcar Dataset is unavailable from 2022-11-08 and I'm a bit frustrated that there is no sign of re-opening the download. Could you please provide me with the centre RGB images by Google drive?
Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.