Code Monkey home page Code Monkey logo

dust3r's Introduction

demo

Official implementation of DUSt3R: Geometric 3D Vision Made Easy
[Project page], [DUSt3R arxiv]

Example of reconstruction from two images

High level overview of DUSt3R capabilities

@inproceedings{dust3r_cvpr24,
      title={DUSt3R: Geometric 3D Vision Made Easy}, 
      author={Shuzhe Wang and Vincent Leroy and Yohann Cabon and Boris Chidlovskii and Jerome Revaud},
      booktitle = {CVPR},
      year = {2024}
}

@misc{dust3r_arxiv23,
      title={DUSt3R: Geometric 3D Vision Made Easy}, 
      author={Shuzhe Wang and Vincent Leroy and Yohann Cabon and Boris Chidlovskii and Jerome Revaud},
      year={2023},
      eprint={2312.14132},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Table of Contents

License

The code is distributed under the CC BY-NC-SA 4.0 License. See LICENSE for more information.

# Copyright (C) 2024-present Naver Corporation. All rights reserved.
# Licensed under CC BY-NC-SA 4.0 (non-commercial use only).

Get Started

Installation

  1. Clone DUSt3R.
git clone --recursive https://github.com/naver/dust3r
cd dust3r
# if you have already cloned dust3r:
# git submodule update --init --recursive
  1. Create the environment, here we show an example using conda.
conda create -n dust3r python=3.11 cmake=3.14.0
conda activate dust3r 
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia  # use the correct version of cuda for your system
pip install -r requirements.txt
# Optional: you can also install additional packages to:
# - add support for HEIC images
pip install -r requirements_optional.txt
  1. Optional, compile the cuda kernels for RoPE (as in CroCo v2).
# DUST3R relies on RoPE positional embeddings for which you can compile some cuda kernels for faster runtime.
cd croco/models/curope/
python setup.py build_ext --inplace
cd ../../../

Checkpoints

You can obtain the checkpoints by two ways:

  1. You can use our huggingface_hub integration: the models will be downloaded automatically.

  2. Otherwise, We provide several pre-trained models:

Modelname Training resolutions Head Encoder Decoder
DUSt3R_ViTLarge_BaseDecoder_224_linear.pth 224x224 Linear ViT-L ViT-B
DUSt3R_ViTLarge_BaseDecoder_512_linear.pth 512x384, 512x336, 512x288, 512x256, 512x160 Linear ViT-L ViT-B
DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth 512x384, 512x336, 512x288, 512x256, 512x160 DPT ViT-L ViT-B

You can check the hyperparameters we used to train these models in the section: Our Hyperparameters

To download a specific model, for example DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth:

mkdir -p checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth -P checkpoints/

Interactive demo

In this demo, you should be able run DUSt3R on your machine to reconstruct a scene. First select images that depicts the same scene.

You can adjust the global alignment schedule and its number of iterations.

Note

If you selected one or two images, the global alignment procedure will be skipped (mode=GlobalAlignerMode.PairViewer)

Hit "Run" and wait. When the global alignment ends, the reconstruction appears. Use the slider "min_conf_thr" to show or remove low confidence areas.

python3 demo.py --model_name DUSt3R_ViTLarge_BaseDecoder_512_dpt

# Use --weights to load a checkpoint from a local file, eg --weights checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth
# Use --image_size to select the correct resolution for the selected checkpoint. 512 (default) or 224
# Use --local_network to make it accessible on the local network, or --server_name to specify the url manually
# Use --server_port to change the port, by default it will search for an available port starting at 7860
# Use --device to use a different device, by default it's "cuda"

Interactive demo with docker

To run DUSt3R using Docker, including with NVIDIA CUDA support, follow these instructions:

  1. Install Docker: If not already installed, download and install docker and docker compose from the Docker website.

  2. Install NVIDIA Docker Toolkit: For GPU support, install the NVIDIA Docker toolkit from the Nvidia website.

  3. Build the Docker image and run it: cd into the ./docker directory and run the following commands:

cd docker
bash run.sh --with-cuda --model_name="DUSt3R_ViTLarge_BaseDecoder_512_dpt"

Or if you want to run the demo without CUDA support, run the following command:

cd docker
bash run.sh --model_name="DUSt3R_ViTLarge_BaseDecoder_512_dpt"

By default, demo.py is lanched with the option --local_network.
Visit http://localhost:7860/ to access the web UI (or replace localhost with the machine's name to access it from the network).

run.sh will launch docker-compose using either the docker-compose-cuda.yml or docker-compose-cpu.ym config file, then it starts the demo using entrypoint.sh.

demo

Usage

from dust3r.inference import inference
from dust3r.model import AsymmetricCroCo3DStereo
from dust3r.utils.image import load_images
from dust3r.image_pairs import make_pairs
from dust3r.cloud_opt import global_aligner, GlobalAlignerMode

if __name__ == '__main__':
    device = 'cuda'
    batch_size = 1
    schedule = 'cosine'
    lr = 0.01
    niter = 300

    model_name = "naver/DUSt3R_ViTLarge_BaseDecoder_512_dpt"
    # you can put the path to a local checkpoint in model_name if needed
    model = AsymmetricCroCo3DStereo.from_pretrained(model_name).to(device)
    # load_images can take a list of images or a directory
    images = load_images(['croco/assets/Chateau1.png', 'croco/assets/Chateau2.png'], size=512)
    pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)
    output = inference(pairs, model, device, batch_size=batch_size)

    # at this stage, you have the raw dust3r predictions
    view1, pred1 = output['view1'], output['pred1']
    view2, pred2 = output['view2'], output['pred2']
    # here, view1, pred1, view2, pred2 are dicts of lists of len(2)
    #  -> because we symmetrize we have (im1, im2) and (im2, im1) pairs
    # in each view you have:
    # an integer image identifier: view1['idx'] and view2['idx']
    # the img: view1['img'] and view2['img']
    # the image shape: view1['true_shape'] and view2['true_shape']
    # an instance string output by the dataloader: view1['instance'] and view2['instance']
    # pred1 and pred2 contains the confidence values: pred1['conf'] and pred2['conf']
    # pred1 contains 3D points for view1['img'] in view1['img'] space: pred1['pts3d']
    # pred2 contains 3D points for view2['img'] in view1['img'] space: pred2['pts3d_in_other_view']

    # next we'll use the global_aligner to align the predictions
    # depending on your task, you may be fine with the raw output and not need it
    # with only two input images, you could use GlobalAlignerMode.PairViewer: it would just convert the output
    # if using GlobalAlignerMode.PairViewer, no need to run compute_global_alignment
    scene = global_aligner(output, device=device, mode=GlobalAlignerMode.PointCloudOptimizer)
    loss = scene.compute_global_alignment(init="mst", niter=niter, schedule=schedule, lr=lr)

    # retrieve useful values from scene:
    imgs = scene.imgs
    focals = scene.get_focals()
    poses = scene.get_im_poses()
    pts3d = scene.get_pts3d()
    confidence_masks = scene.get_masks()

    # visualize reconstruction
    scene.show()

    # find 2D-2D matches between the two images
    from dust3r.utils.geometry import find_reciprocal_matches, xy_grid
    pts2d_list, pts3d_list = [], []
    for i in range(2):
        conf_i = confidence_masks[i].cpu().numpy()
        pts2d_list.append(xy_grid(*imgs[i].shape[:2][::-1])[conf_i])  # imgs[i].shape[:2] = (H, W)
        pts3d_list.append(pts3d[i].detach().cpu().numpy()[conf_i])
    reciprocal_in_P2, nn2_in_P1, num_matches = find_reciprocal_matches(*pts3d_list)
    print(f'found {num_matches} matches')
    matches_im1 = pts2d_list[1][reciprocal_in_P2]
    matches_im0 = pts2d_list[0][nn2_in_P1][reciprocal_in_P2]

    # visualize a few matches
    import numpy as np
    from matplotlib import pyplot as pl
    n_viz = 10
    match_idx_to_viz = np.round(np.linspace(0, num_matches-1, n_viz)).astype(int)
    viz_matches_im0, viz_matches_im1 = matches_im0[match_idx_to_viz], matches_im1[match_idx_to_viz]

    H0, W0, H1, W1 = *imgs[0].shape[:2], *imgs[1].shape[:2]
    img0 = np.pad(imgs[0], ((0, max(H1 - H0, 0)), (0, 0), (0, 0)), 'constant', constant_values=0)
    img1 = np.pad(imgs[1], ((0, max(H0 - H1, 0)), (0, 0), (0, 0)), 'constant', constant_values=0)
    img = np.concatenate((img0, img1), axis=1)
    pl.figure()
    pl.imshow(img)
    cmap = pl.get_cmap('jet')
    for i in range(n_viz):
        (x0, y0), (x1, y1) = viz_matches_im0[i].T, viz_matches_im1[i].T
        pl.plot([x0, x1 + W0], [y0, y1], '-+', color=cmap(i / (n_viz - 1)), scalex=False, scaley=False)
    pl.show(block=True)

matching example on croco pair

Training

In this section, we present a short demonstration to get started with training DUSt3R. At the moment, we didn't release the training datasets, so we're going to download and prepare a subset of CO3Dv2 - Creative Commons Attribution-NonCommercial 4.0 International and launch the training code on it. The demo model will be trained for a few epochs on a very small dataset. It will not be very good.

Demo

# download and prepare the co3d subset
mkdir -p data/co3d_subset
cd data/co3d_subset
git clone https://github.com/facebookresearch/co3d
cd co3d
python3 ./co3d/download_dataset.py --download_folder ../ --single_sequence_subset
rm ../*.zip
cd ../../..

python3 datasets_preprocess/preprocess_co3d.py --co3d_dir data/co3d_subset --output_dir data/co3d_subset_processed  --single_sequence_subset

# download the pretrained croco v2 checkpoint
mkdir -p checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/CroCo/CroCo_V2_ViTLarge_BaseDecoder.pth -P checkpoints/

# the training of dust3r is done in 3 steps.
# for this example we'll do fewer epochs, for the actual hyperparameters we used in the paper, see the next section: "Our Hyperparameters"
# step 1 - train dust3r for 224 resolution
torchrun --nproc_per_node=4 train.py \
    --train_dataset "1000 @ Co3d(split='train', ROOT='data/co3d_subset_processed', aug_crop=16, mask_bg='rand', resolution=224, transform=ColorJitter)" \
    --test_dataset "100 @ Co3d(split='test', ROOT='data/co3d_subset_processed', resolution=224, seed=777)" \
    --model "AsymmetricCroCo3DStereo(pos_embed='RoPE100', img_size=(224, 224), head_type='linear', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
    --train_criterion "ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
    --test_criterion "Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
    --pretrained "checkpoints/CroCo_V2_ViTLarge_BaseDecoder.pth" \
    --lr 0.0001 --min_lr 1e-06 --warmup_epochs 1 --epochs 10 --batch_size 16 --accum_iter 1 \
    --save_freq 1 --keep_freq 5 --eval_freq 1 \
    --output_dir "checkpoints/dust3r_demo_224"	  

# step 2 - train dust3r for 512 resolution
torchrun --nproc_per_node=4 train.py \
    --train_dataset "1000 @ Co3d(split='train', ROOT='data/co3d_subset_processed', aug_crop=16, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter)" \
    --test_dataset "100 @ Co3d(split='test', ROOT='data/co3d_subset_processed', resolution=(512,384), seed=777)" \
    --model "AsymmetricCroCo3DStereo(pos_embed='RoPE100', patch_embed_cls='ManyAR_PatchEmbed', img_size=(512, 512), head_type='linear', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
    --train_criterion "ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
    --test_criterion "Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
    --pretrained "checkpoints/dust3r_demo_224/checkpoint-best.pth" \
    --lr 0.0001 --min_lr 1e-06 --warmup_epochs 1 --epochs 10 --batch_size 4 --accum_iter 4 \
    --save_freq 1 --keep_freq 5 --eval_freq 1 \
    --output_dir "checkpoints/dust3r_demo_512"

# step 3 - train dust3r for 512 resolution with dpt
torchrun --nproc_per_node=4 train.py \
    --train_dataset "1000 @ Co3d(split='train', ROOT='data/co3d_subset_processed', aug_crop=16, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter)" \
    --test_dataset "100 @ Co3d(split='test', ROOT='data/co3d_subset_processed', resolution=(512,384), seed=777)" \
    --model "AsymmetricCroCo3DStereo(pos_embed='RoPE100', patch_embed_cls='ManyAR_PatchEmbed', img_size=(512, 512), head_type='dpt', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
    --train_criterion "ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
    --test_criterion "Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
    --pretrained "checkpoints/dust3r_demo_512/checkpoint-best.pth" \
    --lr 0.0001 --min_lr 1e-06 --warmup_epochs 1 --epochs 10 --batch_size 2 --accum_iter 8 \
    --save_freq 1 --keep_freq 5 --eval_freq 1 \
    --output_dir "checkpoints/dust3r_demo_512dpt"

Our Hyperparameters

We didn't release the training datasets, but here are the commands we used for training our models:

# NOTE: ROOT path omitted for datasets
# 224 linear
torchrun --nproc_per_node 4 train.py \
    --train_dataset=" + 100_000 @ Habitat512(1_000_000, split='train', aug_crop=16, resolution=224, transform=ColorJitter) + 100_000 @ BlendedMVS(split='train', aug_crop=16, resolution=224, transform=ColorJitter) + 100_000 @ MegaDepthDense(split='train', aug_crop=16, resolution=224, transform=ColorJitter) + 100_000 @ ARKitScenes(aug_crop=256, resolution=224, transform=ColorJitter) + 100_000 @ Co3d_v3(split='train', aug_crop=16, mask_bg='rand', resolution=224, transform=ColorJitter) + 100_000 @ StaticThings3D(aug_crop=256, mask_bg='rand', resolution=224, transform=ColorJitter) + 100_000 @ ScanNetpp(split='train', aug_crop=256, resolution=224, transform=ColorJitter) + 100_000 @ Waymo(aug_crop=128, resolution=224, transform=ColorJitter) " \
    --test_dataset=" Habitat512(1_000, split='val', resolution=224, seed=777) + 1_000 @ BlendedMVS(split='val', resolution=224, seed=777) + 1_000 @ MegaDepthDense(split='val', resolution=224, seed=777) + 1_000 @ Co3d_v3(split='test', mask_bg='rand', resolution=224, seed=777) " \
    --train_criterion="ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
    --test_criterion="Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
    --model="AsymmetricCroCo3DStereo(pos_embed='RoPE100', img_size=(224, 224), head_type='linear', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
    --pretrained="checkpoints/CroCo_V2_ViTLarge_BaseDecoder.pth" \
    --lr=0.0001 --min_lr=1e-06 --warmup_epochs=10 --epochs=100 --batch_size=16 --accum_iter=1 \
    --save_freq=5 --keep_freq=10 --eval_freq=1 \
    --output_dir="checkpoints/dust3r_224"

# 512 linear
torchrun --nproc_per_node 8 train.py \
    --train_dataset=" + 10_000 @ Habitat512(1_000_000, split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ BlendedMVS(split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ MegaDepthDense(split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ ARKitScenes(aug_crop=256, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ Co3d_v3(split='train', aug_crop=16, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ StaticThings3D(aug_crop=256, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ ScanNetpp(split='train', aug_crop=256, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ Waymo(aug_crop=128, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) " \
    --test_dataset=" Habitat512(1_000, split='val', resolution=(512,384), seed=777) + 1_000 @ BlendedMVS(split='val', resolution=(512,384), seed=777) + 1_000 @ MegaDepthDense(split='val', resolution=(512,336), seed=777) + 1_000 @ Co3d_v3(split='test', resolution=(512,384), seed=777) " \
    --train_criterion="ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
    --test_criterion="Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
    --model="AsymmetricCroCo3DStereo(pos_embed='RoPE100', patch_embed_cls='ManyAR_PatchEmbed', img_size=(512, 512), head_type='linear', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
    --pretrained="checkpoints/dust3r_224/checkpoint-best.pth" \
    --lr=0.0001 --min_lr=1e-06 --warmup_epochs=20 --epochs=200 --batch_size=4 --accum_iter=2 \
    --save_freq=10 --keep_freq=10 --eval_freq=1 --print_freq=10 \
    --output_dir="checkpoints/dust3r_512"

# 512 dpt
torchrun --nproc_per_node 8 train.py \
    --train_dataset=" + 10_000 @ Habitat512(1_000_000, split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ BlendedMVS(split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ MegaDepthDense(split='train', aug_crop=16, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ ARKitScenes(aug_crop=256, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ Co3d_v3(split='train', aug_crop=16, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ StaticThings3D(aug_crop=256, mask_bg='rand', resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ ScanNetpp(split='train', aug_crop=256, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) + 10_000 @ Waymo(aug_crop=128, resolution=[(512, 384), (512, 336), (512, 288), (512, 256), (512, 160)], transform=ColorJitter) " \
    --test_dataset=" Habitat512(1_000, split='val', resolution=(512,384), seed=777) + 1_000 @ BlendedMVS(split='val', resolution=(512,384), seed=777) + 1_000 @ MegaDepthDense(split='val', resolution=(512,336), seed=777) + 1_000 @ Co3d_v3(split='test', resolution=(512,384), seed=777) " \
    --train_criterion="ConfLoss(Regr3D(L21, norm_mode='avg_dis'), alpha=0.2)" \
    --test_criterion="Regr3D_ScaleShiftInv(L21, gt_scale=True)" \
    --model="AsymmetricCroCo3DStereo(pos_embed='RoPE100', patch_embed_cls='ManyAR_PatchEmbed', img_size=(512, 512), head_type='dpt', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12)" \
    --pretrained="checkpoints/dust3r_512/checkpoint-best.pth" \
    --lr=0.0001 --min_lr=1e-06 --warmup_epochs=15 --epochs=90 --batch_size=2 --accum_iter=4 \
    --save_freq=5 --keep_freq=10 --eval_freq=1 --print_freq=10 \
    --output_dir="checkpoints/dust3r_512dpt"

dust3r's People

Contributors

codesmith-emmy avatar cris-test avatar eltociear avatar hturki avatar jerome-revaud avatar lbg030 avatar parskatt avatar spagnolog avatar vincent-leroy avatar wauplin avatar wzy-99 avatar yocabon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dust3r's Issues

What if a partial set of poses is known?

Hey Naver,

First of all great work, it is very interesting to play around with!

I'm curious, if one knows a partial set of poses and focal lengths aforehand, how should one initialize the pose-graph?

Best regards

Pretrained Croco

Hi,

Thanks a lot for the amazing work. Did you use the pretrained Croco model when training Dust3r? If so, could you please point out where you load the model in the training code?

Thanks in advance!

CUDA OOM

Hi,

The performance is really amazing on the few image pairs I have tried.
However, when I moved to a bigger scenes (29 images), it crashes with CUDA OOM on 16Gb V100.
Any recommendations how can I run it?

  File "/home/old-ufo/dev/dust3r/dust3r/cloud_opt/optimizer.py", line 176, in forward
    aligned_pred_i = geotrf(pw_poses, pw_adapt * self._stacked_pred_i)
  File "/home/old-ufo/dev/dust3r/dust3r/utils/geometry.py", line 86, in geotrf
    pts = pts @ Trf[..., :-1, :] + Trf[..., -1:, :]

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.38 GiB. GPU 0 has a total capacity of 15.77 GiB of which 775.88 MiB is free. Including non-PyTorch memory, this process has 15.01 GiB memory in use. Of the allocated memory 13.70 GiB is allocated by PyTorch, and 922.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

A small suggestion to author

Assuming we have 1296 images (36x36) around the object in a 360x360 setup, or perhaps slightly fewer, say 100 images.
The current algorithm generates pairs for all possible combinations, resulting in a substantial number of pairs (100*99).
Without a proper sampling policy, this can be challenging to handle.
A small recommendation would be to initiate and maintain a pair-loss table from scratch, gradually increasing the number of pairs and sampling based on the convergence of loss.

Manually adjustable camera positions after prediction

Manually adjustable camera positions after prediction within the GUI software.

Have you considered adding this? It would be a great way to correct any cameras that the algorithm may have not predicted correctly.

Love this btw. Great work so far.

Model deployment issues

Thanks for the engineering code, I want to deploy the model to embedded devices, is this feasible, do you know those devices support this kind of model Looking forward to your reply

HEIC images are ignored

Tried to demo it yesterday to a few novice users using share=True in launch.
Encountered a few problems along the way.

It seems to ignore heic files (which is often the default on mobile devices).

The problematic line where you need to add a ".heic"

if not path.endswith(('.jpg', '.jpeg', '.png', '.JPG')):

You also need to add a pillow-heif depency
pip install pillow-heif

And the following lines somewhere in image.py

from pillow_heif import register_heif_opener
register_heif_opener()

Other additional problems encountered en mobile devices (but probably due to gradio) :

  • The 3d viewer can't easily translate the scene (no right click + drag gesture available).
  • Possibility to add one more image without re-uploading everything.
  • ImagePicker instead of FilePicker

Image Mask

I would like to ask if it is possible to add a mask to the images currently being promoted on the network.
If so, how should I add it?

Core dump

image

This error come from python code from the Readme file

Dataset details

First of all: very cool work!

I have two questions regarding reproducing pairs from the datasets for training.

Habitat

Are the scenes the pairs are generated from the same as in CroCo Habitat README?
Specifically, do the 1M pairs relate in some way to the ~1.8M pairs used in Croco?

Real datasets

For CroCoV2 you provide metadata to re-generate the crops for ARKitScenes and MegaDepth.
Specifically the CroCoV2 paper mentions,

1,070,414 pairs from ARKitScenes [8], 2,014,789 pairs from MegaDepth

Do these relate to the pairs you obtained for dust3r training in some way?

If there is some metadata similar to the one for CroCo to generate these pairs that would be greatly appreciated!

Thank you!

Allow to export scene data as JSON for general use outside of Python

First, congrats to the developers. a powerful tool, found immediate use for my projects.

Because most of our tools are written in MATLAB/Octave, I found that the generated .glbfile is difficult to convert.

I just added a little bit code to allow the demo.py GUI to export data to a JSON/binary JSON construct (with the JMesh mesh data annotation) - which can be potentially parsed/shared in other environments (like JavaScript/Node/C++/MATLAB). I also added a drop down menu in the demo to let user choose output format.

Here are my commits

NeuroJSON@a13b18c
NeuroJSON@935ace7

to export to JSON, only one extra dependency jdata (16 kb) is needed. To export to a binary JSON format (for smaller file sizes), another small package bjdata (65 kb).

loading the data in MATLAB/Octave

>> dat=loadjd('/tmp/scene.jmsh');
>> dat
dat = 
  struct with fields:

       images: [1×10 struct]
      cameras: [1×10 struct]
       meshes: [1×10 struct]
    transform: [4×4 double]

>> dat.meshes(1)
ans = 
  struct with fields:

    MeshVertex3: [196608×3 single]
       MeshTri3: [1×1 struct]

loading the data back to Python

import jdata as jd
dat=jd.load('/tmp/scene.jmsh');
>>> dat.keys()
dict_keys(['images', 'cameras', 'meshes', 'transform'])
>>> dat['meshes'][0].keys()
dict_keys(['MeshVertex3', 'MeshTri3'])
>>> dat['meshes'][0]['MeshTri3'].keys()
dict_keys(['Data', 'Properties'])
>>> dat['meshes'][0]['MeshTri3']['Data'].shape
(189024, 3)
>>> dat['meshes'][0]['MeshTri3']['Data'].dtype
dtype('int64')

in my test, the generated scene from 10 images took 43MB size in glb, 44MB in binary json (.bmsh) and 59MB for JSON (due to base64). No noticeable difference in loading/saving speed. For both JSON/binary JSON file, changing the compressor to 'lzma' could lead to much smaller file size.

just want to share this in case others see similar needs. I am also happy to create a PR if the developers are interested in adding this feature.

image

Wrong Intrinsics

There seems to be an issue with the camera's intrinsics, some focal is 0.

can I fix the camera pose? Or given some restrictive priors

A very good job! I have a fixed use of the scene, 4 cameras fixed to shoot an object, but I find that each time the output camera is not in the same position, can you give some prior conditions or modify the code to fix the camera pose?

FYI: Missing dependencies required to run main.py

Hi and thank you for for sharing this amazing work.

Just a heads up:

I had to manually install some missing deps to make main.py run. I followed the readme and installed with conda, skipping both the optional step 3 and installing the new optional_requirements.txt

einops (conda)
tqdm (via pip)
scipy (conda)
opencv-python (pip)
trimesh (pip)

pip install "pyglet<2" # use version <2

Cheers

Could I get the real depth from your 2D-2D matching?

Hi, thank you for your amazing job!
Could I get the real depth from your 2D-2D matching?
If I input 2 images from a stereo camera(like self-driving), according ‘real_depth = (baseline × focal) / disparity’, is that meaning the answear of title is right?
And if I input 5 images, for some reason, I want to see all matching between each images pair(for example, img2 and img5), could you tell me how to modify you code to achieve that?(I modfied the code of "Usage" of ReadMe, however, the matchings are same for all image pair...)
Thank you!

Scale Invariant Depth, Metric Depth, and Surface Normals

As I understand here #44 (comment) ya'll are in the process of training a metric version of dust3r, so the current version outputs scale invariant. Was the depth normalized before training? I see here

def rescale_image_depthmap(image, depthmap, camera_intrinsics, output_resolution):
that the resolution is being changed. But I also notice here
with PIL.Image.open(depth_path) as depth_pil:
that it seems the metric depth map is being loaded?

Is there any normalization being performed on the depth/pointmap to make it scale invariant?

Also I've been following what the MetricV2, have you all looked at including surface normals as supervision based on pointmap -> depthmap -> surface normal conversion such that the network can also produce surface normals?

training costs

Thanks for your great work! It's amazing! It can be applied to many real-world scenarios. I wonder if this is a paper submitted to CVPR 2024? The format of the paper looks like it is.

Besides, could you please tell me how many GPUs it costs to train this model? Thanks very much!

Memory errors

Hi, thank you for your great work!
I'm currently working on testing this on larger datasets (5-10k images) and notice that a very large amount of (V)RAM would be required to make it work. I've already generated a pairs file to reduce the number of pairs from 50M to 3M, but this still seems to be way too large. Do you have any pointers/suggestions I could try out to make it scale better?

I'm using cloud compute with 80GB VRAM and 220GB RAM so that shouldn't be an issue btw.

Try this to increase resolution w/o finetuning (Instruction)

Using the default setup, large input images were being resized to 512 x 384 (using DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth). But I wanted results with higher resolution (1024 x 768). So I followed "Extending Context Window of Large Language Models via Position Interpolation" by Meta and changed only the default image_size value of 512 to 1024 inside demo.py and multiplied the variable t inside get_cos_sin method of RoPE2D of croco/models/pos_embed.py by (512/1024). This gave pretty good results, though finetuning is most likely required for better results.

something wrong in gradio import

(dust3r) H:\qing\AIproect\dust3r>python demo.py --weights checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth
Traceback (most recent call last):
File "H:\qing\AIproect\dust3r\demo.py", line 9, in
import gradio
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio_init_.py", line 3, in
import gradio.simple_templates
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio_simple_templates_init
.py", line 1, in
from .simpledropdown import SimpleDropdown
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio_simple_templates\simpledropdown.py", line 6, in
from gradio.components.base import FormComponent
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio\components_init_.py", line 40, in
from gradio.components.multimodal_textbox import MultimodalTextbox
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio\components\multimodal_textbox.py", line 28, in
class MultimodalTextbox(FormComponent):
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio\component_meta.py", line 198, in new
create_or_modify_pyi(component_class, name, events)
File "D:\py\anaconda3\envs\dust3r\Lib\site-packages\gradio\component_meta.py", line 92, in create_or_modify_pyi
source_code = source_file.read_text()
^^^^^^^^^^^^^^^^^^^^^^^
File "D:\py\anaconda3\envs\dust3r\Lib\pathlib.py", line 1059, in read_text
return f.read()
^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0xb2 in position 1972: illegal multibyte sequence

Unexpected `force` option for `print()`

There's calling print() with passing force=True (at losses.py, L222).
However, built-in print() function can't accept this parameter:

>>> print("something", force=True)
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm 2022.3.2\plugins\python\helpers\pydev\pydevconsole.py", line 364, in runcode
    coro = func()
           ^^^^^^
  File "<input>", line 1, in <module>
TypeError: 'force' is an invalid keyword argument for print()

I believe the author implied that parameter to flush stdout stream, so there should have been flush=True:

>>> print("something", flush=True)
something

Options to do SLAM for video? Get poses and camera intrinsics?

Hi there, congrats on the fantastic work! These are amazing results.

I'm working on 3D mapping systems for robotics, and was wondering if

Given a video, can this method help with obtaining the camera parameters, and poses for each frame?

Do you guys have any scripts already for this? I see that in the example usage you have:

    # retrieve useful values from scene:
    imgs = scene.imgs
    focals = scene.get_focals()
    poses = scene.get_im_poses()

And you can do scene.get_intrinsics() which is great, but when I run this on 12 images from the replica dataset, scene.get_intrinsics() outputs 12 different intrinsic matrices, none of which really match the original camera intrinsics of the replica dataset.

Am I doing something wrong? Should I specify the scale or resolution or something else about the images at some point? The replica images are 1200x600 (w,h) but they get resized to 512 I'm assuming.

Just wondering how I should go about getting the camera parameters for a monocular rgb video, or if that's not really possible to do super accurately yet with this method.

For extra detail, I'm using the following frames from the replica dataset

    image_filenames = [
        'frame000000.jpg', 'frame000023.jpg', 'frame000190.jpg', 'frame000502.jpg',
        'frame000606.jpg', 'frame000988.jpg', 'frame001181.jpg', 'frame001374.jpg',
        'frame001587.jpg', 'frame001786.jpg', 'frame001845.jpg', 'frame001928.jpg'
    ]
...
    images = load_images(images_path_list, size=512)
    pairs = make_pairs(images, scene_graph='complete', prefilter=None, symmetrize=True)
...
    scene = global_aligner(output, device=device, mode=GlobalAlignerMode.PointCloudOptimizer)
    loss = scene.compute_global_alignment(init="mst", niter=niter, schedule=schedule, lr=lr)
...

and the output of the scene.get_intrinsics() is as follows, I'm only showing two of the matrices here, not all 12:

print(scene.get_intrinsics())
tensor([[[250.8425,   0.0000, 256.0000],
         [  0.0000, 250.8425, 144.0000],
         [  0.0000,   0.0000,   1.0000]],
...
        [[250.7383,   0.0000, 256.0000],
         [  0.0000, 250.7383, 144.0000],
         [  0.0000,   0.0000,   1.0000]]], device='cuda:0',
       grad_fn=<CopySlices>)

compared to the ground truth camera params of the replica dataset from the camera_params.json

K_given = [
    [600.0, 0, 599.5],
    [0, 600.0, 339.5],
    [0, 0, 1]
]

here is the actual camera_params.json file in case it helps

{
    "camera": {
        "w": 1200,
        "h": 680,
        "fx": 600.0,
        "fy": 600.0,
        "cx": 599.5,
        "cy": 339.5,
        "scale": 6553.5
    }
}

Also, just curious, how would I go about running this on long videos? Or is that not possible yet?

My apologies if these are too many questions! This method is really awesome, and I'm having a lot of fun using it. Thanks again for the wonderful work!

Dear authors: a confusion about network architecture.

Dear author, if there are no inherent constraints on the input images (such as B must always be to the left of A or even stricter constraints), what is the reason behind TransformerDecoder_1/2 and Header_1/2 being required to have different weights instead of sharing weights for later information sharing? According to my limited understanding, after 'perfect' or 'sufficient' training, Decoder_1/2 and Header_1/2 should be nearly identical. In this case, what is the significance of not sharing weights?

in shortly:
if no inherent differents, and after sufficient training, decoder_header_1/2 should be nearly identical.
consider this:
input single image I to this two-path network, output diffent point-map and camera-pose.
is it reasonable? is this we wanted?

a thought experiment:
swap input image pairs A and B, maybe we will get different output performaces: worse or better.
right?

image

Some f-string error

Hi, I get some f-string error, like:

dust3r/heads/__init__.py line:19
raise NotImplementedError(f"unexpected {head_type=} and {output_mode=}")

 File "<fstring>", line 1
    (head_type=)
              ^
SyntaxError: invalid syntax

Am I using the wrong approach?

Running on Apple Mac M2

Good jobs guys ! Very impressive results.

I confirm that it's working on Apple Mac ( with Apple Silicon ), I try with more than 8 images with no error.
3 images, 6 images pairs, run in 6 sec + 15 sec
8 images, 56 images pairs, run in 70 sec + 90 sec

PYTORCH_ENABLE_MPS_FALLBACK=1 python3 demo.py --weights checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth --device mps
image

Did you plan to release some samples with a more bigger number of images ?

In demo train code, there is a typo in step 1.

There is a typo in Demo Learning step1.
There's no quotation mark on the path next to the pretrained
The train will not work when the code is executed.
Quotations on the route have been confirmed to work well

Thanks 😀

Mesh view is washed out in visualizer, texture not embedded

Fantastic work guys ! Quality and performance is impressive, though the visualizer seems to apply some transparency in mesh mode compared to point cloud.

Here is a comparaison of the 2 using a single image as source :

image
image

Also when saving the mesh there is no texture embedded and no color information on the point cloud when I import it to Blender

image

PS : I installed dust3r using Pinokio.

Getting torch.cuda.OutOfMemoryError using more than 16 images

Firstly, congrats to all the folks at Naver for their awesome accomplishments with Dust3r. It was very straight forward getting dust3r up and running, but I discovered i run into torch.cuda.OutOfMemoryError(s) when i try to process more than 16 images at once. I am running a rtx 3060 12GB, was wondering if anyone may know what I can do to resolve or debug an issue like this? I am running dust3r via docker-compose with PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128. Here is the full error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.11 GiB (GPU 0; 11.76 GiB total capacity; 9.84 GiB already allocated; 932.69 MiB free; 10.32 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any help or insight would be much appreciated. Again, thanks to the folks at Naver for their awesome work and releasing it!

Getting the point clouds out at the right scale

Would it be possible to use spase or dense matchers to get the scale out? Getting all the matched points and then using the known intrinsics to project them into 3d and then comparing the same points to pointmap in dust3r?

Waymo dataset processing

Hi authors, thanks for sharing the amazing work. I was wondering how you use Waymo dataset for training, as the most accurate depth is from Lidar which is sparse.

Scannet Data Prerocess

Could you please release the code about preprocess scannet data for training and inference, we can better understand the all pipeline about this dust3r? thanks

Problems with camera pose application

Thank you for your excellent work. But when I try to use cam_pose, some problems arise. Specifically, for the four pictures, the following pictures can be obtained in the output of the demo,
image
and you can see that there are good camera poses.
But when I apply the pose parameters directly to the point cloud locally, I get the following results.
image
There is an unreasonable gap between each perspective.
I want to know if the way I get the pose is wrong(poses = scene.get_im_poses())? Or is it that the point cloud results displayed on the web page do not completely correspond to the pose obtained by the model? Looking forward to your reply.

No module named 'models.dpt_block'

(venv) D:\dust3r>python demo.py --weights checkpoints/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth
Traceback (most recent call last):
File "D:\dust3r\demo.py", line 19, in
from dust3r.inference import inference, load_model
File "D:\dust3r\dust3r\inference.py", line 10, in
from dust3r.model import AsymmetricCroCo3DStereo, inf # noqa: F401, needed when loading the model
File "D:\dust3r\dust3r\model.py", line 11, in
from .heads import head_factory
File "D:\dust3r\dust3r\heads_init_.py", line 8, in
from .dpt_head import create_dpt_head
File "D:\dust3r\dust3r\heads\dpt_head.py", line 17, in
from models.dpt_block import DPTOutputAdapter # noqa
ModuleNotFoundError: No module named 'models.dpt_block'

(venv) D:\dust3r\checkpoints>dir
驱动器 D 中的卷是 Data
卷的序列号是 3C50-8BA1

D:\dust3r\checkpoints 的目录

2024/03/04 21:16

.
2024/03/04 21:16 ..
2024/03/04 20:35 2,129,660,080 DUSt3R_ViTLarge_BaseDecoder_224_linear.pth
2024/03/04 20:40 2,285,019,929 DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth
2024/03/04 20:41 2,129,656,556 DUSt3R_ViTLarge_BaseDecoder_512_linear.pth

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.