Code Monkey home page Code Monkey logo

emernerf's Introduction

EmerNeRF

PyTorch implementation of:

EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision,
Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, Yue Wang

EmerNeRF overview

News

  • [2023/11/05] We've released an initial version of EmerNeRF, which supports the NeRF On-The-Road (NOTR) dataset sourced from the Waymo Open Dataset. NuScenes support is also available.

Table of Contents

Introduction

We introduce EmerNeRF, a self-supervised approach that utilizes neural fields for spatial-temporal decomposition, motion estimation, and the lifting of foundation features. EmerNeRF can decompose a scene into dynamic objects and a static background and estimate their motion in a self-supervised way. Enriched with lifted and "denoised" 2D features in 4D space-time, EmerNeRF unveils new potentials for scene understanding. Additionally, we release the NeRF On-The-Road (NOTR) dataset split to support future research.

Installation

Our code is developed on Ubuntu 22.04 using Python 3.9 and PyTorch 2.0. Please note that the code has only been tested with these specified versions. We recommend using conda for the installation of dependencies. The installation process might take more than 30 minutes.

  1. Create the emernerf conda environment and install all dependencies:
conda create -n emernerf python=3.9 -y
conda activate emernerf
# this will take a while: more than 10 minutes
pip install -r requirements.txt
  1. Install nerfacc and tiny-cuda-nn manually:
pip install git+https://github.com/nerfstudio-project/nerfacc.git@8340e19daad4bafe24125150a8c56161838086fa
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

Troubleshooting:

nvcc fatal : Unsupported gpu architecture 'compute_89`

If you encounter the error nvcc fatal : Unsupported gpu architecture 'compute_89, try the following command:

TCNN_CUDA_ARCHITECTURES=86 pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
error: parameter packs not expanded with ‘...’

If you encounter this error:

error: parameter packs not expanded with ‘...’

Refer to this solution on GitHub.

Dataset preparation

For those interested in setting up a custom dataset, kindly use these two datasets as templates. Also take a look at the datasets/base directory to familiarize yourself with the dataset preparation process in our codebase.

Run the code

Configs

We have provided detailed comments in configs/default_config.yaml for each configuration. Alongside the released code, you'll also find these detailed comments.

Training

Sample training scripts can be found in sample_scripts/

  1. Data inspection. Before initiating the training, you might want to inspect the data. We've included a script for visualizing the data. To visualize the NOTR dataset, execute the following:
# Adjust hyper-parameters as needed
python train_emernerf.py \
    --config_file configs/default_config.yaml \
    --output_root $output_root \
    --project $project \
    --run_name ${scene_idx} \
    --render_data_video_only \  # This will render a video of the data.
    data.scene_idx=$scene_idx \
    data.pixel_source.load_size=[160,240] \ # Downsample to enhance the visibility of LiDAR points.
    data.pixel_source.num_cams=3 \ # Opt for 1, 3, or 5
    data.start_timestep=0 \
    data.end_timestep=-1 

This script produces a video similar to the one below, showcasing LiDAR points colored by their range values and the 3D scene flows, and their feature maps (if load_features=True):

  1. Training. For the most comprehensive EmerNeRF training (incorporating dynamic encoder, flow encoder, and feature lifting), use:
  python train_emernerf.py \
    --config_file configs/default_flow.yaml \
    --output_root $output_root \
    --project $project \
    --run_name ${scene_idx}_flow \
    data.scene_idx=$scene_idx \
    data.start_timestep=$start_timestep \
    data.end_timestep=$end_timestep \
    data.pixel_source.load_features=True \
    data.pixel_source.feature_model_type=dinov2_vitb14 \
    nerf.model.head.enable_feature_head=True \
    nerf.model.head.enable_learnable_pe=True \
    logging.saveckpt_freq=$num_iters \
    optim.num_iters=$num_iters

For more examples, refer to the sample_scripts/ folder.

  1. Voxel Inspection. We've provided visualization code to display spatial-temporal features as shown on our homepage. To visualize voxel features, simply add --visualize_voxel and specify resume_from=$YOUR_PRETRAINED_MODEL. This will produce an HTML file which you can open in a browser for voxel feature visualization:

Example Voxel Visualization

Pretrained models

  • We will release all pretrained models soon. Do note that the distribution of pretrained models will be in accordance with Waymo's data sharing policy. Hence, we will only release pretrained models for registered users upon request. More details will be provided soon.

Citation

Consider citing our paper if you find this code or our paper is useful for your research:

@article{yang2023emernerf,
    title={EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision},
    author={Jiawei Yang and Boris Ivanovic and Or Litany and Xinshuo Weng and Seung Wook Kim and Boyi Li and Tong Che and Danfei Xu and Sanja Fidler and Marco Pavone and Yue Wang},
    journal={arXiv preprint arXiv:2311.02077},
    year={2023}
}

emernerf's People

Contributors

borisivanovic avatar jiawei-yang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

emernerf's Issues

Training results are blurry with poor depth maps using custom dataset with transformed camera intrinsics and extrinsics

Hello,

I have a question regarding the use of a custom dataset. I've transformed the camera intrinsics and extrinsics as instructed. Additionally, I've computed the vehicle pose using Euler angles. The dataset is initialized with the first frame's pose as the origin in the world coordinates.

During training, I only utilize 2D images, camera intrinsics and extrinsics, and the vehicle pose. However, I'm encountering an issue where the trained model produces blurry results, and the depth maps are of poor quality. I've already adjusted the ORIGINAL_SIZE in the datasets/waymo.py file, and the custom dataset follows the same coordinate system as Waymo.

I'm trying to understand why the training results are blurry and why the depth maps are not satisfactory. Any insights or suggestions on how to improve the clarity and quality of the depth maps would be greatly appreciated.

Thank you!
depth:
step_24000_depths
gt_rgb:
step_24000_gt_rgbs
inference_rgb:
step_24000_rgbs

Rendering question

Hi, is there any script or function used for rendering specially? Or should I just use the eval function in train_emernerf.py to render images? Thanks!

Edit the dynamic objects

Hi, thanks for your great work. I wonder whether the objects in the scene can be edited or not, such as change the position or insert a car. If not, do you have any plan for it? Thanks!

Inquiry About Training on NVIDIA 3090 - Compatibility and Support

Hi Jiawei,

First and foremost, I'd like to express my admiration for the incredible work you've done with EmerNeRF.

I am reaching out with a query regarding training the model on an NVIDIA 3090 GPU. While attempting to train the model, I encountered several issues, particularly when all features were enabled. The system suggests that training requires an A100 GPU, which I currently do not have access to. Below are the specific error messages received:

- `cutlassF` is not supported because:
  - xFormers wasn't built with CUDA support
  - Operator wasn't built - see `python -m xformers.info` for more info

- `flshattF` is not supported because:
  - xFormers wasn't built with CUDA support
  - dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
  - Operator wasn't built - see `python -m xformers.info` for more info

- `tritonflashattF` is not supported because:
  - xFormers wasn't built with CUDA support
  - dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
  - Requires A100 GPU

- `smallkF` is not supported because:
  - xFormers wasn't built with CUDA support
  - max(query.shape[-1] != value.shape[-1]) > 32
  - Operator wasn't built - see `python -m xformers.info` for more info
  - Unsupported embed per head: 64

I am wondering if it's possible to train this model on a single NVIDIA 3090 GPU and if so, what steps I should take to resolve these issues. I am also particularly interested in any pre-trained models. Do you have a plan to release them?

Any guidance or suggestions would be immensely appreciated.

Thank you for your time and for the groundbreaking work you are doing.

Best regards,
Ruihan

How to optimize EmerNeRF

frame_019

Thanks for your wonderful works!

I'm doing a project related to AD simulation.
I ran emernerf on scene in which ego car move at a medium or high speed, like waymo scene id 2.
But the decomposition result is not good.

Following the approach of DynIBAR, I tried to improve emernerf by adding 2D optical flow supervision to the training process.
I merged the flow loss into the total pixel loss.
However, the flow loss did not converge during training, and the overall training results became worse.

So the questions are:

  1. How can I optimize emernerf? For example, improve its decomposition ability.
  2. If I use additional 2D optical flow supervision, besides merging the flow loss into the total_pixel_loss, what else should I do? Will this approach improve the decomposition results?

Looking forward to your advice. Thank you once again!

For training-validation question

Thanks for your wonderful work!

Could you please take little time to answer my question about training-validation? Looking forward to hearing from you!

To the best of my knowledge, NeRF typically requires retraining for new data in order to encode specific characteristics of the data into network parameters. So there is a parameter in the code called scene_ IDX, which specifies specific data.

May I ask if my above understanding is correct?

However, what confuses me is that the code comments are "# which scene to use, [0, 798] for waymo's training set and [0, 849] for nuscenes's train/val sets, inclusive”. It indicates the validation set that can be used for nuscenes, does this mean that your model can generalize to new data?

Regarding the mismatch between saved chevkpoints state_dict and newly loaded model for the different scenes

I'm using this emernerf architecture for my maneuver detection project to get the decomposed field as my input features, thus I am using it as a backbone. I trained it for the 1000 epochs on nuscenes on scene 0 but when I ran it in the eval model on the different scene, the state dict of the saved checkpoint and new model which is supposed to adjust to the new scene had different sizes.
Do we need to train the model for each scene separately? How about it's inference on some custom data using just 6 cameras?

Changing perspectives

Thank you for your work. May I ask what I should do if I want to change my perspective after training on the nuscece dataset I am using.

How much this work is based on LiDAR data?

Hello!

I would like to thank you for this very interesting work, I'm reading the paper and code as they have very interesting ideas. I was able to see that there is availability to avoid loading LiDAR data during training for Waymo dataset. I was able to do so ( in Flow Mode) but results of detecting dynamics for scene with were poor for detecting and also rendering results.

My question is simple is this method highly depend on multi-sensor configuration or it can be visual only? Can LiDAR data replaced with any ground truth depth information from RGB-D cameras for example.

Could you please guide me how to take the best of this work using visual only data?

Thanks

Inquiry about Minimum GPU Requirements for Training

Dear author,

I hope this message finds you well. First and foremost, I would like to express my appreciation for the work you've done on EmerNeRF. It's a fantastic project, and I'm eager to explore it further.

I am currently interested in training models using your project and I was wondering if you could provide some guidance regarding the minimum GPU requirements for the training process. Specifically, I would like to know:

1、The minimum number of GPUs recommended for efficient training.
2、Any specific GPU models or specifications that you have found to work well during your development.

Understanding these details will greatly assist me in planning the hardware resources for my own experiments. I understand that hardware requirements can vary based on the dataset and model complexity, but having a general idea would be incredibly helpful.

Thank you for your time, and I look forward to hearing from you.

Request for Dockerfile

Is there a Dockerfile available for Emernerf? or Has anyone built a Dockerfile?

Thanks in advance.

How to train in multiple GPUs?

I don't have GPU which satifies memory emernerf training needs,I only have 7 GTX 3080 of each with 10GB memory.How to train in multiple GPUs?

Correct Procedure for Voxel Visualization?

I am struggling to create a feature field which looks like the one provided in the README.

I get the following visualisation, which looks nothing like the example image:

1_feature_field

These are the commands I ran to produce this feature field using scene 1:

Step 1: Train model according to step 2 in the documentation:

python train_emernerf.py \
--config_file configs/default_config.yaml \
--output_root output \
--project test_dino \
--run_name 1_flow \
data.pixel_source.load_features=True \
data.pixel_source.feature_model_type=dinov2_vitb14 \
data.pixel_source.skip_feature_extraction=False \
nerf.model.head.enable_feature_head=True \
nerf.model.head.enable_learnable_pe=True \
data.scene_idx=1 \
data.start_timestep=2 \
data.end_timestep=3 \
logging.saveckpt_freq=8000 \
optim.num_iters=8000

This gives the following metrics:

I20240327 12:56:14 root video_utils.py:93] Eval over 6 images:
I20240327 12:56:14 root video_utils.py:94]      PSNR: 36.4330
I20240327 12:56:14 root video_utils.py:95]      SSIM: 0.9412
I20240327 12:56:14 root video_utils.py:96]      Feature PSNR: 24.9425
I20240327 12:56:14 root video_utils.py:97]      Masked PSNR: 34.0290
I20240327 12:56:14 root video_utils.py:98]      Masked SSIM: 0.9251
I20240327 12:56:14 root video_utils.py:99]      Masked Feature PSNR: 23.2345

The generated rgbs and dino_features are the following:

step_8000_rgbs
step_8000_dino_feats

Step 2: Add the --visualize_voxel and resume_from flags and remove the iterations and save checkpoint frequency (since I don't want to train again):

python train_emernerf.py \
--config_file configs/default_config.yaml \
--output_root output \
--project test_dino \
--run_name 1_flow \
--visualize_voxel \
resume_from=output/test_dino/1_flow/checkpoint_08000.pth \
data.pixel_source.load_features=True \
data.pixel_source.feature_model_type=dinov2_vitb14 \
data.pixel_source.skip_feature_extraction=False \
nerf.model.head.enable_feature_head=True \
nerf.model.head.enable_learnable_pe=True \
data.scene_idx=1 \
data.start_timestep=2 \
data.end_timestep=3

This produces the feature field shown above.

Does anyone know what could be causing this issue? I am currently attempting to train with a much longer timestep but other than that I do not see any glaring errors in my procedure.

Any help would be greatly appreciated. I am happy to provide further information.

novel view synthesis in Nuscenes

Hello, I'm trying to support adding dynamic object masks and new view synthesis on nuscenes. In fact, I didn't understand what you said "it requires splitting asynchronous LiDAR and camera data." Can you give me more tips?

thanks!

Training on custom dataset

Certainly! Here's the revised version of your paragraph with grammatical corrections and improved clarity:

Hi!

Thank you for your excellent work. I really appreciate it.

I am currently training EmerNeRF on a custom dataset collected from the Carla simulator. I've slightly modified the code to include depth supervision. While the overall reconstructed geometry appears quite good, the decomposed dynamic field exhibits significant artifacts, and the cars are noticeably distorted. Additionally, the voxelization results seem scattered.

Static RGB field:
https://github.com/NVlabs/EmerNeRF/assets/79851538/c02a167a-16dc-4d79-bee7-63ce910f117f

Static depth:
https://github.com/NVlabs/EmerNeRF/assets/79851538/4d7a603c-0d6b-4d2c-949b-e631c9343e24

Dynamic RGB field:
https://github.com/NVlabs/EmerNeRF/assets/79851538/42c6047a-9a75-4e4a-a9d3-d2117521201b

Dynamic depth:
https://github.com/NVlabs/EmerNeRF/assets/79851538/9f67c6b5-3b30-4828-9d02-6ad9784886cf

Voxelization:
image

Here is my config:
data:
data_root: ../data/carla_v2
dataset: carla_depth
scene_idx: 0
start_timestep: 0
end_timestep: -1
ray_batch_size: 8192
preload_device: cuda
pixel_source:
load_size:
- 644
- 966
downscale: 1
num_cams: 6
test_image_stride: 0
load_rgb: true
load_sky_mask: true
load_dynamic_mask: false
load_features: true
skip_feature_extraction: false
target_feature_dim: 64
feature_model_type: dinov2_vitb14
feature_extraction_stride: 7
feature_extraction_size:
- 644
- 966
delete_features_after_run: false
sampler:
buffer_downscale: 16
buffer_ratio: 0.25
depth_truncate: 70
lidar_source:
load_lidar: true
only_use_top_lidar: false
truncated_max_range: 80
truncated_min_range: -2
lidar_downsample_factor: 4
lidar_percentile: 0.02
occ_source:
voxel_size: 0.1
nerf:
aabb:

  • -20.0
  • -40.0
  • 0
  • 80.0
  • 40.0
  • 20.0
    unbounded: true
    propnet:
    num_samples_per_prop:
    • 128
    • 64
      near_plane: 0.1
      far_plane: 1000.0
      sampling_type: uniform_lindisp
      enable_anti_aliasing_level_loss: true
      anti_aliasing_pulse_width:
    • 0.03
    • 0.003
      xyz_encoder:
      type: HashEncoder
      n_input_dims: 3
      n_levels_per_prop:
      • 8
      • 8
        base_resolutions_per_prop:
      • 16
      • 16
        max_resolution_per_prop:
      • 512
      • 2048
        lgo2_hashmap_size_per_prop:
      • 20
      • 20
        n_features_per_level: 1
        unbounded: true
        sampling:
        num_samples: 64
        model:
        xyz_encoder:
        type: HashEncoder
        n_input_dims: 3
        n_levels: 10
        n_features_per_level: 4
        base_resolution: 16
        max_resolution: 8192
        log2_hashmap_size: 20
        dynamic_xyz_encoder:
        type: HashEncoder
        n_input_dims: 4
        n_levels: 10
        n_features_per_level: 4
        base_resolution: 32
        max_resolution: 8192
        log2_hashmap_size: 18
        neck:
        base_mlp_layer_width: 64
        geometry_feature_dim: 64
        semantic_feature_dim: 64
        head:
        head_mlp_layer_width: 64
        enable_cam_embedding: false
        enable_img_embedding: true
        appearance_embedding_dim: 16
        enable_sky_head: true
        enable_feature_head: true
        feature_embedding_dim: 64
        feature_mlp_layer_width: 64
        enable_learnable_pe: true
        enable_dynamic_branch: true
        enable_shadow_head: true
        interpolate_xyz_encoding: true
        enable_temporal_interpolation: false
        enable_flow_branch: true
        num_cams: 6
        unbounded: true
        resume_from: null
        render:
        render_chunk_size: 16384
        render_novel_trajectory: false
        fps: 24
        render_low_res: true
        render_full: true
        render_test: true
        low_res_downscale: 4
        save_html: false
        vis_voxel_size: 0.3
        supervision:
        rgb:
        loss_type: l2
        loss_coef: 1.0
        depth:
        loss_type: l2
        enable: true
        loss_coef: 1.0
        depth_error_percentile: null
        line_of_sight:
        enable: true
        loss_type: my
        loss_coef: 0.1
        start_iter: 2000
        start_epsilon: 6.0
        end_epsilon: 2.5
        decay_steps: 5000
        decay_rate: 0.5
        sky:
        loss_type: opacity_based
        loss_coef: 0.001
        feature:
        loss_type: l2
        loss_coef: 0.5
        dynamic:
        loss_type: sparsity
        loss_coef: 0.01
        entropy_loss_skewness: 1.1
        shadow:
        loss_type: sparsity
        loss_coef: 0.01
        optim:
        num_iters: 100000
        weight_decay: 1.0e-05
        lr: 0.01
        seed: 0
        check_nan: false
        cache_rgb_freq: 2000
        logging:
        vis_freq: 2000
        print_freq: 200
        saveckpt_freq: 20000
        save_seperate_video: true
        resume_from: null
        eval:
        eval_lidar_flow: false
        remove_ground_when_eval_lidar_flow: true
        eval_occ: false
        occ_annotation_stride: 10

Could you kindly provide some suggestions on how I can improve the quality of the dynamic field?

Uupgrade to numpy==1.23 to avoid typing errors.

Adding this here in case other people run into same problem.

It seems that following the instructions, numpy==1.21 is installed as this is a requirement for WOMD. However, on my machine this lead to a typing error. Upgrading to numpy==1.23 fixed the issue. I haven't tried running WOMD yet, but nuScenes seems to work.

Decomposed Static RGB and Decomposed Dynamic RGB

Great job! I'm really interested in knowing how to decompose a scene into dynamic and static components and render them separately to achieve the effects of "Decomposed Static RGB" and "Decomposed Dynamic RGB" mentioned in the paper. Is there any related code or functionality available for this?

Image size and feature_extraction_size

Hi, my image data is of size 1920x1080, and I don't want to resize it. When I change the parameter load_size to [1080,1920], should I also change feature_extraction_size or keep it unchanged? Thanks.

Question about the indoor dataset

Thanks for sharing this awesome work!

May I know if this method adapts to the indoor scenes dataset (such as 7Scenes and ScanNet) for NeRF rendering?

CUDA out of memory error

Thanks a lot for your interesting work!!

I am trying to run the code in ubuntu 22.04 using GeForce GTX TITAN X (12gb) and i am getting out of memory error. Could someone suggest how I can solve this?

Also if the batch size needs to be changed, which file should I find the variable in?

Screenshot from 2024-04-10 14-15-45

I got this error when for scene id 114 and no other changes.

Screenshot from 2024-04-10 14-14-44

Got this error after I changed the ray_batch_size from 8192 to 4096.

I am new to this, please need guidance!

Missing packages in requirements.txt

Thank you for your amazing work!

It seems that a few packages are missing from the requirements.txt file. Namely tiny-cuda-nn and nerfacc. They can be installed with:

pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
pip install nerfacc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.