Code Monkey home page Code Monkey logo

dycheck's Introduction

Monocular Dynamic View Synthesis: A Reality Check

teaser

This repo contains training, evaluation, data processing and visualization code in JAX for our reality check on the recent advances in Dynamic View Synthesis (DVS) from monocular video. Please refer to our project page for more visualizations and qualitative results.

Monocular Dynamic View Synthesis: A Reality Check
Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, Angjoo Kanazawa
UC Berkeley, CMU, Adobe Research
NeurIPS 2022

We find that though existing works have shown impressive results, there exists a discrepancy between the practical captures, for which we hope to develop, and the common experimental protocols, which are effectively multi-view. Also, we find that existing evaluation protocols are also limitted as they have not consider the co-visibility of pixels during training and testing, as well as inferred correspondence. We benchmark the existing works under the practical setting using our improved evaluation scheme and show that there is a large room for improvement. We hope that our work can provide a solid baseline for future works in this domain.

teaser

In this repository, you can find:

Setup

Please refer to SETUP.md for instructions on setting up a work environment.

By default, our code runs on 4 NVIDIA RTX A5000 GPUs (24 GB memory). Please try decreasing the chunk size if you have fewer resources. You can do so by the following syntax for all of demo/evaluation/training code:

# Append rendering chunk at the end of your command. Set it to something
# smaller than the default 8192 in case of OOM.
... --gin_bindings="get_prender_image.chunk=<CHUNK_SIZE>"

Quick start

Here is a demo to get you started: re-rendering a paper-windmill video from a pre-trained T-NeRF model. Under the demo folder, we include a minimal dataset and a model checkpoint for this purpose.

# Launch a demo task.
python demo/launch.py --task <TASK>

You should be able to get the results below by specifying <TASK>=...:

Training video "novel_view" "stabilized_view" "bullet_time"
Click to see additional details.
  • The minimal dataset only contain the camera and meta information without the actual video frames.
  • The model is our baseline T-NeRF with additional regularizations (config) which we find competitive comparing to SOTA methods.
  • It takes roughtly 3 minutes to render for novel-view task and 12 minutes for the others at 0.3 FPS.

Datasets

Please refer to DATASETS.md for instructions on downloading processed datasets used in our paper, including:

  1. Additional co-visibility masks and keypoint annotations for Nerfies-HyperNeRF dataset.
  2. Our accompanying iPhone dataset with more diverse and complex real-life motions.

For processing your own captures following our procedue, please see RECORD3D_CAPTURE.md.

Benchmark

Please refer to BENCHMARK.md for our main results and instructions on reproducibility, including:

  1. How to evaluate our released checkpoints.
  2. How to train from scratch using our configurations.

Effective Multi-view Factors (EMFs)

For better transparency on the experiments, we recommend future works to report their EMFs on their new sequences.

We propose two EMFs: Angular EMF and Full EMF. The first one is easy to compute but assumes there's a single look-at point of the sequence. The second one is generally applicable but uses optical flow and monocular depth prediction for 3D scene flow estimation and is thus usually noisy. We recommend trying out the Angular EMF first whenever possible.

(1) From our python APIs

from dycheck import processors

# Angular EMF (omega): camera angular speed. We recommend trying it out first whenever possible.
angular_emf = processors.compute_angular_emf(
    orientations,   # (N, 3, 3) np.ndarray for world-to-camera transforms in the OpenCV format.
    positions,      # (N, 3) np.ndarray for camera positions in world space.
    fps,            # Video FPS.
    lookat=lookat,  # Optional camera lookat point. If None, will be computed by triangulating camera optical axes.
)

# Full EMF (Omega): relative camera-scene motion ratio.
full_emf = processors.compute_full_emf(
    rgbs,           # (N, H, W, 3) np.ndarray video frames in either uint8 or float32.
    cameras,        # A sequence of N camera objects.
    bkgd_points,    # (P, 3) np.ndarray for background points.
)

Please see the camera definition in our repo, which follows the one in Nerfies. Note that additional care needs to be taken due to camera distortion, e.g. during camera projection.

(2) From our script

To use our repo as a script that takes a video as input, given that the dataset is preprocessed in Nerfies' data format. Note that you will need to also write a process config at configs/<DATASET>/process_emf.gin. Take a look at our example.

python tools/process_emf.py \
    --gin_configs 'configs/<DATASET>/process_emf.gin' \
    --gin_bindings 'SEQUENCE="<SEQUENCE>"'

Better evaluation metrics

(1) Co-visibility masked image metrics

Have a look on our masked image metrics which besides taking in a pair of predicted and ground-truth images (img0, img1) as input, also consider an optional co-visible mask.

# Consider computing SSIM and LPIPS for example.
from dycheck.core import metrics

# Masked SSIM using partial conv.
mssim = metrics.compute_ssim (
    img0,           # (H, W, 3) jnp.ndarray image in float32.
    img1,           # (H, W, 3) jnp.ndarray image in float32.
    mask,           # (H, W, 1) optional jnp.ndarray in float32 {0, 1}. The metric is computed only on the pixels with mask == 1.
)

# Masked LPIPS.
compute_lpips = metrics.get_compute_lpips()  # Create LPIPS model on CPU. We find it is fast enough for all of our experiments.
mlpips = compute_lpips(
    img0,           # (H, W, 3) jnp.ndarray image in float32.
    img1,           # (H, W, 3) jnp.ndarray image in float32.
    mask,           # (H, W, 1) optional jnp.ndarray in float32 {0, 1}. The metric is computed only on the pixels with mask == 1.
)

You can use our pre-computed co-visibility mask or process yourself. We provide a process script for your reference. Note that you will need a process config at configs/<DATASET>/process_covisible.gin. Take a look at our example.

python tools/process_covisible.py \
    --gin_configs 'configs/<DATASET>/process_covisible.gin' \
    --gin_bindings 'SEQUENCE="<SEQUENCE>"'

(2) Correspondence metrics

from dycheck.core import metrics

pck = metrics.compute_pck (
    kps0,           # (J, 2) jnp.ndarray keypoints in float32.
    kps1,           # (J, 2) jnp.ndarray keypoints in float32.
    img_wh          # (Tuple[int, int]) image width and height.
    ratio,          # (float) threshold ratio.
)

In this repo, we use root-finding for determining the long-term correspondence given a Nerfies or HyperNeRF checkpoint. See our canonical-to-view snippet for reference.

Citation

If you find this repository useful for your research, please use the following:

@inproceedings{gao2022dynamic,
    title={Dynamic Novel-View Synthesis: A Reality Check},
    author={Gao, Hang and Li, Ruilong and Tulsiani, Shubham and Russell, Bryan and Kanazawa, Angjoo},
    booktitle={NeurIPS},
    year={2022},
}

Acknowledgement

This repository is built on top of Keunhong's hypernerf and nerfies codebases. We also thank jaxnerf for reference of fast data-loading; deq-jax for Broyden root-finding solver.

We would like to thank Zhengqi Li and Keunhong Park for valuable feedback and discussions; Matthew Tancik and Ethan Weber for proofreading. We are also grateful to our pets: Sriracha, Haru, and Mochi, for being good during capture. This project is generously supported in part by the CONIX Research Center, sponsored by DARPA, as well as the BDD and BAIR sponsors.

dycheck's People

Contributors

hangg7 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.