ivrl / volrecon Goto Github PK

Official code of VolRecon (CVPR 2023)

License: MIT License

Python 99.35% Shell 0.65%

3d-reconstruction 3dvision volume-rendering implicit-neural-representation multi-view-stereo

volrecon's Introduction

VolRecon

Code of paper 'VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction' (CVPR 2023)

Project | arXiv

Abstract: The success of the Neural Radiance Fields (NeRF) in novel view synthesis has inspired researchers to propose neural implicit scene reconstruction. However, most existing neural implicit reconstruction methods optimize per-scene parameters and therefore lack generalizability to new scenes. We introduce VolRecon, a novel generalizable implicit reconstruction method with Signed Ray Distance Function (SRDF). To reconstruct the scene with fine details and little noise, VolRecon combines projection features aggregated from multi-view features, and volume features interpolated from a coarse global feature volume. Using a ray transformer, we compute SRDF values of sampled points on a ray and then render color and depth. On DTU dataset, VolRecon outperforms SparseNeuS by about 30% in sparse view reconstruction and achieves comparable accuracy as MVSNet in full view reconstruction. Furthermore, our approach exhibits good generalization performance on the large-scale ETH3D benchmark.

If you find this project useful for your research, please cite:

@misc{ren2022volrecon,
      title={VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction}, 
      author={Yufan Ren and Fangjinhua Wang and Tong Zhang and Marc Pollefeys and Sabine Süsstrunk},
      journal={CVPR},
      year={2023}
}

Installation

Requirements

python 3.8
CUDA 10.2

conda create --name volrecon python=3.8 pip
conda activate volrecon

pip install -r requirements.txt

Reproducing Sparse View Reconstruction on DTU

Download pre-processed DTU dataset. The dataset is organized as follows:

root_directory
├──cameras
    ├── 00000000_cam.txt
    ├── 00000001_cam.txt
    └── ...  
├──pair.txt
├──scan24
├──scan37
      ├── image               
      │   ├── 000000.png       
      │   ├── 000001.png       
      │   └── ...                
      └── mask                   
          ├── 000.png   
          ├── 001.png
          └── ...

Camera file cam.txt stores the camera parameters, which includes extrinsic, intrinsic, minimum depth and depth range interval:

extrinsic
E00 E01 E02 E03
E10 E11 E12 E13
E20 E21 E22 E23
E30 E31 E32 E33

intrinsic
K00 K01 K02
K10 K11 K12
K20 K21 K22

DEPTH_MIN DEPTH_INTERVAL

pair.txt stores the view selection result. For each reference image, 10 best source views are stored in the file:

TOTAL_IMAGE_NUM
IMAGE_ID0                       # index of reference image 0 
10 ID0 SCORE0 ID1 SCORE1 ...    # 10 best source images for reference image 0 
IMAGE_ID1                       # index of reference image 1
10 ID0 SCORE0 ID1 SCORE1 ...    # 10 best source images for reference image 1 
...

In script/eval_dtu.sh, set DATASET as the root directory of the dataset, set OUT_DIR as the directory to store the rendered depth maps. CKPT_FILE is the path of the checkpoint file (default as our model pretrained on DTU). Run bash eval_dtu.sh on GPU. By Default, 3 images (--test_n_view 3) in image set 0 (--set 0) are used for testing.
In tsdf_fusion.sh, set ROOT_DIR as the directory that stores the rendered depth maps. Run bash tsdf_fusion.sh on GPU to get the reconstructed meshes in mesh directory.
For quantitative evaluation, download SampleSet and Points from DTU's website. Unzip them and place Points folder in SampleSet/MVS Data/. The structure looks like:

SampleSet
├──MVS Data
      └──Points

Following SparseNeuS, we clean the raw mesh with object masks by running:

python evaluation/clean_mesh.py --root_dir "PATH_TO_DTU_TEST" --n_view 3 --set 0

Get the quantitative results by running evaluation code:

python evaluation/dtu_eval.py --dataset_dir "PATH_TO_SampleSet_MVS_Data"

Note that you can change --set in eval_dtu.sh and --set during mesh cleaning to use different image sets (0 or 1). By default, image set 0 is used. The average chamfer distance of sets 0 and 1 is what we reported in Table 1.

Evaluation on Custom Dataset

We provide some helpful scripts for evaluation on custom datasets, which consists of a set of images. As discussed in the limitation section, our method is not suitable for very large-scale scenes because of the coarse global feature volume. The main steps are as follows:

Run COLMAP for sparse reconstruction.
Use colmap_input.py to convert COLMAP's sparse reconstruction results into the similar format as the datasets that we use. The dataset should be organized as:

root_directory
├──scene_name1
├──scene_name2 
      ├── images                 
      │   ├── 00000000.jpg       
      │   ├── 00000001.jpg       
      │   └── ...                
      ├── cams                   
      │   ├── 00000000_cam.txt   
      │   ├── 00000001_cam.txt   
      │   └── ...                
      └── pair.txt

This step is mainly to get camera files and view selection (pair.txt). As discussed previously, the view selection will pick out best source views for a reference view, which also helps to further reduce the volume size. The camera file stores the camera parameters, which includes extrinsic, intrinsic, minimum depth and maximum depth:

extrinsic
E00 E01 E02 E03
E10 E11 E12 E13
E20 E21 E22 E23
E30 E31 E32 E33

intrinsic
K00 K01 K02
K10 K11 K12
K20 K21 K22

DEPTH_MIN DEPTH_MAX

The file code/dataset/general_fit.py is the dataset loader. The parameter self.offset_dist is the distance offset w.r.t. the reference view to generate a virtual viewpoint for rendering, which can be adjusted (set to 25mm by default).
Use script/eval_general.sh for image and depth rendering.

Training on DTU

Download pre-processed DTU's training set and Depths_raw (both provided by MVSNet). Then organize the dataset as follows:

root_directory
├──Cameras
├──Rectified
└──Depths_raw

In train_dtu.sh, set DATASET as the root directory of dataset; set LOG_DIR as the directory to store the checkpoints.
Train the model by running bash train_dtu.sh on GPU.

Acknowledgement

Part of the code is based on SparseNeuS and IBRNet.

volrecon's People

Contributors

Stargazers

Watchers

Forkers

jaedukseo graphopti seabird-go jackzhousz aoxiangfan curtincomputing ryf1123 eviliclufas radoralev khalilacheche umairkhawaja nukually greenpeer

volrecon's Issues

depth_fusion

Thank you for your outstanding job. I have a question regarding the depth fusion and depth evaluation source code. Will you be releasing this code in the future?

I noticed that there is a depth_fusion.sh file, but I could not find the depth_fusion.py file. Could you please let me know where I can find the related source code? Additionally, it would be wonderful if you could provide a link to the depth evaluation source code as well.

Thank you in advance.

how calculate camera parameter ?

Visualization results

Hi, thanks for your amazing work. In the results section of the paper, a comparison with other methods such as sparseneus and mvsnerf has been done. I wanted to enquire about the results for mvsnerf and how you acquired them. I tried running marching cubes mvsnerf with a density threshold of 10 and am not getting satisfactory results. If possible, I'd be also grateful if you would be willing to share the code to do the same.

Thank you in advance

Is there a little mistake on confusing SRDF and SDF in Appendix 7 of the paper?

Hi, thanks for your great work!
When I read the paper, I find it seems that there is a mistake in Appendix 7.

By the way, I wonder if there exists an algorithm to construct SRDF from depth images.
Thanks a lot~

Reproduce problem regarding the edge of the picture and generated mesh

Thanks for sharing the great work!
When I was tried to reproduce your work, I found there is some blurring in the edge of the prediction image like following:

This image is generated by using the checkpoint you provided in the github.
And when I used the checkpoint trained with the DTU dataset by the bash script script/train_dtu.sh I still got the blurring images predicitons.

And I observed some inconsistency in the mesh (inconsistency betwent the checkpoint you provided in the github and the checkpoint after trained locally).

The left mesh is generated by the checkpoint you provided in the github, and the right one is generated by the model I trained locally.

Right now, I am confused with if there is some problems with my environment?
Or there will be some difference when running the code for several times?

Question about the quantitative evaluation results on DTU dataset

Thanks for publishing this great work! But we just found that there are something mismatches between our reproduction results and your paper. We adopt your released checkpoint to generate the mesh and test the chamfer distance of two image sets. We can only get the 1.48 average scores of these two sets (1.38 for the first set and 1.58 for the second set), and the results depicted in your paper is 1.38. So is this result reasonable? and the results in your paper is the result of the first set? or are we misunderstanding something?

Error when trying to run pip install -r requirements.txt

My environment is Ubuntu 20.04, Python 3.8. It gives the error below when it tries to install pyembree:

Collecting pyembree (from -r requirements.txt (line 25))
Using cached pyembree-0.2.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
Using cached pyembree-0.2.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
Using cached pyembree-0.2.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
Using cached pyembree-0.2.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
Using cached pyembree-0.2.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
Using cached pyembree-0.2.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
Using cached pyembree-0.1.12-py3-none-any.whl (12.9 MB)
INFO: pip is still looking at multiple versions of pyembree to determine which version is compatible with other requirements. This could take a while.
Using cached pyembree-0.1.10.tar.gz (12.8 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [15 lines of output]
Traceback (most recent call last):
File "/home/voxar/anaconda3/envs/volrecon/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in
main()
File "/home/voxar/anaconda3/envs/volrecon/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/voxar/anaconda3/envs/volrecon/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-nt016rlp/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-nt016rlp/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 325, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-nt016rlp/overlay/lib/python3.8/site-packages/setuptools/build_meta.py", line 341, in run_setup
exec(code, locals())
File "", line 33, in
ModuleNotFoundError: No module named 'build'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Volume Rendering of SRDF

Thanks for the great work! But I notice that as your paper said, you adopt the way of volume rendering SDF (Neus) to volume render SRDF. So how does this guarantee that the network learns the srdf of the sampling point instead of the sdf?

camera coordinate in data preprocessing (dtu_train.py)

Hi! First of all, thanks for sharing this great work!
I assume that you are busy, but I have trivial questions about the camera matrix in dtu_train.py.

When you define w2c matrix in Line 270 of dtu_train.py, why do we need to multiply w2c_ref_inv matrix to all extrinsics?
w2cs.append(self.all_extrinsics[index_mat] @ w2c_ref_inv)
instead of simply using self.all_extrinsics[index_mat]?

I think this is because we need to make sample['ray_o'] defined in world coordinates like below.
sample['ref_pose'] = (intrinsics_pad @ sample['w2cs'])[0]
sample['ray_o'] = sample['ref_pose_inv'][:3,-1]

Am I correct here?

In addition, what is self.homo_pixel for? I assume that it is used to make ray_d in camera coordinate, but I am confused here.

Again, thanks for sharing the work and answering questions.

Single view reconstruction

Hi,

Thanks for your excellent work.

I'm curious about reconstructing objects with single input view, like pixelNeRF. I know it's a highly ill-posed problem but I still wonder the performance in this setting (I noticed the ablation about source views in the appendix that show the result of 2 views at least).

Or any suggestions on how to improve the performance, such as fine tuning under single view setting?

Extract the custom mesh

Hi, thank you for your great work! On dtu datasets, it looks good. I try to extract the mesh on custom datasets. First, I use script/eval_general.sh to rendering image and depth. Then I try to run tsdf_fusion.sh to get the reconstructed meshes. Finally, I changed the image shape and view list to my own datasets in evaluation/clean_mesh.py. But results are badly. Here are some of my results, hope you can help me, thank you!

Issue with tsdf_fusion.sh on custom data

I ran the 'colmap_input.py' command post sparse reconstruction on custom data, resulting in camera files and pair.txt. Subsequently, I executed the script/eval_general.sh command, which created three folders, each containing data. However, when I try to execute 'tsdf_fusion.sh', I encountered 'Found Scans: []'. Can you please advise me on the necessary steps or if I missed something?

Why is there a difference between the VolRecon and SparseNeus when computing "iter_cos" from "true_cos" in renderer.py?

Hi, thanks for your great work!
I understand why true_cos in your implementaion is $-1$ because of SRDF.
ButI notice that, in your implementation, iter_cos in $renderer.py$ equals $1.5\cdot$true_cos when cos_anneal_ratio is equal to $1$, but $\cdot 1.5$ exceeds the upper bound of the cos function, which is different from implementation between NeuS and SparseNeus. I wonder if there are some reasons why you designed this.
Looking forward to your reply!

Note:
In VolRecon:

iter_cos = -(-true_cos * 0.5 + 0.5 * (1.0 - cos_anneal_ratio) - true_cos * cos_anneal_ratio)

In SparseNeus & NeuS:

iter_cos = -(F.relu(-true_cos * 0.5 + 0.5) * (1.0 - cos_anneal_ratio) + F.relu(-true_cos) * cos_anneal_ratio)

Question about Fine-tuning

Hello, thank you for sharing the code of your great work.
I was wondering if it's possible to fine-tune the pretrained model on a new scene rather than running inference, similar to how it is done in SparseNeuS. In the paper, there was no mention of this so I just assumed that it is not possible but I might be wrong.

Also, I was trying to run the model on a scene from BlendedMVS, but I could not get any meaningful reconstruction as seen here:

when looking at the rendered depth maps, I get something, but the mesh generation does not give good result:

I tried tuning the value of self.offset_dist but still no success. Do you have an idea of what could be wrong here ?

Thanks

Questions about source views

Hi, thanks for the excellent work! Here I have a small question regarding the term 'source views' used in this project.

According to the code, the image to be rendered (denoted as 'gt-color' for brevity) is also included in 'source views', which means that the model aims to aggregate 'gt-color' itself along with some neighbor views to render the 'gt-color'. If this is the case, I wonder how can the efficacy of rgb-render-loss be gauranteed? Since the view transformer simply needs to pick the gt-color and gives zero weights to all other views. However, as suggested in Tab. 4 of the original paper, pure rgb-render-loss still delivers a reasonable result with a chamfer of 2.04.

Could you please clarify more on this point? Thanks in advance!

How fast is it to do the reconstruction ?

Dear author,

After training on a large dataset, given a couple of images around an object, how soon can I use the pretrained weights to obtain the 3d mesh ?

Thanks!