Code Monkey home page Code Monkey logo

ngp_pl's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ngp_pl's Issues

Predicting SDF instead of density?

Hi, it's me again with another question. I'm thinking of extending this framework to support SDF prediction instead of density. How hard would this be? I mean something like https://lioryariv.github.io/volsdf/ (or NeuS) where you still train from posed images with volume rendering. My naive guess is that you would need to convert the geometry MLP output from SDF to density before rendering and (maybe) change some hyperparameters, but I might be overlooking something. From an implementation point of view, is this something that can be quickly done or does it require major changes to the code? Thank you in advance!

hparams is a global variable in NeRFSystem class

Hello!
First of all congrats for the great work you are doing!
I noticed that the hparams variable in the NeRFSystem class inside train.py is not an attribute of the class but a global variable defined inside the main, so it is not possible to call the class methods outside the train script.
Maybe it should be better to just use self.hparams since it is a lighting module.

Maybe something like this would do the trick:
hparams -> self.hparams

Multi GPU Training Support

Hi, thanks for the amazing code!
Although the original tcnn codebase stated that they did not support multi-GPU.
NVlabs/tiny-cuda-nn#63

But I found that as long as you set

torch.cuda.set_device(rank)

before calling anything like tcnn.Network, it actually works. All the tensors are in the correct GPUs and behave normal to me.
And it is also common to set_device in distributed training before setting dist.init_process_group

However, I am completely new to pytorch-lighting and not sure how the DDP was implemented.
I tried to do it naviely but the code raise errors about "illegal memory access".
I am not sure if it is due to improper DDP settings, or your custom cuda code like "aabb_intersection" or "density_grid update".

Personally I think it will be very useful if the code supports multi-gpu which can be potentially used in multi-object/video data scenarios.

Demo using Lightning App โšก๏ธ

Hi @kwea123, thanks for providing the pre-trained checkpoints. I loved this project ๐Ÿ˜„ !

Do you have any plan to add a demo app for this? You can create a Lightning App and submit it to the app gallery.

We've created Research Template App for this purpose, his app lets you connect a blogpost, arxiv paper, and a jupyter notebook and even have an interactive demo for people to play with the model. This app also allows industry practitioners to reproduce your work.


PS: I'd be interested in contributing by creating an app for this project.

Asking for comments to ray-marcher implementation

Hi, thank you very much again for this nice re-implementation of pytorch instant-ngp.
I have been reading and playing with it in another codebase.
However, I found it was very difficult to understand correctly the ray marcher implementation.
https://github.com/kwea123/ngp_pl/blob/master/models/csrc/raymarching.cu
For example, I could not understand terms like cascades, scale, exp_step_factor which are very nontypical compared to the original NeRF implementations. It would be nice to have more comments to explain how raymarcher works.
Also, there is no hierarchical (importance) sampling in this case?
Thanks very much!

Exporting model as ONNX for inference using barracuda.

I wonder if once you are able to create the custom dataset pipeline, it would be an issue to try and deploy the resulting NGP into unity via barracuda. I presume exporting the model to ONNX and running it through barracuda wouldn't be much of an issue? I foresee using the a pipeline closely related to the Style Transfer One.

About the inference speed

Congratulations on the great work!
I just have a small issue on the FPS of torch-ngp. The reported 7.8 actually includes the cost of saving rgb and depth to disk, and I estimated it to be ~20 on lego from the GUI on a V100.
Since the ray marching CUDA implementation is similar, I really wonder about the speed up, and would be grateful if you could share any ideas on how to improve it!

Samples Per Ray

Hi,

thank you for providing this amazing codebase!
I wonder if you have an intuition about the samples/ray in your implementation vs instant-ngp?
Yours drops to around ~25, while theirs drops to around ~6 on the lego scene after 5K steps.

Are you actually taking smaller steps along a ray or sth?

Thank you!

Problem when using large scale

Hi, Thanks a lot for your repo it's very useful!

I'm having a problem when training a nerf with a large scale factor. For example training on the Lego dataset with a --scale 8 breaks with the current repo and returns this error:

Got cutlass error: Error Internal at: 346

However, this problem didn't appear with a previous version of your repo. It seems to trace back to the new calc_dt function in raymarching.cu intorduced in the fix volume rendering condition bug.

I see that in the old version the scale was used whereas it isn't used in the new one. Could you help me understand the difference between these two?

Thanks!

Qualitative results

  • Chair
rgb.mp4
depth.mp4
  • Drums
rgb.mp4
depth.mp4
  • Ficus
rgb.mp4
depth.mp4
  • Hotdog
rgb.mp4
depth.mp4
  • Lego
rgb.mp4
depth.mp4
  • Materials
rgb.mp4
depth.mp4
  • Mic
rgb.mp4
depth.mp4
  • Ship
rgb.mp4
depth.mp4

Is there support for 3d texture export?

Thank you for your implementation and Great Work! I looked through the code but I was not able to find a .vol export similar to your previous nerf implementation here. Would the same code work on this model? If not what suggestions do you have to export the volume to a 3d texture?

gradient computation

Hi, thanks for the great work. I have a question regarding to the gradient computation in

// compute gradients by math...
dL_drgbs[s][0] = dL_drgb[ray_idx][0]*w;
dL_drgbs[s][1] = dL_drgb[ray_idx][1]*w;
dL_drgbs[s][2] = dL_drgb[ray_idx][2]*w;
dL_dsigmas[s] = deltas[s] * (
dL_drgb[ray_idx][0]*(rgbs[s][0]*T-(R-r)) +
dL_drgb[ray_idx][1]*(rgbs[s][1]*T-(G-g)) +
dL_drgb[ray_idx][2]*(rgbs[s][2]*T-(B-b)) +
dL_dopacity[ray_idx]*(1-O) +
dL_ddepth[ray_idx]*(t*T-(D-d))
);
dL_drgb_bg[0] = dL_drgb[ray_idx][0]*(1-O);
dL_drgb_bg[1] = dL_drgb[ray_idx][1]*(1-O);
dL_drgb_bg[2] = dL_drgb[ray_idx][2]*(1-O);

Can you explain slightly on how you get the formula of dL_dsigmas[s]? i.e., why does
dL_dsigmas[s] = deltas[s] * (
dL_drgb[ray_idx][0](rgbs[s][0]T-(R-r)) +
dL_drgb[ray_idx][1]
(rgbs[s][1]T-(G-g)) +
dL_drgb[ray_idx][2]
(rgbs[s][2]T-(B-b)) +
dL_dopacity[ray_idx]
(1-O) +
dL_ddepth[ray_idx]
(t*T-(D-d))
)

about the texture

Can we extract texture by use_vertext_normals?
I tried, but the program is dead.

Gradients can be INF

Hi,

sometimes it happens that gradients for the rgb network are INF.
It can be reproduced by adding this hook to the NeRFSystem:

    def on_before_optimizer_step(self, optimizer, optimizer_idx: int) -> None:
        def check_nan(x, str):
            if torch.any(torch.isnan(x)):
                raise ValueError(f"{str} is NAN, {x.isnan().sum()}")

            if torch.any(torch.isinf(x)):
                raise ValueError(f"{str} is INF, {x.isinf().sum()}")

            if torch.any(torch.isneginf(x)):
                raise ValueError(f"{str} is NEG INF, {x.isneginf().sum()}")

        check_nan(self.model.rgb_net.params, 'rgb network')
        check_nan(self.model.xyz_encoder.params, 'xyz_encoder')

        check_nan(self.model.xyz_encoder.params.grad, 'xyz_encoder grad')
        check_nan(self.model.rgb_net.params.grad, 'rgb network grad')

I tried different versions of the repository, dating back until the first release at 95c573dd494fb1bfa9b1afffb4c48ef2d9dcec22.

For all, after some time, I get errors like this (tested on lego):

Epoch 6:   8%|โ–Š         | 76/1000 [00:02<00:25, 36.41it/s, loss=0.000466, v_num=2, train/psnr=32.80]Traceback (most recent call last):
...
ValueError: rgb network grad is INF, 1

This seems not to affect the optimization at all (i.e. INF value might just be ignored and still converges), but still it's strange to me why it is happening.
Just wanted to put this here in case you got any idea what might cause this :)

About "test_traj.txt"

I want to ask an addition question, there is no test_traj.txt in the downloaded datasets from NSVF. How can I obtained it?
Thank you very much!

Differentiate through the rendering process

Hi, thanks for the great work! I have a (maybe stupid) question regarding the implementation. I'm not a NeRF researcher, but there are many works in my field adopting the NeRF representation for different applications. When using NeRF, we mostly care about the gradients from rendered RGB values (or depth) w.r.t. to input rays (or query points etc.). I looked at the official instant-ngp repo before, and I find out we cannot easily differentiate through their CUDA code.

Since this is a PyTorch-based implementation, I'm wondering if we can easily get the gradient. I haven't thoroughly read the code, but as far as I read the answer seems to be yes. After all you will need PyTorch auto-grad mechanism to "train" the network. Is that correct?

About background colors?

Among them, troch-ngp uses a random background color
nerf uses a fixed background.
ngp_pl seems to use a fixed background.

Q: Why are background colors randomized during NeRF training?
A: Transparency in the training data indicates a desire for transparency in the learned model. Using a solid background color, the model can minimize its loss by simply predicting that background color, rather than transparency (zero density). By randomizing the background colors, the model is forced to learn zero density to let the randomized colors "shine through".

As mentioned in instant-ngp, would it be better to use a random background?

How to extract mesh

Great work!
Instant_ngp can extract mesh from images. How does that re-implement in your work?

Synthetic NeRF Dataset creation tool

Not really an issue but just a small PSA to contribute to your effort, if allowed. I just uploaded a Synthetic NeRF dataset creation tool here. Please feel free to critique and enhance and if deemed usable incorporate onto your effort. I am amazed at your work and felt the need to humbly contribute . I'm working on a GUI as well. Regards

Planning to support datasets with background?

Hi, thanks for open sourcing this amazing work! I was wondering if there is any plan to support more realistic datasets without object masks (e.g. the plain and simple T&T instead of the preprocessed from NSVF)? I guess so because supporting custom datasets (with COLMAP?) is on the TODO list, but asking just in case :) thank you again!

Multi GPU Support

It would be awesome if it could support multi gpu. Higher batch size support, more images, higher aabb scale, all this would be possible !

thanks for your amazing work :)

Results reproduction

Hi @kwea123 ! Thanks a lot for youe work!
I've tried to reproduce your results but I did not manage to achieve high quality. The only difference is that I used AdamW rather than apex fused optimizer. I tried Robot dataset with simply :
python train.py --root_dir C:\Users\Admin\Desktop\Robot --exp_name Lego --scale 1

And I recieved the following results:
image
003_d
003

Is it a problem with a different optimizer or I should change some parameters?

Some question about the benchmark

Thanks for your great work! :)
The code is really easy to follow and understand. I love it!

Just some questions.

  1. What does the FPS means in the table? Is it the inference time of a 800x800 images?
  2. Is the reported 60 FPS of the official instant-ngp is c++/cuda run on RTX 2080Ti as well?
  3. All the reported 3 implementations are mentioned to be trained for 5 minutes. Does it means that the faster one is actually trained for more iterations?

Lego checkpoint cannot be loaded into model

Hello,

Could you please check whether the Lego checkpoint of the trained NeRF, that you provide in release is compatible with model implementation?

I double checked it with the version of NGP from master as well as release, neither seems to work

RuntimeError: Error(s) in loading state_dict for NGP: size mismatch for density_bitfield: copying a param with shape torch.Size([262144]) from checkpoint, the shape in current model is torch.Size([524288]). size mismatch for xyz_encoder.params: copying a param with shape torch.Size([11448112]) from checkpoint, the shape in current model is torch.Size([12199312]).

Cannot run inference with test.ipynb

Hi, excellent work! I trained a nvsf model on the lego dataset using the steps in the README. During inference time in test.ipynb, I get the error in the picture below. It seems that there is an issue with the size of density_bitfield and xyz_encoder.params in the checkpoint.

I trained my model following the README, so I'm not sure what's wrong with my approach.

error-with-test

[Potential Bug] Use mip level to query density grid.

Thanks for this great work!

After going through the code, I noticed this two lines behave weirdly:

const float mip_bound = fminf(1<<mip, scale);
const float mip_bound_inv = 1.0f/mip_bound;

The original Instant-NGP code is here:

float mip_scale = scalbnf(1.0f, -mip);

Which should be equivalent to

const float mip_bound = 1 << mip; 
const float mip_bound_inv = 1.0f/mip_bound; 

I'm not sure what's the reason of adding a fminf operation here.

About the performance of NeRF

This for your work, that's very great! But I have a puzzle about customizing datasets: I get the estimated pose through COLMAP, then get the trained model through your repo code (after 10K~100K iter).
However, the result is not great, like this:

blender_data247_spiral_200000_rgb.mp4

It can't reach the similar performance as NeRF's demo(like lego and others...). So, I think maybe it is related to the accuracy of pose.
Can you give me some suggestions to improve it๏ผŸ

problem encountered at the beginning of the training๏ผšValueError

Hello. It's an amazing job!

cuda11.7
wsl+ubuntu20.04
gtx1080

The problem is described as follows๏ผš

~/code/ngp_pl$ python train.py --root_dir /home/gzx/datasets/Lego/ --exp_name Lego
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: logs/Lego
Traceback (most recent call last):
File "train.py", line 184, in
trainer.fit(system, ckpt_path=hparams.ckpt_path)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1174, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_setup_hook
self._call_lightning_module_hook("setup", stage=fn)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "train.py", line 67, in setup
self.train_dataset = dataset(split=hparams.split, **kwargs)
File "/home/gzx/code/ngp_pl/datasets/nsvf.py", line 43, in init
K = np.loadtxt(os.path.join(root_dir, 'intrinsics.txt'),
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/numpy/lib/npyio.py", line 1301, in loadtxt
arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/numpy/lib/npyio.py", line 979, in _read
arr = _load_from_filelike(
ValueError: the number of columns changed from 4 to 3 at row 2; use usecols to select a subset and avoid this error

I don't know why this is happening and would like to get your help.

Did something break?

Loading Synthetic Nerf Lego breaks with:

ValueError: the number of columns changed from 4 to 3 at row 2; use usecols to select a subset and avoid this error

when loading the intrinsics.txt file

The error originates here: datasets/nsvf.py, line 44
K = np.loadtxt(os.path.join(root_dir, 'intrinsics.txt'),

Seems to be the hard coding on line 27 of datasets/nsvf.py

The error when installing Cuda extension!

Processing e:\codes\ngp_pl-master\ngp_pl-master\models\csrc
Preparing metadata (setup.py) ... done
Building wheels for collected packages: vren
Building wheel for vren (setup.py) ... error
error: subprocess-exited-with-error

ร— python setup.py bdist_wheel did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> [915 lines of output]
running bdist_wheel
.....

  2 errors detected in the compilation of "intersection.cu".
  intersection.cu
  error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.6\\bin\\nvcc.exe' failed with exit code 1
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for vren
Running setup.py clean for vren
Failed to build vren
Installing collected packages: vren
Running setup.py install for vren ... error
error: subprocess-exited-with-error

ร— Running setup.py install for vren did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> [917 lines of output]
running install
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_ext
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\utils\cpp_extension.py:411: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend. warnings.warn(msg.format('we could not find ninja.'))
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\utils\cpp_extension.py:346: UserWarning: Error checking compiler version for cl: [WinError 2] ็ณป็ปŸๆ‰พ ไธๅˆฐๆŒ‡ๅฎš็š„ๆ–‡ไปถใ€‚
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
building 'vren' extension
creating build
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IE:\codes\ngp_pl-master\ngp_pl-master\models\csrc\include -IC:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include -IC:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\torch\csrc\api\include -IC:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\TH -IC:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\include" -IC:\ProgramData\Anaconda3\envs\ngp-pl\include -IC:\ProgramData\Anaconda3\envs\ngp-pl\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tpbinding.cpp /Fobuild\temp.win-amd64-3.9\Release\binding.obj /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /EHsc -O2 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=vren -D_GLIBCXX_USE_CXX11_ABI=0
binding.cpp
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\c10/macros/Macros.h(143): warning C4067: ้ข„ๅค„็†ๅ™จๆŒ‡ไปคๅŽๆœ‰ๆ„ๅค–ๆ ‡่ฎฐ - ๅบ”่พ“ๅ…ฅๆข่กŒ็ฌฆ
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\c10/core/TensorImpl.h(2214): warning C4805: โ€œ|โ€: ๅœจๆ“ไฝœไธญๅฐ†็ฑปๅž‹โ€œuintptr_tโ€ไธŽ็ฑปๅž‹โ€œboolโ€ ๆททๅˆไธๅฎ‰ๅ…จ
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\c10/util/Optional.h(198): warning C4624: โ€œc10::constexpr_storage_tโ€: ๅทฒๅฐ†ๆžๆž„ๅ‡ฝๆ•ฐ้šๅผๅฎš ไน‰ไธบโ€œๅทฒๅˆ ้™คโ€

what is Tensor().mT operation?

Hello, Thank you so much for the awesome youtube video and lives as well!! I have learnt a lot from them.
Since I tried to trained my own data from nerf dataset transform.json (generated with instant-ngp colmap2nerf.py),
there is a operation .mT within dataset/ray_utils.py with line "rays_d = rearrange(directions, 'n c -> n 1 c') @ c2w[..., :3].mT".

Can you give me some info about this .mT operation? Somehow I cannot run this line due this operation is missing..
How is this different from normal transpose?

Thanks in advance!!

NaN in density grid

Hi author, thx for your amazing repo!
When I train on an indoor scene I find nan values appear in the density grid which leads to errors like:

in sample_uniform_and_occupied_cells
cells += [(torch.cat([indices1, indices2]), torch.cat([coords1, coords2]))]
RuntimeError: CUDA error: invalid configuration argument

Could you pls tell me how to fix this like setting an upper bound value for the density grid during training?

CUDA OOM error when validating

Hi kwea123, thanks for your great work on NGP.
I have tried mipnerf360 datasets and the gpu I used is NVIDIA V100 16G.
On training phase, the gpu memory used is ~6G, but when on validatation phase, I will get an OOM error. The batchsize I tried includes 4096 and 8192.

Optimize extrinsics error

The experimental training options optimize_ext crushed in my project. When optimizing extrinsics, the shape of self.dR[batch['img_idxs']] is [batchsize, 3] for training step and [3] for testing. However, the code to compute dR = axisangle_to_R(self.dR[batch['img_idxs']]) is the same for training and testing step. I'm trying to solve this problem, but didn't work now. Have you ever test optimize_ext ?
The detailed error info is as follows:
Traceback (most recent call last):
File "train.py", line 280, in
trainer.fit(system, ckpt_path=hparams.ckpt_path)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
self._run_validation()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 311, in _run_validation
self.val_loop.run()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 128, in advance
output = self._evaluation_step(**kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 226, in _evaluation_step
output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 344, in validation_step
return self.model.validation_step(*args, **kwargs)
File "train.py", line 199, in validation_step
results = self(batch, split='test')
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "train.py", line 90, in forward
dR = axisangle_to_R(self.dR[batch['img_idxs']])
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
return func(*args, **kwargs)
File "/home/x00586938/ngp_pl-master/datasets/ray_utils.py", line 84, in axisangle_to_R
zero = torch.zeros_like(v[:, :1]) # (B, 1)
IndexError: too many indices for tensor of dimension 1

CUDA error when training on custom dataset

Hi, thank you again for this great work. I encountered a CUDA error RuntimeError: CUDA error: an illegal memory access was encountered when training on my custom data rendered from Blender (Kubric, to be more precise). The strange thing is that this bug is not reproducible -- sometimes it happens, sometimes not (using different seeds), and the trackback points to different lines of code each time. One example trackback looks like:

... skip all the PyTorch-Lightning related lines
File ".../ngp_pl/finetune.py", line 78, in validation_step                                                                                                             
    results = self(batch, split='test')                                                              
  File ".../anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)                                                            
  File ".../ngp_pl/train.py", line 79, in forward                                                                                                                        
    return render(self.model, batch_data['rays'], **kwargs)                                                                                                                                                
  File ".../ngp_pl/models/rendering.py", line 37, in render                                                                                                              
    results = render_func(model, rays_o, rays_d, hits_t, **kwargs)                                                                                                                                         
  File ".../anaconda3/envs/py38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)                                                                     
  File ".../ngp_pl/models/rendering.py", line 88, in __render_rays_test
    valid_mask = ~torch.all(dirs == 0, dim=1)                                                        
RuntimeError: CUDA error: an illegal memory access was encountered

I know it's very hard to debug this given only the error message, but it's a bit tricky to share the code because I modify some parts of it. So I just want to ask if, by any chance, anyone has encountered this error, and what could be the reason for it?

train colmap data

Hey!Thanks for your excellent work!
I have three questions:

  1. instrinsics usually is the format like :

image

so can you tell me what's the instrinsics mean?
  1. Have you ever met this error? I don't know which step is wrong

image

3. How can I use colmap data to train directly?To other words,how can i get bounds and intrinsics for my custom photos? Thanks a lot!!!

Gradio Demo ?

Can we have a Gradio Demo of the results (someplace where we can upload a punch of images and it produces the 2d model of it) ?

A problem occurred before the training was completed

Thank you so much for such a great job.

My system environment is as follows:

OS: win11+wsl2+ubuntu20.04
Graphic Card: 1660ti 6G
RAM: 40G

I had no problems with my training until you updated the gui code. Today I updated the latest repository and had the following problem.

(ngp_pl) gzx@HelloWorld:~/code/ngp_pl$ /usr/bin/env /home/gzx/anaconda3/envs/ngp_pl/bin/python /home/gzx/.vscode-server/extensions/ms-python.python-2022.10.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 37035 -- /home/gzx/code/ngp_pl/train.py --root_dir ../../datasets/nerf/Synthetic_NeRF/Lego/ --exp_name Lego --num_epochs 1
GridEncoding: Nmin=16 b=1.31951 F=2 T=2^19 L=16
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Loading 100 train images ...
100it [00:01, 50.29it/s]
Loading 200 test images ...
200it [00:03, 51.64it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 0: 83%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).y=30.00, train/psnr=29.70]
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1314) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/gzx/code/ngp_pl/train.py", line 264, in
trainer.fit(system, ckpt_path=hparams.ckpt_path)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
self._run_validation()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 309, in _run_validation
self.val_loop.run()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 154, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 112, in advance
batch = next(data_fetcher)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch
batch = next(iterator)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data
idx, data = self._get_data()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1286, in _get_data
success, data = self._try_get_data()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1147, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1314) exited unexpectedly
Epoch 0: 83%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ | 1000/1200 [01:12<00:14, 13.70it/s, loss=0.00104, train/s_per_ray=30.00, train/psnr=29.70]

I don't know what happened. If you know the cause of this problem, please let me know.

Training and Rendering Bounding Box

Hello!
Thanks for your work.
I just wanted to know if you plan to add the rendering bounding box, just like in Torch-NGP or Instant-NGP, in addition to the training bounding box.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.