kwea123 / ngp_pl Goto Github PK
View Code? Open in Web Editor NEWInstant-ngp in pytorch+cuda trained with pytorch-lightning (high quality with high speed, with only few lines of legible code)
License: MIT License
Instant-ngp in pytorch+cuda trained with pytorch-lightning (high quality with high speed, with only few lines of legible code)
License: MIT License
Hi, it's me again with another question. I'm thinking of extending this framework to support SDF prediction instead of density. How hard would this be? I mean something like https://lioryariv.github.io/volsdf/ (or NeuS) where you still train from posed images with volume rendering. My naive guess is that you would need to convert the geometry MLP output from SDF to density before rendering and (maybe) change some hyperparameters, but I might be overlooking something. From an implementation point of view, is this something that can be quickly done or does it require major changes to the code? Thank you in advance!
Hello!
First of all congrats for the great work you are doing!
I noticed that the hparams
variable in the NeRFSystem
class inside train.py
is not an attribute of the class but a global variable defined inside the main
, so it is not possible to call the class methods outside the train script.
Maybe it should be better to just use self.hparams
since it is a lighting module.
Maybe something like this would do the trick:
hparams
-> self.hparams
Hi, thanks for the amazing code!
Although the original tcnn codebase stated that they did not support multi-GPU.
NVlabs/tiny-cuda-nn#63
But I found that as long as you set
torch.cuda.set_device(rank)
before calling anything like tcnn.Network
, it actually works. All the tensors are in the correct GPUs and behave normal to me.
And it is also common to set_device
in distributed training before setting dist.init_process_group
However, I am completely new to pytorch-lighting and not sure how the DDP was implemented.
I tried to do it naviely but the code raise errors about "illegal memory access".
I am not sure if it is due to improper DDP settings, or your custom cuda code like "aabb_intersection" or "density_grid update".
Personally I think it will be very useful if the code supports multi-gpu which can be potentially used in multi-object/video data scenarios.
Hi @kwea123, thanks for providing the pre-trained checkpoints. I loved this project ๐ !
Do you have any plan to add a demo app for this? You can create a Lightning App and submit it to the app gallery.
We've created Research Template App for this purpose, his app lets you connect a blogpost, arxiv paper, and a jupyter notebook and even have an interactive demo for people to play with the model. This app also allows industry practitioners to reproduce your work.
PS: I'd be interested in contributing by creating an app for this project.
Hi, thanks for the code !
I have one question about this update_density_grid function. Can u explain why we need to do this update with density_grid and density_bitfield ?
Hi author, thx for your amazing repo!
How can we use this repo to load training results (nn weights, density grid...) from original instngp?
Hi, thank you very much again for this nice re-implementation of pytorch instant-ngp.
I have been reading and playing with it in another codebase.
However, I found it was very difficult to understand correctly the ray marcher implementation.
https://github.com/kwea123/ngp_pl/blob/master/models/csrc/raymarching.cu
For example, I could not understand terms like cascades
, scale
, exp_step_factor
which are very nontypical compared to the original NeRF implementations. It would be nice to have more comments to explain how raymarcher works.
Also, there is no hierarchical (importance) sampling in this case?
Thanks very much!
I wonder if once you are able to create the custom dataset pipeline, it would be an issue to try and deploy the resulting NGP into unity via barracuda. I presume exporting the model to ONNX and running it through barracuda wouldn't be much of an issue? I foresee using the a pipeline closely related to the Style Transfer One.
Congratulations on the great work!
I just have a small issue on the FPS of torch-ngp. The reported 7.8 actually includes the cost of saving rgb and depth to disk, and I estimated it to be ~20 on lego from the GUI on a V100.
Since the ray marching CUDA implementation is similar, I really wonder about the speed up, and would be grateful if you could share any ideas on how to improve it!
Wonderfull project! Is it possible to get the gradient of the density about xyz, just like the normal of sdf. I guess the normal is helpful to get smoother meshes.
Hi,
thank you for providing this amazing codebase!
I wonder if you have an intuition about the samples/ray in your implementation vs instant-ngp?
Yours drops to around ~25, while theirs drops to around ~6 on the lego scene after 5K steps.
Are you actually taking smaller steps along a ray or sth?
Thank you!
Hi, Thanks a lot for your repo it's very useful!
I'm having a problem when training a nerf with a large scale factor. For example training on the Lego dataset with a --scale 8
breaks with the current repo and returns this error:
Got cutlass error: Error Internal at: 346
However, this problem didn't appear with a previous version of your repo. It seems to trace back to the new calc_dt
function in raymarching.cu intorduced in the fix volume rendering condition bug.
I see that in the old version the scale was used whereas it isn't used in the new one. Could you help me understand the difference between these two?
Thanks!
Thank you for your implementation and Great Work! I looked through the code but I was not able to find a .vol
export similar to your previous nerf implementation here. Would the same code work on this model? If not what suggestions do you have to export the volume to a 3d texture?
Hi, thanks for the great work. I have a question regarding to the gradient computation in
ngp_pl/models/csrc/volumerendering.cu
Lines 137 to 152 in e9d9c37
Can you explain slightly on how you get the formula of dL_dsigmas[s]? i.e., why does
dL_dsigmas[s] = deltas[s] * (
dL_drgb[ray_idx][0](rgbs[s][0]T-(R-r)) +
dL_drgb[ray_idx][1](rgbs[s][1]T-(G-g)) +
dL_drgb[ray_idx][2](rgbs[s][2]T-(B-b)) +
dL_dopacity[ray_idx](1-O) +
dL_ddepth[ray_idx](t*T-(D-d))
)
Can we extract texture by use_vertext_normals?
I tried, but the program is dead.
Hi,
sometimes it happens that gradients for the rgb network are INF.
It can be reproduced by adding this hook to the NeRFSystem
:
def on_before_optimizer_step(self, optimizer, optimizer_idx: int) -> None:
def check_nan(x, str):
if torch.any(torch.isnan(x)):
raise ValueError(f"{str} is NAN, {x.isnan().sum()}")
if torch.any(torch.isinf(x)):
raise ValueError(f"{str} is INF, {x.isinf().sum()}")
if torch.any(torch.isneginf(x)):
raise ValueError(f"{str} is NEG INF, {x.isneginf().sum()}")
check_nan(self.model.rgb_net.params, 'rgb network')
check_nan(self.model.xyz_encoder.params, 'xyz_encoder')
check_nan(self.model.xyz_encoder.params.grad, 'xyz_encoder grad')
check_nan(self.model.rgb_net.params.grad, 'rgb network grad')
I tried different versions of the repository, dating back until the first release at 95c573dd494fb1bfa9b1afffb4c48ef2d9dcec22
.
For all, after some time, I get errors like this (tested on lego):
Epoch 6: 8%|โ | 76/1000 [00:02<00:25, 36.41it/s, loss=0.000466, v_num=2, train/psnr=32.80]Traceback (most recent call last):
...
ValueError: rgb network grad is INF, 1
This seems not to affect the optimization at all (i.e. INF value might just be ignored and still converges), but still it's strange to me why it is happening.
Just wanted to put this here in case you got any idea what might cause this :)
I want to ask an addition question, there is no test_traj.txt in the downloaded datasets from NSVF. How can I obtained it?
Thank you very much!
Hi, thanks for the great work! I have a (maybe stupid) question regarding the implementation. I'm not a NeRF researcher, but there are many works in my field adopting the NeRF representation for different applications. When using NeRF, we mostly care about the gradients from rendered RGB values (or depth) w.r.t. to input rays (or query points etc.). I looked at the official instant-ngp repo before, and I find out we cannot easily differentiate through their CUDA code.
Since this is a PyTorch-based implementation, I'm wondering if we can easily get the gradient. I haven't thoroughly read the code, but as far as I read the answer seems to be yes. After all you will need PyTorch auto-grad mechanism to "train" the network. Is that correct?
In Line 210
w2c_R = poses[:, :3, :3].mT # (N_cams, 3, 3) batch transpose
'mT' there is a bug. Do you mean T? or anything?
Among them, troch-ngp uses a random background color
nerf uses a fixed background.
ngp_pl seems to use a fixed background.
Q: Why are background colors randomized during NeRF training?
A: Transparency in the training data indicates a desire for transparency in the learned model. Using a solid background color, the model can minimize its loss by simply predicting that background color, rather than transparency (zero density). By randomizing the background colors, the model is forced to learn zero density to let the randomized colors "shine through".
As mentioned in instant-ngp, would it be better to use a random background?
Hi,
in models/networks.py on line 160 should't it say s = min(2**(c - 1), self.scale)
instaed of s = min(2**c, self.scale)
, in accordance with the comment on line 20?
If I'm wrong could you explain why?
Great work!!! I am wondering how you are handling the 360 scenes. Do you adopt the similar contraction space like mip-nerf 360 or just follow nerf++?
ngp_pl/models/csrc/volumerendering.cu
Line 20 in 0813fad
Hi, I noticed a tiny thing in this line. I am not sure if I understood it correctly... and I am not sure why having this judgment in the first place. Does it seem to cause the "last ray" always returns zero RGB?
Thanks
Great work!
Instant_ngp can extract mesh from images. How does that re-implement in your work?
Not really an issue but just a small PSA to contribute to your effort, if allowed. I just uploaded a Synthetic NeRF dataset creation tool here. Please feel free to critique and enhance and if deemed usable incorporate onto your effort. I am amazed at your work and felt the need to humbly contribute . I'm working on a GUI as well. Regards
Hi, thanks for open sourcing this amazing work! I was wondering if there is any plan to support more realistic datasets without object masks (e.g. the plain and simple T&T instead of the preprocessed from NSVF)? I guess so because supporting custom datasets (with COLMAP?) is on the TODO list, but asking just in case :) thank you again!
It would be awesome if it could support multi gpu. Higher batch size support, more images, higher aabb scale, all this would be possible !
thanks for your amazing work :)
Hi @kwea123 ! Thanks a lot for youe work!
I've tried to reproduce your results but I did not manage to achieve high quality. The only difference is that I used AdamW rather than apex fused optimizer. I tried Robot dataset with simply :
python train.py --root_dir C:\Users\Admin\Desktop\Robot --exp_name Lego --scale 1
And I recieved the following results:
Is it a problem with a different optimizer or I should change some parameters?
Thanks for your great work! :)
The code is really easy to follow and understand. I love it!
Just some questions.
800x800
images?Hello,
Could you please check whether the Lego checkpoint of the trained NeRF, that you provide in release is compatible with model implementation?
I double checked it with the version of NGP from master as well as release, neither seems to work
RuntimeError: Error(s) in loading state_dict for NGP: size mismatch for density_bitfield: copying a param with shape torch.Size([262144]) from checkpoint, the shape in current model is torch.Size([524288]). size mismatch for xyz_encoder.params: copying a param with shape torch.Size([11448112]) from checkpoint, the shape in current model is torch.Size([12199312]).
Hi, excellent work! I trained a nvsf
model on the lego
dataset using the steps in the README. During inference time in test.ipynb
, I get the error in the picture below. It seems that there is an issue with the size of density_bitfield
and xyz_encoder.params
in the checkpoint.
I trained my model following the README, so I'm not sure what's wrong with my approach.
Thanks for this great work!
After going through the code, I noticed this two lines behave weirdly:
ngp_pl/models/csrc/raymarching.cu
Lines 215 to 216 in 091c2bc
The original Instant-NGP code is here:
float mip_scale = scalbnf(1.0f, -mip);
Which should be equivalent to
const float mip_bound = 1 << mip;
const float mip_bound_inv = 1.0f/mip_bound;
I'm not sure what's the reason of adding a fminf
operation here.
This for your work, that's very great! But I have a puzzle about customizing datasets: I get the estimated pose through COLMAP, then get the trained model through your repo code (after 10K~100K iter).
However, the result is not great, like this:
It can't reach the similar performance as NeRF's demo(like lego and others...). So, I think maybe it is related to the accuracy of pose.
Can you give me some suggestions to improve it๏ผ
Hello. It's an amazing job!
cuda11.7
wsl+ubuntu20.04
gtx1080
The problem is described as follows๏ผ
~/code/ngp_pl$ python train.py --root_dir /home/gzx/datasets/Lego/ --exp_name Lego
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Warning: FullyFusedMLP is not supported for the selected architecture 61. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: logs/Lego
Traceback (most recent call last):
File "train.py", line 184, in
trainer.fit(system, ckpt_path=hparams.ckpt_path)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1174, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_setup_hook
self._call_lightning_module_hook("setup", stage=fn)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "train.py", line 67, in setup
self.train_dataset = dataset(split=hparams.split, **kwargs)
File "/home/gzx/code/ngp_pl/datasets/nsvf.py", line 43, in init
K = np.loadtxt(os.path.join(root_dir, 'intrinsics.txt'),
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/numpy/lib/npyio.py", line 1301, in loadtxt
arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/numpy/lib/npyio.py", line 979, in _read
arr = _load_from_filelike(
ValueError: the number of columns changed from 4 to 3 at row 2; use usecols
to select a subset and avoid this error
I don't know why this is happening and would like to get your help.
Loading Synthetic Nerf Lego breaks with:
ValueError: the number of columns changed from 4 to 3 at row 2; use usecols
to select a subset and avoid this error
when loading the intrinsics.txt file
The error originates here: datasets/nsvf.py, line 44
K = np.loadtxt(os.path.join(root_dir, 'intrinsics.txt'),
Seems to be the hard coding on line 27 of datasets/nsvf.py
Processing e:\codes\ngp_pl-master\ngp_pl-master\models\csrc
Preparing metadata (setup.py) ... done
Building wheels for collected packages: vren
Building wheel for vren (setup.py) ... error
error: subprocess-exited-with-error
ร python setup.py bdist_wheel did not run successfully.
โ exit code: 1
โฐโ> [915 lines of output]
running bdist_wheel
.....
2 errors detected in the compilation of "intersection.cu".
intersection.cu
error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.6\\bin\\nvcc.exe' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for vren
Running setup.py clean for vren
Failed to build vren
Installing collected packages: vren
Running setup.py install for vren ... error
error: subprocess-exited-with-error
ร Running setup.py install for vren did not run successfully.
โ exit code: 1
โฐโ> [917 lines of output]
running install
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_ext
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\utils\cpp_extension.py:411: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend. warnings.warn(msg.format('we could not find ninja.'))
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\utils\cpp_extension.py:346: UserWarning: Error checking compiler version for cl: [WinError 2] ็ณป็ปๆพ ไธๅฐๆๅฎ็ๆไปถใ
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
building 'vren' extension
creating build
creating build\temp.win-amd64-3.9
creating build\temp.win-amd64-3.9\Release
"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IE:\codes\ngp_pl-master\ngp_pl-master\models\csrc\include -IC:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include -IC:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\torch\csrc\api\include -IC:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\TH -IC:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\include" -IC:\ProgramData\Anaconda3\envs\ngp-pl\include -IC:\ProgramData\Anaconda3\envs\ngp-pl\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tpbinding.cpp /Fobuild\temp.win-amd64-3.9\Release\binding.obj /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /EHsc -O2 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=vren -D_GLIBCXX_USE_CXX11_ABI=0
binding.cpp
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\c10/macros/Macros.h(143): warning C4067: ้ขๅค็ๅจๆไปคๅๆๆๅคๆ ่ฎฐ - ๅบ่พๅ
ฅๆข่ก็ฌฆ
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\c10/core/TensorImpl.h(2214): warning C4805: โ|โ: ๅจๆไฝไธญๅฐ็ฑปๅโuintptr_tโไธ็ฑปๅโboolโ ๆททๅไธๅฎๅ
จ
C:\ProgramData\Anaconda3\envs\ngp-pl\lib\site-packages\torch\include\c10/util/Optional.h(198): warning C4624: โc10::constexpr_storage_tโ: ๅทฒๅฐๆๆๅฝๆฐ้ๅผๅฎ ไนไธบโๅทฒๅ ้คโ
Hello, Thank you so much for the awesome youtube video and lives as well!! I have learnt a lot from them.
Since I tried to trained my own data from nerf dataset transform.json (generated with instant-ngp colmap2nerf.py),
there is a operation .mT within dataset/ray_utils.py with line "rays_d = rearrange(directions, 'n c -> n 1 c') @ c2w[..., :3].mT".
Can you give me some info about this .mT operation? Somehow I cannot run this line due this operation is missing..
How is this different from normal transpose?
Thanks in advance!!
Hi author, thx for your amazing repo!
When I train on an indoor scene I find nan values appear in the density grid which leads to errors like:
in sample_uniform_and_occupied_cells
cells += [(torch.cat([indices1, indices2]), torch.cat([coords1, coords2]))]
RuntimeError: CUDA error: invalid configuration argument
Could you pls tell me how to fix this like setting an upper bound value for the density grid during training?
Hi kwea123, thanks for your great work on NGP.
I have tried mipnerf360 datasets and the gpu I used is NVIDIA V100 16G.
On training phase, the gpu memory used is ~6G, but when on validatation phase, I will get an OOM error. The batchsize I tried includes 4096 and 8192.
The experimental training options optimize_ext crushed in my project. When optimizing extrinsics, the shape of self.dR[batch['img_idxs']] is [batchsize, 3] for training step and [3] for testing. However, the code to compute dR = axisangle_to_R(self.dR[batch['img_idxs']])
is the same for training and testing step. I'm trying to solve this problem, but didn't work now. Have you ever test optimize_ext ?
The detailed error info is as follows:
Traceback (most recent call last):
File "train.py", line 280, in
trainer.fit(system, ckpt_path=hparams.ckpt_path)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
self._run_validation()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 311, in _run_validation
self.val_loop.run()
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 128, in advance
output = self._evaluation_step(**kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 226, in _evaluation_step
output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 344, in validation_step
return self.model.validation_step(*args, **kwargs)
File "train.py", line 199, in validation_step
results = self(batch, split='test')
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "train.py", line 90, in forward
dR = axisangle_to_R(self.dR[batch['img_idxs']])
File "/home/user/miniconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
return func(*args, **kwargs)
File "/home/x00586938/ngp_pl-master/datasets/ray_utils.py", line 84, in axisangle_to_R
zero = torch.zeros_like(v[:, :1]) # (B, 1)
IndexError: too many indices for tensor of dimension 1
Line 114 in 05b4c3e
What is the function of doing this?
Hi @kwea123 ,
Is the 'vren.py' file missing? or i should install 'vren' package?
Hi, thank you again for this great work. I encountered a CUDA error RuntimeError: CUDA error: an illegal memory access was encountered
when training on my custom data rendered from Blender (Kubric, to be more precise). The strange thing is that this bug is not reproducible -- sometimes it happens, sometimes not (using different seeds), and the trackback points to different lines of code each time. One example trackback looks like:
... skip all the PyTorch-Lightning related lines
File ".../ngp_pl/finetune.py", line 78, in validation_step
results = self(batch, split='test')
File ".../anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File ".../ngp_pl/train.py", line 79, in forward
return render(self.model, batch_data['rays'], **kwargs)
File ".../ngp_pl/models/rendering.py", line 37, in render
results = render_func(model, rays_o, rays_d, hits_t, **kwargs)
File ".../anaconda3/envs/py38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File ".../ngp_pl/models/rendering.py", line 88, in __render_rays_test
valid_mask = ~torch.all(dirs == 0, dim=1)
RuntimeError: CUDA error: an illegal memory access was encountered
I know it's very hard to debug this given only the error message, but it's a bit tricky to share the code because I modify some parts of it. So I just want to ask if, by any chance, anyone has encountered this error, and what could be the reason for it?
Hey!Thanks for your excellent work!
I have three questions:
Can we have a Gradio Demo of the results (someplace where we can upload a punch of images and it produces the 2d model of it) ?
Thank you so much for such a great job.
My system environment is as follows:
OS: win11+wsl2+ubuntu20.04
Graphic Card: 1660ti 6G
RAM: 40G
I had no problems with my training until you updated the gui code. Today I updated the latest repository and had the following problem.
(ngp_pl) gzx@HelloWorld:~/code/ngp_pl$ /usr/bin/env /home/gzx/anaconda3/envs/ngp_pl/bin/python /home/gzx/.vscode-server/extensions/ms-python.python-2022.10.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 37035 -- /home/gzx/code/ngp_pl/train.py --root_dir ../../datasets/nerf/Synthetic_NeRF/Lego/ --exp_name Lego --num_epochs 1
GridEncoding: Nmin=16 b=1.31951 F=2 T=2^19 L=16
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Loading 100 train images ...
100it [00:01, 50.29it/s]
Loading 200 test images ...
200it [00:03, 51.64it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 0: 83%|โโโโโโโโโโโโโโโโERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).y=30.00, train/psnr=29.70]
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/threading.py", line 306, in wait
gotit = waiter.acquire(True, timeout)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1314) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gzx/code/ngp_pl/train.py", line 264, in
trainer.fit(system, ckpt_path=hparams.ckpt_path)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
self._run_validation()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 309, in _run_validation
self.val_loop.run()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 154, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 112, in advance
batch = next(data_fetcher)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch
batch = next(iterator)
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data
idx, data = self._get_data()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1286, in _get_data
success, data = self._try_get_data()
File "/home/gzx/anaconda3/envs/ngp_pl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1147, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1314) exited unexpectedly
Epoch 0: 83%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 1000/1200 [01:12<00:14, 13.70it/s, loss=0.00104, train/s_per_ray=30.00, train/psnr=29.70]
I don't know what happened. If you know the cause of this problem, please let me know.
Hello!
Thanks for your work.
I just wanted to know if you plan to add the rendering bounding box, just like in Torch-NGP or Instant-NGP, in addition to the training bounding box.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.