Congratulations on the great work! I just have a small issue on the FPS of torch-n

About the inference speed about ngp_pl HOT 6 CLOSED

kwea123 commented on June 15, 2024 1

About the inference speed

from ngp_pl.

Comments (6)

kwea123 commented on June 15, 2024 1

Hi, indeed, the 7.8 includes saving so not accurate.

After commenting out these lines, I ran again with --test on my 2080Ti
https://github.com/ashawkey/torch-ngp/blob/main/nerf/utils.py#L848-L864

And this is the result:

So in average it's 200/11=18.18FPS, which is close to your number.

As for why the speed is different, during the implementation I noticed some improvements:

(I'm not so familiar with cuda so this might be wrong) I pass in PackedTensor directly in the kernel, so I don't need to do the locate like here, this at least improves readability, and probably some speed? I'm not sure. And here I use inplace addition instead of passing by additional variables.
I keep track of alive indices instead of making additional buffers and use atomicAdd to track here

Probably the most effective? When marching rays in testing, we allocate a small number each time for each ray and march the rays only that little (the variable step in your code). However not all rays need that many samples. Take step=1 (initial step) for example, we march 1 sample further for each ray, but many rays don't even hit anything, so we can exclude those samples that hits nothing from network evaluation like

ngp_pl/models/rendering.py

Lines 80 to 87 in ebd4539

    
           valid_mask = ~torch.all(dirs==0, dim=1) 
        
           if valid_mask.sum()==0: break 
        
           sigmas = torch.zeros(len(xyzs), device=device) 
        
           rgbs = torch.zeros(len(xyzs), 3, device=device) 
        
           _sigmas, _rgbs = model(xyzs[valid_mask], dirs[valid_mask]) 
        
           sigmas[valid_mask] = _sigmas.float() 
        
           rgbs[valid_mask] = _rgbs.float()

In your code, you evaluate every samples, among which many are useless. Take the first step as example, there are 800x800x1 samples, but if we mask out the samples that hit nothing, in my experiment only ~200k samples need to be evaluated, and that leads to speedup.

I can only think of these small things that are different, but unsure if they are really the difference.

from ngp_pl.

ashawkey commented on June 15, 2024 1

Yes I'm doing so! It seems the T_threshold also matters.

1e-2: 33.59
1e-4: 28.41
1e-2 w/o masking (point 3): 33.30
1e-4 w/o masking (point 3): 28.10

(The results are from a TITAN RTX)

by w/o masking I modified the code to:

        sigmas, rgbs = model(xyzs, dirs)
        sigmas = sigmas.contiguous().float()
        rgbs = rgbs.contiguous().float()

Some quick experiments on torch-ngp also show point 2 and 3 may be not very effective.
I'm not very sure about the difference between PackedTensor and data_ptr for point 1, but the indexing overhead should be small.

Also, this notebook only measures rendering time, but the data loading strategy also matters.
ngp_pl preprocess images into rays, while torch_ngp generates rays on-the-fly (a time-memory trade-off).

Finally, thanks for the information and help! It seems still difficult to match native instant-ngp's speed while keeping enough flexibility.

P.S. a little name inconsistency in released ckpts, it should be epoch=19_slim.ckpt to match the notebook.

from ngp_pl.

kwea123 commented on June 15, 2024 1

I just noticed a big bottleneck in data loading, I fixed it and now it trains 1.5x faster. I will re-benchmark everything

from ngp_pl.

ashawkey commented on June 15, 2024

Thanks for the detailed explanation! I'll experiment on them later.

from ngp_pl.

kwea123 commented on June 15, 2024

A notebook test.ipynb and a pretrained model is uploaded, you can use them to evaluate and measure the fps.

from ngp_pl.

SSground commented on June 15, 2024

Whether to consider using volrender to improve inference speed？
https://github.com/sxyu/volrend
That is to say, the output format needs to be converted into the form of plenoctree？
https://github.com/sxyu/plenoctree

from ngp_pl.

About the inference speed about ngp_pl HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	valid_mask = ~torch.all(dirs==0, dim=1)
	if valid_mask.sum()==0: break

	sigmas = torch.zeros(len(xyzs), device=device)
	rgbs = torch.zeros(len(xyzs), 3, device=device)
	_sigmas, _rgbs = model(xyzs[valid_mask], dirs[valid_mask])
	sigmas[valid_mask] = _sigmas.float()
	rgbs[valid_mask] = _rgbs.float()