facebookresearch / nsvf Goto Github PK
View Code? Open in Web Editor NEWOpen source code for the paper of Neural Sparse Voxel Fields.
License: MIT License
Open source code for the paper of Neural Sparse Voxel Fields.
License: MIT License
Describe the bug
I am trying to run the NeRf implementation using the same code base. However, I am getting a memory over run issue. What should the configuration setting be?
To Reproduce
Steps to reproduce the behavior:
Configuration used for training:
export DATASET="/nitthilan/data/NSVF/Synthetic_NSVF/Spaceship/"
export SAVE="./spaceship_nerf_ckpt/"
export TRAIN_DIM="50x50"
export TRAIN_VIEWS="0..100"
export VALID_DIM="1x1"
export VALID_VIEWS="0..100"
export PRUNE_EVERY=2500
export VIEW_PER_BATCH=1
export PIXEL_PER_VIEW=2048
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -u train.py ${DATASET}
--user-dir fairnr
--task single_object_rendering
--train-views ${TRAIN_VIEWS} --view-resolution ${TRAIN_DIM}
--max-sentences 1 --view-per-batch ${VIEW_PER_BATCH} --pixel-per-view ${PIXEL_PER_VIEW}
--no-preload
--sampling-on-mask 1.0 --no-sampling-at-reader
--valid-views ${VALID_VIEWS} --valid-view-resolution ${VALID_DIM}
--valid-view-per-batch 1
--transparent-background "1.0,1.0,1.0" --background-stop-gradient
--arch nerf_base
--color-weight 128.0 --alpha-weight 1.0
--optimizer "adam" --adam-betas "(0.9, 0.999)"
--lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000
--criterion "srn_loss" --clip-norm 0.0
--num-workers 0
--seed 2
--save-interval-updates 500 --max-update 150000
--virtual-epoch-steps 500 --save-interval 1
--half-voxel-size-at "5000,25000,75000"
--reduce-step-size-at "5000,25000,75000"
--pruning-every-steps ${PRUNE_EVERY}
--keep-interval-updates 5 --keep-last-epochs 5
--log-format simple --log-interval 1
--save-dir ${SAVE}
--raymarching-tolerance 0.01
--tensorboard-logdir ${SAVE}/tensorboard
| tee -a $SAVE/train.log
Expected behavior
If I increase the TRAIN_DIM="50x50" to anything more then the memory consumed exceeds 25GB. I am running it on a 4 GPU system. Ideally, the same configuration runs for nsvf_base arch for dimensions of 800x800. Am I missing something?
Running script:
python -u /home/stud/kghasemi/.pycharm_helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 0.0.0.0 --port 41219 --file /tmp/kghasemi/GRF/train.py /tmp/kghasemi/data/Synthetic_NSVF/Wineholder --user-dir fairnr --task single_object_rendering --train-views 0..100 --view-resolution 200x200 --max-sentences 1 --view-per-batch 2 --pixel-per-view 64 --no-preload --sampling-on-mask 1.0 --no-sampling-at-reader --disable-validation --valid-view-resolution 200x200 --valid-views 100..200 --valid-view-per-batch 2 --transparent-background 1.0,1.0,1.0 --background-stop-gradient --arch nsvf_base --initial-boundingbox /tmp/kghasemi/data/Synthetic_NSVF/Wineholder/bbox.txt --raymarching-stepsize-ratio 0.125 --use-octree --discrete-regularization --color-weight 128.0 --alpha-weight 1.0 --optimizer adam --adam-betas "(0.9, 0.999)" --lr-scheduler polynomial_decay --total-num-update 150000 --lr 0.001 --clip-norm 0.0 --criterion srn_loss --num-workers 0 --seed 2 --save-interval-updates 500 --max-update 150000 --virtual-epoch-steps 50 --save-interval 1 --half-voxel-size-at 5000,25000,75000 --reduce-step-size-at 5000,25000,75000 --pruning-every-steps 2500 --keep-interval-updates 5 --log-format simple --log-interval 1 --tensorboard-logdir checkpoint/Wineholder/tensorboard/nsvf_basev2 --save-dir checkpoint/Wineholder/nsvf_basev2
When I choose --pixel-per-view 64
after a few epochs training on Wineholder, I get this error:
Traceback (most recent call last):
data_utils.py: get_uv File "/tmp/kghasemi/conda/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "/tmp/kghasemi/conda/lib/python3.8/site-packages/fairseq/logging/metrics.py", line 95, in aggregate
yield agg
File "/tmp/kghasemi/conda/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/tmp/kghasemi/GRF/fairnr_cli/train.py", line 184, in train
log_output = trainer.train_step(samples)
File "/tmp/kghasemi/conda/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/tmp/kghasemi/conda/lib/python3.8/site-packages/fairseq/trainer.py", line 457, in train_step
raise e
File "/tmp/kghasemi/conda/lib/python3.8/site-packages/fairseq/trainer.py", line 425, in train_step
loss, sample_size_i, logging_output = self.task.train_step(
File "/tmp/kghasemi/GRF/fairnr/tasks/neural_rendering.py", line 300, in train_step
return super().train_step(sample, model, criterion, optimizer, update_num, ignore_grad)
File "/tmp/kghasemi/conda/lib/python3.8/site-packages/fairseq/tasks/fairseq_task.py", line 351, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/tmp/kghasemi/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/tmp/kghasemi/GRF/fairnr/criterions/rendering_loss.py", line 42, in forward
net_output = model(**sample)
File "/tmp/kghasemi/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/tmp/kghasemi/GRF/fairnr/models/fairnr_model.py", line 77, in forward
results = self._forward(ray_start, ray_dir, **kwargs)
File "/tmp/kghasemi/GRF/fairnr/models/nsvf.py", line 79, in _forward
samples = self.encoder.ray_sample(intersection_outputs)
File "/tmp/kghasemi/GRF/fairnr/modules/encoder.py", line 355, in ray_sample
sampled_idx, sampled_depth, sampled_dists = uniform_ray_sampling(
File "/tmp/kghasemi/GRF/fairnr/clib/__init__.py", line 192, in forward
pts_idx = pts_idx.reshape(G, -1, P)
RuntimeError: shape '[256, -1, 60]' is invalid for input of size 15240
Process finished with exit code 1
Hello! Thank you for open-sourcing this amazing work.
I have attempted to reproduce your research results on real, inward facing dataset and struggled to achieve photo-realistic results. After a thorough process of elimination, I make an educated guess that the problem is with NSVF's robustness to slightly perturbated camera poses. Running on the same colmap camera poses, we observed that the original NeRF model is significantly more robust to camera pose inaccuracies.
To reproduce: Run colmap on synthetic Lego to get non-ground-truth camera poses. Train model with original configurations.
Here is an example:
Run colmap on synthetic Lego dataset, with exhaustive matching.
Now we compare training results on three sets of camera poses: 1) ground truth, 2) ground truth + additive gaussian noise (std=0.01), 3) colmap camera poses in NSVF's coordinate frame convention. We can see that colmap camera poses does not produce desirable result.
Here is a snapshot of the loss function (pink=ground truth poses, blue=colmap). Note that training on colmap camera poses cannot converge to the same photorealistic results regardless of training time.
Here is another example on a real captured inward facing dataset:
Replace _ext_src_root = "fairnr/clib"
by _ext_src_root = os.path.abspath("fairnr/clib")
in setup.py
Is your feature request related to a problem? Please describe.
Would it be possible to share the code used to generate the pose.txt files for the different datasets?
Also if possible the script to generate the rendered images from 3D models?
I reviewed related issues here but there is still no correct way found.
Can you please provide your code to get the values?
I'm struggling to reproduce your result with the provided images(especially real, big size objects using colmap. not small, synthetic ones from blender).
I've always failed to get the reasonable rendering results, just seen so ambiguous images.
for example, in Wineholder case,
colmap shows reasonable result so that I can trust it.
But the bbox.txt using "https://github.com/yxie20/lego_nsvf/tree/master/poses" is :
-1.7405309358237018 0.8937903721375522 0.8837398050743352 2.0868807956960875 3.9037096850320445 3.3750371294069716 0.3239107072481569
and your provided bbox.txt is
-0.5884649 -0.44142154 -0.12279411 0.78653513 0.43357844 0.8772059 0.125.
it's totally different!
camera intrinsics are little similar
888.0 0.0 400.0 0.0
0.0 888.0 400.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 1.0
VS
875.000000 400.000000 400.000000 0.
0. 0. 0.
0.
1.
800 800
~
Thank you!
Is your feature request related to a problem? Please describe.
I'm always frustrated when I cant try pytorch/cuda/python things under windows
Describe the solution you'd like
what you want to happen : smooth install
Describe alternatives you've considered
any alternative solutions conda or wsl ubuntu working way
Additional context
wsl ubuntu windows 10 has problem with cuda
conda windows 10 has problem with vc14++ compilation allmost on torchsearchsorted
Did you compare the depth accuracy with NeRF? In my opinion your sampling strategy samples points closer to where the object is located, so intuitively your depth should be better. Is this correct?
I have try to run this project codes in RTX2080Ti (11GB) x 4,
the original args like "--view-per-batch 4 --pixel-per-view 2048" will cause the OOM Error In Cuda devices in just 2 iters,
so I try to reduce the batch size to "--view-per-batch 4 --pixel-per-view 128",and it works well in the first 5000 iters,
and the args "--view-per-batch 2 --pixel-per-view 128", works well in the first 25000 iters,
They will finally cause the OOM Error in the voxels split step(just a guess),So I try to check the codes about the mm control part, and I did not found any codes about "Release the unused cache of Pytorch",like some codes:
torch.cuda.empty_cache()
so I try to add this code to the "fairnr/models/nsvf.py/NSVFModel/clean_caches":
def clean_caches(self, reset=False):
self.encoder.clean_runtime_caches()
if reset:
self.encoder.reset_runtime_caches()
torch.cuda.empty_cache() # cache release after Model do all things
And this really help me to do more split steps (but still can not do more split steps like after 75000 iters)。
Before Add this line code:
Mem use of Cuda device: 4000MB ->(voxel split) 8000MB -> (voxel split) OOM Error
After Add this line code:
Mem use of Cuda device: 4000MB ->(voxel split) 6800MB -> (voxel split) 9900MB -> (voxel split) OOM Error
And I don't find any bad affect on the results, yet.
I also try other ways to solve the problem of OOM, like add args "--fp16" to turn on the fp16 mode in apex module(which says can reduce the mem use due to use float16),
But this just cause error which I post the Issue #33.
If you guys have interesets about how to run these codes in the other cuda device(especially those not have so much gpu mm as V100 32GB),This line code and the bug report maybe useful for you guys.
Thanks for replying.
Hi,
I'm trying to run the train_wineholder.sh
script on my machine. It works fine for the first 500 iterations, but immediately after the 500th iteration, it pauses, eventually throwing the following error related to tensors existing on different devices:
Start of training:
2020-11-16 10:31:25 | INFO | fairseq.distributed_utils | distributed init (rank 1): tcp://localhost:14275
2020-11-16 10:31:25 | INFO | fairseq.distributed_utils | distributed init (rank 0): tcp://localhost:14275
2020-11-16 10:31:27 | INFO | fairseq.distributed_utils | initialized host lightbox-desktop as rank 0
2020-11-16 10:31:27 | INFO | fairseq.distributed_utils | initialized host lightbox-desktop as rank 1
2020-11-16 10:31:27 | INFO | fairnr_cli.train | Namespace(L1=False, adam_betas='(0.9, 0.999)', adam_eps=1e-08, all_gather_list_size=16384, alpha_weight=1.0, arch='nsvf_base', background_depth=5.0, background_stop_gradient=True, best_checkpoint_metric='loss', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_suffix='', chunk_size=64, clip_norm=0.0, color_weight=128.0, cpu=False, criterion='srn_loss', curriculum=0, data='/code/nsvf/datasets/Synthetic_NSVF/Wineholder', data_buffer_size=10, dataset_impl=None, ddp_backend='c10d', density_embed_dim=128, depth_weight=0.0, depth_weight_decay=None, deterministic_step=False, device_id=0, disable_validation=False, discrete_regularization=True, distributed_backend='nccl', distributed_init_method='tcp://localhost:14275', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, distributed_wrapper='DDP', empty_cache_freq=0, end_learning_rate=0.0, eval_lpips=False, fast_stat_sync=False, feature_embed_dim=256, feature_layers=1, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, half_voxel_size_at='5000,25000,75000', initial_boundingbox='/code/nsvf/datasets/Synthetic_NSVF/Wineholder/bbox.txt', inputs_to_density='emb:6:32', inputs_to_texture='feat:0:256, ray:4', keep_best_checkpoints=-1, keep_interval_updates=5, keep_last_epochs=-1, load_depth=False, load_mask=False, localsgd_frequency=3, log_format='simple', log_interval=1, lr=[0.001], lr_scheduler='polynomial_decay', max_epoch=0, max_hits=60, max_sentences=1, max_sentences_valid=1, max_tokens=None, max_tokens_valid=None, max_update=150000, maximize_best_checkpoint_metric=False, memory_efficient_bf16=False, memory_efficient_fp16=False, min_color=-1, min_loss_scale=0.0001, min_lr=-1, model_parallel_size=1, no_background_loss=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_load_binary=False, no_preload=True, no_progress_bar=False, no_sampling_at_reader=True, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, nprocs_per_node=2, num_workers=0, object_id_path=None, optimizer='adam', optimizer_overrides='{}', output_valid=None, patience=-1, pixel_per_view=2048.0, power=1.0, profile=False, pruning_every_steps=2500, pruning_rerun_train_set=False, pruning_th=0.5, pruning_with_train_stats=False, quantization_config_path=None, raymarching_stepsize=0.01, raymarching_stepsize_ratio=0.125, raymarching_tolerance=0, reduce_step_size_at='5000,25000,75000', rendering_args=None, rendering_every_steps=None, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sampling_at_center=1.0, sampling_on_bbox=False, sampling_on_mask=1.0, sampling_patch_size=1, sampling_skipping_size=1, save_dir='/code/nsvf/checkpoints/Wineholder/nsvf_basev1', save_interval=1, save_interval_updates=500, scoring='bleu', seed=2, sentence_avg=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, stop_time_hours=0, subsample_valid=-1, task='single_object_rendering', tensorboard_logdir='/code/nsvf/checkpoints/Wineholder/tensorboard/nsvf_basev1', test_views='0', texture_embed_dim=256, texture_layers=3, threshold_loss_scale=None, tokenizer=None, total_num_update=150000, tpu=False, train_subset='train', train_views='0..100', transparent_background='1.0,1.0,1.0', update_freq=[1], use_bmuf=False, use_octree=True, use_old_adam=False, user_dir='fairnr', valid_chunk_size=64, valid_subset='valid', valid_view_per_batch=1, valid_view_resolution='800x800', valid_views='100..200', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, vgg_level=2, vgg_weight=0.0, view_per_batch=2, view_resolution='800x800', virtual_epoch_steps=5000, voxel_embed_dim=32, voxel_path=None, voxel_size=0.25, warmup_updates=0, weight_decay=0.0)
2020-11-16 10:31:27 | INFO | fairnr_cli.train | NSVFModel(
(reader): ImageReader()
(encoder): SparseVoxelEncoder(
(values): Embedding(1170, 32)
)
(field): RaidanceField(
(bg_color): BackgroundField()
(den_filters): ModuleDict(
(emb): NeRFPosEmbLinear(Cat(32, Sinusoidal (in=32, out=384, angular=False)))
)
(tex_filters): ModuleDict(
(feat): Identity()
(ray): NeRFPosEmbLinear(Sinusoidal (in=3, out=24, angular=True))
)
(feature_field): ImplicitField(
(net): Sequential(
(0): FCLayer(
(net): Sequential(
(0): Linear(in_features=416, out_features=256, bias=True)
(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(2): ReLU(inplace=True)
)
)
(1): FCLayer(
(net): Sequential(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(2): ReLU(inplace=True)
)
)
(2): FCLayer(
(net): Sequential(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(2): ReLU(inplace=True)
)
)
)
)
(predictor): SignedDistanceField(
(hidden_layer): FCLayer(
(net): Sequential(
(0): Linear(in_features=256, out_features=128, bias=True)
(1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
(2): ReLU(inplace=True)
)
)
(output_layer): Linear(in_features=128, out_features=1, bias=True)
)
(renderer): TextureField(
(net): Sequential(
(0): FCLayer(
(net): Sequential(
(0): Linear(in_features=280, out_features=256, bias=True)
(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(2): ReLU(inplace=True)
)
)
(1): FCLayer(
(net): Sequential(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(2): ReLU(inplace=True)
)
)
(2): FCLayer(
(net): Sequential(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(2): ReLU(inplace=True)
)
)
(3): FCLayer(
(net): Sequential(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(2): ReLU(inplace=True)
)
)
(4): Linear(in_features=256, out_features=3, bias=True)
)
)
)
(raymarcher): VolumeRenderer()
)
2020-11-16 10:31:27 | INFO | fairnr_cli.train | model nsvf_base, criterion SRNLossCriterion
2020-11-16 10:31:27 | INFO | fairnr_cli.train | num. model params: 582737 (num. trained: 582724)
2020-11-16 10:31:27 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2020-11-16 10:31:27 | INFO | fairseq.utils | rank 0: capabilities = 6.1 ; total memory = 7.929 GB ; name = GeForce GTX 1080
2020-11-16 10:31:27 | INFO | fairseq.utils | rank 1: capabilities = 6.1 ; total memory = 7.921 GB ; name = GeForce GTX 1080
2020-11-16 10:31:27 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2020-11-16 10:31:27 | INFO | fairnr_cli.train | training on 2 GPUs
2020-11-16 10:31:27 | INFO | fairnr_cli.train | max tokens per GPU = None and max sentences per GPU = 1
2020-11-16 10:31:27 | INFO | fairseq.trainer | no existing checkpoint found /code/nsvf/checkpoints/Wineholder/nsvf_basev1/checkpoint_last.pt
2020-11-16 10:31:27 | INFO | fairseq.trainer | loading train data for epoch 1
/code/nsvf/env/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:397: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
/code/nsvf/env/lib/python3.8/site-packages/torch/nn/parallel/distributed.py:397: UserWarning: The `check_reduction` argument in `DistributedDataParallel` module is deprecated. Please avoid using it.
warnings.warn(
Building EasyOctree done. total #nodes = 1881, terminal #nodes = 864 (time taken 0.254881 s)
Building EasyOctree done. total #nodes = 1881, terminal #nodes = 864 (time taken 0.260331 s)
/code/nsvf/env/lib/python3.8/site-packages/fairseq/utils.py:304: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
warnings.warn(
/code/nsvf/env/lib/python3.8/site-packages/fairseq/utils.py:304: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
Then at iter 500:
2020-11-16 10:41:12 | INFO | train_inner | epoch 001: 500 / 5000 loss=23.507, vox=0.125, stp=0.016, tvo=0.127, asf=68.201, ash=68.201, nvo=864, color=0.182, alpha=0.228, wps=2.1, ups=1.07, wpb=2, bsz=2, num_updates=500, lr=0.000996667, gnorm=171.876, train_wall=1, wall=464
Traceback (most recent call last):
File "train.py", line 20, in <module>
cli_main()
File "/code/nsvf/fairnr_cli/train.py", line 353, in cli_main
torch.multiprocessing.spawn(
File "/code/nsvf/env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/code/nsvf/env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/code/nsvf/env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/code/nsvf/env/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/code/nsvf/fairnr_cli/train.py", line 338, in distributed_main
main(args, init_distributed=True)
File "/code/nsvf/fairnr_cli/train.py", line 104, in main
should_end_training = train(args, trainer, task, epoch_itr)
File "/media/lightbox/Extra/anaconda/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/code/nsvf/fairnr_cli/train.py", line 204, in train
valid_losses = validate_and_save(args, trainer, task, epoch_itr, valid_subsets)
File "/code/nsvf/fairnr_cli/train.py", line 245, in validate_and_save
valid_losses = validate(args, trainer, task, epoch_itr, valid_subsets)
File "/code/nsvf/fairnr_cli/train.py", line 302, in validate
trainer.valid_step(sample)
File "/media/lightbox/Extra/anaconda/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/code/nsvf/env/lib/python3.8/site-packages/fairseq/trainer.py", line 631, in valid_step
raise e
File "/code/nsvf/env/lib/python3.8/site-packages/fairseq/trainer.py", line 615, in valid_step
_loss, sample_size, logging_output = self.task.valid_step(
File "/code/nsvf/fairnr/tasks/neural_rendering.py", line 306, in valid_step
images = model.visualize(sample, shape=0, view=0)
File "/code/nsvf/env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/code/nsvf/fairnr/models/fairnr_model.py", line 126, in visualize
images = {
File "/code/nsvf/fairnr/models/fairnr_model.py", line 127, in <dictcomp>
tag: recover_image(width=width, **images[tag])
File "/code/nsvf/fairnr/data/data_utils.py", line 264, in recover_image
img = ((img - min_val) / (max_val - min_val)).clamp(min=0, max=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
The only change I've made to the training script is, reducing --view-per-batch
to 1. Do you have any idea what the issue might be?
I'm running this on Ubuntu 20.04 with two GTX GeForce 1080 GPUs, CUDA version 10.1. Let me know if I can provide any further info at this time! Thanks so much!
Hi:
The script reported "CUDA kernel failed : invalid device function
void svo_intersect_point_kernel_wrapper(int, int, int, float, int, const float*, const float*, const float*, const int*, int*, float*, float*) at L:384 in fairnr/clib/src/intersect_gpu.cu" when I try to train a network. The detial imformation are shown below:
2020-10-21 12:31:48 | INFO | fairnr_cli.train | model nsvf_base, criterion SRNLossCriterion
2020-10-21 12:31:48 | INFO | fairnr_cli.train | num. model params: 574865 (num. trained: 574852)
2020-10-21 12:31:50 | INFO | fairseq.utils | CUDA enviroments for all 1 workers
2020-10-21 12:31:50 | INFO | fairseq.utils | rank 0: capabilities = 7.5 ; total memory = 10.761 GB ; name = GeForce RTX 2080 Ti
2020-10-21 12:31:50 | INFO | fairseq.utils | CUDA enviroments for all 1 workers
2020-10-21 12:31:50 | INFO | fairnr_cli.train | training on 1 GPUs
2020-10-21 12:31:50 | INFO | fairnr_cli.train | max tokens per GPU = None and max sentences per GPU = 1
2020-10-21 12:31:50 | INFO | fairseq.trainer | no existing checkpoint found checkpoints/robot/checkpoint_last.pt
2020-10-21 12:31:50 | INFO | fairseq.trainer | loading train data for epoch 1
2020-10-21 12:31:50 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16
Building EasyOctree done. total #nodes = 1505, terminal #nodes = 660 (time taken 0.214959 s)
CUDA kernel failed : invalid device function
void svo_intersect_point_kernel_wrapper(int, int, int, float, int, const float*, const float*, const float*, const int*, int*, float*, float*) at L:384 in fairnr/clib/src/intersect_gpu.cu
Do you have any idea to solve such a problem?
I would like to run NSVF as a container with a dockerfile.
I started building my own but am having trouble building the project inside an nvidia docker container as I'm using WSL2 and the current build doesn't come with NVCC or debug tools to check the output.
I think the build scripts and Dockerfiles should match the build workflow for Alicevision Meshroom. https://github.com/alicevision/meshroom/tree/develop/docker
I still need to build the build-ubuntu.sh script but any help or info on building this cleanly would be appreciated.
ARG UBUNTU_VERSION
ARG COMMAND
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
COPY . /NVSF/
WORKDIR /NVSF
RUN apt-get update \
&& apt-get install -y python3-pip python3 git \
&& cd /usr/local/bin \
&& ln -s /usr/bin/python3 python \
&& pip3 install --upgrade pip \
&& pip3 install torch \
&& pip3 install -r /NVSF/requirements.txt
CMD /bin/bash ${COMMAND}
results are impressive.
how can i get cam pos and intrinsics for my custom photos?
I already try to export the pose in blender by myself in the recent two weeks.
But I find some issues about the process:
First, when I used the codes, which similar to the script shared by the Nerf team, to export the poses of cameras, I found that all rays miss voxels in NSVF train codes(cause Error), which means that all rays goes the wrong direction, I try to deepinto the codes and debug to figure out why or what cause the problem, and I found that the defaut direction of the camera in blender is pointed to negative z-axis, which means this is conflict with the codes of NSVF, NSVF train codes set the default depth to 1 at ray direction calculating stage(in blender the camera imaging depth is -1 due to negative z-axis),and this cause all ray direction point to the opposition and so miss all voxels.
Second, I check the pose data of same model between Nerf and NSVF, and I found that the pose(cam2world matrix) at the second/third cols is different from the nerf data(nsvf those two cols multi -1 turn into nerf data),this maybe an evidence to the depth setting problem just mentioned at the first part.
After I fix the depth bugs of depth setting, train codes run normally, but compared to the official data shared by NSVF, my results are no good enough, and in valid part, all images someway fall into a malposition problem(lead to very high valid loss problem), which I also try to debug and understand the problems for some days, and I still do not get the good results in my all trys.
So, can u guys share the export scripts in blender? or other 3D model application?
Hi can you please add a google colab thanks
Describe the bug
I have tried your example script for evaluation:
DATA="Wineholder"
RES="800x800"
ARCH="nsvf_base"
SUFFIX="v1"
DATASET=/private/home/jgu/data/shapenet/release/Synthetic_NSVF/${DATA}
SAVE=/checkpoint/jgu/space/neuralrendering/new_release/$DATA
MODEL=$ARCH$SUFFIX
MODEL_PATH=$SAVE/$MODEL/checkpoint_last.pt
# start validating a trained model with target images.
# CUDA_VISIBLE_DEVICES=0 \
python validate.py ${DATASET} \
--user-dir fairnr \
--valid-views "200..400" \
--valid-view-resolution "800x800" \
--no-preload \
--task single_object_rendering \
--max-sentences 1 \
--valid-view-per-batch 1 \
--path ${MODEL_PATH} \
--model-overrides '{"chunk_size":1024,"raymarching_tolerance":0.01,"tensorboard_logdir":"","eval_lpips":True}' \
Using this script I could do an evaluation using my trained model. However, the reconstructed images are missing in tensorboard
. I noticed that you are passing tensorboard_logdir
as an empty string in --model-overrides
option. I tried populating it. I also added --results-path
option. Here you can find my script :
MODEL="nsvf_basev1"
MODEL_PATH=checkpoint/${DATA}/${MODEL}/checkpoint_best.pt
SAVE=eval/${DATA}/${MODEL}
NAME="${DATA}_${MODEL}"
TB_LOG=checkpoint/${DATA}/tensorboard/${MODEL}
python validate.py ${DATASET} \
--user-dir fairnr \
--valid-views "200..205" \
--valid-view-resolution "800x800" \
--no-preload \
--task single_object_rendering \
--max-sentences 1 \
--valid-view-per-batch 1 \
--path ${MODEL_PATH} \
--model-overrides "{'chunk_size' : 512,'raymarching_tolerance' : 0.01, 'tensorboard_logdir' : '${TB_LOG}', 'eval_lpips' : True}" \
--results-path ${SAVE} \
However, when running the script, I get this error:
| valid on 'valid' subset: 0%| | 0/200 [00:00<?, ?it/s]Building EasyOctree done. total #nodes = 11371, terminal #nodes = 5026 (time taken 1.524386 s)
Traceback (most recent call last):
File "validate.py", line 11, in <module>
cli_main()
File "/tmp/kghasemi/NSVF/fairnr_cli/validate.py", line 153, in cli_main
distributed_utils.call_main(args, main, override_args=override_args)
File "/tmp/kghasemi/conda/lib/python3.8/site-packages/fairseq/distributed_utils.py", line 189, in call_main
main(args, **kwargs)
File "/tmp/kghasemi/NSVF/fairnr_cli/validate.py", line 114, in main
_loss, _sample_size, log_output = task.valid_step(sample, model, criterion)
File "/tmp/kghasemi/NSVF/fairnr/tasks/neural_rendering.py", line 308, in valid_step
write_images(self.writer, images, self._num_updates['step'])
KeyError: 'step'
I search this part of the code but could not figure out what the issue was:
def valid_step(self, sample, model, criterion):
loss, sample_size, logging_output = super().valid_step(sample, model, criterion)
model.add_eval_scores(logging_output, sample, model.cache, criterion, outdir=self.output_valid)
if self.writer is not None:
images = model.visualize(sample, shape=0, view=0)
if images is not None:
write_images(self.writer, images, self._num_updates['step'])
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The evaluation should not produce any error and reconstructed images must appear in the Tensorboard
Desktop (please complete the following information):
Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-123-generic x86_64)
Describe the bug
I get an exception when trying to run render.py on the synthetic wine-holder example.
To Reproduce
Run the wine-holder example using the command in the README:
DATASET=/opt/Synthetic_NSVF/Wineholder
SAVE=/opt/output
python -u train.py ${DATASET} \
--user-dir fairnr \
--task single_object_rendering \
--train-views "0..100" --view-resolution "800x800" \
--max-sentences 1 --view-per-batch 4 --pixel-per-view 2048 \
--no-preload \
--sampling-on-mask 1.0 --no-sampling-at-reader \
--valid-views "100..200" --valid-view-resolution "400x400" \
--valid-view-per-batch 1 \
--transparent-background "1.0,1.0,1.0" --background-stop-gradient \
--arch nsvf_base \
--initial-boundingbox ${DATASET}/bbox.txt \
--use-octree \
--raymarching-stepsize-ratio 0.125 \
--discrete-regularization \
--color-weight 128.0 --alpha-weight 1.0 \
--optimizer "adam" --adam-betas "(0.9, 0.999)" \
--lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000 \
--criterion "srn_loss" --clip-norm 0.0 \
--num-workers 0 \
--seed 2 \
--save-interval-updates 500 --max-update 150000 \
--virtual-epoch-steps 5000 --save-interval 1 \
--half-voxel-size-at "5000,25000,75000" \
--reduce-step-size-at "5000,25000,75000" \
--pruning-every-steps 2500 \
--keep-interval-updates 5 --keep-last-epochs 5 \
--log-format simple --log-interval 1 \
--save-dir ${SAVE} \
--tensorboard-logdir ${SAVE}/tensorboard \
| tee -a $SAVE/train.log
I was able to run for 15 checkpoints before it hung due to memory errors. I am able to export a ply with extract.py and it looks good.
Try to render from given camera poses with the command in the README:
MODEL_PATH=${SAVE}/checkpoint_best.pt
python render.py ${DATASET} \
--user-dir fairnr \
--task single_object_rendering \
--path ${MODEL_PATH} \
--model-overrides '{"chunk_size":512,"raymarching_tolerance":0.01}' \
--render-save-fps 24 \
--render-resolution "800x800" \
--render-camera-poses ${DATASET}/pose \
--render-views "200..400" \
--render-output ${SAVE}/output \
--render-output-types "color" "depth" "voxel" "normal" --render-combine-output \
--log-format "simple"
Results in this exception:
2021-01-08 21:46:18 | INFO | fairnr.renderer | rendering starts. fairnr BaseModel
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Building EasyOctree done. total #nodes = 10097, terminal #nodes = 4457 (time taken 2.363832 s)
Building EasyOctree done. total #nodes = 10097, terminal #nodes = 4457 (time taken 2.387693 s)
Building EasyOctree done. total #nodes = 10097, terminal #nodes = 4457 (time taken 2.516335 s)
Traceback (most recent call last):
File "render.py", line 11, in <module>
cli_main()
File "/src/NSVF/fairnr_cli/render_multigpu.py", line 137, in cli_main
distributed_utils.call_main(args, main)
File "/venv/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 174, in call_main
args.distributed_world_size,
File "/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.7/multiprocessing/queues.py", line 105, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/venv/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 156, in distributed_main
main(args, **kwargs)
File "/src/NSVF/fairnr_cli/render_multigpu.py", line 35, in main
return _main(args, sys.stdout)
File "/src/NSVF/fairnr_cli/render_multigpu.py", line 108, in _main
for i, sample in enumerate(t):
File "/venv/lib/python3.7/site-packages/fairseq/logging/progress_bar.py", line 245, in __iter__
for i, obj in enumerate(self.iterable, start=self.n):
File "/venv/lib/python3.7/site-packages/fairseq/data/iterators.py", line 61, in __iter__
for x in self.iterable:
File "/venv/lib/python3.7/site-packages/fairseq/data/iterators.py", line 517, in __next__
raise item
File "/venv/lib/python3.7/site-packages/fairseq/data/iterators.py", line 454, in run
item = next(self._source_iter)
File "/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
success, data = self._try_get_data()
File "/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 152108) exited unexpectedly
I tried running with the other checkpoints but got the same thing. Is there a parameter I can tweak to avoid these memory issues?
Desktop (please complete the following information):
Hi Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt,
As talked in Email, we converted nerf NDC coordinates to support with your code and ran your code on a scene from NeRF, “trex” with minor changes on launch script. Is this the result that you expect?
Here is trex on NSVF.
https://drive.google.com/file/d/1EoUnrAEbN6vnnXy7WpWCLqhjex8AByrL/view?fbclid=IwAR27zUKFwvS3gATUMUfQH8Ij_n4EaYJpep94ejFRr2aLbIB9lBFL_FssUaE
Here is our trex dataset.
https://drive.google.com/drive/folders/1bKfaNuafR46qfdcxSiAlFyi3Ox7yIkxO?usp=sharing
Here is bash script to train.
https://gist.github.com/pureexe/52486aa8f62dff0a9a8770b5861e7deb
Here is bash script to render.
https://gist.github.com/pureexe/e9cda0b0adf671a84a1711f6e3e19c4a
Kind regards,
Pakkapon
Is your feature request related to a problem? Please describe.
We want to have a qualitative comparison in our work.
Describe the solution you'd like
We hope the official test-set results on Synthetic-NeRF
, Synthetic-NSVF
, BlendedMVS
, and Tanks and Temples
can be released for qualitative comparisons in future studies.
I have replicated the NSVF datasets. The rendered views looked good. But why the training loss is always far larger than the validation loss? Here are the scripts of parameters.
python -u train.py ${Dataset}
--user-dir fairnr
--task single_object_rendering
--train-views "0..100" --view-resolution "800x800"
--max-sentences 1 --view-per-batch 1 --pixel-per-view 512
--no-preload
--sampling-on-mask 1.0 --no-sampling-at-reader
--valid-views "100..200" --valid-view-resolution "800x800"
--valid-view-per-batch 1
--transparent-background "1.0,1.0,1.0" --background-stop-gradient
--arch nsvf_base
--initial-boundingbox ${SOURCEDIR}/nsvf/Spaceship/bbox.txt
--use-octree
--raymarching-stepsize-ratio 0.125
--discrete-regularization
--color-weight 128.0 --alpha-weight 1.0
--optimizer "adam" --adam-betas "(0.9, 0.999)"
--lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000
--criterion "srn_loss" --clip-norm 0.0
--num-workers 0
--seed 2
--save-interval-updates 500 --max-update 150000
--virtual-epoch-steps 5000 --save-interval 1
--half-voxel-size-at "5000,25000,75000"
--reduce-step-size-at "5000,25000,75000"
--pruning-every-steps 2500
--keep-interval-updates 5 --keep-last-epochs 5
--log-format simple --log-interval 1
--save-dir ${Result}/${MODEL}
--tensorboard-logdir ${Result}/tensorboard/${MODEL}
| tee -a ${Result}/train.log
set up the whole virtual environment in clusters, GPU is Nvida v100. But I receive this error,
CUDA kernel failed : no kernel image is available for execution on the device
void aabb_intersect_point_kernel_wrapper(int, int, int, float, int, const float*, const float*, const float*, int*, float*, float*) at L:371 in fairnr/clib/src/intersect_gpu.cu
CUDA kernel failed : no kernel image is available for execution on the device
void aabb_intersect_point_kernel_wrapper(int, int, int, float, int, const float*, const float*, const float*, int*, float*, float*) at L:371 in fairnr/clib/src/intersect_gpu.cu
Traceback (most recent call last):
File "train.py", line 20, in <module>
cli_main()
File "/ibex/scratch/lir0b/NSVF/fairnr_cli/train.py", line 356, in cli_main
nprocs=torch.cuda.device_count(),
File "/home/lir0b/.conda/envs/nsvf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/lir0b/.conda/envs/nsvf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 112, in join
(error_index, exitcode)
Exception: process 0 terminated with exit code 255
scripts are:
#!/bin/bash
#SBATCH -N 1
#SBATCH --partition=batch
#SBATCH -J table11_d4
#SBATCH -o table11_d4.%J.out
#SBATCH -e table11_d4.%J.err
#SBATCH [email protected]
#SBATCH --mail-type=ALL
#SBATCH --time=23:00:00
#SBATCH --mem=40G
#SBATCH --gres=gpu:v100:2
#SBATCH --cpus-per-task=6
#SBATCH --constraint=[gpu]
#run the application:
#DATASET=./Synthetic_NSVF/gun
#SAVE=./exp/gun
#export PATH=$PATH:/ibex/scratch/lir0b/NSVF/fairnr/clib
#export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/ibex/scratch/lir0b/NSVF/fairnr/clib
#export LD_LIBRARY_PATH=/ibex/scratch/lir0b/NSVF:$LD_LIBRARY_PATH
#conda activate nsvf
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/ibex/scratch/lir0b/NSVF/fairnr/clib:/ibex/scratch/lir0b/NSVF
export PATH=$PATH:/ibex/scratch/lir0b/NSVF/fairnr/clib:/ibex/scratch/lir0b/NSVF
pip install --editable /ibex/scratch/lir0b/NSVF
python setup.py build_ext --inplace
python -u train.py ./Synthetic_NSVF/table11_d4 \
--user-dir fairnr \
--task single_object_rendering \
--train-views "1..35" --view-resolution "384x512" \
--max-sentences 1 --view-per-batch 4 --pixel-per-view 2048 \
--no-preload \
--sampling-on-mask 1.0 --no-sampling-at-reader \
--valid-views "35..44" --valid-view-resolution "384x512" \
--valid-view-per-batch 1 \
--transparent-background "1.0,1.0,1.0" --background-stop-gradient \
--arch nsvf_base \
--initial-boundingbox ./Synthetic_NSVF/table11_d4/bbox.txt \
--raymarching-stepsize-ratio 0.125 \
--discrete-regularization \
--color-weight 128.0 --alpha-weight 1.0 \
--optimizer "adam" --adam-betas "(0.9, 0.999)" \
--lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000 \
--criterion "srn_loss" --clip-norm 0.0 \
--num-workers 0 \
--seed 2 \
--save-interval-updates 500 --max-update 150000 \
--virtual-epoch-steps 5000 --save-interval 1 \
--half-voxel-size-at "5000,25000,75000" \
--reduce-step-size-at "5000,25000,75000" \
--pruning-every-steps 2500 \
--keep-interval-updates 5 --keep-last-epochs 5 \
--log-format simple --log-interval 1 \
--save-dir ./exp/table11_d4 \
--tensorboard-logdir ./exp/table11_d4/tensorboard \
| tee -a ./exp/table11_d4/train.log
To Reproduce
Steps to reproduce the behavior:
Expected behavior
normal training
Screenshots
[CUDA kernel failed : no kernel image is available for execution on the device
void svo_intersect_point_kernel_wrapper(int, int, int, float, int, const float*, const float*, const float*, const int*, int*, float*, float*) at L:384 in fairnr/clib/src/intersect_gpu.cu
CUDA kernel failed : no kernel image is available for execution on the device
void svo_intersect_point_kernel_wrapper(int, int, int, float, int, const float*, const float*, const float*, const int*, int*, float*, float*) at L:384 in fairnr/clib/src/intersect_gpu.cu
CUDA kernel failed : no kernel image is available for execution on the device
void svo_intersect_point_kernel_wrapper(int, int, int, float, int, const float*, const float*, const float*, const int*, int*, float*, float*) at L:384 in fairnr/clib/src/intersect_gpu.cu
CUDA kernel failed : no kernel image is available for execution on the device
void svo_intersect_point_kernel_wrapper(int, int, int, float, int, const float*, const float*, const float*, const int*, int*, float*, float*) at L:384 in fairnr/clib/src/intersect_gpu.cu
Traceback (most recent call last):
File "train.py", line 20, in <module>
cli_main()
File "/ibex/scratch/lir0b/NSVF/fairnr_cli/train.py", line 356, in cli_main
nprocs=torch.cuda.device_count(),
File "/home/lir0b/.conda/envs/nsvf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/lir0b/.conda/envs/nsvf/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 112, in join
(error_index, exitcode)
Exception: process 1 terminated with exit code 255](url)
Describe the bug
When we run the code for the Bike dataset, I see that the code exits unexpectedly. The dataloader process gets terminated
To Reproduce
Command I used for running:
python -u train.py ${DATASET} --user-dir fairnr --task single_object_rendering --train-views "0..100" --view-resolution "800x800" --max-sentences 1 --view-per-batch 2 --pixel-per-view 2048 --no-preload --sampling-on-mask 1.0 --no-sampling-at-reader --valid-views "100..200" --valid-view-resolution "400x400" --valid-view-per-batch 1 --transparent-background "1.0,1.0,1.0" --background-stop-gradient --arch nsvf_base --initial-boundingbox ${DATASET}/bbox.txt --raymarching-stepsize-ratio 0.125 --discrete-regularization --color-weight 128.0 --alpha-weight 1.0 --optimizer "adam" --adam-betas "(0.9, 0.999)" --lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000 --criterion "srn_loss" --clip-norm 0.0 --num-workers 2 --seed 2 --save-interval-updates 500 --max-update 150000 --virtual-epoch-steps 5000 --save-interval 1 --half-voxel-size-at "5000,25000,75000" --reduce-step-size-at "5000,25000,75000" --pruning-every-steps 2500 --keep-interval-updates 5 --keep-last-epochs 5 --log-format simple --log-interval 1 --save-dir ${SAVE} --tensorboard-logdir ${SAVE}/tensorboard | tee -a $SAVE/train.log
Probably this is something minor.
Regards,
K. J. Nitthilan
I'm having trouble running the training scripts after installing the repo.
I've followed the steps:
pip install -r requirements.txt
pip install --editable ./
And then tried running train_wineholder.sh
(I've also tried python train.py
without arguments)
I get the following error:
Traceback (most recent call last):
File "train.py", line 7, in <module>
from fairnr_cli.train import cli_main
File "/home/david/repos/NSVF/fairnr_cli/train.py", line 26, in <module>
from fairnr import ResetTrainerException
File "/home/david/repos/NSVF/fairnr/__init__.py", line 11, in <module>
from . import data, tasks, models, modules, criterions
File "/home/david/repos/NSVF/fairnr/data/__init__.py", line 6, in <module>
from .shape_dataset import (
File "/home/david/repos/NSVF/fairnr/data/shape_dataset.py", line 14, in <module>
from . import data_utils, geometry, trajectory
File "/home/david/repos/NSVF/fairnr/data/geometry.py", line 12, in <module>
from fairnr.clib._ext import build_octree
File "/home/david/repos/NSVF/fairnr/clib/__init__.py", line 28, in <module>
import fairnr.clib._ext as _ext
AttributeError: module 'fairnr' has no attribute 'clib'
I didn't have any trouble with running pip install --editable ./
since I get the following output:
Obtaining file:///home/david/repos/NSVF
Installing collected packages: fairnr
Running setup.py develop for fairnr
Successfully installed fairnr-0.0.0
But looking at the output from running the training scripts, it seems like there's a build issue with fairnir.clib
, anyone know why? I'm running this on Ubuntu 18.04 with CUDA 10.2. (Also tried CUDA 10.0, get the same error).
Can you please add sample script example of these to your docs:
Hi, I am trying to train on scannet dateset with a voxel map as input, but I encounter an "out of bound" error when running build_octree function. I was wonder do you have any example how we should train with a voxel map as input? Thanks.
raymarching_tolerance - Is this field used only during rendering? Can this be used while training too? i.e. reduce the number of steps to accumulate the sigma value?
Is the BackgroundField used during training? I do not see it being used. How is the background color taken care of during the training process
Hi, thanks for sharing the great work! I wonder if you can provide a rough estimate of how many samples are used to evaluate the MLP inside each sparse voxel, and how many sparse voxels a ray typically intersect with? Thank you!
Hi,
I'm trying to run NSVF on my own training set but the end result (even after 2500 iterations) is just blank, white images. I am assuming there is something wrong with either the way I've formatted my camera extrinsics or the training parameters. I was wondering if anyone had any insights here. For context, the scene is inward-facing (taken from our scanner, which has 27 DSLR cameras positioned around the object in a hemispherical arrangement). Here is what one of the images from our dataset looks like:
Our extrinsics (and intrinsics) are taken directly from OpenCV's camera calibration routine, so they follow "OpenCV-style matrix conventions," which I believe is what is used by NSVF. I invert each extrinsic matrix to convert it to a "camera-to-world" matrix. I wasn't able to find any additional information about matrix formatting in the README, but does this sound correct? An example of one of our poses is:
-5.104647941864430827e-01 -8.207758808463482686e-03 -8.598594807243422622e-01 7.951983207748324345e-01
6.956707447815645429e-01 -5.917033288830939597e-01 -4.073443082255215897e-01 1.741095097774456368e+00
-5.054383332823660924e-01 -8.061140138243517717e-01 3.077536156810128376e-01 6.316381418571048556e+01
0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 1.000000000000000000e+00
And we use a bounding box with the following dimensions:
bbox = np.array([[
-15, 4.0, -15, # min corner
15, 25, 15, # max corner
1.0 # initial voxel size
]])
Our training script is basically copy+pasted from the train_jade.sh
example, but I will paste it below for reference.
I'm wondering if there is something obvious that I've missed in pre-processing our dataset and/or the training script that would cause NSVF to produce completely blank frames. I've tried a number of adjustments, but I'd love to hear any suggestions anyone might have.
Thanks so much for making this project available!
DATA="clamp_alpha"
RES="500x750"
ARCH="nsvf_base"
SUFFIX="v1"
DATASET=/home/ec2-user/SageMaker/${DATA}
SAVE=/home/ec2-user/SageMaker/checkpoints/$DATA
MODEL=$ARCH$SUFFIX
mkdir -p $SAVE/$MODEL
# start training locally
python train.py ${DATASET} \
--user-dir fairnr \
--task single_object_rendering \
--train-views "0..100" \
--view-resolution $RES \
--max-sentences 1 \
--view-per-batch 1 \
--pixel-per-view 128 \
--no-preload \
--sampling-on-mask 1.0 --no-sampling-at-reader \
--valid-view-resolution $RES \
--valid-views "100..200" \
--valid-view-per-batch 1 \
--transparent-background "1.0,1.0,1.0" \
--background-stop-gradient \
--arch $ARCH \
--initial-boundingbox ${DATASET}/bbox.txt \
--raymarching-stepsize-ratio 0.125 \
--use-octree \
--discrete-regularization \
--color-weight 128.0 \
--alpha-weight 1.0 \
--optimizer "adam" \
--adam-betas "(0.9, 0.999)" \
--lr-scheduler "polynomial_decay" \
--total-num-update 150000 \
--lr 0.001 \
--clip-norm 0.0 \
--criterion "srn_loss" \
--num-workers 0 \
--seed 2 \
--save-interval-updates 500 --max-update 150000 \
--virtual-epoch-steps 5000 --save-interval 1 \
--half-voxel-size-at "5000,25000,75000" \
--reduce-step-size-at "5000,25000,75000" \
--pruning-every-steps 2500 \
--keep-interval-updates 5 \
--log-format simple --log-interval 1 \
--tensorboard-logdir ${SAVE}/tensorboard/${MODEL} \
--save-dir ${SAVE}/${MODEL}
I've failed to train large objects like Barn. Can you share the training sh file for us?
Hi, thanks for sharing this great work.
The code for Maria Sequence is missing in the repo. Could you please release the code and the data for Maria Sequence.
Thank you very much.
Thanks for your excellent work first.
I'm working on my own dataset for free-viewpoint rendering and interested in your work. Since rgb, pose and intrinsic datas are prepared, how can I get the bbox data like the style in your datasets?
Describe the bug
I am trying to run your project. When running pip install --editable ./
or python setup.py build_ext --inplace
I am getting the following output:
Obtaining file:///tmp/kghasemi/NSVF
Installing collected packages: fairnr
Running setup.py develop for fairnr
ERROR: Command errored out with exit status 1:
command: /tmp/kghasemi/conda/envs/nsvf-env/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/kghasemi/NSVF/setup.py'"'"'; __file__='"'"'/tmp/kghasemi/NSVF/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
cwd: /tmp/kghasemi/NSVF/
Complete output (23 lines):
running develop
running egg_info
writing fairnr.egg-info/PKG-INFO
writing dependency_links to fairnr.egg-info/dependency_links.txt
writing entry points to fairnr.egg-info/entry_points.txt
writing top-level names to fairnr.egg-info/top_level.txt
reading manifest file 'fairnr.egg-info/SOURCES.txt'
writing manifest file 'fairnr.egg-info/SOURCES.txt'
running build_ext
building 'fairnr.clib._ext' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/fairnr
creating build/temp.linux-x86_64-3.7/fairnr/clib
creating build/temp.linux-x86_64-3.7/fairnr/clib/src
gcc -pthread -B /tmp/kghasemi/conda/envs/nsvf-env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/tmp/kghasemi/conda/envs/nsvf-env/lib/python3.7/site-packages/torch/include -I/tmp/kghasemi/conda/envs/nsvf-env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/tmp/kghasemi/conda/envs/nsvf-env/lib/python3.7/site-packages/torch/include/TH -I/tmp/kghasemi/conda/envs/nsvf-env/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-10.1/include -I/tmp/kghasemi/conda/envs/nsvf-env/include/python3.7m -c fairnr/clib/src/octree.cpp -o build/temp.linux-x86_64-3.7/fairnr/clib/src/octree.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from fairnr/clib/src/../include/utils.h:7:0,
from fairnr/clib/src/octree.cpp:7:
/tmp/kghasemi/conda/envs/nsvf-env/lib/python3.7/site-packages/torch/include/ATen/cuda/CUDAContext.h:7:10: fatal error: cublas_v2.h: No such file or directory
#include <cublas_v2.h>
^~~~~~~~~~~~~
compilation terminated.
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /tmp/kghasemi/conda/envs/nsvf-env/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/kghasemi/NSVF/setup.py'"'"'; __file__='"'"'/tmp/kghasemi/NSVF/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.
To Reproduce
Steps to reproduce the behavior:
git clone https://github.com/facebookresearch/NSVF.git
cd NSVF
conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch
pip install -r requirements.txt
pip install --editable ./
If you don't execute step 3, you will get this error:
ERROR: Command errored out with exit status 1:
command: /tmp/kghasemi/conda/envs/nsvf-env/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-09c9r6sr/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-09c9r6sr/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-cn8dm7pt
cwd: /tmp/pip-req-build-09c9r6sr/
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-09c9r6sr/setup.py", line 2, in <module>
from torch.utils.cpp_extension import BuildExtension, CUDA_HOME
ModuleNotFoundError: No module named 'torch'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Expected behavior
When I execute find /usr/local/ -name cublas_v2.h
, I get this /usr/local/cuda-10.2/targets/x86_64-linux/include/cublas_v2.h
which means the header exist.
Therefore I should not get this error.
Desktop (please complete the following information):
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
Env
cudatoolkit 10.1.243 h6bb024c_0
pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
lpips-pytorch latest pypi_0 pypi
pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
torchvision 0.5.0 py37_cu101 pytorch
ipython 7.19.0 pypi_0 pypi
ipython-genutils 0.2.0 pypi_0 pypi
opencv-python 4.2.0.32 pypi_0 pypi
python 3.7.9 h7579374_0
python-dateutil 2.8.1 pypi_0 pypi
Could you please provide the conda environment file you are using please?
Describe the bug
Hi! I noticed you have implement VGG loss and patch-sampling for nerf, could you elaborate on how to use it?
Besides, did you adopt this loss in your paper?
To Reproduce
add
--vgg-weight 1e-4 \
--sampling-patch-size 48
in jade
training config.
Expected behavior
This doesn't work now, all_results
is None after setting this.
I have been thinking about this method of lightfield synthesis and came to the same conclusion about sparse voxel grids with Taichi (though I haven't managed to impliment it).
My original thought was to take the outter most voxels of each object, unwrap the faces, differentiably render with a gan the vector displacement / depth / raw light and pbr textures for each voxel. You can tag voxels with object detection PV-RCNN or similar.
Also being able to define +inf as a skydome to paint on with spherical harmonics differentiable rendering would be required for breaking down the lighting components as well as a store of light sources past a threshold for lighting estimation.
Additionally a global frame or time step would be handy for making a camera rig that can record voxel video with n cameras and global shutter sync. Being able to update only specific voxels as needed if changed beyond a threshold will help with sparse computation.
Ideally the output of this pipeline could be a voxelized lightfield projection with material onto a standard cubified obj or alembic sequence.
I already install the nvidia/apex module in my env(which is optional said in your project README).
When I try to add args "--fp16" to the train script:
python -u train.py ${DATASET} \
... \
--fp16 \
... \
--tensorboard-logdir ${SAVE}/tensorboard \
| tee -a $SAVE/train.log
It will occur some errors, the main Error Report is about c10:Error
:
...
terminate called after throwing an instance of 'c10::Error'
...
Something similar to fairsep issue#1683 - closed&no response
I try to find ways to solve this, like add args "--ddp-backend=no_c10d",but this just cause the same error.
I haven't read all the main codes of project, but I guess you guys maybe more familiar with these problem, so I try to post this issue.
Thanks for replying.
BTW:train without "--fp16" is always fine, and the env is almost the same as the requirement file in README.
Describe the bug
There are build failures:
$ pip install --editable ./
Obtaining file:///home/user/dev/NSVF
Installing collected packages: fairnr
Running setup.py develop for fairnr
ERROR: Command errored out with exit status 1:
command: /home/user/miniconda3/envs/python37/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/user/dev/NSVF/setup.py'"'"'; __file__='"'"'/home/user/dev/NSVF/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
cwd: /home/user/dev/NSVF/
Complete output (120 lines):
running develop
running egg_info
writing fairnr.egg-info/PKG-INFO
writing dependency_links to fairnr.egg-info/dependency_links.txt
writing entry points to fairnr.egg-info/entry_points.txt
writing top-level names to fairnr.egg-info/top_level.txt
reading manifest file 'fairnr.egg-info/SOURCES.txt'
writing manifest file 'fairnr.egg-info/SOURCES.txt'
running build_ext
building 'fairnr.clib._ext' extension
Emitting ninja build file /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[1/6] c++ -MMD -MF /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o.d -pthread -B /home/user/miniconda3/envs/python37/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/intersect.cpp -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o
c++ -MMD -MF /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o.d -pthread -B /home/user/miniconda3/envs/python37/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/intersect.cpp -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/home/user/dev/NSVF/fairnr/clib/src/intersect.cpp:6:10: fatal error: intersect.h: No such file or directory
6 | #include "intersect.h"
| ^~~~~~~~~~~~~
compilation terminated.
[2/6] c++ -MMD -MF /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/octree.o.d -pthread -B /home/user/miniconda3/envs/python37/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/octree.cpp -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/octree.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/octree.o
c++ -MMD -MF /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/octree.o.d -pthread -B /home/user/miniconda3/envs/python37/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/octree.cpp -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/octree.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/home/user/dev/NSVF/fairnr/clib/src/octree.cpp:6:10: fatal error: octree.h: No such file or directory
6 | #include "octree.h"
| ^~~~~~~~~~
compilation terminated.
[3/6] c++ -MMD -MF /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/binding.o.d -pthread -B /home/user/miniconda3/envs/python37/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/binding.cpp -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/binding.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/binding.o
c++ -MMD -MF /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/binding.o.d -pthread -B /home/user/miniconda3/envs/python37/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/binding.cpp -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/binding.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/home/user/dev/NSVF/fairnr/clib/src/binding.cpp:6:10: fatal error: intersect.h: No such file or directory
6 | #include "intersect.h"
| ^~~~~~~~~~~~~
compilation terminated.
[4/6] c++ -MMD -MF /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/sample.o.d -pthread -B /home/user/miniconda3/envs/python37/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/sample.cpp -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/sample.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/sample.o
c++ -MMD -MF /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/sample.o.d -pthread -B /home/user/miniconda3/envs/python37/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/sample.cpp -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/sample.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/home/user/dev/NSVF/fairnr/clib/src/sample.cpp:6:10: fatal error: sample.h: No such file or directory
6 | #include "sample.h"
| ^~~~~~~~~~
compilation terminated.
[5/6] /usr/local/cuda-11.0/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/sample_gpu.o.d -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/sample_gpu.cu -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/sample_gpu.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -std=c++14
FAILED: /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/sample_gpu.o
/usr/local/cuda-11.0/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/sample_gpu.o.d -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/sample_gpu.cu -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/sample_gpu.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -std=c++14
/home/user/dev/NSVF/fairnr/clib/src/sample_gpu.cu:11:10: fatal error: cuda_utils.h: No such file or directory
11 | #include "cuda_utils.h"
| ^~~~~~~~~~~~~~
compilation terminated.
[6/6] /usr/local/cuda-11.0/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect_gpu.o.d -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/intersect_gpu.cu -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect_gpu.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -std=c++14
FAILED: /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect_gpu.o
/usr/local/cuda-11.0/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect_gpu.o.d -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/intersect_gpu.cu -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect_gpu.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 -std=c++14
/home/user/dev/NSVF/fairnr/clib/src/intersect_gpu.cu:11:10: fatal error: cuda_utils.h: No such file or directory
11 | #include "cuda_utils.h"
| ^~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1673, in _run_ninja_build
env=env)
File "/home/user/miniconda3/envs/python37/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/user/dev/NSVF/setup.py", line 35, in <module>
'fairnr-train = fairseq_cli.train:cli_main'
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/home/user/miniconda3/envs/python37/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/user/miniconda3/envs/python37/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/home/user/miniconda3/envs/python37/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/setuptools/command/develop.py", line 34, in run
self.install_for_development()
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/setuptools/command/develop.py", line 136, in install_for_development
self.run_command('build_ext')
File "/home/user/miniconda3/envs/python37/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/user/miniconda3/envs/python37/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
_build_ext.build_ext.run(self)
File "/home/user/miniconda3/envs/python37/lib/python3.7/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 708, in build_extensions
build_ext.build_extensions(self)
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
_build_ext.build_ext.build_extensions(self)
File "/home/user/miniconda3/envs/python37/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions
self._build_extensions_serial()
File "/home/user/miniconda3/envs/python37/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial
self.build_extension(ext)
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
_build_ext.build_extension(self, ext)
File "/home/user/miniconda3/envs/python37/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension
depends=ext.depends)
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 538, in unix_wrap_ninja_compile
with_cuda=with_cuda)
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1359, in _write_ninja_file_and_compile_objects
error_prefix='Error compiling objects for extension')
File "/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1683, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
----------------------------------------
ERROR: Command errored out with exit status 1: /home/user/miniconda3/envs/python37/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/user/dev/NSVF/setup.py'"'"'; __file__='"'"'/home/user/dev/NSVF/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.
What is peculiar is that copy pasting individual build commands, such as the follow one, compile correctly on their own.
$ c++ -MMD -MF /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o.d -pthread -B /home/user/miniconda3/envs/python37/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/TH -I/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda-11.0/include -I/home/user/miniconda3/envs/python37/include/python3.7m -c -c /home/user/dev/NSVF/fairnr/clib/src/intersect.cpp -o /home/user/dev/NSVF/build/temp.linux-x86_64-3.7/fairnr/clib/src/intersect.o -O2 -Ifairnr/clib/include -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/ATen/Parallel.h:140,
[...]
from /home/user/dev/NSVF/fairnr/clib/src/intersect.cpp:6:
/home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/ATen/ParallelOpenMP.h:83: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]
83 | #pragma omp parallel for if ((end - begin) >= grain_size)
|
In file included from /home/user/miniconda3/envs/python37/lib/python3.7/site-packages/torch/include/c10/core/DeviceType.h:8,
[...]
from /home/user/dev/NSVF/fairnr/clib/src/intersect.cpp:6:
/home/user/dev/NSVF/fairnr/clib/src/intersect.cpp: In function ‘std::tuple<at::Tensor, at::Tensor, at::Tensor> ball_intersect(at::Tensor, at::Tensor, at::Tensor, float, int)’:
fairnr/clib/include/utils.h:12:24: warning: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Wdeprecated-declarations]
12 | TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor"); \
[...]
The individual build command finishes with warnings but without errors.
To Reproduce
Following the README, at the installation step, running either pip install --editable ./
or python setup.py build_ext --inplace
results in the build failures detailed above.
Expected behavior
Build should pass.
Desktop (please complete the following information):
Ubuntu 18, CUDA 11.3, g++ version 9.3.0.
Hello! Can you provide the .blend files for Synthetic-NSVF? Thank you and stay safe!
Describe the bug
I've been able to get NSVF to successfully run locally on the supplied datasets, but I'm running into errors when using my own custom dataset. I converted my dataset to match the format described in the README and set the bounding box dimensions to the minima and maxima of the translation coordinates among all extrinsic camera matrices (pose matrices). I calculated the voxel size according to the volume of these boundaries.
Running train.py
with the supplied arguments results in the following error:
File "train.py", line 20, in <module>
cli_main()
File "/home/ubuntu/CSM/NSVF/fairnr_cli/train.py", line 373, in cli_main
main(args)
File "/home/ubuntu/CSM/NSVF/fairnr_cli/train.py", line 104, in main
should_end_training = train(args, trainer, task, epoch_itr)
File "/usr/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/ubuntu/CSM/NSVF/fairnr_cli/train.py", line 181, in train
log_output = trainer.train_step(samples)
File "/usr/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/ubuntu/.local/lib/python3.7/site-packages/fairseq/trainer.py", line 431, in train_step
ignore_grad=is_dummy_batch,
File "/home/ubuntu/CSM/NSVF/fairnr/tasks/neural_rendering.py", line 300, in train_step
return super().train_step(sample, model, criterion, optimizer, update_num, ignore_grad)
File "/home/ubuntu/.local/lib/python3.7/site-packages/fairseq/tasks/fairseq_task.py", line 351, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/CSM/NSVF/fairnr/criterions/rendering_loss.py", line 42, in forward
net_output = model(**sample)
File "/home/ubuntu/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/ubuntu/CSM/NSVF/fairnr/models/fairnr_model.py", line 77, in forward
results = self._forward(ray_start, ray_dir, **kwargs)
File "/home/ubuntu/CSM/NSVF/fairnr/models/nsvf.py", line 91, in _forward
all_results['missed'] = fill_in((fullsize, ), hits, all_results['missed'], 1.0).view(S, V, P)
File "/home/ubuntu/CSM/NSVF/fairnr/data/geometry.py", line 306, in fill_in
output = input.new_ones(*shape) * initial
AttributeError: 'NoneType' object has no attribute 'new_ones'
I investigated further and determined that this means no rays are intersecting with the voxels during training. I assume that this is due to an issue with my bounding box setup. To recreate this, here is my dataset.
Is there an issue with the way I'm using my camera pose matrices to set my bounding box dimensions? Are there other steps I could take to ensure that at least some intersections are occurring between the rays and voxels?
Hello authors, may I know where to download pre-trained models?
(I see that README.md
mentioned pre-trained models are MIT-licensed).
I am using --valid-view-per-batch 2
for the validation phase (to be the same as the training phase --view-per-batch 2
). Debugging your code, I can confirm that this value (--valid-view-per-batch
) is received and set correctly, but It's still processing 1 view in the validation phase:
The training phase works fine:
is it a bug in the code or I misunderstood?
I don't really understand how to get the results of a dynamic scene.
What's the meaning of 'hypernetwork to encode all the 200 frames'?
![Screenshot from 2021-06-04 19-21-10](https://user-images.githubusercontent.com/41947948/120793925-1980aa80-c56a-11eb-
8da8-6ab349152f03.png)
Hope to know the specific operation. Thanks a lot!
Describe the bug
I am trying to run this code for a custom dataset. The dataset is as follows.
https://drive.google.com/drive/folders/1p6L5FVMrGzz3wdOBZwlo7ZCjS-AfsDfY?usp=sharing
And I use the following command to run
export DATASET="../neural_sparse_voxel_field/data/pixologic/brian_pape_1/nsvf/"
export SAVE="./brian_pape_1_ckpt/"
mkdir brian_pape_1_ckpt
rm -rf brian_pape_1_ckpt/*
CUDA_VISIBLE_DEVICES=1,2,4,5,6,7,8 python -u train.py ${DATASET}
--user-dir fairnr
--task single_object_rendering
--train-views "0..15" --view-resolution "562x750"
--max-sentences 1 --view-per-batch 1 --pixel-per-view 2048
--no-preload
--sampling-on-mask 1.0 --no-sampling-at-reader
--valid-views "0..8" --valid-view-resolution "281x375"
--valid-view-per-batch 1
--transparent-background "0.0,0.0,0.0" --background-stop-gradient
--arch nsvf_base
--initial-boundingbox ${DATASET}/bbox.txt
--use-octree
--raymarching-stepsize-ratio 0.125
--discrete-regularization
--color-weight 128.0 --alpha-weight 1.0
--optimizer "adam" --adam-betas "(0.9, 0.999)"
--lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000
--criterion "srn_loss" --clip-norm 0.0
--num-workers 0
--seed 2
--save-interval-updates 200 --max-update 150000
--virtual-epoch-steps 500 --save-interval 1
--half-voxel-size-at "500,2500,7500"
--reduce-step-size-at "500,2500,7500"
--pruning-every-steps 250
--keep-interval-updates 5 --keep-last-epochs 5
--log-format simple --log-interval 1
--save-dir ${SAVE}
--tensorboard-logdir ${SAVE}/tensorboard
| tee -a $SAVE/train.log
The above error when the first stage of pruning happens. Looks like the number of nodes after pruning seems to be zero. I am not sure how this can happen. Any ideas where I have to look for debugging?
The loss gradually decreases to around 10-30 and the validation PSNR was around 19 and ssim 0.73. So I assumed some voxels should have good values. This was not the case with
--transparent-background "1.0,1.0,1.0". In this case, though the ssim is stuck at 0.73 and slowly rises but the training loss is around 40-50.
Any ideas and directions to debug this??
The system easily runs out of memory during the validation phase as it's using all the pixels. Is there a way to limit the number of sampled pixels during the validation phase? (same as training using --pixel-per-view
)
I tried to train with the provided data(windholder) on V100 8 GPUs. it takes arounds 10GB per each GPU and with the default tranining command/configuration, it took almost 6-7 days to finish the training. Is it normal? or is there something for me to speed up? I also tried fp16 training or apex but it's not easy to run(so many errors).
Please help me~
Thank you
Describe the bug
when train it with arg --distributed-no-spawn
, the program gets stuck.
and without arg --distributed-no-spawn
, the program produces strange "OUT OF MEMORY" error
, since there's free GPU space.
To Reproduce
append line '--distributed-no-spawn' in train_wineholder.sh like the followed
# just for debugging
DATA="Wineholder"
RES="800x800"
ARCH="nsvf_base"
SUFFIX="v1"
DATASET=/xxx/NSVF/data/Synthetic_NSVF/${DATA}
SAVE=/xxx/NSVF/$DATA
MODEL=$ARCH$SUFFIX
mkdir -p $SAVE/$MODEL
CUDA_VISIBLE_DEVICES="4,7"
# start training locally
python train.py ${DATASET} \
--user-dir fairnr \
--task single_object_rendering \
--train-views "0..100" \
--view-resolution $RES \
--max-sentences 1 \
--view-per-batch 2 \
--pixel-per-view 2048 \
--no-preload \
--sampling-on-mask 1.0 --no-sampling-at-reader \
--valid-view-resolution $RES \
--valid-views "100..200" \
--valid-view-per-batch 1 \
--transparent-background "1.0,1.0,1.0" \
--background-stop-gradient \
--arch $ARCH \
--initial-boundingbox ${DATASET}/bbox.txt \
--raymarching-stepsize-ratio 0.125 \
--use-octree \
--discrete-regularization \
--color-weight 128.0 \
--alpha-weight 1.0 \
--optimizer "adam" \
--adam-betas "(0.9, 0.999)" \
--lr-scheduler "polynomial_decay" \
--total-num-update 150000 \
--lr 0.001 \
--clip-norm 0.0 \
--criterion "srn_loss" \
--seed 2 \
--save-interval-updates 500 --max-update 150000 \
--virtual-epoch-steps 5000 --save-interval 1 \
--half-voxel-size-at "5000,25000,75000" \
--reduce-step-size-at "5000,25000,75000" \
--pruning-every-steps 2500 \
--keep-interval-updates 5 \
--log-format simple --log-interval 1 \
--tensorboard-logdir ${SAVE}/tensorboard/${MODEL} \
--save-dir ${SAVE}/${MODEL} \
--device-id 4 \
--distributed-no-spawn
and when running it , the program get stuck.
Without --distributed-no-spawn
, it will log like this
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | distributed init (rank 7): tcp://localhost:14705
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | distributed init (rank 4): tcp://localhost:14705
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | distributed init (rank 8): tcp://localhost:14705
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | distributed init (rank 3): tcp://localhost:14705
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | distributed init (rank 2): tcp://localhost:14705
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | distributed init (rank 5): tcp://localhost:14705
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | distributed init (rank 1): tcp://localhost:14705
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | distributed init (rank 0): tcp://localhost:14705
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | distributed init (rank 6): tcp://localhost:14705
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | initialized host ubuntu as rank 6
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | distributed init (rank 9): tcp://localhost:14705
2021-03-23 00:46:08 | INFO | fairseq.distributed_utils | initialized host ubuntu as rank 9
2021-03-23 00:46:09 | INFO | fairseq.distributed_utils | initialized host ubuntu as rank 7
2021-03-23 00:46:09 | INFO | fairseq.distributed_utils | initialized host ubuntu as rank 4
2021-03-23 00:46:09 | INFO | fairseq.distributed_utils | initialized host ubuntu as rank 8
2021-03-23 00:46:09 | INFO | fairseq.distributed_utils | initialized host ubuntu as rank 3
2021-03-23 00:46:09 | INFO | fairseq.distributed_utils | initialized host ubuntu as rank 2
2021-03-23 00:46:09 | INFO | fairseq.distributed_utils | initialized host ubuntu as rank 5
2021-03-23 00:46:09 | INFO | fairseq.distributed_utils | initialized host ubuntu as rank 1
2021-03-23 00:46:09 | INFO | fairseq.distributed_utils | initialized host ubuntu as rank 0
Traceback (most recent call last):
File "train.py", line 20, in <module>
cli_main()
File "/home/lsy/NSVF/fairnr_cli/train.py", line 356, in cli_main
nprocs=torch.cuda.device_count(),
File "/home/lsy/anaconda3/envs/NSVF/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/lsy/anaconda3/envs/NSVF/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/lsy/anaconda3/envs/NSVF/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/lsy/anaconda3/envs/NSVF/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/lsy/NSVF/fairnr_cli/train.py", line 338, in distributed_main
main(args, init_distributed=True)
File "/home/lsy/NSVF/fairnr_cli/train.py", line 50, in main
args.distributed_rank = distributed_utils.distributed_init(args)
File "/home/lsy/NSVF/3rd/fairseq-stable/fairseq/distributed_utils.py", line 107, in distributed_init
dist.all_reduce(torch.zeros(1).cuda())
RuntimeError: CUDA error: out of memory
it's strange since it has no message like Tried to allocate 2.0 GiB
. Moreover, nvidis-smi
shows theres free space in GPU 4 and 7.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:04:00.0 Off | N/A |
| 26% 45C P2 79W / 250W | 8119MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:05:00.0 Off | N/A |
| 30% 50C P2 72W / 250W | 8119MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:06:00.0 Off | N/A |
| 28% 47C P2 73W / 250W | 8103MiB / 12196MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:07:00.0 Off | N/A |
| 27% 47C P2 78W / 250W | 8135MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 TITAN Xp Off | 00000000:08:00.0 Off | N/A |
| 23% 28C P8 8W / 250W | 10MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 TITAN Xp Off | 00000000:0B:00.0 Off | N/A |
| 32% 53C P2 118W / 250W | 8334MiB / 12196MiB | 83% Default |
+-------------------------------+----------------------+----------------------+
| 6 TITAN Xp Off | 00000000:0C:00.0 Off | N/A |
| 36% 60C P2 146W / 250W | 8846MiB / 12196MiB | 81% Default |
+-------------------------------+----------------------+----------------------+
| 7 TITAN Xp Off | 00000000:0D:00.0 Off | N/A |
| 23% 27C P8 8W / 250W | 10MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 8 TITAN Xp Off | 00000000:0E:00.0 Off | N/A |
| 39% 63C P2 194W / 250W | 12080MiB / 12196MiB | 49% Default |
+-------------------------------+----------------------+----------------------+
| 9 TITAN Xp Off | 00000000:0F:00.0 Off | N/A |
| 29% 49C P2 77W / 250W | 8105MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Desktop (please complete the following information):
THANKS!
Hi
When I am training the network I find the following import error. Probably its a minor issue. Am I missing some step?
root@684ba01c705d:/nitthilan/source_code/multiview_rendering/NSVF# python -u train.py ${DATASET} --user-dir fairnr --task single_object_rendering --train-views "0..100" --view-resolution "800x800" --max-sentences 1 --view-per-batch 4 --pixel-per-view 2048 --no-preload --sampling-on-mask 1.0 --no-sampling-at-reader --valid-views "100..200" --valid-view-resolution "400x400" --valid-view-per-batch 1 --transparent-background "1.0,1.0,1.0" --background-stop-gradient --arch nsvf_base --initial-boundingbox ${DATASET}/bbox.txt --use-octree --raymarching-stepsize-ratio 0.125 --discrete-regularization --color-weight 128.0 --alpha-weight 1.0 --optimizer "adam" --adam-betas "(0.9, 0.999)" --lr 0.001 --lr-scheduler "polynomial_decay" --total-num-update 150000 --criterion "srn_loss" --clip-norm 0.0 --num-workers 0 --seed 2 --save-interval-updates 500 --max-update 150000 --virtual-epoch-steps 5000 --save-interval 1 --half-voxel-size-at "5000,25000,75000" --reduce-step-size-at "5000,25000,75000" --pruning-every-steps 2500 --keep-interval-updates 5 --keep-last-epochs 5 --log-format simple --log-interval 1 --save-dir ${SAVE} --tensorboard-logdir ${SAVE}/tensorboard | tee -a $SAVE/train.log
Traceback (most recent call last):
File "train.py", line 7, in
from fairnr_cli.train import cli_main
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr_cli/train.py", line 26, in
from fairnr import ResetTrainerException
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/init.py", line 11, in
from . import data, tasks, models, modules, criterions, clib
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/init.py", line 15, in
module = importlib.import_module('fairnr.models.' + model_name)
File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/multi_nsvf.py", line 15, in
from fairnr.models.nsvf import NSVFModel, base_architecture
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/nsvf.py", line 24, in
from fairnr.models.fairnr_model import BaseModel
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/models/fairnr_model.py", line 22, in
from fairnr.modules.encoder import get_encoder
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/modules/init.py", line 15, in
module = importlib.import_module('fairnr.modules.' + model_name)
File "/opt/conda/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/modules/encoder.py", line 23, in
from fairnr.clib import (
File "/nitthilan/source_code/multiview_rendering/NSVF/fairnr/clib/init.py", line 28, in
import fairnr.clib._ext as _ext
AttributeError: module 'fairnr' has no attribute 'clib'
root@684ba01c705d:/nitthilan/source_code/multiview_rendering/NSVF#
Describe the bug
if using PyTorch 1.4 and Cuda 10.2, then Rtx3090 is not supported (RTX 3090 is using compute 8.6)
if trying with PyTorch 1.71 and Cuda 11.0 then fairnr does not compile
(ValueError: Unknown CUDA arch (8.6) or GPU not supported)
To Reproduce
Steps to reproduce the behavior:
1 - install all recommended dependancies as explained in Readme.md
Expected behavior
As NSVF is about fast rendering it would be great to have it working on latest Nvidia's flagship GPU.
Desktop (please complete the following information):
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.