Hi, I am running stylegan_v on a custom dataset and everything seems to work fine unti

HiI Could you please tell whether the original <a href="https://github.com/NVlabs/styl

Error encountered while running multi-gpu training about stylegan-v HOT 6 CLOSED

universome commented on June 14, 2024

Error encountered while running multi-gpu training

from stylegan-v.

Comments (6)

universome commented on June 14, 2024

HiI Could you please tell whether the original StyleGAN2-ADA runs on your system? Because it looks like the issue is with process launching which we inherited from there. In this case, it might be helpful to check their troubleshooting guide

from stylegan-v.

universome commented on June 14, 2024

Also, as far as I remember SIGKILL is sometimes sent by slurm when one exceeds something like memory limit. Do you run it with slurm? Could you try reducing training resolutions and batch sizes to some minimal values, like 32x32 and 16? And SIGKILL can be sent by other system managers

from stylegan-v.

skymanaditya1 commented on June 14, 2024

Thanks for the reply! I was able to train StayleGan2's PyTorch implementation from https://github.com/lucidrains/stylegan2-pytorch. I indeed run this on a slurm server. The way I run this is a different process runs on GPU 0, while I try the StyleGan-V on GPUs 1 and 2. I see that I don't see this error when I train StyleGAN-V in isolation using the full 4 GPUs with a small enough batch size (32).
However, I see a different type of error -- (Stacktrace below)

Setting up augmentation...
Distributing across 4 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Training for 25000 kimg...

tick 0 kimg 0.1 time 23s sec/tick 2.6 sec/kimg 26.76 maintenance 20.4 cpumem 2.70 gpumem 9.26 augment 0.000
Evaluating metrics for how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5 ...
{"results": {"fvd2048_16f": 5910.987411352878}, "metric": "fvd2048_16f", "total_time": 51.558385610580444, "total_time_str": "52s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1652602566.066554}
Traceback (most recent call last):
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 451, in
main() # pylint: disable=no-value-for-parameter
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 446, in main
torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 375, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/training/training_loop.py", line 508, in training_loop
result_dict = metric_main.calc_metric(
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 49, in calc_metric
all_runs_results = [_metric_dictmetric for _ in range(num_runs)]
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 49, in
all_runs_results = [_metric_dictmetric for _ in range(num_runs)]
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 123, in fvd2048_128f
fvd = frechet_video_distance.compute_fvd(opts, max_real=2048, num_gen=2048, num_frames=128)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/frechet_video_distance.py", line 31, in compute_fvd
mu_real, sigma_real = metric_utils.compute_feature_stats_for_dataset(
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_utils.py", line 195, in compute_feature_stats_for_dataset
dataset = dnnlib.util.construct_class_by_name(**dataset_kwargs)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/dnnlib/util.py", line 292, in construct_class_by_name
return call_func_by_name(*args, func_name=class_name, **kwargs)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/dnnlib/util.py", line 287, in call_func_by_name
return func_obj(*args, **kwargs)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/training/dataset.py", line 335, in init
raise IOError('No videos found in the specified archive')
OSError: No videos found in the specified archive

I am trying to debug this as well.

from stylegan-v.

skymanaditya1 commented on June 14, 2024

FYI, my data dir has 10,000 videos each having exactly 25 frames. I am trying to compare the performance of StyleGAN-v against our method for the task of unconditional video generation on fewer videos and fewer frames per video.
I think from the code it appears that it rejects dirs which don't satisfy the criteria of having a minimum number of frames, which is why it throws the error - "No videos found in the specified archive".

from stylegan-v.

universome commented on June 14, 2024

Yes, I think the issue is that it is trying to compute two metrics which require 128-frames videos, and the dataset class rejects all the short videos.

from stylegan-v.

skymanaditya1 commented on June 14, 2024

Yes, thank you! Closing.

from stylegan-v.

Error encountered while running multi-gpu training about stylegan-v HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent