Code Monkey home page Code Monkey logo

Comments (6)

universome avatar universome commented on June 14, 2024

HiI Could you please tell whether the original StyleGAN2-ADA runs on your system? Because it looks like the issue is with process launching which we inherited from there. In this case, it might be helpful to check their troubleshooting guide

from stylegan-v.

universome avatar universome commented on June 14, 2024

Also, as far as I remember SIGKILL is sometimes sent by slurm when one exceeds something like memory limit. Do you run it with slurm? Could you try reducing training resolutions and batch sizes to some minimal values, like 32x32 and 16? And SIGKILL can be sent by other system managers

from stylegan-v.

skymanaditya1 avatar skymanaditya1 commented on June 14, 2024

Thanks for the reply! I was able to train StayleGan2's PyTorch implementation from https://github.com/lucidrains/stylegan2-pytorch. I indeed run this on a slurm server. The way I run this is a different process runs on GPU 0, while I try the StyleGan-V on GPUs 1 and 2. I see that I don't see this error when I train StyleGAN-V in isolation using the full 4 GPUs with a small enough batch size (32).
However, I see a different type of error -- (Stacktrace below)

Setting up augmentation...
Distributing across 4 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Training for 25000 kimg...

tick 0 kimg 0.1 time 23s sec/tick 2.6 sec/kimg 26.76 maintenance 20.4 cpumem 2.70 gpumem 9.26 augment 0.000
Evaluating metrics for how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5 ...
{"results": {"fvd2048_16f": 5910.987411352878}, "metric": "fvd2048_16f", "total_time": 51.558385610580444, "total_time_str": "52s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1652602566.066554}
Traceback (most recent call last):
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 451, in
main() # pylint: disable=no-value-for-parameter
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 446, in main
torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 375, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/training/training_loop.py", line 508, in training_loop
result_dict = metric_main.calc_metric(
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 49, in calc_metric
all_runs_results = [_metric_dictmetric for _ in range(num_runs)]
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 49, in
all_runs_results = [_metric_dictmetric for _ in range(num_runs)]
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 123, in fvd2048_128f
fvd = frechet_video_distance.compute_fvd(opts, max_real=2048, num_gen=2048, num_frames=128)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/frechet_video_distance.py", line 31, in compute_fvd
mu_real, sigma_real = metric_utils.compute_feature_stats_for_dataset(
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_utils.py", line 195, in compute_feature_stats_for_dataset
dataset = dnnlib.util.construct_class_by_name(**dataset_kwargs)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/dnnlib/util.py", line 292, in construct_class_by_name
return call_func_by_name(*args, func_name=class_name, **kwargs)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/dnnlib/util.py", line 287, in call_func_by_name
return func_obj(*args, **kwargs)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/training/dataset.py", line 335, in init
raise IOError('No videos found in the specified archive')
OSError: No videos found in the specified archive

I am trying to debug this as well.

from stylegan-v.

skymanaditya1 avatar skymanaditya1 commented on June 14, 2024

FYI, my data dir has 10,000 videos each having exactly 25 frames. I am trying to compare the performance of StyleGAN-v against our method for the task of unconditional video generation on fewer videos and fewer frames per video.
I think from the code it appears that it rejects dirs which don't satisfy the criteria of having a minimum number of frames, which is why it throws the error - "No videos found in the specified archive".

from stylegan-v.

universome avatar universome commented on June 14, 2024

Yes, I think the issue is that it is trying to compute two metrics which require 128-frames videos, and the dataset class rejects all the short videos.

from stylegan-v.

skymanaditya1 avatar skymanaditya1 commented on June 14, 2024

Yes, thank you! Closing.

from stylegan-v.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.