Comments (6)
HiI Could you please tell whether the original StyleGAN2-ADA runs on your system? Because it looks like the issue is with process launching which we inherited from there. In this case, it might be helpful to check their troubleshooting guide
from stylegan-v.
Also, as far as I remember SIGKILL is sometimes sent by slurm when one exceeds something like memory limit. Do you run it with slurm? Could you try reducing training resolutions and batch sizes to some minimal values, like 32x32 and 16? And SIGKILL can be sent by other system managers
from stylegan-v.
Thanks for the reply! I was able to train StayleGan2's PyTorch implementation from https://github.com/lucidrains/stylegan2-pytorch. I indeed run this on a slurm server. The way I run this is a different process runs on GPU 0, while I try the StyleGan-V on GPUs 1 and 2. I see that I don't see this error when I train StyleGAN-V in isolation using the full 4 GPUs with a small enough batch size (32).
However, I see a different type of error -- (Stacktrace below)
Setting up augmentation...
Distributing across 4 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Training for 25000 kimg...
tick 0 kimg 0.1 time 23s sec/tick 2.6 sec/kimg 26.76 maintenance 20.4 cpumem 2.70 gpumem 9.26 augment 0.000
Evaluating metrics for how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5 ...
{"results": {"fvd2048_16f": 5910.987411352878}, "metric": "fvd2048_16f", "total_time": 51.558385610580444, "total_time_str": "52s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1652602566.066554}
Traceback (most recent call last):
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 451, in
main() # pylint: disable=no-value-for-parameter
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 446, in main
torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 375, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/training/training_loop.py", line 508, in training_loop
result_dict = metric_main.calc_metric(
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 49, in calc_metric
all_runs_results = [_metric_dictmetric for _ in range(num_runs)]
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 49, in
all_runs_results = [_metric_dictmetric for _ in range(num_runs)]
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 123, in fvd2048_128f
fvd = frechet_video_distance.compute_fvd(opts, max_real=2048, num_gen=2048, num_frames=128)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/frechet_video_distance.py", line 31, in compute_fvd
mu_real, sigma_real = metric_utils.compute_feature_stats_for_dataset(
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_utils.py", line 195, in compute_feature_stats_for_dataset
dataset = dnnlib.util.construct_class_by_name(**dataset_kwargs)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/dnnlib/util.py", line 292, in construct_class_by_name
return call_func_by_name(*args, func_name=class_name, **kwargs)
File "/home2/ravi_mishra/aditya1/stylegan-v/src/dnnlib/util.py", line 287, in call_func_by_name
return func_obj(*args, **kwargs)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/training/dataset.py", line 335, in init
raise IOError('No videos found in the specified archive')
OSError: No videos found in the specified archive
I am trying to debug this as well.
from stylegan-v.
FYI, my data dir has 10,000 videos each having exactly 25 frames. I am trying to compare the performance of StyleGAN-v against our method for the task of unconditional video generation on fewer videos and fewer frames per video.
I think from the code it appears that it rejects dirs which don't satisfy the criteria of having a minimum number of frames, which is why it throws the error - "No videos found in the specified archive".
from stylegan-v.
Yes, I think the issue is that it is trying to compute two metrics which require 128-frames videos, and the dataset class rejects all the short videos.
from stylegan-v.
Yes, thank you! Closing.
from stylegan-v.
Related Issues (20)
- Large GPU memory consumption at the beginning of training HOT 1
- Questions about details of hyperparameters in AlignedTimeEncoder HOT 4
- Notebook release
- FVD calculation is not deterministic HOT 1
- About FaceForensics Dataset HOT 2
- Linearly spaced periods stated in Appendix B.3 is not correct HOT 1
- Low IS on UCF101 HOT 16
- Face Forensics dataset preprocessing HOT 1
- FileNotFoundError: [Errno 2] No such file or directory: 'src\\experiment_config.yaml'
- Run stylegan-v Conditionally
- Question about FVD calculation HOT 2
- Question about the release of Pre-trained checkpoints HOT 2
- I tried to generate fake video samples with your pretrained weights but....
- ImportError: cannot import name 'CoordFuser' from 'src.training.layers' HOT 4
- TypeError: Descriptors cannot be created directly
- Projection of real video with multiple frames
- RainbowJelly dataset
- Problem regarding true data statics computation for FVD HOT 4
- Rainbow jelly isn't available
- MOCOGAN: training isn't runing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stylegan-v.