I use tensorflow 1.11.0, CUDA 9.0.176 and cuDNN 7.3.1 on Ubuntu 16.04. My GPUs are nvi

Training hangs when training a GAN with multiple GPUs about video_prediction HOT 14 CLOSED

alexlee-gk commented on July 3, 2024

Training hangs when training a GAN with multiple GPUs

from video_prediction.

Comments (14)

alexlee-gk commented on July 3, 2024

This issue happens when training a GAN variant (i.e. the GAN or VAE-GAN) with multiple GPUs.

I'll look into this. As a temporary work-around, you can train with a single GPU.

from video_prediction.

Glooow1024 commented on July 3, 2024

I set CUDA_VISIBLE_DEVICES=0 and problem was solved. But I met the Resource Exhausted after that. As you said in #6 , I change batchsize to 4 and run again with the script :

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --input_dir data/bair --dataset bair \ 
  --model savp --model_hparams_dict hparams/bair_action_free/ours_savp/model_hparams.json \
  --output_dir logs/bair_action_free/ours_savp \ 
  --gpu_mem_frac 0.7 \ 
  --model_hparams tv_weight=0.001,transformation=flow

Now it seems to be runing correctly, althought it still outputs the same information about topological sort failed occasionally. Thanks a lot.

from video_prediction.

alexlee-gk commented on July 3, 2024

Great! However, be aware that that batch size might end up being too small, and the results won't be as good as in the paper. I'll be making a few changes to improve accuracy and reduce the memory footprint. I'll post an update when I do and also when I fix the multi-GPU training.

from video_prediction.

crequena commented on July 3, 2024

Dear Alex,

Thank you for sharing the code of this awesome project and congratulations for your results and the very nice paper!

I believe my training also hangs due to this issue if any GAN loss is used, however it also happens when training on a single GPU (Tesla V100, CUDA 9.0.176, tf 1.9.0 and 1.12.0, cudnn 7.1.3). I find that training runs ''smoothly'' if CPU is used :)

It would be awesome if this issue is solved (I would totally help but I am not fluent enough to dig into this problem).

I am building something for climate science based on your work, it would be awesome to talk to you! Check your inbox :)

EDIT: Actually training failed on CPU too. It does so when training reaches progress_freq or summary_freq with the following error:

/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
898 try:
899 result = self._run(None, fetches, feed_dict, options_ptr,
--> 900 run_metadata_ptr)
901 if run_metadata:
902 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1133 if final_fetches or final_targets or (handle and feed_dict_tensor):
1134 results = self._do_run(handle, final_targets, final_fetches,
-> 1135 feed_dict_tensor, options, run_metadata)
1136 else:
1137 results = []

/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1314 if handle is None:
1315 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1316 run_metadata)
1317 else:
1318 return self._do_call(_prun_fn, handle, feeds, fetches)

/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1333 except KeyError:
1334 pass
-> 1335 raise type(e)(node_def, op, message)
1336
1337 def _extend_graph(self):

InvalidArgumentError: Retval[4] does not have value

Though closely related it is to note that with GPU it never leaves step 0, with CPU it reaches progress_freq step. Seems like fetches['d_losses'] & fetches['g_losses'] can only be retrieved at initialization, then these are gone, so probably no real training is in progress on CPU either.

from video_prediction.

alexlee-gk commented on July 3, 2024

Hi Chris, I have made several improvements in the experimental branch (will soon be merged into master), including this issue with GANs. Can you try the experimental branch and see if the problem persists?

from video_prediction.

crequena commented on July 3, 2024

Hey Alex, thanks a lot! I seem to have some dependency problem in the experimental video_prediction/metrics.py trying to import lpips_tf. Maybe I am missing a new requirement?

from video_prediction.

alexlee-gk commented on July 3, 2024

Yes, you can install it with pip install -r requirements.txt. (There is one new dependency at the end of that file).

from video_prediction.

crequena commented on July 3, 2024

Hey Alex I get this error at build_graph . It happens with a dataset I created (shaped very similarly to kth) but also with 'bair' just as provided by you.

174
175 # inputs comes from the training dataset by default, unless train_handle is remapped to the val_handles
--> 176 model.build_graph(inputs)
177
178 if long_val_dataset is not None:

/video_prediction/models/base_model.py in build_graph(self, inputs)
686 self.accum_eval_metrics = OrderedDict()
687 for name, eval_metric in self.eval_metrics.items():
--> 688 , self.accum_eval_metrics['accum' + name] = tf.metrics.mean_tensor(eval_metric)
689 local_variables = set(tf.local_variables()) - original_local_variables
690 self.accum_eval_metrics_reset_op = tf.group([tf.assign(v, tf.zeros_like(v)) for v in local_variables])

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in mean_tensor(values, weights, metrics_collections, updates_collections, name)
1294 values = math_ops.to_float(values)
1295 total = metric_variable(
-> 1296 values.get_shape(), dtypes.float32, name='total_tensor')
1297 count = metric_variable(
1298 values.get_shape(), dtypes.float32, name='count_tensor')

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in metric_variable(shape, dtype, validate_shape, name)
49 ],
50 validate_shape=validate_shape,
---> 51 name=name)
52
53

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in variable(initial_value, trainable, collections, validate_shape, caching_device, name, dtype, constraint, use_resource)
2232 name=name, dtype=dtype,
2233 constraint=constraint,
-> 2234 use_resource=use_resource)
2235
2236

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in (**kwargs)
2222 constraint=None,
2223 use_resource=None):
-> 2224 previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
2225 for getter in ops.get_default_graph()._variable_creator_stack: # pylint: disable=protected-access
2226 previous_getter = _make_getter(getter, previous_getter)

Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator(next_creator, **kwargs)
2194 collections=collections, validate_shape=validate_shape,
2195 caching_device=caching_device, name=name, dtype=dtype,
-> 2196 constraint=constraint)
2197 elif not use_resource and context.executing_eagerly():
2198 raise RuntimeError(

Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py in init(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint)
310 name=name,
311 dtype=dtype,
--> 312 constraint=constraint)
313
314 # pylint: disable=unused-argument

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, constraint)
415 with ops.name_scope("Initializer"), ops.device(None):
416 initial_value = ops.convert_to_tensor(
--> 417 initial_value(), name="initial_value", dtype=dtype)
418 self._handle = _eager_safe_variable_handle(
419 shape=initial_value.get_shape(),

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in ()
43
44 return variable_scope.variable(
---> 45 lambda: array_ops.zeros(shape, dtype),
46 trainable=False,
47 collections=[

Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py in zeros(shape, dtype, name)
1545 except (TypeError, ValueError):
1546 # Happens when shape is a list with tensor elements
-> 1547 shape = ops.convert_to_tensor(shape, dtype=dtypes.int32)
1548 if not shape._shape_tuple():
1549 shape = reshape(shape, [-1]) # Ensure it's a vector

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, preferred_dtype)
1009 name=name,
1010 preferred_dtype=preferred_dtype,
-> 1011 as_ref=False)
1012
1013

Anaconda/install/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx)
1105
1106 if ret is None:
-> 1107 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
1108
1109 if ret is NotImplemented:

/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _tensor_shape_tensor_conversion_function(s, dtype, name, as_ref)
236 if not s.is_fully_defined():
237 raise ValueError(
--> 238 "Cannot convert a partially known TensorShape to a Tensor: %s" % s)
239 s_list = s.as_list()
240 int64_value = 0

ValueError: Cannot convert a partially known TensorShape to a Tensor: (?, ?)

from video_prediction.

alexlee-gk commented on July 3, 2024

The problem is that one of the metrics is not returning fully defined shapes, and I suspect that it might be the new LPIPS metric causing this. If that’s the case, you can just comment this metric out:
https://github.com/alexlee-gk/video_prediction/blob/experimental/video_prediction/models/base_model.py#L149

Unlike the losses, the metrics don’t affect the training.

from video_prediction.

crequena commented on July 3, 2024

Unfortunately commenting out that line (or L126, or every line using LPIPS in base_model.py) still leads to the same error.

from video_prediction.

crequena commented on July 3, 2024

Hey Alex,

Commenting out anything involving accum_eval_summary both in train.py and base_model.py allows for the training to proceed.
That is commenting out: L301-315 and in L317 or should_eval(step, args.accum_eval_summary_freq) in train.py and L687-688, L711-713, L718-720, L722 in base_model.py.

If no GAN loss is used, training works! However, it still seems to get stuck if I use video_image_sn_gan_weight or image_sn_gan_weight > 0 on a single GPU. I also gave a try, blindly (just in case it matter), to the new argument, aggregate_nccl=1 with same results.

Training does not run this time around on CPU either since Max pooling on CPU seems to not like the data format:

InvalidArgumentError (see above for traceback): Default MaxPoolingOp only supports NHWC on device type CPU
[[Node: metrics/import/max_pool = MaxPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Thank you for all your work! :)

from video_prediction.

alexlee-gk commented on July 3, 2024

Thanks for the detailed reporting. All the mentioned issues should be fixed as of now:

Metrics shapes not fully defined. This occurred only on tf 1.9 because that version seems to have weaker static shape inference compared to tf >= 1.10. If you pull my changes, it should now work with tf 1.9.
LPIPS metric not supported on CPU. I updated the LPIPS models to support both GPU and CPU. Make sure to clear the cache of the old models: rm ~/.lpips/*.
The training doesn't get stuck for me, so I haven't changed anything to the repo in regards to that. The SAVP (i.e VAE-GAN) model trains for me on Titan X, P100, and V100 GPUs, with single and multi GPU training, python 3.5 and 3.6, tf 1.9 and 1.12, and cudnn 7.3.0.29. In my case with tf 1.12, the training script reports that images are processed at about 13.2 and 16.3 images/s when using 1 and 2 V100 GPUs, respectively. Just in case, can you make sure you pull all changes and try again? This is the command I use:

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --input_dir data/bair --dataset bair --model savp --model_hparams_dict hparams/bair_action_free/ours_savp/model_hparams.json --output_dir logs/bair_action_free/ours_savp

Also, the option aggregate_nccl only matters for multi-GPU training, and it specifies how the gradients should be aggregated. Enabling it has resulted in slower training when I have tried it, so it's better to leave the default.

from video_prediction.

crequena commented on July 3, 2024

Hi Alex, I just followed your instructions in the previous post and training in single and multiple gpu is totally working! Thank you so much for your dedication!

from video_prediction.

alexlee-gk commented on July 3, 2024

That's great! I'll close this issue then. Feel free to re-open or open another one if another issue arises.

from video_prediction.

Training hangs when training a GAN with multiple GPUs about video_prediction HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent