Comments (14)
This issue happens when training a GAN variant (i.e. the GAN or VAE-GAN) with multiple GPUs.
I'll look into this. As a temporary work-around, you can train with a single GPU.
from video_prediction.
I set CUDA_VISIBLE_DEVICES=0
and problem was solved. But I met the Resource Exhausted
after that. As you said in #6 , I change batchsize to 4 and run again with the script :
CUDA_VISIBLE_DEVICES=0 python scripts/train.py --input_dir data/bair --dataset bair \
--model savp --model_hparams_dict hparams/bair_action_free/ours_savp/model_hparams.json \
--output_dir logs/bair_action_free/ours_savp \
--gpu_mem_frac 0.7 \
--model_hparams tv_weight=0.001,transformation=flow
Now it seems to be runing correctly, althought it still outputs the same information about topological sort failed
occasionally. Thanks a lot.
from video_prediction.
Great! However, be aware that that batch size might end up being too small, and the results won't be as good as in the paper. I'll be making a few changes to improve accuracy and reduce the memory footprint. I'll post an update when I do and also when I fix the multi-GPU training.
from video_prediction.
Dear Alex,
Thank you for sharing the code of this awesome project and congratulations for your results and the very nice paper!
I believe my training also hangs due to this issue if any GAN loss is used, however it also happens when training on a single GPU (Tesla V100, CUDA 9.0.176, tf 1.9.0 and 1.12.0, cudnn 7.1.3). I find that training runs ''smoothly'' if CPU is used :)
It would be awesome if this issue is solved (I would totally help but I am not fluent enough to dig into this problem).
I am building something for climate science based on your work, it would be awesome to talk to you! Check your inbox :)
EDIT: Actually training failed on CPU too. It does so when training reaches progress_freq or summary_freq with the following error:
/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
898 try:
899 result = self._run(None, fetches, feed_dict, options_ptr,
--> 900 run_metadata_ptr)
901 if run_metadata:
902 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1133 if final_fetches or final_targets or (handle and feed_dict_tensor):
1134 results = self._do_run(handle, final_targets, final_fetches,
-> 1135 feed_dict_tensor, options, run_metadata)
1136 else:
1137 results = []/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1314 if handle is None:
1315 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1316 run_metadata)
1317 else:
1318 return self._do_call(_prun_fn, handle, feeds, fetches)/Net/Groups/BGI/people/crequ/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1333 except KeyError:
1334 pass
-> 1335 raise type(e)(node_def, op, message)
1336
1337 def _extend_graph(self):InvalidArgumentError: Retval[4] does not have value
Though closely related it is to note that with GPU it never leaves step 0, with CPU it reaches progress_freq step. Seems like fetches['d_losses'] & fetches['g_losses'] can only be retrieved at initialization, then these are gone, so probably no real training is in progress on CPU either.
from video_prediction.
Hi Chris, I have made several improvements in the experimental
branch (will soon be merged into master
), including this issue with GANs. Can you try the experimental
branch and see if the problem persists?
from video_prediction.
Hey Alex, thanks a lot! I seem to have some dependency problem in the experimental video_prediction/metrics.py
trying to import lpips_tf
. Maybe I am missing a new requirement?
from video_prediction.
Yes, you can install it with pip install -r requirements.txt
. (There is one new dependency at the end of that file).
from video_prediction.
Hey Alex I get this error at build_graph
. It happens with a dataset I created (shaped very similarly to kth) but also with 'bair' just as provided by you.
174
175 # inputs comes from the training dataset by default, unless train_handle is remapped to the val_handles
--> 176 model.build_graph(inputs)
177
178 if long_val_dataset is not None:/video_prediction/models/base_model.py in build_graph(self, inputs)
686 self.accum_eval_metrics = OrderedDict()
687 for name, eval_metric in self.eval_metrics.items():
--> 688 , self.accum_eval_metrics['accum' + name] = tf.metrics.mean_tensor(eval_metric)
689 local_variables = set(tf.local_variables()) - original_local_variables
690 self.accum_eval_metrics_reset_op = tf.group([tf.assign(v, tf.zeros_like(v)) for v in local_variables])/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in mean_tensor(values, weights, metrics_collections, updates_collections, name)
1294 values = math_ops.to_float(values)
1295 total = metric_variable(
-> 1296 values.get_shape(), dtypes.float32, name='total_tensor')
1297 count = metric_variable(
1298 values.get_shape(), dtypes.float32, name='count_tensor')/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in metric_variable(shape, dtype, validate_shape, name)
49 ],
50 validate_shape=validate_shape,
---> 51 name=name)
52
53/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in variable(initial_value, trainable, collections, validate_shape, caching_device, name, dtype, constraint, use_resource)
2232 name=name, dtype=dtype,
2233 constraint=constraint,
-> 2234 use_resource=use_resource)
2235
2236/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in (**kwargs)
2222 constraint=None,
2223 use_resource=None):
-> 2224 previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
2225 for getter in ops.get_default_graph()._variable_creator_stack: # pylint: disable=protected-access
2226 previous_getter = _make_getter(getter, previous_getter)Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator(next_creator, **kwargs)
2194 collections=collections, validate_shape=validate_shape,
2195 caching_device=caching_device, name=name, dtype=dtype,
-> 2196 constraint=constraint)
2197 elif not use_resource and context.executing_eagerly():
2198 raise RuntimeError(Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py in init(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint)
310 name=name,
311 dtype=dtype,
--> 312 constraint=constraint)
313
314 # pylint: disable=unused-argument/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, constraint)
415 with ops.name_scope("Initializer"), ops.device(None):
416 initial_value = ops.convert_to_tensor(
--> 417 initial_value(), name="initial_value", dtype=dtype)
418 self._handle = _eager_safe_variable_handle(
419 shape=initial_value.get_shape(),/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/metrics_impl.py in ()
43
44 return variable_scope.variable(
---> 45 lambda: array_ops.zeros(shape, dtype),
46 trainable=False,
47 collections=[Anaconda/install/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py in zeros(shape, dtype, name)
1545 except (TypeError, ValueError):
1546 # Happens when shape is a list with tensor elements
-> 1547 shape = ops.convert_to_tensor(shape, dtype=dtypes.int32)
1548 if not shape._shape_tuple():
1549 shape = reshape(shape, [-1]) # Ensure it's a vector/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, preferred_dtype)
1009 name=name,
1010 preferred_dtype=preferred_dtype,
-> 1011 as_ref=False)
1012
1013Anaconda/install/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx)
1105
1106 if ret is None:
-> 1107 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
1108
1109 if ret is NotImplemented:/Anaconda/install/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _tensor_shape_tensor_conversion_function(s, dtype, name, as_ref)
236 if not s.is_fully_defined():
237 raise ValueError(
--> 238 "Cannot convert a partially known TensorShape to a Tensor: %s" % s)
239 s_list = s.as_list()
240 int64_value = 0ValueError: Cannot convert a partially known TensorShape to a Tensor: (?, ?)
from video_prediction.
The problem is that one of the metrics is not returning fully defined shapes, and I suspect that it might be the new LPIPS metric causing this. If that’s the case, you can just comment this metric out:
https://github.com/alexlee-gk/video_prediction/blob/experimental/video_prediction/models/base_model.py#L149
Unlike the losses, the metrics don’t affect the training.
from video_prediction.
Unfortunately commenting out that line (or L126
, or every line using LPIPS in base_model.py
) still leads to the same error.
from video_prediction.
Hey Alex,
Commenting out anything involving accum_eval_summary
both in train.py
and base_model.py
allows for the training to proceed.
That is commenting out: L301-315 and in L317 or should_eval(step, args.accum_eval_summary_freq)
in train.py
and L687-688, L711-713, L718-720, L722 in base_model.py
.
If no GAN loss is used, training works! However, it still seems to get stuck if I use video_image_sn_gan_weight or image_sn_gan_weight > 0 on a single GPU. I also gave a try, blindly (just in case it matter), to the new argument, aggregate_nccl=1 with same results.
Training does not run this time around on CPU either since Max pooling on CPU seems to not like the data format:
InvalidArgumentError (see above for traceback): Default MaxPoolingOp only supports NHWC on device type CPU
[[Node: metrics/import/max_pool = MaxPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Thank you for all your work! :)
from video_prediction.
Thanks for the detailed reporting. All the mentioned issues should be fixed as of now:
- Metrics shapes not fully defined. This occurred only on tf 1.9 because that version seems to have weaker static shape inference compared to tf >= 1.10. If you pull my changes, it should now work with tf 1.9.
- LPIPS metric not supported on CPU. I updated the LPIPS models to support both GPU and CPU. Make sure to clear the cache of the old models:
rm ~/.lpips/*
. - The training doesn't get stuck for me, so I haven't changed anything to the repo in regards to that. The SAVP (i.e VAE-GAN) model trains for me on Titan X, P100, and V100 GPUs, with single and multi GPU training, python 3.5 and 3.6, tf 1.9 and 1.12, and cudnn 7.3.0.29. In my case with tf 1.12, the training script reports that images are processed at about 13.2 and 16.3 images/s when using 1 and 2 V100 GPUs, respectively. Just in case, can you make sure you pull all changes and try again? This is the command I use:
CUDA_VISIBLE_DEVICES=0 python scripts/train.py --input_dir data/bair --dataset bair --model savp --model_hparams_dict hparams/bair_action_free/ours_savp/model_hparams.json --output_dir logs/bair_action_free/ours_savp
Also, the option aggregate_nccl
only matters for multi-GPU training, and it specifies how the gradients should be aggregated. Enabling it has resulted in slower training when I have tried it, so it's better to leave the default.
from video_prediction.
Hi Alex, I just followed your instructions in the previous post and training in single and multiple gpu is totally working! Thank you so much for your dedication!
from video_prediction.
That's great! I'll close this issue then. Feel free to re-open or open another one if another issue arises.
from video_prediction.
Related Issues (20)
- Is downloading dataset necessary for sample prediction videos? HOT 1
- Error in downloading dataset (partially downloaded) HOT 1
- Checkpoint data loss error when evaluating
- Training stability & progress HOT 1
- train with my own dataset
- what is the difference between Bair action free and action conditioned HOT 1
- FailedPreconditionError while trying to predict using gan_only model on KTH
- ValueError: as_list() is not defined on an unknown TensorShape. HOT 6
- Unable to download pretrained model HOT 1
- Using trained model for custom sized images HOT 1
- Questions about evaluation with the deterministic model
- KeyError: 'gen_states' when run train.py
- KL Loss Weight is zero
- The KHT dataset have not existed any more
- File "scripts/generate.py", line 15, in <module> from video_prediction import datasets, models ModuleNotFoundError: No module named 'video_prediction' HOT 1
- Training error HOT 4
- Testing on custom images
- Dependency Nightmare
- bash download pre-trained model gives an error
- CDNA Masks
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from video_prediction.