Comments (14)
@mychina75 The origin paper has training details about coco-stuff dataset:)
from bisenetv2-tensorflow.
in the paper, 150K, 10K, 20K iterations for the Cityscapes dataset, CamVid dataset, and COCO-Stuff datasets respectively....
but image number of COCO db is much larger than Cityscapes.. why the iterations so small?
maybe something wrong?
from bisenetv2-tensorflow.
@mychina75 That's a problem which you may get a satisfied answer at https://github.com/ycszen/BiSeNet (The origin auther's repo) :)
from bisenetv2-tensorflow.
thank you. I will check.
and There is a error report about resume training... plz check.
##################
2020-05-25 16:52:38.994 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:init:229 - Initialize human bisenetv2 multi gpu trainner complete
2020-05-25 16:52:41.706 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:319 - => Restoring weights from: ./model/coco_human/bisenetv2/ ...
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-05-25 16:52:42.368599: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
2020-05-25 16:52:42.376 | ERROR | trainner.human_bisenetv2_multi_gpu_trainner:train:332 - Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
2 root error(s) found.
(0) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
[[node loader_and_saver/save/RestoreV2 (defined at /opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]]
(1) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
[[node loader_and_saver/save/RestoreV2 (defined at /opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]]
[[loader_and_saver/save/RestoreV2/_37]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'loader_and_saver/save/RestoreV2':
File "tools/train_bisenetv2_human.py", line 40, in
train_model()
File "tools/train_bisenetv2_human.py", line 27, in train_model
worker = multi_gpu_trainner.BiseNetV2HumanMultiTrainer() #MultiTrainer()
File "/opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py", line 201, in init
self._loader = tf.train.Saver(self._net_var)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in init
self.build()
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
build_restore=build_restore)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
restore_sequentially, reshape)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
restore_sequentially)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
name=name)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()
2020-05-25 16:52:42.377 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:333 - => Can not load pretrained model weights: ./model/coco_human/bisenetv2/
2020-05-25 16:52:42.377 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:334 - => Now it starts to train BiseNetV2 from scratch ...
from bisenetv2-tensorflow.
@mychina75 Which ckpt file did you use to do resume training?
from bisenetv2-tensorflow.
I set the model_checkpoint_path as "./model/coco_human/bisenetv2/"
and make some changes in restore:
ckpt = tf.train.get_checkpoint_state(os.path.dirname(self._initial_weight))
self._loader.restore(self._sess, ckpt.model_checkpoint_path) #moself._initial_weight)
the original code: 'self._loader.restore(self._sess, self._initial_weight)'
not work for the SNAPSHOT_PATH: './model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1.index'
either...
from bisenetv2-tensorflow.
@mychina75 The snapshot file path should be ./model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1 instead:)
from bisenetv2-tensorflow.
额... 还是这个错误,Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
################
2020-05-26 09:39:59.213 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:319 - => Restoring weights from: ./model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1 ...
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-05-26 09:39:59.880135: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
2020-05-26 09:39:59.928 | ERROR | trainner.human_bisenetv2_multi_gpu_trainner:train:332 - Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
2 root error(s) found.
(0) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
[[node loader_and_saver/save/RestoreV2 (defined at /project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]]
(1) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
[[node loader_and_saver/save/RestoreV2 (defined at /project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]]
[[loader_and_saver/save/RestoreV2/_223]]
from bisenetv2-tensorflow.
@mychina75 你的ckpt文件怎么生成的?确定ckpt文件的路径没有输入错误吗。你这个错误就是ckpt模型文件和当前的计算图模型不匹配:)
from bisenetv2-tensorflow.
模型保存没改呀,就在xxx_gpu_trainner.py里面
# define saver and loader
with tf.variable_scope('loader_and_saver'):
self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
self._loader = tf.train.Saver(self._net_var)
self._saver = tf.train.Saver(max_to_keep=5)
restore在这里:
if CFG.TRAIN.RESTORE_FROM_SNAPSHOT.ENABLE:
try:
LOG.info('=> Restoring weights from: {:s} ... '.format(self._initial_weight))
self._loader.restore(self._sess, self._initial_weight)
...
是不是跟FREEZE_BN的设置有关,默认ENABLE: False
代码里面有判断:
# define moving average op
with tf.variable_scope(name_or_scope='moving_avg'):
if CFG.TRAIN.FREEZE_BN.ENABLE:
train_var_list = [
v for v in tf.trainable_variables() if 'beta' not in v.name and 'gamma' not in v.name
]
else:
train_var_list = tf.trainable_variables()
需要单独保存一下这个参数?
from bisenetv2-tensorflow.
@mychina75 默认是不freeze bn的 你如果使用的是训练过程中保存的ckpt文件的话 不应该有这个问题。如果你使用的是预测过程中保存的ckpt文件那么会出现这个问题。这个我之前都是自己试用过的,没有问题,下来有时间我再测试下:)
from bisenetv2-tensorflow.
@mychina75 还有就是你能不能提供更详细的能复现你的问题的过程。比如你修改了代码的什么地方,然后怎么开始训练的,怎么保存参数,怎么开始restore weights的:)
from bisenetv2-tensorflow.
解决了,改了下*_gpu_trainner.py的这个地方。貌似有些变量没有存下来,改了以后.meta文件从7.35MB变到了9.09MB。应该不会影响pb文件。
# define saver and loader
with tf.variable_scope('loader_and_saver'):
self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
self._loader = tf.train.Saver(self._net_var)
self._saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)
---------------------- 》
with tf.variable_scope('loader_and_saver'):
self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
self._loader = tf.train.Saver(self._net_var)
self._saver = tf.train.Saver(max_to_keep=5)
from bisenetv2-tensorflow.
@mychina75 好滴:)
from bisenetv2-tensorflow.
Related Issues (20)
- Issue when resuming training HOT 3
- 您好,您有coco的预训练模型吗? HOT 1
- 这项目结构真的牛,佩服!!! HOT 1
- inference time HOT 9
- retrain issue with CityScapes HOT 5
- inference time HOT 4
- 请教一下为什么训练时GPU使用率才百分之几 HOT 3
- 训练时,Train loss: nan Train miou: nan HOT 3
- 请问怎么使用使用 v2 large方式训练 HOT 3
- 为什么训练中会出现 Train loss: 3.29611 Val loss:nan... val loss 怎么运算出了 nan HOT 3
- 是不是对一些细小的识别不太好,或是需要做哪些方面的优化 HOT 3
- 关于cityscapes预训练模型的checkpoint文件 HOT 9
- Could you share your released bisenetv2 model in baidu or google cloud? HOT 3
- Data augmentation and data shuffling problem? HOT 3
- Quantized Model? HOT 3
- 请教模型几个参数的作用 HOT 1
- 针对小目标的识别分割应该从哪里下手优化
- the loss is always nan HOT 1
- Performance issues in /data_provider (by P3) HOT 1
- 你好 我无法解决这个问题 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bisenetv2-tensorflow.