Hi, thank you for your great work. I'd like to train your model on COCO db, an

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

about coco db training about bisenetv2-tensorflow HOT 14 CLOSED

maybeshewill-cv commented on May 24, 2024

about coco db training

from bisenetv2-tensorflow.

Comments (14)

MaybeShewill-CV commented on May 24, 2024

@mychina75 The origin paper has training details about coco-stuff dataset:)

from bisenetv2-tensorflow.

mychina75 commented on May 24, 2024

in the paper, 150K, 10K, 20K iterations for the Cityscapes dataset, CamVid dataset, and COCO-Stuff datasets respectively....
but image number of COCO db is much larger than Cityscapes.. why the iterations so small?
maybe something wrong?

from bisenetv2-tensorflow.

MaybeShewill-CV commented on May 24, 2024

@mychina75 That's a problem which you may get a satisfied answer at https://github.com/ycszen/BiSeNet (The origin auther's repo) :)

from bisenetv2-tensorflow.

mychina75 commented on May 24, 2024

thank you. I will check.
and There is a error report about resume training... plz check.

##################
2020-05-25 16:52:38.994 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:init:229 - Initialize human bisenetv2 multi gpu trainner complete
2020-05-25 16:52:41.706 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:319 - => Restoring weights from: ./model/coco_human/bisenetv2/ ...
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-05-25 16:52:42.368599: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
2020-05-25 16:52:42.376 | ERROR | trainner.human_bisenetv2_multi_gpu_trainner:train:332 - Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found.
(0) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
[[node loader_and_saver/save/RestoreV2 (defined at /opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]]
(1) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
[[node loader_and_saver/save/RestoreV2 (defined at /opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]]
[[loader_and_saver/save/RestoreV2/_37]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'loader_and_saver/save/RestoreV2':
File "tools/train_bisenetv2_human.py", line 40, in
train_model()
File "tools/train_bisenetv2_human.py", line 27, in train_model
worker = multi_gpu_trainner.BiseNetV2HumanMultiTrainer() #MultiTrainer()
File "/opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py", line 201, in init
self._loader = tf.train.Saver(self._net_var)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in init
self.build()
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build
build_restore=build_restore)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal
restore_sequentially, reshape)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
restore_sequentially)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
name=name)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()

2020-05-25 16:52:42.377 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:333 - => Can not load pretrained model weights: ./model/coco_human/bisenetv2/
2020-05-25 16:52:42.377 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:334 - => Now it starts to train BiseNetV2 from scratch ...

from bisenetv2-tensorflow.

MaybeShewill-CV commented on May 24, 2024

@mychina75 Which ckpt file did you use to do resume training?

from bisenetv2-tensorflow.

mychina75 commented on May 24, 2024

I set the model_checkpoint_path as "./model/coco_human/bisenetv2/"
and make some changes in restore:
ckpt = tf.train.get_checkpoint_state(os.path.dirname(self._initial_weight))
self._loader.restore(self._sess, ckpt.model_checkpoint_path) #moself._initial_weight)

the original code: 'self._loader.restore(self._sess, self._initial_weight)'
not work for the SNAPSHOT_PATH: './model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1.index'
either...

from bisenetv2-tensorflow.

MaybeShewill-CV commented on May 24, 2024

@mychina75 The snapshot file path should be ./model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1 instead:)

from bisenetv2-tensorflow.

mychina75 commented on May 24, 2024

额... 还是这个错误，Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint

################
2020-05-26 09:39:59.213 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:319 - => Restoring weights from: ./model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1 ...
WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2020-05-26 09:39:59.880135: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
2020-05-26 09:39:59.928 | ERROR | trainner.human_bisenetv2_multi_gpu_trainner:train:332 - Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found.
(0) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
[[node loader_and_saver/save/RestoreV2 (defined at /project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]]
(1) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
[[node loader_and_saver/save/RestoreV2 (defined at /project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]]
[[loader_and_saver/save/RestoreV2/_223]]

from bisenetv2-tensorflow.

MaybeShewill-CV commented on May 24, 2024

@mychina75 你的ckpt文件怎么生成的？确定ckpt文件的路径没有输入错误吗。你这个错误就是ckpt模型文件和当前的计算图模型不匹配:)

from bisenetv2-tensorflow.

mychina75 commented on May 24, 2024

模型保存没改呀，就在xxx_gpu_trainner.py里面
# define saver and loader
with tf.variable_scope('loader_and_saver'):
self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
self._loader = tf.train.Saver(self._net_var)
self._saver = tf.train.Saver(max_to_keep=5)
restore在这里：
if CFG.TRAIN.RESTORE_FROM_SNAPSHOT.ENABLE:
try:
LOG.info('=> Restoring weights from: {:s} ... '.format(self._initial_weight))
self._loader.restore(self._sess, self._initial_weight)
...

是不是跟FREEZE_BN的设置有关，默认ENABLE: False
代码里面有判断：
# define moving average op
with tf.variable_scope(name_or_scope='moving_avg'):
if CFG.TRAIN.FREEZE_BN.ENABLE:
train_var_list = [
v for v in tf.trainable_variables() if 'beta' not in v.name and 'gamma' not in v.name
]
else:
train_var_list = tf.trainable_variables()
需要单独保存一下这个参数？

from bisenetv2-tensorflow.

MaybeShewill-CV commented on May 24, 2024

@mychina75 默认是不freeze bn的你如果使用的是训练过程中保存的ckpt文件的话不应该有这个问题。如果你使用的是预测过程中保存的ckpt文件那么会出现这个问题。这个我之前都是自己试用过的，没有问题，下来有时间我再测试下:)

from bisenetv2-tensorflow.

MaybeShewill-CV commented on May 24, 2024

@mychina75 还有就是你能不能提供更详细的能复现你的问题的过程。比如你修改了代码的什么地方，然后怎么开始训练的，怎么保存参数，怎么开始restore weights的：）

from bisenetv2-tensorflow.

mychina75 commented on May 24, 2024

解决了，改了下*_gpu_trainner.py的这个地方。貌似有些变量没有存下来，改了以后.meta文件从7.35MB变到了9.09MB。应该不会影响pb文件。
# define saver and loader
with tf.variable_scope('loader_and_saver'):
self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
self._loader = tf.train.Saver(self._net_var)
self._saver = tf.train.Saver(tf.global_variables(), max_to_keep=5)
---------------------- 》
with tf.variable_scope('loader_and_saver'):
self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
self._loader = tf.train.Saver(self._net_var)
self._saver = tf.train.Saver(max_to_keep=5)

from bisenetv2-tensorflow.

MaybeShewill-CV commented on May 24, 2024

@mychina75 好滴:)

from bisenetv2-tensorflow.

about coco db training about bisenetv2-tensorflow HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent