chenbinghui1 / dsl Goto Github PK

View Code? Open in Web Editor NEW

100.0 100.0 10.0 10.28 MB

CVPR2022 paper "Dense Learning based Semi-Supervised Object Detection"

License: Apache License 2.0

Python 99.46% Dockerfile 0.06% Shell 0.14% Makefile 0.02% Batchfile 0.02% C++ 0.08% Cuda 0.13% Cython 0.11%

dsl's People

Contributors

Stargazers

Watchers

Forkers

xjtuchenchao jie311 heiyuxiaokai pengyulpy alihassan094 hhvera qiwei123 levi-33 drkostas phil-mart

dsl's Issues

整数除以0报错

大佬们好，想问一个问题
我按照README中的教程一步一步走下来，但在执行这一步时

发生了报错，信息如下：

我不知道为什么，网上搜了一下也没有什么相关信息，求教

监督学习

第四步如果单卡的话是只需要在train里加入config的路径就行了嘛？还需要配置其他什么东西吗，为什么单纯添加了[r50_caffe_mslonger_tricks_0.Xdata.py]模型之后跑完一百个epoch之后的MAP结果连1%都不到，（1%和10%的数据），求解单卡配置参数，非常感谢

Where is "Adaptive Filtering Strategy" source code?

Hi, Thank you for your great work!

In your paper, you mentioned that you applied a AF strategy to improve the quality of pseudo-labels, but I didn't find the code of this part. could you release the code of this part?

how to install?

请问安装好4个依赖后就可以直接运行代码了吗？是否需要pip install -v 之类的命令？

伪标签好像没有及时更新？

在训练过程中预测伪标签的时候，我看到预测的图是：[self.image_list[next_iter]]，也就是说每一个iteration只更新一张伪标签吗？还是我的理解的问题，期待你的回答，万分感激

Paper release

http://www4.comp.polyu.edu.hk/~cslzhang/paper/DSL_cvpr22.pdf was linked from Google Scholar but unfortunately returns 404 now. Would be keen to read the paper in better rendering than Google cache :)

关于lscale的一些问题

您好，请问在lscale部分，我的理解是通过下采样图片，然后错位对相邻层同尺寸的分数图进行MSEloss计算，这个分数图是怎么得到的呢？对FPN网络输出的特征图进行激活函数sigmoid这样的？还有，在进行MSEloss时，相邻层的channel是不相同的，应该也是个2倍关系，请问这部分是怎么处理的？

About adathres

Hi, thanks for your great work, but I concern that how to generate adathres.json?

How the scale invariant implement?

非常感谢您能开源您的代码！不过我在阅读您的代码的时候有几个地方不是很明白，所以想请教您一下：

关于Adaptive Threshold，我看到您的实现代码里稍微跟Paper有一些不一样的地方，可以麻烦您稍微解释一下吗？
关于Scale Invariant Learning，我没有看懂您这一部分的实现原理，您是对unlabel image进行缩放，然后一起加载进来的吗？一个batch的数据是怎么组织的呢？我看您的代码只看到如何组织label image和unlabel image，不知道这个缩放的图片是如何加载进来和组织的呢？这里的flatten_As_labels是用来干什么的呢？
关于这个Ignore，我理解这里应该是想把那些原本被那些高质量的伪标签分配为背景的Proposal/ Anchor Box，通过ignore gt box，分配得到ignore label，从而不计算这些Proposal/Anchor Box的loss. 但是看您这里的逻辑如果一个Proposal/Anchor Box被ignore gt box或者gt box分配了背景类标签，即flatten_ig_labels - self.num_classes = 0，flatten_labels - self.num_classes = 0，这个Proposal/Anchor Box对应的sample wise的权重则会置为0，这是不是不合理呢？

hello, I have a problem when inference the data of VOC. Run the script of ./tools/inference_unlabeled_coco_data.sh, I don't get any inference results at specified folder. Is there a script for inferencing and generating pseudo-labels for VOC data？ Thanks！

关于半监督训练的一些问题

您好，看了您的工作，提出了很多新的想法，感觉收益很多。
我想要问一下，对于半监督训练，在监督训练baseline的时候需要将模型训练至完全收敛吗，还是要留存出一定的空间至没有完全收敛状态，因为在半监督阶段也会使用到标注的数据，从而防止这部分数据的过拟合而影响到模型的整体效果。
因为我看您说一般baseline在55epoch达到较好效果，我用voc数据训练了60epoch后到达63.8AP50，而您论文中给出的supervised的AP50是69.6。所以说是不是在监督阶段留一部分余量会更好？

some questions

Hi, binghui, thank you for sharing your work, I also work on ssod area, I found some differences between your work and other opensource ssod architecher, since the problem I will describe should be very specific, to prevent misunderstanding, I write them in chinese:
1.我仔细看了你代码的实现，在做10%standard任务时，大概的状态是你会在半监督那部分开始之前，用10%数据量训练好的模型来生成一份伪标签，然后在半监督的时候，你每个epoch的的迭代次数其实是按照这个伪标签的图片数量来的，是这样吗？
2.你的伪标签在每一个epoch结束后利用pred_hook机制又重新生成一次，同时是每个epoch来更新你的af模块，如果我没有理解错的话，你的ema模型是每一个迭代都会更新，但是只会在epoch结束来ema离线生成伪标签，这个与我们自己实现的方案有点不同，我们是利用ema每个迭代生成伪标签而且ema也每个迭代更新，道理上来说我感觉你这种方式貌似更加鲁棒，这样的实现有什么原因吗？
感谢你的工作并期待你的回复。

关于Aggregated Teacher

您好，请问哪里能找到您关于Aggregated Teacher更新参数的代码，我在mmdet/runner/hooks/semi_epoch_based_runner.py里的SemiEpochBasedRunner.EMA中找到更新参数的代码，但是好像和ema策略是一样的？我不太确定是否是找错了，麻烦您了

local variable 'save_path' referenced before assignment

你好，想问一下关于patch shuffle的代码在哪里，找了好久没找到

大佬提交的代码似乎是一个社区开源库的代码，特别杂，能否只条明路说说各模块关键实现代码分别在何处，感谢不尽！

你好，想问一下聚合教师的代码在哪一块呀

Question about dsl

Must the supervised model be trained in advance? Can I use DSL to train a model both with labeled samples and unlabeled samples from the begining(just with pretrained model of backbone)?

Results isssue

Hai,
I have trained supervised model just like steps you give (step 1-4), after training supervised baseline model on COCO dataset,
I have run semi_dest.sh with corresponding file paths to determine performance of supervised model, and the performance is
12% ( I have used 10% as partially labelled data) but in Table 1 of your paper the result is "23.70 ± 0.22". How I solve this issue??

Secondly, I am training model on 1 GPU device. This is the only difference.

I am waiting for your positive response and guidance please. Thanks.

train error

你好，我在用多gpu分布式训练网络时，DistSamplerSeedHook_semi出现了如下错误：

请问是什么原因呢

how to inference

我先使用自己的数据集训练了一个baseline，mAP在30%左右，然后将权重加在了pretrained后边，但在DSL Training阶段发现准确率不是从30%左右开始，而是从零开始，而且涨幅非常慢，训练16个epoch大概才有5%的mAP，请问是哪里出了问题吗？

Some questions about your code

Hi, thanks for your wonderful work! Where can I find the code about Adaptive Filtering Strategy? Thanks!

关于unlabel_train.sh

关于unlabel_train.sh脚本文件的问题，想要复现大佬的项目，但是发现在DEMO目录下并没有此文件，请问在哪个目录下呢？可以帮忙解决一下吗

how to inference

你好，请问如何在我自己准备的图片上面使用DSL进行推理。
在readme里面发现了 tools/semi_dist_test.sh，但是我看了一下，好像无法满足我的需求。

RuntimeError: Address already in use

大佬们好，我折腾了一个月，终于在WSL的ubuntu18.04.5上配好了环境
但是它在运行

readme里的这条语句时，运行了一段时间然后报错如下：

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "./tools/train.py", line 202, in <module>
    main()
  File "./tools/train.py", line 120, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 35, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "./tools/train.py", line 202, in <module>
    main()
  File "./tools/train.py", line 120, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 35, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "./tools/train.py", line 202, in <module>
    main()
  File "./tools/train.py", line 120, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 35, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "./tools/train.py", line 202, in <module>
    main()
  File "./tools/train.py", line 120, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 35, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "./tools/train.py", line 202, in <module>
    main()
  File "./tools/train.py", line 120, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 35, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "./tools/train.py", line 202, in <module>
    main()
  File "./tools/train.py", line 120, in main
    init_dist(args.launcher, **cfg.dist_params)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 18, in init_dist
    _init_dist_pytorch(backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/dist_utils.py", line 35, in _init_dist_pytorch
    dist.init_process_group(backend=backend, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Killing subprocess 1941
Killing subprocess 1942
Killing subprocess 1943
Killing subprocess 1944
Killing subprocess 1945
Killing subprocess 1946
Killing subprocess 1947
Killing subprocess 1948
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', './tools/train.py', '--local_rank=7', 'configs/fcos_semi/r50_caffe_mslonger_tricks_0.Xdata.py', '--launcher', 'pytorch', '--work-dir', 'workdir_coco/r50_caffe_mslonger_tricks_0.1data']' returned non-zero exit status 1.

主要是 RuntimeError: Address already in use 报错
然后我尝试运行了它报错的/usr/bin/python -u ./tools/train.py --local_rank=7 configs/fcos_semi/r50_caffe_mslonger_tricks_0.Xdata.py --launcher pytorch --work-dir workdir_coco/r50_caffe_mslonger_tricks_0.1data这条命令，发现是可以运行的，我去网上搜了一下，应该是pytorch分布式在单机多任务时使用了GPU的同一个端口而报错，然后我就修改了所有DSL项目中的master_port参数，如下

但是还是报错RuntimeError: Address already in use,,,,,好折磨啊555
大佬求教一教，已经给了star

EMAModel and student model get the same performance

hello，我昨天试了一下用训好的 epochxx.pth 和 epochxx.pth_ema 进行推理，发现这两个模型得到的精度是一模一样的，请问这样是合理的吗，理论上 EMAmodel 的精度不是会高很多的吗，或者说有哪里出了问题？

关于baseline

您好，感谢您非常有意义的工作，我关于baseline有些问题想和您请教：在10%labeled data设置下，意味着90%的unlabled data需要和10%labeled data进行匹配，在一个epoch中，如果unlabled data全部使用的话，则labled data需要重复使用9次，这样来构成一个epoch。这样在论文baseline的结果中，在一个epoch中同样使用9倍的labled data还是只使用1倍的labled data呢？(在semi-supervised和baseline跑相同迭代数的情况下)

Question about semi-supervised and supervised performance

感谢大佬的工作，DSL在自己的数据集上效果也很棒！
实验过程发现一个现象：使用50%标注训练的模型指标已经和100%标注训练的模型持平了。想请教您一下，这个现象应该怎么解释？
我之前的理解是半监督无论如何也不可能超过全监督的性能，否则全量标注相比于部分标注多出来的那些标注框作用是什么？模型带着部分错误的label都可以达到全量标注的效果，无法理解，希望您能解惑，感谢大佬！

The code of MetaNet Part?

Hi~,
Thank you for your great work! I have some problems with your code, in your paper you mentioned that you applied a MetaNet to improve the quality of pseudo-labels, but I didn't find the code of this part. I noticed that you said you remove the code of this part in the ReadMe.md, could you release the code of this part for I want to check the effectiveness of this part?

unlabel_pred error and cannot find the images

batch_mlvl_bboxes /= batch_mlvl_bboxes.new_tensor(
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 5142/5142, 19.1 task/s, elapsed: 269s, ETA: 0s2022-11-24 10:22:08,173 - mmdet - INFO - [INFO] Unlabel pred Done!
Traceback (most recent call last):
File "tools/train.py", line 202, in
main()
File "tools/train.py", line 190, in main
train_detector(
File "/home/hello/PycharmProjects/pythonProject/new_DSL/DSL/mmdet/apis/train.py", line 218, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/hello/PycharmProjects/pythonProject/new_DSL/DSL/mmdet/runner/hooks/semi_epoch_based_runner.py", line 345, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/hello/PycharmProjects/pythonProject/new_DSL/DSL/mmdet/runner/hooks/semi_epoch_based_runner.py", line 267, in train
self.call_hook('after_train_iter')
File "/home/hello/anaconda3/envs/Torch-DSL/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/home/hello/PycharmProjects/pythonProject/new_DSL/DSL/mmdet/runner/hooks/unlabel_pred_hook.py", line 460, in after_train_iter
self.after_train_iter_func(runner)
File "/home/hello/PycharmProjects/pythonProject/new_DSL/DSL/mmdet/runner/hooks/unlabel_pred_hook.py", line 517, in after_train_iter_func
assert len(runner.imagefiles) == 2
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 16269) of binary: /home/hello/anaconda3/envs/Torch-DSL/bin/python

when i want to debug and set preload=1,start_point=2 to reduce training time.It occur another error.

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
loading annotations into memory...
Done (t=0.13s)
creating index...
index created!
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
loading annotations into memory...
Done (t=0.20s)
creating index...
index created!
[ERROR][ModelInfer] Found no image in /home/hello/PycharmProjects/pythonProject/new_DSL/DSL/mydata/semicoco/images/full

but this document has images,i dont know what happen?
anybody can help me?Thanks a lot.

May I ask if you have released all the code?

您好，我在复现的过程中发现代码里没有demo/model_train/unlabel.sh这个文件，在论文里面涉及的一些方法我也没有找到相关的代码，请问您是已经把所有核心代码都放了吗？我用demo/model_train/unlabel_dynamic.sh文件训练的时候代码似乎是有bug

train debug

换用自己数据集训练时报错：

Traceback (most recent call last):
File "./tools/train.py", line 202, in
main()
File "./tools/train.py", line 190, in main
train_detector(
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/apis/train.py", line 218, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/runner/hooks/semi_epoch_based_runner.py", line 344, in run
epoch_runner(data_loaders[i], **kwargs)
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/runner/hooks/semi_epoch_based_runner.py", line 265, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/runner/hooks/semi_epoch_based_runner.py", line 155, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/usr/local/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/models/detectors/base.py", line 237, in train_step
losses = self(**data)
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 97, in new_func
return old_func(*args, **kwargs)
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/models/detectors/base.py", line 171, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/models/detectors/single_stage.py", line 82, in forward_train
losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes,
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/models/dense_heads/base_dense_head.py", line 54, in forward_train
losses = self.loss(*loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
File "/usr/local/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 185, in new_func
return old_func(*args, **kwargs)
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/models/dense_heads/fcos_head.py", line 309, in loss
loss_cls = self.loss_cls(
File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/models/losses/focal_loss.py", line 170, in forward
loss_cls = self.loss_weight * calculate_loss_func(
File "/secret/ZLW/Codes/SSOD/DSL/mmdet/models/losses/focal_loss.py", line 85, in sigmoid_focal_loss
loss = _sigmoid_focal_loss(pred.contiguous(), target, gamma, alpha, None,
File "/usr/local/lib/python3.8/site-packages/mmcv/ops/focal_loss.py", line 39, in forward
assert input.size(0) == target.size(0)
AssertionError

batch设为8，输入分辨率设为512x512，debug了一下，发现在semi_epoch_based_runner.py第186行开始，
data_batch['img_metas']、data_batch['gt_bboxes']data_batch['gt_labels']添加了一个元素，而data_batch['img'] cat了一个batch-1的图像tensor。导致网络的模型输入tensor维度变成(15,3,512,512)，而label相关的信息为9张图像的，进而在计算loss时出现了AssertionError。
请问大佬这里是我代码没理解对还是确实有bug呢？

我想请问一下为什么跑这个程序会直接卡死

在For COCO Partially Labeled Data protocol，中的第四步执行脚本时，

下面是我的环境配置：

我在index created停了很久了，没有其他的报错，但是就是卡住了，这是什么原因呢？可以帮我解决一下吗？

复现没达到预期

这是我baseline的训练结果，用的10%的数据集

这是我在上面的基础上跑EMA出现的结果，现在就发现似乎未标记数据对整个检测性能没有增强。

这是我改进了之后的模型性能，还是发现未标记数据没有对性能造成提升，我想问一下，4张3090，总内存30G，8个cpu核心，应该不会存在运算过程中的信息丢失的问题，但是就是跑不出来，您能够给我一些建议吗？我现在想是否是数据集的问题，过大，我换个小众数据集