goatmessi7 / asff Goto Github PK
View Code? Open in Web Editor NEWyolov3 with mobilenet v2 and ASFF
License: GNU General Public License v3.0
yolov3 with mobilenet v2 and ASFF
License: GNU General Public License v3.0
I simply build yolov3 asff model like this:
model = build_model()
a_t = torch.Tensor(a).unsqueeze(axis=0).to(device)
out = model(a_t)[0]
Which out is a (48555, 85) shape tensor.
I think the first 4 dim is location, while 81 is the score. However, when i filtered out the locations with score bigger than 0.1 (which mostly reasonable), the result is bad:
it gets more than 48000 results which apparently not the objects,
how to exactly get objects from these outputs?
HI, when I run the eval.py, the mismatch error always occurs.
In the terminal, the code is
python -m torch.distributed.launch --nproc_per_node=2 --master_port=${RANDOM+10000} eval.py \
--cfg config/yolov3_baseline.cfg -d COCO --distributed --ngpu 2 \
--checkpoint weights/YOLOv3-ASFF_40.6.pth --half --asff --rfb -s 608
What confused me is which pretrained model should I choose.
The errors are as follows:
size mismatch for level_0_fusion.weight_level_0.conv.weight: copying a param with shape torch.Size([16, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([8, 512, 1, 1]). size mismatch for level_0_fusion.weight_level_0.batch_norm.weight: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_0.batch_norm.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_0.batch_norm.running_mean: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_0.batch_norm.running_var: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_1.conv.weight: copying a param with shape torch.Size([16, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([8, 512, 1, 1]). size mismatch for level_0_fusion.weight_level_1.batch_norm.weight: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_1.batch_norm.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_1.batch_norm.running_mean: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_1.batch_norm.running_var: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_2.conv.weight: copying a param with shape torch.Size([16, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([8, 512, 1, 1]). size mismatch for level_0_fusion.weight_level_2.batch_norm.weight: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_2.batch_norm.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_2.batch_norm.running_mean: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_level_2.batch_norm.running_var: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([8]). size mismatch for level_0_fusion.weight_levels.weight: copying a param with shape torch.Size([3, 48, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 24, 1, 1]). size mismatch for level_1_fusion.weight_level_0.conv.weight: copying a param with shape torch.Size([16, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([8, 256, 1, 1]).
Thanks for any help.
python demo.py -i example/test.jpg --cfg config/yolov3_baseline.cfg -d COCO --checkpoint weights/YOLOv3-baseline_38.8.pth --half --rfb -s 608
Missing key(s) in state_dict: "module_list.19.Feature_adaption.rfb.branch_0.0.weight", "module_list.19.Feature_adaption.rfb.branch_0.0.bias",
add a demo.py to only show one image result .
Hello, thank you for your great work.
I found an out of memory
error at the end of epoch 7.
I use 8 2080Ti GPUs, and my training scripts is:
python -m torch.distributed.launch --nproc_per_node=8 --master_port=233323 main.py \
--cfg config/yolov3_baseline.cfg -d COCO --tfboard --distributed --ngpu 8 \
--checkpoint weights/darknet53_feature_mx.pth --start_epoch 0 --half --asff --rfb --dropblock \
--log_dir log/COCO_ASFF -s 608
The error log is:
[Epoch 7/300][Iter 2930/2957][lr 0.001000][Loss: anchor 9.82, iou 10.44, l1 32.70, conf 27.21, cls 79.91, imgsize 544, time: 6.14]
[Epoch 7/300][Iter 2940/2957][lr 0.001000][Loss: anchor 9.65, iou 10.57, l1 32.50, conf 26.13, cls 76.35, imgsize 512, time: 6.48]
[Epoch 7/300][Iter 2950/2957][lr 0.001000][Loss: anchor 12.56, iou 13.53, l1 41.10, conf 31.59, cls 95.52, imgsize 512, time: 6.16]
Traceback (most recent call last):
File "main.py", line 454, in <module>
main()
File "main.py", line 388, in main
optimizer.backward(loss)
File "/mnt/WXRG0353/sfchen/ASFF/utils/fp16_utils/fp16_optimizer.py", line 483, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/mnt/WXRG0353/sfchen/ASFF/utils/fp16_utils/loss_scaler.py", line 45, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/mnt/WXRG0333/sfchen/anaconda3/envs/pytorch13/lib/python3.7/site-packages/torch/tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mnt/WXRG0333/sfchen/anaconda3/envs/pytorch13/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 8.16 GiB (GPU 5; 10.73 GiB total capacity; 1.76 GiB already allocated; 8.16 GiB free; 29.42 MiB cached)
and the nvidia-smi
message:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:1A:00.0 Off | N/A |
| 34% 55C P2 100W / 250W | 2922MiB / 10989MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:1B:00.0 Off | N/A |
| 35% 58C P2 92W / 250W | 2920MiB / 10989MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... On | 00000000:3D:00.0 Off | N/A |
| 33% 52C P2 113W / 250W | 2912MiB / 10989MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... On | 00000000:3E:00.0 Off | N/A |
| 33% 51C P2 101W / 250W | 2920MiB / 10989MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... On | 00000000:88:00.0 Off | N/A |
| 32% 49C P2 90W / 250W | 2922MiB / 10989MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... On | 00000000:89:00.0 Off | N/A |
| 27% 29C P8 4W / 250W | 11MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce RTX 208... On | 00000000:B1:00.0 Off | N/A |
| 27% 28C P8 1W / 250W | 11MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce RTX 208... On | 00000000:B2:00.0 Off | N/A |
| 27% 28C P8 1W / 250W | 11MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 173812 C ...hen/anaconda3/envs/pytorch13/bin/python 2911MiB |
| 1 173813 C ...hen/anaconda3/envs/pytorch13/bin/python 2909MiB |
| 2 173814 C ...hen/anaconda3/envs/pytorch13/bin/python 2901MiB |
| 3 173815 C ...hen/anaconda3/envs/pytorch13/bin/python 2909MiB |
| 4 173816 C ...hen/anaconda3/envs/pytorch13/bin/python 2911MiB |
+-----------------------------------------------------------------------------+
请问下作者2080ti环境用的啥,能详细告知下吗?谢谢
First of all thank the author for the excellent code,especially the attempts of bof and ga。I looked at the code, I feel that your Guided Anchoring idea is not the same as the original paper, do you have time to explain the implementation ideas of this part of ga? thank you
ssh://[email protected]:10074/usr/local/bin/python -u /project/ASFF/main.py --cfg=config/yolov3_baseline.cfg -d=VOC --tfboard --checkpoint=weights/darknet53_feature_mx.pth --start_epoch=0 --half --log_dir log/VOC -s=240 --checkpoint=
Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='', dataset='VOC', debug=False, distributed=False, dropblock=False, eval_interval=10, half=True, local_rank=0, log_dir='log/VOC', n_cpu=4, ngpu=2, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=240, testset=False, tfboard=True, use_cuda=True, vis=False)
successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMENTUM': 0.9, 'DECAY': 0.0005, 'BURN_IN': 5, 'MAXEPOCH': 300, 'COS': True, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 4, 'IMGSIZE': 608, 'IGNORETHRE': 0.7, 'RANDRESIZE': True}, 'TEST': {'CONFTHRE': 0.01, 'NMSTHRE': 0.6, 'IMGSIZE': 608}}
Training YOLOv3 strong baseline!
using cuda
using tfboard
Traceback (most recent call last):
File "/project/ASFF/main.py", line 455, in
main()
File "/project/ASFF/main.py", line 389, in main
optimizer.backward(loss)
File "/project/ASFF/utils/fp16_utils/fp16_optimizer.py", line 483, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/project/ASFF/utils/fp16_utils/loss_scaler.py", line 45, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 4, 76, 76, 25]], which is output 0 of CloneBackward, is at version 9; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Process finished with exit code 1
File "main.py", line 386, in main
loss_dict = model.forward(imgs, targets, epoch)
File "/home/workspace/git/python/detection/ASFF/models/yolov3_asff.py", line 149, in forward
x, anchor_loss, iou_loss, l1_loss, conf_loss, cls_loss = header(fused, targets)
File "/home/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/workspace/git/python/detection/ASFF/models/yolov3_head.py", line 265, in forward
loss_wh = (self.l1_loss(output[...,2:4], l1_target[...,2:4],tgt_scale)).sum() / batchsize
File "/home/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
TypeError: forward() takes 3 positional arguments but 4 were given
Hello, thank you for sharing your great idea and code.
When run training on the coco, I meet an error: TypeError: forward() takes 3 positional arguments but 4 were given
as posted. And in the code:
self.l1_loss = nn.L1Loss(reduction='none')
While https://github.com/ruinmessi/ASFF/blob/master/models/yolov3_head.py#L264-L266
loss_xy = (tgt_scale*self.bcewithlog_loss(output[...,:2], l1_target[...,:2])).sum() / batchsize
loss_wh = (self.l1_loss(output[...,2:4], l1_target[...,2:4]),tgt_scale)).sum() / batchsize
Maybe, it should be:
loss_xy = (tgt_scale*self.bcewithlog_loss(output[...,:2], l1_target[...,:2])).sum() / batchsize
loss_wh = (tgt_scale*self.l1_loss(output[...,2:4], l1_target[...,2:4])).sum() / batchsize
@ruinmessi
Any plan to have demo.py for a single image input?
Thanks,
I tried to use 0.9.10.dev0 installed by pip, but some issues occurred.
I got maximum recursion depth error when I run the following demo.py:
python3 demo.py -i test.jpg --cfg config/yolov3_baseline.cfg -d COCO --checkpoint YOLOv3-mobile-asff.pth --asff -s 416
/home/test/anaconda3/envs/detectron2/lib/python3.6/site-packages/torchvision-0.5.0a0+1e857d9-py3.6-linux-x86_64.egg/torchvision/io/_video_opt.py:17: UserWarning: video reader based on ffmpeg c++ ops not available
Setting Arguments.. : Namespace(asff=True, cfg='config/yolov3_baseline.cfg', checkpoint='YOLOv3-mobile-asff.pth', dataset='COCO', half=False, img='test.jpg', rfb=False, test_size=416, use_cuda=True)
successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'mobile'}, 'TRAIN': {'LR': 0.001, 'MOMENTUM': 0.9, 'DECAY': 0.0005, 'BURN_IN': 5, 'MAXEPOCH': 300, 'COS': True, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 5, 'IMGSIZE': 608, 'IGNORETHRE': 0.7, 'RANDRESIZE': True}, 'TEST': {'CONFTHRE': 0.01, 'NMSTHRE': 0.65, 'IMGSIZE': 608}}
For mobilenet, we currently don't support dropblock, rfb and FeatureAdaption
Training YOLOv3 with ASFF!
loading pytorch ckpt... YOLOv3-mobile-asff.pth
using cuda
Traceback (most recent call last):
File "/home/test/anaconda3/envs/detectron2/lib/python3.6/site-packages/torchvision-0.5.0a0+1e857d9-py3.6-linux-x86_64.egg/torchvision/ops/boxes.py", line 31, in nms
File "/home/test/anaconda3/envs/detectron2/lib/python3.6/site-packages/torchvision-0.5.0a0+1e857d9-py3.6-linux-x86_64.egg/torchvision/ops/boxes.py", line 31, in nms
File "/home/test/anaconda3/envs/detectron2/lib/python3.6/site-packages/torchvision-0.5.0a0+1e857d9-py3.6-linux-x86_64.egg/torchvision/ops/boxes.py", line 31, in nms
[Previous line repeated 997 more times]
RecursionError: maximum recursion depth exceeded
This version has support running on cpu repository
thanks
have you try more bigger backbone?
Such as Resnet101 ResnXt101 or other?
how is that performe
Hi. Thanks for sharing the code.
I had two questions.
In section 4.4 of the paper you mentioned that ,
final model is YOLOv3 with ASFF*, which is an enhancedASFFversion by integrating other lightweight modules (i.e.DropBlock [7] and RFB [23]) with 1.5×longer training timethan the models in Section 4.1.
This will help us in understanding the performance improvement due to ASFF block more clearly. Thanks.
Hi there, thanks for sharing your code!
I have tested your pre-trained model "YOLOv3 800+ ASFF*" with distributed testing(on 4 TITAN XPs) and single GPU testing(on one TITAN XP) on COCO minival. I noticed that except the inference time gap between them, the single GPU testing's performance is 1% mAP lower than the distributed testing results. Could you please explain that for me?
The single GPU result: 67.32ms, 42.6mAP
Distributed test on 4 GPUs result: 88.41ms, 43.6mAP
Thank you so much for your time!
could you please tell me that whether inference time (24ms)contain the time nms costs?thx
In the paper 《Objects as Points》,the running time is tested ona machine with Intel Core
i7-8086K CPU, Titan Xp GPU, Pytorch 0.4.1, CUDA 9.0,and CUDNN 7.1 It gets 39.2 mAP at 28FPS with flip test, In your paper, you annotated V100 as the same result with Titan Xp, (28 (V100) 39.2 57.1 42.8 19.9 43.0 51.4), The perfirmance of Tesla V100 is much better than Titan Xp. I am confused about the test speed.
who can provide your code to me,thank you
Will ASFF benefit to tiny yolo3?
1、where is the GT assignment strategy int the code ?
2、what tricks used in this code? except bof.
3、you test size is 608 in both coco and voc ? do not use multi scale test?
thanks
非常好的工作!问一个low的问题,我的coco数据集下面只有image和annotation,请问cache是干啥的,从哪下载得到?
@ruinmessi Hi,
Why do you use Convolutional layers without Batch-normalization in RFBblock?
https://github.com/ruinmessi/ASFF/blob/55b6637b21d69509ae73100f9db19dc98acb419d/models/network_blocks.py#L138-L159
ImportError: /home/guobaozi/anaconda3/envs/PkuNet_apex/lib/python3.7/site-packages/torchvision/_C.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
python3.7
pytorch1.3
cuda10.1
details:
Traceback (most recent call last):
File "demo.py", line 136, in
demo()
File "demo.py", line 116, in demo
outputs = postprocess(outputs, num_class, 0.01, 0.65)
File "/media/guobaozi/C022AA4B225A6D42/guobaozi_cv/ASFF-master/utils/utils.py", line 62, in postprocess
detections_class[:, :4], detections_class[:, 4]*detections_class[:, 5], nms_thre)
File "/home/guobaozi/anaconda3/envs/PkuNet_apex/lib/python3.7/site-packages/torchvision/ops/boxes.py", line 32, in nms
_C = _lazy_import()
File "/home/guobaozi/anaconda3/envs/PkuNet_apex/lib/python3.7/site-packages/torchvision/extension.py", line 12, in _lazy_import
from torchvision import _C as C
ImportError: /home/guobaozi/anaconda3/envs/PkuNet_apex/lib/python3.7/site-packages/torchvision/_C.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
Hello,
When I run the YOLOv3 baseline training script:
python -m torch.distributed.launch --nproc_per_node=10 --master_port=287343 main.py \
--cfg config/yolov3_baseline.cfg -d COCO --tfboard --distributed --ngpu 8 \
--checkpoint weights/darknet53_feature_mx.pth --start_epoch 0 --half --log_dir log/COCO -s 608
The process got strucked at:
index created!
Training YOLOv3 strong baseline!
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
index created!
Training YOLOv3 strong baseline!
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
loading pytorch ckpt... weights/darknet53_feature_mx.pth
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
using cuda
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
loading pytorch ckpt... weights/darknet53_feature_mx.pth
using cuda
I use 8 2080Ti GPUs, and the state is GPUs is:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:1A:00.0 Off | N/A |
| 27% 31C P8 20W / 250W | 1186MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:1B:00.0 Off | N/A |
| 27% 29C P8 18W / 250W | 1186MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... On | 00000000:3D:00.0 Off | N/A |
| 27% 30C P8 23W / 250W | 1186MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... On | 00000000:3E:00.0 Off | N/A |
| 27% 29C P8 13W / 250W | 1186MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... On | 00000000:88:00.0 Off | N/A |
| 27% 28C P8 9W / 250W | 1186MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... On | 00000000:89:00.0 Off | N/A |
| 27% 30C P8 17W / 250W | 1186MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce RTX 208... On | 00000000:B1:00.0 Off | N/A |
| 27% 29C P8 11W / 250W | 1186MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce RTX 208... On | 00000000:B2:00.0 Off | N/A |
| 27% 31C P8 25W / 250W | 1186MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 159506 C ...hen/anaconda3/envs/pytorch13/bin/python 1175MiB |
| 1 159507 C ...hen/anaconda3/envs/pytorch13/bin/python 1175MiB |
| 2 159508 C ...hen/anaconda3/envs/pytorch13/bin/python 1175MiB |
| 3 159509 C ...hen/anaconda3/envs/pytorch13/bin/python 1175MiB |
| 4 159510 C ...hen/anaconda3/envs/pytorch13/bin/python 1175MiB |
| 5 159511 C ...hen/anaconda3/envs/pytorch13/bin/python 1175MiB |
| 6 159512 C ...hen/anaconda3/envs/pytorch13/bin/python 1175MiB |
| 7 159513 C ...hen/anaconda3/envs/pytorch13/bin/python 1175MiB |
+-----------------------------------------------------------------------------+
Hello, I have trained the weight by VOC2007 dataset, and get the evaluate result map = 40. But when I use the my weight to test by demo.py , I can't get any prediction.
The training script is:
python -m torch.distributed.launch --nproc_per_node=1 --master_port=${RANDOM+10000} main.py --cfg config/yolov3_baseline.cfg -d VOC --ngpu 1 --distributed --checkpoint weights/darknet53_feature_mx.pth --start_epoch 0 --half --asff --rfb --dropblock -s 608
and parameters in demo.py is:
--img XX --checkpoint XX --asff --rfb --half -d VOC -s 608
and I try some your weights, but don't run with parameter -d VOC.
So should I change some code in demo.py to fit VOC?
@ruinmessi
Thanks for sharing such a great work.
Iam trying your work with a dataset which every objects in the image are quite small.
Iam using 1 GPU for training only and the model trained without any errors. But the loss didnt decrease.
Maybe the problem comes from the fact that I didnt use so many methods that you had used for training this network such as synchronized batch normalization, FP16 training ...
But Iam wondering whether it is the main problem or this network is struggle with detecting small objects ?
When I replace the FPN with ASFF in Retinaface, the model size is double, but the result is inferior to FPN.
训练时出现:
....省略...
File"C:\Users\Admin\Anaconda3\envs\pytorch\lib\sitepackages\apex\parallel\optimized_sync_batchnorm_kernel.py", line 29, in forward
if torch.distributed.is_initialized():
AttributeError: module 'torch.distributed' has no attribute 'is_initialized'
my environment is cuda10 ,2080ti, could you tell me how to install cocoapi,thx
I can't train this in Win10.
It shows: 'AttributeError: module 'torch.distributed' has no attribute 'is_initialized'
How to fix it?
utils/cocoapi_evaluator.py line 186 will change i
value. When using the vis(True) mode, the image obtained below does not correspond to the outputs above. It can be modified to the following:
for ind in range(bboxes.shape[0]):
label = self.dataset.class_ids[int(cls[ind])]
A = {"image_id": id_, "category_id": label, "bbox": bboxes[ind].numpy().tolist(),
"score": scores[ind].numpy().item(), "segmentation": []} # COCO json format
data_dict.append(A)
BOFs used multi-scale training, how about ASFF? Does ASFF get the hight mAP with multi-scale training or not ?
Traceback (most recent call last):
File "main.py", line 456, in
main()
File "main.py", line 346, in main
torch.save(model.module.state_dict(), os.path.join(args.save_dir,
File "/home/lianguofei/workspace/ASFF/py35env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 539, in getattr
type(self).name, name))
AttributeError: 'YOLOv3' object has no attribute 'module'
你好,我想请问一下,这个具体要怎么解决?
If i have 1 GPU, set n-gpus = 1, rights ?
例如darknet16 或者tiny的因为现在要落地,模型最好要50M一下。
Could you release the training log of baseline model & ASFF model? I tried to train a baseline model with less epochs (30) for quick testing, but I found that the validation mAP is quite low (0.08 at epoch 16). And the conf loss / clf loss seems not converge (near 100...). Is this normal?
RuntimeError:Error(s) in loading state_dict for YOLOv3:
size mismatch for module_list.18.conv.weight:copying a param with shape
torch.Size([1024,512,3,3]) from checkpoint,the shape in current model is torch.Size([256,512,3,3])
In the paper, you said that trained ASFF model for 300 epochs on 4 V100 GPUs. Could you share how long time you spent for 300 epochs training?
The entire network is trained with stochastic gradient descent (SGD) on 4 GPUs (NVDIA Tesla V100) with 16 images per GPU. All models are trained for 300 epochs with the first 4 epochs of warmup and the cosine learning rate schedule [26] from 0.001 to 0.00001.
hello, i wonder that why i train the model about 10 epoch and meet that error, is that pytorch's version result in this problem, i change my torch's version to 1.0.1 but meet another problem that import DCN error, many thanks if you can reply me :)
torch.save(model.module.state_dict(), os.path.join(args.save_dir, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py") _____________'YOLOv3' object has no attribute 'module'
levels_weight_v = torch.cat((level_0_weight_v, level_1_weight_v, level_2_weight_v),1)
levels_weight = self.weight_levels(levels_weight_v)
levels_weight = F.softmax(levels_weight, dim=1)
fused_out_reduced = level_0_resized * levels_weight[:,0:1,:,:]+\
level_1_resized * levels_weight[:,1:2,:,:]+\
level_2_resized * levels_weight[:,2:,:,:]
using cude
using tfboard
segment error(core dumped)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.