zgcr / simpleaicv_pytorch_training_examples Goto Github PK
View Code? Open in Web Editor NEWSimpleAICV:pytorch training and testing examples.
License: MIT License
SimpleAICV:pytorch training and testing examples.
License: MIT License
Hi, it is an amazing job! Could you please introduce how to build the celeba-hq dataset?
where to download the pretrained Regnet models ?
thanks
I just wonder to know what is your computer hardware device, I chose the i5-12400f and RTX4080 to train the just simple Conv model, just have 5 layers, but the speed is so slow, and the training time will cost many years about 100 epochs. And I try to ues AutoDL to train this model , the cpu is 100% utilize but the gpu is just 40%, and the training speed is also very slow.
您好,想问一下,为啥我在训练retinanet的时候总是执行一段出现几个warning之后就自动停下来了,而且也不报错;我一开始以为是用来apex的问题,设置为false之后还是自动停下来了;后面我给换成多卡的也是同样的,请问大佬知道是为啥嘛?
`root@container-ab78119f3c-c31dcd5b:~/SimpleAICV-pytorch-ImageNet-COCO-training-master/detection_training/coco/res50_retinanet_retinaresize800# sh train.sh
======================1======================
No pretrained model file!
loading annotations into memory...
Done (t=16.43s)
creating index...
index created!
Dataset Size:117266
Dataset Class Num:80
loading annotations into memory...
Done (t=0.51s)
creating index...
index created!
Dataset Size:5000
Dataset Class Num:80
======================2======================
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.`
想拜读一下代码,但是clone时出错
(base) PS D:\code> git clone https://github.com/zgcr/SimpleAICV_pytorch_training_examples.git
Cloning into 'SimpleAICV_pytorch_training_examples'...
remote: Enumerating objects: 2761, done.
remote: Counting objects: 100% (2181/2181), done.
remote: Compressing objects: 100% (970/970), done.
remote: Total 2761 (delta 1257), reused 2010 (delta 1129), pack-reused 580
Receiving objects: 100% (2761/2761), 35.31 MiB | 2.01 MiB/s, done.
Resolving deltas: 100% (1588/1588), done.
fatal: cannot create directory at 'simpleAICV/detection/compile_multiscale_deformable_attention/build/temp.linux-x86_64-3.8/root/code/SimpleAICV_pytorch_training_examples_on_ImageNet_COCO_ADE20K/z_dino_main/dino_multiscale_deformable_attention_compile': Filename too long
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'
(base) PS D:\code>
Hi,
Nice work. I have a question. What is the difference between resnet_imagenet_DataParallel_train_example and resnet_imagenet_DistributedDataParallel_train_example?
Thank you very much for your outstanding work. However, when I use Dataparrel for training, the GPU will take up more and more time, and then CUDA will start stop the program. May I ask why?
作者您好,请问一下您的代码当中是否可以自定义模型呢,我想自己定义一个resnet110网络是否可以通过将参数修改成[3,4,26,3]来实现呢
您好,我只有一张显卡应该怎样设置目标检测里面的train.sh呢?
感谢!
Could you please share me with your pretrained model of Vovnet series model and RegNet series model? thank you very much!!!!
感谢博主贡献这么nice的代码,我想下载个预训练的模型,但是链接都不行,麻烦大神看到了修复下,灰常感谢。
Thanks
First, thanks author for the great work. It is a great tool to conduct ablation studies. (I even think you can write a paper about that, after adding some more training option, e.g., few-shot, zero-shot learning.)
However, there seems to have a bug about distillation training. That is, when i finished downloading ResNet-34 weights, loading them to the teacher model, the accuracy of it seems really low. It claims 0.298% top-1 accuracy on ImageNet-1K.
I did not checked the code yet, but i suspect it is because the weights you published has a different order of output classes. Could you kindly check this out?
Hi, thanks for your great contributions. I have a question about the implementation of RetinaNet. In losses.py, it seems that the reg_head directly output the absolute position of bounding boxes and l1 loss was calculated by the difference between ground truth bbox positions and reg_head output. Is my understanding correct ?
你好:
你说多次训练会有波动,我这边darknet53在imagenet数据集上训练,现在得到最好的结果top1acc:76.5%,不知道是不是算波动范围内?我这边的训练配置和你是一样的,除了我是用分布式训练,四张卡,batchsize=124×4这个有区别吧。
我这边训练的脚本地址:https://github.com/njustczr/darknet53
Hello. I'm using 4 2080Ti and I wonder how to set num_workers properly?
你好,我主要想学习COCO数据集的一个加载方式,看见你写的很好,但是对cocodataset中一些内容有疑问,比如coco数据集中COCODataPrefetcher()这个类是干嘛的呢,还有这个文件中的coco_class_color干什么用的呢
这个可能是导致你精度比它低一点点的原因吧
您好,我想请教一下,为什么咱们自己训练下的权重和pytorch官方给出的预训练权重会相差很大。原本pytorch官方的预训练参数我能到0.616.现在拿咱们这个模型训出来的预训练参数性能只能到0.499. 我是哪一步出错了吗。因为我这个还挺依赖预训练参数。
你好再次来打扰你了 我在训练时候train.info.log中反馈的是训练到8700轮不给反馈信息了
2021-12-03 15:46:39 - train: epoch 0001, iter [08200, 58633], lr: 0.000100, total_loss: 0.4340, cls_loss: 0.2691, reg_loss: 0.1649
2021-12-03 15:47:37 - train: epoch 0001, iter [08300, 58633], lr: 0.000100, total_loss: 0.6410, cls_loss: 0.4634, reg_loss: 0.1775
2021-12-03 15:48:35 - train: epoch 0001, iter [08400, 58633], lr: 0.000100, total_loss: 0.5121, cls_loss: 0.2628, reg_loss: 0.2494
2021-12-03 15:49:28 - train: epoch 0001, iter [08500, 58633], lr: 0.000100, total_loss: 0.4244, cls_loss: 0.2080, reg_loss: 0.2165
2021-12-03 15:50:28 - train: epoch 0001, iter [08600, 58633], lr: 0.000100, total_loss: 0.5233, cls_loss: 0.3370, reg_loss: 0.1864
2021-12-03 15:51:25 - train: epoch 0001, iter [08700, 58633], lr: 0.000100, total_loss: 0.9907, cls_loss: 0.6687, reg_loss: 0.3220
而且也没有生成权重 训练几次都是在这个地方卡主了 不知道是该继续训练还是哪里需要改动
请问这是怎么一回事呢
你好,我看到了你在CSDN上使用IoU loss训练RetinaNet的文章,很详细,但是我有个问题:
改动的地方是直接把smooth L1 loss改成IoU loss就可以了吗?我自己训练的话起始分类损失是1.228,IoU损失到了11.56,感觉差的有点大,请问是什么原因?有什么好办法解决吗?
作者您好能提供国内下载源吗
hello,
我对比了一下centernet源码和你的repo里的centernet,发现用你的repo训练centernet比源码一个epoch耗时长很多,大概一个64batchsize的iter需要20s,centernet源码几乎是秒级。
对比了下代码好像没有大的区别,请问你知道为啥么
你好,我看到你的代码里面是评价的正确率,而你的github上表格写的错误率,它们之和等于1??
我训练时如果不用apex 如下:
loading annotations into memory...
Done (t=20.29s)
creating index...
index created!
loading annotations into memory...
Done (t=2.85s)
creating index...
index created!
如果用了的话 还会显示
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
然后就不显示别的了 请问这是在训练还是卡住不动了 如果是卡住是什么引起的呢 我的训练环境是3080ti batch设置为2
我把pretrained设置为True之后,出现如下错误,请问我该如何解决?
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Traceback (most recent call last):
File "../../../tools/train_detection_model.py", line 205, in
main()
File "../../../tools/train_detection_model.py", line 46, in main
from train_config import config
File "./train_config.py", line 19, in
class config:
File "./train_config.py", line 28, in config
'num_classes': num_classes,
File "/home/cc631/hailong/code/Dilated-FPN/simpleAICV-pytorch-ImageNet-COCO-training/simpleAICV/detection/models/retinanet.py", line 145, in resnet50_retinanet
return _retinanet('resnet50', pretrained, **kwargs)
File "/home/cc631/hailong/code/Dilated-FPN/simpleAICV-pytorch-ImageNet-COCO-training/simpleAICV/detection/models/retinanet.py", line 131, in _retinanet
map_location=torch.device('cpu')), model)
File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 581, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'empty'
The yolox backbone in this codebase without focus operation, the shape of stem between https://github.com/Megvii-BaseDetection/YOLOX and this codebase is different.
The stem of yolox_m backbone in https://github.com/Megvii-BaseDetection/YOLOX:
The stem of yolox_m backbone in this codebase:
class Focus(nn.Module):
"""Focus width and height information into channel space."""
def __init__(self, in_channels, out_channels, ksize=1, stride=1, act="silu"):
super().__init__()
self.conv = BaseConv(in_channels * 4, out_channels, ksize, stride, act=act)
def forward(self, x):
# shape of x (b,c,w,h) -> y(b,4c,w/2,h/2)
patch_top_left = x[..., ::2, ::2]
patch_top_right = x[..., ::2, 1::2]
patch_bot_left = x[..., 1::2, ::2]
patch_bot_right = x[..., 1::2, 1::2]
x = torch.cat(
(
patch_top_left,
patch_bot_left,
patch_top_right,
patch_bot_right,
),
dim=1,
)
return self.conv(x)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.