zgcr / simpleaicv_pytorch_training_examples Goto Github PK

View Code? Open in Web Editor NEW

402.0 9.0 95.0 36.46 MB

SimpleAICV:pytorch training and testing examples.

License: MIT License

Python 96.27% Shell 1.42% C++ 0.21% Cuda 2.10%

pytorch darknet fcos resnet retinanet centernet ttfnet repvgg mae dino

simpleaicv_pytorch_training_examples's Issues

about celeba-hq datasets

Hi, it is an amazing job! Could you please introduce how to build the celeba-hq dataset?

where to download the pretrained models ?

where to download the pretrained Regnet models ?

thanks

How to train the Imagnet (which has 300million images) in local machine?

I just wonder to know what is your computer hardware device, I chose the i5-12400f and RTX4080 to train the just simple Conv model, just have 5 layers, but the speed is so slow, and the training time will cost many years about 100 epochs. And I try to ues AutoDL to train this model , the cpu is 100% utilize but the gpu is just 40%, and the training speed is also very slow.

多机多卡训练存在很多很多很多问题

使用imagenet训练resnet50读取数据有问题，想请教您

我按照您的编程环境，在运行代码时出现了这个错误，希望你能解答一下

retinanet训练问题

您好，想问一下，为啥我在训练retinanet的时候总是执行一段出现几个warning之后就自动停下来了，而且也不报错；我一开始以为是用来apex的问题，设置为false之后还是自动停下来了；后面我给换成多卡的也是同样的，请问大佬知道是为啥嘛？
`root@container-ab78119f3c-c31dcd5b:~/SimpleAICV-pytorch-ImageNet-COCO-training-master/detection_training/coco/res50_retinanet_retinaresize800# sh train.sh
======================1======================
No pretrained model file!
loading annotations into memory...
Done (t=16.43s)
creating index...
index created!
Dataset Size:117266
Dataset Class Num:80
loading annotations into memory...
Done (t=0.51s)
creating index...
index created!
Dataset Size:5000
Dataset Class Num:80
======================2======================
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.`

Could you share the ILSVRC2012 data or give me a link for downloading quickly ? Thanks a lot !

window下clone不了

想拜读一下代码，但是clone时出错

(base) PS D:\code> git clone https://github.com/zgcr/SimpleAICV_pytorch_training_examples.git
Cloning into 'SimpleAICV_pytorch_training_examples'...
remote: Enumerating objects: 2761, done.
remote: Counting objects: 100% (2181/2181), done.
remote: Compressing objects: 100% (970/970), done.
remote: Total 2761 (delta 1257), reused 2010 (delta 1129), pack-reused 580
Receiving objects: 100% (2761/2761), 35.31 MiB | 2.01 MiB/s, done.
Resolving deltas: 100% (1588/1588), done.
fatal: cannot create directory at 'simpleAICV/detection/compile_multiscale_deformable_attention/build/temp.linux-x86_64-3.8/root/code/SimpleAICV_pytorch_training_examples_on_ImageNet_COCO_ADE20K/z_dino_main/dino_multiscale_deformable_attention_compile': Filename too long
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

(base) PS D:\code>

Difference between DataParallel and DistributedDataParallel?

Hi,
Nice work. I have a question. What is the difference between resnet_imagenet_DataParallel_train_example and resnet_imagenet_DistributedDataParallel_train_example?

When using Dataparrel, GPU footprint will increase with time

Thank you very much for your outstanding work. However, when I use Dataparrel for training, the GPU will take up more and more time, and then CUDA will start stop the program. May I ask why？

定义模型

作者您好，请问一下您的代码当中是否可以自定义模型呢，我想自己定义一个resnet110网络是否可以通过将参数修改成[3,4,26,3]来实现呢

train.sh

您好，我只有一张显卡应该怎样设置目标检测里面的train.sh呢？
感谢！

Pretrained model

Could you please share me with your pretrained model of Vovnet series model and RegNet series model? thank you very much!!!!

预训练模型百度网盘失效，google云盘没有访问权限

感谢博主贡献这么nice的代码，我想下载个预训练的模型，但是链接都不行，麻烦大神看到了修复下，灰常感谢。

Could you plz upload the well-trained model

Thanks

knowledge distillation training seems broken

First, thanks author for the great work. It is a great tool to conduct ablation studies. (I even think you can write a paper about that, after adding some more training option, e.g., few-shot, zero-shot learning.)

However, there seems to have a bug about distillation training. That is, when i finished downloading ResNet-34 weights, loading them to the teacher model, the accuracy of it seems really low. It claims 0.298% top-1 accuracy on ImageNet-1K.

I did not checked the code yet, but i suspect it is because the weights you published has a different order of output classes. Could you kindly check this out?

reg_head in RetinaNet

Hi, thanks for your great contributions. I have a question about the implementation of RetinaNet. In losses.py, it seems that the reg_head directly output the absolute position of bounding boxes and l1 loss was calculated by the difference between ground truth bbox positions and reg_head output. Is my understanding correct ？

darknet53,imagenet数据集上训练

你好：
你说多次训练会有波动，我这边darknet53在imagenet数据集上训练，现在得到最好的结果top1acc：76.5%，不知道是不是算波动范围内？我这边的训练配置和你是一样的，除了我是用分布式训练，四张卡，batchsize=124×4这个有区别吧。
我这边训练的脚本地址：https://github.com/njustczr/darknet53

Why do you know that the precision of my cifar100 is higher than the data in the form

How to set num_workers?

Hello. I'm using 4 2080Ti and I wonder how to set num_workers properly?

有关cocodataset的一些问题

你好，我主要想学习COCO数据集的一个加载方式，看见你写的很好，但是对cocodataset中一些内容有疑问，比如coco数据集中COCODataPrefetcher()这个类是干嘛的呢，还有这个文件中的coco_class_color干什么用的呢

RegNet的ColorJitter与原文的PCA-ColorJitter有区别

这个可能是导致你精度比它低一点点的原因吧

为什么自己训练下的Resnet50预训练权重和pytorch库里面相差很大

您好，我想请教一下，为什么咱们自己训练下的权重和pytorch官方给出的预训练权重会相差很大。原本pytorch官方的预训练参数我能到0.616.现在拿咱们这个模型训出来的预训练参数性能只能到0.499. 我是哪一步出错了吗。因为我这个还挺依赖预训练参数。

训练问题

你好再次来打扰你了我在训练时候train.info.log中反馈的是训练到8700轮不给反馈信息了
2021-12-03 15:46:39 - train: epoch 0001, iter [08200, 58633], lr: 0.000100, total_loss: 0.4340, cls_loss: 0.2691, reg_loss: 0.1649
2021-12-03 15:47:37 - train: epoch 0001, iter [08300, 58633], lr: 0.000100, total_loss: 0.6410, cls_loss: 0.4634, reg_loss: 0.1775
2021-12-03 15:48:35 - train: epoch 0001, iter [08400, 58633], lr: 0.000100, total_loss: 0.5121, cls_loss: 0.2628, reg_loss: 0.2494
2021-12-03 15:49:28 - train: epoch 0001, iter [08500, 58633], lr: 0.000100, total_loss: 0.4244, cls_loss: 0.2080, reg_loss: 0.2165
2021-12-03 15:50:28 - train: epoch 0001, iter [08600, 58633], lr: 0.000100, total_loss: 0.5233, cls_loss: 0.3370, reg_loss: 0.1864
2021-12-03 15:51:25 - train: epoch 0001, iter [08700, 58633], lr: 0.000100, total_loss: 0.9907, cls_loss: 0.6687, reg_loss: 0.3220
而且也没有生成权重训练几次都是在这个地方卡主了不知道是该继续训练还是哪里需要改动
请问这是怎么一回事呢

RetinaNet训练问题

你好，我看到了你在CSDN上使用IoU loss训练RetinaNet的文章，很详细，但是我有个问题：
改动的地方是直接把smooth L1 loss改成IoU loss就可以了吗？我自己训练的话起始分类损失是1.228，IoU损失到了11.56，感觉差的有点大，请问是什么原因？有什么好办法解决吗？

权重

作者您好能提供国内下载源吗

训centernet耗时特别长

hello，
我对比了一下centernet源码和你的repo里的centernet，发现用你的repo训练centernet比源码一个epoch耗时长很多，大概一个64batchsize的iter需要20s，centernet源码几乎是秒级。
对比了下代码好像没有大的区别，请问你知道为啥么

关于评价指标

你好，我看到你的代码里面是评价的正确率，而你的github上表格写的错误率，它们之和等于1？？

train on RetinaNet

我训练时如果不用apex 如下：
loading annotations into memory...
Done (t=20.29s)
creating index...
index created!
loading annotations into memory...
Done (t=2.85s)
creating index...
index created!
如果用了的话还会显示
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0

然后就不显示别的了请问这是在训练还是卡住不动了如果是卡住是什么引起的呢我的训练环境是3080ti batch设置为2

pretrained问题

我把pretrained设置为True之后，出现如下错误，请问我该如何解决？

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Traceback (most recent call last):

File "../../../tools/train_detection_model.py", line 205, in

main()

File "../../../tools/train_detection_model.py", line 46, in main

from train_config import config

File "./train_config.py", line 19, in

class config:

File "./train_config.py", line 28, in config

'num_classes': num_classes,

File "/home/cc631/hailong/code/Dilated-FPN/simpleAICV-pytorch-ImageNet-COCO-training/simpleAICV/detection/models/retinanet.py", line 145, in resnet50_retinanet

return _retinanet('resnet50', pretrained, **kwargs)

File "/home/cc631/hailong/code/Dilated-FPN/simpleAICV-pytorch-ImageNet-COCO-training/simpleAICV/detection/models/retinanet.py", line 131, in _retinanet

map_location=torch.device('cpu')), model)

File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 581, in load

with _open_file_like(f, 'rb') as opened_file:

File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like

return _open_file(name_or_buffer, mode)

File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 211, in init

super(_open_file, self).init(open(name, mode))

FileNotFoundError: [Errno 2] No such file or directory: 'empty'

yolox backbone error

The yolox backbone in this codebase without focus operation, the shape of stem between https://github.com/Megvii-BaseDetection/YOLOX and this codebase is different.
The stem of yolox_m backbone in https://github.com/Megvii-BaseDetection/YOLOX:

The stem of yolox_m backbone in this codebase:

class Focus(nn.Module):
"""Focus width and height information into channel space."""

def __init__(self, in_channels, out_channels, ksize=1, stride=1, act="silu"):
    super().__init__()
    self.conv = BaseConv(in_channels * 4, out_channels, ksize, stride, act=act)

def forward(self, x):
    # shape of x (b,c,w,h) -> y(b,4c,w/2,h/2)
    patch_top_left = x[..., ::2, ::2]
    patch_top_right = x[..., ::2, 1::2]
    patch_bot_left = x[..., 1::2, ::2]
    patch_bot_right = x[..., 1::2, 1::2]
    x = torch.cat(
        (
            patch_top_left,
            patch_bot_left,
            patch_top_right,
            patch_bot_right,
        ),
        dim=1,
    )
    return self.conv(x)

zgcr / simpleaicv_pytorch_training_examples Goto Github PK

simpleaicv_pytorch_training_examples's Issues

Recommend Projects

Recommend Topics

Recommend Org