Code Monkey home page Code Monkey logo

simpleaicv_pytorch_training_examples's Issues

How to train the Imagnet (which has 300million images) in local machine?

I just wonder to know what is your computer hardware device, I chose the i5-12400f and RTX4080 to train the just simple Conv model, just have 5 layers, but the speed is so slow, and the training time will cost many years about 100 epochs. And I try to ues AutoDL to train this model , the cpu is 100% utilize but the gpu is just 40%, and the training speed is also very slow.

retinanet训练问题

您好,想问一下,为啥我在训练retinanet的时候总是执行一段出现几个warning之后就自动停下来了,而且也不报错;我一开始以为是用来apex的问题,设置为false之后还是自动停下来了;后面我给换成多卡的也是同样的,请问大佬知道是为啥嘛?
`root@container-ab78119f3c-c31dcd5b:~/SimpleAICV-pytorch-ImageNet-COCO-training-master/detection_training/coco/res50_retinanet_retinaresize800# sh train.sh
======================1======================
No pretrained model file!
loading annotations into memory...
Done (t=16.43s)
creating index...
index created!
Dataset Size:117266
Dataset Class Num:80
loading annotations into memory...
Done (t=0.51s)
creating index...
index created!
Dataset Size:5000
Dataset Class Num:80
======================2======================
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.`

window下clone不了

想拜读一下代码,但是clone时出错

(base) PS D:\code> git clone https://github.com/zgcr/SimpleAICV_pytorch_training_examples.git
Cloning into 'SimpleAICV_pytorch_training_examples'...
remote: Enumerating objects: 2761, done.
remote: Counting objects: 100% (2181/2181), done.
remote: Compressing objects: 100% (970/970), done.
remote: Total 2761 (delta 1257), reused 2010 (delta 1129), pack-reused 580
Receiving objects: 100% (2761/2761), 35.31 MiB | 2.01 MiB/s, done.
Resolving deltas: 100% (1588/1588), done.
fatal: cannot create directory at 'simpleAICV/detection/compile_multiscale_deformable_attention/build/temp.linux-x86_64-3.8/root/code/SimpleAICV_pytorch_training_examples_on_ImageNet_COCO_ADE20K/z_dino_main/dino_multiscale_deformable_attention_compile': Filename too long
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

(base) PS D:\code>

定义模型

作者您好,请问一下您的代码当中是否可以自定义模型呢,我想自己定义一个resnet110网络是否可以通过将参数修改成[3,4,26,3]来实现呢

train.sh

您好,我只有一张显卡应该怎样设置目标检测里面的train.sh呢?
感谢!

Pretrained model

Could you please share me with your pretrained model of Vovnet series model and RegNet series model? thank you very much!!!!

knowledge distillation training seems broken

First, thanks author for the great work. It is a great tool to conduct ablation studies. (I even think you can write a paper about that, after adding some more training option, e.g., few-shot, zero-shot learning.)

However, there seems to have a bug about distillation training. That is, when i finished downloading ResNet-34 weights, loading them to the teacher model, the accuracy of it seems really low. It claims 0.298% top-1 accuracy on ImageNet-1K.

I did not checked the code yet, but i suspect it is because the weights you published has a different order of output classes. Could you kindly check this out?

reg_head in RetinaNet

Hi, thanks for your great contributions. I have a question about the implementation of RetinaNet. In losses.py, it seems that the reg_head directly output the absolute position of bounding boxes and l1 loss was calculated by the difference between ground truth bbox positions and reg_head output. Is my understanding correct ?

darknet53,imagenet数据集上训练

你好:
你说多次训练会有波动,我这边darknet53在imagenet数据集上训练,现在得到最好的结果top1acc:76.5%,不知道是不是算波动范围内?我这边的训练配置和你是一样的,除了我是用分布式训练,四张卡,batchsize=124×4这个有区别吧。
我这边训练的脚本地址:https://github.com/njustczr/darknet53

有关cocodataset的一些问题

你好,我主要想学习COCO数据集的一个加载方式,看见你写的很好,但是对cocodataset中一些内容有疑问,比如coco数据集中COCODataPrefetcher()这个类是干嘛的呢,还有这个文件中的coco_class_color干什么用的呢

为什么自己训练下的Resnet50预训练权重和pytorch库里面相差很大

您好,我想请教一下,为什么咱们自己训练下的权重和pytorch官方给出的预训练权重会相差很大。原本pytorch官方的预训练参数我能到0.616.现在拿咱们这个模型训出来的预训练参数性能只能到0.499. 我是哪一步出错了吗。因为我这个还挺依赖预训练参数。

训练问题

你好再次来打扰你了 我在训练时候train.info.log中反馈的是训练到8700轮不给反馈信息了
2021-12-03 15:46:39 - train: epoch 0001, iter [08200, 58633], lr: 0.000100, total_loss: 0.4340, cls_loss: 0.2691, reg_loss: 0.1649
2021-12-03 15:47:37 - train: epoch 0001, iter [08300, 58633], lr: 0.000100, total_loss: 0.6410, cls_loss: 0.4634, reg_loss: 0.1775
2021-12-03 15:48:35 - train: epoch 0001, iter [08400, 58633], lr: 0.000100, total_loss: 0.5121, cls_loss: 0.2628, reg_loss: 0.2494
2021-12-03 15:49:28 - train: epoch 0001, iter [08500, 58633], lr: 0.000100, total_loss: 0.4244, cls_loss: 0.2080, reg_loss: 0.2165
2021-12-03 15:50:28 - train: epoch 0001, iter [08600, 58633], lr: 0.000100, total_loss: 0.5233, cls_loss: 0.3370, reg_loss: 0.1864
2021-12-03 15:51:25 - train: epoch 0001, iter [08700, 58633], lr: 0.000100, total_loss: 0.9907, cls_loss: 0.6687, reg_loss: 0.3220
而且也没有生成权重 训练几次都是在这个地方卡主了 不知道是该继续训练还是哪里需要改动
请问这是怎么一回事呢

RetinaNet训练问题

你好,我看到了你在CSDN上使用IoU loss训练RetinaNet的文章,很详细,但是我有个问题:
改动的地方是直接把smooth L1 loss改成IoU loss就可以了吗?我自己训练的话起始分类损失是1.228,IoU损失到了11.56,感觉差的有点大,请问是什么原因?有什么好办法解决吗?

权重

作者您好能提供国内下载源吗

训centernet耗时特别长

hello,
我对比了一下centernet源码和你的repo里的centernet,发现用你的repo训练centernet比源码一个epoch耗时长很多,大概一个64batchsize的iter需要20s,centernet源码几乎是秒级。
对比了下代码好像没有大的区别,请问你知道为啥么

关于评价指标

你好,我看到你的代码里面是评价的正确率,而你的github上表格写的错误率,它们之和等于1??

train on RetinaNet

我训练时如果不用apex 如下:
loading annotations into memory...
Done (t=20.29s)
creating index...
index created!
loading annotations into memory...
Done (t=2.85s)
creating index...
index created!
如果用了的话 还会显示
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0

然后就不显示别的了 请问这是在训练还是卡住不动了 如果是卡住是什么引起的呢 我的训练环境是3080ti batch设置为2

pretrained问题

我把pretrained设置为True之后,出现如下错误,请问我该如何解决?

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last):

File "../../../tools/train_detection_model.py", line 205, in

main()

File "../../../tools/train_detection_model.py", line 46, in main

from train_config import config

File "./train_config.py", line 19, in

class config:

File "./train_config.py", line 28, in config

'num_classes': num_classes,

File "/home/cc631/hailong/code/Dilated-FPN/simpleAICV-pytorch-ImageNet-COCO-training/simpleAICV/detection/models/retinanet.py", line 145, in resnet50_retinanet

return _retinanet('resnet50', pretrained, **kwargs)

File "/home/cc631/hailong/code/Dilated-FPN/simpleAICV-pytorch-ImageNet-COCO-training/simpleAICV/detection/models/retinanet.py", line 131, in _retinanet

map_location=torch.device('cpu')), model)

File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 581, in load

with _open_file_like(f, 'rb') as opened_file:

File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like

return _open_file(name_or_buffer, mode)

File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 211, in init

super(_open_file, self).init(open(name, mode))

FileNotFoundError: [Errno 2] No such file or directory: 'empty'

yolox backbone error

The yolox backbone in this codebase without focus operation, the shape of stem between https://github.com/Megvii-BaseDetection/YOLOX and this codebase is different.
The stem of yolox_m backbone in https://github.com/Megvii-BaseDetection/YOLOX:
image
The stem of yolox_m backbone in this codebase:
image

class Focus(nn.Module):
"""Focus width and height information into channel space."""

def __init__(self, in_channels, out_channels, ksize=1, stride=1, act="silu"):
    super().__init__()
    self.conv = BaseConv(in_channels * 4, out_channels, ksize, stride, act=act)

def forward(self, x):
    # shape of x (b,c,w,h) -> y(b,4c,w/2,h/2)
    patch_top_left = x[..., ::2, ::2]
    patch_top_right = x[..., ::2, 1::2]
    patch_bot_left = x[..., 1::2, ::2]
    patch_bot_right = x[..., 1::2, 1::2]
    x = torch.cat(
        (
            patch_top_left,
            patch_bot_left,
            patch_top_right,
            patch_bot_right,
        ),
        dim=1,
    )
    return self.conv(x)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.