Code Monkey home page Code Monkey logo

simpleaicv_pytorch_training_examples's Introduction

My column

https://www.zhihu.com/column/c_1692623656205897728

Introduction

This repository provides simple training and testing examples for the following tasks:

task support dataset support network
Image classification task CIFAR100
ImageNet1K(ILSVRC2012)
ImageNet21K(Winter 2021 release)
ACCV2022
ResNet
DarkNet
RepVGG
RegNetX
ViT
VAN
Object detection task VOC2007 and VOC2012
COCO2017
Objects365(v2,2020)
RetinaNet
FCOS
CenterNet
TTFNet
DETR
DINO-DETR
Semantic segmentation task ADE20K DeepLabv3+
U2Net
Instance segmentation task COCO2017 YOLACT
SOLOv2
Knowledge distillation task ImageNet1K(ILSVRC2012) KD loss(for ResNet)
DML loss(for ResNet)
Contrastive learning task ImageNet1K(ILSVRC2012) DINO(for ResNet)
Masked image modeling task ImageNet1K(ILSVRC2012)
ACCV2022
MAE(for ViT)
OCR text detection task ICDAR DBNet
OCR text recognition task / CTC Model
Image inpainting task CelebA-HQ
Places365-standard
Places365-challenge
AOT-GAN
TRANSX-LKA-AOT-GAN
diffusion model task CIFAR10
CIFAR100
CelebA-HQ
FFHQ
DDPM
DDIM
PLMS

All task training results

See all task training results in results.md.

Environments

1、This repository only supports running on ubuntu(verison>=18.04 LTS).

2、This repository only support one machine one gpu/one machine multi gpus mode with pytorch DDP training.

3、Please make sure your Python environment version>=3.7.

4、Please make sure your pytorch version>=1.10.

5、If you want to use torch.complie() function,please make sure your pytorch version>=2.0.

Use pip or conda to install those Packages in your Python environment:

torch
torchvision
pillow
numpy
Cython
colormath
pycocotools
opencv-python
scipy
eniops
scikit-image
pyclipper
shapely
imagesize
nltk
tqdm
onnx
onnx-simplifier
thop==0.0.31.post2005241907
gradio==3.32.0
yapf

If you want to use dino-detr model,install MultiScaleDeformableAttention Packge in your Python environment:

cd to simpleAICV/detection/compile_multiscale_deformable_attention,then run commands:

chmod +x make.sh
./make.sh

Download my pretrained models and experiments records

You can download all my pretrained models and all my experiments records/checkpoints from huggingface or Baidu-Netdisk.

If you only want to download all my pretrained models(model.state_dict()),you can download pretrained_models folder.

# huggingface
https://huggingface.co/zgcr654321/classification_training/tree/main
https://huggingface.co/zgcr654321/contrastive_learning_training/tree/main
https://huggingface.co/zgcr654321/detection_training/tree/main
https://huggingface.co/zgcr654321/image_inpainting_training/tree/main
https://huggingface.co/zgcr654321/diffusion_model_training/tree/main
https://huggingface.co/zgcr654321/distillation_training/tree/main
https://huggingface.co/zgcr654321/instance_segmentation_training/tree/main
https://huggingface.co/zgcr654321/masked_image_modeling_training/tree/main
https://huggingface.co/zgcr654321/ocr_text_detection_training/tree/main
https://huggingface.co/zgcr654321/ocr_text_recognition_training/tree/main
https://huggingface.co/zgcr654321/semantic_segmentation_training/tree/main
https://huggingface.co/zgcr654321/pretrained_models/tree/main

# Baidu-Netdisk
链接:https://pan.baidu.com/s/1yhEwaZhrb2NZRpJ5eEqHBw 
提取码:rgdo

Prepare datasets

CIFAR10

Make sure the folder architecture as follows:

CIFAR10
|
|-----batches.meta  unzip from cifar-10-python.tar.gz
|-----data_batch_1  unzip from cifar-10-python.tar.gz
|-----data_batch_2  unzip from cifar-10-python.tar.gz
|-----data_batch_3  unzip from cifar-10-python.tar.gz
|-----data_batch_4  unzip from cifar-10-python.tar.gz
|-----data_batch_5  unzip from cifar-10-python.tar.gz
|-----readme.html   unzip from cifar-10-python.tar.gz
|-----test_batch    unzip from cifar-10-python.tar.gz

CIFAR100

Make sure the folder architecture as follows:

CIFAR100
|
|-----train unzip from cifar-100-python.tar.gz
|-----test  unzip from cifar-100-python.tar.gz
|-----meta  unzip from cifar-100-python.tar.gz

ImageNet 1K(ILSVRC2012)

Make sure the folder architecture as follows:

ILSVRC2012
|
|-----train----1000 sub classes folders
|-----val------1000 sub classes folders
Please make sure the same class has same class folder name in train and val folders.

ImageNet 21K(Winter 2021 release)

Make sure the folder architecture as follows:

ImageNet21K
|
|-----train-----------10450 sub classes folders
|-----val-------------10450 sub classes folders
|-----small_classes---10450 sub classes folders
|-----imagenet21k_miil_tree.pth
Please make sure the same class has same class folder name in train and val folders.

ACCV2022

Make sure the folder architecture as follows:

ACCV2022
|
|-----train-------------5000 sub classes folders
|-----testa-------------60000 images
|-----accv2022_broken_list.json

VOC2007 and VOC2012

Make sure the folder architecture as follows:

VOCdataset
|                 |----Annotations
|                 |----ImageSets
|----VOC2007------|----JPEGImages
|                 |----SegmentationClass
|                 |----SegmentationObject
|        
|                 |----Annotations
|                 |----ImageSets
|----VOC2012------|----JPEGImages
|                 |----SegmentationClass
|                 |----SegmentationObject

COCO2017

Make sure the folder architecture as follows:

COCO2017
|                |----captions_train2017.json
|                |----captions_val2017.json
|--annotations---|----instances_train2017.json
|                |----instances_val2017.json
|                |----person_keypoints_train2017.json
|                |----person_keypoints_val2017.json
|                 
|                |----train2017
|----images------|----val2017

Objects365(v2,2020)

Make sure the folder architecture as follows:

objects365_2020
|
|                |----zhiyuan_objv2_train.json
|--annotations---|----zhiyuan_objv2_val.json
|                |----sample_2020.json
|                 
|                |----train all train patch folders
|----images------|----val   all val patch folders
                 |----test  all test patch folders

ADE20K

Make sure the folder architecture as follows:

ADE20K
|                 |----training
|---images--------|----validation
|                 |----testing
|        
|                 |----training
|---annotations---|----validation

CelebA-HQ

Make sure the folder architecture as follows:

CelebA-HQ
|                 |----female
|---train---------|----male
|        
|                 |----female
|---val-----------|----male

FFHQ

Make sure the folder architecture as follows:

FFHQ
|
|---images
|---ffhq-dataset-v1.json
|---ffhq-dataset-v2.json

Places365-standard/challenge

Make sure the folder architecture as follows:

Places365-standard/challenge
|
|                            |---train_large all sub folders
|---high_resolution_images---|---val_large   all images
|                            |---test_large  all images

How to train and test model

If you want to train or test model,you need enter a training experiment folder directory,then run train.sh or test.sh.

For example,you can enter classification_training/imagenet/resnet50.

If you want to restart train this model,please delete checkpoints and log folders first,then run train.sh:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node=2 --master_addr 127.0.1.0 --master_port 10000 ../../../tools/train_classification_model.py --work-dir ./

if you want to test this model,you need have a pretrained model first,modify trained_model_path in test_config.py,then run test.sh:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node=2 --master_addr 127.0.1.0 --master_port 10000 ../../../tools/test_classification_model.py --work-dir ./

CUDA_VISIBLE_DEVICES is used to specify the gpu ids for this training.Please make sure the number of nproc_per_node equal to the number of using gpu cards.Make sure master_addr/master_port are unique for each training.

All checkpoints/log are saved in training/testing experiment folder directory.

Also, You can modify super parameters in train_config.py/test_config.py.

How to use gradio demo

cd to gradio_demo,we have classification/detection/semantic_segmentation/instance_segmentation demo.

For example,you can run detection gradio demo:

chmod +x run_gradio_detect_single_image.sh
./run_gradio_detect_single_image.sh

Citation

If you find my work useful in your research, please consider citing:

@inproceedings{zgcr,
 title={SimpleAICV-pytorch-training-examples},
 author={zgcr},
 year={2020-2024}
}

simpleaicv_pytorch_training_examples's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simpleaicv_pytorch_training_examples's Issues

yolox backbone error

The yolox backbone in this codebase without focus operation, the shape of stem between https://github.com/Megvii-BaseDetection/YOLOX and this codebase is different.
The stem of yolox_m backbone in https://github.com/Megvii-BaseDetection/YOLOX:
image
The stem of yolox_m backbone in this codebase:
image

class Focus(nn.Module):
"""Focus width and height information into channel space."""

def __init__(self, in_channels, out_channels, ksize=1, stride=1, act="silu"):
    super().__init__()
    self.conv = BaseConv(in_channels * 4, out_channels, ksize, stride, act=act)

def forward(self, x):
    # shape of x (b,c,w,h) -> y(b,4c,w/2,h/2)
    patch_top_left = x[..., ::2, ::2]
    patch_top_right = x[..., ::2, 1::2]
    patch_bot_left = x[..., 1::2, ::2]
    patch_bot_right = x[..., 1::2, 1::2]
    x = torch.cat(
        (
            patch_top_left,
            patch_bot_left,
            patch_top_right,
            patch_bot_right,
        ),
        dim=1,
    )
    return self.conv(x)

为什么自己训练下的Resnet50预训练权重和pytorch库里面相差很大

您好,我想请教一下,为什么咱们自己训练下的权重和pytorch官方给出的预训练权重会相差很大。原本pytorch官方的预训练参数我能到0.616.现在拿咱们这个模型训出来的预训练参数性能只能到0.499. 我是哪一步出错了吗。因为我这个还挺依赖预训练参数。

权重

作者您好能提供国内下载源吗

How to train the Imagnet (which has 300million images) in local machine?

I just wonder to know what is your computer hardware device, I chose the i5-12400f and RTX4080 to train the just simple Conv model, just have 5 layers, but the speed is so slow, and the training time will cost many years about 100 epochs. And I try to ues AutoDL to train this model , the cpu is 100% utilize but the gpu is just 40%, and the training speed is also very slow.

关于评价指标

你好,我看到你的代码里面是评价的正确率,而你的github上表格写的错误率,它们之和等于1??

pretrained问题

我把pretrained设置为True之后,出现如下错误,请问我该如何解决?

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last):

File "../../../tools/train_detection_model.py", line 205, in

main()

File "../../../tools/train_detection_model.py", line 46, in main

from train_config import config

File "./train_config.py", line 19, in

class config:

File "./train_config.py", line 28, in config

'num_classes': num_classes,

File "/home/cc631/hailong/code/Dilated-FPN/simpleAICV-pytorch-ImageNet-COCO-training/simpleAICV/detection/models/retinanet.py", line 145, in resnet50_retinanet

return _retinanet('resnet50', pretrained, **kwargs)

File "/home/cc631/hailong/code/Dilated-FPN/simpleAICV-pytorch-ImageNet-COCO-training/simpleAICV/detection/models/retinanet.py", line 131, in _retinanet

map_location=torch.device('cpu')), model)

File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 581, in load

with _open_file_like(f, 'rb') as opened_file:

File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like

return _open_file(name_or_buffer, mode)

File "/home/cc631/anaconda3/envs/pytorch1.7/lib/python3.7/site-packages/torch/serialization.py", line 211, in init

super(_open_file, self).init(open(name, mode))

FileNotFoundError: [Errno 2] No such file or directory: 'empty'

定义模型

作者您好,请问一下您的代码当中是否可以自定义模型呢,我想自己定义一个resnet110网络是否可以通过将参数修改成[3,4,26,3]来实现呢

RetinaNet训练问题

你好,我看到了你在CSDN上使用IoU loss训练RetinaNet的文章,很详细,但是我有个问题:
改动的地方是直接把smooth L1 loss改成IoU loss就可以了吗?我自己训练的话起始分类损失是1.228,IoU损失到了11.56,感觉差的有点大,请问是什么原因?有什么好办法解决吗?

Pretrained model

Could you please share me with your pretrained model of Vovnet series model and RegNet series model? thank you very much!!!!

有关cocodataset的一些问题

你好,我主要想学习COCO数据集的一个加载方式,看见你写的很好,但是对cocodataset中一些内容有疑问,比如coco数据集中COCODataPrefetcher()这个类是干嘛的呢,还有这个文件中的coco_class_color干什么用的呢

darknet53,imagenet数据集上训练

你好:
你说多次训练会有波动,我这边darknet53在imagenet数据集上训练,现在得到最好的结果top1acc:76.5%,不知道是不是算波动范围内?我这边的训练配置和你是一样的,除了我是用分布式训练,四张卡,batchsize=124×4这个有区别吧。
我这边训练的脚本地址:https://github.com/njustczr/darknet53

window下clone不了

想拜读一下代码,但是clone时出错

(base) PS D:\code> git clone https://github.com/zgcr/SimpleAICV_pytorch_training_examples.git
Cloning into 'SimpleAICV_pytorch_training_examples'...
remote: Enumerating objects: 2761, done.
remote: Counting objects: 100% (2181/2181), done.
remote: Compressing objects: 100% (970/970), done.
remote: Total 2761 (delta 1257), reused 2010 (delta 1129), pack-reused 580
Receiving objects: 100% (2761/2761), 35.31 MiB | 2.01 MiB/s, done.
Resolving deltas: 100% (1588/1588), done.
fatal: cannot create directory at 'simpleAICV/detection/compile_multiscale_deformable_attention/build/temp.linux-x86_64-3.8/root/code/SimpleAICV_pytorch_training_examples_on_ImageNet_COCO_ADE20K/z_dino_main/dino_multiscale_deformable_attention_compile': Filename too long
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

(base) PS D:\code>

knowledge distillation training seems broken

First, thanks author for the great work. It is a great tool to conduct ablation studies. (I even think you can write a paper about that, after adding some more training option, e.g., few-shot, zero-shot learning.)

However, there seems to have a bug about distillation training. That is, when i finished downloading ResNet-34 weights, loading them to the teacher model, the accuracy of it seems really low. It claims 0.298% top-1 accuracy on ImageNet-1K.

I did not checked the code yet, but i suspect it is because the weights you published has a different order of output classes. Could you kindly check this out?

训centernet耗时特别长

hello,
我对比了一下centernet源码和你的repo里的centernet,发现用你的repo训练centernet比源码一个epoch耗时长很多,大概一个64batchsize的iter需要20s,centernet源码几乎是秒级。
对比了下代码好像没有大的区别,请问你知道为啥么

retinanet训练问题

您好,想问一下,为啥我在训练retinanet的时候总是执行一段出现几个warning之后就自动停下来了,而且也不报错;我一开始以为是用来apex的问题,设置为false之后还是自动停下来了;后面我给换成多卡的也是同样的,请问大佬知道是为啥嘛?
`root@container-ab78119f3c-c31dcd5b:~/SimpleAICV-pytorch-ImageNet-COCO-training-master/detection_training/coco/res50_retinanet_retinaresize800# sh train.sh
======================1======================
No pretrained model file!
loading annotations into memory...
Done (t=16.43s)
creating index...
index created!
Dataset Size:117266
Dataset Class Num:80
loading annotations into memory...
Done (t=0.51s)
creating index...
index created!
Dataset Size:5000
Dataset Class Num:80
======================2======================
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.`

训练问题

你好再次来打扰你了 我在训练时候train.info.log中反馈的是训练到8700轮不给反馈信息了
2021-12-03 15:46:39 - train: epoch 0001, iter [08200, 58633], lr: 0.000100, total_loss: 0.4340, cls_loss: 0.2691, reg_loss: 0.1649
2021-12-03 15:47:37 - train: epoch 0001, iter [08300, 58633], lr: 0.000100, total_loss: 0.6410, cls_loss: 0.4634, reg_loss: 0.1775
2021-12-03 15:48:35 - train: epoch 0001, iter [08400, 58633], lr: 0.000100, total_loss: 0.5121, cls_loss: 0.2628, reg_loss: 0.2494
2021-12-03 15:49:28 - train: epoch 0001, iter [08500, 58633], lr: 0.000100, total_loss: 0.4244, cls_loss: 0.2080, reg_loss: 0.2165
2021-12-03 15:50:28 - train: epoch 0001, iter [08600, 58633], lr: 0.000100, total_loss: 0.5233, cls_loss: 0.3370, reg_loss: 0.1864
2021-12-03 15:51:25 - train: epoch 0001, iter [08700, 58633], lr: 0.000100, total_loss: 0.9907, cls_loss: 0.6687, reg_loss: 0.3220
而且也没有生成权重 训练几次都是在这个地方卡主了 不知道是该继续训练还是哪里需要改动
请问这是怎么一回事呢

train on RetinaNet

我训练时如果不用apex 如下:
loading annotations into memory...
Done (t=20.29s)
creating index...
index created!
loading annotations into memory...
Done (t=2.85s)
creating index...
index created!
如果用了的话 还会显示
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0

然后就不显示别的了 请问这是在训练还是卡住不动了 如果是卡住是什么引起的呢 我的训练环境是3080ti batch设置为2

train.sh

您好,我只有一张显卡应该怎样设置目标检测里面的train.sh呢?
感谢!

reg_head in RetinaNet

Hi, thanks for your great contributions. I have a question about the implementation of RetinaNet. In losses.py, it seems that the reg_head directly output the absolute position of bounding boxes and l1 loss was calculated by the difference between ground truth bbox positions and reg_head output. Is my understanding correct ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.