locuslab / convmixer Goto Github PK

Implementation of ConvMixer for "Patches Are All You Need? 🤷"

License: MIT License

Python 99.98% Shell 0.01% JavaScript 0.01%

convmixer's Introduction

Patches Are All You Need? 🤷

This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?" by Asher Trockman and Zico Kolter.

🔎 New: Check out this repository for training ConvMixers on CIFAR-10.

Code overview

The most important code is in convmixer.py. We trained ConvMixers using the timm framework, which we copied from here.

Update: ConvMixer is now integrated into the timm framework itself. You can see the PR here.

Inside pytorch-image-models, we have made the following modifications. (Though one could look at the diff, we think it is convenient to summarize them here.)

Added ConvMixers
- added timm/models/convmixer.py
- modified timm/models/__init__.py
Added "OneCycle" LR Schedule
- added timm/scheduler/onecycle_lr.py
- modified timm/scheduler/scheduler.py
- modified timm/scheduler/scheduler_factory.py
- modified timm/scheduler/__init__.py
- modified train.py (added two lines to support this LR schedule)

We are confident that the use of the OneCycle schedule here is not critical, and one could likely just as well train ConvMixers with the built-in cosine schedule.

Evaluation

We provide some model weights below:

Model Name	Kernel Size	Patch Size	File Size
ConvMixer-1536/20	9	7	207MB
ConvMixer-768/32*	7	7	85MB
ConvMixer-1024/20	9	14	98MB

* Important: ConvMixer-768/32 here uses ReLU instead of GELU, so you would have to change convmixer.py accordingly (we will fix this later).

You can evaluate ConvMixer-1536/20 as follows:

python validate.py --model convmixer_1536_20 --b 64 --num-classes 1000 --checkpoint [/path/to/convmixer_1536_20_ks9_p7.pth.tar] [/path/to/ImageNet1k-val]

You should get a 81.37% accuracy.

Training

If you had a node with 10 GPUs, you could train a ConvMixer-1536/20 as follows (these are exactly the settings we used):

sh distributed_train.sh 10 [/path/to/ImageNet1k] 
    --train-split [your_train_dir] 
    --val-split [your_val_dir] 
    --model convmixer_1536_20 
    -b 64 
    -j 10 
    --opt adamw 
    --epochs 150 
    --sched onecycle 
    --amp 
    --input-size 3 224 224
    --lr 0.01 
    --aa rand-m9-mstd0.5-inc1 
    --cutmix 0.5 
    --mixup 0.5 
    --reprob 0.25 
    --remode pixel 
    --num-classes 1000 
    --warmup-epochs 0 
    --opt-eps=1e-3 
    --clip-grad 1.0

We also included a ConvMixer-768/32 in timm/models/convmixer.py (though it is simple to add more ConvMixers). We trained that one with the above settings but with 300 epochs instead of 150 epochs.

Note: If you are training on CIFAR-10 instead of ImageNet-1k, we recommend setting --scale 0.75 1.0 as well, since the default value of 0.08 1.0 does not make sense for 32x32 inputs.

The tweetable version of ConvMixer, which requires from torch.nn import *:

def ConvMixer(h,d,k,p,n):
 S,C,A=Sequential,Conv2d,lambda x:S(x,GELU(),BatchNorm2d(h))
 R=type('',(S,),{'forward':lambda s,x:s[0](x)+x})
 return S(A(C(3,h,p,p)),*[S(R(A(C(h,h,k,groups=h,padding=k//2))),A(C(h,h,1))) for i in range(d)],AdaptiveAvgPool2d(1),Flatten(),Linear(h,n))

convmixer's People

Contributors

Stargazers

Watchers

Forkers

levinna lorenzomammana stjordanis mdahao taoshss zhihengcv dumpmemory liuruiyang98 apeizou suyanzhou626 augustkrzhu shadowkun chandan-iiti jonathanfk981 littlepure2333 manncodes baodijun youyongquan hl-louis liguge bmyan faithfulnguyen fawnliu ianleongudri drzhoukarl akssieg kunzeng-ch youngfly11 hilbert-qyw scotter-qian ljm198134 ravimk07 yoontae6719 fanqingyu0604 atousaz xuliangcs kelvinyang0320 daaiwusheng jokergaming zhwzhong cv-ip mathieutuli chenlin9 sunarker donglongzi rebornforpower kvgandikota aditya-zutshi pandinosaurus janedy9879 neerajkanhere epsilon01 ahkami-mehran yunkai696 k-h-ismail ikasumi karolinazmh alexjunholee johnnysclai techthiyanes ejhortala gstoica27 gabbysuwichaya waybarrios tanmayy24 dawncc atlasgooo2 shengzhang90 lailainan ai-hub-deep-learning-fundamental nirvanesque avr8 maoshifu-yang ishanchaks91 yrefrain hyeonkijeong pugangqiang mincheulkim dctyxx 27roger tlzhanggithub mohsen-azimi sreelakshmi-mp aliborji meera-m-t general111 amrzv nfsrules tommylitlle gg-big-org pme0 codeaudit statmixedml f1ibrahim-tmu ehsanfar crazy-jack sweat-tiger mjack3 nishantdhotre

convmixer's Issues

Experiments with full convolutional layers instead of patch embedding？

Have the author tried to replace the patch embedding with the just convolution？That is, using 1 stride instead of p？

With this setting, this is a standard convolution network like MobileNet. I wonder what would be the performance？Is the performance gain of Convmix due to the patch embedding or the depthwise conv layers？

Very interested in this work, thanks.

Cifar10 baseline doesn't reach 95%

Hello,
I tried convmixer256 on Cifar-10 with the same timm options specified for ImageNet (except the num_classes) and it doesn't go beyond 90% accuracy. Could you please specify the options used for Cifar-10 experiment ?

something about loss

CIFAR-10 training settings

First of all, thank you for the interesting work.
I was experimenting the one with patch size 1 and kernel size 9 with CIFAR-10 with the following training settings:

--model tiny_convmixer
 -b 64 -j 8 
--opt adamw 
--epochs 200 
--sched onecycle 
--amp 
--input-size 3 32 32 
--lr 0.01 
--aa rand-m9-mstd0.5-inc1 
--cutmix 0.5 
--mixup 0.5 
--reprob 0.25 
--remode pixel 
--num-classes 10
--warmup-epochs 0
--opt-eps 1e-3
--clip-grad 1.0
--scale 0.75 1.0
--weight-decay 0.01
--mean 0.4914 0.4822 0.4465
--std 0.2471 0.2435 0.2616

I could get only 95.89%. I am supposed to get 96.03% according to Table 4 in the paper.
Can you please let me know any setting I missed? Thank you again.

License

Hi.

Would you consider providing an open source license for this repo?

What's new about this model?

Why “patches” are all you need?
Patch embedding is Conv7x7 stem,
The body is simply repeated Conv9x9 + Conv1x1,
(Not challenging your work, it's indeed very interesting), but just kindly wondering what's new about this model?

你好，convmixer能否直接提取图片的2D特征？

weight location

where is the weights file after training?

Do you use this model on other downstream work? (like semantic segmentation )

is single gpu training possible?

if yes , what is the proper script command?

I just changed the distributed script to single gpu, It failed to train.

Segmentation ConvMixer architecture ?

I was trying to figure what a Segmentation ConvMixer would look like, and came up with that (residual connection inspired by MultiResUNet). Does it make sense to you ?

Request more experiment results to compare to other architecture.

Hi!
This work is pretty interesting, but I think there should are more results like in "Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight" as they replace local self-attention with depth-wise convolution in Swin Transformer. Since you conduct an advanced one with a more simple architecture compared to SwinTransformer, so I wonder if ConvMixer can get similar performance on object detection and semantic segmentation.

Could you please release the training logs of convmixer on imagenet and cifar?

As the title implies, it would be great to see the training log of convmixer.

padding=same？

https://github.com/tmp-iclr/convmixer/blob/1cefd860a1a6a85369887d1a633425cedc2afd0a/convmixer.py#L18 There is an error:TypeError: conv2d(): argument 'padding' (position 5) must be tuple of ints, not str.

Training time

Hi, first of all thanks for a very interesting paper.

I would like to know how long did it take you to train the models? I'm trying to train ConvMixer-768/32 using 2xV100 and one epoch is ~3 hours, so I would estimate that full training would take ~= 2 * 3 * 300 ~= 1800 GPU hours, which is insane. Even if you trained with 10 GPUs it would take ~1 week for one experiment to finish. Are my calculations correct?

Training scheme modifications for small GPUs

Hi authors. Your paper has demonstrated a quite intriguing observation. I wish you luck with your submission.
Thanks for sharing the code of the submission. When running the code, I got an issue regarding OOM when using the default batch size of 64. In the end I can only run with 8 samples per batch per GPU as my GPUs have only 11GB. I would like to know if you have tried smaller GPUs and achieved the same results. So far, besides learning rate modified according to the linear rule, I haven't made any change yet. If you tried training using smaller GPUs before, could you please share your experience? Thank you very much!