facebookresearch / levit Goto Github PK

View Code? Open in Web Editor NEW

596.0 12.0 71.0 278 KB

LeViT a Vision Transformer in ConvNet's Clothing for Faster Inference

License: Apache License 2.0

Python 100.00%

levit's Introduction

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

This repository contains PyTorch evaluation code, training code and pretrained models for LeViT.

They obtain competitive tradeoffs in terms of speed / precision:

For details see LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference by Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou and Matthijs Douze.

If you use this code for a paper please cite:

@InProceedings{Graham_2021_ICCV,
    author    = {Graham, Benjamin and El-Nouby, Alaaeldin and Touvron, Hugo and Stock, Pierre and Joulin, Armand and Jegou, Herve and Douze, Matthijs},
    title     = {LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {12259-12269}
}

Model Zoo

We provide baseline LeViT models trained with distllation on ImageNet 2012.

name	acc@1	acc@5	#FLOPs	#params	url
LeViT-128S	76.6	92.9	305M	7.8M	model
LeViT-128	78.6	94.0	406M	9.2M	model
LeViT-192	80.0	94.7	658M	11M	model
LeViT-256	81.6	95.4	1120M	19M	model
LeViT-384	82.6	96.0	2353M	39M	model

Usage

First, clone the repository locally:

git clone https://github.com/facebookresearch/levit.git

Then, install PyTorch 1.7.0+ and torchvision 0.8.1+ and pytorch-image-models:

conda install -c pytorch pytorch torchvision
pip install timm

Data preparation

Download and extract ImageNet train and val images from http://image-net.org/. The directory structure is the standard layout for the torchvision datasets.ImageFolder, and the training and validation data is expected to be in the train/ folder and val folder respectively:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg

Evaluation

To evaluate a pre-trained LeViT-256 model on ImageNet val with a single GPU run:

python main.py --eval --model LeViT_256 --data-path /path/to/imagenet

This should give

* Acc@1 81.636 Acc@5 95.424 loss 0.750

Training

To train LeViT-256 on ImageNet with hard distillation on a single node with 8 gpus run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model LeViT_256 --data-path /path/to/imagenet --output_dir /path/to/save

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

To train LeViT-256 model on ImageNet on one node with 8 gpus:

python run_with_submitit.py --model LeViT_256 --data-path /path/to/imagenet

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Contributing

We actively welcome your pull requests! Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

levit's People

Contributors

Stargazers

Watchers

Forkers

btgraham peternara pkurainbow sunsmarterjie jurjsorinliviu tianhaofu stjordanis sailfish009 nathanhack hsouporto alexjunholee vkirilenko sunke123 dipet trendingtechnology chzhan snoopybingo xuewengeophysics cuicheng01 hkzhang-git kann-tsukasa cv-ip cvlinks bigtailcat1977 tp030ny yankai317 herolin12 andy1621 jihaonew metavai yuhuang-ca anminhhung faisal-w jhx646018057 zhen-zohn-wang lafreze windsao robert-junwang joberzheng jawaechan projektosmium lemonandrabbit shilei2403 echochoc arthurhero 1hunters aminullah6264 wkcn pugangqiang alexandresee modelacc zhenlongsong 0taj tashengjinsheng fxmarty josephkj dl-vit zheng-ningxin logichen muxiaojue zhuhanqing czifan aashokvardhan jonathan221078 vadimcurca longzenglong lpylpy0514 paperwave nobelvictory sebastianyyy neukaren

levit's Issues

Exporting ONNX failed.

I used following code to export onnx model:

torch.onnx.export(levit_model, dummy_input, 
                  "levit192.onnx",
                  export_params=True,
                  verbose=True, 
                  input_names=input_names, output_names=output_names)

but error occurred:

raise RuntimeError("step!=1 is currently not supported")
RuntimeError: step!=1 is currently not supported

I tried to set opset_version=11, but another error occurred:

  File "/multimedia-nfs/liwei/model_selection/model_select_env/lib/python3.6/site-packages/torch/onnx/utils.py", line 500, in _model_to_graph
    _export_onnx_opset_version)
RuntimeError: Index is supposed to be an empty tensor or a vector

I need your help. Thank you!

attention bias problem

Thanks for sharing! Well, I'm confused about the part of attention bias and I list a small example blow.

points = list(itertools.product(range(3), range(3)))
N = len(points)
attention_offsets = {}
idxs = []
for p1 in points:
for p2 in points:
offset = (abs(p1[0] - p2[0]), abs(p1[1] - p2[1]))
if offset not in attention_offsets:
attention_offsets[offset] = len(attention_offsets)
idxs.append(attention_offsets[offset])
attention_biases = torch.nn.Parameter(
torch.zeros(1, len(attention_offsets)))
attention_bias_idxs = torch.LongTensor(idxs).view(N, N)
print(attention_biases[:,attention_bias_idxs])

However, after I run this code, the result is all zero for attention_biases !? I think the reasonable answer is the content of attention_bias_idxs in each head.

Can Levit work with non-square input?

It throws an error when fed an input other than 224x224? Would it be possible to use it as a feature extractor for the non-squared input(either levit or levit_c)?
example code.
model = levit_c.LeViT_c_128S(num_classes=1000).to(device="cuda:0").eval()
features = model.blocks(model.patch_embed(torch.randn(4, 3, 256, 224, device='cuda:0'))) # throws an error for any resolution other than (224x224)

why "Finetuning with distillation not yet supported"?

why not support finetuning with distillation, any conflicts?

levit_c.LeViT_c_384 not working

Hi, there seems to be a bug in the levit_c.LeViT_c_384 model. Could you please have a look at it?

To reproduce.
import torch
import levit_c
model = levit_c.LeViT_c_384()
out = model(torch.rand(4,3,224,224)) # gives following error
RuntimeError: The size of tensor a (384) must match the size of tensor b (4) at non-singleton dimension 1

Does levit work on detection/segmentation task?

Error loading model after finetuning.

Thanks for the great work. I am however running into issue when training a Levit-128s model and then loading the trained model . It complains with lot of unexpected keys.

I am using following command to evaluate the model (checkpoint.pth)

python main.py --eval --model LeViT_128S --batch-size 1

` Missing key(s) in state_dict: "patch_embed.0.weight", "patch_embed.0.bias", "patch_embed.2.weight", "patch_embed.2.bias", "patch_embed.4.weight", "patch_embed.4.bias", "patch_embed.6.weight", "patch_embed.6.bias", "blocks.0.m.qkv.weight", "blocks.0.m.qkv.bias", "blocks.0.m.proj.1.weight", "blocks.0.m.proj.1.bias", "blocks.1.m.0.weight", "blocks.1.m.0.bias", "blocks.1.m.2.weight", "blocks.1.m.2.bias", "blocks.2.m.qkv.weight", "blocks.2.m.qkv.bias", "blocks.2.m.proj.1.weight", "blocks.2.m.proj.1.bias", "blocks.3.m.0.weight", "blocks.3.m.0.bias", "blocks.3.m.2.weight", "blocks.3.m.2.bias", "blocks.4.kv.weight", "blocks.4.kv.bias", "blocks.4.q.1.weight", "blocks.4.q.1.bias", "blocks.4.proj.1.weight", "blocks.4.proj.1.bias", "blocks.5.m.0.weight", "blocks.5.m.0.bias", "blocks.5.m.2.weight", "blocks.5.m.2.bias", "blocks.6.m.qkv.weight", "blocks.6.m.qkv.bias", "blocks.6.m.proj.1.weight", "blocks.6.m.proj.1.bias", "blocks.7.m.0.weight", "blocks.7.m.0.bias", "blocks.7.m.2.weight", "blocks.7.m.2.bias", "blocks.8.m.qkv.weight", "blocks.8.m.qkv.bias", "blocks.8.m.proj.1.weight", "blocks.8.m.proj.1.bias", "blocks.9.m.0.weight", "blocks.9.m.0.bias", "blocks.9.m.2.weight", "blocks.9.m.2.bias", "blocks.10.m.qkv.weight", "blocks.10.m.qkv.bias", "blocks.10.m.proj.1.weight", "blocks.10.m.proj.1.bias", "blocks.11.m.0.weight", "blocks.11.m.0.bias", "blocks.11.m.2.weight", "blocks.11.m.2.bias", "blocks.12.kv.weight", "blocks.12.kv.bias", "blocks.12.q.1.weight", "blocks.12.q.1.bias", "blocks.12.proj.1.weight", "blocks.12.proj.1.bias", "blocks.13.m.0.weight", "blocks.13.m.0.bias", "blocks.13.m.2.weight", "blocks.13.m.2.bias", "blocks.14.m.qkv.weight", "blocks.14.m.qkv.bias", "blocks.14.m.proj.1.weight", "blocks.14.m.proj.1.bias", "blocks.15.m.0.weight", "blocks.15.m.0.bias", "blocks.15.m.2.weight", "blocks.15.m.2.bias", "blocks.16.m.qkv.weight", "blocks.16.m.qkv.bias", "blocks.16.m.proj.1.weight", "blocks.16.m.proj.1.bias", "blocks.17.m.0.weight", "blocks.17.m.0.bias", "blocks.17.m.2.weight", "blocks.17.m.2.bias", "blocks.18.m.qkv.weight", "blocks.18.m.qkv.bias", "blocks.18.m.proj.1.weight", "blocks.18.m.proj.1.bias", "blocks.19.m.0.weight", "blocks.19.m.0.bias", "blocks.19.m.2.weight", "blocks.19.m.2.bias", "blocks.20.m.qkv.weight", "blocks.20.m.qkv.bias", "blocks.20.m.proj.1.weight", "blocks.20.m.proj.1.bias", "blocks.21.m.0.weight", "blocks.21.m.0.bias", "blocks.21.m.2.weight", "blocks.21.m.2.bias", "head.weight", "head.bias".

    Unexpected key(s) in state_dict: "patch_embed.0.c.weight", "patch_embed.0.bn.weight", "patch_embed.0.bn.bias", "patch_embed.0.bn.running_mean", "patch_embed.0.bn.running_var", "patch_embed.0.bn.num_batches_tracked", "patch_embed.2.c.weight", "patch_embed.2.bn.weight", "patch_embed.2.bn.bias", "patch_embed.2.bn.running_mean", "patch_embed.2.bn.running_var", "patch_embed.2.bn.num_batches_tracked", "patch_embed.4.c.weight", "patch_embed.4.bn.weight", "patch_embed.4.bn.bias", "patch_embed.4.bn.running_mean", "patch_embed.4.bn.running_var", "patch_embed.4.bn.num_batches_tracked", "patch_embed.6.c.weight", "patch_embed.6.bn.weight", "patch_embed.6.bn.bias", "patch_embed.6.bn.running_mean", "patch_embed.6.bn.running_var", "patch_embed.6.bn.num_batches_tracked", "blocks.0.m.qkv.c.weight", "blocks.0.m.qkv.bn.weight", "blocks.0.m.qkv.bn.bias", "blocks.0.m.qkv.bn.running_mean", "blocks.0.m.qkv.bn.running_var", "blocks.0.m.qkv.bn.num_batches_tracked", "blocks.0.m.proj.1.c.weight", "blocks.0.m.proj.1.bn.weight", "blocks.0.m.proj.1.bn.bias", "blocks.0.m.proj.1.bn.running_mean", "blocks.0.m.proj.1.bn.running_var", "blocks.0.m.proj.1.bn.num_batches_tracked", "blocks.1.m.0.c.weight", "blocks.1.m.0.bn.weight", "blocks.1.m.0.bn.bias", "blocks.1.m.0.bn.running_mean", "blocks.1.m.0.bn.running_var", "blocks.1.m.0.bn.num_batches_tracked", "blocks.1.m.2.c.weight", "blocks.1.m.2.bn.weight", "blocks.1.m.2.bn.bias", "blocks.1.m.2.bn.running_mean", "blocks.1.m.2.bn.running_var", "blocks.1.m.2.bn.num_batches_tracked", "blocks.2.m.qkv.c.weight", "blocks.2.m.qkv.bn.weight", "blocks.2.m.qkv.bn.bias", "blocks.2.m.qkv.bn.running_mean", "blocks.2.m.qkv.bn.running_var", "blocks.2.m.qkv.bn.num_batches_tracked", "blocks.2.m.proj.1.c.weight", "blocks.2.m.proj.1.bn.weight", "blocks.2.m.proj.1.bn.bias", "blocks.2.m.proj.1.bn.running_mean", "blocks.2.m.proj.1.bn.running_var", "blocks.2.m.proj.1.bn.num_batches_tracked", "blocks.3.m.0.c.weight", "blocks.3.m.0.bn.weight", "blocks.3.m.0.bn.bias", "blocks.3.m.0.bn.running_mean", "blocks.3.m.0.bn.running_var", "blocks.3.m.0.bn.num_batches_tracked", "blocks.3.m.2.c.weight", "blocks.3.m.2.bn.weight", "blocks.3.m.2.bn.bias", "blocks.3.m.2.bn.running_mean", "blocks.3.m.2.bn.running_var", "blocks.3.m.2.bn.num_batches_tracked", "blocks.4.kv.c.weight", "blocks.4.kv.bn.weight", "blocks.4.kv.bn.bias", "blocks.4.kv.bn.running_mean", "blocks.4.kv.bn.running_var", "blocks.4.kv.bn.num_batches_tracked", "blocks.4.q.1.c.weight", "blocks.4.q.1.bn.weight", "blocks.4.q.1.bn.bias", "blocks.4.q.1.bn.running_mean", "blocks.4.q.1.bn.running_var", "blocks.4.q.1.bn.num_batches_tracked", "blocks.4.proj.1.c.weight", "blocks.4.proj.1.bn.weight", "blocks.4.proj.1.bn.bias", "blocks.4.proj.1.bn.running_mean", "blocks.4.proj.1.bn.running_var", "blocks.4.proj.1.bn.num_batches_tracked", "blocks.5.m.0.c.weight", "blocks.5.m.0.bn.weight", "blocks.5.m.0.bn.bias", "blocks.5.m.0.bn.running_mean", "blocks.5.m.0.bn.running_var", "blocks.5.m.0.bn.num_batches_tracked", "blocks.5.m.2.c.weight", "blocks.5.m.2.bn.weight", "blocks.5.m.2.bn.bias", "blocks.5.m.2.bn.running_mean", "blocks.5.m.2.bn.running_var", "blocks.5.m.2.bn.num_batches_tracked", "blocks.6.m.qkv.c.weight", "blocks.6.m.qkv.bn.weight", "blocks.6.m.qkv.bn.bias", "blocks.6.m.qkv.bn.running_mean", "blocks.6.m.qkv.bn.running_var", "blocks.6.m.qkv.bn.num_batches_tracked", "blocks.6.m.proj.1.c.weight", "blocks.6.m.proj.1.bn.weight", "blocks.6.m.proj.1.bn.bias", "blocks.6.m.proj.1.bn.running_mean", "blocks.6.m.proj.1.bn.running_var", "blocks.6.m.proj.1.bn.num_batches_tracked", "blocks.7.m.0.c.weight", "blocks.7.m.0.bn.weight", "blocks.7.m.0.bn.bias", "blocks.7.m.0.bn.running_mean", "blocks.7.m.0.bn.running_var", "blocks.7.m.0.bn.num_batches_tracked", "blocks.7.m.2.c.weight", "blocks.7.m.2.bn.weight", "blocks.7.m.2.bn.bias", "blocks.7.m.2.bn.running_mean", "blocks.7.m.2.bn.running_var", "blocks.7.m.2.bn.num_batches_tracked", "blocks.8.m.qkv.c.weight", "blocks.8.m.qkv.bn.weight", "blocks.8.m.qkv.bn.bias", "blocks.8.m.qkv.bn.running_mean", "blocks.8.m.qkv.bn.running_var", "blocks.8.m.qkv.bn.num_batches_tracked", "blocks.8.m.proj.1.c.weight", "blocks.8.m.proj.1.bn.weight", "blocks.8.m.proj.1.bn.bias", "blocks.8.m.proj.1.bn.running_mean", "blocks.8.m.proj.1.bn.running_var", "blocks.8.m.proj.1.bn.num_batches_tracked", "blocks.9.m.0.c.weight", "blocks.9.m.0.bn.weight", "blocks.9.m.0.bn.bias", "blocks.9.m.0.bn.running_mean", "blocks.9.m.0.bn.running_var", "blocks.9.m.0.bn.num_batches_tracked", "blocks.9.m.2.c.weight", "blocks.9.m.2.bn.weight", "blocks.9.m.2.bn.bias", "blocks.9.m.2.bn.running_mean", "blocks.9.m.2.bn.running_var", "blocks.9.m.2.bn.num_batches_tracked", "blocks.10.m.qkv.c.weight", "blocks.10.m.qkv.bn.weight", "blocks.10.m.qkv.bn.bias", "blocks.10.m.qkv.bn.running_mean", "blocks.10.m.qkv.bn.running_var", "blocks.10.m.qkv.bn.num_batches_tracked", "blocks.10.m.proj.1.c.weight", "blocks.10.m.proj.1.bn.weight", "blocks.10.m.proj.1.bn.bias", "blocks.10.m.proj.1.bn.running_mean", "blocks.10.m.proj.1.bn.running_var", "blocks.10.m.proj.1.bn.num_batches_tracked", "blocks.11.m.0.c.weight", "blocks.11.m.0.bn.weight", "blocks.11.m.0.bn.bias", "blocks.11.m.0.bn.running_mean", "blocks.11.m.0.bn.running_var", "blocks.11.m.0.bn.num_batches_tracked", "blocks.11.m.2.c.weight", "blocks.11.m.2.bn.weight", "blocks.11.m.2.bn.bias", "blocks.11.m.2.bn.running_mean", "blocks.11.m.2.bn.running_var", "blocks.11.m.2.bn.num_batches_tracked", "blocks.12.kv.c.weight", "blocks.12.kv.bn.weight", "blocks.12.kv.bn.bias", "blocks.12.kv.bn.running_mean", "blocks.12.kv.bn.running_var", "blocks.12.kv.bn.num_batches_tracked", "blocks.12.q.1.c.weight", "blocks.12.q.1.bn.weight", "blocks.12.q.1.bn.bias", "blocks.12.q.1.bn.running_mean", "blocks.12.q.1.bn.running_var", "blocks.12.q.1.bn.num_batches_tracked", "blocks.12.proj.1.c.weight", "blocks.12.proj.1.bn.weight", "blocks.12.proj.1.bn.bias", "blocks.12.proj.1.bn.running_mean", "blocks.12.proj.1.bn.running_var", "blocks.12.proj.1.bn.num_batches_tracked", "blocks.13.m.0.c.weight", "blocks.13.m.0.bn.weight", "blocks.13.m.0.bn.bias", "blocks.13.m.0.bn.running_mean", "blocks.13.m.0.bn.running_var", "blocks.13.m.0.bn.num_batches_tracked", "blocks.13.m.2.c.weight", "blocks.13.m.2.bn.weight", "blocks.13.m.2.bn.bias", "blocks.13.m.2.bn.running_mean", "blocks.13.m.2.bn.running_var", "blocks.13.m.2.bn.num_batches_tracked", "blocks.14.m.qkv.c.weight", "blocks.14.m.qkv.bn.weight", "blocks.14.m.qkv.bn.bias", "blocks.14.m.qkv.bn.running_mean", "blocks.14.m.qkv.bn.running_var", "blocks.14.m.qkv.bn.num_batches_tracked", "blocks.14.m.proj.1.c.weight", "blocks.14.m.proj.1.bn.weight", "blocks.14.m.proj.1.bn.bias", "blocks.14.m.proj.1.bn.running_mean", "blocks.14.m.proj.1.bn.running_var", "blocks.14.m.proj.1.bn.num_batches_tracked", "blocks.15.m.0.c.weight", "blocks.15.m.0.bn.weight", "blocks.15.m.0.bn.bias", "blocks.15.m.0.bn.running_mean", "blocks.15.m.0.bn.running_var", "blocks.15.m.0.bn.num_batches_tracked", "blocks.15.m.2.c.weight", "blocks.15.m.2.bn.weight", "blocks.15.m.2.bn.bias", "blocks.15.m.2.bn.running_mean", "blocks.15.m.2.bn.running_var", "blocks.15.m.2.bn.num_batches_tracked", "blocks.16.m.qkv.c.weight", "blocks.16.m.qkv.bn.weight", "blocks.16.m.qkv.bn.bias", "blocks.16.m.qkv.bn.running_mean", "blocks.16.m.qkv.bn.running_var", "blocks.16.m.qkv.bn.num_batches_tracked", "blocks.16.m.proj.1.c.weight", "blocks.16.m.proj.1.bn.weight", "blocks.16.m.proj.1.bn.bias", "blocks.16.m.proj.1.bn.running_mean", "blocks.16.m.proj.1.bn.running_var", "blocks.16.m.proj.1.bn.num_batches_tracked", "blocks.17.m.0.c.weight", "blocks.17.m.0.bn.weight", "blocks.17.m.0.bn.bias", "blocks.17.m.0.bn.running_mean", "blocks.17.m.0.bn.running_var", "blocks.17.m.0.bn.num_batches_tracked", "blocks.17.m.2.c.weight", "blocks.17.m.2.bn.weight", "blocks.17.m.2.bn.bias", "blocks.17.m.2.bn.running_mean", "blocks.17.m.2.bn.running_var", "blocks.17.m.2.bn.num_batches_tracked", "blocks.18.m.qkv.c.weight", "blocks.18.m.qkv.bn.weight", "blocks.18.m.qkv.bn.bias", "blocks.18.m.qkv.bn.running_mean", "blocks.18.m.qkv.bn.running_var", "blocks.18.m.qkv.bn.num_batches_tracked", "blocks.18.m.proj.1.c.weight", "blocks.18.m.proj.1.bn.weight", "blocks.18.m.proj.1.bn.bias", "blocks.18.m.proj.1.bn.running_mean", "blocks.18.m.proj.1.bn.running_var", "blocks.18.m.proj.1.bn.num_batches_tracked", "blocks.19.m.0.c.weight", "blocks.19.m.0.bn.weight", "blocks.19.m.0.bn.bias", "blocks.19.m.0.bn.running_mean", "blocks.19.m.0.bn.running_var", "blocks.19.m.0.bn.num_batches_tracked", "blocks.19.m.2.c.weight", "blocks.19.m.2.bn.weight", "blocks.19.m.2.bn.bias", "blocks.19.m.2.bn.running_mean", "blocks.19.m.2.bn.running_var", "blocks.19.m.2.bn.num_batches_tracked", "blocks.20.m.qkv.c.weight", "blocks.20.m.qkv.bn.weight", "blocks.20.m.qkv.bn.bias", "blocks.20.m.qkv.bn.running_mean", "blocks.20.m.qkv.bn.running_var", "blocks.20.m.qkv.bn.num_batches_tracked", "blocks.20.m.proj.1.c.weight", "blocks.20.m.proj.1.bn.weight", "blocks.20.m.proj.1.bn.bias", "blocks.20.m.proj.1.bn.running_mean", "blocks.20.m.proj.1.bn.running_var", "blocks.20.m.proj.1.bn.num_batches_tracked", "blocks.21.m.0.c.weight", "blocks.21.m.0.bn.weight", "blocks.21.m.0.bn.bias", "blocks.21.m.0.bn.running_mean", "blocks.21.m.0.bn.running_var", "blocks.21.m.0.bn.num_batches_tracked", "blocks.21.m.2.c.weight", "blocks.21.m.2.bn.weight", "blocks.21.m.2.bn.bias", "blocks.21.m.2.bn.running_mean", "blocks.21.m.2.bn.running_var", "blocks.21.m.2.bn.num_batches_tracked", "head.bn.weight", "head.bn.bias", "head.bn.running_mean", "head.bn.running_var", "head.bn.num_batches_tracked", "head.l.weight", "head.l.bias".

`
What am i missing here?

Update: create_model function from timm has fuse argument setting it to False , loads the model correctly. Is there a function to save the fused model though?

Why LeViT needs 1000 training epochs?

While other VIT models are trained with only 300 epochs, LeViT need 1000 epochs，which bring lots of traing cost. I think is unfair for comparison. What is the accuracy of LeViT at 300 epoch ?

Question about train LeViT-256

Hi,

I am training LeViT-256. I use this command:
python3 -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model LeViT_256 --data-path /scratch/pytorch-image-models-master-mingqi/imagenet --output_dir /output --seed 0 --batch-size 512.
But it takes this error.

It seems like package version problem? or python version problem? Is there anyone could help me solve this problem?

Thanks

ResNet50+DeiT

How to change the activation maps produced by cropped ResNet to adapt to DieT. Can you release the code and pretrained model about this，Thanks a lot.

LeViT model settings for Cifar10

I am interested if there is any LeViT model setup you have tested for Cifar10. I would like to know the proper setup of ConvNet block and Attention blocks.

How to reproduce the CPU/Arm inference speed in Table 3?

I have test DeiT/EfficientNet/LeViT on my iphone. Their speeds are comparable if they have similar FLOPs. But in Table 3, EfficientNet is much much slower than DeiT/LeViT. So I want to know how to reproduce the CPU/Arm inference speed in Table 3?

Inference - different output when using different batch size

When performing inference using pretrained model (within eval() mode), the same image may producing different logits output when batch size is changed.

Example code:

with torch.no_grad():
    x = torch.stack([
        torch.zeros((3,224,224)),
        torch.ones((3,224,224)),
        torch.ones((3,224,224)),
    ])
    model = LeViT_384(pretrained=True)
    model = model.eval()
    print('batch=1', model(x[:1])[0][:2].numpy())
    print('batch=3', model(x[:3])[0][:2].numpy())

Output: (only for the 1st sample, limited to 2 classes)

batch=1 [-0.3287484  -0.11664876]
batch=3 [-0.32874817 -0.11664899]

While the argmax may not significantly affected, this inconsistency make it difficult to perform gradient analysis.

I'm suspecting that this is caused by some batch-normalization layers not honoring eval() mode.

DeiT Tiny: You say 76% but paper says

You say DeiT Tiny gets 76% imagenet top1. But https://github.com/facebookresearch/deit say it gets 72.2 or 74.5 (depending on distillation). What is causing this difference?

LeViT-128S without distillation 100 epoch training reproduction on 1 GPU

Hello and thanks for the great paper and codebase!
I am trying to replicate the numbers reported in Table 5 of the paper, and specifically the A4 model (without distillation), that is reported to achieve 69.7% top-1 accuracy. Would you have any hints as to how to replicate these numbers only having 1 GPU? Modifying the code and using gradient accumulation techniques to replicate the 256 * 32 = 8192 batch size seems to only reach 63.9% top-1 accuracy.
Are there any other steps / tricks that I might be missing? Thanks!

inference speed gets faster from paper v1 to v2 on arxiv

Hi, thank you for your great work！

I am curious why the inference speed listed in the second version is much faster than the first version of the paper on arxiv? I was wondering what approach you have taken to improve inference speed.

AttributeError: module 'timm.models' has no attribute 'vit_deit_tiny_distilled_patch16_224'

Hello!
When I run speed_test.py, error AttributeError: module 'timm.models' has no attribute 'vit_deit_tiny_distilled_patch16_224'.
Does it mean that my timm doesn't have the deit model? How can I solve it?
I go to timm repo https://github.com/rwightman/pytorch-image-models but didn't find how to add models.
Thanks!

About the shape of attention_biases

Thanks for your work!

When I run the code, I meet an error:
too many indices for tensor of dimension2.

Error file is "mypath/levit.py", the error code is "self.attention_biases[:, self.attention_bias_idxs] "

attention_biases shape: [4, 196]
attention_bias_idxs shape: [196, 196]
Is there something wrong with this code?

Question about training from Checkpoint.pth

Hi,

I use this command to train
python3 -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --epochs 500 --model LeViT_256 --data-path /scratch/pytorch-image-models-master/imagenet --output_dir /scratch/LeViT-main/output
My workstation is broken when it finish 70 epochs train for LeViT-256. So I want to re-start from 70 epochs. But when I use the same command, it start from 0 epoch. Does anyone know how to re-start from 70 epochs. The checkpoint file is in the output/ dictionary.

Thanks

Does it mean to use fuse when evaluating?

LeViT/main.py

Line 261 in 809aded

fuse=args.eval,

LeViT/levit.py

Line 499 in 809aded

net = globals()[name](fuse=True, pretrained=True)

Excuse me, does the above line of code mean that fuse should be used when evaluating? Thank you!

Using LeViT model and pretrained weights for object detection on larger resolution

Hello,

Impressive work! I notice that the LeViT models are trained with images of 224x224. Recently, I have tried to use LeViT256(with your pretrained model weights) as backbone of my own object detection model. The input resolution of my model is 448x800, which means after patch embedding, the resolution becomes 28x50. In your model, the resolution after patch embedding is 14x14. Therefore, when I try to load the pretrained wights, the shape of attention_bias_idx doesn't match (weights: 14x14,14x14, model: 28x50, 28x50). For this problem, I came up with two possible compromise ways: 1. load the model weights without attention_bias_idx. I trained my OD model on Nuscenes and the results are not good. 2. load the model weights and perform nearest interpolation for attention_bias_idx, but logically this process doesn't make much sense to me.

I wonder whether anyone has tried to use LeViT model and pretrained weights for larger resolution than 224x224. How should I solve the above mentioned problem? Or the weights of attention_bias_idx doesn't matter too much for the model performance on other datasets. Hope someone could provide some hints for me. Thanks!

'NoneType' object has no attribute 'log_softmax'

I am using standard loss function nn.CrossEntropyLoss(). It give the following error, please let me know, can we use nn.CrossEntropyLoss()?

Traceback (most recent call last):

  File "/raid/khawar/PycharmProjects/thesis/train.py", line 487, in <module>
    loss = LOSS(outputs, labels)
  File "/raid/khawar/anaconda3/envs/vision-transformer-pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/raid/khawar/anaconda3/envs/vision-transformer-pytorch/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1047, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/raid/khawar/anaconda3/envs/vision-transformer-pytorch/lib/python3.8/site-packages/torch/nn/functional.py", line 2693, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/raid/khawar/anaconda3/envs/vision-transformer-pytorch/lib/python3.8/site-packages/torch/nn/functional.py", line 1672, in log_softmax
    ret = input.log_softmax(dim)
AttributeError: 'NoneType' object has no attribute 'log_softmax'

Question about running the speed_test.py

Hi~
I want to run the speed_test.py, but there is an error as follow:
q, k, v = qkv.view(B, N, self.num_heads,
shape '[2048, 50176, 4, -1]' is invalid for input of size 308281344
when I check the code, I find that the code remove batchnorm of model, and the patch_embed of model is also removed. Therefore, the transformer blocks can not reshape the input.

My question is how I fix this problem?
And When I delete this line of removing batchnorm, I find the result of 'levit.LeViT_128S, 2048, 224' is 20761 images/s on RTX 3090 which is a lot higher than what you reported (12880 images/s reported in Tab 3). Is this result reasonable?
I am looking forward to your reply, thx~.

The specific setting (e.g., batch-size) to reproduce the inference speed in Tab.3?

In the Tab.3 of the paper, there are some values indicating the inference speed of LeViT models, such as 12880 img/s for LeViT-128S and 9266 img/s for LeViT-128.

Would you please list the specific setting (e.g., the batch-size, the type of GPU), because the same architecture can run with various inference speed under different settings.

Which tool/method you used to compute FLOPs?

levit vs levit_c

Could you please tell me what is difference between levit and levit_c? levit_c is bit slower using spped_test.py? what is its accuracy on the imagenet?

problem of inference precision

Thank you very much for your open source. But when I reproduce the inference precision, when I use the model provided by the official, the inference precision is inconsistent with that given in readme. What is the reason.

i use the LeViT-256 Acc@1 81.584 Acc@5 95.464 loss 0.745

LeViT training and bench on GTSRB dataset

Hello

I'm trying to use your SOTA LeViT for GTSRB but encounter some problems when testing. The accuracy after testing 12K images in GTSRB was only 13.4%, 347 FPS on 3080Ti. I believe that your model could break any record of my survey and training may be the main cause. I have tried levit.py and levit_c.py to load the model with only arg num_classes = 43 for training. I also use the same training, and testing method for GhostNet 1.0 and MobileNetV3 Large. Could you please point out some points in the training code that made my testbench not work with your work? Thank you in advance.

import torch
import torchvision
from torchvision import models
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torchsummary import summary
from utils import save_plots
from levit_c import LeViT_c_128S
mean=(0.485, 0.456, 0.406)
std=(0.229, 0.224, 0.225)
transform_train = transforms.Compose([
    transforms.Resize((224,224)),
    transforms.ToTensor(),
    transforms.Normalize(mean,std),
])
trainset = torchvision.datasets.GTSRB(root='data', download=False, transform=transform_train)   # download=True if you did not download yet
trainloader = DataLoader(trainset, batch_size=32, shuffle=True, num_workers=8) 
model = LeViT_c_128S(num_classes=43)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Config Training HyperParameter
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5)
# Lists to keep track of losses and accuracies.
train_loss= []
train_acc= []
# Training
epochs = 50
model.train()
epoch_acc = 0
epoch_loss = 0
for epoch in range(epochs):
    print("\n Epoch: %d"%(epoch+1))
    sum_loss = 0.0
    correct = 0.0
    total =0.0
    for i, data in enumerate(trainloader,0):
        length = len(trainloader)
        inputs,labels = data
        inputs,labels = inputs.to(device),labels.to(device)
        optimizer.zero_grad()
        # forward+backward
        outputs, x = model(inputs)
        loss = criterion(outputs,labels)
        loss.backward()
        optimizer.step()
        # 每个epoch输出损失和正确率
        sum_loss += loss.item()
        _, predicted = torch.max(outputs.data,1)
        total += labels.size(0)
        correct += predicted.eq(labels.data).cpu().sum()
        print("[epoch:%d, iter:%d] Loss: %.03F | Acc: %.3f%%"
              %(epoch+1, (i+1+epoch*length), sum_loss/(i+1), 100.*correct/total))
    scheduler.step()      # Adjust Learning Rate for next epoch
    epoch_loss = sum_loss/(i+1)
    epoch_acc = 100.*correct/total
    train_loss.append(epoch_loss)
    train_acc.append(epoch_acc)
#Display Training Result
model_name = "LeViT_128s"
save_plots(model_name, train_acc, train_loss)
print("Model: LeViT_128s")
print(f"Training Hyperparameter - Epochs: %s, Batch-size: 32, Learning-rate: 0.1, Optimizer: SGD, Momentum: 0.9 " % epochs)
print("[epoch:%d, iter:%d] Loss: %.03F | Acc: %.3f%%" %(epoch+1, (i+1+epoch*length), sum_loss/(i+1), 100.*correct/total))
print(f"Model was saved as %s.pth" % model_name)   
torch.save(model.state_dict(),'LeViT_128s.pth')