Code Monkey home page Code Monkey logo

attention-transfer's Introduction

Attention Transfer

PyTorch code for "Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer" https://arxiv.org/abs/1612.03928
Conference paper at ICLR2017: https://openreview.net/forum?id=Sks9_ajex

What's in this repo so far:

  • Activation-based AT code for CIFAR-10 experiments
  • Code for ImageNet experiments (ResNet-18-ResNet-34 student-teacher)
  • Jupyter notebook to visualize attention maps of ResNet-34 visualize-attention.ipynb

Coming:

  • grad-based AT
  • Scenes and CUB activation-based AT code

The code uses PyTorch https://pytorch.org. Note that the original experiments were done using torch-autograd, we have so far validated that CIFAR-10 experiments are exactly reproducible in PyTorch, and are in process of doing so for ImageNet (results are very slightly worse in PyTorch, due to hyperparameters).

bibtex:

@inproceedings{Zagoruyko2017AT,
    author = {Sergey Zagoruyko and Nikos Komodakis},
    title = {Paying More Attention to Attention: Improving the Performance of
             Convolutional Neural Networks via Attention Transfer},
    booktitle = {ICLR},
    url = {https://arxiv.org/abs/1612.03928},
    year = {2017}}

Requirements

First install PyTorch, then install torchnet:

pip install git+https://github.com/pytorch/tnt.git@master

then install other Python packages:

pip install -r requirements.txt

Experiments

CIFAR-10

This section describes how to get the results in the table 1 of the paper.

First, train teachers:

python cifar.py --save logs/resnet_40_1_teacher --depth 40 --width 1
python cifar.py --save logs/resnet_16_2_teacher --depth 16 --width 2
python cifar.py --save logs/resnet_40_2_teacher --depth 40 --width 2

To train with activation-based AT do:

python cifar.py --save logs/at_16_1_16_2 --teacher_id resnet_16_2_teacher --beta 1e+3

To train with KD:

python cifar.py --save logs/kd_16_1_16_2 --teacher_id resnet_16_2_teacher --alpha 0.9

We plan to add AT+KD with decaying beta to get the best knowledge transfer results soon.

ImageNet

Pretrained model

We provide ResNet-18 pretrained model with activation based AT:

Model val error
ResNet-18 30.4, 10.8
ResNet-18-ResNet-34-AT 29.3, 10.0

Download link: https://s3.amazonaws.com/modelzoo-networks/resnet-18-at-export.pth

Model definition: https://github.com/szagoruyko/functional-zoo/blob/master/resnet-18-at-export.ipynb

Convergence plot:

Train from scratch

Download pretrained weights for ResNet-34 (see also functional-zoo for more information):

wget https://s3.amazonaws.com/modelzoo-networks/resnet-34-export.pth

Prepare the data following fb.resnet.torch and run training (e.g. using 2 GPUs):

python imagenet.py --imagenetpath ~/ILSVRC2012 --depth 18 --width 1 \
                   --teacher_params resnet-34-export.hkl --gpu_id 0,1 --ngpu 2 \
                   --beta 1e+3

attention-transfer's People

Contributors

edersantana avatar szagoruyko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

attention-transfer's Issues

how to do the interpolation?

Thank you for you source code for attention-transfer, but I an not familiar with pytorch. I do not understand the interplotation's implementation. How it works? how to do the interpolation if the two feature maps's dimension are not the same? Can you explain it for me clearly?

question about "params.itervalues()"

(py35) user@user-ASUS:~/fzz/study/attention-transfer-master$ python cifar.py --save logs/resnet_40_1_teacher --depth 40 --width 1
parsed options: {'data_root': '.', 'dataset': 'CIFAR10', 'cuda': False, 'width': 1.0, 'lr': 0.1, 'alpha': 0, 'teacher_id': '', 'gpu_id': '0', 'lr_decay_ratio': 0.2, 'epoch_step': '[60,120,160]', 'beta': 0, 'batchSize': 128, 'ngpu': 1, 'depth': 40, 'randomcrop_pad': 4, 'optim_method': 'SGD', 'nthread': 4, 'weightDecay': 0.0005, 'resume': '', 'epochs': 200, 'save': 'logs/resnet_40_1_teacher', 'temperature': 4, 'dtype': 'float'}
Files already downloaded and verified
Files already downloaded and verified
Traceback (most recent call last):
File "cifar.py", line 331, in
main()
File "cifar.py", line 212, in main
optimizable = [v for v in params.itervalues() if v.requires_grad]
AttributeError: 'collections.OrderedDict' object has no attribute 'itervalues'

When I run 'cifar.py', I got this error, and I can't find the “itervalues” in any file.

Any experiment results updated with AT on imagenet?

How is AT performed on other models?Are there any experiment results presented?
Besides, I tried reproducing the experiment with resnet-18 on imagenet, getting no improvement after 100 epochs' training.
Thanks!

Setting of β

Hi.

In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we
set it to 10^3 divided by number of elements in attention map and batch size for each layer. "

But I am still confused. What is 10^3 mean, and how 0.1 was got?

Question on Code

thank you for your code , it's very helpful for me to study computer vision.
but it is shame that i can't run this code correctly,
i think it's maybe the matter of the edition of software
so, there is a issue i want to know that the edition of software, such us python, opencv,
what i used is python 2.7 pencv 3.2.0

thank you very much

Table1: Experiments on CIFAR-10

Hello, I have a question!
This sentence "Error is computed as median of 5 runs with different seed." is mentioned in table 1 of your paper, how should I get the result of a single run while running the lab? Is it the accuracy of the model one the test set after the last epoch? And I run it 5 times , and then I take the median of them, right?

How should I calculate the results of the experiment? I was first exposed to deep learning paper reproduction.

Attention map

Hello! Thaks for the job!
I have one more question: How I can get visualization of attention map?

Question about KL_loss average

Hi, Thanks for your sharing code. I have a question about the KL_loss implement. The pytorch KL_loss is caculate by average the batch size and dimension. But the original knowledge distill do not need average the loss in the dimension. So I assume there is some bug in it?

KL div v/s xentropy

The Hinton distillation paper states:
"The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model. The second objective function is the cross entropy with the correct labels."

In https://github.com/szagoruyko/attention-transfer/blob/master/utils.py#L13-L15, the first objective function is computed using kl_div which is different from cross_entropy.
kl_div computes (- \sum t log(x/t))
cross_entropy computes (- \sum t log(x))
In general, cross_entropy is kl_div + entropy(t)

Did I misunderstand something, or did you use a slightly different loss in your implementation?

invalid variables

When I run cifar.py I have the error:
new() received an invalid combination of arguments - got (Tensor, int, int, int), but expected one of:

  • (torch.device device)
  • (tuple of ints size, torch.device device)
  • (torch.Storage storage)
  • (Tensor other)
  • (object data, torch.device device)
    What is the problem?
    Thanks for the answer!

Crossing computer vision boundaries

Hello!

First of all thanks for sharing the code! It's amazing!

I wanted to ask a (noobish perhaps) question, apologize my ignorance if this is smth obvious.

Do you see any relevance or advantage for using the attention transfer techniques to other areas apart from computer vision. I was mainly thinking of NLP. Could smth like this help it in cases where specific important parts of the text are to be identified or in cases of text summarization?

Thanks in advance for the great work!

Kind regards,
Theodore.

how to resolve this error

File "cifar.py", line 126
o = block(o, params, f'{base}.block{i} ', mode, stride if i == 0 else 1)
^
SyntaxError: invalid syntax
error

Got error when use 2 gpus.

I got an error when I use 2 gpus to train the model on ImageNet. I followed the steps in README, but I got the following error.
Traceback (most recent call last):
File "imagenet.py", line 340, in
main()
File "imagenet.py", line 336, in main
engine.train(h, iter_train, opt.epochs, optimizer)
File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 63, in train
state['optimizer'].step(closure)
File "/usr/local/lib/python3.6/site-packages/torch/optim/sgd.py", line 80, in step
loss = closure()
File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 52, in closure
loss, output = state'network'
File "imagenet.py", line 265, in h
y_s, y_t, loss_groups = utils.data_parallel(f, inputs, params, mode, range(opt.ngpu))
File "/opt/ml/job/utils.py", line 64, in data_parallel
return gather(outputs, output_device)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
return gather_map(outputs)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in
ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
RuntimeError: dimension specified as 0 but tensor has no dimensions

Thank you if you have any solution to this.

Loss function problems

Hi ,

thanks for your great work
I have some questions.
Why in the details implementation, just use square than mean,not using L2-norm in the paper you described?

image

def at(x):
    return F.normalize(x.pow(2).mean(1).view(x.size(0), -1))


def at_loss(x, y):
    return (at(x) - at(y)).pow(2).mean()

My Imagenet replication results are poor

Hello, first of all, thank you very much for your great work!

When I reproduce the results of your paper, I found several confusing problems in the relevant part of Imagenet, and the results are much worse than those mentioned in the paper.

  • First of all, the accuracy of the resnet34 teacher network you mentioned in the paper is different from that of the resnet34 pre training model you provided. I don't know whether it is this reason that leads to the poor results of students. Can you provide the resnet34 model mentioned in the paper?

  • The second point is that when you do the experiment of Imagenet, you mentioned that the super parameters used are the same as those used in the migration experiment, but no specific value is given. What's the specific beta value, please?

Here is my recurrence("Imagenet_AT" is the experimental result with beta set to 1000, which is much worse than the result in the paper. "Imagenet_AT2000" is the result after I tried to adjust the beta to 2000. You know that this experiment is very computationally expensive, so I stopped the experiment after observing that the previous result is very poo):
image
Result in your paper
image

AT+KD Code

Can you put the code of AT+KD in this repository.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.