szagoruyko / attention-transfer Goto Github PK

Improving Convolutional Networks via Attention Transfer (ICLR 2017)

Home Page: https://arxiv.org/abs/1612.03928

Python 4.77% Jupyter Notebook 95.23%

pytorch knowledge-distillation attention deep-learning

attention-transfer's Introduction

Attention Transfer

PyTorch code for "Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer" https://arxiv.org/abs/1612.03928
Conference paper at ICLR2017: https://openreview.net/forum?id=Sks9_ajex

What's in this repo so far:

Activation-based AT code for CIFAR-10 experiments
Code for ImageNet experiments (ResNet-18-ResNet-34 student-teacher)
Jupyter notebook to visualize attention maps of ResNet-34 visualize-attention.ipynb

Coming:

grad-based AT
Scenes and CUB activation-based AT code

The code uses PyTorch https://pytorch.org. Note that the original experiments were done using torch-autograd, we have so far validated that CIFAR-10 experiments are exactly reproducible in PyTorch, and are in process of doing so for ImageNet (results are very slightly worse in PyTorch, due to hyperparameters).

bibtex:

@inproceedings{Zagoruyko2017AT,
    author = {Sergey Zagoruyko and Nikos Komodakis},
    title = {Paying More Attention to Attention: Improving the Performance of
             Convolutional Neural Networks via Attention Transfer},
    booktitle = {ICLR},
    url = {https://arxiv.org/abs/1612.03928},
    year = {2017}}

Requirements

First install PyTorch, then install torchnet:

pip install git+https://github.com/pytorch/tnt.git@master

then install other Python packages:

pip install -r requirements.txt

Experiments

CIFAR-10

This section describes how to get the results in the table 1 of the paper.

First, train teachers:

python cifar.py --save logs/resnet_40_1_teacher --depth 40 --width 1
python cifar.py --save logs/resnet_16_2_teacher --depth 16 --width 2
python cifar.py --save logs/resnet_40_2_teacher --depth 40 --width 2

To train with activation-based AT do:

python cifar.py --save logs/at_16_1_16_2 --teacher_id resnet_16_2_teacher --beta 1e+3

To train with KD:

python cifar.py --save logs/kd_16_1_16_2 --teacher_id resnet_16_2_teacher --alpha 0.9

We plan to add AT+KD with decaying beta to get the best knowledge transfer results soon.

ImageNet

Pretrained model

We provide ResNet-18 pretrained model with activation based AT:

Model	val error
ResNet-18	30.4, 10.8
ResNet-18-ResNet-34-AT	29.3, 10.0

Download link: https://s3.amazonaws.com/modelzoo-networks/resnet-18-at-export.pth

Model definition: https://github.com/szagoruyko/functional-zoo/blob/master/resnet-18-at-export.ipynb

Convergence plot:

Train from scratch

Download pretrained weights for ResNet-34 (see also functional-zoo for more information):

wget https://s3.amazonaws.com/modelzoo-networks/resnet-34-export.pth

Prepare the data following fb.resnet.torch and run training (e.g. using 2 GPUs):

python imagenet.py --imagenetpath ~/ILSVRC2012 --depth 18 --width 1 \
                   --teacher_params resnet-34-export.hkl --gpu_id 0,1 --ngpu 2 \
                   --beta 1e+3

attention-transfer's People

Contributors

Stargazers

Watchers

Forkers

stevenlol caomw allensmile vyraun jeilove chagge codeaudit benjamesbabala tspannhw xypan1232 ml-lab zergey arasharchor beckettman jessebickford neozoik wuatanabe arnavkj1995 nehgu johnsonc soledad89 baerxxl chenchengkuan paseam freeyawork lamhocn satoshirobatofujimoto ilovecv nagyist ogrisel vybhavk xllau sumeetkr baiyancheng20 shanyuhu chenxinglili mlzxy firestonelib wpfhtl kinect59 koryako vishnumani2009 zxsted xiangzi1992 yinglang wikipedia2008 arnabkar tangxinkevin pavanearmstrong lraxue guo2004131 bob48523 sibylfiresoul edersantana bityangke resurgo-genetics zgsxwsdxg yx100 yuhuixu1993 zhangxgu laoyangui cnn-gan iqbal-chowdhury shizhang124 rpersie winggy shubhampachori12110095 cv9527 phlovexz runngezhang cathleenyu searobbersduck ethanyhzhang zbxzc35 shaoli-huang ieee820 bemoregt zcrwind tigercouple wenyafei4 birdylinch celuigi chenliyo99 pandinosaurus aust-hansen vanpersie32 jeffrey-umbo q512624756 baby47 wolegechu amwons klpek bhavesh2g ai3dvision couragelfyang pinglmlcv shiyongde 32l christinaliang ihaeyong

attention-transfer's Issues

how to do the interpolation?

Thank you for you source code for attention-transfer, but I an not familiar with pytorch. I do not understand the interplotation's implementation. How it works? how to do the interpolation if the two feature maps's dimension are not the same? Can you explain it for me clearly?

question about "params.itervalues()"

(py35) user@user-ASUS:~/fzz/study/attention-transfer-master$ python cifar.py --save logs/resnet_40_1_teacher --depth 40 --width 1
parsed options: {'data_root': '.', 'dataset': 'CIFAR10', 'cuda': False, 'width': 1.0, 'lr': 0.1, 'alpha': 0, 'teacher_id': '', 'gpu_id': '0', 'lr_decay_ratio': 0.2, 'epoch_step': '[60,120,160]', 'beta': 0, 'batchSize': 128, 'ngpu': 1, 'depth': 40, 'randomcrop_pad': 4, 'optim_method': 'SGD', 'nthread': 4, 'weightDecay': 0.0005, 'resume': '', 'epochs': 200, 'save': 'logs/resnet_40_1_teacher', 'temperature': 4, 'dtype': 'float'}
Files already downloaded and verified
Files already downloaded and verified
Traceback (most recent call last):
File "cifar.py", line 331, in
main()
File "cifar.py", line 212, in main
optimizable = [v for v in params.itervalues() if v.requires_grad]
AttributeError: 'collections.OrderedDict' object has no attribute 'itervalues'

When I run 'cifar.py', I got this error, and I can't find the “itervalues” in any file.

Any experiment results updated with AT on imagenet?

How is AT performed on other models?Are there any experiment results presented?
Besides, I tried reproducing the experiment with resnet-18 on imagenet, getting no improvement after 100 epochs' training.
Thanks!

Setting of β

Hi.

In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we
set it to 10^3 divided by number of elements in attention map and batch size for each layer. "

But I am still confused. What is 10^3 mean, and how 0.1 was got?

How to visualize the attention maps?

Question on Code

thank you for your code , it's very helpful for me to study computer vision.
but it is shame that i can't run this code correctly,
i think it's maybe the matter of the edition of software
so, there is a issue i want to know that the edition of software, such us python, opencv,
what i used is python 2.7 pencv 3.2.0

thank you very much

Table1: Experiments on CIFAR-10

Hello, I have a question!
This sentence "Error is computed as median of 5 runs with different seed." is mentioned in table 1 of your paper, how should I get the result of a single run while running the lab? Is it the accuracy of the model one the test set after the last epoch? And I run it 5 times , and then I take the median of them, right?

How should I calculate the results of the experiment? I was first exposed to deep learning paper reproduction.

how to install cvtransforms?

how to install cvtransforms?
thank you

qusetion about cifar.py

The file of the LightCNN-9 model released on Google Drive seems corrupt

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

When I ran the cifar code with GPU, I got the following error. Any suggestions would be appreciated!

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Attention map

Hello! Thaks for the job!
I have one more question: How I can get visualization of attention map?

Question about KL_loss average

Hi, Thanks for your sharing code. I have a question about the KL_loss implement. The pytorch KL_loss is caculate by average the batch size and dimension. But the original knowledge distill do not need average the loss in the dimension. So I assume there is some bug in it?

KL div v/s xentropy

The Hinton distillation paper states:
"The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model. The second objective function is the cross entropy with the correct labels."

In https://github.com/szagoruyko/attention-transfer/blob/master/utils.py#L13-L15, the first objective function is computed using kl_div which is different from cross_entropy.
kl_div computes (- \sum t log(x/t))
cross_entropy computes (- \sum t log(x))
In general, cross_entropy is kl_div + entropy(t)

Did I misunderstand something, or did you use a slightly different loss in your implementation?

Why not use bn for teacher net in imagenet.py

Thanks for your great work first!

I wonder why you do not use BN layer when inference the teacher model here( https://github.com/szagoruyko/attention-transfer/blob/master/imagenet.py#L117 )? Is it a typo?

Hope for your reply!

What is this model's final purpose?

Hi, @EderSantana @szagoruyko

What is this model's final purpose?

It is for improving the performance of CNN?
Otherwise, It is for get Attention region like saliency map?

Thanks in advance .

invalid variables

When I run cifar.py I have the error:
new() received an invalid combination of arguments - got (Tensor, int, int, int), but expected one of:

(torch.device device)
(tuple of ints size, torch.device device)
(torch.Storage storage)
(Tensor other)
(object data, torch.device device)
What is the problem?
Thanks for the answer!

Hoping to see the implementation of AT+KD with decaying beta

Hi, I like your work and curious that when will you be planning on adding the implementation of AT+KD with decaying beta?
Will it be commit soon?

Thank you.

Crossing computer vision boundaries

Hello!

First of all thanks for sharing the code! It's amazing!

I wanted to ask a (noobish perhaps) question, apologize my ignorance if this is smth obvious.

Do you see any relevance or advantage for using the attention transfer techniques to other areas apart from computer vision. I was mainly thinking of NLP. Could smth like this help it in cases where specific important parts of the text are to be identified or in cases of text summarization?

Thanks in advance for the great work!

Kind regards,
Theodore.

how to resolve this error

File "cifar.py", line 126
o = block(o, params, f'{base}.block{i} ', mode, stride if i == 0 else 1)
^
SyntaxError: invalid syntax

Got error when use 2 gpus.

I got an error when I use 2 gpus to train the model on ImageNet. I followed the steps in README, but I got the following error.
Traceback (most recent call last):
File "imagenet.py", line 340, in
main()
File "imagenet.py", line 336, in main
engine.train(h, iter_train, opt.epochs, optimizer)
File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 63, in train
state['optimizer'].step(closure)
File "/usr/local/lib/python3.6/site-packages/torch/optim/sgd.py", line 80, in step
loss = closure()
File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 52, in closure
loss, output = state'network'
File "imagenet.py", line 265, in h
y_s, y_t, loss_groups = utils.data_parallel(f, inputs, params, mode, range(opt.ngpu))
File "/opt/ml/job/utils.py", line 64, in data_parallel
return gather(outputs, output_device)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
return gather_map(outputs)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in
ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
RuntimeError: dimension specified as 0 but tensor has no dimensions

Thank you if you have any solution to this.

How to get the attention map with input model.parameters() or model.named_parameters() in pytorch

Hello! Is there anyone know how to get the attention map with input the model.parameters() or model.named_parameters() in pytorch?

Loss function problems

Hi ,

thanks for your great work
I have some questions.
Why in the details implementation, just use square than mean,not using L2-norm in the paper you described?

def at(x):
    return F.normalize(x.pow(2).mean(1).view(x.size(0), -1))


def at_loss(x, y):
    return (at(x) - at(y)).pow(2).mean()

My Imagenet replication results are poor

Hello, first of all, thank you very much for your great work!

When I reproduce the results of your paper, I found several confusing problems in the relevant part of Imagenet, and the results are much worse than those mentioned in the paper.

First of all, the accuracy of the resnet34 teacher network you mentioned in the paper is different from that of the resnet34 pre training model you provided. I don't know whether it is this reason that leads to the poor results of students. Can you provide the resnet34 model mentioned in the paper?
The second point is that when you do the experiment of Imagenet, you mentioned that the super parameters used are the same as those used in the migration experiment, but no specific value is given. What's the specific beta value, please?

Here is my recurrence("Imagenet_AT" is the experimental result with beta set to 1000, which is much worse than the result in the paper. "Imagenet_AT2000" is the result after I tried to adjust the beta to 2000. You know that this experiment is very computationally expensive, so I stopped the experiment after observing that the previous result is very poo):

Result in your paper：