haitongli / knowledge-distillation-pytorch Goto Github PK

View Code? Open in Web Editor NEW

1.8K 19.0 340.0 22.6 MB

A PyTorch implementation for exploring deep and shallow knowledge distillation (KD) experiments with flexibility

License: MIT License

Python 100.00%

pytorch knowledge-distillation deep-neural-networks cifar10 model-compression dark-knowledge computer-vision

knowledge-distillation-pytorch's People

Contributors

Stargazers

Watchers

Forkers

tangal0203 akarle zhiymiy pinglmlcv wynmew hxl1990 hisifish uptodiff ein-farbe tianyunkeml suzhenghang starstylesky queenie88 tkasarla morparia-p xiahaifeng1995 ddeeppnneett crystalwlh helenligit zcrwind kristenxy soinlovelin michaelyyq hwenjun18 xiaobocasia suvojit-0x55aa blank-wang thu111a kougou linqiaozhou w510056105 ramitpahwa luwei6896 thomastilli spandanagella dlwbm123 sauradip yuki0036 faizwhb pierrehao gusuperstar yifei87 dqgdqg saintlogos1234 arunabh1904 daviddsun 666dzy666 tnq177 lzr9926 zhuysheng lixiangnlp zhengdqin december-boy leodora debuluoyi wpfhtl dev233 jarygrace advboxzoo suyanzhou626 niuxiaozhang scape1989 justbiubiu c1enyang lijuny erodata zephyrchenzf smallduiyue8 chenxshuo ajilim softsys4ai hokekiyoo changwuxie mgsong forjiuzhou hjffily pigip leela93 daanmomo lujiely hamidshojanazeri cyrh zhuangzhong luhuijun666 woaksths shanguanma yichaojin coneyyan zshy1205 shubhampachori12110095 lippman1125 hyqyoung bala93 stevenshi-23 ydtydr mathpopo robot-ai-machinelearning limingmingli321 hegc stc-cqupt

knowledge-distillation-pytorch's Issues

How to improve studentmodel's acc?

My teachermodel's acc is 99%, but when I try to distill knowledge, my studentmodel's acc is under 10%. It seems that studentmodel didn't learn knowledge from teachermodel.
I use Lenet5 as my studentmodel,alpha = 0.9, temperature = 1.
Thanks for your help.

Computing teacher outpouts is called only onece?

Teacher model's outputs are only computed before training epoch. https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/train.py#L277

It assumes that inputs are fixed in each epoch. But the inputs are different in each epoch due to the random transform operations, e.g. randonCrop, randomFlip.
I think the right way is to recalculate teacher's outputs in each epoch.
Is it a bug?

teacher model in eval() mode but still update gradients?

Hi,

Very useful code and instructions! If I understand it correctly, the teacher model shouldn't be updated with gradients and only the student model will compute gradients during the distillation process. I noticed in the train_and_evaluate_kd() function, the teacher model is set to eval() mode. But I think eval() only alters the behavior of dropout or BatchNorm, it doesn't stop gradient update when loss.backward() is called. I think teacher model's parameters should set require_grad to False.

Are the distilled student models available for download?

Hello @peterliht, the pre-trained teacher models are available but do you have the corresponding student models (5 layer CNN, where Teacher Model: Resnet 18 and dataset: CIFAR 10) uploaded somewhere? If you could provide it then it would be of great help. Thanks.

missing training log for base cnn

https://github.com/peterliht/knowledge-distillation-pytorch/blob/master/experiments/base_cnn/train.log

2018-03-09 20:46:06,587:INFO: Loading the datasets...
2018-03-09 20:46:10,074:INFO: - done.
2018-03-09 20:46:10,078:INFO: Starting training for 30 epoch(s)
2018-03-09 20:51:27,485:INFO: Loading the datasets...
2018-03-09 20:51:30,918:INFO: - done.
2018-03-09 20:51:30,922:INFO: Starting training for 30 epoch(s)
2018-03-09 20:54:20,870:INFO: Loading the datasets...
2018-03-09 20:54:24,364:INFO: - done.
2018-03-09 20:54:24,368:INFO: Starting training for 30 epoch(s)
2018-03-09 20:54:24,368:INFO: Epoch 1/30

The train dataloader will be shuffled every epoch, Does it really work?

In the code, the dataloader 'shuffle' switch is set to True.
So the teacher output can not actually work.

About "reduction" built in KLDivLoss

The reason why your temperature is bigger than the original paper setting (said T = 2) may be caused by KLDivLoss. You may try to set reduction = "batchmean" in KLDivLoss. Just a guess. Welcome others to discuss.

regression problem can use this method？

hi，i want to know whether Knowledge Distillation can be used in regression problem

boxed folder

I have downloaded the .zip file from boxed folder, but it can't be unzipped successfully, after being unzipped, the .tar file has become .tar.cpgz file. I also have tried to unzipped through 'unzip' and 'tar xvf' through terminal on max OSX, but failed. II

Could you please send me the boxed folder file to my email? [email protected] Thank You!

in mnist folder，why teacher_mnist and stdudent_mnist do not contain the softmax?

Box folder

请问我在服务器上如何通过linux命令下载 box文件夹中的数据？

experiment result

Hello peterliht，
I ran through your code according to the instructions, did not modify any parameters, but found that the results vary greatly.
What parameters did you modify before releasing the code?
The following experimental results on resnet18：
python train.py --model_dir experiments/resnet18_distill/resnext_teacher

My experimental environment is：

python 3.5.2
pytorch 0.4.0
GPU  TITAN Xp

I see the fitnets for reference

Did you use the fitnets for kd the model?
Fitnets: Hints for thin deep nets

I think I couldn't prove how cnn_distill has highter performance than base_cnn.

This is my situation.
I trained base_cnn in advance using cifar10 dataset for comparing performance between base_cnn and cnn_distill.

Also, I trained base_resnet18 as a teacher using same dataset.
Lastly, I trained cnn_distill using resnet18.

I got two accuracy which were 0.875 from base_cnn and 0.858 from cnn_distill in each metrics_val_best_weights.json.
It looks like that base_cnn is better than cnn_distill.

I didn't change any param in base_cnn and cnn_distill except for one param which was augmentation value from 'no' to 'yes' in base_cnn's params.json.

I think there would be no reason to use knowledge-distillation if base_cnn had higher accuracy.
Please let me know where I was wrong.
Thanks for your time.

How to train my own dataset

Hi @peterliht , Thanks for you great job!
I am trying to train my own dataset , however I got RuntimeError: size mismatch, m1: [2 x 2048], m2: [512 x 2] at errors,
I guess, it is because dataloader shape are different from yours , I am using this repo https://github.com/cs230-stanford/cs230-code-examples/tree/master/pytorch/vision to load my own data,
please guide to fix the issue
Thanks in advance, appreciate your time

Is student net really learn what teacher output?

I print the first 32 labels of train dataloader for teacher net and got:
14, 8, 29, 67, 59, 49, 73, 25, 4, 76, 11, 25, 82, 6, 11, 47, 28, 43, 40, 49, 27, 92, 62, 37, 64, 22, 38, 90, 14, 16, 27, 92
while the first 32 labels of train dataloader of student net but they are:
86, 40, 14, 73, 50, 43, 40, 27, 1, 51, 11, 47, 32, 76, 28, 83, 32, 4, 52, 77, 3, 64, 24, 36, 80, 93, 96, 72, 26, 75, 47, 79

So it seems that the output index of teacher net and student net are not the same at each batch.

'Tensor' object is not callable

I modified the code, and I get an error, does anybody have any idea why?
I am using CPU:

I have an error in this line:

—> 10 output_teacher_batch = teacher_model(data_batch).data().numpy()
TypeError: ‘Tensor’ object is not callable

Does anybody have an idea how to solve this?

def fetch_teacher_outputs(teacher_model, dataloader):

set teacher_model to evaluation mode

teacher_model.eval()
teacher_outputs = []
for i, (data_batch, labels_batch) in enumerate(dataloader):
if torch.cuda.is_available():
data_batch, labels_batch = data_batch.cuda(async=True),
labels_batch.cuda(async=True)
data_batch, labels_batch = Variable(data_batch), Variable(labels_batch)

**output_teacher_batch = teacher_model(data_batch).data().numpy()**
teacher_outputs.append(output_teacher_batch)

return teacher_outputs

cant get the pretrained model

cant open the url to get the pretrained teacher model checkpoints, can you offer another way??

Error Cuda

Hi, this is the error I got while executing this comman, could you please check this?

python3 train.py --model_dir experiments/resnet18_distill/resnext_teacher
Loading the datasets...
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified

done.
/u/halle/yeganeh/home_at/Desktop/git/knowledge-distillation-pytorch/model/resnext.py:82: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaiming_normal_.
init.kaiming_normal(self.classifier.weight)
/u/halle/yeganeh/home_at/Desktop/git/knowledge-distillation-pytorch/model/resnext.py:87: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaiming_normal_.
init.kaiming_normal(self.state_dict()[key], mode='fan_out')
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=51 error=30 : unknown error
Traceback (most recent call last):
File "train.py", line 421, in
teacher_model = nn.DataParallel(teacher_model).cuda()
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 260, in cuda
return self._apply(lambda t: t.cuda(device))
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/sitepython3 train.py --model_dir experiments/resnet18_distill/resnext_teacher
Loading the datasets...
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
done.
/u/halle/yeganeh/home_at/Desktop/git/knowledge-distillation-pytorch/model/resnext.py:82: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaiming_normal_.
init.kaiming_normal(self.classifier.weight)
/u/halle/yeganeh/home_at/Desktop/git/knowledge-distillation-pytorch/model/resnext.py:87: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaiming_normal_.
init.kaiming_normal(self.state_dict()[key], mode='fan_out')
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=51 error=30 : unknown error
Traceback (most recent call last):
File "train.py", line 421, in
teacher_model = nn.DataParallel(teacher_model).cuda()
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 260, in cuda
return self._apply(lambda t: t.cuda(device))
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 260, in
return self._apply(lambda t: t.cuda(device))
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/cuda/init.py", line 162, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:51
-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 260, in
return self._apply(lambda t: t.cuda(device))
File "/u/halle/yeganeh/home_at/.local/lib/python3.6/site-packages/torch/cuda/init.py", line 162, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/aten/src/THC/THCGeneral.cpp:51

An issue on loss function

I suggest both training loss function without KD and with KD should add a softmax function, because the outputs of models are without softmax. Just like this.
https://github.com/peterliht/knowledge-distillation-pytorch/blob/e4c40132fed5a45e39a6ef7a77b15e5d389186f8/model/net.py#L100-L114
==>
KD_loss = nn.KLDivLoss()(F.log_softmax(outputs/T, dim=1), F.softmax(teacher_outputs/T, dim=1)) * (alpha * T * T) + \ F.cross_entropy(F.softmax(outputs,dim=1), labels) * (1. - alpha)

https://github.com/peterliht/knowledge-distillation-pytorch/blob/e4c40132fed5a45e39a6ef7a77b15e5d389186f8/model/net.py#L83-L97
==>
return nn.CrossEntropyLoss()(F.softmax(outputs,dim=1), labels)

For another thing, why does the first part of the KD loss function in distill_mnist.py multiply 2?
https://github.com/peterliht/knowledge-distillation-pytorch/blob/e4c40132fed5a45e39a6ef7a77b15e5d389186f8/mnist/distill_mnist.py#L96-L97

One more thing, it is not necessary to multiply T*T if we distill only using soft targets.
https://github.com/peterliht/knowledge-distillation-pytorch/blob/e4c40132fed5a45e39a6ef7a77b15e5d389186f8/mnist/distill_mnist_unlabeled.py#L96-L97

reference
Distilling the Knowledge in a Neural Network