ruizhoud / distributionloss Goto Github PK

View Code? Open in Web Editor NEW

31.0 3.0 6.0 14 KB

Source code for paper "Regularizing Activation Distribution for Training Binarized Deep Networks"

Python 100.00%

distributionloss's Introduction

Distribution_Loss

Source code for paper "Regularizing Activation Distribution for Training Binarized Deep Networks"

Code modified from the code for the original BNN paper.

Train BNN-DL for Alexnet ImageNet

CUDA_VISIBLE_DEVICES=0 python custom_main_binary_imagenet.py --seed 1 --model alexnet_binary_vs_xnor --batch-size 256 --batch_size_test 100 --infl_ratio 1 --distrloss 2. --distr_epoch 60 --epochs 65 --gpus 0

You can download checkpoint from this link. The top-1 accuracy is 47.9%, and top-5 accuracy is 71.9%.

distributionloss's People

Contributors

Stargazers

Watchers

Forkers

waterbearbee december-boy dilinwang820 lanie-yao lliai malamleh93

distributionloss's Issues

Request commit model for cifar10

Can you commit model for cifar10 described in the paper xC-xC-MP-2xC-2xC-MP-4xC-4xC-10C-GP?
I have some difficulties to reproduce result of the papers.

The link of checkpoint is missed

Hello! The link of checkpoint is missed, could you upload it again? Thanks very much！

Question about distribution loss

Hi Ruizhou,

Thanks for sharing your code!
When I read your code, there are some problems that bother me. I cannot understand the codes for distribution loss, because they are inconsistent with the description in the paper. In addition, the hyper-parameter is also inconsistent with that in the paper.

Cloud you help me and make some explanations about the distribution loss in the code?

Thanks!

Question About Gradient mismatch

Hi,
BNN_DL is very nice paper, solving the fundamental problem of BNN.
However, I have some ploblem understanding Degeneration, Saturation and Gradient mismatch.

According to what I understand,
Intuitively,

Degeneration Loss makes mean of activation to zero.
Saturation Loss reduces the number of activation with A>1 and A<-1

Then, What is the role of Gradient Mismatch Loss?
It is difficult to understand the meaning of formula [ReLU(1-|u|-k*sigma)]^2.
I think the only way avoiding gradient missmatch problem is changing activation function. (BNN+, Self-binarizing network)

Could you explain in more detail?

Thanks.

First activation function without quantization?

Hi Ruizhou,

Thanks for sharing your code!

While going through your code, I found you used LeakyReLU for the first activation function and didn't quantize its output. Therefore it seems the second convolution layer takes full precision input instead of binary input. Previous works (i.e. XNOR-Net, DoReFa-Net) quantize the first activation as well.

Have you tried to quantize the first activation layer too?

One question about Line 247 in custom_main_binary_imagenet.py

Hi Ruizhou,

This paper on weight regularization is quite interesting!

I have a question about the code here. After this line executed, all the weights are forced to be in [-1, 1]. If so, how could gradient mismatch be mitigated/solved?

Haichao

Question about the last layer

Hi Ruizhou,

Thanks for sharing your code.
I'm very impressed of your excellent work because it makes network tolerant of various types of optimizers and values of hyperparameters. I have a question about after the last layer whether there is Distrloss_layer or not.
In Cifar10 experiment, you use 7 Convolution layers(written as xC-xC-MP-2xC-2xC-MP-4xC-4xC-10C-GP) and I think the first layer is full precision convolution, whereas, the others are all binary convolutions including the last layer (10C). Is there a Distrloss_layer after the last binary convolution layer? I assume 10C-GP part as 10C-BN-Distrloss_layer-GP.

Implementation of [distrloss_layer.py]

[distrloss_layer.py]
distrloss1 = (torch.min(2 - mean - std, 2 + mean - std).clamp(min=0) ** 2).mean() + ((std - 4).clamp(min=0) ** 2).mean()
distrloss2 = (mean ** 2 - std ** 2).clamp(min=0).mean()

According to the paper, it seems to be
distrloss1:

Gradient Mismatch Loss: torch.min(2 - mean - std, 2 + mean - std).clamp(min=0) ** 2).mean()
Saturation Loss: (std - 4).clamp(min=0) ** 2

Q1. Why 2 & 4 instead of 1 which is described in the paper?

distrloss2:
Degeneration Loss: (mean ** 2 - std ** 2).clamp(min=0)

Q2. Why not (torch.abs(mean) - std).clamp(min=0) ** 2?

ruizhoud / distributionloss Goto Github PK

distributionloss's Introduction

Distribution_Loss

Train BNN-DL for Alexnet ImageNet

distributionloss's People

Contributors

Stargazers

Watchers

Forkers

distributionloss's Issues

Request commit model for cifar10

The link of checkpoint is missed

Question about distribution loss

Question About Gradient mismatch

First activation function without quantization?

One question about Line 247 in custom_main_binary_imagenet.py

Question about the last layer

Implementation of [distrloss_layer.py]

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent