eladhoffer / convnet.pytorch Goto Github PK

View Code? Open in Web Editor NEW

346.0 346.0 89.0 205 KB

ConvNet training using pytorch

License: MIT License

Python 100.00%

convnet.pytorch's People

Contributors

Stargazers

Watchers

Forkers

tpys xiangzi1992 randl hdubey bityangke johny-c maxgaz59 shubhampachori12110095 wisdomdeng balodhi sytelus nadavbh12 dg-apollo christinaliang hehehahaha yanxiaobin-ben xinlinli170 iwanggp zyang22 drmeerkat miistein fanhaoxin brianlan tbennun i-doctor coderxdy ist-daslab shaocewu bailin7134 azgo14 vfinotti scottclowe boone891214 jwbrun iezsf shigangli weiding4372 pavelrst divyagaur23 kim-sunghoon berryweinst wenhuach woojunepark woffett queenie88 tilmto yxinjiang tomjoshi zyyhhxx debug-thebug dahe-cvl itayhubara lilujunai brandonliang lixiaoqing123456 xrosliang zjmonk erikgoron yiweichen04 fromsystem sotskin karlleell sukumarh parthpatel-es mohammadjavadd elbruzozen zmandyhe vzyknc shenmayufei samsudinng yuntai yanivbl6 jzhan1017 junyanj1 choijungwoo xue1234730 dovedx simnyatsanga cemreyim tonojikiobya sarvanin dulvqingyunlt 5l1v3r1 ashstuff iq-scm varde80

convnet.pytorch's Issues

AttributeError: 'Image' object has no attribute 'new'

Hi, Why am I facing this error ?
its complaining about this line:
alpha = img.new().resize_(3).normal_(0, self.alphastd)

Whats wrong here? I'm using Pytorch 0.4

Stochastic quantization: difference between code and paper

In the paper, the stochastic quantization was done by rounding up with probability p=clip(0.5x, 0, 1), and rounding down with probability 1-p. However, in the code it's done by adding random uniform noise before quantization:
noise = output.new(output.shape).uniform_(-0.5, 0.5)
output.add_(noise)
This noise does not depend on the magnitude of x. I wonder what is the reasoning behind this discrepancy?

RuntimeError: view size is not compatible with input tensor's size and stride

I have the following config file:

{
    "adapt_grad_norm": null,
    "autoaugment": false,
    "batch_size": 256,
    "chunk_batch": 1,
    "config_file": null,
    "cutmix": null,
    "cutout": false,
    "dataset": "imagenet",
    "datasets_dir": "~/data/",
    "device": "cuda",
    "device_ids": [
        0
    ],
    "dist_backend": "nccl",
    "dist_init": "env://",
    "distributed": false,
    "drop_optim_state": false,
    "dtype": "float",
    "duplicates": 1,
    "epochs": 90,
    "eval_batch_size": -1,
    "evaluate": null,
    "grad_clip": -1,
    "input_size": null,
    "label_smoothing": 0,
    "local_rank": -1,
    "loss_scale": 1,
    "lr": 0.1,
    "mixup": null,
    "model": "alexnet",
    "model_config": "",
    "momentum": 0.9,
    "optimizer": "SGD",
    "print_freq": 10,
    "results_dir": "./results",
    "resume": "",
    "save": "alexnet_unquant",
    "save_all": false,
    "seed": 123,
    "start_epoch": -1,
    "sync_bn": false,
    "tensorwatch": false,
    "tensorwatch_port": 0,
    "weight_decay": 0,
    "workers": 8,
    "world_size": -1
}

I get the following error:

Starting Epoch: 1

Traceback (most recent call last):
  File "main.py", line 364, in <module>
    main()
  File "main.py", line 130, in main
    main_worker(args)
  File "main.py", line 306, in main_worker
    train_results = trainer.train(train_data.get_loader(),
  File "/MyPath/convNet.pytorch/trainer.py", line 269, in train
    return self.forward(data_loader, training=True, average_output=average_output, chunk_batch=chunk_batch)
  File "/MyPath/convNet.pytorch/trainer.py", line 224, in forward
    prec1, prec5 = accuracy(output, target, topk=(1, 5))
  File "/MyPath/convNet.pytorch/utils/meters.py", line 70, in accuracy
    correct_k = correct[:k].view(-1).float().sum(0)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

I get the same error when I try the following command in README.rd:

python main.py --model resnet --model-config "{'depth': 18, 'quantize':True}" --save resnet18_8bit -b 64

How to rectify this?

Thanks.

ResNext CIFAR10 crashing during layer construction

When running the command :
python3.6 main.py -b 32 --gpus 1 --model resnext --dataset cifar10

An error pops up

Traceback (most recent call last):
  File "main.py", line 306, in <module>
    main()
  File "main.py", line 194, in main
    train_loader, model, criterion, epoch, optimizer)
  File "main.py", line 295, in train
    training=True, optimizer=optimizer)
  File "main.py", line 254, in forward
    output=model(input_var)
  File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/vista_fpga/pytorch_repos/elad_hofer_pytorch/convNet.pytorch/models/resnext_original.py", line 158, in forward
    x = self.layer1(x)
  File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/lib64/python3.6/site-packages/torch/nn/modules/container.py", line 72, in forward
    input = module(input)
  File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/vista_fpga/pytorch_repos/elad_hofer_pytorch/convNet.pytorch/models/resnext_original.py", line 56, in forward
    residual = self.downsample(x)
  File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/vista_fpga/pytorch_repos/elad_hofer_pytorch/convNet.pytorch/models/resnext_original.py", line 119, in forward
    return torch.cat([ds, self.zero.expand(*zeros_size)], 1)
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

quantization

If I use torch=0.2.0, I met the error:
Traceback (most recent call last):
File "example/mpii.py", line 352, in
main(parser.parse_args())
File "example/mpii.py", line 107, in main
train_loss, train_acc = train(train_loader, model, criterion, optimizer, args.debug, args.flip)
File "example/mpii.py", line 153, in train
output = model(input_var)
File "/home/wangmeng/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/wangmeng/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 71, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/wangmeng/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/wangmeng/pytorch-pose-quantized/pose/models/hourglass_quantized.py", line 172, in forward
x = self.conv1(x)
File "/home/wangmeng/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/wangmeng/pytorch-pose-quantized/pose/models/modules/quantize.py", line 188, in forward
qinput = self.quantize_input(input)
File "/home/wangmeng/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/wangmeng/pytorch-pose-quantized/pose/models/modules/quantize.py", line 165, in forward
min_value * (1 - self.momentum))
TypeError: add_ received an invalid combination of arguments - got (Variable), but expected one of:

(float value)
didn't match because some of the arguments have invalid types: (Variable)
(torch.cuda.FloatTensor other)
didn't match because some of the arguments have invalid types: (Variable)
(torch.cuda.sparse.FloatTensor other)
didn't match because some of the arguments have invalid types: (Variable)
(float value, torch.cuda.FloatTensor other)
(float value, torch.cuda.sparse.FloatTensor other)

Whether I need modified the models/modules/quantilized.py or not?

Have you test you new transform function

Nan loss for quantization

I'm running the code for 8-bit quantization but found that the training loss always gets NAN while I didn't make a slight modification to the original code. Wondering why this could happen and hoping for your clarification.

main.py line 287 defaults = {**train_data_defaults} - SyntaxError: invalid syntax

Hi,

This is the output on command line:

python main.py --dataset cifar10 --model resnet --model-config "{'depth': 44}" --duplicates 32 --cutout -b 64 --epochs 100 --save resnet_44_cutout_m-32-new
File "main.py", line 287
defaults = {**train_data_defaults}
^
SyntaxError: invalid syntax

Hi, is any explanation about the scale_fix in the RangeBN

A fix is added in the RangeBN for the scale:
scale_fix = (0.5 * 0.35) * (1 + (math.pi * math.log(4)) ** 0.5) / ((2 * math.log(y.size(-1))) ** 0.5)

(2lng(n)) ** 0.5 is explained in the paper. However where do the 0.5 / 0.35/ pie*ln(4) come from?

TypeError: init() got an unexpected keyword argument 'reduction'

Hi,
When I try to run the script, I face this error , whats wrong ?
TypeError: __init__() got an unexpected keyword argument 'reduction'
I'm calling the script like this :

IMAGENET_DIR=ImageNet_DataSet
MODEL_NAME=mobilenet
python main.py --dataset $IMAGENET_DIR --model $MODEL_NAME

I'm using Python3.6 and Pytorch 0.4
Thanks alot in advance

why not use apex?

why can we use apex for FP16 training? Is there any bug for L1BN2D?

May I know the software versions

Hi, @eladhoffer

Thanks for the excellent project. I'm trying the code on my local machine. However, some python package dependence seems not met. I wonder if you could share the version of some key software, such as the pytorch, python, os?

How to run the code in Colaboratory?

I'm trying to run main.py using colab and I executed following command in colab

!python /content/convNet.pytorch/main.py --dataset cifar10 --model resnet --model-config "{'depth': 44}" --duplicates 40 --cutout -b 64 --epochs 100 --save resnet44_cutout_m-40

and I'm getting the following error

Traceback (most recent call last): File "/content/convNet.pytorch/main.py", line 14, in <module> from data import DataRegime, SampledDataRegime File "/content/convNet.pytorch/data.py", line 9, in <module> from utils.dataset import IndexedFileDataset File "/content/convNet.pytorch/utils/dataset.py", line 6, in <module> from torch.utils.data.sampler import Sampler, RandomSampler, BatchSampler, _int_classes ImportError: cannot import name '_int_classes' from 'torch.utils.data.sampler' (/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py)

How can I fix this issue?
What steps do I need to follow to be able to run the code in colab?
What changes do I need to make?

Quantizing Mobilenet

I am trying to quantize mobilenet model in the same how you have implemented resnet (https://github.com/eladhoffer/convNet.pytorch). To accomplish this I added the following lines in models/mobilenet .py

from .modules.quantize import QConv2d, QLinear, RangeBN
torch.nn.Linear = QLinear
torch.nn.Conv2d = QConv2d
torch.nn.BatchNorm2d = RangeBN

But, on training the loss is going to nan. It will be of great help if you could provide some inputs on this.

Thanks in advance

regards
Shreyas

trainer.validate will run full validation set on every GPU.

example for efficient multi-gpu training of resnet50 (4 gpus, label-smoothing, fast regime by fast-ai):

python -m torch.distributed.launch --nproc_per_node=4  main.py --model resnet --model-config "{'depth': 50, 'regime': 'fast'}" --eval-batch-size 512 --save resnet50_fast --label-smoothing 0.1

I made some changes:

python -m torch.distributed.launch --nproc_per_node=8 main.py --model resnet --model-config "{'depth': 34, 'regime': 'fast'}" --batch-size 256 --eval-batch-size 512 --label-smoothing 0.1

The log shows:

TRAINING - Epoch: [15][10/625]	Time 0.810 (1.640)
EVALUATING - Epoch: [15][10/98]	Time 1.353 (3.035)

According to the following formulas:

1281167 / 256 = 5004.5, 5004.5 / 8 = 625.5
50000 / 512 = 97.6, 97.6 / 8 = 12.2

So validation steps should be 12 or 13, not 98.

where is the mix&match code

is there any plan to release the mix&match paper code

bug in vgg?

Hi,
cool framework!
note that you add a layer of AvgPool2D with kernel=1 in the class VGG.
This basically doesn't have any effect. Perhaps you meant AdaptiveAveragePool?
In addition, the input for the classification layer is usually 77512, given an input of 224x224.

min/max calculatation and quantize of BN layer

I'm curious about why you use chunk during the calculation of min and max, why not just calculate? And can't we quantize weights and bias of BN?