eladhoffer / convnet.pytorch Goto Github PK
View Code? Open in Web Editor NEWConvNet training using pytorch
License: MIT License
ConvNet training using pytorch
License: MIT License
Hi, Why am I facing this error ?
its complaining about this line:
alpha = img.new().resize_(3).normal_(0, self.alphastd)
Whats wrong here? I'm using Pytorch 0.4
In the paper, the stochastic quantization was done by rounding up with probability p=clip(0.5x, 0, 1), and rounding down with probability 1-p. However, in the code it's done by adding random uniform noise before quantization:
noise = output.new(output.shape).uniform_(-0.5, 0.5)
output.add_(noise)
This noise does not depend on the magnitude of x. I wonder what is the reasoning behind this discrepancy?
I have the following config file:
{
"adapt_grad_norm": null,
"autoaugment": false,
"batch_size": 256,
"chunk_batch": 1,
"config_file": null,
"cutmix": null,
"cutout": false,
"dataset": "imagenet",
"datasets_dir": "~/data/",
"device": "cuda",
"device_ids": [
0
],
"dist_backend": "nccl",
"dist_init": "env://",
"distributed": false,
"drop_optim_state": false,
"dtype": "float",
"duplicates": 1,
"epochs": 90,
"eval_batch_size": -1,
"evaluate": null,
"grad_clip": -1,
"input_size": null,
"label_smoothing": 0,
"local_rank": -1,
"loss_scale": 1,
"lr": 0.1,
"mixup": null,
"model": "alexnet",
"model_config": "",
"momentum": 0.9,
"optimizer": "SGD",
"print_freq": 10,
"results_dir": "./results",
"resume": "",
"save": "alexnet_unquant",
"save_all": false,
"seed": 123,
"start_epoch": -1,
"sync_bn": false,
"tensorwatch": false,
"tensorwatch_port": 0,
"weight_decay": 0,
"workers": 8,
"world_size": -1
}
I get the following error:
Starting Epoch: 1
Traceback (most recent call last):
File "main.py", line 364, in <module>
main()
File "main.py", line 130, in main
main_worker(args)
File "main.py", line 306, in main_worker
train_results = trainer.train(train_data.get_loader(),
File "/MyPath/convNet.pytorch/trainer.py", line 269, in train
return self.forward(data_loader, training=True, average_output=average_output, chunk_batch=chunk_batch)
File "/MyPath/convNet.pytorch/trainer.py", line 224, in forward
prec1, prec5 = accuracy(output, target, topk=(1, 5))
File "/MyPath/convNet.pytorch/utils/meters.py", line 70, in accuracy
correct_k = correct[:k].view(-1).float().sum(0)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
I get the same error when I try the following command in README.rd:
python main.py --model resnet --model-config "{'depth': 18, 'quantize':True}" --save resnet18_8bit -b 64
How to rectify this?
Thanks.
When running the command :
python3.6 main.py -b 32 --gpus 1 --model resnext --dataset cifar10
An error pops up
Traceback (most recent call last):
File "main.py", line 306, in <module>
main()
File "main.py", line 194, in main
train_loader, model, criterion, epoch, optimizer)
File "main.py", line 295, in train
training=True, optimizer=optimizer)
File "main.py", line 254, in forward
output=model(input_var)
File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/vista_fpga/pytorch_repos/elad_hofer_pytorch/convNet.pytorch/models/resnext_original.py", line 158, in forward
x = self.layer1(x)
File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/usr/lib64/python3.6/site-packages/torch/nn/modules/container.py", line 72, in forward
input = module(input)
File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/vista_fpga/pytorch_repos/elad_hofer_pytorch/convNet.pytorch/models/resnext_original.py", line 56, in forward
residual = self.downsample(x)
File "/usr/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "/home/vista_fpga/pytorch_repos/elad_hofer_pytorch/convNet.pytorch/models/resnext_original.py", line 119, in forward
return torch.cat([ds, self.zero.expand(*zeros_size)], 1)
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)
If I use torch=0.2.0, I met the error:
Traceback (most recent call last):
File "example/mpii.py", line 352, in
main(parser.parse_args())
File "example/mpii.py", line 107, in main
train_loss, train_acc = train(train_loader, model, criterion, optimizer, args.debug, args.flip)
File "example/mpii.py", line 153, in train
output = model(input_var)
File "/home/wangmeng/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/wangmeng/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 71, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/wangmeng/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/wangmeng/pytorch-pose-quantized/pose/models/hourglass_quantized.py", line 172, in forward
x = self.conv1(x)
File "/home/wangmeng/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/wangmeng/pytorch-pose-quantized/pose/models/modules/quantize.py", line 188, in forward
qinput = self.quantize_input(input)
File "/home/wangmeng/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/wangmeng/pytorch-pose-quantized/pose/models/modules/quantize.py", line 165, in forward
min_value * (1 - self.momentum))
TypeError: add_ received an invalid combination of arguments - got (Variable), but expected one of:
Whether I need modified the models/modules/quantilized.py or not?
I'm running the code for 8-bit quantization but found that the training loss always gets NAN while I didn't make a slight modification to the original code. Wondering why this could happen and hoping for your clarification.
Hi,
This is the output on command line:
python main.py --dataset cifar10 --model resnet --model-config "{'depth': 44}" --duplicates 32 --cutout -b 64 --epochs 100 --save resnet_44_cutout_m-32-new
File "main.py", line 287
defaults = {**train_data_defaults}
^
SyntaxError: invalid syntax
A fix is added in the RangeBN for the scale:
scale_fix = (0.5 * 0.35) * (1 + (math.pi * math.log(4)) ** 0.5) / ((2 * math.log(y.size(-1))) ** 0.5)
(2lng(n)) ** 0.5 is explained in the paper. However where do the 0.5 / 0.35/ pie*ln(4) come from?
Hi,
When I try to run the script, I face this error , whats wrong ?
TypeError: __init__() got an unexpected keyword argument 'reduction'
I'm calling the script like this :
IMAGENET_DIR=ImageNet_DataSet
MODEL_NAME=mobilenet
python main.py --dataset $IMAGENET_DIR --model $MODEL_NAME
I'm using Python3.6 and Pytorch 0.4
Thanks alot in advance
why can we use apex for FP16 training? Is there any bug for L1BN2D?
Hi, @eladhoffer
Thanks for the excellent project. I'm trying the code on my local machine. However, some python package dependence seems not met. I wonder if you could share the version of some key software, such as the pytorch, python, os?
I'm trying to run main.py using colab and I executed following command in colab
!python /content/convNet.pytorch/main.py --dataset cifar10 --model resnet --model-config "{'depth': 44}" --duplicates 40 --cutout -b 64 --epochs 100 --save resnet44_cutout_m-40
and I'm getting the following error
Traceback (most recent call last): File "/content/convNet.pytorch/main.py", line 14, in <module> from data import DataRegime, SampledDataRegime File "/content/convNet.pytorch/data.py", line 9, in <module> from utils.dataset import IndexedFileDataset File "/content/convNet.pytorch/utils/dataset.py", line 6, in <module> from torch.utils.data.sampler import Sampler, RandomSampler, BatchSampler, _int_classes ImportError: cannot import name '_int_classes' from 'torch.utils.data.sampler' (/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py)
How can I fix this issue?
What steps do I need to follow to be able to run the code in colab?
What changes do I need to make?
I am trying to quantize mobilenet model in the same how you have implemented resnet (https://github.com/eladhoffer/convNet.pytorch). To accomplish this I added the following lines in models/mobilenet .py
from .modules.quantize import QConv2d, QLinear, RangeBN
torch.nn.Linear = QLinear
torch.nn.Conv2d = QConv2d
torch.nn.BatchNorm2d = RangeBN
But, on training the loss is going to nan. It will be of great help if you could provide some inputs on this.
Thanks in advance
regards
Shreyas
example for efficient multi-gpu training of resnet50 (4 gpus, label-smoothing, fast regime by fast-ai):
python -m torch.distributed.launch --nproc_per_node=4 main.py --model resnet --model-config "{'depth': 50, 'regime': 'fast'}" --eval-batch-size 512 --save resnet50_fast --label-smoothing 0.1
I made some changes:
python -m torch.distributed.launch --nproc_per_node=8 main.py --model resnet --model-config "{'depth': 34, 'regime': 'fast'}" --batch-size 256 --eval-batch-size 512 --label-smoothing 0.1
The log shows:
TRAINING - Epoch: [15][10/625] Time 0.810 (1.640)
EVALUATING - Epoch: [15][10/98] Time 1.353 (3.035)
According to the following formulas:
1281167 / 256 = 5004.5, 5004.5 / 8 = 625.5
50000 / 512 = 97.6, 97.6 / 8 = 12.2
So validation steps should be 12 or 13, not 98.
is there any plan to release the mix&match paper code
Hi,
cool framework!
note that you add a layer of AvgPool2D with kernel=1 in the class VGG.
This basically doesn't have any effect. Perhaps you meant AdaptiveAveragePool?
In addition, the input for the classification layer is usually 77512, given an input of 224x224.
I'm curious about why you use chunk during the calculation of min and max, why not just calculate? And can't we quantize weights and bias of BN?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.