Code Monkey home page Code Monkey logo

mean-teacher's Introduction

Mean teachers are better role models

Paper ---- NIPS 2017 poster ---- NIPS 2017 spotlight slides ---- Blog post

By Antti Tarvainen, Harri Valpola (The Curious AI Company)

Approach

Mean Teacher is a simple method for semi-supervised learning. It consists of the following steps:

  1. Take a supervised architecture and make a copy of it. Let's call the original model the student and the new one the teacher.
  2. At each training step, use the same minibatch as inputs to both the student and the teacher but add random augmentation or noise to the inputs separately.
  3. Add an additional consistency cost between the student and teacher outputs (after softmax).
  4. Let the optimizer update the student weights normally.
  5. Let the teacher weights be an exponential moving average (EMA) of the student weights. That is, after each training step, update the teacher weights a little bit toward the student weights.

Our contribution is the last step. Laine and Aila [paper] used shared parameters between the student and the teacher, or used a temporal ensemble of teacher predictions. In comparison, Mean Teacher is more accurate and applicable to large datasets.

Mean Teacher model

Mean Teacher works well with modern architectures. Combining Mean Teacher with ResNets, we improved the state of the art in semi-supervised learning on the ImageNet and CIFAR-10 datasets.

ImageNet using 10% of the labels top-5 validation error
Variational Auto-Encoder [paper] 35.42 ± 0.90
Mean Teacher ResNet-152 9.11 ± 0.12
All labels, state of the art [paper] 3.79
CIFAR-10 using 4000 labels test error
CT-GAN [paper] 9.98 ± 0.21
Mean Teacher ResNet-26 6.28 ± 0.15
All labels, state of the art [paper] 2.86

Implementation

There are two implementations, one for TensorFlow and one for PyTorch. The PyTorch version is probably easier to adapt to your needs, since it follows typical PyTorch idioms, and there's a natural place to add your model and dataset. Let me know if anything needs clarification.

Regarding the results in the paper, the experiments using a traditional ConvNet architecture were run with the TensorFlow version. The experiments using residual networks were run with the PyTorch version.

Tips for choosing hyperparameters and other tuning

Mean Teacher introduces two new hyperparameters: EMA decay rate and consistency cost weight. The optimal value for each of these depends on the dataset, the model, and the composition of the minibatches. You will also need to choose how to interleave unlabeled samples and labeled samples in minibatches.

Here are some rules of thumb to get you started:

  • If you are working on a new dataset, it may be easiest to start with only labeled data and do pure supervised training. Then when you are happy with the architecture and hyperparameters, add mean teacher. The same network should work well, although you may want to tune down regularization such as weight decay that you have used with small data.
  • Mean Teacher needs some noise in the model to work optimally. In practice, the best noise is probably random input augmentations. Use whatever relevant augmentations you can think of: the algorithm will train the model to be invariant to them.
  • It's useful to dedicate a portion of each minibatch for labeled examples. Then the supervised training signal is strong enough early on to train quickly and prevent getting stuck into uncertainty. In the PyTorch examples we have a quarter or a half of the minibatch for the labeled examples and the rest for the unlabeled. (See TwoStreamBatchSampler in Pytorch code.)
  • For EMA decay rate 0.999 seems to be a good starting point.
  • You can use either MSE or KL-divergence as the consistency cost function. For KL-divergence, a good consistency cost weight is often between 1.0 and 10.0. For MSE, it seems to be between the number of classes and the number of classes squared. On small datasets we saw MSE getting better results, but KL always worked pretty well too.
  • It may help to ramp up the consistency cost in the beginning over the first few epochs until the teacher network starts giving good predictions.
  • An additional trick we used in the PyTorch examples: Have two seperate logit layers at the top level. Use one for classification of labeled examples and one for predicting the teacher output. And then have an additional cost between the logits of these two predictions. The intent is the same as with the consistency cost rampup: in the beginning the teacher output may be wrong, so loosen the link between the classification prediction and the consistency cost. (See the --logit-distance-cost argument in the PyTorch implementation.)

mean-teacher's People

Contributors

alexvicegrab avatar filipgrano avatar perone avatar tarvaina avatar vikkamath avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mean-teacher's Issues

AttributeError: 'DataFrame' object has no attribute 'to_msgpack'

Traceback (most recent call last):
File "main.py", line 423, in
main(RunContext(file, 0))
File "main.py", line 104, in main
train(train_loader, model, ema_model, optimizer, epoch, training_log)
File "main.py", line 310, in train
**meters.sums()
File "/home/lbl/work/mean-teacher-master/pytorch/mean_teacher/run_context.py", line 34, in record
self._record(step, col_val_dict)
File "/home/lbl/work/mean-teacher-master/pytorch/mean_teacher/run_context.py", line 45, in _record
self.save()
File "/home/lbl/work/mean-teacher-master/pytorch/mean_teacher/run_context.py", line 38, in save
df.to_msgpack(self.log_file_path, compress='zlib')
File "/home/lbl/miniconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 5274, in getattr
return object.getattribute(self, name)
AttributeError: 'DataFrame' object has no attribute 'to_msgpack'

my pandas version is 1.0.1 and this function may be removed earlier, so what can i do?

ValueError: signal number 32 out of range

Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home//anaconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 139, in _serve
signal.pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG))
File "/home/anaconda3/lib/python3.6/signal.py", line 60, in pthread_sigmask
sigs_set = _signal.pthread_sigmask(how, mask)
ValueError: signal number 32 out of range

I want to know what version of python you are using.

what is the role of the export function?

def export(fn):
    mod = sys.modules[fn.__module__]   
    if hasattr(mod, '__all__'):    
        mod.__all__.append(fn.__name__)    
    else:   
        mod.__all__ = [fn.__name__]    
    return fn

I don't understand the useful of the export function?

About the loss compute

Hi, i am wonder about this loss:
class_loss = class_criterion(class_logit, target_var) / minibatch_size
since this loss ignore some samples(no_lable), why here still use the minibatch_size not the labeled_size?

The accuracy of validation data is very low

Thank you for sharing this project.

I run the cifar10_test.py with the default configuration, and the accuracy of training data is about 50% for top 1and 90% for top 5.

However, for validation data, the accuracy is very low, and the accuracy is 0.154% for top1/5. Please have a look for the following figure.

image

Any suggestion is appreciated!

Applying the code to cifar-100

Hi,
I ran the code on cifar-10 and it worked great!
As this method is the state-of-the-art semi-supervised method for image recognition, I was curious to check its performance on cifar-100.
I changed the parts of the code loading cifar-10, to load cifar-100 (Tesorflow version). The code runs and prints indications of training, but I got 98-100% error, So I guess I am doing something wrong.
An example of one of the output lines:

INFO:main:step 60: train/error/1: 100.0%, train/class_cost/1: nan, train/cons_cost/mt: nan

Do you have an idea what else should I do to apply it to cifar-100?

Missing data?

Hello! I tried using your code, I executed the tensorflow-based script via python train_cifar10.py, but there's an error while searching for the data itself: FileNotFoundError: [Errno 2] No such file or directory: 'data\\images\\cifar\\cifar10\\cifar10_gcn_zca_v2.npz'. I looked through the repository and couldn't find it. Am I missing something and this data is created inside the code?

How to unpack training.msgpack and show the training logs?

Thanks for your inspiring idea and the corresponding code.

I have run the cifar10 experiments in your code on the AWS cloud.
After i trained the network, the data logs were saved in the cloud. Then I downloaded the results files, such as, training.msgpack, but i don't know how to unpack it to show the training logs.

I have google and searched at stackoverflow. But i still have not find a way to show the logs.

Would you please show me how to unpack the .msgpack file and show the logs?

Thanks.

Problems caused by version incompatibility

When I tried to run this code on CIFAR-10, I discovered several problems caused by version incompatibilty, for example, the parameter 'async = true' cannot be set; the function .tomsgpack() didn't work cuz msgpack has already been removed in recent pandas.
Could anyone plz post a list of the version(s) of packages used in the project? (e.g. python, pandas, pytorch) Thanks~~

Gradients because of ema being dependent upon student variables

Great paper!

Tensorflow documentation says the EMA variables are created with (trainable=False) and added to the GraphKeys.ALL_VARIABLES collection. Now as they are not trainable they wont have the gradient applied on them, i understand that. But, as they depend upon the current trainable variables of the graph, and hence so do the predictions of the teacher network; an additional gradient will flow to the trainable variables because of ema being dependent upon them. Is this correct understnading of implementation?

is alpha set wrong?

In the paper, it is said alpha should be 0.99 at the beginning (when global_step is small) and should be 0.999 at the end (when global_step is large), however, in the code:

alpha = min(1 - 1 / (global_step + 1), alpha)

following this, alpha is 0 when global_step is small, and is alpha (this is set as 0.99 from parameters) when global_step is >99. The code seems different what the paper presented. The paper indicates a code of

alpha = max(1 - 1 / (global_step + 1), alpha)

does anyone find issues here?

Pytorch run out of memory

Hi, thanks that you keep pushing semi-supervised ML forward!

I have recently tried pytorch version and apparently my gpu capacity is not enough (6GB).
I had issues with tensorflow as well but I have already seen that older version of tensorflow may help.

Which brings two suggestions:

  • Can you please dockernize your environment so it is easier to reproduce your results?
  • Can you add MNIST example. I know it is boring but it will most likely run even on single gpu?

Questions about your code

Hello,
Why the models have two fc layers and two outputs? I don't think it's necessary.
Consistency_loss can also be calculated by class_logit and ema_logit.
What's the difference between class_logit and cons_logit?

SVHN - final accuracy

Hi, I ran your tensorflow code (file train_svhn.py) and the final accuracy was only around 90%. I did not change anything in the code. I ran it as is ! Do you have any suggestions why I do not get the expected 96% ? By the way, I ran it on one GPU.

About EMA

Hi, I found that the teacher model's weights seem to be not updated as it performed as bad as it was first initialized.

alpha = min(1 - 1 / (global_step + 1), alpha) for ema_param, param in zip(ema_model.parameters(), model.parameters()): ema_param.data.mul_(alpha).add_(1 - alpha, param.data)

Shouldn't this be ema_param.data.mul_(alpha).add_((1 - alpha)*param.data) ?

here are the parameters printed out during training:
('teacher_p: ', Parameter containing:
tensor([ 0.0007, -0.0006, 0.0046, -0.0033, 0.0004, 0.0262, 0.0153, -0.0259,
-0.0115, -0.0015, -0.0117, -0.0060, 0.0161, 0.0104, 0.0080, -0.0015,
-0.0116, -0.0160, 0.0247, -0.0227, 0.0077, 0.0052, 0.0217, 0.0111,
-0.0036, -0.0176, -0.0188, 0.0026, -0.0163, 0.0155],
device='cuda:0'))
('student_p: ', Parameter containing:
tensor([-0.0322, -0.0153, 0.0206, -0.0212, -0.0274, 0.0293, 0.0225, -0.0279,
-0.0272, -0.0282, -0.0272, -0.0261, 0.0275, 0.0261, 0.0274, -0.0251,
0.0014, -0.0285, 0.0296, -0.0296, 0.0105, -0.0209, 0.0123, 0.0227,
-0.0162, -0.0081, -0.0079, -0.0233, -0.0145, 0.0030],
device='cuda:0', requires_grad=True))
('(after) teacher_p: ', Parameter containing:
tensor([ 0.0007, -0.0006, 0.0046, -0.0033, 0.0004, 0.0262, 0.0153, -0.0259,
-0.0115, -0.0016, -0.0117, -0.0060, 0.0161, 0.0104, 0.0080, -0.0015,
-0.0116, -0.0160, 0.0247, -0.0227, 0.0077, 0.0052, 0.0217, 0.0111,
-0.0036, -0.0176, -0.0187, 0.0026, -0.0163, 0.0155],
device='cuda:0'))

Losses

Hi, thank you for your great project!

I’m stuck with two problems while trying to test the mean teacher idea as described in your NIPS 2017 presentation with a MNIST dataset and a simple convnet from official Pytorch examples using your Pytorch code:

  1. Loss is defined as:

loss = class_loss + consistency_loss + res_loss

where

if args.consistency:
    (...)
    consistency_loss = consistency_weight * consistency_criterion(cons_logit, ema_logit) / minibatch_size
    (…)
else:
    consistency_loss = 0

but default value of args.consistency is None, so consistency_loss=0 by default

Similarly,

if args.logit_distance_cost >= 0:
    (…)
else:
    (…)
    res_loss = 0

but args.logit_distance_cost=-1 by default

So using the default values switches the mean teacher off and just an ordinary supervised model remains? Should these losses be complimentary or interchanged?

  1. Training a mean teacher model on MNIST with some consistency weight without res_loss with fixed hyperparameters (https://github.com/rracinskij/mean_teacher/blob/master/mean_teacher.py) gives significantly lower test accuracy (~78% with 1000 labels) compared to setting the consistency weight to zero (~92%).

I’d greatly appreciate any comments or hints.

Does not work with TensorFlow versions >= 1.3

Thanks for your inspiring idea and the corresponding code.

I try to run the tensorflow code train_cifar10.py.
But it takes more than 2 hours to construct the computational graph and I'm still stuck here.
The screen does not print any thing.
If I replace the CNNs in the original code with a plain ResNet-32 (without Weight Normalization or other tricks), the whole code goes on well.

Do you know what might be wrong?

Thanks.

question on your code

Hello!
There is one function that I don't understand in /mean_teacher/utils.py
def export(fn): mod = sys.modules[fn.__module__] if hasattr(mod, '__all__'): mod.__all__.append(fn.__name__) else: mod.__all__ = [fn.__name__] return fn

Look forward for your reply, thank you so much.

applying mean teacher to my own dataset

hi, i have already achieve ~94% with 4000 labels on cifar10. But for my own three classification task, i have 160k labelled data and unlabelled data. i can not get expected results( worse than using the labelled data to train directly ). is lr strategy sensitive to it? here is my setting (finetuning on mobilenet-v1 and using 4 gpus): thanks in advance

defaults = {

    # Technical details
    'workers': 20,
    'checkpoint_epochs': 20,
    'evaluation_epochs': 5,

    # Data
    'dataset': 'my dataset',
    'train_subdir': 'train',
    'eval_subdir': 'test',

    # Data sampling
    'base_batch_size': 100,
    'base_labeled_batch_size': 50,

    # Architecture
    'arch': 'mnet1',

    # Costs
    'consistency_type': 'mse',
    'consistency_rampup': 5,
    'consistency': 20.0,
    'logit_distance_cost': .01,
    'weight_decay': 2e-4,

    # Optimization
    'lr_rampup': 0,
    'base_lr': 0.001,
    'nesterov': True,
}

what is ShiftConvDownsample in ResNext and shakeshake26

hi , firstly thanks for your great work for ssl. But when i refer many resnext nets of pytorch, there are no ShiftConvDownsample layer? what is the function of it? And mean teacher didn't use this layer in the experiment of cifar10 and imagenet, right? And the two fc layers after avepooling correspond to student and teacher? thanks in advance...

Adapting to Different Image Size

Hello,

Our team is interested in testing an implementation of the mean-teacher Resnet in PyTorch for a few image classification problems we are working on.

However, we are having difficulty adapting the network to our image dimensions.

If I resize our images to 32x32 it runs without error. But, if I change to something else, I get:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/data/nihxr/emo/mean-teacher/pytorch/experiments/rbc_test.py", line 76, in <module>
    run(**run_params)
  File "/mnt/data/nihxr/emo/mean-teacher/pytorch/experiments/rbc_test.py", line 71, in run
    main.main(context)
  File "/mnt/data/nihxr/emo/mean-teacher/pytorch/main.py", line 97, in main
    train(train_loader, model, ema_model, optimizer, epoch, training_log)
  File "/mnt/data/nihxr/emo/mean-teacher/pytorch/main.py", line 225, in train
    ema_model_out = ema_model(ema_input_var)
  File "/opt/conda/lib/python3.5/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 68, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 78, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 67, in parallel_apply
    raise output
  File "/opt/conda/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 42, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.5/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/data/nihxr/emo/mean-teacher/pytorch/mean_teacher/architectures.py", line 158, in forward
    x = self.layer3(x)
  File "/opt/conda/lib/python3.5/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.5/site-packages/torch/nn/modules/container.py", line 67, in forward
    input = module(input)
  File "/opt/conda/lib/python3.5/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/data/nihxr/emo/mean-teacher/pytorch/mean_teacher/architectures.py", line 255, in forward
    residual = self.downsample(x)
  File "/opt/conda/lib/python3.5/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/mnt/data/nihxr/emo/mean-teacher/pytorch/mean_teacher/architectures.py", line 302, in forward
    x[:, :, 1::2, 1::2]), dim=1)
RuntimeError: inconsistent tensor sizes at /opt/conda/conda-bld/pytorch_1512382878663/work/torch/lib/THC/generic/THCTensorMath.cu:157

Which makes sense. We're just a little unfamiliar with PyTorch and, speaking for myself, Resnet. So, I thought I would post this question while I was looking into this to see if someone might post an obvious hint that may not be obvious to find.

Thank you in advance,
Tommy

Question about DataLoader

Hi! Awesome work by the way.
I am curious from which part of your code let you get (input, ema_input).
The TwoStreamBatchSampler seem to return a single batch (not as a tuple of two batches). I am just wondering from which part of the code let the DataLoader return a tuple of (input, ema_input)
Looking forward to your reply.
Thanks in advance!

Pytorch version

May I know the version you were using?
The code is not compatible with latest pytorch. Thank yoU!

EMA of BatchNorm Layer

Hello
I have read the pytorch code of mean-tearcher, and in the update_ema_variables function, only parameters are updated. But as for batchnorm layer, mean and var buffers should also be computed except the weight and bais params. In the current implementition, batchnorm of EMA model may be using the default 0-mean and 1-var.... am I right ?

Unusual Test Loss of cifar_shakeshake26 trained on 1000-labeled dataset

I found that the loss plot of cifar_shakeshake26 trained on 1000-labeled dataset was unusual (but it is normal if trained on 4000-labeled or 45000-labeled dataset). I made a 50-epoch training, and training plots are as following.

image
image
image
image
image
image
image
image

Please take a look at the test loss plot, it started from a small value, but raised dramatically with training epochs increasing. This situation doesn’t happen while trained on 4000-labeled or 45000-labeled datasets. You can reproduce this with running following command:

python main.py \
    --dataset cifar10 \
    --labels data-local/labels/cifar10/1000_balanced_labels/00.txt \
    --arch cifar_shakeshake26 \
    --consistency 100.0 \
    --consistency-rampup 5 \
    --labeled-batch-size 62 \
    --epochs 50 \
    --lr-rampdown-epochs 210

I don't know how to explain this, hope you guys could help. Thanks a lot :D

How to use pretrained ResNext152 model

Thanks for your codes. I had to admit it's a wonderful strategy.
However, when I use this package on the action recognition dataset Stanford40, I encounter the loss explosion problem, so I am thinking about using pre-trained model.
I had decreased the classes from 40 to 10. and turned the mode to fully supervised learning with exclude_unlabled as 'True'. Hope you have time to give a reply even a little hint.
Here I print out the loss at each step until loss explosion. The Res Loss increase like crazy.
AssertionError: Loss explosion: 226970.828125
0 batch
class Variable containing:
2.3374
const Variable containing:
1.00000e-02 *
2.4998
res Variable containing:
1.00000e-02 *
1.0730
1 batch
class Variable containing:
12.6847
const Variable containing:
275.5649
res Variable containing:
1.00000e+05 *
2.2668

RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #2 'other'

The results of running python main.py ...... are showed as following:

Traceback (most recent call last):
File "main.py", line 424, in
main(RunContext(file, 0))
File "main.py", line 104, in main
train(train_loader, model, ema_model, optimizer, epoch, training_log)
File "main.py", line 274, in train
meters.update('top1', prec1[0], labeled_minibatch_size)
File "/home/gzx/Meanteacher/mean-teacher/pytorch/mean_teacher/utils.py", line 53, in update
self.meters[name].update(value, n)
File "/home/gzx/Meanteacher/mean-teacher/pytorch/mean_teacher/utils.py", line 86, in update
self.sum = self.sum +(val * n)
RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #2 'other'

Is this because I install pytorch by using conda? The pytorch version is 0.4.1

precision in test is zero all the time

Hello, I am reproducing your work as a baseline now.
I come up with a problem that the precision (top1&top%) in the test process is always zero.
During training, the performance of this model is nomal, like
"INFO:main:Epoch: [13][170/226] Time 0.329 (0.339) Data 0.000 (0.006) Class 0.2847 (0.2877) Cons 0.0917 (0.0787) Prec@1 60.000 (58.871) Prec@5 62.000 (61.994)"
But in test, it performs like
"INFO:main:Test: [10/20] Time 0.100 (0.242) Data 0.000 (0.143) Class 1.8608 (1.7321) Prec@1 0.000 (0.000) Prec@5 0.000 (0.000)"

Cloud you plz help me to solve this problem?Thank you so much.

Migrate to pytorch 1.1.0

Hi, the code will be update for a new version of pytorch?

I'm trying to do this by my own, but I'm new to pytorch and finding some issues. After changing the .data to item, I try to run the code but receive RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

ConvNet on pytorch

I am trying to reproduce the result on pytorch with the traditional 13 layers convnet, but the result is bad.
It maybe some mistakes in my settings. I would appreciate it if you could offer help.

ImageNet Training Loss Very High (Error)

File "./main.py", line 166, in main
train(train_loader, train_loader_len, model, ema_model, ema_model, optimizer, epoch, training_lo
File "./main.py", line 492, in train
assert not (np.isnan(loss.data[0]) or loss.data[0] > 1e5), 'Loss explosion: {}'.format(loss.data
AssertionError: Loss explosion: 1088561.875

I trained with 8 GPUs which should be close enough to the 10 GPU setting in the provided configuration. Is this expected?

ZCA preprocessing

I can not find where is ZCA preprocessing being applied to CIFAR10 data in pytorch implementation. Am I missing something?

Keep Training but no output

I used the suggested command 'python main.py
--dataset cifar10
--labels data-local/labels/cifar10/1000_balanced_labels/00.txt
--arch cifar_shakeshake26
--consistency 100.0
--consistency-rampup 5
--labeled-batch-size 62
--epochs 180
--lr-rampdown-epochs 210

I'm using ubuntu 18.04 python3.6, pytorch 0.3.0. numpy 1.14.2, and cuda8.0.and 2 gtx1080Ti
When I run the main.py, it can start training but there is no output information(epochs ,accuracy) during the training process (in a few hours), the only outputs are like this:

INFO:main:=> creating model 'cifar_shakeshake26'
INFO:main:=> creating EMA model 'cifar_shakeshake26'
INFO:main:
List of model parameters:

module.conv1.weight 16 * 3 * 3 * 3 = 432
module.layer1.0.conv_a1.weight 96 * 16 * 3 * 3 = 13,824
module.layer1.0.bn_a1.weight 96 = 96
module.layer1.0.bn_a1.bias 96 = 96
.....
module.fc2.weight 10 * 384 = 3,840
module.fc2.bias 10 = 10

all parameters sum of above = 26,197,316

I have checked the results folder and there is no checkpoint file in it.

TypeError: can't serialize tensor(62, device='cuda:0')

I do not know why it happend.

/home/lts/.conda/envs/PCL/bin/python /home/lts/PycharmProject/mean-teacher/pytorch/main.py
INFO:main:=> creating model 'cifar_shakeshake26'
INFO:main:=> creating EMA model 'cifar_shakeshake26'
INFO:main:
List of model parameters:

module.conv1.weight 16 * 3 * 3 * 3 = 432
module.layer1.0.conv_a1.weight 96 * 16 * 3 * 3 = 13,824
module.layer1.0.bn_a1.weight 96 = 96
module.layer1.0.bn_a1.bias 96 = 96
module.layer1.0.conv_a2.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.0.bn_a2.weight 96 = 96
module.layer1.0.bn_a2.bias 96 = 96
module.layer1.0.conv_b1.weight 96 * 16 * 3 * 3 = 13,824
module.layer1.0.bn_b1.weight 96 = 96
module.layer1.0.bn_b1.bias 96 = 96
module.layer1.0.conv_b2.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.0.bn_b2.weight 96 = 96
module.layer1.0.bn_b2.bias 96 = 96
module.layer1.0.downsample.0.weight 96 * 16 * 1 * 1 = 1,536
module.layer1.0.downsample.1.weight 96 = 96
module.layer1.0.downsample.1.bias 96 = 96
module.layer1.1.conv_a1.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.1.bn_a1.weight 96 = 96
module.layer1.1.bn_a1.bias 96 = 96
module.layer1.1.conv_a2.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.1.bn_a2.weight 96 = 96
module.layer1.1.bn_a2.bias 96 = 96
module.layer1.1.conv_b1.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.1.bn_b1.weight 96 = 96
module.layer1.1.bn_b1.bias 96 = 96
module.layer1.1.conv_b2.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.1.bn_b2.weight 96 = 96
module.layer1.1.bn_b2.bias 96 = 96
module.layer1.2.conv_a1.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.2.bn_a1.weight 96 = 96
module.layer1.2.bn_a1.bias 96 = 96
module.layer1.2.conv_a2.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.2.bn_a2.weight 96 = 96
module.layer1.2.bn_a2.bias 96 = 96
module.layer1.2.conv_b1.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.2.bn_b1.weight 96 = 96
module.layer1.2.bn_b1.bias 96 = 96
module.layer1.2.conv_b2.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.2.bn_b2.weight 96 = 96
module.layer1.2.bn_b2.bias 96 = 96
module.layer1.3.conv_a1.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.3.bn_a1.weight 96 = 96
module.layer1.3.bn_a1.bias 96 = 96
module.layer1.3.conv_a2.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.3.bn_a2.weight 96 = 96
module.layer1.3.bn_a2.bias 96 = 96
module.layer1.3.conv_b1.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.3.bn_b1.weight 96 = 96
module.layer1.3.bn_b1.bias 96 = 96
module.layer1.3.conv_b2.weight 96 * 96 * 3 * 3 = 82,944
module.layer1.3.bn_b2.weight 96 = 96
module.layer1.3.bn_b2.bias 96 = 96
module.layer2.0.conv_a1.weight 192 * 96 * 3 * 3 = 165,888
module.layer2.0.bn_a1.weight 192 = 192
module.layer2.0.bn_a1.bias 192 = 192
module.layer2.0.conv_a2.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.0.bn_a2.weight 192 = 192
module.layer2.0.bn_a2.bias 192 = 192
module.layer2.0.conv_b1.weight 192 * 96 * 3 * 3 = 165,888
module.layer2.0.bn_b1.weight 192 = 192
module.layer2.0.bn_b1.bias 192 = 192
module.layer2.0.conv_b2.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.0.bn_b2.weight 192 = 192
module.layer2.0.bn_b2.bias 192 = 192
module.layer2.0.downsample.conv.weight 192 * 96 * 1 * 1 = 18,432
module.layer2.0.downsample.conv.bias 192 = 192
module.layer2.0.downsample.bn.weight 192 = 192
module.layer2.0.downsample.bn.bias 192 = 192
module.layer2.1.conv_a1.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.1.bn_a1.weight 192 = 192
module.layer2.1.bn_a1.bias 192 = 192
module.layer2.1.conv_a2.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.1.bn_a2.weight 192 = 192
module.layer2.1.bn_a2.bias 192 = 192
module.layer2.1.conv_b1.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.1.bn_b1.weight 192 = 192
module.layer2.1.bn_b1.bias 192 = 192
module.layer2.1.conv_b2.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.1.bn_b2.weight 192 = 192
module.layer2.1.bn_b2.bias 192 = 192
module.layer2.2.conv_a1.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.2.bn_a1.weight 192 = 192
module.layer2.2.bn_a1.bias 192 = 192
module.layer2.2.conv_a2.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.2.bn_a2.weight 192 = 192
module.layer2.2.bn_a2.bias 192 = 192
module.layer2.2.conv_b1.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.2.bn_b1.weight 192 = 192
module.layer2.2.bn_b1.bias 192 = 192
module.layer2.2.conv_b2.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.2.bn_b2.weight 192 = 192
module.layer2.2.bn_b2.bias 192 = 192
module.layer2.3.conv_a1.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.3.bn_a1.weight 192 = 192
module.layer2.3.bn_a1.bias 192 = 192
module.layer2.3.conv_a2.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.3.bn_a2.weight 192 = 192
module.layer2.3.bn_a2.bias 192 = 192
module.layer2.3.conv_b1.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.3.bn_b1.weight 192 = 192
module.layer2.3.bn_b1.bias 192 = 192
module.layer2.3.conv_b2.weight 192 * 192 * 3 * 3 = 331,776
module.layer2.3.bn_b2.weight 192 = 192
module.layer2.3.bn_b2.bias 192 = 192
module.layer3.0.conv_a1.weight 384 * 192 * 3 * 3 = 663,552
module.layer3.0.bn_a1.weight 384 = 384
module.layer3.0.bn_a1.bias 384 = 384
module.layer3.0.conv_a2.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.0.bn_a2.weight 384 = 384
module.layer3.0.bn_a2.bias 384 = 384
module.layer3.0.conv_b1.weight 384 * 192 * 3 * 3 = 663,552
module.layer3.0.bn_b1.weight 384 = 384
module.layer3.0.bn_b1.bias 384 = 384
module.layer3.0.conv_b2.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.0.bn_b2.weight 384 = 384
module.layer3.0.bn_b2.bias 384 = 384
module.layer3.0.downsample.conv.weight 384 * 192 * 1 * 1 = 73,728
module.layer3.0.downsample.conv.bias 384 = 384
module.layer3.0.downsample.bn.weight 384 = 384
module.layer3.0.downsample.bn.bias 384 = 384
module.layer3.1.conv_a1.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.1.bn_a1.weight 384 = 384
module.layer3.1.bn_a1.bias 384 = 384
module.layer3.1.conv_a2.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.1.bn_a2.weight 384 = 384
module.layer3.1.bn_a2.bias 384 = 384
module.layer3.1.conv_b1.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.1.bn_b1.weight 384 = 384
module.layer3.1.bn_b1.bias 384 = 384
module.layer3.1.conv_b2.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.1.bn_b2.weight 384 = 384
module.layer3.1.bn_b2.bias 384 = 384
module.layer3.2.conv_a1.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.2.bn_a1.weight 384 = 384
module.layer3.2.bn_a1.bias 384 = 384
module.layer3.2.conv_a2.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.2.bn_a2.weight 384 = 384
module.layer3.2.bn_a2.bias 384 = 384
module.layer3.2.conv_b1.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.2.bn_b1.weight 384 = 384
module.layer3.2.bn_b1.bias 384 = 384
module.layer3.2.conv_b2.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.2.bn_b2.weight 384 = 384
module.layer3.2.bn_b2.bias 384 = 384
module.layer3.3.conv_a1.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.3.bn_a1.weight 384 = 384
module.layer3.3.bn_a1.bias 384 = 384
module.layer3.3.conv_a2.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.3.bn_a2.weight 384 = 384
module.layer3.3.bn_a2.bias 384 = 384
module.layer3.3.conv_b1.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.3.bn_b1.weight 384 = 384
module.layer3.3.bn_b1.bias 384 = 384
module.layer3.3.conv_b2.weight 384 * 384 * 3 * 3 = 1,327,104
module.layer3.3.bn_b2.weight 384 = 384
module.layer3.3.bn_b2.bias 384 = 384
module.fc1.weight 10 * 384 = 3,840
module.fc1.bias 10 = 10
module.fc2.weight 10 * 384 = 3,840
module.fc2.bias 10 = 10

all parameters sum of above = 26,197,316

/home/lts/.conda/envs/PCL/lib/python3.6/site-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
/home/lts/PycharmProject/mean-teacher/pytorch/main.py:224: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
ema_input_var = torch.autograd.Variable(ema_input, volatile=True)
INFO:main:Epoch: [0][0/22000] Time 10.247 (10.247) Data 3.054 (3.054) Class 2.2228 (2.2228) Cons 0.0015 (0.0015) Prec@1 3.000 (3.000) Prec@5 51.000 (51.000)
Traceback (most recent call last):
File "/home/lts/PycharmProject/mean-teacher/pytorch/main.py", line 426, in
main(RunContext(file, 0))
File "/home/lts/PycharmProject/mean-teacher/pytorch/main.py", line 105, in main
train(train_loader, model, ema_model, optimizer, epoch, training_log)
File "/home/lts/PycharmProject/mean-teacher/pytorch/main.py", line 311, in train
**meters.sums()
File "/home/lts/PycharmProject/mean-teacher/pytorch/mean_teacher/run_context.py", line 34, in record
self._record(step, col_val_dict)
File "/home/lts/PycharmProject/mean-teacher/pytorch/mean_teacher/run_context.py", line 45, in _record
self.save()
File "/home/lts/PycharmProject/mean-teacher/pytorch/mean_teacher/run_context.py", line 38, in save
df.to_msgpack(self.log_file_path, compress='zlib')
File "/home/lts/.conda/envs/PCL/lib/python3.6/site-packages/pandas/core/generic.py", line 1320, in to_msgpack
**kwargs)
File "/home/lts/.conda/envs/PCL/lib/python3.6/site-packages/pandas/io/packers.py", line 154, in to_msgpack
writer(fh)
File "/home/lts/.conda/envs/PCL/lib/python3.6/site-packages/pandas/io/packers.py", line 150, in writer
fh.write(pack(a, **kwargs))
File "/home/lts/.conda/envs/PCL/lib/python3.6/site-packages/pandas/io/packers.py", line 691, in pack
use_bin_type=use_bin_type).pack(o)
File "pandas/io/msgpack/_packer.pyx", line 230, in pandas.io.msgpack._packer.Packer.pack (pandas/io/msgpack/_packer.cpp:3642)
File "pandas/io/msgpack/_packer.pyx", line 232, in pandas.io.msgpack._packer.Packer.pack (pandas/io/msgpack/_packer.cpp:3484)
File "pandas/io/msgpack/_packer.pyx", line 191, in pandas.io.msgpack._packer.Packer._pack (pandas/io/msgpack/_packer.cpp:2605)
File "pandas/io/msgpack/_packer.pyx", line 220, in pandas.io.msgpack._packer.Packer._pack (pandas/io/msgpack/_packer.cpp:3178)
File "pandas/io/msgpack/_packer.pyx", line 191, in pandas.io.msgpack._packer.Packer._pack (pandas/io/msgpack/_packer.cpp:2605)
File "pandas/io/msgpack/_packer.pyx", line 220, in pandas.io.msgpack._packer.Packer._pack (pandas/io/msgpack/_packer.cpp:3178)
File "pandas/io/msgpack/_packer.pyx", line 227, in pandas.io.msgpack._packer.Packer._pack (pandas/io/msgpack/_packer.cpp:3348)
TypeError: can't serialize tensor(62, device='cuda:0')

Process finished with exit code 1

Query regarding input transformation

Hey,
I guess input and ema_input transformed versions of the same images, right ?(

for i, ((input, ema_input), target) in enumerate(train_loader):
)

If so, did you guys experiment with using the same input for both model and ema_model ? Does using the same input lead to drop in performance ?

Thanks !

Questions about the precision in validation

I trained the ResNet architecture (cifar_shakeshake26 in Pytorch version) on cifar-10 dataset with 1000 unlabeled images and 44000 labeled images (the resting 5000 images are used for validation) for about 180 epochs, setting the bach-size 256, labeled batch-size 62.
But I observed that the validation precision (top 1) would first rise from 43% up to 50% and then fall to only 13% (began to fall after about 10 epochs) along the training process. I was so puzzled about this phenomenon. Besides, the precision in training always rise and never fall, why the validation precision would fall??

how to train with unlabeled data

i get the idea of the paper. but how can i use the unlabeled data to train the model.
i see a word in the paper. if i use the unlabeled data just keep the consistency loss ,because there is no classification loss . am i right?

ImageNet instructions

Hi, thanks for the great work!
I am wondering whether you could upload the instruction to reproduce the ImageNet results. It seems the data preparation for ImageNet is missing. Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.