Code Monkey home page Code Monkey logo

cutmix-pytorch's Introduction

Accepted at ICCV 2019 (oral talk) !!

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Official Pytorch implementation of CutMix regularizer | Paper | Pretrained Models

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo.

Clova AI Research, NAVER Corp.

Our implementation is based on these repositories:

Abstract

Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers. They have proved to be effective for guiding the model to attend on less discriminative parts of objects (e.g. leg as opposed to head of a person), thereby letting the network generalize better and have better object localization capabilities. On the other hand, current methods for regional dropout removes informative pixels on training images by overlaying a patch of either black pixels or random noise. Such removal is not desirable because it leads to information loss and inefficiency during training. We therefore propose the CutMix augmentation strategy: patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. By making efficient use of training pixels and retaining the regularization effect of regional dropout, CutMix consistently outperforms the state-of-the-art augmentation strategies on CIFAR and ImageNet classification tasks, as well as on the ImageNet weakly-supervised localization task. Moreover, unlike previous augmentation methods, our CutMix-trained ImageNet classifier, when used as a pretrained model, results in consistent performance gains in Pascal detection and MS-COCO image captioning benchmarks. We also show that CutMix improves the model robustness against input corruptions and its out-of-distribution detection performances.

Overview of the results of Mixup, Cutout, and CutMix.

teaser

Updates

23 May, 2019: Initial upload

Getting Started

Requirements

  • Python3
  • PyTorch (> 1.0)
  • torchvision (> 0.2)
  • NumPy

Train Examples

  • CIFAR-100: We used 2 GPUs to train CIFAR-100.
python train.py \
--net_type pyramidnet \
--dataset cifar100 \
--depth 200 \
--alpha 240 \
--batch_size 64 \
--lr 0.25 \
--expname PyraNet200 \
--epochs 300 \
--beta 1.0 \
--cutmix_prob 0.5 \
--no-verbose
  • ImageNet: We used 4 GPUs to train ImageNet.
python train.py \
--net_type resnet \
--dataset imagenet \
--batch_size 256 \
--lr 0.1 \
--depth 50 \
--epochs 300 \
--expname ResNet50 \
-j 40 \
--beta 1.0 \
--cutmix_prob 1.0 \
--no-verbose

Test Examples using Pretrained model

python test.py \
--net_type pyramidnet \
--dataset cifar100 \
--batch_size 64 \
--depth 200 \
--alpha 240 \
--pretrained /set/your/model/path/model_best.pth.tar
python test.py \
--net_type resnet \
--dataset imagenet \
--batch_size 64 \
--depth 50 \
--pretrained /set/your/model/path/model_best.pth.tar

Experimental Results and Pretrained Models

  • PyramidNet-200 pretrained on CIFAR-100 dataset:
Method Top-1 Error Model file
PyramidNet-200 [CVPR'17] (baseline) 16.45 model
PyramidNet-200 + CutMix 14.23 model
PyramidNet-200 + Shakedrop [arXiv'18] + CutMix 13.81 -
PyramidNet-200 + Mixup [ICLR'18] 15.63 model
PyramidNet-200 + Manifold Mixup [ICML'19] 16.14 model
PyramidNet-200 + Cutout [arXiv'17] 16.53 model
PyramidNet-200 + DropBlock [NeurIPS'18] 15.73 model
PyramidNet-200 + Cutout + Labelsmoothing 15.61 model
PyramidNet-200 + DropBlock + Labelsmoothing 15.16 model
PyramidNet-200 + Cutout + Mixup 15.46 model
  • ResNet models pretrained on ImageNet dataset:
Method Top-1 Error Model file
ResNet-50 [CVPR'16] (baseline) 23.68 model
ResNet-50 + CutMix 21.40 model
ResNet-50 + Feature CutMix 21.80 model
ResNet-50 + Mixup [ICLR'18] 22.58 model
ResNet-50 + Manifold Mixup [ICML'19] 22.50 model
ResNet-50 + Cutout [arXiv'17] 22.93 model
ResNet-50 + AutoAugment [CVPR'19] 22.40* -
ResNet-50 + DropBlock [NeurIPS'18] 21.87* -
ResNet-101 + CutMix 20.17 model
ResNet-152 + CutMix 19.20 model
ResNeXt-101 (32x4d) + CutMix 19.47 model

* denotes results reported in the original papers

Transfer Learning Results

Backbone ImageNet Cls (%) ImageNet Loc (%) CUB200 Loc (%) Detection (SSD) (mAP) Detection (Faster-RCNN) (mAP) Image Captioning (BLEU-4)
ResNet50 23.68 46.3 49.41 76.7 75.6 22.9
ResNet50+Mixup 22.58 45.84 49.3 76.6 73.9 23.2
ResNet50+Cutout 22.93 46.69 52.78 76.8 75 24.0
ResNet50+CutMix 21.60 46.25 54.81 77.6 76.7 24.9

Third-party Implementations

Citation

@inproceedings{yun2019cutmix,
    title={CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features},
    author={Yun, Sangdoo and Han, Dongyoon and Oh, Seong Joon and Chun, Sanghyuk and Choe, Junsuk and Yoo, Youngjoon},
    booktitle = {International Conference on Computer Vision (ICCV)},
    year={2019},
    pubstate={published},
    tppubtype={inproceedings}
}

License

Copyright (c) 2019-present NAVER Corp.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

cutmix-pytorch's People

Contributors

hellbell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cutmix-pytorch's Issues

True lambda

There is a potential bug in the cutmix code when applying the lambda value to the target.
Initially you randomly define a lambda value that you use in the rand_bbox to define bbx1, bby1, bbx2, bby2. But when the coordinates of the selected region fall close to an image edge, the values are clipped. That's ok to define the image region (doing so all image pixels have the same probability of being selected). However, the true lambda (proportion of the image that is mixed) is smaller than the previously defined lambda.
To avoid this the true lambda value that should be applied to the target should be calculated this way:
lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (input.size()[-1] * input.size()[-2]))
This value would now be correctly, and can be applied to the target thus avoiding unnecessary distortions.

Another reimplementation

I got BIG interest on your work so I decided to reimplement with some fixes.

Issue #1 #3 #4 are dealt with this new reimplementation and I still work on improving for better usability and performance.

Currently, after I fixed those issues, I got clearly-improved results on CIFAR100.

Top-1 Error(@300epoch) Top-1 Error(Best)
Paper's Reported Result N/A 13.81
Our Re-implementation 13.68 13.15

I will update as other experiments are updated my implementation progressed.

Repo : https://github.com/ildoonet/cutmix

cutmix for segmentation

In this repo, cut mix works well for classification. Can this repo be used for semantic segmentation too?

The CutMix Does not work ??

When I employ CutMix, the results get worse instead. What is important to keep in mind when I use CutMix?

Question About CutMix

If in the mini batch random sampling, when the sample pairs combine new data, image a and image B are background regions in the clipping region, and the background regions are almost similar, rather than the region where the target is located, is it reasonable to train the forced model fitting?

Any COCO pretrained models Yet?

Hi, was wondering if you have any COCO pre-trained models (and withFPN) for cutmix regularization under pytorch implementation yet? Any plans for future updates?

Object Detection about

For object detection task, what is the label of the original target box processed by cutmix?

Clarification about CutMix

I saw examples where head of a dog is cut and overlayed with head of a cat. How do identify these specific portions of the body - or how do identify these important portions of the two objects.
Or is the above mentioned scenario is purely an example? and you replace portions of images randomly?

I hope I am clear in my question, else I shall rephrase it.

Reproducibility on Resnet-110 on cifar-100

Hi, I run your code on cifar-100 using Resnet-110 because PyramidNet-200 takes long time to finish. I used the default parameters as PyramidNet-200. Here is what I got:
try1: 20.23 4.65
try2: 21.3 4.93

However, 20.11, 4.43 are reported in the paper (in Table 6). May I know what settings did you use for Resnet-110?

Training loss/accuracy fluctuates too much when using CutMix regularization

I'm trying to implement CutMix on CIFAR-10 dataset. Here is my implementation from the given pseudocode:

cutmix_decision = np.random.rand()
if cutmix_decision > 0.60:
    # Cutmix: https://arxiv.org/pdf/1905.04899.pdf
    x_train_shuffled, y_train_shuffled = shuffle_minibatch(x_train, y_train)
    lam = np.random.beta(CUTMIX_ALPHA, CUTMIX_ALPHA)
    cut_rat = np.sqrt(1. - lam)
    cut_w = np.int(W * cut_rat)
    cut_h = np.int(H * cut_rat)

    # uniform
    cx = np.random.randint(W)
    cy = np.random.randint(H)

    bbx1 = np.clip(cx - cut_w // 2, 0, W)
    bby1 = np.clip(cy - cut_h // 2, 0, H)
    bbx2 = np.clip(cx + cut_w // 2, 0, W)
    bby2 = np.clip(cy + cut_h // 2, 0, H)

    x_train[:, :, bbx1:bbx2, bby1:bby2] = x_train_shuffled[:, :, bbx1:bbx2, bby1:bby2]
    lam = 1 - (bbx2 - bbx1) * (bby2 - bby1) / (W * H)
    y_train = lam * y_train + (1 - lam) * y_train_shuffled

And here is the shuffle_minibatch() function:

def shuffle_minibatch(x, y):
    assert x.size(0)== y.size(0)
    indices = torch.randperm(x.size(0))
    return x[indices], y[indices]

I'm using PyTorch for training the model. This regularization is randomly done with probability 40% (cutmix_decision > 0.60). Now when I train the model, The training loss/accuracy fluctuates way too much. However, validation accuracy stays stable and due to stable validation accuracy, I'm assuming that the CutMix implementation is correct.

Here is the accuracy curve for both training and validation datasets.

image

Is this normal behavior while using CutMix regularization or am I missing something? Is the rate of regularization too high? or is the image resolution very low for this type of regularization? In case if you are interested to take a look at full implementation, here is my notebook: https://www.kaggle.com/kaushal2896/cifar-10-simple-cnn-with-cutmix-using-pytorch

CutMix for Image Captioning

Hi, I really like your paper. But I have a question.
I understand how CutMix is applied to the image classification task, but I'm confused about how do you implement CutMix for Image Captioning task
Given λ that sampled from the uniform distribution (0, 1), how do you merge 2 captions?

It will be great if you could provide an example like :
with λ = 0.2, y1 = "A man wearing a blue jeans standing inside the bus", y2 = "A bird is flying in the sky" will be combined into xxx
Thanks :)

Aplha for cutmix

for my dataset how do i choose value of alpha for cutmix to get best performance?

About the result of PyramidNet-110 baseline in the paper

In the paper (table 6), PyramidNet-110 without cutmix is reported to achieve 19.85% top1 error on CIFAR100. Since in the paper of PyramidNet, the error rate with the same setting (except that lr equals 0.5) only achieves 23.12%, I wonder how the result of 19.85% is achieved? Could you provide any example training code to get this result?
Thanks a lot!

cutmix as finetune technique

first of all, very well documented experiments!
the learning curve in fig.2 shows the model trained with cutmix outperforms baseline only after 2x LR adjustments, i wonder if this is true across your other experiments.

It does make sense since cutmix is indeed a regularization which prevents model from memorize/overfit to training data. But WHEN does it take effect is also an interesting question to ask.

If it's only effective at the very late of training stage, one can simply use it as fine-tuning technique and save time/resource from retraining from scratch.

Consistency between code and paper

Hello, I'm a student from Postech.

It was interesting to follow your work.
But, I think the following code is inconsistent with the paper.

loss = criterion(output, target_a) * lam + criterion(output, target_b) * (1. - lam)

The loss is interpolated in the code but the labels in the paper making a single target, as described in Appendix A1.

Is there any reason to change like this, or am I misunderstood the paper?

I look forward to your response. Thank you.

About the probability of applying CutMix

It seems that the 'cutmix_prob' is an important hyper-parameter yet you did not mention in the paper. May I ask the value you adopted to get the results in the paper?

Can Cutmix work well with CosFace?

Hi there,

Thanks for the great paper.

I'm wondering if Cutmix can work with large margin loss such as CosFace, and have you done similar experiments?

I think Cutmix can help to solve some OOD images problem in my project, but I don't want to drop Cosface. But I'm having trouble with how to combine them. Could you please give me some advice?

Thanks and have a great day!

About the hyper-parameter alpha of mixup

In the paper mixup: Beyond Empirical Risk Minimization, it seems that Mixup performs the best on ImageNet when the hyper-parameter alpha is between 0.2 and 0.4. For ResNet-50, they get 77.9 acuuracy on ImageNet when alpha equals 0.2 and trained for 200 epochs. But in this work, the result of mixup is reported when alpha equals 1.0, which leads to 77.42 accuracy. This might not be fair.
In fact, after doing some experiments, we find that the performance of Mixup and Cutmix could be close on ImageNet with preferred alpha settings respectively (0.2 and 1.0).
Have you tried some related experiments and what do you think about it?
Hope I express my opinion clearly. Looking forward to your reply!

cutmix for segmentation

I'm very interested in your work and recently I've been trying to use it in the segmentation problem and here I have a couple of questions, one, how does this formula: loss = criterion(output, target_a) * lam + criterion(output, target_b) * (1. -lam) in the paper expand into the segmentation problem?I use 3d image data, so I cutmix a volume. In addition, I am not clear about how to set some parameters in this case. Could you give me some Suggestions? here is my implementation :

        volume_batch_label = volume_batch[:labeled_bs].clone()
        label_batch_label = label_batch[:labeled_bs].clone()
        # r = np.random.rand(1)
        # if r < 0.5:
        # generate mixed sample(labeled_data)
        lam_label = np.random.beta(1,1)
        rand_index_label = torch.randperm(volume_batch_label.size()[0]).cuda()
        # target_a_label = label_batch_label
        # target_b_label = label_batch_label[rand_index_label]
        bbx1, bby1, bbz1, bbx2, bby2, bbz2 = rand_bbox(volume_batch_label.size(), lam_label)
        
        volume_batch_label[:, :, bbx1:bbx2, bby1:bby2, bbz1:bbz2] = volume_batch_label[rand_index_label, :, bbx1:bbx2, bby1:bby2, bbz1:bbz2]
      
        label_batch_label[:, bbx1:bbx2, bby1:bby2, bbz1:bbz2] = label_batch_label[rand_index_label, bbx1:bbx2, bby1:bby2, bbz1:bbz2]
        # adjust lambda to exactly match pixel ratio
        lam_label = 1 - ((bbx2 - bbx1) * (bby2 - bby1) * (bbz2 - bbz1) / (volume_batch[:labeled_bs].size()[-1] * volume_batch[:labeled_bs].size()[-2] * volume_batch[:labeled_bs].size()[-3]))
        # compute output
        outputs_label = model(volume_batch_label)
        ## serpervised loss labeleddata
        loss_seg = F.cross_entropy(outputs_label, label_batch_label)
        outputs_soft_label = F.softmax(outputs_label, dim=1)
        loss_seg_dice = losses.dice_loss(outputs_soft_label[:, 1, :, :, :], label_batch_label == 1)

how to implement Feature level Cutmix?

Thanks for your sharing.

But I have a question about how to implement 'ResNet-50 + Feature Cutmix', since if we just use the replace operation like we did in the image level cutmix, the gradients will not be back propagated.

Actually, I encountered such problem:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Could you give me some suggestions or hints to solve this feature level cutmix problem?

Scripts for mentioned experiments

Hi, thanks for this great repo!

I am new to this area and I have a few questions regarding the experiments:

  1. Do you still have the scripts (or hyperparameter lists) for the experiments in the paper. For example, the ResNet 50 and ResNet 101 ImageNet dataset; PyramidNet-200 and PyramidNet-110 for Cifars.
  2. When I ran ResNet50 on Cifar 10, the number of parameters is 0.5M. while the number of parameters of ResNet50 on ImageNet is 25M. (This is the command I used: python train.py --net_type resnet --dataset cifar10 --depth 50 --alpha 240 --batch_size 64 --lr 0.25 --expname ResNet50 --epochs 300 --beta -1.0 --cutmix_prob 0.5 --no-verbose)
  3. How should I conduct transfer learning experiments?

Hope you can help me with this.

Thanks a lot!

Reproducibility Issue

I have ran your codes 5 times in the below environment.

Two V100 GPUs
Python 3.6.7
PyTorch 1.0.0
Cuda 9.0

The command I used is this :

python train.py \
--net_type pyramidnet \
--dataset cifar100 \
--depth 200 \
--alpha 240 \
--batch_size 64 \
--lr 0.25 \
--expname PyraNet200 \
--epochs 300 \
--beta 1.0 \
--cutmix_prob 0.5 \
--no-verbose

For the baseline, I set cutmix_prob=0.0 not to use cutmix.

  Model & Augmentations try1 try2 try3 try4 try5 Average
cutmix p=0.0 Pyramid200(Converged) 17.14 16.32 16.15 16.29 16.61 16.502
  Pyramid200(Best) 17.01 16.02 16.01 16.17 16.35 16.312
cutmix p=0.5 CutMix(Converged) 16.27 15.55 16.18 16.19 15.38 15.914
  CutMix(Best) 15.29 14.66 15.28 15.04 14.52 14.958

The baseline has a similar top-1 accuracy as your paper said (16.45), but with cutmix(p=0.5), the result is somewhat poor compared to the reported value(14.23).

Also, I conducted an experiments with shakedrop (after codes for shakedrop regularization has been brought from 'https://github.com/owruby/shake-drop_pytorch').

    try1 try2 try3 try4 try5 Average
cutmix p=0.5 ShakeDrop+CutMix(Converged) 14.06 14 14.16 13.86 14 14.016
  ShakeDrop+CutMix(Best) 13.67 13.81 13.8 13.69 13.62 13.718

Here you can see, top-1 accuracy you claimed on the paper can be achieved only by using 'maximum top-1 validation accuracy' during training, not by using 'converged top-1 validation accuracy' after training.

So, here is my questions.

  1. How can I reproduce your result? Especially with your provided codes and sample commands, I should reproduce 14.23% of Top1 Accuracy with PyramidNet+Cutmix. It will be great if you can provide the specific environment and command to reproduce the result or this helps you to find some problems on this repo.

  2. Did you use 'last validation accuracy' after training or 'best validation accuracy(peak accuracy)' while training? I saw some codes tracking the best validation accuracy while training and print out the value before terminating, so I assume that you used 'best(peak) validation accuracy'.

Thanks. I look forward to hearing from you.

Using with DropBlock

Hello, Firstly, I really appreciate to publish good paper and open the source code!

My question is, as I write on the title, what can be happen if I use CutMix with DropBlock?
Do you think that it is helpful? or there can be some interference between to techniques?

question of why cutmix has improved performance.

hi author.

I'm a student studying deep learning in South Korea.

If look at your paper and the code, it seems like you are cropping random locations, mixing two images and matching two labels.

The question arises here. For example, if there are no objects in an image that is cropped at random locations, label noise may be generated, which may adversely affect performance.

Why does this not adversely affect performance?

Clarification on in-between class samples results

In the paper, Figure 5b (Analysis for in-between class samples) is explained as the probability to predict neither two classes by varying the combination ratio lambda. Does it mean that if the sample is created by mixing cat and dog, then the model should predict labels other than either of the cat or dog? What is the significance of this result? Maybe I am missing something here. Also, there are separate plots for Mixup and Cutmix methods and in each plot, various methods are compared. How to interpret the results. Maybe the question is basic but I'm not able to understand these results. Hope someone can answer my queries. I've attached the Figure for reference. Thanks in advance!

image

Question about the reported results

Hi, thanks for the great job. I have a quick question. In the paper, it says that the reported result is the best result in the whole training process. Is this reasonable? Because we are not supposed to choose model based on the test set. And actually, in real application scenerios, we do not know which model is the best because we do not have test labels. So I think the final epoch result is more reasonable than the best result. Can you give the final epoch performance of ResNet-50 on ImageNet? Thanks a lot.

Reproducibility Issue again

Hi.
I ran your code with the same instructions on README as well as #7.
however I still couldn't get the reported performance.
I am very curious that Is there any version dependency for this issue? or if this is random seed problem.
Can you provide how you used cosine decay? Maybe Can you provide function "adjust_learning_rate" in train.py?

Your reported accuracy is 14.23 and it is also mentioned in #7 as below.

  at 300 epoch best acc
try1 14.78 14.23
try2 15.44 14.5
try3 15.00 14.68
average 15.07 14.47

However, what I am getting is '16., 15.' best error on CIFAR100 using pyra 200.

Thanks in advance.

CutMix on detection

Hi, thanks for your great job!
I have a question about applying CutMix on detection tasks.
Can I just use CutMix on detection datasets like coco, if I can, should I choose objects from the same image to paste or those in the same batch? :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.