shadealsha / ltr-weight-balancing Goto Github PK

View Code? Open in Web Editor NEW

118.0 3.0 10.0 45.22 MB

CVPR 2022 - official implementation for "Long-Tailed Recognition via Weight Balancing" https://arxiv.org/abs/2203.14197

License: MIT License

Jupyter Notebook 97.77% Python 2.23%

pytorch deep-learning long-tailed-recognition

ltr-weight-balancing's Introduction

Long-Tailed Recognition via Weight Balancing

[CVPR2022 paper] [poster] [slides] [video]

In the real open world, data tends to follow long-tailed class distributions, motivating the well-studied long-tailed recognition (LTR) problem. Naive training produces models that are biased toward common classes in terms of higher accuracy. The key to addressing LTR is to balance various aspects including data distribution, training losses, and gradients in learning. We explore an orthogonal direction, {\bf weight balancing}, motivated by the empirical observation that the naively trained classifier has "artificially" larger weights in norm for common classes (because there exists abundant data to train them, unlike the rare classes). We investigate three techniques to balance weights, L2-normalization, weight decay, and MaxNorm. We first point out that L2-normalization "perfectly" balances per-class weights to be unit norm, but such a hard constraint might prevent classes from learning better classifiers. In contrast, weight decay penalizes larger weights more heavily and so learns small balanced weights; the MaxNorm constraint encourages growing small weights within a norm ball but caps all the weights by the radius. Our extensive study shows that both help learn balanced weights and greatly improve the LTR accuracy. Surprisingly, weight decay, although underexplored in LTR, significantly improves over prior work. Therefore, we adopt a two-stage training paradigm and propose a simple approach to LTR: (1) learning features using the cross-entropy loss by tuning weight decay, and (2) learning classifiers using class-balanced loss by tuning weight decay and MaxNorm. Our approach achieves the state-of-the-art accuracy on five standard benchmarks, serving as a future baseline for long-tailed recognition.

Code Description

This folder contains two executable Jupyter Notebook files for demonstrating our training approach and how we will open-source our code. The Jupyter Notebook files are sufficiently self-explanatory with detailed comments, and displayed output. The files reproduce the results on the CIFAR100-LT (imbalance factor 100) as shown in Table 1 of the paper.

Running the files requires some common packages (e.g., PyTorch as detailed later). If running, please start with the first stage training demo before running the second stage training.

demo1_first-stage-training.ipynb
Running this file will train a naive network using cross-entropy loss and stochastic gradient descent (SGD) without weight decay. It should achieve an overall accracy ~39% on the CIFAR100-LT (imbalance factor 100). Then it will train another network with weight decay. Running this file takes ~2 hour with a GPU (e.g. NVIDIA GeForce RTX 3090 in our work). The runtime can be reduced by chaning total_epoch_num to 100. The training results and model paramters will be saved at exp/demo_1.
demo2_second-stage-training.ipynb
Running this file will compare various regularizers used in the second-stage training such as L2 normalization, $\tau$-normalization, and MaxNorm with weight decay. The latter should achieve an overall accuracy >52%. Running this file takes a few minutes on a GPU.

Why Jupyter Notebook?

We prefer to release the code using Jupyter Notebook (https://jupyter.org) because it allows for interactive demonstration for education purposes.

We also provide python scripts in case readers would like run them rather than Jupyter Notebook. These python scripts are converted using Jupyter command below:

jupyter nbconvert --to script demo1_first-stage-training.ipynb
jupyter nbconvert --to script demo2_second-stage-training.ipynb

Requirement

We installed python and most packages through Anaconda. Some others might not be installed by default, such as pandas, torchvision, and PyTorch. We suggest installing them before running our code. Below are the versions of python and PyTorch used in our work.

Python version: 3.7.4 [GCC 7.3.0]
PyTorch verion: 1.7.1

We suggest assigning 300MB space to run all the demos, because they will save models paramters.

If you find our model/method/dataset useful, please cite our work:

@inproceedings{LTRweightbalancing,
  title={Long-Tailed Recognition via Weight Balancing},
  author={Alshammari, Shaden and Wang, Yuxiong and Ramanan, Deva and Kong, Shu},
  booktitle={CVPR},
  year={2022}
}

ltr-weight-balancing's People

Contributors

Stargazers

Watchers

Forkers

aimerykong shubham745 sssssshf tcmyxc fsgdrq xfguo-ucas mahdiyarmm joannelin168 xinyu2 majiajun1

ltr-weight-balancing's Issues

Some questions about comparison experiments

Hello, I read this paper and saw the part comparing with other methods in the paper. However, I did not find the source code of DiVE's paper, so I would like to ask how you got the results of DiVE on many, med and few. Did you write a code by yourself or found its source code or trained model?
Thank you for taking time out of your busy schedule to read my questions. I am looking forward to your reply！

Questions about Maxnorm regularizations

In paper, it claims that Maxnorm regularization is applied per iteration, but in the code, it seems to be applied per epoch? Am i right?

What does the weight of the classifier affect?

Hi，
Thank you for your efforts, Why do we need to consider the balanced weights of classifiers, and what is the need for it?

Some implementation confusion about ImageNet and iNaturalist

I really appreciate your work and am grateful for your open-sourcing of the code.
I was able to successfully reproduce the experimental results on CIFAR100-LT using your code, but I was wondering if you could provide me with some additional information on the hyperparameter settings for the larger datasets, ImageNet-LT and iNaturalist.

Specifically, I would greatly appreciate it if you could share the following:

For Stage 1:
the initial learning rate, epoch number, weight decay settings,
and any data augmentation techniques used (such as color jittering for ImageNet).

For Stage 2:
the initial learning rate, epoch number, weight decay settings,
and the hyperparameters in CBLoss (loss type, beta, and gamma).

Thank you once again for your fantastic work and for your generosity in sharing your code with the community.

Some questions about the experiments.

In the code, you use the ResNet34 model for CIFAR100-LT, but in the paper, you use the ResNet32 model.

dataset	Stage	loss	base lr	schedular	batch	epoch	WD	model	result_all
CIFAR100-100	stage1	CE	0.01	Coslr	64	320	0.005	ResNet32	40.1
CIFAR100-100	stage1	CE	0.01	Coslr	64	320	0.005	ResNet34	47.3

I use these settings to train the model and get a bad result (7% lower than the open source experiment in CoLab), could you please point out my problem？

Could you provide more details about the experiment on how to choose the proper weight decay value for long-tailed recognition? It would help a lot.
I experiment several methods including MisLAS, BAMLS. I find that 5e-4 is good enough and tuning weight decay improves the performance slightly. Maybe the tuning of weight decay is not the core point of imbalanced learning?

Number of seeds

How many seeds were averaged to get the final shown results and what are they?

Code release for ImageNetLT and iNat

Will the code for ImageNetLT and iNat be released?

Reproducibility Issues

Hello,

I am unable to produce the results on your paper for CIFAR-100 IF 100 with a Resnet-32 backbone. I have followed the details in the paper (cosine scheduler, lr=0.01, batch size 64, etc) and made some minor modifications to your code to take in a Resnet-32 (see resnet_cifar.py in MiSLAS). However, even with weight decay tuning via bayesian optimization, I cannot attain a test accuracy beyond 39.5% (small improvement from baseline).

Would you be able to share the specific weight decay value used to attain the paper results (46.08% acc)?

Thank you for your time!

Reproducibility.. Shocking..

Also, I find that I can never achieve 46% in 200 epoch,... with tuned WD with cross entropy..

you trained it for 320 epoch.

2stage training

Do you need to freeze the weight of feature network for the second stage training of classification layer???

Reproducibility issue... ResNet 32 vs 34.. Fraud?

Thank you for your efforts, but I still doubt about the performance you reported since the Colab code uses ResNet-34 but your paper achieved the performance with ResNet32.

ResNet 34 has far different from ResNet 32 in the perspective of internal channel. ResNet 32 uses 3 stage with internal channel 16, 32, 64. and final fully connected layer uses only 64-dim feature.

ResNet 34 uses 512-dim feature with very large internal channels begins from 64.

Not just piling up 2 layers, they have such a big difference. But you only suggested ResNet34, which has huge parameters. But you reported you used ResNet 32 for CIFAR 100-LT to achieve the performance.

Config for best accuracy stage 1 model

Can you give me the configuration of the best model in stage 1. You write (and in demo 1) stage 1 with 46.7% accuracy but you load weights with 47.9% accuracy .

About the long-tailed learning

This research field is very messy, and most people are playing around without getting to the bottom of the problem.
The performance could be highly improved by just adjusting the hyperparameters of general networks, including the weight decay and others that you have not mentioned but used.
I think the future direction is to solve the optimization problem of long-tailed learning.

Question about Figure 3 on your paper

If you calculate filter L2 norm on trained model, how did you calculate variance of filter norm ? (On figure 3)
Since the model is already trained, the filter is learned, so it is fixed.
Thus norm is just a value, without variance.
Am I right ?

Thanks.

How to set the imbalance factor for CIFAR100

How can toy set the imbalance factor for CIFAR100?

Cannot Reproduce Your Performance

Hi, thank you for your work,

nothing but, I cannot reproduce your work, while I'm using the exactly same parameters that you use.

Moreover, even if I follow the ipynotebook cell by cell, you achieved 91% in training data at 41 epoch, but I can only get 6~70% even if I try many times.

Hyperparameters used

Hi,
Interesting work here!
Can you please provide the details on the following hyperparameters used for each of the following datasets,
INaturalist18 : weight decay, learning rate used.
ImagenetLT: weight decay, learning rate used.

Thank you in advance!