I have ran your codes 5 times in the below environment. <div class="snippet-clipbo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a href="https://github.com/clovaai/CutMix-PyTorch/issues/7#issuecomment-516710134" da

Reproducibility Issue about cutmix-pytorch HOT 9 CLOSED

ildoonet commented on July 28, 2024

Reproducibility Issue

from cutmix-pytorch.

Comments (9)

hellbell commented on July 28, 2024 3

@ildoonet
Thank you for your reply.
I do understand your concerns but I don't agree that mentioning the best performance is cheating. As I said, the best model surely can be treated to represent the performance of the method. The difference between the best and the last model is coming from the step decaying learning rate. In our case of using cosine learning rate on CIFAR100, the best and last model is almost the same (within +- 0.1% acc).
All the experiments we re-implemented are conducted in the same experiment setting, the best model is selected for every other method, so there is no cheating and fair-comparison issues.
Our best model's performance is not instantly peaked high value because we conducted several times and report the mean of the best performances.

from cutmix-pytorch.

hellbell commented on July 28, 2024 1

@ildoonet
For your clarification and further discussion, So I re-open this issue.

The baseline has a similar top-1 accuracy as your paper said (16.45), but with cutmix(p=0.5), the result is somewhat poor compared to the reported value(14.23).

I newly ran our code on CIFAR100 three times, and we got

	at 300 epoch	best acc
try1	14.78	14.23
try2	15.44	14.5
try3	15.00	14.68
average	15.07	14.47

Also, for ImageNet-1K, we got

	at 300 epoch	best
try1	21.20	21.19
try2	21.61	21.61
try3	21.40	21.40
average	21.403	21.400

Interestingly, we got the best performance near the last epoch of the training.

I wonder if it is right to use the best validation accuracy. As you can see, the converged model's accuracy is slightly lower than the best one and it is hard to be sure that the best accuracy presents the model's true performance.

For ImageNet-1K task, many methods report their best validation accuracy during training because they cannot approach the 'test dataset'. Of course, we will add the statement to our final paper We report the best performance during training. for the clarification.
We evaluated on CIFAR datasets using the same evaluation strategy. And we try our best to reproduce the baselines (mixup, cutout, and so on) and report their best performance for fair-comparison.
But I have a question about what is the true performance as you said.
I'm not sure the only way to represent the true performance of the model is to report the last epoch's performance because the model could fluctuate at the end of the training and we cannot guarantee the model was converged at the last epoch. Therefore, researchers usually train models and pick their best models by validating on the validation set.
In short, we choose the best model to represent the performance of the model and I think two approaches, selecting best model or last model, are both making sense to evaluate the trained models.
But your comments about the best and last models are very worth to consider for future work.

When I worked on Fast AutoAugment, I used the converged value instead of the instantly peaked high value, and as far as I know, AutoAugment measure the performance in a same way.

First, it is a nice work for the Fast AutoAugment!
In my guess, Fast AutoAugment and AutoAugment may use cosine learning rate decaying, so they are less fluctuating at the end of training, so the best performance and last performance would be similar.
I recently found CutMix + cosine learning rate works well with CIFAR dataset, so we will report both the best and last performance when using cosine learning rate. I hope the gap between the best and the last models would be smaller than current training scheme.

from cutmix-pytorch.

JiyueWang commented on July 28, 2024 1

'Cheating' is a rather harsh word. However, comparing the peak value indeed benefits oscillating and risky methods.

from cutmix-pytorch.

hellbell commented on July 28, 2024

@ildoonet

How can I reproduce your result? Especially with your provided codes and sample commands, I should reproduce 14.23% of Top1 Accuracy with PyramidNet+Cutmix. It will be great if you can provide the specific environment and command to reproduce the result or this helps you to find some problems on this repo.

We use pytorch 1.0.0, Tesla P40 GPUs. The paper's experiments were conducted on our cloud system (NSML).
So I recently tested again our code on local machine for CIFAR100 and ImageNet using this repo, and I got slightly lower performance on CIFAR100 (top-1 error 14.5~14.6 as similar to your report) but got better performance on ImageNet (top-1 error 21.4). One possible reason would be the difference between the cloud system and the local machines.
We note that the results (top-1 error 14.5 on CIFAR100) still much better than the important baselines (cutout, mixup, etc). In the camera-ready version of our paper, we might update the performance to 14.5 on CIFAR100 and 21.4 on ImageNet for better reproducibility using local machines.

Did you use 'last validation accuracy' after training or 'best validation accuracy(peak accuracy)' while training? I saw some codes tracking the best validation accuracy while training and print out the value before terminating, so I assume that you used 'best(peak) validation accuracy'.

As you can see in the code, we choose the best validation accuracy.

Thanks!

from cutmix-pytorch.

ildoonet commented on July 28, 2024

@hellbell Thanks, I guess that this reproducibility issue is not from the environment.

I wonder if it is right to use the best validation accuracy. As you can see, the converged model's accuracy is slightly lower than the best one and it is hard to be sure that the best accuracy presents the model's true performance. When I worked on Fast AutoAugment, I used the converged value instead of the instantly peaked high value, and as far as I know, AutoAugment measure the performance in a same way.

Anyway, thanks for the clarification.

from cutmix-pytorch.

hellbell commented on July 28, 2024

[Updated reply]

@ildoonet
I agree with your reply at some points and it is worth to see the final performance (or, converged performance) comparisons. But our paper also reports the best performance of other algorithms for fair-comparison by re-implementing. Only a few methods which we cannot reproduce were reported by their original paper's scores.
Anyway, thank you for the constructive comments!

from cutmix-pytorch.

ildoonet commented on July 28, 2024

I guess that if you mention the best & converged accuracy, it will be okay. Mentioning only instantly peaked high value is somewhat considered as cheating or validation over-fitting.

But as you can say, true performance of the model is hard to measure even if we have a held-out set only for testing.

Also, I trained with cosine learning rate for many models but there are similar gaps also.

Anyway Thanks for your consideration and long explanations. This is very helpful for me to think lot of things.

from cutmix-pytorch.

GuoleiSun commented on July 28, 2024

I guess that if you mention the best & converged accuracy, it will be okay. Mentioning only instantly peaked high value is somewhat considered as cheating or validation over-fitting.

But as you can say, true performance of the model is hard to measure even if we have a held-out set only for testing.

Also, I trained with cosine learning rate for many models but there are similar gaps also.

Anyway Thanks for your consideration and long explanations. This is very helpful for me to think lot of things.

If choosing the best performance is cheating, then many people are cheating. So I don't agree with your point. Rather, @hellbell is fairly correct. Thanks for the interesting work

from cutmix-pytorch.

ildoonet commented on July 28, 2024

I deeply apologize for the misrepresentation of poor English and poor word choice. CutMix inspired me a lot and helped me a lot in my research.

from cutmix-pytorch.

Reproducibility Issue about cutmix-pytorch HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent