I've attempting to train the 'tiny' model from scratch for verification but running in

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

training hparam clarifications about gcvit HOT 3 CLOSED

nvlabs commented on May 26, 2024

training hparam clarifications

from gcvit.

Comments (3)

ahatamiz commented on May 26, 2024

Hi @rwightman

Thank you for the insightful comments/questions. For starters, our work is built on top of timm==0.5.4 (default settings, etc.). In addition, for all experiments, we used 4 nodes ( 4 x 8 V100 = 32 GPUs). I'd like to provide more details/answers regarding the questions:

For GC ViT Tiny, the uploaded model weights/logs use a total batch size of 32 x 128 (N_gpus * batch_size_per_gpu) = 4096. However, we have also trained with a total batch size of 32 x 32 = 1024 and achieved very similar results (please see the table below). When using a local batch size of 128 (total 4096), we use a learning rate of 0.005 (as specified here). Otherwise, we use a learning rate of 0.001 when local batch size is 32 (total 1028).
The global batch size is 32 x 128 = 4096 when the local batch size is 128. We used 32 GPUs as specified above.
We did not run into any sensitivities, epsilons, etc. at and used all the defaults from timm==0.5.4. In fact, I have actually uploaded the entire config file, as generated by timm, in this link for a through overview of all hyper-parameters.
Yes. We achieve slight improvement using EMA, and generally find EMA to be more useful. Results for experiments with and without EMA are listed below.
We actually did use AMP for all experiments, as indicated in the config file. But for clarity, I have also added --amp to the training commands.

model	top-1	local batch size	global batch size	EMA	AMP
GCViT-T	83.40	128	4096	Yes	Yes
GCViT-T	83.38	128	4096	No	Yes
GCViT-T	83.39	32	1024	Yes	Yes
GCViT-T	83.37	32	1024	No	Yes

In addition to the above, we have also used the Swin Transformer epoch-based scheduler by slighly modifying the timm's iteration based scheduler (link here). Our motivation was to be comparable with Swin training settings. We will update the arXiv manuscript to reflect these information very soon.

Given my previous experience, I believe that timm library is the most effective and efficient way for ImageNet training, and an easy way to reaching SOTA or surpassing it, without needing to change much.

from gcvit.

rwightman commented on May 26, 2024

@ahatamiz thank you for the detailed response, my LR needs a bit of adjustment based on that info, I'll try another run with that and a new seed, I noticed the scheduler change, for long training runs I've found it made very little difference (why I have been slow to support per-step update)....

from gcvit.

ahatamiz commented on May 26, 2024

Hi @rwightman

Sure. I totally agree that scheduler would not make a big difference. Looking forward to know the results, and would be happy to provide more details if needed.

Thanks.

from gcvit.

training hparam clarifications about gcvit HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent