I'm finetuning SD 1.5 at high 576 batch size.(24 gradient accumulation right now for t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Strange Results on first step about lion-pytorch HOT 11 CLOSED

lucidrains commented on May 24, 2024

Strange Results on first step

from lion-pytorch.

Comments (11)

nbardy commented on May 24, 2024 2

Confirmed fixed.

from lion-pytorch.

mitchellnw commented on May 24, 2024 1

@nbardy do you get this behaviour without triton?

one thing i noticed here is that the triton code uses auto-tune + in-place updates which may cause issues. on the first step multiple differnt kernels will be launched which all do the same thing to see what is fastest. this is unique to the first step. usually this is not a problem when training from scratch as warmup is used but it may be here

from lion-pytorch.

xiangning-chen commented on May 24, 2024

Hi, thanks for the datapoint.

Do you have a comparison of the commands used for running with Lion and AdamW?

from lion-pytorch.

nbardy commented on May 24, 2024

@xiangning-chen same command besides the lr_opt value

from lion-pytorch.

xiangning-chen commented on May 24, 2024

Oh I meant the learning_rate, lr_end, and weight decay comparison for Lion and AdamW.

from lion-pytorch.

nbardy commented on May 24, 2024

They are in the main post.

‘Relevant Code’ is for lion
‘Relevant parameters’ is for Adam

from lion-pytorch.

xiangning-chen commented on May 24, 2024

@nbardy Sorry I'm a bit confused, in Relevant parameters you set the --lion_opt flag, but this is for Adam?
Can you please just tell me the learning_rate, lr_end, and weight decay for Lion and AdamW respectively, thanks!

from lion-pytorch.

lucidrains commented on May 24, 2024

@mitchellnw thanks for bringing this to my attention Mitchell!

@nbardy do you want to see if 6ab873a addresses the issue?

from lion-pytorch.

nbardy commented on May 24, 2024

I have finally got back to training more diffusion models.

Tried upgrading to lion-pytorch==0.1.2 and still getting a reset it seems on first step

https://wandb.ai/nbardy-facet/sd_xl_train_t2iadapter/runs/eey3bj1n?workspace=user-nbardy-facet

from lion-pytorch.

nbardy commented on May 24, 2024

lion-pytorch==0.1.2
pytorch-triton==2.1.0+e650d3708b
triton==2.0.0
torch==2.0.1

from lion-pytorch.

nbardy commented on May 24, 2024

Turned off lion and it’s still there. This is probably something else from my changes. Will test more next week.

from lion-pytorch.

Recommend Projects

Strange Results on first step about lion-pytorch HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent