lucidrains / lion-pytorch Goto Github PK
View Code? Open in Web Editor NEWπ¦ Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch
License: MIT License
π¦ Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch
License: MIT License
Hi, thx for your great work!
I set use_triton=True
, and turned on automatic mixed precision training, but inf
appeared in the results.
Does the lion_pytorch/triton.py
need to consider bf16
or fp16
?
It seems this issue #20 (comment) is similar to mine
The way to trigger this problem in my situation is just simply define the optimizer and run and the problem occurs.
opt = Lion(
params,
lr=lr,
betas=(0.95, 0.98),
use_triton=True
)
...
...
...
trainer.fit(model, datamodule)
There are two exceptions. The first is keyerror in update_fn_kernel
:
KeyError:
('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-d6252949da17ceb5f3a278a70250af13-1af5134066c618146d2cd009138944a0-2d732a2488b7ed996facc3e641ee56bf-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948c
e0a49cfe122c', (torch.float32, torch.float32, torch.float32, dtype('float64'), 'fp32', 'fp32', 'fp32', 'i32'), (128,), (True, True, True, (False,), (False,), (False,), (False,), (True, False)))
During the exception above, the second exception occurs in triton/runtime/jit.py
path/python3.9/site-packages/triton/runtime/jit.py:190
β in _type_of β
β β
β 187 β β β return f'*{ty}' β
β 188 β β if key is None: β
β 189 β β β return '*i8' β
β β± 190 β β assert isinstance(key, str) β
β 191 β β return key β
β 192 β β
β 193 β def _make_signature(self, sig_key):
key
above is float64
Please give me some instruction... Thanks!
Thanks for implementing and testing our lion optimizer!
Just wondering did you also enlarge the decoupled weight decay to maintain the regularization strength?
best,
--xiangning
Pytorch has AMD ROCM builds. How can lion-pytorch use those?
Update: seems to work for my local enwik8 autoregressive language modeling
Update 2: [experiments](https://api.wandb.ai/links/lucidrains/d4v6c8sl), seems much worse than Adam if learning rate held constant
Update 3: Dividing the learning rate by 3, seeing better early results than Adam. Maybe Adam has been dethroned, after nearly a decade.
Update 4: using the 10x smaller learning rate rule of thumb from the paper resulted in the worst run. so I guess it still takes a bit of tuning
One of the main benefits of LION, is it needs to save less data for each param.
Adam needs to save Momentum and RMSProp ema's, while in LION we need to save only momentum ema.
When I try to use LION, it takes exactly the same amount of memory as AdamW
Hi, thanks for this implementation.
I was training a GPT2-medium pretraining model, but found that loss explodes immediately when resuming checkpoints when setting use_triton=True
. If use_triton=False
is set, then everything works fineγ
I wonder if there's any problem in the triton implementation.
This issue actually still persists. My python environment:
python 3.10
accelerate 0.25.0
bitsandbytes 0.41.3.post2
black 23.7.0
datasets 2.14.7
flash-attn 2.4.2
huggingface-hub 0.17.3
lion-pytorch 0.1.2
networkx 3.1
numpy 1.26.3
pandas 2.1.4
pip 23.3.1
safetensors 0.4.1
tokenizers 0.14.1
torch 2.1.2
transformers 4.34.1
triton 2.1.0
My training setup is single-node multi-GPU with FSDP. FSDP config is:
fsdp_config:
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_sync_module_states: true
fsdp_use_orig_params: true
mixed_precision: bf16
i.e., using FULL_SHARD FSDP + bf16 AMP. In such case, use_triton=True
results in problems when resuming training from checkpoint - the gradient scale explodes along with the loss.
If I train with use_triton=False
, save, then resume, there's no problem.
Originally posted by @syncdoth in #20 (comment)
I'm finetuning SD 1.5 at high 576 batch size.(24 gradient accumulation right now for testing on 1 GPU before scaling up)
Trying to get LION working. Getting very strange results on the first step. Seems to reset the model to a weird texture filled state
Step 0 on the left and Step 1 on the right
Here is one I let run longer, It seems to actually be converging π€ , but still has the same reset problem at the start.
Step 500, Step 1000, Step 1500
Relevant Code:
from lion_pytorch import Lion
optimizer = Lion(
params_to_optimize,
lr=args.learning_rate,
weight_decay=1e-2,
betas=(0.95,0.98),
use_triton=True # set this to True to use cuda kernel w/ Triton lang (Tillet et al)
)
Relevant parameters:
--batch_size=24 --learning_rate=7.0e-7 --gradient_accumulation_steps=24 --lion_opt --lr_end=7.0e-9 --lr_scheduler=cosine_with_restarts --lr_num_cycles=20 --lr_warmup_steps=0 --max_train_steps=10000 --mixed_precision bf16
No apparent sharp decrease in loss?
More samples across many prompts:
Note black squares are just NSFW filter I believe
Thanks for creating this great implementation and for the helpful discussions! I was wondering if there's any plan to add this to the optimizers in the official pytorch codebase. We could refer to this repo, the author's repo, and the paper as references in the docstring, similar to the lion implementation in the official keras codebase so that more people can benefit from it.
Thank you for that implementation!
There was an older optimizer[0] that would update the learning-rate according to the following rule (basically increasing the learning-rate when things are going in the same direction and decreasing it when they are not):
epsilon = 0.01
same_sign = sign(update[weight]) == sign(previous_update[weight])
lr[weight] = torch.where(same_sign, lr[weight]*(1+epsilon), lr[weight]/(1+epsilon))
It requires:
epsilon
parameter.However, at that price, you can find an epsilon and starting learning rate that work well for a large range of problems and not have to think about learning rate scheduling nor optimal learning-rate for your given problem.
Given the regularity of the update step of lion
, it might be worth playing with.
[0]: I just searched for an exact reference but could not find it anymore, it was before Adam's domination over the ML world.
Hey mates,
thanks for your great work. I am wondering now if the Lion optimizer will work with grad accumulation in pytorch (using transformer API for NLP tasks)?
I read some work, which is built on Lion, but enabled the gradient accumulation, by setting Ξ²1=Ξ²2=Ξ²
as a special case of Lion. The author named the optimizer as βTigerβ (Tight-fisted Optimizer), here is the official implementation (you might need to translate the website into English with Chrome) and the rewrite in pytorch.
I've been experimenting with the LION optimizer in your other (great) Imagen repository. I can share my anecdotal experience and combinations:
beta1 0.95
and beta2 0.98
1e-4
, 3e-5
and 1e-5
.True
and False
.Training was indeed fast but unfortunately in the end always ended up yielding NaNs.
I think a potential issue could be how LION interacts with a warmup schedule; I am not sure if you're supposed to do warmup with this optimizer or not (which I always did).
Hi, I have been testing this out on some diffusion models I am training.
Convergence seems decent (somewhat faster than AdamW, using 1/10th the learning rate, 10x weight decay).
However, I recently paused a few experiments and tried to resume, and the loss explodes immediately. I do not
face this issue when resuming AdamW trains.
I have also found it necessary to use an LR warm-up period in my trains (even with the 1/10th loss), which again,
is not required in AdamW. I'll try to do a bit more digging to see if I can track down the source of instability - however for experiment resuming, surely if I load the optimizer state correctly things should resume as expected?
My only thought is whether something could be going wrong with saving of EMA / moving average statistics? If I get a chance to dig into this more I'll let you know what I find. (Possibly I am doing something wrong).
any new update for this optimizer?
Hi @lucidrains, thanks for this implementation.
I wonder if you're using distributed training for your experiments. If so, as noted in Accelerate's docs, do you scale your learning rate (on top of downscaling for LION optimizer, even if you're not using Accelerate) based on number of processes (GPUs).
If you don't scale learning rate, do you recommend doing so?
Thank you so much for your great implementations! My collabrators and I have recently onlined a manuscript (available at https://arxiv.org/abs/2307.10053) that provides convergence guarantees for Lion optimizer, especially in the training of nonsmooth neural networks. Moreover, we present some of our implementations at our repo (available at https://github.com/xnchxy/GeneralSGD). Hope our results could help to improve the Lion optimizer.
Best regards,
Nachuan
Hi Phil,
I have been testing some different Lion hyperparameters with PaLM at the 1B scale (Total batch size 192. ~1.6 million tokens a batch). Using a decoupled weight decay of 0.1 for all runs. So far the best configuration was:
This had about a 0.2 loss improvement over AdamW. The memory consumption was ~4% lower. There was an increase in speed of about 0.14. Lowering the iteration time from 1.65 to 1.51.
Wandb logs:
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-43-11---Vmlldzo0MzE0MTcy
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-48-41---Vmlldzo0MzE0MjAz
I am going to be testing at the 2B scale next and report the results. I am going to try adjusting the learning rate and betas more as well. I was wondering if you had noticed a significant difference in performance as you increased the size of the model?
Thank you,
Enrico
How does this optimizer perform? What's your lr and batch size when compared with AdamW?
I utilize the parameters almost the same as AdamW in my NLP task, which performs well.
Tried enabling Triton and ran into the above issue. I installed the latest version of Triton on PyPi (1.1.1). Is there a particular version this library is compatible with?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.