lucidrains / lion-pytorch Goto Github PK

🦁 Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch

License: MIT License

Python 100.00%

artificial-intelligence deep-learning optimizers evolutionary-search

lion-pytorch's People

Contributors

Stargazers

Watchers

Forkers

andreaskaratzas cemberk anminhhung play-hearts animesh treksis younesbelkada vuongtruongson99 heitorrapela yukaizhou thanhpham-1998 naturalgradient uestc-chen cxz mrbungle-codes dvruette timesd henrywoo tlwzzy asdf2kr stanlito-ai rasbt xiangning-chen adambear mbrukman huangpengcheng666 glutinourice sosofun rohitdhankar kaimaoge qilong-ying nooneust camenduru tmukande-debug godnpeter xiaojunbrave douyimin yhna940 nswood perilla000leaves 5l1v3r1 sten2lu cryptowealth-technology tomnuen mingrixiang02 yousufmo exploringweirdmachines

lion-pytorch's Issues

Using Triton with PyTorch 2.0 for AMP training results in tensors containing inf values.

Hi, thx for your great work!

I set use_triton=True, and turned on automatic mixed precision training, but inf appeared in the results.
Does the lion_pytorch/triton.py need to consider bf16 or fp16?

It seems this issue #20 (comment) is similar to mine

KeyError in update_fn_kernel when use_triton=True

The way to trigger this problem in my situation is just simply define the optimizer and run and the problem occurs.

opt = Lion(
    params,
    lr=lr,
    betas=(0.95, 0.98),
    use_triton=True
)
...
...
...
trainer.fit(model, datamodule)

There are two exceptions. The first is keyerror in update_fn_kernel:

KeyError: 
('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-d6252949da17ceb5f3a278a70250af13-1af5134066c618146d2cd009138944a0-2d732a2488b7ed996facc3e641ee56bf-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948c
e0a49cfe122c', (torch.float32, torch.float32, torch.float32, dtype('float64'), 'fp32', 'fp32', 'fp32', 'i32'), (128,), (True, True, True, (False,), (False,), (False,), (False,), (True, False)))

During the exception above, the second exception occurs in triton/runtime/jit.py

 path/python3.9/site-packages/triton/runtime/jit.py:190 
│ in _type_of                                                                                      │
│                                                                                                  │
│   187 │   │   │   return f'*{ty}'                                                                │
│   188 │   │   if key is None:                                                                    │
│   189 │   │   │   return '*i8'                                                                   │
│ ❱ 190 │   │   assert isinstance(key, str)                                                        │
│   191 │   │   return key                                                                         │
│   192 │                                                                                          │
│   193 │   def _make_signature(self, sig_key):

key above is float64

Please give me some instruction... Thanks!

Did you increase the decoupled weight decay simultaneously when decreasing the learning rate?

Thanks for implementing and testing our lion optimizer!
Just wondering did you also enlarge the decoupled weight decay to maintain the regularization strength?

best,
--xiangning

AMD ROCM versions

Pytorch has AMD ROCM builds. How can lion-pytorch use those?

What is the best learning rate you have found? for lora and dreambooth ty

Update: seems to work for my local enwik8 autoregressive language modeling

Update 2: [experiments](https://api.wandb.ai/links/lucidrains/d4v6c8sl), seems much worse than Adam if learning rate held constant

Update 3: Dividing the learning rate by 3, seeing better early results than Adam. Maybe Adam has been dethroned, after nearly a decade.

Update 4: using the 10x smaller learning rate rule of thumb from the paper resulted in the worst run. so I guess it still takes a bit of tuning

Same amount of VRAM is taken as in AdamW

One of the main benefits of LION, is it needs to save less data for each param.
Adam needs to save Momentum and RMSProp ema's, while in LION we need to save only momentum ema.
When I try to use LION, it takes exactly the same amount of memory as AdamW

Loss explodes when resuming using trion implementation.

Hi, thanks for this implementation.

I was training a GPT2-medium pretraining model, but found that loss explodes immediately when resuming checkpoints when setting use_triton=True. If use_triton=False is set, then everything works fine。

I wonder if there's any problem in the triton implementation.

Do you have the actual weights trained from the paper?

This issue actually still persists. My python environment:

          This issue actually still persists. My python environment:

python                    3.10
accelerate                0.25.0
bitsandbytes              0.41.3.post2
black                     23.7.0
datasets                  2.14.7
flash-attn                2.4.2
huggingface-hub           0.17.3
lion-pytorch              0.1.2
networkx                  3.1
numpy                     1.26.3
pandas                    2.1.4
pip                       23.3.1
safetensors               0.4.1
tokenizers                0.14.1
torch                     2.1.2
transformers              4.34.1
triton                    2.1.0

My training setup is single-node multi-GPU with FSDP. FSDP config is:

fsdp_config:
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
mixed_precision: bf16

i.e., using FULL_SHARD FSDP + bf16 AMP. In such case, use_triton=True results in problems when resuming training from checkpoint - the gradient scale explodes along with the loss.

If I train with use_triton=False, save, then resume, there's no problem.

Originally posted by @syncdoth in #20 (comment)

Strange Results on first step

I'm finetuning SD 1.5 at high 576 batch size.(24 gradient accumulation right now for testing on 1 GPU before scaling up)

Trying to get LION working. Getting very strange results on the first step. Seems to reset the model to a weird texture filled state

Step 0 on the left and Step 1 on the right

Here is one I let run longer, It seems to actually be converging 🤔 , but still has the same reset problem at the start.
Step 500, Step 1000, Step 1500

Relevant Code:

       from lion_pytorch import Lion

        optimizer  = Lion(
            params_to_optimize,
            lr=args.learning_rate,
            weight_decay=1e-2,
            betas=(0.95,0.98),
            use_triton=True # set this to True to use cuda kernel w/ Triton lang (Tillet et al)
        )

Relevant parameters:

--batch_size=24 --learning_rate=7.0e-7 --gradient_accumulation_steps=24 --lion_opt --lr_end=7.0e-9 --lr_scheduler=cosine_with_restarts --lr_num_cycles=20 --lr_warmup_steps=0  --max_train_steps=10000 --mixed_precision bf16

No apparent sharp decrease in loss?

More samples across many prompts:

Note black squares are just NSFW filter I believe

Add the implementation to official pytorch repo

Thanks for creating this great implementation and for the helpful discussions! I was wondering if there's any plan to add this to the optimizers in the official pytorch codebase. We could refer to this repo, the author's repo, and the paper as references in the docstring, similar to the lion implementation in the official keras codebase so that more people can benefit from it.

Strange Results on first step

Adaptive learning-rate optimization

Thank you for that implementation!

There was an older optimizer[0] that would update the learning-rate according to the following rule (basically increasing the learning-rate when things are going in the same direction and decreasing it when they are not):

epsilon = 0.01
same_sign = sign(update[weight]) == sign(previous_update[weight])
lr[weight] = torch.where(same_sign, lr[weight]*(1+epsilon), lr[weight]/(1+epsilon))

It requires:

keeping track of the previous sign of the update vector,
storing one learning rate per weight,
adding an epsilon parameter.

However, at that price, you can find an epsilon and starting learning rate that work well for a large range of problems and not have to think about learning rate scheduling nor optimal learning-rate for your given problem.

Given the regularity of the update step of lion, it might be worth playing with.

[0]: I just searched for an exact reference but could not find it anymore, it was before Adam's domination over the ML world.

Does the Lion optimizer work with grad accumulation?

Hey mates,

thanks for your great work. I am wondering now if the Lion optimizer will work with grad accumulation in pytorch (using transformer API for NLP tasks)?

I read some work, which is built on Lion, but enabled the gradient accumulation, by setting β1=β2=β as a special case of Lion. The author named the optimizer as “Tiger” (Tight-fisted Optimizer), here is the official implementation (you might need to translate the website into English with Chrome) and the rewrite in pytorch.

add an 8-bit version with bitsandbytes

https://github.com/TimDettmers/bitsandbytes/blob/main/compile_from_source.md

Always getting NaNs in long training

I've been experimenting with the LION optimizer in your other (great) Imagen repository. I can share my anecdotal experience and combinations:

Models of different sizes 0.2B, 0.7B and 1B params.
Betas such as beta1 0.95 and beta2 0.98
Learning rates 1e-4, 3e-5 and 1e-5.
Triton kernel turned both True and False.

Training was indeed fast but unfortunately in the end always ended up yielding NaNs.

I think a potential issue could be how LION interacts with a warmup schedule; I am not sure if you're supposed to do warmup with this optimizer or not (which I always did).

Instability when resuming trains

Hi, I have been testing this out on some diffusion models I am training.

Convergence seems decent (somewhat faster than AdamW, using 1/10th the learning rate, 10x weight decay).
However, I recently paused a few experiments and tried to resume, and the loss explodes immediately. I do not
face this issue when resuming AdamW trains.

I have also found it necessary to use an LR warm-up period in my trains (even with the 1/10th loss), which again,
is not required in AdamW. I'll try to do a bit more digging to see if I can track down the source of instability - however for experiment resuming, surely if I load the optimizer state correctly things should resume as expected?

My only thought is whether something could be going wrong with saving of EMA / moving average statistics? If I get a chance to dig into this more I'll let you know what I find. (Possibly I am doing something wrong).

any new update?

any new update for this optimizer?

Learning rate scaling for distributed training?

Hi @lucidrains, thanks for this implementation.

I wonder if you're using distributed training for your experiments. If so, as noted in Accelerate's docs, do you scale your learning rate (on top of downscaling for LION optimizer, even if you're not using Accelerate) based on number of processes (GPUs).

If you don't scale learning rate, do you recommend doing so?

Convergence guarantees for Lion

Thank you so much for your great implementations! My collabrators and I have recently onlined a manuscript (available at https://arxiv.org/abs/2307.10053) that provides convergence guarantees for Lion optimizer, especially in the training of nonsmooth neural networks. Moreover, we present some of our implementations at our repo (available at https://github.com/xnchxy/GeneralSGD). Hope our results could help to improve the Lion optimizer.

Best regards,

Nachuan

Performance experiments over AdamW

Hi Phil,

I have been testing some different Lion hyperparameters with PaLM at the 1B scale (Total batch size 192. ~1.6 million tokens a batch). Using a decoupled weight decay of 0.1 for all runs. So far the best configuration was:

3e-4
betas-90-98

This had about a 0.2 loss improvement over AdamW. The memory consumption was ~4% lower. There was an increase in speed of about 0.14. Lowering the iteration time from 1.65 to 1.51.

Wandb logs:
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-43-11---Vmlldzo0MzE0MTcy
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-48-41---Vmlldzo0MzE0MjAz

I am going to be testing at the 2B scale next and report the results. I am going to try adjusting the learning rate and betas more as well. I was wondering if you had noticed a significant difference in performance as you increased the size of the model?

Thank you,

Enrico

How does this optimizer perform in you tasks?

How does this optimizer perform? What's your lr and batch size when compared with AdamW?
I utilize the parameters almost the same as AdamW in my NLP task, which performs well.

AttributeError: module 'trinton.language' has no attribute 'constexpr'

Tried enabling Triton and ran into the above issue. I installed the latest version of Triton on PyPi (1.1.1). Is there a particular version this library is compatible with?