Code Monkey home page Code Monkey logo

lion-pytorch's People

Contributors

dvruette avatar lucidrains avatar rasbt avatar xiangning-chen avatar yousufmo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lion-pytorch's Issues

KeyError in update_fn_kernel when use_triton=True

The way to trigger this problem in my situation is just simply define the optimizer and run and the problem occurs.

opt = Lion(
    params,
    lr=lr,
    betas=(0.95, 0.98),
    use_triton=True
)
...
...
...
trainer.fit(model, datamodule)

There are two exceptions. The first is keyerror in update_fn_kernel:

KeyError: 
('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-d6252949da17ceb5f3a278a70250af13-1af5134066c618146d2cd009138944a0-2d732a2488b7ed996facc3e641ee56bf-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948c
e0a49cfe122c', (torch.float32, torch.float32, torch.float32, dtype('float64'), 'fp32', 'fp32', 'fp32', 'i32'), (128,), (True, True, True, (False,), (False,), (False,), (False,), (True, False)))

During the exception above, the second exception occurs in triton/runtime/jit.py

 path/python3.9/site-packages/triton/runtime/jit.py:190 
β”‚ in _type_of                                                                                      β”‚
β”‚                                                                                                  β”‚
β”‚   187 β”‚   β”‚   β”‚   return f'*{ty}'                                                                β”‚
β”‚   188 β”‚   β”‚   if key is None:                                                                    β”‚
β”‚   189 β”‚   β”‚   β”‚   return '*i8'                                                                   β”‚
β”‚ ❱ 190 β”‚   β”‚   assert isinstance(key, str)                                                        β”‚
β”‚   191 β”‚   β”‚   return key                                                                         β”‚
β”‚   192 β”‚                                                                                          β”‚
β”‚   193 β”‚   def _make_signature(self, sig_key):

key above is float64

Please give me some instruction... Thanks!

What is the best learning rate you have found? for lora and dreambooth ty

Update: seems to work for my local enwik8 autoregressive language modeling

Update 2: [experiments](https://api.wandb.ai/links/lucidrains/d4v6c8sl), seems much worse than Adam if learning rate held constant

Update 3: Dividing the learning rate by 3, seeing better early results than Adam. Maybe Adam has been dethroned, after nearly a decade.

Update 4: using the 10x smaller learning rate rule of thumb from the paper resulted in the worst run. so I guess it still takes a bit of tuning


Same amount of VRAM is taken as in AdamW

One of the main benefits of LION, is it needs to save less data for each param.
Adam needs to save Momentum and RMSProp ema's, while in LION we need to save only momentum ema.
When I try to use LION, it takes exactly the same amount of memory as AdamW

Loss explodes when resuming using trion implementation.

Hi, thanks for this implementation.

I was training a GPT2-medium pretraining model, but found that loss explodes immediately when resuming checkpoints when setting use_triton=True. If use_triton=False is set, then everything works fine。

I wonder if there's any problem in the triton implementation.

This issue actually still persists. My python environment:

          This issue actually still persists. My python environment:
python                    3.10
accelerate                0.25.0
bitsandbytes              0.41.3.post2
black                     23.7.0
datasets                  2.14.7
flash-attn                2.4.2
huggingface-hub           0.17.3
lion-pytorch              0.1.2
networkx                  3.1
numpy                     1.26.3
pandas                    2.1.4
pip                       23.3.1
safetensors               0.4.1
tokenizers                0.14.1
torch                     2.1.2
transformers              4.34.1
triton                    2.1.0

My training setup is single-node multi-GPU with FSDP. FSDP config is:

fsdp_config:
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
mixed_precision: bf16

i.e., using FULL_SHARD FSDP + bf16 AMP. In such case, use_triton=True results in problems when resuming training from checkpoint - the gradient scale explodes along with the loss.

If I train with use_triton=False, save, then resume, there's no problem.

Originally posted by @syncdoth in #20 (comment)

Strange Results on first step

I'm finetuning SD 1.5 at high 576 batch size.(24 gradient accumulation right now for testing on 1 GPU before scaling up)

Trying to get LION working. Getting very strange results on the first step. Seems to reset the model to a weird texture filled state

Step 0 on the left and Step 1 on the right

image

Here is one I let run longer, It seems to actually be converging πŸ€” , but still has the same reset problem at the start.
Step 500, Step 1000, Step 1500
image

Relevant Code:

       from lion_pytorch import Lion

        optimizer  = Lion(
            params_to_optimize,
            lr=args.learning_rate,
            weight_decay=1e-2,
            betas=(0.95,0.98),
            use_triton=True # set this to True to use cuda kernel w/ Triton lang (Tillet et al)
        )

Relevant parameters:

--batch_size=24 --learning_rate=7.0e-7 --gradient_accumulation_steps=24 --lion_opt --lr_end=7.0e-9 --lr_scheduler=cosine_with_restarts --lr_num_cycles=20 --lr_warmup_steps=0  --max_train_steps=10000 --mixed_precision bf16

No apparent sharp decrease in loss?
image

More samples across many prompts:

image

Note black squares are just NSFW filter I believe

Add the implementation to official pytorch repo

Thanks for creating this great implementation and for the helpful discussions! I was wondering if there's any plan to add this to the optimizers in the official pytorch codebase. We could refer to this repo, the author's repo, and the paper as references in the docstring, similar to the lion implementation in the official keras codebase so that more people can benefit from it.

Adaptive learning-rate optimization

Thank you for that implementation!

There was an older optimizer[0] that would update the learning-rate according to the following rule (basically increasing the learning-rate when things are going in the same direction and decreasing it when they are not):

epsilon = 0.01
same_sign = sign(update[weight]) == sign(previous_update[weight])
lr[weight] = torch.where(same_sign, lr[weight]*(1+epsilon), lr[weight]/(1+epsilon))

It requires:

  • keeping track of the previous sign of the update vector,
  • storing one learning rate per weight,
  • adding an epsilon parameter.

However, at that price, you can find an epsilon and starting learning rate that work well for a large range of problems and not have to think about learning rate scheduling nor optimal learning-rate for your given problem.

Given the regularity of the update step of lion, it might be worth playing with.

[0]: I just searched for an exact reference but could not find it anymore, it was before Adam's domination over the ML world.

Does the Lion optimizer work with grad accumulation?

Hey mates,

thanks for your great work. I am wondering now if the Lion optimizer will work with grad accumulation in pytorch (using transformer API for NLP tasks)?

I read some work, which is built on Lion, but enabled the gradient accumulation, by setting Ξ²1=Ξ²2=Ξ² as a special case of Lion. The author named the optimizer as β€œTiger” (Tight-fisted Optimizer), here is the official implementation (you might need to translate the website into English with Chrome) and the rewrite in pytorch.

Always getting NaNs in long training

I've been experimenting with the LION optimizer in your other (great) Imagen repository. I can share my anecdotal experience and combinations:

  • Models of different sizes 0.2B, 0.7B and 1B params.
  • Betas such as beta1 0.95 and beta2 0.98
  • Learning rates 1e-4, 3e-5 and 1e-5.
  • Triton kernel turned both True and False.

Training was indeed fast but unfortunately in the end always ended up yielding NaNs.

I think a potential issue could be how LION interacts with a warmup schedule; I am not sure if you're supposed to do warmup with this optimizer or not (which I always did).

image

Instability when resuming trains

Hi, I have been testing this out on some diffusion models I am training.

Convergence seems decent (somewhat faster than AdamW, using 1/10th the learning rate, 10x weight decay).
However, I recently paused a few experiments and tried to resume, and the loss explodes immediately. I do not
face this issue when resuming AdamW trains.

I have also found it necessary to use an LR warm-up period in my trains (even with the 1/10th loss), which again,
is not required in AdamW. I'll try to do a bit more digging to see if I can track down the source of instability - however for experiment resuming, surely if I load the optimizer state correctly things should resume as expected?

My only thought is whether something could be going wrong with saving of EMA / moving average statistics? If I get a chance to dig into this more I'll let you know what I find. (Possibly I am doing something wrong).

Convergence guarantees for Lion

Thank you so much for your great implementations! My collabrators and I have recently onlined a manuscript (available at https://arxiv.org/abs/2307.10053) that provides convergence guarantees for Lion optimizer, especially in the training of nonsmooth neural networks. Moreover, we present some of our implementations at our repo (available at https://github.com/xnchxy/GeneralSGD). Hope our results could help to improve the Lion optimizer.

Best regards,

Nachuan

Performance experiments over AdamW

Hi Phil,

I have been testing some different Lion hyperparameters with PaLM at the 1B scale (Total batch size 192. ~1.6 million tokens a batch). Using a decoupled weight decay of 0.1 for all runs. So far the best configuration was:

  • 3e-4
  • betas-90-98

This had about a 0.2 loss improvement over AdamW. The memory consumption was ~4% lower. There was an increase in speed of about 0.14. Lowering the iteration time from 1.65 to 1.51.

Wandb logs:
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-43-11---Vmlldzo0MzE0MTcy
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-48-41---Vmlldzo0MzE0MjAz

I am going to be testing at the 2B scale next and report the results. I am going to try adjusting the learning rate and betas more as well. I was wondering if you had noticed a significant difference in performance as you increased the size of the model?

Thank you,

Enrico

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.