jiaweizzhao / galore Goto Github PK

View Code? Open in Web Editor NEW

1.4K 1.4K 140.0 401 KB

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

License: Apache License 2.0

Python 97.71% Shell 2.29%

galore's People

Contributors

Stargazers

Watchers

Forkers

lessw2020 rogervaas binsean muellerzr akkineniramesh xiechengmude winglian teome vkantchev hiyouga jolks ishine ychoi-atop vu1seek kunlun-zhu saucam zirenlegend inserviceofx drkhoinguyen techthiyanes leixy76 tomas-gajarsky andreslavescu wd255 darthjaja6 big-data-ai kjlcl pengyuzhang97 l-stripper bizarrework-w bornedbalarec cozyme-reejoy buffywideangeliche lawrt-serenesilly mjweb100versinda shigherit bloggeno14 plentytale-riseraci deltatana-softonal arani-k curiousli50 theevilchariyesmessages aquatical-freddarth dailyerneheardoom glenesto91 mrqjsfhf rb1717 saifulanas uhrilazaro408 danield21 birdyn-jackfasc btargetcoops 0iui0 svorwerk-flextg sanowl qweszxc7410 dingy007 chhsiao1981 paperwave ericbin1 liunix61 iamfaith onmygame evdcush joeaelkhoury adilvk007 awgu hodawang lizhongguo dumpmemory batman-do k2m5t2 jjhoow ttltwlj cauchyturing agnim25 if-ai alvin-zyl fengxingxiang hughes-research bet0x lnovikov penutchen bb33bb reinakousaka jeromeku hmxiong keshavaspanda cnuer-cdx dune-z eltociear edenzzzz seyoungree huzongxiang kperi lancymotee tianjinyellow janeyx99 joelseniorliang guilhermeleste

galore's Issues

Galore + Lora?

Hi,

Sorry if this is stupid question but, is it possible to use the 8bit galore optimiser in combination with LoRA adapters?

Thanks

`torch_run.py` lacking autocast and scaling for Automatic Mixed Precision

Hey,

As mentioned in the title, there is the direct conversion of the model to BF16, without the use of torch.amp functions of autocast and scaling needed for AMP.

This means that the projected memory shown here is only the 2bytes for the model (BF16) but the results post-training would be bad as per various sources. Beyond that, we would need AMP for it to work properly, which means getting 6 bytes per parameter, which blows the 24GiB mentioned in the paper out of the water.

For LLaMa3 8B, you would need 8 * 10^9 * 6 bytes ~ 44GiB for just parameter loading in BF16 AMP.

Just wanted to point it out, and ask about why this is made this way. The paper also mentions a 58GiB minimum -- but I think you'd need much more than that.

If this is a deliberate decision, please point me to the studies that show that such training has been stabilized.

src: [ https://docs.fast.ai/callback.fp16.html ]

Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks"

Has anyone successfully replicated the results of fine-tuning tasks?
I followed the hyperparameters outlined in the REAMDE and the paper, and tried cola and mrpc tasks on a single GPU without gradient accumulation. However, the results I obtained differed from those reported in the paper.
And here are best performances of my runs

mrpc: 0.8971(92.25)
cola: 0.6274(0.6035),
where numbers in parentheses are the results of the paper.
I would appreciate any assistance from someone who can provide insights on this matter.

linalg.svd: The algorithm failed to converge

I have converted the GaLore code to C++ (libtorch) and currently running into an issue where large layers (the initial embedding layer) is failing at SVD.

Layer is 31618x2624
I am running with full_matrices set to false already.

[W BatchLinearAlgebraLib.cpp:703] Warning: torch.linalg.svd: During SVD computation with the selected cusolver driver, batches 0 failed to converge. A more accurate method will be used to compute the SVD as a fallback. Check doc at https://pytorch.org/docs/stable/generated/torch.linalg.svd.html (function operator ()) [08:32:34.6859554] Projection failed: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 2623).

Is it a known issue and is there any workarounds for this?

How many GB memory is required to train the 7b model using DDP mode with galore?

in sigle gpu mode,I success run the train by RTX3090.but it took too long。
in ddp mode，we got OOM in LlamaForCausalLM = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[local_rank],
output_device=local_rank,
broadcast_buffers=False,
)
.

Hyperparameters for SFT?

Thanks for the great work. One thing I'm curious about is that does it actually work well on SFT for LLMs? It is not covered in the paper, as well. I tried the following parameters on a 2B-sized model, but it leads to very slow convergence. Could you please give me some advice?

lr: 5e-5
galore_rank: 64
galore_update_proj_gap: 200
scale: 0.25
proj_type: std

can support llava model ?

Confusion about the paper

Impressive and insightful work, hooray to the authors! Recently I read your paper, but I'm comfused about the following parts.

In the abstract, you discuss how memory-reduction approaches like LoRA underperform full-rank training, for they constrain search space to a low-rank subspace. I quite agree with this perspective. But in the methodology section, the gradient is computed by projection and backprojection, which results in $\Delta W$ being in a low rank subspace as well, which seems to potentially conflict with the initial motivation. Could you please elaborate on how Galore aligns with the overarching goals of the study?
Despite the reduction in optimizer state, performing SVD on gradients introduces a considerable demand on peak memory usage due to the high-dimensional nature of the singular vectors U and V. I'm not sure if I understood the method properly.

Sorry if I ask stupid questions. Thank you for your time and consideration.

RuntimeError: diag(): Supports 1D or 2D tensors. Got 3D

[/usr/local/lib/python3.10/dist-packages/galore_torch/galore_projector.py](https://localhost:8080/#) in get_orthogonal_matrix(self, weights, rank, type)
     85         #make the smaller matrix always to be orthogonal matrix
     86         if type=='right':
---> 87             A = U[:, :rank] @ torch.diag(s[:rank])
     88             B = Vh[:rank, :]
     89 

RuntimeError: diag(): Supports 1D or 2D tensors. Got 3D

As I understand it, galore is not able to work with models that work with 3D inputs/outputs?

Galore finetuning #stopped

# Configuration parameters
model_name_or_path = "mistralai/Mistral-7B-v0.1"
max_length = 128
doc_stride = 128
pad_to_max_length = True
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
learning_rate = 0.0002
weight_decay = 0.0
num_train_epochs = 1
gradient_accumulation_steps = 1
output_dir = "/home/IAIS/jdatta/teacher_model"
seed = 42

# Load the datasets
squad = datasets.load_dataset("rajpurkar/squad_v2")
dataset = squad['train'].train_test_split(test_size=0.2)
train_dataset = dataset['train']
eval_dataset = dataset['test']

train_dataset = train_dataset.select(range(1000))
eval_dataset = eval_dataset.select(range(500))

training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="steps",
    warmup_ratio=0.05,
    overwrite_output_dir=True,
    gradient_accumulation_steps=gradient_accumulation_steps,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    num_train_epochs=num_train_epochs,
    fp16=True,
    eval_steps=10,
    save_strategy='steps',
    save_steps=10,
    save_total_limit=1,
    dataloader_num_workers=2,
    load_best_model_at_end=True,
    report_to="none",
    prediction_loss_only=True,
    gradient_checkpointing=True,
    optim_args="rank=64, update_proj_gap=100, scale=0.10",
    optim="galore_adafactor",
    optim_target_modules=["c_attn", "c_proj", "q_proj", "k_proj", "v_proj", "down_proj", "up_proj"],
    learning_rate=learning_rate,
    weight_decay=weight_decay,
)

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
)
trainer.train()

The traning is not starting.
It is showing the following comments for 2 hours:
/home/IAIS/jdatta/miniconda3/envs/myenv/lib/python3.11/site-packages/transformers/training_args.py:1474: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
Activated GaLoRE fine-tuning, depending on your model size and hardware, the training might take a while before starting. Please be patient !
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:

Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Avoid using tokenizers before the fork if possible
Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Should I tune any parameter?
I've tried with Mistral-7b, Phi-2, Llama-7b also.

Figure 1 clarification on batch size and sequence length

In Figure 1, what is batch size, sequence len, and vocab size here? It isn't clear from the caption. I would expect activations to take up more space. From what I can tell:

batch size seems to be 256 based on Fig. 1 caption
sequence len seems to be 2048, based on footnote 1
vocab size is 32000, based on config from repo
bf16 used so 2 bytes per float, based on footnote 2

So only the logits of the Llama model should take up 256 * 2048 * 32000 * 2 bytes or 31.25 GB. Where is this required memory in Figure 1?

Thanks!

Any plan for the first stable release?

ValueError: some parameters appear in more than one parameter group

I encountered an error, how should I resolve it?

[WARNING|trainer.py:1272] 2024-04-27 12:04:25,428 >> Activated GaLoRE fine-tuning, depending on your model size and hardware, the training might take a while before starting. Please be patient !
/home/jiahui/anaconda3/envs/llm/lib/python3.10/site-packages/galore_torch/adamw.py:48: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
Traceback (most recent call last):
File "/home/jiahui/workspace/nmt/thesis_nmt/mnmt/multi/scripts/run_translation.py", line 618, in
main()
File "/home/jiahui/workspace/nmt/thesis_nmt/mnmt/multi/scripts/run_translation.py", line 534, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/jiahui/workspace/nmt/thesis_nmt/transformers/src/transformers/trainer.py", line 1848, in train
return inner_training_loop(
File "/home/jiahui/workspace/nmt/thesis_nmt/transformers/src/transformers/trainer.py", line 1949, in _inner_training_loop
self.create_optimizer_and_scheduler(num_training_steps=max_steps)
File "/home/jiahui/workspace/nmt/thesis_nmt/transformers/src/transformers/trainer.py", line 981, in create_optimizer_and_scheduler
self.create_optimizer()
File "/home/jiahui/workspace/nmt/thesis_nmt/transformers/src/transformers/trainer.py", line 1038, in create_optimizer
self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
File "/home/jiahui/anaconda3/envs/llm/lib/python3.10/site-packages/galore_torch/adamw.py", line 64, in init
super().init(params, defaults)
File "/home/jiahui/anaconda3/envs/llm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 192, in init
self.add_param_group(param_group)
File "/home/jiahui/anaconda3/envs/llm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 535, in add_param_group
raise ValueError("some parameters appear in more than one parameter group")
ValueError: some parameters appear in more than one parameter group

Where is LOMO (fused gradient update) implemented?

Hi! Congrats on the great work! I have a question regarding the gradient storage: the paper mentioned that GaLore also uses LOMO to avoid materializing the full gradient, but I couldn't find where LOMO is implemented in the code base. Can you point me to where it is implemented (or the equivalents)? Thanks!

Training Time

Would it be possible for you to add how long each training run takes to the README? I think a lot of people who have heard about Galore would be interested in that.

Questions about reproducing the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks"

Thank you for your great work. I am trying to reproduce the results in "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks"

I have go through this issue, but I still fail to reproduce some of the results.

I conducted experiments on GaLore rank 4, with the following script

python run_glue.py \
    --model_name_or_path roberta-base \
    --task_name $task_name \
    --enable_galore \
    --lora_all_modules \
    --max_length 512 \
    --seed $seed \
    --lora_r 4 \
    --galore_scale 4 \
    --per_device_train_batch_size 16 \
    --update_proj_gap 500 \
    --learning_rate 1e-5 \
    --num_train_epochs 30 \
    --output_dir results/ft/roberta_base/$task_name

following the hyper parameters in Table 7 in your paper, I changed bs to 32 and learning rate to 3e-5 when conducting experiment on CoLA dataset. But I got the following results

Dataset	Result in Paper	Reproduced Results
MRPC	92.25	Acc 87.74, F1 91.10
COLA	60.35	matthews_correlation: 59.56
RTE	79.42	Acc 77.25
STSB	90.73	pearson 0.90526; spearmanr 0.90339
QQP	91.06	pearson 0.90785; spearmanr 0.90589

I am wondering if I have missed someting?

When I used galore on orpo, the learning rate was set to 8e-6, but the training rate was 0.01

trainer = ORPOTrainer(
        model=model,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        
        #peft_config=peft_config,
        tokenizer=tokenizer,
        args= ORPOConfig(
            max_length=cutoff_len,
            max_prompt_length=cutoff_len//2,
            beta=0.1,
            per_device_train_batch_size=micro_batch_size,
            gradient_accumulation_steps=gradient_accumulation_steps,
            warmup_steps=0,
            num_train_epochs=num_epochs,
            lr_scheduler_type="cosine",
            learning_rate=8e-6,
            bf16=True,
            logging_steps=10,
            optim = "galore_adamw_8bit_layerwise",
            optim_target_modules=[r".*attn.*", r".*mlp.*"],
            optim_args="rank=1024, update_proj_gap=500, scale=0.25",
            evaluation_strategy="steps" if val_set_size > 0 else "no",
            save_strategy="steps",
            eval_steps=100 if val_set_size > 0 else None,
            save_steps=100,
            output_dir=output_dir,
            save_total_limit=2,
            gradient_checkpointing=True, 
            gradient_checkpointing_kwargs={'use_reentrant':True},
            load_best_model_at_end=True if val_set_size > 0 else False,
            ddp_find_unused_parameters=False if ddp else None,
            report_to="wandb" if use_wandb else None,
            run_name=wandb_run_name if use_wandb else None,
            do_train=True,
            remove_unused_columns=False,
        )
    )


Activated GaLoRE fine-tuning, depending on your model size and hardware, the training might take a while before starting. Please be patient !
  0%|                                                                                                                                     | 0/495 [00:00<?, ?it/s]Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 0.3557, 'grad_norm': 0.0, 'learning_rate': 0.001, 'rewards/chosen': -0.015678538009524345, 'rewards/rejected': -0.012379011139273643, 'rewards/accuracies': 0.19999998807907104, 'rewards/margins': -0.003299527335911989, 'logps/rejected': -0.12379010766744614, 'logps/chosen': -0.15678536891937256, 'logits/rejected': 0.7921055555343628, 'logits/chosen': 0.791210412979126, 'nll_loss': 0.2719877064228058, 'log_odds_ratio': -0.8374900817871094, 'log_odds_chosen': -0.25091928243637085, 'epoch': 0.06}
{'loss': 0.2634, 'grad_norm': 0.0, 'learning_rate': 0.001, 'rewards/chosen': -0.012010233476758003, 'rewards/rejected': -0.009977776557207108, 'rewards/accuracies': 0.29999998211860657, 'rewards/margins': -0.0020324576180428267, 'logps/rejected': -0.09977775812149048, 'logps/chosen': -0.12010233104228973, 'logits/rejected': 0.7489851713180542, 'logits/chosen': 0.7482139468193054, 'nll_loss': 0.1832979917526245, 'log_odds_ratio': -0.8010236620903015, 'log_odds_chosen': -0.16869042813777924, 'epoch': 0.12}
{'loss': 0.2482, 'grad_norm': 0.0, 'learning_rate': 0.001, 'rewards/chosen': -0.011346157640218735, 'rewards/rejected': -0.01022450439631939, 'rewards/accuracies': 0.4833333492279053, 'rewards/margins': -0.0011216530110687017, 'logps/rejected': -0.102245032787323, 'logps/chosen': -0.11346157640218735, 'logits/rejected': 0.7105721831321716, 'logits/chosen': 0.7108334898948669, 'nll_loss': 0.17242279648780823, 'log_odds_ratio': -0.7573043704032898, 'log_odds_chosen': -0.08471358567476273, 'epoch': 0.18}
{'loss': 0.2444, 'grad_norm': 0.0, 'learning_rate': 0.001, 'rewards/chosen': -0.012975988909602165, 'rewards/rejected': -0.013058923184871674, 'rewards/accuracies': 0.550000011920929, 'rewards/margins': 8.293241262435913e-05, 'logps/rejected': -0.13058921694755554, 'logps/chosen': -0.12975989282131195, 'logits/rejected': 0.6808757781982422, 'logits/chosen': 0.6832461953163147, 'nll_loss': 0.1756206750869751, 'log_odds_ratio': -0.687309741973877, 'log_odds_chosen': 0.04155167192220688, 'epoch': 0.24}

Memory issue

Hi, thanks for releasing GaLore! I'm running out of memory whenever I use a sequence length longer than 512, even if I use a smaller model. I can train a 7B model w/ a 512 sequence length on 24G VRAM, but I can't train a 5B model w/ a 8192 sequence length. Thanks!

Support for Jamba (ai21labs/Jamba-v0.1)

Jamba is a very interesting new model and I’d love to add support for galore for finetuning it. It’s an MoE+Transformer+Mamba hybrid so I’m not sure how that would work with Galore.

thoughts/pointers? @jiaweizzhao @agnim25 @darthjaja6

GaLore in HuggingFace

Hi team, very thanks for GaLore. I'm currently using HuggingFace for fine-tuning. Just curious to integrate GaLore with HuggingFace.

It's not an issue, I'm just interested to use GaLore with HuggingFace

@jiaweizzhao

from galore_torch import GaLoreAdamW, GaLoreAdamW8bit, GaLoreAdafactor
from peft import LoraConfig
import transformers
from trl import SFTTrainer
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)


model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
)
trainer.train()

should I just replace optim="paged_adamw_8bit" with optim = GaLoreAdamW8bit? can you please provide some sample script ?

Clarifying GLUE Benchmark Accuracy: Validation or Test Set?

Hello, I enjoyed reading your excellent paper. I have one question: is the accuracy mentioned for the GLUE benchmarks based on the validation set or the test set? The paper does not specify this detail. Thank you for your clarification.

Seems not compatible with DeepSpeed (perhaps also FSDP)

Hi, appreciate to your awesome work!

When I trying to introduce GaLore AdamW optimizer to Gemma training, it seems that it is not compatible with deepspeed with Zero stage as both 0 and 1:

I guess this is because DeepSpeed's BF16_Optimizer will flatten the parameters for memory efficiency. Perhaps this will also affect the usage of FSDP.

RuntimeError: cusolver error: CUSOLVER_STATUS_INVALID_VALUE in torch.linalg.svd

The method works great on most layers but on the final projection in my transformer (1024 x 50k) I get

RuntimeError: cusolver error: CUSOLVER_STATUS_INVALID_VALUE, when calling `cusolverDnSgesvdj_bufferSize(handle, jobz, econ, m, n, A, lda, S, U, ldu, V, ldv, lwork, params)`

when executing U, s, Vh = torch.linalg.svd(matrix).

The issue is fixed by using U, s, Vh = torch.linalg.svd(matrix, full_matrices = False)

Zero Loss: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values

Hi GaLore Team, congratulations for the interesting work!

I am trying to fine-tune llama-3 8B model using GaLore but getting this error:
torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values.

Interestingly first batch loss is non-zero and all subsequent losses are zero values before training is automatically terminated.

Full Error Log

Activated GaLoRE fine-tuning, depending on your model size and hardware, the training might take a while before starting. Please be patient !
model.layers.0.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.0.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.0.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.0.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.1.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.1.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.1.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.1.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.2.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.2.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.2.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.2.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.3.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.3.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.3.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.3.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.4.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.4.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.4.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.4.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.5.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.5.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.5.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.5.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.6.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.6.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.6.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.6.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.7.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.7.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.7.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.7.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.8.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.8.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.8.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.8.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.9.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.9.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.9.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.9.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.10.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.10.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.10.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.10.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.11.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.11.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.11.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.11.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.12.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.12.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.12.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.12.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.13.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.13.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.13.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.13.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.14.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.14.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.14.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.14.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.15.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.15.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.15.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.15.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.16.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.16.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.16.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.16.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.17.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.17.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.17.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.17.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.18.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.18.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.18.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.18.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.19.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.19.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.19.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.19.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.20.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.20.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.20.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.20.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.21.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.21.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.21.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.21.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.22.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.22.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.22.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.22.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.23.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.23.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.23.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.23.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.24.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.24.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.24.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.24.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.25.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.25.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.25.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.25.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.26.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.26.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.26.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.26.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.27.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.27.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.27.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.27.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.28.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.28.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.28.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.28.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.29.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.29.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.29.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.29.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.30.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.30.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.30.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.30.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.31.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.31.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.31.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.31.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
  0%|                                                                                                                                                                            | 0/6719320 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  0%|                                                                                                                                                            | 1/6719320 [06:15<701609:55:03, 375.90s/it][2024-07-23 07:19:54,094] [INFO] [axolotl.callbacks.on_step_end:128] [PID:148509] [RANK:0] GPU memory usage while training: 17.607GB (+15.215GB cache, +1.482GB misc)
{'loss': 1.694, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                      
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}                                                                                                                                        
  0%|                                                                                                                                                             | 200/6719320 [08:51<1455:09:36,  1.28it/s]/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/galore_torch/galore_projector.py:83: UserWarning: torch.linalg.svd: During SVD computation with the selected cusolver driver, batches 0 failed to converge. A more accurate method will be used to compute the SVD as a fallback. Check doc at https://pytorch.org/docs/stable/generated/torch.linalg.svd.html (Triggered internally at ../aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp:697.)
  U, s, Vh = torch.linalg.svd(matrix, full_matrices = False)
Traceback (most recent call last):
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/minimalist/work/projects/sota/axolotl/src/axolotl/cli/train.py", line 72, in <module>
    fire.Fire(do_cli)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/minimalist/work/projects/sota/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
  File "/home/minimalist/work/projects/sota/axolotl/src/axolotl/cli/train.py", line 67, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
  File "/home/minimalist/work/projects/sota/axolotl/src/axolotl/train.py", line 191, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/transformers/trainer.py", line 3324, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/accelerate/accelerator.py", line 2151, in backward
    loss.backward(**kwargs)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/transformers/trainer.py", line 1398, in optimizer_hook
    optimizer_dict[param].step()
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
    out = func(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/galore_torch/adamw8bit.py", line 58, in step
    grad = state["projector"].project(p.grad, state["step"])
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 21, in project
    self.ortho_matrix = self.get_orthogonal_matrix(full_rank_grad, self.rank, type='left')
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 83, in get_orthogonal_matrix
    U, s, Vh = torch.linalg.svd(matrix, full_matrices = False)
torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 1023).
  0%|                                                                                                                                                             | 200/6719320 [13:19<7465:36:33,  4.00s/it]
Traceback (most recent call last):
  File "/home/minimalist/miniconda3/envs/comps/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/minimalist/miniconda3/envs/comps/bin/python', '-m', 'axolotl.cli.train', 'examples/llama-3/qlora.yml']' returned non-zero exit status 1.

Hyperparams

base_model: meta-llama/Meta-Llama-3-8B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

datasets:
  - path: <dataset>
    type: sharegpt
    conversation: llama-3
    field_human: human
    field_model: gpt

dataset_prepared_path:
val_set_size: 0.01
output_dir: ./outputs/galore-out

sequence_len: 2048
sample_packing: false
eval_sample_packing: true
pad_to_sequence_len: true

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 4
optimizer: galore_adamw_8bit_layerwise
lr_scheduler: cosine
learning_rate: 0.000001

optim_target_modules:
 - self_attn
 - mlp

train_on_inputs: false
group_by_length: false
bf16: true
tf32: false

bfloat16: true

logging_steps: 4
flash_attention: true

Third-party benchmark

Hello, thank you very much for such excellent work. We have conducted some experiments using Llama-Factory, and the results indicate that Galore can significantly reduce memory usage during full parameter fine-tuning. We utilized the 8-bit AdamW optimizer and pure bfloat16 training with gradient checkpointing. Galore requires only 18GB of VRAM to train a Llama-2 7B model, while the standard 8-bit AdamW optimizer requires at least 40GB of VRAM. We provide reproducible scripts for SFT training here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/extras/galore/galore_adamw_8bit_bf16.sh

	Rank	Retain grad	Memory	Token/s
8-bit AdamW		Yes	40GB	1434
8-bit GaLore	16	Yes	28GB	1532
8-bit GaLore	128	Yes	29GB	1532
16-bit GaLore	128	Yes	30GB	1615
16-bit GaLore	128	No	18GB	1587
8-bit GaLore	1024	Yes	36GB	1238

* We omitted the time of computing SVD for GaLore every update_proj_gap step, it costs around 10 minutes for a 7B model.

model: LLaMA-2 7B
device: NVIDIA A100
token batch size: 512
activation checkpointing: enabled
flash attention: disabled

Experiment results last updated: Mar 9th.
todo: add loss convergence results.

The first optimizer.step() execution cost extremely long time

Hello, thank you for providing the implementation of the paper. When I run the code, I found that when the optimizer.step() is called for the first time, it would take extremely long time.
For me, when pretrain llama_1b model on one A100 with batch_size == 1, running optimizer.step() for the first time cost me 70 seconds. But the time became normal (30ms) after the first step. Is this because of some tensor-register step?

Galore unstable on Llama 7B beyond 20K steps

To replicate the above results, run cmd in README, machine configuration: A100 80GB, CUDA version: 11.8, other environments are installed following the recommendation in the repo

# LLaMA-7B, 8-bit GaLore-Adam, single GPU, activation checkpointing
# bsz=16, 22.8G, 
torchrun --standalone --nproc_per_node 1 torchrun_main.py \
    --model_config configs/llama_7b.json \
    --lr 0.005 \
    --galore_scale 0.25 \
    --rank 1024 \
    --update_proj_gap 500 \
    --batch_size 16 \
    --total_batch_size 512 \
    --activation_checkpointing \
    --num_training_steps 150000 \
    --warmup_steps 15000 \
    --weight_decay 0 \
    --grad_clipping 1.0 \
    --dtype bfloat16 \
    --eval_every 1000 \
    --single_gpu \
    --optimizer galore_adamw8bit_per_layer

Galore is not supported for Deepseed Zero3

Error information

/root/anaconda3/envs/new_llm/lib/python3.10/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(
Traceback (most recent call last):
  File "/root/paddlejob/workspace/20240315/0_llm/new_llm/LLaMA-Factory-main/src/train_bash.py", line 14, in <module>
    main()
  File "/root/paddlejob/workspace/20240315/0_llm/new_llm/LLaMA-Factory-main/src/train_bash.py", line 5, in main
    run_exp()
  File "/root/paddlejob/workspace/20240315/0_llm/new_llm/LLaMA-Factory-main/src/llmtuner/train/tuner.py", line 32, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/root/paddlejob/workspace/20240315/0_llm/new_llm/LLaMA-Factory-main/src/llmtuner/train/sft/workflow.py", line 54, in run_sft
    trainer = CustomSeq2SeqTrainer(
  File "/root/anaconda3/envs/new_llm/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 56, in __init__
    super().__init__(
  File "/root/anaconda3/envs/new_llm/lib/python3.10/site-packages/transformers/trainer.py", line 527, in __init__
    raise RuntimeError(
RuntimeError: Passing `optimizers` is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method.

deepspeed zero3 config

{
    "bf16": {
        "enabled": "auto"
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

How to get optim_target_modules=["attn", "mlp"] for other model?

Great work, and many many thanks for this.

I already fine tune a model.
And it's showing best performance.
My question is
If I want to fine tune llama2, Mistral, OpenChat etc.
So how I can get the following.

optim_target_modules=["attn", "mlp"]

Because it's suggest, make sure to confirm these optim_target_modules model. MLP is match and the other one are not..

Any docs available or any suggestions.

Thanks.

CUDA out of memory in torch.linalg.svd

I tried to use GaLore on nn.Linear(256, 267736).
Then I got the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 267.04 GiB. at U, s, Vh = torch.linalg.svd(matrix).
I think full_matrices=False may be required at torch.linalg.svd.

How can i do continued pre-training using this?

support sft?

same title

Seems not compatible with DeepSpeed

A few questions regarding the results and methodology.

Hi, thanks for releasing this work! it has all been very interesting to read. However, I do have a few questions regarding your results and methodology.

For table 4. it seems that you train with a batch size of 16 but report the memory of these runs divided by 16.
There would be a memory overhead of the model weights which is greater than the memory reported in the table.
Is this way of reporting memory commonly done since it does not capture the entire picture? The memory is also reported as the same across all sub-tasks, yet for some of the tasks you are using a different batch size (e.g. 32 for CoLA).
When you report the memory, do you include the overhead of allocating memory for SVD? SVD can have a large memory overhead in practice and especially considering it is only implemented in 32-bit.
Figure 1. shows the impressive results of reducing the memory cost for training LLaMA 7B to within the budget of an RTX 4090. I have noticed that you also use an adaptive low-memory optimisation method (AdaLOMO). I am curious to how much memory improvement is gained from the gradient low-rank projection and how much is coming just from AdaLOMO.
https://github.com/OpenLMLab/LOMO
What do you mean by "token batch size"? Is this just the number of tokens for a single iteration?
The Roberta-base fine-tuning results also seem to be very different from the results reported in the original LoRa paper.

Thanks again!

IndexError: tuple index out of range

Hi Jiawei,

I was trying Galore on TinyLlama-1B using the codebase https://github.com/jzhang38/TinyLlama on 4* A800-80GB. I encounter the following error:

[rank1]:     optimizer.step()
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 74, in step
[rank1]:     output = self._strategy.optimizer_step(
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/strategies/strategy.py", line 207, in optimizer_step
[rank1]:     return self.precision.optimizer_step(optimizer, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/fsdp.py", line 142, in optimizer_step
[rank1]:     return super().optimizer_step(optimizer, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/precision.py", line 124, in optimizer_step
[rank1]:     return optimizer.step(**kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank1]:     out = func(*args, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/galore_torch/adamw.py", line 96, in step
[rank1]:     grad = state["projector"].project(grad, state["step"])
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 15, in project
[rank1]:     if full_rank_grad.shape[0] >= full_rank_grad.shape[1]:
[rank1]: IndexError: tuple index out of range

I use galore as you suggested in torchrun_main.py:

print('using galore')
galore_params = []
target_modules_list = [ "attn", "mlp"]
for module_name, module in model.named_modules():
    if not isinstance(module, nn.Linear):
        continue

    if not any(target_key in module_name for target_key in target_modules_list):
        continue
    
    print('enable GaLore for weights in module: ', module_name)
    galore_params.append(module.weight)

id_galore_params = [id(p) for p in galore_params]

# make parameters without "rank" to another group
regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]
# then call galore_adamw
param_groups = [{'params': regular_params}, 
                {'params': galore_params, 'rank': 128, 'update_proj_gap': 200, 'scale': 0.25, 'proj_type': 'std'}]
    
optimizer = GaLoreAdamW(param_groups, lr=learning_rate)

Any idea why and how to fix it?

Thanks in advance!

ValueError: can't optimize a non-leaf Tensor (param.is_leaf=False,param.retains_grad=False)

My model works fine with adamw_bnb_8bit.
When i switched to galore_adamw_8bit with 'all-linear',
an exception is raised 'can't optimize a non-leaf'

Seq2SeqTrainingArguments(
        output_dir = model_name_or_path,
        save_strategy = 'no',
        logging_steps = 100,
        bf16 = True if torch.cuda.is_available() else False,
        dataloader_pin_memory = True,
        dataloader_num_workers = 8,
        num_train_epochs = 1, #1, # 2,
        do_train=True,
        learning_rate = learning_rate, # 5e-5,
        # optim = 'adamw_bnb_8bit', 
        optim="galore_adamw_8bit_layerwise",
        optim_target_modules='all-linear',
        lr_scheduler_type = 'constant', # 'cosine', constant
        warmup_ratio = 0.,
        per_device_train_batch_size = batch_size, # 8,
        gradient_accumulation_steps = 1,
        report_to = 'none',
        do_eval=False,
        max_steps = max_steps,
        accelerator_config = {'dispatch_batches':False},
        **kwargs
    )

Results vs FP32

Hi, I was reading the GaLore paper and noticed that the "ground truth" baseline seems to be pure BF16 training with nearest rounding. It is generally accepted that pure BF16 training with nearest rounding does not converge to the same point as FP32 or BF16/FP32 mixed precision training -- does GaLore only match pure BF16 or does it match FP32 training as well?

Double approximation of second moment in Adafactor

Adafactor originally does its own an approximation of second moment.
But when GaLore is enabled, that approximation is done based on the shrunken grad by GaLore instead of the raw grad.
Possibly this behavior may have a slightly negative impact.

layerwise optimizer raises TypeError about slice indices

  File "/workspace/transformers/src/transformers/trainer.py", line 1297, in optimizer_hook             
    optimizer_dict[param].step()                                                                                                              File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs) 
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)                                      
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/galore_torch/adamw.py", line 96, in step      
    grad = state["projector"].project(grad, state["step"])                                                                                  
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 21, in project
    self.ortho_matrix = self.get_orthogonal_matrix(full_rank_grad, self.rank, type='left')                          
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 94, in get_orthogonal_matrix
    A = U[:, :rank]                                                                                                                         
TypeError: slice indices must be integers or None or have an __index__ method

Questions about glue task report scores

The following is the result of running the script you provided on mprc task.
{"eval_accuracy": 0.8970588235294118, "eval_f1": 0.926056338028169}

Which one is the result provided in your paper?

Support for DDP with multi-gpus

Hi, thanks for this great work!
I have a question about using GaLore with DDP. I was trying to use GaLore for training 7B with DDP (multi-gpu).
However, I noticed that when using DDP, the memory gets doubled due to the buffer for gradient synchronization in DDP, so the required memory of 7B (bf16) is around 28GB even before using GaLore. Thus, I got OOM when using GaLore to train 7B in gpus with 32GB.
I was wondering if you've encountered the same issue and any suggestion would be appreciated. Thanks!

Does galore save gradient memory?

Dear Author, I am truly grateful for your outstanding work. Please allow me to raise a small question regarding the memory of gradient:
As I understand it, the LOMO method can only ensure that gradients are updated layer-by-layer, but the gradient memory for each weight matrix is not compressed. The shape size remains consistent with the original weight.
I'm not sure if I'm misusing it.

Dataset loading issue, integration with Colossal-AI

Hi,
Thanks for the good work. I'm trying to intergrate this into Colossal-AI(https://github.com/hpcaitech/ColossalAI), compatible with Tensor Parallel and ZeRO.
However, I had trouble loading the dataset; seems they updated the dataset to remove the json schema.
Could you share your dataset version and how you're able to load it?
Thanks!

(Question) About glue tasks

Hello, thanks for your inspiring and excellent work!

I want to try full fine-tuning to compare with Galora, and I have blocked the use of Galora. However, I'm having some problems that when I try to run the glue task (i.e. mrpc) to full fine-tune roberta, I find that the eval acc doesn't change at all as the training progresses. I have ruled out a possible overfitting problem and I would like to ask the author or anyone else if there is a relevant solution.

Why not reproject the internal Adam states during update_proj_gap?

Hi, great project. After reading the paper and the implementation, I am wondering if it is considered to reproject the Adam internal states (exp_avg, exp_avg_sq) from previous subspace to the new subspace?

Please add Phi-2 Support

Attempting to use Galore to finetune a phi model yields "AttributeError: 'PhiConfig' object has no attribute 'rms_norm_eps'", which, having gotten that error on other LLM things, typically translates to "this code doesn't support Phi models"

fixing this would be incredibly nice, as It would allow people with cruddier computers to finetune LLMs

Questions about Figure 3 in the original paper

In the figure, Rank = 1024 and Rank = 512 is very close to the baseline, even better than the baseline. In response, I have the following 2 questions.

Is Rank = 1024 and Rank = 512 steadily better than baseline, or is there some randomness? If it is steadily better than baseline, how can we explain this phenomenon?
Have you ever done an experiment with a very small case of Rank (e.g. n/8、n/16), and how much does this affect the results of the experiment specifically?
Looking forward to your early reply. Your support has been invaluable to me.

Release of Trained Models

Hi, thanks very much for sharing your impressive work!

Would it be possible to release the trained model (e.g., using the script below)?
It would greatly facilitate reproducibility efforts.
Thank you for considering this request.
https://github.com/jiaweizzhao/GaLore/blob/1b36c33782bdd74a4d6a4f51bc626ef67f51011f/scripts/benchmark_c4/llama_7b.sh

Reproducing Perplexity evaluation

How exactly did you measure Perplexity during pre-training with GaLore? (e.g. when creating Figure 5 in your paper https://arxiv.org/pdf/2403.03507.pdf ). Thanks.

Resume function for optimizer

Hi, thank you for generously open-sourcing your excellent work. During our experiments, we noticed that there doesn't seem to be a resume/reload function for the optimizer regarding args.continue_from. Is our understanding correct? If this feature has already been implemented, could you please let us know?

By the way, we found the resume function for the checkpoint model and logging information in these lines.
Thank you for reviewing this inquiry.