jiaweizzhao / galore Goto Github PK
View Code? Open in Web Editor NEWGaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
License: Apache License 2.0
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
License: Apache License 2.0
Hi,
Sorry if this is stupid question but, is it possible to use the 8bit galore optimiser in combination with LoRA adapters?
Thanks
Hey,
As mentioned in the title, there is the direct conversion of the model to BF16, without the use of torch.amp
functions of autocast
and scaling
needed for AMP.
This means that the projected memory shown here is only the 2bytes for the model (BF16) but the results post-training would be bad as per various sources. Beyond that, we would need AMP for it to work properly, which means getting 6 bytes per parameter, which blows the 24GiB mentioned in the paper out of the water.
For LLaMa3 8B, you would need 8 * 10^9 * 6 bytes ~ 44GiB for just parameter loading in BF16 AMP.
Just wanted to point it out, and ask about why this is made this way. The paper also mentions a 58GiB minimum -- but I think you'd need much more than that.
If this is a deliberate decision, please point me to the studies that show that such training has been stabilized.
src: [ https://docs.fast.ai/callback.fp16.html ]
Has anyone successfully replicated the results of fine-tuning tasks?
I followed the hyperparameters outlined in the REAMDE and the paper, and tried cola and mrpc tasks on a single GPU without gradient accumulation. However, the results I obtained differed from those reported in the paper.
And here are best performances of my runs
I have converted the GaLore code to C++ (libtorch) and currently running into an issue where large layers (the initial embedding layer) is failing at SVD.
Layer is 31618x2624
I am running with full_matrices set to false already.
[W BatchLinearAlgebraLib.cpp:703] Warning: torch.linalg.svd: During SVD computation with the selected cusolver driver, batches 0 failed to converge. A more accurate method will be used to compute the SVD as a fallback. Check doc at https://pytorch.org/docs/stable/generated/torch.linalg.svd.html (function operator ()) [08:32:34.6859554] Projection failed: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 2623).
Is it a known issue and is there any workarounds for this?
in sigle gpu mode,I success run the train by RTX3090.but it took too long。
in ddp mode,we got OOM in LlamaForCausalLM = torch.nn.parallel.DistributedDataParallel(
model,
device_ids=[local_rank],
output_device=local_rank,
broadcast_buffers=False,
)
.
Thanks for the great work. One thing I'm curious about is that does it actually work well on SFT for LLMs? It is not covered in the paper, as well. I tried the following parameters on a 2B-sized model, but it leads to very slow convergence. Could you please give me some advice?
lr: 5e-5
galore_rank: 64
galore_update_proj_gap: 200
scale: 0.25
proj_type: std
can support llava model ?
Impressive and insightful work, hooray to the authors! Recently I read your paper, but I'm comfused about the following parts.
Sorry if I ask stupid questions. Thank you for your time and consideration.
[/usr/local/lib/python3.10/dist-packages/galore_torch/galore_projector.py](https://localhost:8080/#) in get_orthogonal_matrix(self, weights, rank, type)
85 #make the smaller matrix always to be orthogonal matrix
86 if type=='right':
---> 87 A = U[:, :rank] @ torch.diag(s[:rank])
88 B = Vh[:rank, :]
89
RuntimeError: diag(): Supports 1D or 2D tensors. Got 3D
As I understand it, galore is not able to work with models that work with 3D inputs/outputs?
# Configuration parameters
model_name_or_path = "mistralai/Mistral-7B-v0.1"
max_length = 128
doc_stride = 128
pad_to_max_length = True
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
learning_rate = 0.0002
weight_decay = 0.0
num_train_epochs = 1
gradient_accumulation_steps = 1
output_dir = "/home/IAIS/jdatta/teacher_model"
seed = 42
# Load the datasets
squad = datasets.load_dataset("rajpurkar/squad_v2")
dataset = squad['train'].train_test_split(test_size=0.2)
train_dataset = dataset['train']
eval_dataset = dataset['test']
train_dataset = train_dataset.select(range(1000))
eval_dataset = eval_dataset.select(range(500))
training_args = TrainingArguments(
output_dir=output_dir,
evaluation_strategy="steps",
warmup_ratio=0.05,
overwrite_output_dir=True,
gradient_accumulation_steps=gradient_accumulation_steps,
per_device_train_batch_size=per_device_train_batch_size,
per_device_eval_batch_size=per_device_eval_batch_size,
num_train_epochs=num_train_epochs,
fp16=True,
eval_steps=10,
save_strategy='steps',
save_steps=10,
save_total_limit=1,
dataloader_num_workers=2,
load_best_model_at_end=True,
report_to="none",
prediction_loss_only=True,
gradient_checkpointing=True,
optim_args="rank=64, update_proj_gap=100, scale=0.10",
optim="galore_adafactor",
optim_target_modules=["c_attn", "c_proj", "q_proj", "k_proj", "v_proj", "down_proj", "up_proj"],
learning_rate=learning_rate,
weight_decay=weight_decay,
)
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
)
trainer.train()
The traning is not starting.
It is showing the following comments for 2 hours:
/home/IAIS/jdatta/miniconda3/envs/myenv/lib/python3.11/site-packages/transformers/training_args.py:1474: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead
warnings.warn(
Activated GaLoRE fine-tuning, depending on your model size and hardware, the training might take a while before starting. Please be patient !
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
Should I tune any parameter?
I've tried with Mistral-7b, Phi-2, Llama-7b also.
In Figure 1, what is batch size, sequence len, and vocab size here? It isn't clear from the caption. I would expect activations to take up more space. From what I can tell:
So only the logits of the Llama model should take up 256 * 2048 * 32000 * 2
bytes or 31.25 GB. Where is this required memory in Figure 1?
Thanks!
I encountered an error, how should I resolve it?
[WARNING|trainer.py:1272] 2024-04-27 12:04:25,428 >> Activated GaLoRE fine-tuning, depending on your model size and hardware, the training might take a while before starting. Please be patient !
/home/jiahui/anaconda3/envs/llm/lib/python3.10/site-packages/galore_torch/adamw.py:48: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True
to disable this warning
warnings.warn(
Traceback (most recent call last):
File "/home/jiahui/workspace/nmt/thesis_nmt/mnmt/multi/scripts/run_translation.py", line 618, in
main()
File "/home/jiahui/workspace/nmt/thesis_nmt/mnmt/multi/scripts/run_translation.py", line 534, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/jiahui/workspace/nmt/thesis_nmt/transformers/src/transformers/trainer.py", line 1848, in train
return inner_training_loop(
File "/home/jiahui/workspace/nmt/thesis_nmt/transformers/src/transformers/trainer.py", line 1949, in _inner_training_loop
self.create_optimizer_and_scheduler(num_training_steps=max_steps)
File "/home/jiahui/workspace/nmt/thesis_nmt/transformers/src/transformers/trainer.py", line 981, in create_optimizer_and_scheduler
self.create_optimizer()
File "/home/jiahui/workspace/nmt/thesis_nmt/transformers/src/transformers/trainer.py", line 1038, in create_optimizer
self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
File "/home/jiahui/anaconda3/envs/llm/lib/python3.10/site-packages/galore_torch/adamw.py", line 64, in init
super().init(params, defaults)
File "/home/jiahui/anaconda3/envs/llm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 192, in init
self.add_param_group(param_group)
File "/home/jiahui/anaconda3/envs/llm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 535, in add_param_group
raise ValueError("some parameters appear in more than one parameter group")
ValueError: some parameters appear in more than one parameter group
Hi! Congrats on the great work! I have a question regarding the gradient storage: the paper mentioned that GaLore also uses LOMO to avoid materializing the full gradient, but I couldn't find where LOMO is implemented in the code base. Can you point me to where it is implemented (or the equivalents)? Thanks!
Would it be possible for you to add how long each training run takes to the README? I think a lot of people who have heard about Galore would be interested in that.
Thank you for your great work. I am trying to reproduce the results in "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks"
I have go through this issue, but I still fail to reproduce some of the results.
I conducted experiments on GaLore rank 4, with the following script
python run_glue.py \
--model_name_or_path roberta-base \
--task_name $task_name \
--enable_galore \
--lora_all_modules \
--max_length 512 \
--seed $seed \
--lora_r 4 \
--galore_scale 4 \
--per_device_train_batch_size 16 \
--update_proj_gap 500 \
--learning_rate 1e-5 \
--num_train_epochs 30 \
--output_dir results/ft/roberta_base/$task_name
following the hyper parameters in Table 7 in your paper, I changed bs to 32 and learning rate to 3e-5 when conducting experiment on CoLA dataset. But I got the following results
Dataset | Result in Paper | Reproduced Results |
---|---|---|
MRPC | 92.25 | Acc 87.74, F1 91.10 |
COLA | 60.35 | matthews_correlation: 59.56 |
RTE | 79.42 | Acc 77.25 |
STSB | 90.73 | pearson 0.90526; spearmanr 0.90339 |
QQP | 91.06 | pearson 0.90785; spearmanr 0.90589 |
I am wondering if I have missed someting?
trainer = ORPOTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
#peft_config=peft_config,
tokenizer=tokenizer,
args= ORPOConfig(
max_length=cutoff_len,
max_prompt_length=cutoff_len//2,
beta=0.1,
per_device_train_batch_size=micro_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
warmup_steps=0,
num_train_epochs=num_epochs,
lr_scheduler_type="cosine",
learning_rate=8e-6,
bf16=True,
logging_steps=10,
optim = "galore_adamw_8bit_layerwise",
optim_target_modules=[r".*attn.*", r".*mlp.*"],
optim_args="rank=1024, update_proj_gap=500, scale=0.25",
evaluation_strategy="steps" if val_set_size > 0 else "no",
save_strategy="steps",
eval_steps=100 if val_set_size > 0 else None,
save_steps=100,
output_dir=output_dir,
save_total_limit=2,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={'use_reentrant':True},
load_best_model_at_end=True if val_set_size > 0 else False,
ddp_find_unused_parameters=False if ddp else None,
report_to="wandb" if use_wandb else None,
run_name=wandb_run_name if use_wandb else None,
do_train=True,
remove_unused_columns=False,
)
)
Activated GaLoRE fine-tuning, depending on your model size and hardware, the training might take a while before starting. Please be patient !
0%| | 0/495 [00:00<?, ?it/s]Could not estimate the number of tokens of the input, floating-point operations will not be computed
{'loss': 0.3557, 'grad_norm': 0.0, 'learning_rate': 0.001, 'rewards/chosen': -0.015678538009524345, 'rewards/rejected': -0.012379011139273643, 'rewards/accuracies': 0.19999998807907104, 'rewards/margins': -0.003299527335911989, 'logps/rejected': -0.12379010766744614, 'logps/chosen': -0.15678536891937256, 'logits/rejected': 0.7921055555343628, 'logits/chosen': 0.791210412979126, 'nll_loss': 0.2719877064228058, 'log_odds_ratio': -0.8374900817871094, 'log_odds_chosen': -0.25091928243637085, 'epoch': 0.06}
{'loss': 0.2634, 'grad_norm': 0.0, 'learning_rate': 0.001, 'rewards/chosen': -0.012010233476758003, 'rewards/rejected': -0.009977776557207108, 'rewards/accuracies': 0.29999998211860657, 'rewards/margins': -0.0020324576180428267, 'logps/rejected': -0.09977775812149048, 'logps/chosen': -0.12010233104228973, 'logits/rejected': 0.7489851713180542, 'logits/chosen': 0.7482139468193054, 'nll_loss': 0.1832979917526245, 'log_odds_ratio': -0.8010236620903015, 'log_odds_chosen': -0.16869042813777924, 'epoch': 0.12}
{'loss': 0.2482, 'grad_norm': 0.0, 'learning_rate': 0.001, 'rewards/chosen': -0.011346157640218735, 'rewards/rejected': -0.01022450439631939, 'rewards/accuracies': 0.4833333492279053, 'rewards/margins': -0.0011216530110687017, 'logps/rejected': -0.102245032787323, 'logps/chosen': -0.11346157640218735, 'logits/rejected': 0.7105721831321716, 'logits/chosen': 0.7108334898948669, 'nll_loss': 0.17242279648780823, 'log_odds_ratio': -0.7573043704032898, 'log_odds_chosen': -0.08471358567476273, 'epoch': 0.18}
{'loss': 0.2444, 'grad_norm': 0.0, 'learning_rate': 0.001, 'rewards/chosen': -0.012975988909602165, 'rewards/rejected': -0.013058923184871674, 'rewards/accuracies': 0.550000011920929, 'rewards/margins': 8.293241262435913e-05, 'logps/rejected': -0.13058921694755554, 'logps/chosen': -0.12975989282131195, 'logits/rejected': 0.6808757781982422, 'logits/chosen': 0.6832461953163147, 'nll_loss': 0.1756206750869751, 'log_odds_ratio': -0.687309741973877, 'log_odds_chosen': 0.04155167192220688, 'epoch': 0.24}
Hi, thanks for releasing GaLore! I'm running out of memory whenever I use a sequence length longer than 512, even if I use a smaller model. I can train a 7B model w/ a 512 sequence length on 24G VRAM, but I can't train a 5B model w/ a 8192 sequence length. Thanks!
Jamba is a very interesting new model and I’d love to add support for galore for finetuning it. It’s an MoE+Transformer+Mamba hybrid so I’m not sure how that would work with Galore.
thoughts/pointers? @jiaweizzhao @agnim25 @darthjaja6
Hi team, very thanks for GaLore. I'm currently using HuggingFace for fine-tuning. Just curious to integrate GaLore with HuggingFace.
It's not an issue, I'm just interested to use GaLore with HuggingFace
from galore_torch import GaLoreAdamW, GaLoreAdamW8bit, GaLoreAdafactor
from peft import LoraConfig
import transformers
from trl import SFTTrainer
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
lora_config = LoraConfig(
r=8,
target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
task_type="CAUSAL_LM",
)
model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])
trainer = SFTTrainer(
model=model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=2,
max_steps=10,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="paged_adamw_8bit"
),
peft_config=lora_config,
)
trainer.train()
should I just replace optim="paged_adamw_8bit"
with optim = GaLoreAdamW8bit
? can you please provide some sample script ?
Hello, I enjoyed reading your excellent paper. I have one question: is the accuracy mentioned for the GLUE benchmarks based on the validation set or the test set? The paper does not specify this detail. Thank you for your clarification.
Hi, appreciate to your awesome work!
When I trying to introduce GaLore AdamW optimizer to Gemma training, it seems that it is not compatible with deepspeed with Zero stage as both 0 and 1:
I guess this is because DeepSpeed's BF16_Optimizer will flatten the parameters for memory efficiency. Perhaps this will also affect the usage of FSDP.
The method works great on most layers but on the final projection in my transformer (1024 x 50k) I get
RuntimeError: cusolver error: CUSOLVER_STATUS_INVALID_VALUE, when calling `cusolverDnSgesvdj_bufferSize(handle, jobz, econ, m, n, A, lda, S, U, ldu, V, ldv, lwork, params)`
when executing U, s, Vh = torch.linalg.svd(matrix)
.
The issue is fixed by using U, s, Vh = torch.linalg.svd(matrix, full_matrices = False)
Hi GaLore Team, congratulations for the interesting work!
I am trying to fine-tune llama-3 8B model using GaLore but getting this error:
torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values
.
Interestingly first batch loss is non-zero and all subsequent losses are zero values before training is automatically terminated.
Activated GaLoRE fine-tuning, depending on your model size and hardware, the training might take a while before starting. Please be patient !
model.layers.0.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.0.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.0.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.0.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.1.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.1.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.1.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.1.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.2.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.2.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.2.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.2.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.3.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.3.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.3.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.3.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.4.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.4.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.4.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.4.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.5.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.5.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.5.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.5.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.6.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.6.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.6.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.6.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.7.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.7.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.7.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.7.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.8.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.8.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.8.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.8.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.9.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.9.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.9.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.9.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.10.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.10.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.10.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.10.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.11.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.11.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.11.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.11.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.12.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.12.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.12.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.12.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.13.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.13.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.13.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.13.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.14.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.14.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.14.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.14.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.15.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.15.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.15.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.15.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.16.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.16.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.16.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.16.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.17.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.17.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.17.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.17.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.18.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.18.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.18.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.18.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.19.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.19.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.19.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.19.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.20.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.20.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.20.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.20.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.21.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.21.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.21.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.21.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.22.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.22.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.22.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.22.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.23.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.23.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.23.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.23.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.24.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.24.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.24.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.24.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.25.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.25.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.25.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.25.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.26.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.26.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.26.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.26.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.27.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.27.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.27.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.27.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.28.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.28.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.28.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.28.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.29.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.29.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.29.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.29.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.30.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.30.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.30.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.30.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.31.self_attn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.31.self_attn.rotary_emb has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.31.mlp has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
model.layers.31.mlp.act_fn has been matched but ignored as GaLore only supports linear layers. Please double check your `optim_target_modules`!
0%| | 0/6719320 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
0%| | 1/6719320 [06:15<701609:55:03, 375.90s/it][2024-07-23 07:19:54,094] [INFO] [axolotl.callbacks.on_step_end:128] [PID:148509] [RANK:0] GPU memory usage while training: 17.607GB (+15.215GB cache, +1.482GB misc)
{'loss': 1.694, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0.001, 'epoch': 0.0}
0%| | 200/6719320 [08:51<1455:09:36, 1.28it/s]/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/galore_torch/galore_projector.py:83: UserWarning: torch.linalg.svd: During SVD computation with the selected cusolver driver, batches 0 failed to converge. A more accurate method will be used to compute the SVD as a fallback. Check doc at https://pytorch.org/docs/stable/generated/torch.linalg.svd.html (Triggered internally at ../aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp:697.)
U, s, Vh = torch.linalg.svd(matrix, full_matrices = False)
Traceback (most recent call last):
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/minimalist/work/projects/sota/axolotl/src/axolotl/cli/train.py", line 72, in <module>
fire.Fire(do_cli)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/minimalist/work/projects/sota/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
return do_train(parsed_cfg, parsed_cli_args)
File "/home/minimalist/work/projects/sota/axolotl/src/axolotl/cli/train.py", line 67, in do_train
return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
File "/home/minimalist/work/projects/sota/axolotl/src/axolotl/train.py", line 191, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
return inner_training_loop(
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/transformers/trainer.py", line 3324, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/accelerate/accelerator.py", line 2151, in backward
loss.backward(**kwargs)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/transformers/trainer.py", line 1398, in optimizer_hook
optimizer_dict[param].step()
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
return wrapped(*args, **kwargs)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
out = func(*args, **kwargs)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/galore_torch/adamw8bit.py", line 58, in step
grad = state["projector"].project(p.grad, state["step"])
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 21, in project
self.ortho_matrix = self.get_orthogonal_matrix(full_rank_grad, self.rank, type='left')
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 83, in get_orthogonal_matrix
U, s, Vh = torch.linalg.svd(matrix, full_matrices = False)
torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 1023).
0%| | 200/6719320 [13:19<7465:36:33, 4.00s/it]
Traceback (most recent call last):
File "/home/minimalist/miniconda3/envs/comps/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
simple_launcher(args)
File "/home/minimalist/miniconda3/envs/comps/lib/python3.10/site-packages/accelerate/commands/launch.py", line 703, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/minimalist/miniconda3/envs/comps/bin/python', '-m', 'axolotl.cli.train', 'examples/llama-3/qlora.yml']' returned non-zero exit status 1.
base_model: meta-llama/Meta-Llama-3-8B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
datasets:
- path: <dataset>
type: sharegpt
conversation: llama-3
field_human: human
field_model: gpt
dataset_prepared_path:
val_set_size: 0.01
output_dir: ./outputs/galore-out
sequence_len: 2048
sample_packing: false
eval_sample_packing: true
pad_to_sequence_len: true
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 4
optimizer: galore_adamw_8bit_layerwise
lr_scheduler: cosine
learning_rate: 0.000001
optim_target_modules:
- self_attn
- mlp
train_on_inputs: false
group_by_length: false
bf16: true
tf32: false
bfloat16: true
logging_steps: 4
flash_attention: true
Hello, thank you very much for such excellent work. We have conducted some experiments using Llama-Factory, and the results indicate that Galore can significantly reduce memory usage during full parameter fine-tuning. We utilized the 8-bit AdamW optimizer and pure bfloat16 training with gradient checkpointing. Galore requires only 18GB of VRAM to train a Llama-2 7B model, while the standard 8-bit AdamW optimizer requires at least 40GB of VRAM. We provide reproducible scripts for SFT training here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/extras/galore/galore_adamw_8bit_bf16.sh
Rank | Retain grad | Memory | Token/s | |
---|---|---|---|---|
8-bit AdamW | Yes | 40GB | 1434 | |
8-bit GaLore | 16 | Yes | 28GB | 1532 |
8-bit GaLore | 128 | Yes | 29GB | 1532 |
16-bit GaLore | 128 | Yes | 30GB | 1615 |
16-bit GaLore | 128 | No | 18GB | 1587 |
8-bit GaLore | 1024 | Yes | 36GB | 1238 |
* We omitted the time of computing SVD for GaLore every update_proj_gap
step, it costs around 10 minutes for a 7B model.
Experiment results last updated: Mar 9th.
todo: add loss convergence results.
Hello, thank you for providing the implementation of the paper. When I run the code, I found that when the optimizer.step() is called for the first time, it would take extremely long time.
For me, when pretrain llama_1b model on one A100 with batch_size == 1, running optimizer.step() for the first time cost me 70 seconds. But the time became normal (30ms) after the first step. Is this because of some tensor-register step?
To replicate the above results, run cmd in README, machine configuration: A100 80GB, CUDA version: 11.8, other environments are installed following the recommendation in the repo
# LLaMA-7B, 8-bit GaLore-Adam, single GPU, activation checkpointing
# bsz=16, 22.8G,
torchrun --standalone --nproc_per_node 1 torchrun_main.py \
--model_config configs/llama_7b.json \
--lr 0.005 \
--galore_scale 0.25 \
--rank 1024 \
--update_proj_gap 500 \
--batch_size 16 \
--total_batch_size 512 \
--activation_checkpointing \
--num_training_steps 150000 \
--warmup_steps 15000 \
--weight_decay 0 \
--grad_clipping 1.0 \
--dtype bfloat16 \
--eval_every 1000 \
--single_gpu \
--optimizer galore_adamw8bit_per_layer
Error information
/root/anaconda3/envs/new_llm/lib/python3.10/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Traceback (most recent call last):
File "/root/paddlejob/workspace/20240315/0_llm/new_llm/LLaMA-Factory-main/src/train_bash.py", line 14, in <module>
main()
File "/root/paddlejob/workspace/20240315/0_llm/new_llm/LLaMA-Factory-main/src/train_bash.py", line 5, in main
run_exp()
File "/root/paddlejob/workspace/20240315/0_llm/new_llm/LLaMA-Factory-main/src/llmtuner/train/tuner.py", line 32, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/root/paddlejob/workspace/20240315/0_llm/new_llm/LLaMA-Factory-main/src/llmtuner/train/sft/workflow.py", line 54, in run_sft
trainer = CustomSeq2SeqTrainer(
File "/root/anaconda3/envs/new_llm/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 56, in __init__
super().__init__(
File "/root/anaconda3/envs/new_llm/lib/python3.10/site-packages/transformers/trainer.py", line 527, in __init__
raise RuntimeError(
RuntimeError: Passing `optimizers` is not allowed if Deepspeed or PyTorch FSDP is enabled. You should subclass `Trainer` and override the `create_optimizer_and_scheduler` method.
deepspeed zero3 config
{
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Great work, and many many thanks for this.
I already fine tune a model.
And it's showing best performance.
My question is
If I want to fine tune llama2, Mistral, OpenChat etc.
So how I can get the following.
optim_target_modules=["attn", "mlp"]
Because it's suggest, make sure to confirm these optim_target_modules model. MLP is match and the other one are not..
Any docs available or any suggestions.
Thanks.
I tried to use GaLore on nn.Linear(256, 267736).
Then I got the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 267.04 GiB.
at U, s, Vh = torch.linalg.svd(matrix)
.
I think full_matrices=False
may be required at torch.linalg.svd.
same title
Seems not compatible with DeepSpeed
Hi, thanks for releasing this work! it has all been very interesting to read. However, I do have a few questions regarding your results and methodology.
For table 4. it seems that you train with a batch size of 16 but report the memory of these runs divided by 16.
There would be a memory overhead of the model weights which is greater than the memory reported in the table.
Is this way of reporting memory commonly done since it does not capture the entire picture? The memory is also reported as the same across all sub-tasks, yet for some of the tasks you are using a different batch size (e.g. 32 for CoLA).
When you report the memory, do you include the overhead of allocating memory for SVD? SVD can have a large memory overhead in practice and especially considering it is only implemented in 32-bit.
Figure 1. shows the impressive results of reducing the memory cost for training LLaMA 7B to within the budget of an RTX 4090. I have noticed that you also use an adaptive low-memory optimisation method (AdaLOMO). I am curious to how much memory improvement is gained from the gradient low-rank projection and how much is coming just from AdaLOMO.
https://github.com/OpenLMLab/LOMO
What do you mean by "token batch size"? Is this just the number of tokens for a single iteration?
The Roberta-base fine-tuning results also seem to be very different from the results reported in the original LoRa paper.
Thanks again!
Hi Jiawei,
I was trying Galore on TinyLlama-1B using the codebase https://github.com/jzhang38/TinyLlama on 4* A800-80GB. I encounter the following error:
[rank1]: optimizer.step()
[rank1]: File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 74, in step
[rank1]: output = self._strategy.optimizer_step(
[rank1]: File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/strategies/strategy.py", line 207, in optimizer_step
[rank1]: return self.precision.optimizer_step(optimizer, **kwargs)
[rank1]: File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/fsdp.py", line 142, in optimizer_step
[rank1]: return super().optimizer_step(optimizer, **kwargs)
[rank1]: File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/precision.py", line 124, in optimizer_step
[rank1]: return optimizer.step(**kwargs)
[rank1]: File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank1]: out = func(*args, **kwargs)
[rank1]: File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/galore_torch/adamw.py", line 96, in step
[rank1]: grad = state["projector"].project(grad, state["step"])
[rank1]: File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 15, in project
[rank1]: if full_rank_grad.shape[0] >= full_rank_grad.shape[1]:
[rank1]: IndexError: tuple index out of range
I use galore as you suggested in torchrun_main.py:
print('using galore')
galore_params = []
target_modules_list = [ "attn", "mlp"]
for module_name, module in model.named_modules():
if not isinstance(module, nn.Linear):
continue
if not any(target_key in module_name for target_key in target_modules_list):
continue
print('enable GaLore for weights in module: ', module_name)
galore_params.append(module.weight)
id_galore_params = [id(p) for p in galore_params]
# make parameters without "rank" to another group
regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]
# then call galore_adamw
param_groups = [{'params': regular_params},
{'params': galore_params, 'rank': 128, 'update_proj_gap': 200, 'scale': 0.25, 'proj_type': 'std'}]
optimizer = GaLoreAdamW(param_groups, lr=learning_rate)
Any idea why and how to fix it?
Thanks in advance!
My model works fine with adamw_bnb_8bit.
When i switched to galore_adamw_8bit with 'all-linear',
an exception is raised 'can't optimize a non-leaf'
Seq2SeqTrainingArguments(
output_dir = model_name_or_path,
save_strategy = 'no',
logging_steps = 100,
bf16 = True if torch.cuda.is_available() else False,
dataloader_pin_memory = True,
dataloader_num_workers = 8,
num_train_epochs = 1, #1, # 2,
do_train=True,
learning_rate = learning_rate, # 5e-5,
# optim = 'adamw_bnb_8bit',
optim="galore_adamw_8bit_layerwise",
optim_target_modules='all-linear',
lr_scheduler_type = 'constant', # 'cosine', constant
warmup_ratio = 0.,
per_device_train_batch_size = batch_size, # 8,
gradient_accumulation_steps = 1,
report_to = 'none',
do_eval=False,
max_steps = max_steps,
accelerator_config = {'dispatch_batches':False},
**kwargs
)
Hi, I was reading the GaLore paper and noticed that the "ground truth" baseline seems to be pure BF16 training with nearest rounding. It is generally accepted that pure BF16 training with nearest rounding does not converge to the same point as FP32 or BF16/FP32 mixed precision training -- does GaLore only match pure BF16 or does it match FP32 training as well?
Adafactor originally does its own an approximation of second moment.
But when GaLore is enabled, that approximation is done based on the shrunken grad by GaLore instead of the raw grad.
Possibly this behavior may have a slightly negative impact.
File "/workspace/transformers/src/transformers/trainer.py", line 1297, in optimizer_hook
optimizer_dict[param].step() File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/galore_torch/adamw.py", line 96, in step
grad = state["projector"].project(grad, state["step"])
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 21, in project
self.ortho_matrix = self.get_orthogonal_matrix(full_rank_grad, self.rank, type='left')
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 94, in get_orthogonal_matrix
A = U[:, :rank]
TypeError: slice indices must be integers or None or have an __index__ method
The following is the result of running the script you provided on mprc task.
{"eval_accuracy": 0.8970588235294118, "eval_f1": 0.926056338028169}
Which one is the result provided in your paper?
Hi, thanks for this great work!
I have a question about using GaLore with DDP. I was trying to use GaLore for training 7B with DDP (multi-gpu).
However, I noticed that when using DDP, the memory gets doubled due to the buffer for gradient synchronization in DDP, so the required memory of 7B (bf16) is around 28GB even before using GaLore. Thus, I got OOM when using GaLore to train 7B in gpus with 32GB.
I was wondering if you've encountered the same issue and any suggestion would be appreciated. Thanks!
Dear Author, I am truly grateful for your outstanding work. Please allow me to raise a small question regarding the memory of gradient:
As I understand it, the LOMO method can only ensure that gradients are updated layer-by-layer, but the gradient memory for each weight matrix is not compressed. The shape size remains consistent with the original weight.
I'm not sure if I'm misusing it.
Hi,
Thanks for the good work. I'm trying to intergrate this into Colossal-AI(https://github.com/hpcaitech/ColossalAI), compatible with Tensor Parallel and ZeRO.
However, I had trouble loading the dataset; seems they updated the dataset to remove the json schema.
Could you share your dataset version and how you're able to load it?
Thanks!
Hello, thanks for your inspiring and excellent work!
I want to try full fine-tuning to compare with Galora, and I have blocked the use of Galora. However, I'm having some problems that when I try to run the glue task (i.e. mrpc) to full fine-tune roberta, I find that the eval acc doesn't change at all as the training progresses. I have ruled out a possible overfitting problem and I would like to ask the author or anyone else if there is a relevant solution.
Hi, great project. After reading the paper and the implementation, I am wondering if it is considered to reproject the Adam internal states (exp_avg, exp_avg_sq) from previous subspace to the new subspace?
Attempting to use Galore to finetune a phi model yields "AttributeError: 'PhiConfig' object has no attribute 'rms_norm_eps'", which, having gotten that error on other LLM things, typically translates to "this code doesn't support Phi models"
fixing this would be incredibly nice, as It would allow people with cruddier computers to finetune LLMs
In the figure, Rank = 1024 and Rank = 512 is very close to the baseline, even better than the baseline. In response, I have the following 2 questions.
Hi, thanks very much for sharing your impressive work!
Would it be possible to release the trained model (e.g., using the script below)?
It would greatly facilitate reproducibility efforts.
Thank you for considering this request.
https://github.com/jiaweizzhao/GaLore/blob/1b36c33782bdd74a4d6a4f51bc626ef67f51011f/scripts/benchmark_c4/llama_7b.sh
How exactly did you measure Perplexity during pre-training with GaLore? (e.g. when creating Figure 5 in your paper https://arxiv.org/pdf/2403.03507.pdf ). Thanks.
Hi, thank you for generously open-sourcing your excellent work. During our experiments, we noticed that there doesn't seem to be a resume/reload function for the optimizer regarding args.continue_from
. Is our understanding correct? If this feature has already been implemented, could you please let us know?
By the way, we found the resume function for the checkpoint model and logging information in these lines.
Thank you for reviewing this inquiry.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.