Code Monkey home page Code Monkey logo

falcontune's People

Contributors

rmihaylov avatar rmmihaylov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

falcontune's Issues

generate get error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Run

falcontune generate \
    --interactive \
    --model falcon-7b \
    --weights tiiuae/falcon-7b \
    --lora_apply_dir falcon-7b-alpaca \
    --max_new_tokens 500 \
    --use_cache \
    --do_sample \
    --instruction "列出人工智能的五个可能应用"

Error

Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████| 2/2 [00:42<00:00, 21.04s/it]
Device map for lora: auto
falcon-7b-alpaca loaded
/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:318: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
  File "/data/home/yaokj5/anaconda3/envs/falcon/bin/falcontune", line 33, in <module>
    sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')())
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/run.py", line 88, in main
    args.func(args)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/generate.py", line 71, in generate
    generated_ids = model.generate(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/generate.py", line 27, in autocast_generate
    return self.model.non_autocast_generate(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/peft/peft_model.py", line 731, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/generation/utils.py", line 1565, in generate
    return self.sample(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/generation/utils.py", line 2612, in sample
    outputs = self(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 1072, in forward
    transformer_outputs = self.transformer(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 967, in forward
    outputs = block(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 722, in forward
    mlp_output += attention_output
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

ModuleNotFoundError: No module named 'torch._six'

Running into this issue. Any ideas for a clean resolve?

  File "/opt/conda/lib/python3.9/site-packages/accelerate/accelerator.py", line 33, in <module>
    from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state
  File "/opt/conda/lib/python3.9/site-packages/accelerate/checkpointing.py", line 24, in <module>
    from .utils import (
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/__init__.py", line 109, in <module>
    from .launch import (
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/launch.py", line 27, in <module>
    from ..utils.other import merge_dicts
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/other.py", line 28, in <module>
    from deepspeed import DeepSpeedEngine
  File "/opt/conda/lib/python3.9/site-packages/deepspeed/__init__.py", line 16, in <module>
    from .runtime.engine import DeepSpeedEngine, DeepSpeedOptimizerCallable, DeepSpeedSchedulerCallable
  File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 25, in <module>
    from deepspeed.runtime.utils import see_memory_usage, get_ma_status, DummyOptim
  File "/opt/conda/lib/python3.9/site-packages/deepspeed/runtime/utils.py", line 18, in <module>
    from torch._six import inf
ModuleNotFoundError: No module named 'torch._six'

Does fine-tuning support the multi-GPU training?

Does fine-tuning support the multi-GPU training?

When trying to fine-tune with multiple GPUs, got the following error.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/transformers/trainer.py", line 2767, in compute_loss
    outputs = model(**inputs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 1070, in forward
    transformer_outputs = self.transformer(
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 965, in forward
    outputs = block(
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/mpt/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 720, in forward
    mlp_output += attention_output
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

TypeError: Input tensors need to be on the same GPU, but found the following tensor and device combinations

Parameters:
-------config-------
dataset='./alpaca_data_zh_51k.json'
data_type='alpaca'
lora_out_dir='./falcon-40b-alpaca/'
lora_apply_dir=None
weights='tiiuae/falcon-40b'
target_modules=['query_key_value']

------training------
mbatch_size=1
batch_size=2
gradient_accumulation_steps=2
epochs=3
lr=0.0003
cutoff_len=256
lora_r=8
lora_alpha=16
lora_dropout=0.05
val_set_size=0.2
gradient_checkpointing=False
gradient_checkpointing_ratio=1
warmup_steps=5
save_steps=50
save_total_limit=3
logging_steps=5
checkpoint=False
skip=False
world_size=1
ddp=False
device_map='auto'
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.3
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: token_type_ids, output, input, instruction. If token_type_ids, output, input, instruction are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 40,943
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 2
  Total optimization steps = 61,413
  Number of trainable parameters = 8,355,840
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
  0%|                                                                                                                                    | 0/61413 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:318: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /data/home/yaokj5/dl/apps/falcon/wandb/offline-run-20230602_104725-mpuuxpn6
wandb: Find logs at: ./wandb/offline-run-20230602_104725-mpuuxpn6/logs
Traceback (most recent call last):
  File "/data/home/yaokj5/anaconda3/envs/falcon/bin/falcontune", line 33, in <module>
    sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')())
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/run.py", line 87, in main
    args.func(args)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/finetune.py", line 162, in finetune
    trainer.train()
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/trainer.py", line 2767, in compute_loss
    outputs = model(**inputs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 1076, in forward
    transformer_outputs = self.transformer(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 971, in forward
    outputs = block(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 639, in forward
    attn_outputs = self.self_attention(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 491, in forward
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/peft/tuners/lora.py", line 698, in forward
    result = super().forward(x)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 388, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 559, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1759, in igemmlt
    is_on_gpu([A, B, out])
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/functional.py", line 390, in is_on_gpu
    raise TypeError(f'Input tensors need to be on the same GPU, but found the following tensor and device combinations:\n {[(t.shape, t.device) for t in tensors]}')
TypeError: Input tensors need to be on the same GPU, but found the following tensor and device combinations:
 [(torch.Size([256, 8192]), device(type='cuda', index=0)), (torch.Size([9216, 8192]), device(type='cuda', index=1)), (torch.Size([256, 9216]), device(type='cuda', index=0))]

Use prompt get error

falcontune generate \
    --interactive \
    --model falcon-7b \
    --weights tiiuae/falcon-7b \
    --lora_apply_dir falcon-7b-alpaca \
    --max_new_tokens 50 \
    --use_cache \
    --do_sample \
    --prompt "番茄炒蛋的配料"
番茄炒蛋的配料。

### Input:


### Response:
xxxxx
Took 334.041 s




Enter new prompt: Traceback (most recent call last):
  File "/data/home/yaokj5/anaconda3/envs/falcon/bin/falcontune", line 33, in <module>
    sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')())
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/run.py", line 88, in main
    args.func(args)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/generate.py", line 71, in generate
    generated_ids = model.generate(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/generate.py", line 27, in autocast_generate
    return self.model.non_autocast_generate(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/peft/peft_model.py", line 731, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/generation/utils.py", line 1312, in generate
    and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0
IndexError: index -1 is out of bounds for dimension 1 with size 0

Missing compatibility with with torch 1.13

When we run the model on our servers, then we encounter the following problem:

File "/home/users//.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/5b9409410d251ab8e06c48078721c8e2b71fa8a1/modelling_RW.py", line 289, in forward
attn_output = F.scaled_dot_product_attention(
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'

Are there any plans to make the model also work on a bit older versions of PyTorch?

I think this would be useful as many people are still on 1.13.

You are using a model of type RefinedWeb to instantiate a model of type RefinedWebModel. This is not supported for all configurations of models and can yield errors.

When I try to run generate command on a H100 server, found the following error ;

You are using a model of type RefinedWeb to instantiate a model of type RefinedWebModel. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "/usr/local/bin/falcontune", line 11, in
load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')()
File "/usr/local/lib/python3.8/dist-packages/falcontune-0.1.0-py3.8.egg/falcontune/run.py", line 87, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/falcontune-0.1.0-py3.8.egg/falcontune/generate.py", line 43, in generate
model, tokenizer = load_model(
File "/usr/local/lib/python3.8/dist-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/init.py", line 33, in load_model
model, tokenizer = load_model(model_config, weights, half=half, backend=backend)
File "/usr/local/lib/python3.8/dist-packages/falcontune-0.1.0-py3.8.egg/falcontune/model/falcon/model.py", line 1147, in load_model
ql = importlib.import_module(f'falcontune.backend.{backend}.quantlinear')
File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 848, in exec_module
File "", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3.8/dist-packages/falcontune-0.1.0-py3.8.egg/falcontune/backend/cuda/quantlinear.py", line 3, in
import quant_cuda
ImportError: /usr/local/lib/python3.8/dist-packages/quant_cuda-0.0.0-py3.8-linux-x86_64.egg/quant_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

Inference not accurate

I used the same dataset provided to fine tune , but the inference seems to give general replies and not from the dataset used for fine tuning.

Any Ideas?

"instruction": "What are the three primary colors?",
"input": "",
"output": "The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB)."

The above was part of the dataset , but when i run inference i get the following

_**Enter new instruction: Describe the structure of an atom.

An atom consists of a nucleus, made up of positively charged protons and neutral electrons, and a number of electrons arranged in shells or layers around the nucleus. The number of electrons in the outermost shell of the nucleus determines the atom's electronic**_

i used the following generate script

falcontune generate
--interactive
--model falcon-7b-instruct-4bit
--weights ./gptq_model-4bit-64g.safetensors
--lora_apply_dir falcon-7b-instruct-4bit-alpaca-selftrained/
--max_new_tokens 50
--use_cache
--do_sample
--instruction "What are the three primary colors?"
--backend triton

ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA devices

Getting below error when trying to finetune the model.

Converted as Half.
trainable params: 8355840 || all params: 1075691520 || trainable%: 0.7767877541695225
Found cached dataset json (/home/users/users/.cache/huggingface/datasets/json/default-7089e4ef944c023b/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.48it/s]
Loading cached split indices for dataset at /home/users/users/.cache/huggingface/datasets/json/default-7089e4ef944c023b/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-a03d095090258b35.arrow and /home/users/users/.cache/huggingface/datasets/json/default-7089e4ef944c023b/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-f83f741993333274.arrow
Run eval every 6 steps                                                                                                                                                  
Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
PyTorch: setting up devices

Traceback (most recent call last):
  File "/home/users/users/falcontune/venv_falcontune/bin/falcontune", line 33, in <module>
    sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')())
  File "/home/users/users/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/run.py", line 87, in main
    args.func(args)
  File "/home/users/users/falcontune/venv_falcontune/lib/python3.8/site-packages/falcontune-0.1.0-py3.8.egg/falcontune/finetune.py", line 116, in finetune
    training_arguments = transformers.TrainingArguments(
  File "<string>", line 111, in __init__
  File "/home/users/users/falcontune/venv_falcontune/lib/python3.8/site-packages/transformers/training_args.py", line 1338, in __post_init__
    raise ValueError(
ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA devices.

Experimental setup details:
OS: Ubuntu 18.04.5 LTS
GPU: Tesla V100-SXM2-32GB
Libs:

bitsandbytes==0.39.0
transformers==4.29.2
triton==2.0.0
sentencepiece==0.1.99
datasets==2.12.0
peft==0.3.0
torch==2.0.1+cu118
accelerate==0.19.0
safetensors==0.3.1
einops==0.6.1
wandb==0.15.3
bitsandbytes==0.39.0
scipy==1.10.1

Finetuning command:

falcontune finetune \
    --model="falcon-40b-instruct-4bit" \
    --weights="./gptq_model-4bit--1g.safetensors" \
    --dataset="./alpaca_cleaned.json" \
    --data_type="alpaca" \
    --lora_out_dir="./falcon-40b-instruct-4bit-alpaca/" \
    --mbatch_size=1 \
    --batch_size=2 \
    --epochs=$epochs \
    --lr=3e-4 \
    --cutoff_len=256 \
    --lora_r=8 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --warmup_steps=5 \
    --save_steps=50 \
    --save_total_limit=3 \
    --logging_steps=5 \
    --target_modules='["query_key_value"]' \
    --backend="triton"

Any help please?

RuntimeError: expected scalar type Half but found Char

Trying to finetune 7b model using sample on the documentation but getting this error, has anyone seen this before?:
Log is here:
bin /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /home/ec2-user/anaconda3/envs/pytorch_p310/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Downloading (…)model.bin.index.json: 16.9kB [00:00, 57.6MB/s]
Downloading (…)l-00001-of-00002.bin: 100%|██| 9.95G/9.95G [01:00<00:00, 166MB/s]
Downloading (…)l-00002-of-00002.bin: 100%|██| 4.48G/4.48G [00:36<00:00, 123MB/s]
Downloading (…)okenizer_config.json: 100%|█████| 220/220 [00:00<00:00, 1.93MB/s]
Downloading (…)/main/tokenizer.json: 2.73MB [00:00, 17.2MB/s]
Downloading (…)cial_tokens_map.json: 100%|█████| 281/281 [00:00<00:00, 2.00MB/s]

Parameters:
-------config-------
dataset='sql-create-context-train.json'
data_type='alpaca'
lora_out_dir='./falcon-7b-alpaca/'
lora_apply_dir=None
weights='tiiuae/falcon-7b'
target_modules=['query_key_value']

------training------
mbatch_size=1
batch_size=2
gradient_accumulation_steps=2
epochs=3
lr=0.0003
cutoff_len=256
lora_r=8
lora_alpha=16
lora_dropout=0.05
val_set_size=0.2
gradient_checkpointing=False
gradient_checkpointing_ratio=1
warmup_steps=5
save_steps=50
save_total_limit=3
logging_steps=5
checkpoint=False
skip=False
world_size=1
ddp=False
device_map='auto'

trainable params: 2359296 || all params: 6924080000 || trainable%: 0.03407378308742822
Downloading and preparing dataset json/default to /home/ec2-user/.cache/huggingface/datasets/json/default-ad054cd2ceaeb665/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data files: 100%|██████████████████| 1/1 [00:00<00:00, 10230.01it/s]
Extracting data files: 100%|█████████████████████| 1/1 [00:00<00:00, 221.66it/s]
Dataset json downloaded and prepared to /home/ec2-user/.cache/huggingface/datasets/json/default-ad054cd2ceaeb665/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 175.08it/s]
Run eval every 7857 steps
PyTorch: setting up devices
The default value for the training argument --report_to will change in v5 (from all installed integrations to none). In v5, you will need to use --report_to all to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using cuda_amp half precision backend
wandb: Tracking run with wandb version 0.15.4
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
The following columns in the training set don't have a corresponding argument in PeftModelForCausalLM.forward and have been ignored: instruction, output, token_type_ids, input. If instruction, output, token_type_ids, input are not expected by PeftModelForCausalLM.forward, you can safely ignore this message.
***** Running training *****
Num examples = 62861
Num Epochs = 3
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 2
Gradient Accumulation steps = 2
Total optimization steps = 94290
0%| | 0/94290 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/ec2-user/SageMaker/wandb/offline-run-20230621_200524-jdrmalza
wandb: Find logs at: ./wandb/offline-run-20230621_200524-jdrmalza/logs
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p310/bin/falcontune", line 33, in
sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')())
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/run.py", line 88, in main
args.func(args)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/finetune.py", line 162, in finetune
trainer.train()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1521, in train
return inner_training_loop(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1763, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2499, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2531, in compute_loss
outputs = model(**inputs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
return self.base_model(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 1070, in forward
transformer_outputs = self.transformer(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 965, in forward
outputs = block(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 698, in forward
attn_outputs = self.self_attention(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 291, in forward
fused_qkv = self.query_key_value(hidden_states) # [batch_size, seq_length, 3 x hidden_size]
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward
result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: expected scalar type Half but found Char

RuntimeError: CUDA error: an illegal memory access was encountered

falcontune generate     --interactive     --model falcon-40b-instruct-4bit     
--weights gptq_model-4bit--1g.safetensors     --max_new_tokens=50     
--use_cache     --do_sample     
--instruction "Who was the first person on the moon?"

...
RuntimeError: CUDA error: an illegal memory access was encountered

While trying to generate on a multiple GPU machine encountered the above error

getting error `OSError: libcurand.so.10: cannot open shared object file: No such file or directory` when finetuning

I am running finetune on falcon-7b

$ falcontune finetune \
    --model=falcon-7b \
    --weights=tiiuae/falcon-7b \
    --dataset=../data/alpaca_data_cleaned.json \
    --data_type=alpaca \
    --lora_out_dir=./falcon-7b-alpaca/ \
    --mbatch_size=1 \
    --batch_size=2 \
    --epochs=3 \
    --lr=3e-4 \
    --cutoff_len=256 \
    --lora_r=8 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --warmup_steps=5 \
    --save_steps=50 \
    --save_total_limit=3 \
    --logging_steps=5 \
    --target_modules='["query_key_value"]'

getting error:

miniconda3/envs/falcontune/lib/python3.9/ctypes/__init__.py", line 382, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory

GPU:
NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB] (rev a1)

Its working very well but 2 issues:

  1. pyarrow gives error if the dataset json file > 10mb

  2. I really cant figure out what all parameters are and how to resume when training is interrupted. Is there any reference for parameters and switches such as :

    --data_type=alpaca
    --lora_out_dir=./falcon-7b-instruct-4bit-alpaca/
    --mbatch_size=1
    --batch_size=2
    --epochs=3
    --lr=3e-4
    --cutoff_len=256
    --lora_r=8
    --lora_alpha=16
    --lora_dropout=0.05
    --warmup_steps=5
    --save_steps=50
    --save_total_limit=3
    --logging_steps=5
    --target_modules='["query_key_value"]'
    ...

Thank you very much.

8-bit models

Hi all,
I am trying to finetune falcon-40b-instruct but running into the following error:

ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices in any distributed mode. In
order to use 8-bit models that have been loaded across multiple GPUs the solution is to use Naive Pipeline Parallelism. 
Therefore you should not specify that you are under any distributed regime in your accelerate config.

Any suggestion?

Thanks!

Possible to offload to cpu (ram)

Is it possible to reduce required vram by offloading it to ram? From my experiments with Lora, models are able to pickup logic from custom dataset with very little amount of data. Would be great to be able to finetune it slow but with less amount of vram.

bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /data/home/yaokj5/anaconda3/envs/falcon did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
  warn(msg)
ERROR: /data/home/yaokj5/anaconda3/envs/falcon/bin/python: undefined symbol: cudaRuntimeGetVersion
CUDA SETUP: libcudart.so path is None
CUDA SETUP: Is seems that your cuda installation is not in your path. See https://github.com/TimDettmers/bitsandbytes/issues/85 for more information.
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
  warn(msg)
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 00
CUDA SETUP: Loading binary /data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
You are using a model of type RefinedWeb to instantiate a model of type RefinedWebModel. This is not supported for all configurations of models and can yield errors.
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards:   0%|                                                                                       | 0/9 [00:18<?, ?it/s]
Traceback (most recent call last):
  File "/data/home/yaokj5/anaconda3/envs/falcon/bin/falcontune", line 33, in <module>
    sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')())
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/run.py", line 87, in main
    args.func(args)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/finetune.py", line 49, in finetune
    llm, tokenizer = load_model(args.model, args.weights, backend=args.backend)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/__init__.py", line 33, in load_model
    model, tokenizer = load_model(model_config, weights, half=half, backend=backend)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 1177, in load_model
    model = RWForCausalLM.from_pretrained(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2777, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3118, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/modeling_utils.py", line 710, in _load_state_dict_into_meta_model
    set_module_8bit_tensor_to_device(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/transformers/utils/bitsandbytes.py", line 78, in set_module_8bit_tensor_to_device
    new_value = bnb.nn.Int8Params(new_value, requires_grad=False, has_fp16_weights=has_fp16_weights).to(device)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 294, in to
    return self.cuda(device)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 258, in cuda
    CB, CBt, SCB, SCBt, coo_tensorB = bnb.functional.double_quant(B)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1987, in double_quant
    row_stats, col_stats, nnz_row_ptr = get_colrow_absmax(
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1876, in get_colrow_absmax
    lib.cget_col_row_stats(ptrA, ptrRowStats, ptrColStats, ptrNnzrows, ct.c_float(threshold), rows, cols)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__
    func = self.__getitem__(name)
  File "/data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /data/home/yaokj5/anaconda3/envs/falcon/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats

OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 79.35 GiB total capacity; 77.18 GiB already allocated; 57.19 MiB free; 77.97 GiB reserved in total by PyTorch)

Ran into CUDA OOM issue during fine tuning

File "/opt/conda/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 336, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "/opt/conda/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
outputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda
torch.cuda.OutOfMemoryError: CUDA out of memory.

Any ideas to fix this?

Metal support

What would the path to metal support for this project look like?

Runtime Error : CUDA OUT OF MEMORY

I am trying to fintune the 7b-instruct-gptq model but it gives me cuda out of memory when I specify a cutoff length of 2048.

Parameters:
-------config-------
dataset='./Falcontune_data.json'
data_type='alpaca'
lora_out_dir='./falcon-7b-instruct-4bit-customodel/'
lora_apply_dir=None
weights='gptq_model-4bit-64g.safetensors'
target_modules=['query_key_value']

------training------
mbatch_size=1
batch_size=2
gradient_accumulation_steps=2
epochs=3
lr=0.0003
cutoff_len=2048
lora_r=8
lora_alpha=16
lora_dropout=0.05
val_set_size=0.2
gradient_checkpointing=False
gradient_checkpointing_ratio=1
warmup_steps=5
save_steps=50
save_total_limit=3
logging_steps=5
checkpoint=False
skip=False
world_size=1
ddp=False
device_map='auto'

OutOfMemoryError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 14.75
GiB total capacity; 12.29 GiB already allocated; 476.81 MiB free; 13.22 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF

Optimizing Inference Time for Chat Conversations on Falcon

How can I optimize the inference time of my 4-bit quantized Falcon 7B, which was trained on a chat dataset using Qlora+PEFT. During inference, I loaded the model in 4 bits using the bits and bytes library. While the model performs well in inference, I've observed that the inference time increases significantly as the length of the chat grows. To give you an idea, here's an example of how I'm using the model:

First Message prompt:
< user>: Hi ..
< bot>:

2nd Message prompt:
< user>: Hi ..
< bot>: How are you ?
< user>: I am good thanks. I need your help !
< bot>:

As the length of the chat increases, the inference time sometimes doubles and can take up to 2-3 minutes per prompt. I'm using an NVIDIA RTX 3090 Ti for inference. Below is the code snippet I'm using for prediction:

MODEL_NAME = "tiiuae/falcon-7b"

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True,
                                bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

SAVED_MODEL = "Saved_models/model_path"

saved_model_config = PeftConfig.from_pretrained(SAVED_MODEL)
saved_model = AutoModelForCausalLM.from_pretrained(saved_model_config.base_model_name_or_path,
                                             return_dict=True,
                                             quantization_config=bnb_config,
                                             device_map="auto",
                                             trust_remote_code=True
                                            )

tokenizer = AutoTokenizer.from_pretrained(saved_model_config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

saved_model = PeftModel.from_pretrained(saved_model, SAVED_MODEL)

pipeline = transformers.pipeline(
    "text-generation",
    model=saved_model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

prompt = "< user>: Hi ..
< bot>: How are you ?
< user>: I am good thanks. I need your help ! 
< bot>: "

response = pipeline(
        prompt,
        bos_token_id=11,
        max_length=2000,
        temperature=0.7,
        top_p=0.7,
        do_sample=True,
        num_return_sequences=1,
        eos_token_id=[15564]
        )[0]['generated_text']

I would greatly appreciate any insights or suggestions on how to improve the inference time of model while dealing with longer chat interactions. Thank you in advance for your assistance!

Does pretrained falcon-40b works on colab?

Hi there,

i'm trying to use your repo to finetune the vanilla pretrained model falcon-40b ma using this command:

!falcontune finetune \ --model=falcon-40b \ --weights=tiiuae/falcon-40b \ --dataset=./alpaca_data_cleaned.json \ --data_type=alpaca \ --lora_out_dir=./falcon-40b-alpaca/ \ --mbatch_size=1 \ --batch_size=2 \ --epochs=3 \ --lr=3e-4 \ --cutoff_len=2048 \ --lora_r=8 \ --lora_alpha=16 \ --lora_dropout=0.05 \ --warmup_steps=5 \ --save_steps=50 \ --save_total_limit=3 \ --logging_steps=5 \ --target_modules='["query_key_value"]'

i get this error:
`2023-06-22 15:49:58.475124: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

BUG REPORT
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/lib64-nvidia did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('8013'), PosixPath('//172.28.0.1')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--logtostderr --listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-a100-s-3cww4v9wjy1aw --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//ipykernel.pylab.backend_inline'), PosixPath('module')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
You are using a model of type RefinedWeb to instantiate a model of type RefinedWebModel. This is not supported for all configurations of models and can yield errors.
Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/local/bin/falcontune:33 in │
│ │
│ 30 │
│ 31 if name == 'main': │
│ 32 │ sys.argv[0] = re.sub(r'(-script.pyw?|.exe)?$', '', sys.argv[0]) │
│ ❱ 33 │ sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', ' │
│ 34 │
│ │
│ /usr/local/lib/python3.10/dist-packages/falcontune-0.1.0-py3.10.egg/falcontu │
│ ne/run.py:88 in main │
│ │
│ 85 │
│ 86 def main(): │
│ 87 │ args = get_args() │
│ ❱ 88 │ args.func(args) │
│ 89 │
│ 90 │
│ 91 if name == 'main': │
│ │
│ /usr/local/lib/python3.10/dist-packages/falcontune-0.1.0-py3.10.egg/falcontu │
│ ne/finetune.py:49 in finetune │
│ │
│ 46 │
│ 47 │
│ 48 def finetune(args): │
│ ❱ 49 │ llm, tokenizer = load_model(args.model, args.weights, backend=args │
│ 50 │ tune_config = FinetuneConfig(args) │
│ 51 │ │
│ 52 │ transformers.logging.set_verbosity_info() │
│ │
│ /usr/local/lib/python3.10/dist-packages/falcontune-0.1.0-py3.10.egg/falcontu │
│ ne/model/init.py:33 in load_model │
│ │
│ 30 │ │
│ 31 │ if model_name in MODEL_CONFIGS: │
│ 32 │ │ from falcontune.model.falcon.model import load_model │
│ ❱ 33 │ │ model, tokenizer = load_model(model_config, weights, half=half, │
│ 34 │ │
│ 35 │ else: │
│ 36 │ │ raise ValueError(f"Invalid model name: {model_name}") │
│ │
│ /usr/local/lib/python3.10/dist-packages/falcontune-0.1.0-py3.10.egg/falcontu │
│ ne/model/falcon/model.py:1171 in load_model │
│ │
│ 1168 │ │ model.loaded_in_4bit = True │
│ 1169 │ │
│ 1170 │ elif llm_config.bits == 8: │
│ ❱ 1171 │ │ model = RWForCausalLM.from_pretrained( │
│ 1172 │ │ │ checkpoint, │
│ 1173 │ │ │ config=config, │
│ 1174 │ │ │ load_in_8bit=True, │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:2722 │
│ in from_pretrained │
│ │
│ 2719 │ │ │ │ │ key: device_map[key] for key in device_map.keys() │
│ 2720 │ │ │ │ } │
│ 2721 │ │ │ │ if "cpu" in device_map_without_lm_head.values() or "d │
│ ❱ 2722 │ │ │ │ │ raise ValueError( │
│ 2723 │ │ │ │ │ │ """ │
│ 2724 │ │ │ │ │ │ Some modules are dispatched on the CPU or the │
│ 2725 │ │ │ │ │ │ the quantized model. If you want to dispatch │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError:
Some modules are dispatched on the CPU or the disk. Make
sure you have enough GPU RAM to fit
the quantized model. If you want to dispatch the model
on the CPU or the disk while keeping
these modules in 32-bit, you need to set
load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_cl
asses/quantization#offload-between-cpu-and-gpu
for more details.`

While with the instruct model i have no issue.

2 questions:

  1. Can this code run also the vanilla 40b model on colab?
  2. If not, can you upload a 4bit version of the vanilla 40b model please?

Thank you so much

RuntimeError: No available kernel. Aborting execution.

Any ideas? Full log below:

Traceback (most recent call last):
File "/home/cosmos/miniconda3/envs/ftune/bin/falcontune", line 33, in
sys.exit(load_entry_point('falcontune==0.1.0', 'console_scripts', 'falcontune')())
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/run.py", line 87, in main
args.func(args)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/finetune.py", line 162, in finetune
trainer.train()
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/transformers/trainer.py", line 2767, in compute_loss
outputs = model(**inputs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
return self.base_model(
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 1070, in forward
transformer_outputs = self.transformer(
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 965, in forward
outputs = block(
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 698, in forward
attn_outputs = self.self_attention(
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/cosmos/miniconda3/envs/ftune/lib/python3.10/site-packages/falcontune-0.1.0-py3.10.egg/falcontune/model/falcon/model.py", line 337, in forward
attn_output = F.scaled_dot_product_attention(
RuntimeError: No available kernel. Aborting execution.

EDIT: CUDA is installed in kernel modules, on the system & in the environment just to rule out that. Using python 3.10.6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.