foundation-model-stack / fms-acceleration Goto Github PK

3.0 6.0 6.0 456 KB

🚀 Collection of libraries used with fms-hf-tuning to accelerate fine-tuning and training of large models.

License: Apache License 2.0

Python 99.03% Shell 0.97%

fms-acceleration's Introduction

FMS Acceleration 🚀

FMS Acceleration is designed to accelerate the fine-tuning and training of large models. This framework comprises a collection of libraries intended to be used with the fms-hf-tuning suite.

The fms-acceleration framework includes accelerators for Full and Parameter Efficient Fine Tuning (PEFT), including

Low Rank Adaptation (LoRA) acceleration (coming soon)
Bits-and-Bytes (BNB) quantised LoRA : QLoRA acceleration
AutoGPTQ quantised LoRA : GPTQ-LoRA acceleration
Full Fine Tuning acceleration (coming soon)
Padding-Free Attention

Our tests show a significant increase in training token throughput using this fms-acceleration framework.

For example:

QLoRA: 22-43 % token throughput increase on 1 GPU as compared to using Hugging Face BNB QLoRA
QLoRA: Straightforward integration with multiple GPU as compared to using Hugging Face BNB QLoRA
GPTQ-LoRA: 22-44 % token throughput increase on 1 GPU as compared to using Hugging Face BNB QLoRA
GPTQ-LoRA: Straightforward integration with multiple GPU as compared to using Hugging Face BNB QLoRA

The above includes numbers using fusedOps-and-kernels and actual impl coming soon, see below.

This package is in BETA and is under development. Expect breaking changes!

Plugins

Plugin	Description	Depends	License	Status
framework	This acceleration framework for integration with huggingface trainers			Alpha
accelerated-peft	For PEFT-training, e.g., 4bit QLoRA.	Huggingface AutoGPTQ	Apache 2.0 MIT	Alpha
fused-op-and-kernels	Fused LoRA and triton kernels (e.g., fast cross-entropy, rms, rope)	--	Apache 2.0 (contains extracted code)	Beta
attention-and-distributed-packing	Padding-Free Flash Attention Computation	flash-attn	Apache 2.0	Beta
MOE-training-acceleration	MegaBlocks inspired triton Kernels and acclerations for Mixture-of-Expert models		Apache 2.0	Coming Soon

Usage with FMS HF Tuning

Below we demonstrate how to accelerate your tuning experience with tuning/sft_trainer.py from fms-hf-tuning.

Note: New exciting plugins will be added over time, so please check here for the latest accelerations!.

Integration with FMS HF Tuning

fms-acceleration is part of fms-hf-tuning, and instructions to utilize fms-acceleration for tuning are found here. In particular, fms-acceleration plugins can be accessed via command line arguments to fms-hf-tuning (e.g., --auto_gptq triton_v2); this is made available via integrated configuration dataclasses that configures the AccelerationFramework for the user.

Need for an alternative way to access features pre-integration

As new plugins become available, more command line arguments will be made avaiable to fms-hf-tuning to enable them. However, this kind of integration takes time; plugins that are in development / research stages may not be immediately integrated.

Therefore, an intermediary step is required to access plugins in fms-acceleration before they become integrated into fms-hf-tuning. In fact, such a method is critical for benchmarking / testing, that needs to happen before integration of any plugin in fms-hf-tuning can even be considered. Hence, we provide a method to configure the acceleration framework via a configuration YAML, that is passed into AccelerationFramework via an environment variable; the instructions for this is provided below. Futhermore, experienced users can also leverage this to early test plugins, but be warned that the learning curve to use these plugins is high (since it requires knowledge on how to write such a configuration). To aid on this, the following instructions are provide that describes both a basic and advanced flow.

FMS Acceleration Via Configuration YAML

Note: As mentioned above, the recommended approach for fms-hf-tuning is to use the acceleration config dataclasses. This method documented for the configuration YAML is only for testing/research purposes and not recommended for production. For general use, please refer instead to the instructions here.

Below we illustrate a configuration YAML flow using the accelerated quantised PEFT using GPTQ-LoRA tuning with the AutoGPTQ triton_v2 kernel use case; this kernel is state-of-the-art provided by jeromeku on Mar 2024:

There is both a basic and advanced usage for the configuration YAML flow.

Basic Configuration YAML Flow 🤡

Most users of fms-hf-tuning only require the basic flow:

Assumption 1: user has an already prepared configuration, say from sample-configurations.
Assumption 2: user knows exactly what acceleration 'plugins` are required (based on the configuration).
Assumption 3: the arguments for running sft_trainer.py is the same; save for one extra argument --acceleration_framework_config_file used to pass in the acceleration config.

In this case then the basic flow comprises of 3 steps:

First go to fms-hf-tuning and install the framework library:
```
$ pip install -e .[fms-accel]
```
or alternatively install the framework directly:
```
$ pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/framework
```
The above installs the command line utility fms_acceleration.cli, which is used to install plugins (and also other things like view sample configurations).

install the required framework plugins; we install the fms-acceleration-peft plugin for GPTQ-LoRA tuning with triton v2 as:

python -m fms_acceleration.cli install fms_acceleration_peft

The above is the equivalent of:

pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/accelerated-peft

Run sft_trainer.py providing the acceleration configuration (via the environment variable ACCELERATION_FRAMEWORK_CONFIG_FILE and arguments; given the basic flow assumption that we simply re-use the same sft_trainer.py arguments as we had without using the fms_acceleration package:

# when using sample-configurations, arguments can be referred from
# defaults.yaml and scenarios.yaml
ACCELERATION_FRAMEWORK_CONFIG_FILE=framework.yaml \
python sft_trainer.py \
    ...  # arguments

The framework activates relevant plugins given the framework configuration; for more details see framework/README.md.

Activate TRANSFORMERS_VERBOSITY=info to see the huggingface trainer printouts and verify that AccelerationFramework is activated!

# this printout will be seen in huggingface trainer logs if acceleration is activated
***** FMS AccelerationFramework *****
Active Plugin: AutoGPTQAccelerationPlugin. Python package: fms_acceleration_peft. Version: 0.0.1.
***** Running training *****
Num examples = 1,549
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 200
Number of trainable parameters = 13,631,488

Advanced Configuration YAML Flow 🥷 🦹

The advanced flow makes further use of fms_acceleration.cli to:

list all available configs and acceleration plugins the configs depend on.
list all available plugins and check which are the installed ones.
identify critical sft_trainer arguments required for correct operation of a particular framework config.

The advanced flow comprises of 5 steps:

Same as Step 1 of basic flow.

Use fms_acceleration.cli configs to search for sample configs:

$ python -m fms_acceleration.cli configs

1. accelerated-peft-autogptq (accelerated-peft-autogptq-sample-configuration.yaml) - plugins: ['accelerated-peft']
2. accelerated-peft-bnb (accelerated-peft-bnb-nf4-sample-configuration.yaml) - plugins: ['accelerated-peft']

This is equivalent to the searching over the:

Full sample configuration list that shows plugins required for all available configs.
E.g., Accelerated GPTQ-LoRA configuration here.

install plugins same as Step 2 of basic flow, noting that in addition we can use plugins to display all available plugins; this list updates as more plugins get developed. Recall that configs list the required plugins for the sample configurations; make sure all of them are installed.
```
$ python -m fms_acceleration.cli plugins

Choose from the list of plugin shortnames, and do:
* 'python -m fms_acceleration.cli install <pip-install-flags> PLUGIN_NAME'.

List of PLUGIN_NAME [PLUGIN_SHORTNAME]:

1. fms_acceleration_peft [peft]
```
After install the list will update to indicate the installed plugins.
Get the correct arguments for sft_trainer.py:
- arguments required for correct operation (e.g., if using accelerated peft, then peft_method is required).
  - Use arguments along with the sample configuration shortname to display the relevant critical arguments; these arguments can be manually referred from scenarios.yaml:
```
$ python -m fms_acceleration.cli arguments accelerated-peft-autogptq

Searching for configuration shortnames: ['accelerated-peft-autogptq']
1. scenario: accelerated-peft-gptq
configs: accelerated-peft-autogptq
arguments:
    --learning_rate 2e-4 \
    --fp16 True \
    --torch_dtype float16 \
    --peft_method lora \
    --r 16 \
    --lora_alpha 16 \
    --lora_dropout 0.0 \
    --target_modules ['q_proj', 'k_proj', 'v_proj', 'o_proj']
```
- More info on defaults.yaml and scenarios.yaml found here.
  - Arguments not critical to the plugins found in defaults.yaml. These can be taken with liberty.
  - Arguments critcal to plugins found in scenarios.yaml. The relevant section of scenarios.yaml, is the one whose framework_config entries, match the shortname of the sample configuration of interest.

CUDA Dependencies

This repo requires CUDA to compute the kernels, and it is convinient to use NVidia Pytorch Containers that already comets with CUDA installed. We have tested with the following versions:

pytorch:24.01-py3

Benchmarks

The benchmarks can be reproduced with the provided scripts.

includes baseline benches (e.g., standard fine-tuning, standard peft).
benches for various acceleration sample configs.

See below CSV files for various results:

A100-80GB.

Code Architecture

For deeper dive into details see framework/README.md.

Maintainers

IBM Research, Singapore

Fabian Lim [email protected]
Aaron Chew [email protected]
Laura Wynter [email protected]

fms-acceleration's People

Contributors

Stargazers

Watchers

Forkers

achew010 wynterl fabianlim anhuong dileep66yadav kmehant

fms-acceleration's Issues

Remove fp16 restriction for BNB Plugin

The bnb plugin in accelerated peft has an fp16 restriction. That can be removed.

Group Memory Field Names with Common Prefix

Description

Comparison of the memory fields in the summary benchmarks is unintuitive.

Pandas sorts the fields in alphabetical order and because the memory tracking fields are not named with a common prefix in the current benchmarking code it ends up printing the report with memory values in columns far apart from each other in the table (see below image).

Proposed Solution

Renaming all memory values with a common mem_ prefix like the following

mem_nvidia_peak_reserved
mem_torch_peak_allocated
mem_torch_allocated

Update: Also remove the index column is introduced into the benchmark file.

This way when the results are gathered and exported to csv, the columns are grouped together intuitively for comparison.

Extract out AutoGPTQ dependency

The AutoGPTQ repo looks like it hasnt had any releases since Mar 1, 2024. While it is still a dependency of major frameworks such as huggingface and vllm, it is concerning that it might get deprecated. Maybe we can consider to extract out the relevant code since it is on an MIT license

When HF Memory Metrics Disabled Benchmark CSV Corrupted.

When the HF memory metrics are disabled, running the benches as follows:

MEMORY_LOGGING=nvidia \
bash scripts/run_benchmarks.sh \
    ...

noticed that the benchmark.csv file will only have the framework_config and torch_dtype columns, the other columns will be empty,.

Extract Out Model Patcher to Framework

There really should be only one model patcher, and it should live in framework and coordinate all patchings.

this should be accompanied by relevant unit tests.

Allow AutoGPTQ to work in low cpu memory mode

It is documented here the challenges now on why it doesnt work,

this probably needs to be fixed in the upstream

Will be good to address this.

Introduce a Better Dequantization Fix on Triton Function for FOAK Plugin's GPTQ Fused Operations

Description

Currently, FOAK's GPTQ Fused Operations maintains its own dequantization triton kernel. This was incompatible with the Accelerated-PEFT Plugin when the local AutoGPTQ package is used. It is due to a line removed in the dequantization function of the local package (see here).

Without a fix, the dequantization produces wrong base outputs when FOAK plugin is used with the local package. The current fix on the FOAK dequantization function in #48, detects if Accelerated-PEFT plugin uses the external AutoGPTQ library and if it is, then adds zeros = zeros + 1 in the function otherwise that line is not used with the local library.

A better fix would be for FOAK plugin to rely on the accelerated_peft plugin to manage the dequantization function to use (local autogptq package or official autogptq) rather than the plugin maintain a similar set of functions itself.

Failure in FSDP Benchmark Experiment using QLoRA with Custom Fused Modules

Problem

Distributed experiments in the benchmarks fail when using BNB's nf4 QLoRA with unsloth fused module optimizations.

Cause

Distributed experiments for BNB's nf4 QLoRA doesnt throw any errors. Suspected incompatibility of FSDP, BNB kernels and Unsloth's matmul.

Stacktrace from test repo:

 File "/data/aaron/experimental/fms-acceleration/libs/unsloth/src/fms_accelerate_unsloth/kernels/fast_lora.py", line 227, in forward
    Q = matmul_lora(X, QW, QW_quant, QA, QB, QS)
  File "/data/aaron/experimental/fms-acceleration/libs/unsloth/src/fms_accelerate_unsloth/kernels/utils.py", line 235, in matmul_lora
    out = torch.matmul(X, W, out = out)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
  0%|                                                                                                                  | 0/100 [00:01<?, ?it/s]

Setting debug environment var CUDA_LAUNCH_BLOCKING=1 produces this Error an illegal memory access was encountered at line 90 in file /src/csrc/ops.cu. This is traced to the dequantizeBlockwise CUDA function.

Reproduce

accelerate launch \
    --config_file ./accelerate.yaml \
    --num_processes=2 \
    --main_process_port=29500  -m tuning.sft_trainer \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v0.3 \
    --acceleration_framework_config_file ./sample-configurations/accelerated-peft-bnb-nf4-unsloth-sample-configuration.yaml \
    --packing True \
    --max_seq_len 2048 \
    --fp16 True \
    --learning_rate 2e-4 \
    --torch_dtype float16 \
    --peft_method lora \
    --r 16 \
    --lora_alpha 16 \
    --lora_dropout 0.0 \
    --target_modules q_proj k_proj v_proj o_proj \
    --use_flash_attn True \
    --response_template \n### Response: \
    --dataset_text_field output \
    --include_tokens_per_second True \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 1 \
    --gradient_checkpointing True \
    --evaluation_strategy no \
    --save_strategy no \
    --weight_decay 0.01 \
    --warmup_steps 10 \
    --adam_epsilon 1e-4 \
    --lr_scheduler_type linear \
    --logging_strategy steps \
    --logging_steps 10 \
    --max_steps 100 \
    --training_data_path ./data/benchmark_data.json \
    --per_device_train_batch_size 2 \
    --output_dir results/exp_5/hf

Modify Existing Triton Kernels to Support Newer Triton Versions

Description:

Newer versions of torch (>= 2.4.0) requires installation of Triton version (>=3.0.0), this will cause an error in the JIT compile
of triton kernels that currently access global variables inside new checks here

Example:

NameError("Cannot access global variable ROPE_GROUP_SIZE from within @jit'ed function. Triton kernels can only access global variables that are annotated as constexpr (x: triton.language.constexpr = 42 or x = triton.language.constexpr(42)).  Alternatively, set the envvar TRITON_ALLOW_NON_CONSTEXPR_GLOBALS=1, but we do not promise to support this forever.")

This is currently mitigated using a TRITON_ALLOW_NON_CONSTEXPR_GLOBALS variable that temporarily allows for access to globals.
A fix to the triton kernels to access global variables by annotating as constexpr (x: triton.language.constexpr = 42 or x = triton.language.constexpr(42)) would be more permanent.

Inconsistency in Padding-Free Benchmarks with Different Transformers Versions

Description

We observe no improvement with PaddingFree on QLoRA and GPTQ-LoRA when running benchmarks on OrcaMath.

However,

additionally applying FOAK along with PaddingFree shows significant improvement.
the benchmarks using FLAN also show an improvement when PaddingFree is used with QLoRA and GPTQ-LoRA.

Mistral7B (OrcaMath) with transformers==4.42.4

Framework Type	Num Device	Device Batch Size	Train Runtime (sec)	Throughput (toks/sec)
BNB	1	4	346	1586
BNB + PF	1	4	340	1595
BNB + FOAK	1	4	314	1748
BNB + FOAK + PF	1	4	245	2229

Mistral7B (FLAN) with transformers==4.42.4

Framework Type	Num Device	Device Batch Size	Train Runtime (sec)	Throughput (toks/sec)
BNB	1	4	1888	1500
BNB + PF	1	4	1225	2314

NOTE: There are some variability between the transformers versions, when transformers is upgraded

Mistral7B (OrcaMath) with transformers==4.44.0

Framework Type	Num Device	Device Batch Size	Train Runtime (sec)	Throughput (toks/sec)
BNB	1	4	347	1586
BNB + PF	1	4	318	1726

Enable CUDA Unit Tests in GH Workflows

Currently we are skipping some FOAK tests because they require CUDA to run, but will be good to enable CUDA in the github actions so that we can run these tests.

Finish Linting of Accelerated Peft Plugin and Scripts

In continuation of #7 , finish and activate the linting for accelerated_peft and scripts/benchmarks/bencmarks.py

Allow Fused Ops to Support Dropout

Right now the fused ops do not support dropout, but perhaps it can be quite trivally supported as this is the implementation of the dropout in QuantLinear in both peft.tuners.lora.bnb and peft.tuners.lora.gptq

output = lora_B(lora_A(dropout(x)))

Model Patcher To Work with Generics

Currently the model patch is triggering on instance checks of the targeted modules. This could be converted to generics, for example

detecting based on semantics of class name. e.g. RMSNorm is a norm module
based on semantics of class instance, e.g., module.norm is a norm.

Also another consideration is to have a confg to allow the users to specifiy target hints, a la what is done for lora target_modules

Doing this, we can have generic patchers that will work on all models

However we should raise if the patching is requested, but nothing was found

Support Position Ids in Rope

This could be very possible by just providing the correct sin and cos values adjusted according to position ids. This can be done outside of the kernel and then passed in:

def _rope_embedding(
    Q,     Q_row_stride,
    cos, cos_row_stride,
    sin, sin_row_stride,
    seqlen,
    head_dim      : tl.constexpr,
    n_heads       : tl.constexpr,
    BACKWARD_PASS : tl.constexpr,
    BLOCK_SIZE    : tl.constexpr,
):

Fix Issues With Benchmark Script

The benchscript can still be improved in couple of ways:

Edit: We need to Triage these

progressively write out the results as each experiment completes.
also measure the GPU mem usage, and update the device hardware (e.g., A100). See #8
~~switch from accelerate.yaml to [command line args]~~(https://huggingface.co/docs/accelerate/en/package_reference/cli) so we have one yaml less to manage.
print out the command line, accelerate args to experiment.save_dir so there is a record; there is already one in the stderr, but maybe something more tidy like in a YAML
~~grep out the FMS accelerate printout from stdout to confirm the correct plugin is activated.~~
making sure the results directory is cleared (but maybe this can be done in run_benchmarks.sh. See #13
put more logs to stdout to inform what is happening in benchmark script (e.g., directory created, results written, etc)

Allow BNB Plugin to be Loaded Without PEFT Wrapping

This issue regards to the warnings that for QLoRA PeFT we should pass peft_config directly to SFTTrainer

if no_peft_model: True then model_loader will simply load the BNB model, and some logic needed to set requires_agumentation: False
peft wrapping will be delegated to trf.SFTTrainer
add one test for this new flag in tests/test_peft_plugin

in generate sample configurations update the CONFIGURATIONS and COMBINATIONS:

  CONFIGURATIONS = {
    KEY_AUTO_GPTQ: "plugins/accelerated-peft/configs/autogptq.yaml",
    KEY_BNB_NF4: (
        "plugins/accelerated-peft/configs/bnb.yaml",
        [("peft.quantization.bitsandbytes.quant_type", "nf4")],
    ),
   KEY_BNB_NF4_BASELINE: (
        "plugins/accelerated-peft/configs/bnb.yaml",
        [
            ("peft.quantization.bitsandbytes.quant_type", "nf4"), 
           ("peft.quantization.bitsandbytes.no_peft_model", True), 
         ],
    ),
}

COMBINATIONS = [
    ("accelerated-peft-autogptq", (KEY_AUTO_GPTQ,)),
    ("accelerated-peft-bnb-nf4", (KEY_BNB_NF4,)),
    ("baseline-bnb-nf4", (KEY_BNB_NF4_BASELINE,)),
]

regenerate the benches, update the CSV and the README

configs/bnb.yaml

add a new flag no_peft_model

# PEFT-related acceleration
peft:

  # quantization-releated acceleration
  # e.g., kernels for quantized base weights
  quantization: 

    # For loading BitsAndBytes quantized layers
    # to serve as 4bit base-weights for LoRA PEFT-tuning.
    # NOTE: currently AutoGPTQ is not properly integrated into huggingface /
    # bitsandbytes, thus recommended quant_type to be either "nf4"
    # or "fp4".
    # bitsandbytes:
    bitsandbytes:
      quant_type: nf4 

      # If True, then no get_peft_model and prepare_model_for_kbit_training
      # will be called. 
      no_peft_model: False

Enable packaging CUDA wheels

We are missing releases. To this end we need to package wheels

Activate a packaging GH action. We can take reference from how FMS does it
Check if we need CUDA deps for packaging. I believe so because our deps have CUDA deps, but check first
If so, then package using a GH cuda toolkit action.
Follow the RH flow to upload the deps for build and inspect, or upload to pypi test first https://github.com/instructlab/GPTDolomite/blob/main/.github/workflows/pypi.yaml#L36-L51
once ok, create a pypi project and upload

There will be one pypi project for each plugin

Memory Consumption for GPTQ-LoRA is higher than QLoRA in Distributed Finetuning

Issue:

There seems to be a difference to how FSDP handles GPTQ-LoRA sharding compared to QLoRA. We observe in the following benchmarks...

Observations on Llama2-70B

1. Lower Memory Consumption for GPTQ-LoRA vs QLoRA for single device finetuning

We notice for Llama2-70B that GPTQ-LoRA (59.0 GiB) consumes 6 GiB less memory than QLoRA (65 GiB) for single device finetuning.

Acceleration Type	Model Name	Num GPUS	Batch Size	Throughput (toks/sec) / Device	Avg Mem Usage (GiB)
accelerated-peft-bnb	NousResearch/Llama-2-70b-hf	1	4	445	65.0
accelerated-peft-autogptq	TheBloke/Llama-2-70b-GPTQ	1	4	451	59.0

2. Higher Memory Consumption for GPTQ-LoRA vs QLoRA for distributed finetuning

However, GPTQ-LoRA (53 GiB/Device) comsumes 24.9 GiB/Device more compared to QLoRA (29.5 Gib/Device) for distributed finetuning.

Acceleration Type	Model Name	Num GPUS	Batch Size	Throughput (toks/sec) / Device	Avg Mem Usage (GiB)
accelerated-peft-bnb	NousResearch/Llama-2-70b-hf	2	2	422	~~29.5~~ 36.3
accelerated-peft-autogptq	TheBloke/Llama-2-70b-GPTQ	2	2	438	~~53.2~~ 61.2

3. Minimal memory savings observed when batchsize drops and number of GPUs increases in distributed finetuning

Despite sharding the model (Ngpus=1 -> Ngpus=2) and halving the batchsize (bs=4/device -> bs=2/device) for 70B models, we also noticed that the memory usage per device did not decrease (59 GiB -> 61 GiB )

Acceleration Type	Model Name	Num GPUS	Batch Size	Throughput (toks/sec) / Device	Avg Mem Usage (GiB)
accelerated-peft-autogptq	TheBloke/Llama-2-70b-GPTQ	1	4	451	59.0
accelerated-peft-autogptq	TheBloke/Llama-2-70b-GPTQ	2	2	438	~~53.2~~ 61.2

Observations on Mixtral

The same issue occurs on a smaller degree when comparing with Mixtral from single device to a distributed setting, we notice a 13GiB overhead for GPTQ-LoRA compared to 1.6 GiB overhead for QLoRA at the same batch size.

1. GPTQ-LoRA - 13GiB increase in memory consumption from single device to distributed finetuning

Acceleration Type	Model Name	Num GPUS	Batch Size	Throughput (toks/sec) / Device	Avg Mem Usage (GiB)
accelerated-peft-autogptq	TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ	1	4	1854	23.9
accelerated-peft-autogptq	TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ	2	4	1821	37.0

2. QLoRA - Lower overhead (1.6 GiB) in memory consumption from single device to distributed finetuning

Acceleration Type	Model Name	Num GPUS	Batch Size	Throughput (toks/sec) / Device	Avg Mem Usage (GiB)
accelerated-peft-bnb	mistralai/Mixtral-8x7B-Instruct-v0.1	1	4	1793	24.6
accelerated-peft-bnb	mistralai/Mixtral-8x7B-Instruct-v0.1	2	4	1731	26.2

This suggests that there could be some memory-leak bug when sharding with the accelerated-peft-autogptq plugin.

BNB Benchmark Experiments Run Out of Memory with Non-Zero Lora Dropout

Description

BNB experiments run out of memory in new benchmarks that set lora_dropout=0.1.

Benchmark	framework_config	peft_method	model_name_or_path	num_gpus	per_device_train_batch_size	lora dropout	Peak Memory in Bytes
Reference	accelerated-peft-bnb	lora	NousResearch/Llama-2-70b-hf	2	4	0.	72.39
New	accelerated-peft-bnb	lora	NousResearch/Llama-2-70b-hf	2	4	0.1	0.

Compared to AutoGPTQ, we don't notice this issue

Benchmark	framework_config	peft_method	model_name_or_path	num_gpus	per_device_train_batch_size	lora dropout	Peak Memory in Bytes
Reference	accelerated-peft-autogptq	lora	NousResearch/Llama-2-70b-hf	2	4	0.	70.14
New	accelerated-peft-autogptq	lora	NousResearch/Llama-2-70b-hf	2	4	0.1	71.7

There might be a slight overhead in the dropout implementation that causes the experiment to run out of memory for large models

Reproduce Issue

Lora Dropout=0. enters training

export CUDA_VISIBLE_DEVICES=0,1
export ACCELERATION_FRAMEWORK_CONFIG_FILE=/workspace/fms-acceleration/scripts/benchmarks/../../sample-configurations/baseline-peft-bnb-nf4-sample-configuration.yaml
accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path NousResearch/Llama-2-70b-hf --packing True --max_seq_len 4096 --fp16 True --learning_rate 2e-4 --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0. --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '
### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path benchmark_outputs/data/cache.json --per_device_train_batch_size 4 --output_dir benchmark_outputs/exp_35/hf --skip_memory_metrics False

Lora Dropout=0.1 runs out of memory

export CUDA_VISIBLE_DEVICES=0,1
export ACCELERATION_FRAMEWORK_CONFIG_FILE=/workspace/fms-acceleration/scripts/benchmarks/../../sample-configurations/baseline-peft-bnb-nf4-sample-configuration.yaml
accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path NousResearch/Llama-2-70b-hf --packing True --max_seq_len 4096 --fp16 True --learning_rate 2e-4 --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0.1 --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '
### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path benchmark_outputs/data/cache.json --per_device_train_batch_size 4 --output_dir benchmark_outputs/exp_35/hf --skip_memory_metrics False

Release Upper Limit on Torch, Transformers, Accelerate

Currently the torch dependency in framework is upper bounded as "< 2.3", however in accelerate versions has problems supporting torch 2.2. The latest numpy versions (>=2.0) also has incompatibilities with the current torch version and is bounded here in #42. Hence, we should consider releasing the upper bound soon.

Also can consider releasing the upper limit on transformers and accelerate

Add GPU measurements to Benchmark Script

We should include the gpu consumption into FILE_RESULTS

new column gpu_mem (this can be in MiB)

The GPU memory can be collected by:

starting a async process running nvidia-smi and log to a FILE_MEM
perform the subprocess.run
read FILE_MEM and process out how much memory was consumed

Points to note:

As part of the processing we should perform some estimation based on the gpu memory consumed by the device before the subprocess.run and account for that appropriately.
If there are multiple GPUs used how should we report? should it be the average?
We use CUDA_VISIBLE_DEVICES as environment to the subprocess, but Im not sure which GPU nvidia-smi will return. The best is to specifiy the device /s ids to nvidia-smi call.
we ise nvidia-smi which reports the reserved_memory not the allocated_memory. This needs to be documented.

Potentially Dead Code in FOAK

There is a folder of triton kernels here https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/fused-ops-and-kernels/src/fms_acceleration_foak/fused_ops/unsloth_lora/gptq/triton

But its not clear if they are actually used in FOAK.
the comments seem to suggest they are used for testing.

If they are not needed lets remove it.

foundation-model-stack / fms-acceleration Goto Github PK

fms-acceleration's Introduction

FMS Acceleration 🚀

Plugins

Usage with FMS HF Tuning

Integration with FMS HF Tuning

Need for an alternative way to access features pre-integration

FMS Acceleration Via Configuration YAML

Basic Configuration YAML Flow 🤡

Advanced Configuration YAML Flow 🥷 🦹

CUDA Dependencies

Benchmarks

Code Architecture

Maintainers

fms-acceleration's People

Contributors

Stargazers

Watchers

Forkers

fms-acceleration's Issues

Description

Proposed Solution

Description

Problem

Cause

Reproduce

Description:

Description

configs/bnb.yaml

Issue:

Observations on Llama2-70B

1. Lower Memory Consumption for GPTQ-LoRA vs QLoRA for single device finetuning

2. Higher Memory Consumption for GPTQ-LoRA vs QLoRA for distributed finetuning

3. Minimal memory savings observed when batchsize drops and number of GPUs increases in distributed finetuning

Observations on Mixtral

1. GPTQ-LoRA - 13GiB increase in memory consumption from single device to distributed finetuning

2. QLoRA - Lower overhead (1.6 GiB) in memory consumption from single device to distributed finetuning

Description

Reproduce Issue

Lora Dropout=0. enters training

Lora Dropout=0.1 runs out of memory

Recommend Projects

Recommend Topics

Recommend Org