Code Monkey home page Code Monkey logo

fms-acceleration's Introduction

FMS Acceleration ๐Ÿš€

FMS Acceleration is designed to accelerate the fine-tuning and training of large models. This framework comprises a collection of libraries intended to be used with the fms-hf-tuning suite.

The fms-acceleration framework includes accelerators for Full and Parameter Efficient Fine Tuning (PEFT), including

  • Low Rank Adaptation (LoRA) acceleration (coming soon)
  • Bits-and-Bytes (BNB) quantised LoRA : QLoRA acceleration
  • AutoGPTQ quantised LoRA : GPTQ-LoRA acceleration
  • Full Fine Tuning acceleration (coming soon)
  • Padding-Free Attention

Our tests show a significant increase in training token throughput using this fms-acceleration framework.

For example:

  • QLoRA: 22-43 % token throughput increase on 1 GPU as compared to using Hugging Face BNB QLoRA
  • QLoRA: Straightforward integration with multiple GPU as compared to using Hugging Face BNB QLoRA
  • GPTQ-LoRA: 22-44 % token throughput increase on 1 GPU as compared to using Hugging Face BNB QLoRA
  • GPTQ-LoRA: Straightforward integration with multiple GPU as compared to using Hugging Face BNB QLoRA

The above includes numbers using fusedOps-and-kernels and actual impl coming soon, see below.

This package is in BETA and is under development. Expect breaking changes!

Plugins

Plugin Description Depends License Status
framework This acceleration framework for integration with huggingface trainers Alpha
accelerated-peft For PEFT-training, e.g., 4bit QLoRA. Huggingface
AutoGPTQ
Apache 2.0
MIT
Alpha
fused-op-and-kernels Fused LoRA and triton kernels (e.g., fast cross-entropy, rms, rope) -- Apache 2.0 (contains extracted code) Beta
attention-and-distributed-packing Padding-Free Flash Attention Computation flash-attn Apache 2.0 Beta
MOE-training-acceleration MegaBlocks inspired triton Kernels and acclerations for Mixture-of-Expert models Apache 2.0 Coming Soon

Usage with FMS HF Tuning

Below we demonstrate how to accelerate your tuning experience with tuning/sft_trainer.py from fms-hf-tuning.

Note: New exciting plugins will be added over time, so please check here for the latest accelerations!.

Integration with FMS HF Tuning

fms-acceleration is part of fms-hf-tuning, and instructions to utilize fms-acceleration for tuning are found here. In particular, fms-acceleration plugins can be accessed via command line arguments to fms-hf-tuning (e.g., --auto_gptq triton_v2); this is made available via integrated configuration dataclasses that configures the AccelerationFramework for the user.

Need for an alternative way to access features pre-integration

As new plugins become available, more command line arguments will be made avaiable to fms-hf-tuning to enable them. However, this kind of integration takes time; plugins that are in development / research stages may not be immediately integrated.

Therefore, an intermediary step is required to access plugins in fms-acceleration before they become integrated into fms-hf-tuning. In fact, such a method is critical for benchmarking / testing, that needs to happen before integration of any plugin in fms-hf-tuning can even be considered. Hence, we provide a method to configure the acceleration framework via a configuration YAML, that is passed into AccelerationFramework via an environment variable; the instructions for this is provided below. Futhermore, experienced users can also leverage this to early test plugins, but be warned that the learning curve to use these plugins is high (since it requires knowledge on how to write such a configuration). To aid on this, the following instructions are provide that describes both a basic and advanced flow.

FMS Acceleration Via Configuration YAML

Note: As mentioned above, the recommended approach for fms-hf-tuning is to use the acceleration config dataclasses. This method documented for the configuration YAML is only for testing/research purposes and not recommended for production. For general use, please refer instead to the instructions here.

Below we illustrate a configuration YAML flow using the accelerated quantised PEFT using GPTQ-LoRA tuning with the AutoGPTQ triton_v2 kernel use case; this kernel is state-of-the-art provided by jeromeku on Mar 2024:

There is both a basic and advanced usage for the configuration YAML flow.

Usage Flows

Basic Configuration YAML Flow ๐Ÿคก

Most users of fms-hf-tuning only require the basic flow:

  • Assumption 1: user has an already prepared configuration, say from sample-configurations.
  • Assumption 2: user knows exactly what acceleration 'plugins` are required (based on the configuration).
  • Assumption 3: the arguments for running sft_trainer.py is the same; save for one extra argument --acceleration_framework_config_file used to pass in the acceleration config.

In this case then the basic flow comprises of 3 steps:

  1. First go to fms-hf-tuning and install the framework library:

    $ pip install -e .[fms-accel]
    

    or alternatively install the framework directly:

    $ pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/framework
    

    The above installs the command line utility fms_acceleration.cli, which is used to install plugins (and also other things like view sample configurations).

  2. install the required framework plugins; we install the fms-acceleration-peft plugin for GPTQ-LoRA tuning with triton v2 as:

    python -m fms_acceleration.cli install fms_acceleration_peft
    

    The above is the equivalent of:

    pip install git+https://github.com/foundation-model-stack/fms-acceleration.git#subdirectory=plugins/accelerated-peft
    
  3. Run sft_trainer.py providing the acceleration configuration (via the environment variable ACCELERATION_FRAMEWORK_CONFIG_FILE and arguments; given the basic flow assumption that we simply re-use the same sft_trainer.py arguments as we had without using the fms_acceleration package:

    # when using sample-configurations, arguments can be referred from
    # defaults.yaml and scenarios.yaml
    ACCELERATION_FRAMEWORK_CONFIG_FILE=framework.yaml \
    python sft_trainer.py \
        ...  # arguments
    

    The framework activates relevant plugins given the framework configuration; for more details see framework/README.md.

    Activate TRANSFORMERS_VERBOSITY=info to see the huggingface trainer printouts and verify that AccelerationFramework is activated!

    # this printout will be seen in huggingface trainer logs if acceleration is activated
    ***** FMS AccelerationFramework *****
    Active Plugin: AutoGPTQAccelerationPlugin. Python package: fms_acceleration_peft. Version: 0.0.1.
    ***** Running training *****
    Num examples = 1,549
    Num Epochs = 1
    Instantaneous batch size per device = 4
    Total train batch size (w. parallel, distributed & accumulation) = 4
    Gradient Accumulation steps = 1
    Total optimization steps = 200
    Number of trainable parameters = 13,631,488
    

Advanced Configuration YAML Flow ๐Ÿฅท ๐Ÿฆน

The advanced flow makes further use of fms_acceleration.cli to:

  • list all available configs and acceleration plugins the configs depend on.
  • list all available plugins and check which are the installed ones.
  • identify critical sft_trainer arguments required for correct operation of a particular framework config.

The advanced flow comprises of 5 steps:

  1. Same as Step 1 of basic flow.

  2. Use fms_acceleration.cli configs to search for sample configs:

    $ python -m fms_acceleration.cli configs
    
    1. accelerated-peft-autogptq (accelerated-peft-autogptq-sample-configuration.yaml) - plugins: ['accelerated-peft']
    2. accelerated-peft-bnb (accelerated-peft-bnb-nf4-sample-configuration.yaml) - plugins: ['accelerated-peft']
    

    This is equivalent to the searching over the:

  3. install plugins same as Step 2 of basic flow, noting that in addition we can use plugins to display all available plugins; this list updates as more plugins get developed. Recall that configs list the required plugins for the sample configurations; make sure all of them are installed.

    $ python -m fms_acceleration.cli plugins
    
    Choose from the list of plugin shortnames, and do:
    * 'python -m fms_acceleration.cli install <pip-install-flags> PLUGIN_NAME'.
    
    List of PLUGIN_NAME [PLUGIN_SHORTNAME]:
    
    1. fms_acceleration_peft [peft]
    

    After install the list will update to indicate the installed plugins.

  4. Get the correct arguments for sft_trainer.py:

    • arguments required for correct operation (e.g., if using accelerated peft, then peft_method is required).

      $ python -m fms_acceleration.cli arguments accelerated-peft-autogptq
      
      Searching for configuration shortnames: ['accelerated-peft-autogptq']
      1. scenario: accelerated-peft-gptq
      configs: accelerated-peft-autogptq
      arguments:
          --learning_rate 2e-4 \
          --fp16 True \
          --torch_dtype float16 \
          --peft_method lora \
          --r 16 \
          --lora_alpha 16 \
          --lora_dropout 0.0 \
          --target_modules ['q_proj', 'k_proj', 'v_proj', 'o_proj']
      
    • More info on defaults.yaml and scenarios.yaml found here.

      • Arguments not critical to the plugins found in defaults.yaml. These can be taken with liberty.
      • Arguments critcal to plugins found in scenarios.yaml. The relevant section of scenarios.yaml, is the one whose framework_config entries, match the shortname of the sample configuration of interest.

CUDA Dependencies

This repo requires CUDA to compute the kernels, and it is convinient to use NVidia Pytorch Containers that already comets with CUDA installed. We have tested with the following versions:

  • pytorch:24.01-py3

Benchmarks

The benchmarks can be reproduced with the provided scripts.

See below CSV files for various results:

Code Architecture

For deeper dive into details see framework/README.md.

Maintainers

IBM Research, Singapore

fms-acceleration's People

Contributors

fabianlim avatar achew010 avatar

Stargazers

Yaowei Zheng avatar Ashok Pon Kumar avatar  avatar

Watchers

Ashok Pon Kumar avatar JJ Asghar avatar Raghu Ganti avatar  avatar Dr. Rashed Z. Bhatti avatar Sukriti Sharma avatar

fms-acceleration's Issues

Group Memory Field Names with Common Prefix

Description

Comparison of the memory fields in the summary benchmarks is unintuitive.

Pandas sorts the fields in alphabetical order and because the memory tracking fields are not named with a common prefix in the current benchmarking code it ends up printing the report with memory values in columns far apart from each other in the table (see below image).

Proposed Solution

Renaming all memory values with a common mem_ prefix like the following

  • mem_nvidia_peak_reserved
  • mem_torch_peak_allocated
  • mem_torch_allocated

Update: Also remove the index column is introduced into the benchmark file.

This way when the results are gathered and exported to csv, the columns are grouped together intuitively for comparison.

Extract out AutoGPTQ dependency

The AutoGPTQ repo looks like it hasnt had any releases since Mar 1, 2024. While it is still a dependency of major frameworks such as huggingface and vllm, it is concerning that it might get deprecated. Maybe we can consider to extract out the relevant code since it is on an MIT license

When HF Memory Metrics Disabled Benchmark CSV Corrupted.

When the HF memory metrics are disabled, running the benches as follows:

MEMORY_LOGGING=nvidia \
bash scripts/run_benchmarks.sh \
    ...

noticed that the benchmark.csv file will only have the framework_config and torch_dtype columns, the other columns will be empty,.

Extract Out Model Patcher to Framework

There really should be only one model patcher, and it should live in framework and coordinate all patchings.

  • this should be accompanied by relevant unit tests.

Introduce a Better Dequantization Fix on Triton Function for FOAK Plugin's GPTQ Fused Operations

Description

Currently, FOAK's GPTQ Fused Operations maintains its own dequantization triton kernel. This was incompatible with the Accelerated-PEFT Plugin when the local AutoGPTQ package is used. It is due to a line removed in the dequantization function of the local package (see here).

Without a fix, the dequantization produces wrong base outputs when FOAK plugin is used with the local package. The current fix on the FOAK dequantization function in #48, detects if Accelerated-PEFT plugin uses the external AutoGPTQ library and if it is, then adds zeros = zeros + 1 in the function otherwise that line is not used with the local library.

A better fix would be for FOAK plugin to rely on the accelerated_peft plugin to manage the dequantization function to use (local autogptq package or official autogptq) rather than the plugin maintain a similar set of functions itself.

Failure in FSDP Benchmark Experiment using QLoRA with Custom Fused Modules

Problem

Distributed experiments in the benchmarks fail when using BNB's nf4 QLoRA with unsloth fused module optimizations.

Cause

Distributed experiments for BNB's nf4 QLoRA doesnt throw any errors. Suspected incompatibility of FSDP, BNB kernels and Unsloth's matmul.

Stacktrace from test repo:

 File "/data/aaron/experimental/fms-acceleration/libs/unsloth/src/fms_accelerate_unsloth/kernels/fast_lora.py", line 227, in forward
    Q = matmul_lora(X, QW, QW_quant, QA, QB, QS)
  File "/data/aaron/experimental/fms-acceleration/libs/unsloth/src/fms_accelerate_unsloth/kernels/utils.py", line 235, in matmul_lora
    out = torch.matmul(X, W, out = out)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
  0%|                                                                                                                  | 0/100 [00:01<?, ?it/s]

Setting debug environment var CUDA_LAUNCH_BLOCKING=1 produces this Error an illegal memory access was encountered at line 90 in file /src/csrc/ops.cu. This is traced to the dequantizeBlockwise CUDA function.

Reproduce

accelerate launch \
    --config_file ./accelerate.yaml \
    --num_processes=2 \
    --main_process_port=29500  -m tuning.sft_trainer \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v0.3 \
    --acceleration_framework_config_file ./sample-configurations/accelerated-peft-bnb-nf4-unsloth-sample-configuration.yaml \
    --packing True \
    --max_seq_len 2048 \
    --fp16 True \
    --learning_rate 2e-4 \
    --torch_dtype float16 \
    --peft_method lora \
    --r 16 \
    --lora_alpha 16 \
    --lora_dropout 0.0 \
    --target_modules q_proj k_proj v_proj o_proj \
    --use_flash_attn True \
    --response_template \n### Response: \
    --dataset_text_field output \
    --include_tokens_per_second True \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 1 \
    --gradient_checkpointing True \
    --evaluation_strategy no \
    --save_strategy no \
    --weight_decay 0.01 \
    --warmup_steps 10 \
    --adam_epsilon 1e-4 \
    --lr_scheduler_type linear \
    --logging_strategy steps \
    --logging_steps 10 \
    --max_steps 100 \
    --training_data_path ./data/benchmark_data.json \
    --per_device_train_batch_size 2 \
    --output_dir results/exp_5/hf

Modify Existing Triton Kernels to Support Newer Triton Versions

Description:

Newer versions of torch (>= 2.4.0) requires installation of Triton version (>=3.0.0), this will cause an error in the JIT compile
of triton kernels that currently access global variables inside new checks here

Example:

NameError("Cannot access global variable ROPE_GROUP_SIZE from within @jit'ed function. Triton kernels can only access global variables that are annotated as constexpr (x: triton.language.constexpr = 42 or x = triton.language.constexpr(42)).  Alternatively, set the envvar TRITON_ALLOW_NON_CONSTEXPR_GLOBALS=1, but we do not promise to support this forever.")
  • This is currently mitigated using a TRITON_ALLOW_NON_CONSTEXPR_GLOBALS variable that temporarily allows for access to globals.
  • A fix to the triton kernels to access global variables by annotating as constexpr (x: triton.language.constexpr = 42 or x = triton.language.constexpr(42)) would be more permanent.

Inconsistency in Padding-Free Benchmarks with Different Transformers Versions

Description

We observe no improvement with PaddingFree on QLoRA and GPTQ-LoRA when running benchmarks on OrcaMath.

However,

  • additionally applying FOAK along with PaddingFree shows significant improvement.
  • the benchmarks using FLAN also show an improvement when PaddingFree is used with QLoRA and GPTQ-LoRA.

Mistral7B (OrcaMath) with transformers==4.42.4

Framework Type Num Device Device Batch Size Train Runtime (sec) Throughput (toks/sec)
BNB 1 4 346 1586
BNB + PF 1 4 340 1595
BNB + FOAK 1 4 314 1748
BNB + FOAK + PF 1 4 245 2229

Mistral7B (FLAN) with transformers==4.42.4

Framework Type Num Device Device Batch Size Train Runtime (sec) Throughput (toks/sec)
BNB 1 4 1888 1500
BNB + PF 1 4 1225 2314

NOTE: There are some variability between the transformers versions, when transformers is upgraded

Mistral7B (OrcaMath) with transformers==4.44.0

Framework Type Num Device Device Batch Size Train Runtime (sec) Throughput (toks/sec)
BNB 1 4 347 1586
BNB + PF 1 4 318 1726

Enable CUDA Unit Tests in GH Workflows

Currently we are skipping some FOAK tests because they require CUDA to run, but will be good to enable CUDA in the github actions so that we can run these tests.

Allow Fused Ops to Support Dropout

Right now the fused ops do not support dropout, but perhaps it can be quite trivally supported as this is the implementation of the dropout in QuantLinear in both peft.tuners.lora.bnb and peft.tuners.lora.gptq

output = lora_B(lora_A(dropout(x)))

Model Patcher To Work with Generics

Currently the model patch is triggering on instance checks of the targeted modules. This could be converted to generics, for example

  • detecting based on semantics of class name. e.g. RMSNorm is a norm module
  • based on semantics of class instance, e.g., module.norm is a norm.

Also another consideration is to have a confg to allow the users to specifiy target hints, a la what is done for lora target_modules

Doing this, we can have generic patchers that will work on all models

  • However we should raise if the patching is requested, but nothing was found

Support Position Ids in Rope

This could be very possible by just providing the correct sin and cos values adjusted according to position ids. This can be done outside of the kernel and then passed in:

def _rope_embedding(
    Q,     Q_row_stride,
    cos, cos_row_stride,
    sin, sin_row_stride,
    seqlen,
    head_dim      : tl.constexpr,
    n_heads       : tl.constexpr,
    BACKWARD_PASS : tl.constexpr,
    BLOCK_SIZE    : tl.constexpr,
):

Fix Issues With Benchmark Script

The benchscript can still be improved in couple of ways:

Edit: We need to Triage these

  • progressively write out the results as each experiment completes.
  • also measure the GPU mem usage, and update the device hardware (e.g., A100). See #8
  • switch from accelerate.yaml to [command line args](https://huggingface.co/docs/accelerate/en/package_reference/cli) so we have one yaml less to manage.
  • print out the command line, accelerate args to experiment.save_dir so there is a record; there is already one in the stderr, but maybe something more tidy like in a YAML
  • grep out the FMS accelerate printout from stdout to confirm the correct plugin is activated.
  • making sure the results directory is cleared (but maybe this can be done in run_benchmarks.sh. See #13
  • put more logs to stdout to inform what is happening in benchmark script (e.g., directory created, results written, etc)

Allow BNB Plugin to be Loaded Without PEFT Wrapping

This issue regards to the warnings that for QLoRA PeFT we should pass peft_config directly to SFTTrainer

  • if no_peft_model: True then model_loader will simply load the BNB model, and some logic needed to set requires_agumentation: False
  • peft wrapping will be delegated to trf.SFTTrainer
  • add one test for this new flag in tests/test_peft_plugin
  • in generate sample configurations update the CONFIGURATIONS and COMBINATIONS:
      CONFIGURATIONS = {
        KEY_AUTO_GPTQ: "plugins/accelerated-peft/configs/autogptq.yaml",
        KEY_BNB_NF4: (
            "plugins/accelerated-peft/configs/bnb.yaml",
            [("peft.quantization.bitsandbytes.quant_type", "nf4")],
        ),
       KEY_BNB_NF4_BASELINE: (
            "plugins/accelerated-peft/configs/bnb.yaml",
            [
                ("peft.quantization.bitsandbytes.quant_type", "nf4"), 
               ("peft.quantization.bitsandbytes.no_peft_model", True), 
             ],
        ),
    }
    
    COMBINATIONS = [
        ("accelerated-peft-autogptq", (KEY_AUTO_GPTQ,)),
        ("accelerated-peft-bnb-nf4", (KEY_BNB_NF4,)),
        ("baseline-bnb-nf4", (KEY_BNB_NF4_BASELINE,)),
    ]
  • regenerate the benches, update the CSV and the README

configs/bnb.yaml

add a new flag no_peft_model

# PEFT-related acceleration
peft:

  # quantization-releated acceleration
  # e.g., kernels for quantized base weights
  quantization: 

    # For loading BitsAndBytes quantized layers
    # to serve as 4bit base-weights for LoRA PEFT-tuning.
    # NOTE: currently AutoGPTQ is not properly integrated into huggingface /
    # bitsandbytes, thus recommended quant_type to be either "nf4"
    # or "fp4".
    # bitsandbytes:
    bitsandbytes:
      quant_type: nf4 

      # If True, then no get_peft_model and prepare_model_for_kbit_training
      # will be called. 
      no_peft_model: False

Enable packaging CUDA wheels

We are missing releases. To this end we need to package wheels

  1. Activate a packaging GH action. We can take reference from how FMS does it
  2. Check if we need CUDA deps for packaging. I believe so because our deps have CUDA deps, but check first
  3. If so, then package using a GH cuda toolkit action.
  4. Follow the RH flow to upload the deps for build and inspect, or upload to pypi test first https://github.com/instructlab/GPTDolomite/blob/main/.github/workflows/pypi.yaml#L36-L51
  5. once ok, create a pypi project and upload

There will be one pypi project for each plugin

Memory Consumption for GPTQ-LoRA is higher than QLoRA in Distributed Finetuning

Issue:

There seems to be a difference to how FSDP handles GPTQ-LoRA sharding compared to QLoRA. We observe in the following benchmarks...

Observations on Llama2-70B

1. Lower Memory Consumption for GPTQ-LoRA vs QLoRA for single device finetuning

We notice for Llama2-70B that GPTQ-LoRA (59.0 GiB) consumes 6 GiB less memory than QLoRA (65 GiB) for single device finetuning.

Acceleration Type Model Name Num GPUS Batch Size Throughput (toks/sec) / Device Avg Mem Usage (GiB)
accelerated-peft-bnb NousResearch/Llama-2-70b-hf 1 4 445 65.0
accelerated-peft-autogptq TheBloke/Llama-2-70b-GPTQ 1 4 451 59.0

2. Higher Memory Consumption for GPTQ-LoRA vs QLoRA for distributed finetuning

However, GPTQ-LoRA (53 GiB/Device) comsumes 24.9 GiB/Device more compared to QLoRA (29.5 Gib/Device) for distributed finetuning.

Acceleration Type Model Name Num GPUS Batch Size Throughput (toks/sec) / Device Avg Mem Usage (GiB)
accelerated-peft-bnb NousResearch/Llama-2-70b-hf 2 2 422 29.5 36.3
accelerated-peft-autogptq TheBloke/Llama-2-70b-GPTQ 2 2 438 53.2 61.2

3. Minimal memory savings observed when batchsize drops and number of GPUs increases in distributed finetuning

Despite sharding the model (Ngpus=1 -> Ngpus=2) and halving the batchsize (bs=4/device -> bs=2/device) for 70B models, we also noticed that the memory usage per device did not decrease (59 GiB -> 61 GiB )

Acceleration Type Model Name Num GPUS Batch Size Throughput (toks/sec) / Device Avg Mem Usage (GiB)
accelerated-peft-autogptq TheBloke/Llama-2-70b-GPTQ 1 4 451 59.0
accelerated-peft-autogptq TheBloke/Llama-2-70b-GPTQ 2 2 438 53.2 61.2

Observations on Mixtral

The same issue occurs on a smaller degree when comparing with Mixtral from single device to a distributed setting, we notice a 13GiB overhead for GPTQ-LoRA compared to 1.6 GiB overhead for QLoRA at the same batch size.

1. GPTQ-LoRA - 13GiB increase in memory consumption from single device to distributed finetuning
Acceleration Type Model Name Num GPUS Batch Size Throughput (toks/sec) / Device Avg Mem Usage (GiB)
accelerated-peft-autogptq TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ โ€‹1 4 1854 23.9โ€‹
accelerated-peft-autogptq TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ โ€‹2 4 1821 37.0โ€‹
2. QLoRA - Lower overhead (1.6 GiB) in memory consumption from single device to distributed finetuning
Acceleration Type Model Name Num GPUS Batch Size Throughput (toks/sec) / Device Avg Mem Usage (GiB)
accelerated-peft-bnb mistralai/Mixtral-8x7B-Instruct-v0.1 โ€‹1 4 1793 24.6โ€‹
accelerated-peft-bnb mistralai/Mixtral-8x7B-Instruct-v0.1 โ€‹2 4 1731 26.2โ€‹

This suggests that there could be some memory-leak bug when sharding with the accelerated-peft-autogptq plugin.

BNB Benchmark Experiments Run Out of Memory with Non-Zero Lora Dropout

Description

BNB experiments run out of memory in new benchmarks that set lora_dropout=0.1.

Benchmark framework_config peft_method model_name_or_path num_gpus per_device_train_batch_size lora dropout Peak Memory in Bytes
Reference accelerated-peft-bnb lora NousResearch/Llama-2-70b-hf 2 4 0. 72.39
New accelerated-peft-bnb lora NousResearch/Llama-2-70b-hf 2 4 0.1 0.

Compared to AutoGPTQ, we don't notice this issue

Benchmark framework_config peft_method model_name_or_path num_gpus per_device_train_batch_size lora dropout Peak Memory in Bytes
Reference accelerated-peft-autogptq lora NousResearch/Llama-2-70b-hf 2 4 0. 70.14
New accelerated-peft-autogptq lora NousResearch/Llama-2-70b-hf 2 4 0.1 71.7

There might be a slight overhead in the dropout implementation that causes the experiment to run out of memory for large models

Reproduce Issue

Lora Dropout=0. enters training

export CUDA_VISIBLE_DEVICES=0,1
export ACCELERATION_FRAMEWORK_CONFIG_FILE=/workspace/fms-acceleration/scripts/benchmarks/../../sample-configurations/baseline-peft-bnb-nf4-sample-configuration.yaml
accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path NousResearch/Llama-2-70b-hf --packing True --max_seq_len 4096 --fp16 True --learning_rate 2e-4 --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0. --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '
### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path benchmark_outputs/data/cache.json --per_device_train_batch_size 4 --output_dir benchmark_outputs/exp_35/hf --skip_memory_metrics False

Lora Dropout=0.1 runs out of memory

export CUDA_VISIBLE_DEVICES=0,1
export ACCELERATION_FRAMEWORK_CONFIG_FILE=/workspace/fms-acceleration/scripts/benchmarks/../../sample-configurations/baseline-peft-bnb-nf4-sample-configuration.yaml
accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path NousResearch/Llama-2-70b-hf --packing True --max_seq_len 4096 --fp16 True --learning_rate 2e-4 --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0.1 --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '
### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path benchmark_outputs/data/cache.json --per_device_train_batch_size 4 --output_dir benchmark_outputs/exp_35/hf --skip_memory_metrics False

Release Upper Limit on Torch, Transformers, Accelerate

Currently the torch dependency in framework is upper bounded as "< 2.3", however in accelerate versions has problems supporting torch 2.2. The latest numpy versions (>=2.0) also has incompatibilities with the current torch version and is bounded here in #42. Hence, we should consider releasing the upper bound soon.

Also can consider releasing the upper limit on transformers and accelerate

Add GPU measurements to Benchmark Script

We should include the gpu consumption into FILE_RESULTS

  • new column gpu_mem (this can be in MiB)

The GPU memory can be collected by:

  • starting a async process running nvidia-smi and log to a FILE_MEM
  • perform the subprocess.run
  • read FILE_MEM and process out how much memory was consumed

Points to note:

  • As part of the processing we should perform some estimation based on the gpu memory consumed by the device before the subprocess.run and account for that appropriately.
  • If there are multiple GPUs used how should we report? should it be the average?
  • We use CUDA_VISIBLE_DEVICES as environment to the subprocess, but Im not sure which GPU nvidia-smi will return. The best is to specifiy the device /s ids to nvidia-smi call.
  • we ise nvidia-smi which reports the reserved_memory not the allocated_memory. This needs to be documented.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.