Code Monkey home page Code Monkey logo

georgian-io / llm-finetuning-hub Goto Github PK

View Code? Open in Web Editor NEW
663.0 11.0 80.0 33.54 MB

Toolkit for fine-tuning, ablating and unit-testing open-source LLMs.

License: Apache License 2.0

Shell 1.48% Python 97.82% Dockerfile 0.64% Makefile 0.06%
classification fine-tuning finetuning large-language-models nlp nlp-machine-learning summarization falcon flan-t5 llama2 lora qlora redpajama ablation-study llm-test mistral-7b unit-testing zephyr

llm-finetuning-hub's Introduction

LLM Finetuning Toolkit

Overview

LLM Finetuning toolkit is a config-based CLI tool for launching a series of LLM fine-tuning experiments on your data and gathering their results. From one single yaml config file, control all elements of a typical experimentation pipeline - prompts, open-source LLMs, optimization strategy and LLM testing.

Installation

pipx (recommended)

pipx installs the package and dependencies in a separate virtual environment

pipx install llm-toolkit

pip

pip install llm-toolkit

Quick Start

This guide contains 3 stages that will enable you to get the most out of this toolkit!

  • Basic: Run your first LLM fine-tuning experiment
  • Intermediate: Run a custom experiment by changing the components of the YAML configuration file
  • Advanced: Launch series of fine-tuning experiments across different prompt templates, LLMs, optimization techniques -- all through one YAML configuration file

Basic

llmtune generate config
llmtune run ./config.yml

The first command generates a helpful starter config.yml file and saves in the current working directory. This is provided to users to quickly get started and as a base for further modification.

Then the second command initiates the fine-tuning process using the settings specified in the default YAML configuration file config.yaml.

Intermediate

The configuration file is the central piece that defines the behavior of the toolkit. It is written in YAML format and consists of several sections that control different aspects of the process, such as data ingestion, model definition, training, inference, and quality assurance. We highlight some of the critical sections.

Flash Attention 2

To enable Flash-attention for supported models. First install flash-attn:

pipx

pipx inject llm-toolkit flash-attn --pip-args=--no-build-isolation

pip

pip install flash-attn --no-build-isolation

Then, add to config file.

model:
  torch_dtype: "bfloat16" # or "float16" if using older GPU
  attn_implementation: "flash_attention_2"

Data Ingestion

An example of what the data ingestion may look like:

data:
  file_type: "huggingface"
  path: "yahma/alpaca-cleaned"
  prompt:
    ### Instruction: {instruction}
    ### Input: {input}
    ### Output:
  prompt_stub: { output }
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42
  • While the above example illustrates using a public dataset from Hugging Face, the config file can also ingest your own data.
   file_type: "json"
   path: "<path to your data file>
   file_type: "csv"
   path: "<path to your data file>
  • The prompt fields help create instructions to fine-tune the LLM on. It reads data from specific columns, mentioned in {} brackets, that are present in your dataset. In the example provided, it is expected for the data file to have column names: instruction, input and output.

  • The prompt fields use both prompt and prompt_stub during fine-tuning. However, during testing, only the prompt section is used as input to the fine-tuned LLM.

LLM Definition

model:
  hf_model_ckpt: "NousResearch/Llama-2-7b-hf"
  quantize: true
  bitsandbytes:
    load_in_4bit: true
    bnb_4bit_compute_dtype: "bf16"
    bnb_4bit_quant_type: "nf4"

# LoRA Params -------------------
lora:
  task_type: "CAUSAL_LM"
  r: 32
  lora_dropout: 0.1
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - up_proj
    - down_proj
    - gate_proj
  • While the above example showcases using Llama2 7B, in theory, any open-source LLM supported by Hugging Face can be used in this toolkit.
hf_model_ckpt: "mistralai/Mistral-7B-v0.1"
hf_model_ckpt: "tiiuae/falcon-7b"
  • The parameters for LoRA, such as the rank r and dropout, can be altered.
lora:
  r: 64
  lora_dropout: 0.25

Quality Assurance

qa:
  llm_tests:
    - length_test
    - word_overlap_test
  • To ensure that the fine-tuned LLM behaves as expected, you can add tests that check if the desired behaviour is being attained. Example: for an LLM fine-tuned for a summarization task, we may want to check if the generated summary is indeed smaller in length than the input text. We would also like to learn the overlap between words in the original text and generated summary.

Artifact Outputs

This config will run fine-tuning and save the results under directory ./experiment/[unique_hash]. Each unique configuration will generate a unique hash, so that our tool can automatically pick up where it left off. For example, if you need to exit in the middle of the training, by relaunching the script, the program will automatically load the existing dataset that has been generated under the directory, instead of doing it all over again.

After the script finishes running you will see these distinct artifacts:

/dataset # generated pkl file in hf datasets format
/model # peft model weights in hf format
/results # csv of prompt, ground truth, and predicted values
/qa # csv of test results: e.g. vector similarity between ground truth and prediction

Once all the changes have been incorporated in the YAML file, you can simply use it to run a custom fine-tuning experiment!

python toolkit.py --config-path <path to custom YAML file>

Advanced

Fine-tuning workflows typically involve running ablation studies across various LLMs, prompt designs and optimization techniques. The configuration file can be altered to support running ablation studies.

  • Specify different prompt templates to experiment with while fine-tuning.
data:
  file_type: "huggingface"
  path: "yahma/alpaca-cleaned"
  prompt:
    - >-
      This is the first prompt template to iterate over
      ### Input: {input}
      ### Output:
    - >-
      This is the second prompt template
      ### Instruction: {instruction}
      ### Input: {input}
      ### Output:
  prompt_stub: { output }
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42
  • Specify various LLMs that you would like to experiment with.
model:
  hf_model_ckpt:
    [
      "NousResearch/Llama-2-7b-hf",
      mistralai/Mistral-7B-v0.1",
      "tiiuae/falcon-7b",
    ]
  quantize: true
  bitsandbytes:
    load_in_4bit: true
    bnb_4bit_compute_dtype: "bf16"
    bnb_4bit_quant_type: "nf4"
  • Specify different configurations of LoRA that you would like to ablate over.
lora:
  r: [16, 32, 64]
  lora_dropout: [0.25, 0.50]

Extending

The toolkit provides a modular and extensible architecture that allows developers to customize and enhance its functionality to suit their specific needs. Each component of the toolkit, such as data ingestion, fine-tuning, inference, and quality assurance testing, is designed to be easily extendable.

Contributing

Open-source contributions to this toolkit are welcome and encouraged. If you would like to contribute, please see CONTRIBUTING.md.

llm-finetuning-hub's People

Contributors

akashsaravanan-georgian avatar balmasi avatar benjaminye avatar georgianpoole avatar mariia-georgian avatar rohitsaha avatar sinclairhudson avatar truskovskiyk avatar viveksingh-ctrl avatar viveksinghds avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llm-finetuning-hub's Issues

[Workflow] Automatically run `black`, `flake8`, `isort` via Github Action

Is your feature request related to a problem? Please describe.

  • Automatic formatting and linting to improve code consistency

Describe the solution you'd like

  • Have pre-commit hooks that run before commits
  • Example

Describe alternatives you've considered

  • I've been manually running black from time to time on the whole repo, but not the best solution for collaboration

[RichUI] Better Dataset Generation Display

Is your feature request related to a problem? Please describe.

  • Dataset creation table display always display all columns of dataset, instead of ones needed by prompt and prompt_stub
  • Dataset creation table display highlighting uses string matching, leading to weird outputs when there are overlaps

Describe the solution you'd like

  • Fix these issues!

question about fine tuning falcon

Hello
I ran the falcon classification task uaing the following command:
!python falcon_classification.py --lora_r 64 --epochs 1 --dropout 0.1 # finetune Falcon-7B on newsgroup classification dataset
Upon inspecting the model, I find that many of the layers are full rank and not the lower rank

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("experiments/classification-sampleFraction-0.99_epochs-1_rank-64_dropout-0.1/assets")
 

Here is a screenshot showing this

Screenshot 2023-10-24 at 8 41 42 PM

Is this expected behavior ?

quickstart basic - missing qa/llm_tests:?

Ran:
llmtune generate config
llmtune run ./config.yml

Things worked well (once I fixed my mistake with Mistral/huggingface repo permissions). The job ran very fast and put results into the "experiment" directory. But the experiment/XXX/results/ directory only has a "results.csv" file in it. I expected there to be results from the qa/llm_tests section in the config.yml file, which looks like this:
qa:
llm_tests:
- jaccard_similarity
- dot_product
- rouge_score
- word_overlap
- verb_percent
- adjective_percent
- noun_percent
- summary_length

Do I have to do something extra to get the qa to run?

example config file to run inference only on fine-tuned model

Is it possible to provide a config file that shows how to run inference on an already fine-tuned model?

I have run the starter config, and it looks like the final PEFT model weights are in experiment/XXX/weights/.

So how do I re-run inference only (and possibly qa checks) on that model?

[LoRA] Use Validation Set

If I have:

  • test_split: 0.1
  • train_split: 0.8

Maybe we can get calc_val_split=1-0.1-0.8=0.1 split as validation. Maybe also apply something like max(calc_val_split, 0.05) to prevent val split to be too big

Inferencing script not executable due to package dependency errors

I tried to go through the README file as mentioned, and once i execute llama2_baseline_inference.py I am thrown with the error

ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes`

even though the packages are installed in my environment. I was able to circumvent this problem by upgrading my datasets library using pip install -U datasets and now I received one other error as given below in this link

To avoid this issue, I downgraded my transformers library to 4.3 and currently, I am unable to download some of the checkpoints. I feel the packages need to be revamped with the latest versions

Allow custom train/test datasets

Is your feature request related to a problem? Please describe.
I'm working on a problem that requires me to split my data in a specific way (base on dates). Right now the config only allows for a single dataset to be provided and it internally does a train-test split based on the values provided for the test_size and train_size parameters.

Describe the solution you'd like
Ideally, an option to specify paths to both train and test data.

Describe alternatives you've considered
The alternative would be to add in support for other types of data splitting which I don't think makes sense for this repo to include.

Additional context
None

Add comment to indicate tf32 won't be available for older GPUs

Describe the bug
I'm trying to run this toolkit on colab notebook with T4 GPU and ran into errors. In order to get it working, I needed to turn bf16 and tf32 to false, and fp16 to true. There's already a note for the bf16 and fp16, maybe we can add a note for tf32 as well.

Training of FlanT5 for summarization

I tried following the same framewrok for training the other llms (falcon,mistral,etc) with SFTTrainer to train the FlanT5 model as well.
But the results are bad, as if the llm doesn't learn anything.
Training it with the Seq2Seq method works. Why did you use this method for FlanT5 an SFTTrainer for all the other llms?

Allow users to set verbosity of outputs

  • Right now debug outputs and warnings are suppressed in favor of a cleaner UI
  • Should leave users to choose a more verbose output by running something like
llmtune run --verbose
llmtune run -v

Trying to access gated repo error, Quickstart Basic

After installation, run:
llmtune generate config
==> works fine
llmtune run ./config.yml
==> get this error

OSError: You are trying to access a gated repo.
Make sure to request access at https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 and pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token>.

So then I do:
huggingface-cli login
===> login successfully
llmtune run ./config.yml
==> get same error

Any ideas?

CLI to Generate Example `config.yml`

  • For better usability, instead of having to copy config.yml out of the source repo. We can write a simple script to download the file and output to user's current working directory

Quickstart Basic uses a very large model and is slow.

The basic "quickstart" example downloads Mistral-7B-Instruct-v0.2, which is ~15GB, taking me over 20 minutes to download. A smaller model should be used as a quickstart example.

To Reproduce
Steps to reproduce the behavior:

  1. Follow the "basic" level of quickstart

The basic version of the quickstart should be, in my opinion, a 10 minute (max) process and not require so much disk space.

Environment:

  • OS: Ubuntu 22.04
  • Packages Installed

`pipx` installation doesn't work

Describe the bug
Having trouble to install with pipx

To Reproduce
Steps to reproduce the behavior:

  1. brew install pipx
  2. pipx install llm-toolkit

Expected behavior
Installs fine

Screenshots
Screenshot 2024-04-08 at 11 35 41โ€ฏAM

Environment:

  • OS: MacOS

[Dataset] Dataset Generation Always Returns Cached Version

Describe the bug
At dataset creation, the dataset generated will always get the cached version despite change in file.

To Reproduce

  1. Run toolkit.py
  2. Ctrl-C
  3. Add a line in the dataset
  4. toolkit.py will not create a new dataset with desired changes

Expected behavior

  1. Dataset to be generated with new data

Environment:

  • OS: Ubuntu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.