foundation-model-stack / fms-hf-tuning Goto Github PK
View Code? Open in Web Editor NEW๐ Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
License: Apache License 2.0
๐ Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
License: Apache License 2.0
fms-hf-tuning/tuning/trainercontroller/callback.py
Lines 198 to 200 in dd29d49
Some examples of malicious strings that get past current checks are in the unit tests in the corresponding PR that fixes this issue:
Most of the custom arguments that are added to the tuning CLI (e.g., for peft_config, defined here) don't have descriptions in the --help
splash screen for the sft_trainer
script. Given how many options may be used at tuning time, it would be ideal to add descriptions for these arguments to make the script easier to use
Add unit tests for the tuning/aim_loader.py module.
If possible, each function should have test cases in the file tests/tuning/test_aim_loader.py
and pass test with the command pytest tests/tuning/test_aim_loader.py
A clear and concise description of any alternative solutions or features you've considered.
Add any other context about the feature request here.
When Dockerfile is updated no guarantee a tuning job still launch with the pytorchjobs.
For example:
One might assume the packaging
is not needed and remove it from Dockerfile. The accelerate_launch.py may fail with:
โ Traceback (most recent call last): โ
โ File "/app/accelerate_launch.py", line 25, in <module> โ
โ from accelerate.commands.launch import launch_command โ
โ File "/usr/local/lib/python3.11/site-packages/accelerate/__init__.py", line 16, in <module> โ
โ from .accelerator import Accelerator โ
โ File "/usr/local/lib/python3.11/site-packages/accelerate/accelerator.py", line 35, in <module> โ
โ from .checkpointing import load_accelerator_state, load_custom_state, save_accelerator_state, save_custom_state โ
โ File "/usr/local/lib/python3.11/site-packages/accelerate/checkpointing.py", line 24, in <module> โ
โ from .utils import ( โ
โ File "/usr/local/lib/python3.11/site-packages/accelerate/utils/__init__.py", line 29, in <module> โ
โ from .dataclasses import ( โ
โ File "/usr/local/lib/python3.11/site-packages/accelerate/utils/dataclasses.py", line 34, in <module> โ
โ from .environment import str_to_bool โ
โ File "/usr/local/lib/python3.11/site-packages/accelerate/utils/environment.py", line 27, in <module> โ
โ from packaging.version import parse โ
โ ModuleNotFoundError: No module named 'packaging'
Create a GH action to test SFT trainer image with every PR. For example: https://gist.github.com/tedhtchang/34335240230d64c6e2ecb5b78d8a8cb0
Add any other context about the feature request here.
The build output is produced in build/lib/
directory.
There is also an auto-generated file tuning/_version.py
Please provide details about the environment you are using, including the following:
v0.0.2rc1
main
branch commit 8548a6df86e0a3ea00ece11a47c3aac2971a512e
main
branch commit 8548a6df86e0a3ea00ece11a47c3aac2971a512e
tox -e build
Build output should not be checked into version control.
Build output shows up when you do git status
. The status is not clean.
The auto-generated file tuning/_version.py
# file generated by setuptools_scm
# don't change, don't track in version control
TYPE_CHECKING = False
if TYPE_CHECKING:
from typing import Tuple, Union
VERSION_TUPLE = Tuple[Union[int, str], ...]
else:
VERSION_TUPLE = object
version: str
__version__: str
__version_tuple__: VERSION_TUPLE
version_tuple: VERSION_TUPLE
__version__ = version = '0.0.2rc2.dev1+gbf0c857'
__version_tuple__ = version_tuple = (0, 0, 2, 'dev1', 'gbf0c857')
Description
Currently the repository has only train(). It would be good to add support for load and inference so tuned models can be validated with same repository
Acceptance criteria
logging_steps greater than one results in TypeError when evaluating trainer controller rule. Following trainer controller configuration used when the bug was detected:
controller-metrics:
- name: loss
class: Loss
controllers:
- name: loss-controller
triggers:
- on_log
rule: loss < 1.0
operations:
- hfcontrols.should_training_stop
./tuning/sft_trainer.py
script with the below training configuration in CCC cluster with 1 GPU (Tesla V100-SXM2-32GB).Below configuration for fms-hf-tuning
causes the crash:
python ./tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--tokenizer_name_or_path $MODEL_PATH \
--training_data_path $DATA_PATH \
--output_dir $OUTPUT_PATH \
--validation_data_path $VALIDATION_PATH \
--num_train_epochs 10 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 5 \
--logging_strategy "steps" \
--include_tokens_per_second \
--packing False \
--metric_for_best_model "loss" \
--load_best_model_at_end True \
--use_flash_attn False \
--trainer_controller_config_file "examples/trainercontroller_configs/loss.yaml" \
--response_template "\n### Response:" \
--dataset_text_field "output"
The rule should evaluate to False and should not throw any exception.
File "/seshapad/fms-hf-tuning/tuning/trainercontroller/callback.py", line 235, in _take_control_actions
raise TypeError("Rule failed due to incorrect type usage") from et
TypeError: Rule failed due to incorrect type usage
None.
There's a problem with the current requirements.txt method as it pulls in too many requirements for CI/CD. We want to enable users to install minimal set of dependencies needed for functioning of the repository and to perform tuning techniques. Other dependencies should be optional.
This issue is to explore what best optional dependency framework to use to ensure install is light-weight.
Tasks include:
This feature is an add-on to the trainer controller framework. We add a new metric to this framework to expose the trainer state of the trainer loop.
This metric now enables rules like the below example, wherein we could check the epochs elapsed in order to stop training on a loss threshold.
controller-metrics:
- name: state
class: StateOfTrainer
- name: loss
class: Loss
controllers:
- name: loss-controller
triggers:
- on_log
rule: loss < 2.2 and state['epoch'] > 5
operations:
- hfcontrols.should_training_stop
We checked if the trainer state is a call-by-reference object which could be obtained once and referred many-times, but this is not how the trainer loop exposes the trainer state. The only way we could find the trainer state to be exposed is in through the arguments of the events of trainer callback.
NA
Description
To keep the code in good health and retain styling conventions, it is a good practice to integrate tools such as black formatter to the repo https://black.readthedocs.io/en/stable/
Acceptance criteria
Trainer control operations have no access to trainer control metrics.
Trainer controller metrics should be passed to the operations to allow for their usage in performing operating tasks such as logging.
NA
NA
we should enable low_cpu_mem_usage to avoid loading 8 copies of the model into CPU. This should boost the memory usage as well as loading time.
Description
The repo is in need for unit tests so we can start enforcing tests to be contributed with every code contribution. In this issue we can add a simple unit test framework to the repository and some simple tests to get started . We also need to enable tests to run as part of every PR build.
Acceptance criteria
Is your feature request related to a problem? Please describe.
To be more inline with HF and have clearer input arguments, there are a few flags that can be updated but will cause breaking change for users. These are listed as TODOs
in fms-hf-tuning:
model_max_length
should be updated to max_seq_length
data_path
should be updated to training_data_path
since we also have validation_data_pathFix bug in CLI parsing for target_modules
Running python tuning/sft_trainer.py --target_modules "c_attn" "c_proj"
which accepts a string or a list and is the correct way to pass in a list in Bash/Shell, gets a parsing error when it actually runs successfully. It gets interpreted correctly as I printed out the LoraConfig and you see the correct value but then at the end I get error ERROR: Could not consume arg: c_proj
.
On looking this up quickly, it appears to have to do with the argument parsing in fire.Fire(main)
.
Example in detail:
# command run
$ python tuning/sft_trainer.py --model_name_or_path $MODEL_PATH --data_path $DATA_PATH --output_dir $OUTPUT_PATH --num_train_epochs 80 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 4 --save_strategy "epoch" --learning_rate 1e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --include_tokens_per_second --packing False --response_template " Label:" --dataset_text_field "output" --use_flash_attn True --tokenizer_name_or_path $MODEL_PATH --torch_dtype bfloat16 --peft_method "lora" --logging_strategy "epoch" --r 16 --lora_dropout 0.05 --lora_alpha 32 --target_modules "c_attn" "c_proj"
# print statement added of LoraConfig from argument parsing
LoraConfig(r=16, lora_alpha=32, target_modules=['c_attn', 'c_proj'], lora_dropout=0.05)
# print statement added of LoraConfig interpreted by SFTTrainer
LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='CAUSAL_LM', inference_mode=False, r=16, target_modules={'c_proj', 'c_attn'}, lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})
# after tuning runs successfully, see error message
{'train_runtime': 228.1836, 'train_samples_per_second': 17.53, 'train_steps_per_second': 1.052, 'train_tokens_per_second': 760.09, 'train_loss': 0.3890522326032321, 'epoch': 73.85}
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 240/240 [03:47<00:00, 1.06it/s]
ERROR: Could not consume arg: c_proj
Able to pass in list via CLI and there is no error.
When running LoRA tuning the reported loss is zero, the variable max_seq_length
is set to the default value of 4096.
### Label: no complaint<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|> This instance will be ignored in loss calculation. Note, if this happens often, consider increasing the `max_seq_length`.
### Label: complaint This instance will be ignored in loss calculation. Note, if this happens often, consider increasing the `max_seq_length`.
{'loss': 0.0, 'learning_rate': 6.716876096514944e-06, 'epoch': 2.04}
...
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 5.0}
pip install
in Dockerfile
once wheel releases are setup instead of cloning the repo while building the image
Currently when doing full parameter tuning (peft_config=None
), the model can be fine-tuned and ckpt can be saved, however, the ckpt cannot be re-load back.
Root cause has been discovered with offline discussions, and most specifically:
PeftSavingCallback
is not necessary for saving a full-parameter-tuning model. This callback is designed to separately save an adapter-only model, and the rationale behind it was that Trainer by itself will save everything in the root folder but for PEFT we want a clean adapter only folder, thus we duplicately save some of them in a separate place.save_pretrained
directly as it will drop some shared tensors (e.g. lm_head) during saving. The better way to do it would be save_pretrained(..., state_dict=state_dict)
by explicitly passing the full state dict, same as how Trainer natively save it.So... due to 2, the ckpt is missing some tensors thus cannot load back, and due to 1, this callback-generated broken ckpt is either being used or overwrites original ckpt.
Solution would be revise the callback to skip doing anything on full parameter tuning.
The current code is using an older API for attention type. We need to move this to the newer version:
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
The issue is to discuss and document Design for a framework to include custom acceleration tools into sft_trainer.py, that improve training metrics such as GPU memory consumption, training speed, etc.
Add unit tests for the tuning/sft_trainer.py module.
If possible, each function should have test cases in the file tests/test_sft_trainer.py
and pass test with the command pytest tests/test_sft_trainer.py
A clear and concise description of any alternative solutions or features you've considered.
Add any other context about the feature request here.
Issue
As a user that is running a large finetuning campaign of runs I would like to annotate each individual run with metadata. This way I'll be able to query my AIM server for my finetuning experiments in the future.
Done when
aim_metadata
to the train() method for optional metadata that fms-hf-tuning propagates to AIMjson
file with aim metadataThe evaluation metrics based stopping criteria for training jobs is not available yet.
The evaluation metrics could be customized in the hugging face. Using these metrics within the stopping criteria rules of the trainer controller framework will help stop training if the training is not making progress.
Not applicable
Not applicable
This issue started happening sometime last couple of weeks and the observed behaviour is not deterministic.
When using 2 GPUs, sometimes both GPUs are at 100% utilization, the script runs and terminates successfully in the expected time (e.g. for an artificial dataset within about 4 minutes). Other times, one GPU is constantly at 100% utilization, the other at 0% and the script never terminates or raises an encounters a Segmentation Fault which kills the python interpreter. A third scenario has both GPUs at 100% utilization and the script never completes or encounters a Segmentation Fault.
Please provide details about the environment you are using, including the following:
export CUDA_VISIBLE_DEVICES=0,1
model="hf-tiny-model-private/tiny-random-BloomForCausalLM"
python tuning/sft_trainer.py --output_dir tuned --model_name_or_path ${model} --use_flash_attn=false --max_steps 100 \
--per_device_train_batch_size 1 \
--evaluation_strategy no --response_template "\n### Response" --dataset_text_field output --tokenizer_name_or_path ${model} \
--torch_dtype float32 --data_path common_en_news_combined_512-preprocessed.jsonl
The script terminates successfully with a reasonable execution time.
Using more than 1 GPU causes random stalls there are no exceptions printed.
There's only one warning that I can see:
/tmp/ray/session_2024-03-06_12-02-23_662564_1/runtime_resources/pip/a1249035b2784b570db2ef4c0bc247b20dd7f217/virtualenv/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
I've switched on debug logs and I can see the AIMStack observer printing debug logs to the terminal but there are no printouts from HF/Torch about processing steps. The job does not terminate even after running for 17 hours for an experiment which is expected to last about 4 minutes.
Occasionally I see this exception:
RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers
and when I do then the script terminates.
I also sometimes see segmentation faults, next time I run into one I'll make a note and add it to this issue.
I'm launching the equivalent of the above script but as a ray remote method with a @ray.remote(num_gpus=2)
. I suspect that some one or more dependency packages have changed sometime in the last couple of weeks and these changes are causing the problems I'm facing.
Add unit tests for the tuning/config/configs.py module.
If possible, each function should have test cases in the file tests/config/test_configs.py
and pass test with the command pytest tests/config/test_configs.py
A clear and concise description of any alternative solutions or features you've considered.
Add any other context about the feature request here.
How to run, enable, disable linting,
Document in a section in the contributors.md doc why, when, and how to do linting
A clear and concise description of any alternative solutions or features you've considered.
Add any other context about the feature request here.
We need issue templates for:
You can refer to templates here https://github.com/caikit/caikit-nlp/tree/main/.github/ISSUE_TEMPLATE and modify as needed
Currently, not setting --peft_method
defaults to prompt tuning. However, intuitively, not passing --peft_method
means not to use any parameter efficient method like LoRA and prompt tuning. Therefore, its good to fallback to full parameter tuning rather prompt tuning.
Currently launch training used in dockerfile for integration with rest of stack only supports python jobs for single GPU training. This issue is to update the script with accelerate for launching multi GPU tuning.
Current format
controller-metrics:
loss:
Loss:
- name: sparsity
class: Sparsity
args:
key3: val3
key4: val4
New format
controller-metrics:
- name: loss-one
class: Loss
arguments:
key-num-1: val1
key-num-2: val2
- name: sparsity-one
class: Sparsity
arguments:
key-num-3: val3
key-num-4: val4
operations:
- name: op1
class: MyOperation1
controllers:
- name: loss-controller
triggers:
- on_log
rule: loss-one < 1.0
operations:
- hfcontrols.should_training_stop
Need to build the python wheel so that pip install fms-hf-tuning
will easily install and have the proper semver versioning.
Tasks include:
Loss metric does not have information on the step and epoch it was generated.
Return entire log-line in loss.py metric
NA
NA
@hickeyma @Ssukriti @tedhtchang
The platform is build upon the torch
and huggingface
ecosystem whereby there are constant fixes that are pushed every release.
pyproject.toml
that can be used to also address #56 , #57node
See here:
npm install
will occur from package.json
with semvar versioning that tells what kind of upgrades (e.g., minor, patch) are allowed.
npm install
the packages will be resolved to install the latest packages that satisfy the constraints.package-lock.json
.package-lock.json
is typically checked to have a record of what exact package versions are working.npm ci
; note that this differs from npm install
, in that it installs from package-lock.json
. Thus there is no package resolution during testing. This ensures if the tests pass, we know exactly what package versions they pass for.Use dependabot to automatically check for new versions, and then raise a PR to upgrade the package version.
Two possibilities:
poetry.lock
, that works together with pyproject.toml
. The poetry flow as described here is completely analogue with the npm
flow described above.pip-compile
to generate requirements.txt
from pyproject.toml
and checking that in as the lock file.Description
currently the train only accepts JSON file to a train dataset. We also need to add support for validation dataset to enable early stopping and possibly other features in the future
Acceptance Criteria
Issue
The __init__()
method of the transformers SFTTrainer class raises an exception when using the latest commit of fms-hf-tuning (f4e8eb4) and transformers>=4.38.0
.
requirements.txt
file is transformers>=4.34.1
fms-hf-tuning
source code to be compatible with transformers>=4.38.0
requirements.txt
so that it uses transformers<4.38.0
Done when
commit: f4e8eb4
model="hf-tiny-model-private/tiny-random-BloomForCausalLM"
python tuning/sft_trainer.py --output_dir tuned --model_name_or_path ${model} --use_flash_attn=false --num_train_epochs 5 \
--evaluation_strategy no --response_template "\n### Response" --dataset_text_field output --tokenizer_name_or_path ${model} \
--torch_dtype float32 --data_path common_en_news_combined_512-preprocessed.jsonl \
--target_modules all-linear --peft_method lora
Exception:
Traceback (most recent call last):
File "/projects/fms-hf-tuning/tuning/sft_trainer.py", line 308, in <module>
fire.Fire(main)
File "/projects/fms-hf-tuning/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/projects/fms-hf-tuning/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/projects/fms-hf-tuning/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/projects/fms-hf-tuning/tuning/sft_trainer.py", line 304, in main
train(model_args, data_args, training_args, tune_config)
File "/projects/fms-hf-tuning/tuning/sft_trainer.py", line 252, in train
trainer = SFTTrainer(
File "/projects/fms-hf-tuning/venv/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 295, in __init__
super().__init__(
File "/projects/fms-hf-tuning/venv/lib/python3.10/site-packages/transformers/trainer.py", line 648, in __init__
self.is_fsdp_xla_v2_enabled = args.fsdp_config["xla_fsdp_v2"]
KeyError: 'xla_fsdp_v2'
Prompt Tuning model generates low-quality output
Please provide details about the environment you are using, including the following:
My launching script is
export MODEL_PATH=/dccstor/ai4code-ansible/shared/ckpt/granite-20b-code-all-yaml-2k/
export DATA_PATH=/dccstor/weiz/peft/train.json
export OUTPUT_PATH=pt_ckpts
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH=.
python tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $DATA_PATH \
--output_dir $OUTPUT_PATH \
--peft_method "pt" \
--tokenizer_name_or_path $MODEL_PATH \
--num_train_epochs 5 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--learning_rate 3e-2 \
--weight_decay 0. \
--warmup_ratio 0.0 \
--lr_scheduler_type "linear" \
--logging_steps 1 \
--include_tokens_per_second \
--packing False \
--response_template "\n### Response:" \
--dataset_text_field "output" \
--use_flash_attn True \
--torch_dtype "bfloat16" \
--num_virtual_tokens 100 \
--max_seq_length 512 \
--prompt_tuning_init RANDOM
To repeat, one needs to define MODEL_PATH to the yaml-2k path (which is shared via COS bucket) and DATA_PATH to the training data
Using the first training data item as the input to the model should show something close to the ground truth, i.e.,
community.docker.docker_login:
registry: "{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}"
username: "{{ registry_username }}"
password: "{{ registry_password }}"
When running through inference, I got
.,.,.,.,.,.,.,.,.,.,........,.,.,.,.,...,.,.....,..,...,......,:.: -.:.:.:.:.:.:.............,........,..........,.:..........,.........,::::::::::::::::
::::::......:..............,.............,..............,.,..:.......:.........2.2......................
The SFT training loss has looked absolutely normal though.
I was able to adapt the HF PT tutorial at https://huggingface.co/docs/peft/main/en/task_guides/clm-prompt-tuning to run our task to get reasonable results.
Issue
I would like to use fms-hf-tuning to collect system level metrics while finetuning models, these include model load time, as well as device related metrics (like these that AIM collects).
One way would be to rely on just the measurements that AIM collects by pointing fms-hf-tuning to an AIM server then contacting AIM to retrieve the data. However, this is a bit restricting in that we cannot collect custom data, collect system metrics at a period other than the one that AIM is using - 30 seconds, or collect data at all if we don't spin up an AIM server.
A more convenient solution would be to allow providing an optional parameter to train()
for which can contain a list of callbacks
fms-hf-tuning/tuning/sft_trainer.py
Lines 29 to 34 in fc07060
In the same spirit I'd like to get access to the TrainingOutput
object that sft_trainer.train()
returns here (input_tokens_per_second, train_runtime, etc) :
fms-hf-tuning/tuning/sft_trainer.py
Line 168 in fc07060
(just by returning the output of trainer.train() as the output of train()).
Done when
trainer.train()
to the caller of train()
When running benchmarks using fms-hf-tuning we need to:
However fms-hf-tuning does not support these features hence requiring maintenance of external forks of fms-hf-tuning that provide them. Specifically
Here is a simplified view of the current state of the sft_trainer.py
script:
def train(
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
):
callbacks = [SomeCallback()]
if aim_available:
callbacks.append(get_aimstack_callback())
callbacks.extend(other_callbacks)
sft_trainer.train(callbacks=callbacks)
def main():
(
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
) = parser.parse_args_into_dataclasses(return_remaining_strings=True)
train(
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
)
We propose the following
class BenchmarkingFriendlyAIMCallBack(AimCallBack):
def on_train_begin(self, args, state, control, model=None, **kwargs):
super().on_train_begin(args, state, control, model, **kwargs)
# annotate self.run with custom metadata so that we can search for this run on AIM
def on_train_end(self, args, state, control, **kwargs):
if state.is_world_process_zero:
# Dump aim metrics to a file such that we can collect this data without actually spinning up an AIM server
...
super().on_train_end(args=args, state=state, control=control, **kwargs)
def parse_args():
return (
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
benhmarking_args,
)
def train(
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
callbacks,
):
callbacks.extend(other_callbacks)
sft_trainer.train(callbacks=callbacks)
def main():
try:
(
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
benhmarking_args,
) = parse_args()
aim_metadata = read_metadata(benchmarking_args.aim_metadata_path)
callbacks = [ ]
if aim_is_available():
callbacks.append(
# dumps AIM data to disk plus annotates AIM run with custom metadata
BenchmarkingFriendlyAIMCallback(
custom_metadata=aim_metadata,
output_file=benchmarking_args.aim_output_file,
)
)
train(
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
callbacks=callbacks, # new
)
# Catch GPU OOM exceptions, NCCL exceptions etc, and store them in a file for easy processing
except GPUOutOfMemoryError as exc :
report_gpu_oom(my_custom_args.aim_output_file, exc)
raise
except Exception:
report_unknown_error(my_custom_args.aim_output_file, exc)
raise
Pros:
Cons:
As a plus, the proposed changes in train()
enable us to experiment with new implementations of BenchmarkingFriendlyAIMCallback()
without requiring updates to fms-hf-tuning (e.g. by using a wrapper script).
We could slightly refactor the code such that the logic in the train()
method for activating the current AIMCallback
(via a call to get_aimstack_callback()
) is in the main()
method instead of the train()
method.
This would enable us to use a wrapper script which implements our earlier proposed design of sft_trainer.py
. The wrapper script would have a main()
function similar to the above proposal and which would invoke tuning.train()
.
sft_trainer.py
in fms-hf-tuning would look like this:
def parse_args():
return (
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
)
def train(
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
callbacks,
):
callbacks.extend(other_callbacks)
sft_trainer.train(callbacks=callbacks)
def main():
(
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
benhmarking_args,
) = parse_args()
# Move the insertion of the AIM Callback from train() to main()
callbacks = [ ]
if aim_available:
callbacks.append(get_aimstack_callback())
train(
model_args,
data_args,
training_args,
tune_config,
trainer_controller_args,
callbacks=callbacks, # new
)
Pros:
Cons:
Accelerate https://huggingface.co/docs/transformers/en/accelerate is created by HF to help users easily train a Transformers model on any type of distributed setup, whether it is multiple GPUโs on one machine or multiple GPUโs across several machines. We should leverage the library for its ease of use.
torchrun used currently can be less user friendly . See #80
(@fabianlim thanks for the suggestions)
Add unit tests for the tuning/utils/config_utils.py module.
If possible, each function should have test cases in the file tests/tuning/utils/test_config_utils.py
and pass test with the command pytest tests/tuning/utils/test_config_utils.py
A clear and concise description of any alternative solutions or features you've considered.
Add any other context about the feature request here.
Boolean values in fsdp config (https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/config/fsdp_config.json#L4-L6) are represented as string values. This does not translate to actual boolean values when loaded in hf/transformers library https://github.com/huggingface/transformers/blob/2a002d073a337051bdc3fbdc95ff1bc0399ae2bb/src/transformers/training_args.py#L1654
Use json boolean values instead of string representation. Meanwhile, I have as well raised an issue for discussion on transformers library to add a dataclass and argument parser with a motivating example (huggingface/transformers#29476)
Enable lint and checks for every PR
Add unit tests for the tuning/utils/merge_model_utils.py module.
If possible, each function should have test cases in the file tests/utils/test_merge_model_utils.py
and pass test with the command pytest tests/utils/test_merge_model_utils.py
A clear and concise description of any alternative solutions or features you've considered.
Add any other context about the feature request here.
Add unit tests for the tuning/data/tokenizer_data_utils.py module.
If possible, each function should have test cases in the file tests/data/test_tokenizer_data_utils.py
and pass test with the command pytest tests/data/test_tokenizer_data_utils.py
A clear and concise description of any alternative solutions or features you've considered.
Add any other context about the feature request here.
If the AIM package is installed then the AIM server is also expected to be running.
File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/storage/migrations/utils.py", line 28, in upgrade_database
raise subprocess.SubprocessError(f'Database upgrade failed with exit code {exit_code}')
subprocess.SubprocessError: Database upgrade failed with exit code 1
It should allow running tests and training in environments where the AIM package is installed but there is no AIM server running.
Please provide details about the environment you are using, including the following:
main
branch commit db99b280e95c9e5822ee83beee050ba5575ab3bf
Steps to reproduce:
conda
or venv
environment with the aim
package and this library installed.Training should complete without errors. Only expect an AIM server to be running if the appropriate environment variables like AIMSTACK_DB
are set.
Training fails with an error talking about AIM
File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
return func(*args, **kwargs)
File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/run.py", line 859, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/run.py", line 272, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/base_run.py", line 34, in __init__
self.repo = get_repo(repo)
File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 24, in get_repo
repo = Repo.from_path(repo, init=True)
File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/repo.py", line 210, in from_path
repo = Repo(path, read_only=read_only, init=init)
File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/sdk/repo.py", line 153, in __init__
self.structured_db.run_upgrades()
File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/storage/structured/db.py", line 98, in run_upgrades
upgrade_database(self.db_url)
File "/dccstor/rhassistant/seshapad/seshaenv/lib/python3.10/site-packages/aim/storage/migrations/utils.py", line 28, in upgrade_database
raise subprocess.SubprocessError(f'Database upgrade failed with exit code {exit_code}')
subprocess.SubprocessError: Database upgrade failed with exit code 1
LoraConfig can accept a List or a str for the target_modules as seen in the description below. This would be useful in order to support passing "all-linear"
as an option instead of the specific attention layers.
target_modules (Optional[Union[List[str], str]]) โ The names of the modules to apply the adapter to. If this is specified, only the modules with the specified names will be replaced. When passing a string, a regex match will be performed. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings. If this is specified as 'all-linear', then all linear/Conv1D modules are chosen, excluding the output layer. If this is not specified, modules will be chosen according to the model architecture. If the architecture is not known, an error will be raised โ in this case, you should specify the target modules manually.
fms-hf-tuning LoraConfig
currently accepts only a List: target_modules: List[str] = field(default_factory=lambda: ["q_proj", "v_proj"])
. This causes it so that if one tries to pass all-linear
it is interpreted as a List
Example
$ python tuning/sft_training.py --target_modules "all-linear"
# interpreted as
LoraConfig(r=8, lora_alpha=16, target_modules=['all-linear'], lora_dropout=0.05)
# subsequently gets used in SFTTrainer as
LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='CAUSAL_LM', inference_mode=False, r=8, target_modules='all-linear', lora_alpha=16, lora_dropout=0.05, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})
# errors with
ValueError: Target modules {'all-linear'} not found in the base model. Please check the target modules and try again.
I tried testing setting target_modules: Union[List[str], str] = field(default_factory=lambda: ["q_proj", "v_proj"])
however "all-linear" was still interpreted as a List instead of a string. This is likely due to the command-line parsing.
Note that all-linear
is supported in PEFT 1.8.0 and must be upgraded.
Add a CONTRIBUTING.md and/or DEVELOPER.md file to help new open source members get up to speed quickly.
One suggestion was to use the caikit CONTRIBUTING.md file as a starting point.
Description
In order for the code to exit clean , we need to validate that parameters are of the correct range and required parameters are passed by the users beforehand. if not, we should return Type/Value Errors and code should not crash.
Example: currently , num_epochs set to 0 , causes a divide by zero error. This exception should be handled with error message: num_epochs>1 .
Acceptance criteria
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.