huggingface / optimum-habana Goto Github PK

View Code? Open in Web Editor NEW

143.0 40.0 174.0 10.89 MB

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)

License: Apache License 2.0

Makefile 0.16% Python 99.44% Shell 0.04% Jupyter Notebook 0.36%

bert fine-tuning habana hpu transformers

optimum-habana's Introduction

Optimum for Intel® Gaudi® Accelerators

Optimum for Intel Gaudi - a.k.a. optimum-habana - is the interface between the Transformers and Diffusers libraries and Intel Gaudi AI Accelerators (HPU). It provides a set of tools enabling easy model loading, training and inference on single- and multi-HPU settings for different downstream tasks. The list of officially validated models and tasks is available here. Users can try other of the thousands of Hugging Face models on Intel Gaudi accelerators and tasks with only few changes.

What are Intel Gaudi AI Accelerators (HPUs)?

HPUs offer fast model training and inference as well as a great price-performance ratio. Check out this blog post about BLOOM inference and this post benchmarking Intel Gaudi 2 and NVIDIA A100 GPUs for BridgeTower training for concrete examples.

Gaudi Setup

Please refer to the Intel Gaudi AI Accelerator official installation guide.

Tests should be run in a Docker container based on Intel Gaudi Docker images.

The current version has been validated for SynapseAI 1.17.

Install the library and get example scripts

Option 1: Use the latest stable release

To install the latest stable release of this package

pip install --upgrade-strategy eager optimum[habana]

The --upgrade-strategy eager option is needed to ensure optimum-habana is upgraded to the latest stable release.

To use the example associated with the latest stable release, run:

git clone https://github.com/huggingface/optimum-habana
cd optimum-habana && git checkout v1.13.1
with v1.13.1 the version number of this release.

Option 2: Use the latest main branch under development

Optimum for Intel Gaudi is a fast-moving project, and you may want to install it from source and get the latest scripts :

pip install git+https://github.com/huggingface/optimum-habana.git
git clone https://github.com/huggingface/optimum-habana

Option 3: Use the `transformers_future` branch to have the latest changes from Transformers

The transformers_future branch is regularly updated with the latest changes from the main branches of Optimum Habana and Transformers. This enables you to try out new Transformers features that have not been merged into the main branch yet.

Warning

The transformers_future branch may have some regressions or bugs and may be less stable than the main branch.

pip install git+https://github.com/huggingface/optimum-habana.git@transformers_future
git clone -b transformers_future https://github.com/huggingface/optimum-habana

Install dependencies

To use DeepSpeed on HPUs, you also need to run the following command:

pip install git+https://github.com/HabanaAI/[email protected]

To install the requirements for every example:

cd <example-folder>
pip install -r requirements.txt

How to use it?

Quick Start

Optimum for Intel Gaudi was designed with one goal in mind: to make training and inference straightforward for Transformers and Diffusers users, while fully leveraging the power of Intel Gaudi AI Accelerators.

Transformers Interface

There are two main classes one needs to know:

GaudiTrainer: the trainer class that takes care of compiling and distributing the model to run on HPUs, and performing training and evaluation.
GaudiConfig: the class that enables to configure Habana Mixed Precision and to decide whether optimized operators and optimizers should be used or not.

The GaudiTrainer is very similar to the Transformers Trainer, and adapting a script using the Trainer to make it work with Intel Gaudi accelerators will mostly consist in simply swapping the Trainer class for the GaudiTrainer one. That's how most of the example scripts were adapted from their original counterparts.

Here is an example:

- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments

- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
  # training arguments...
+ use_habana=True,
+ use_lazy_mode=True,  # whether to use lazy or eager mode
+ gaudi_config_name=path_to_gaudi_config,
)

# A lot of code here

# Initialize our Trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
    model=model,
    args=training_args,  # Original training arguments.
    train_dataset=train_dataset if training_args.do_train else None,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

where gaudi_config_name is the name of a model from the Hub (Intel Gaudi configurations are stored in model repositories) or a path to a local Intel Gaudi configuration file (you can see here how to write your own).

Diffusers Interface

You can generate images from prompts using Stable Diffusion on Intel Gaudi using the GaudiStableDiffusionPipeline class and the [GaudiDDIMScheduler] which have been both optimized for HPUs. Here is how to use them and the differences with the Diffusers library:

- from diffusers import DDIMScheduler, StableDiffusionPipeline
+ from optimum.habana.diffusers import GaudiDDIMScheduler, GaudiStableDiffusionPipeline


model_name = "CompVis/stable-diffusion-v1-4"

- scheduler = DDIMScheduler.from_pretrained(model_name, subfolder="scheduler")
+ scheduler = GaudiDDIMScheduler.from_pretrained(model_name, subfolder="scheduler")

- pipeline = StableDiffusionPipeline.from_pretrained(
+ pipeline = GaudiStableDiffusionPipeline.from_pretrained(
    model_name,
    scheduler=scheduler,
+   use_habana=True,
+   use_hpu_graphs=True,
+   gaudi_config="Habana/stable-diffusion",
)

outputs = generator(
    ["An image of a squirrel in Picasso style"],
    num_images_per_prompt=16,
+   batch_size=4,
)

Documentation

Check out the documentation of Optimum for Intel Gaudi for more advanced usage.

Validated Models

The following model architectures, tasks and device distributions have been validated for Optimum for Intel Gaudi:

In the tables below, ✔️ means single-card, multi-card and DeepSpeed have all been validated.

Transformers:

Architecture	Training	Inference	Tasks
BERT	✔️	✔️	text classification question answering language modeling text feature extraction
RoBERTa	✔️	✔️	question answering language modeling
ALBERT	✔️	✔️	question answering language modeling
DistilBERT	✔️	✔️	question answering language modeling
GPT2	✔️	✔️	language modeling text generation
BLOOM(Z)		DeepSpeed	text generation
StarCoder / StarCoder2	✔️	Single card	language modeling text generation
GPT-J	DeepSpeed	Single card DeepSpeed	language modeling text generation
GPT-NeoX	DeepSpeed	DeepSpeed	language modeling text generation
OPT		DeepSpeed	text generation
Llama 2 / CodeLlama / Llama 3 / Llama Guard / Granite	✔️	✔️	language modeling text generation question answering text classification (Llama Guard)
StableLM		Single card	text generation
Falcon	LoRA	✔️	language modeling text generation
CodeGen		Single card	text generation
MPT		Single card	text generation
Mistral		Single card	text generation
Phi	✔️	Single card	language modeling text generation
Mixtral		Single card	text generation
Persimmon		Single card	text generation
Qwen2	Single card	Single card	language modeling text generation
Gemma	✔️	Single card	language modeling text generation
T5 / Flan T5	✔️	✔️	summarization translation question answering
BART		Single card	summarization translation question answering
ViT	✔️	✔️	image classification
Swin	✔️	✔️	image classification
Wav2Vec2	✔️	✔️	audio classification speech recognition
Whisper	✔️	✔️	speech recognition
SpeechT5		Single card	text to speech
CLIP	✔️	✔️	contrastive image-text training
BridgeTower	✔️	✔️	contrastive image-text training
ESMFold		Single card	protein folding
Blip		Single card	visual question answering image to text
OWLViT		Single card	zero shot object detection
ClipSeg		Single card	object segmentation
Llava / Llava-next		Single card	image to text
Segment Anything Model		Single card	object segmentation
VideoMAE		Single card	Video classification
TableTransformer		Single card	table object detection
DETR		Single card	object detection

Diffusers:

Architecture	Training	Inference	Tasks
Stable Diffusion	textual inversion ControlNet	Single card	text-to-image generation
Stable Diffusion XL	fine-tuning	Single card	text-to-image generation
Stable Diffusion Depth2img		Single card	depth-to-image generation
LDM3D		Single card	text-to-image generation
Text to Video		Single card	text-to-video generation

PyTorch Image Models/TIMM:

Architecture	Training	Inference	Tasks
FastViT		Single card	image classification

TRL:

Architecture	Training	Tasks
Llama 2	✔️	DPO Pipeline
Llama 2	✔️	PPO Pipeline
Stable Diffusion	✔️	DDPO Pipeline

Other models and tasks supported by the Transformers and Diffusers libraries may also work. You can refer to this section for using them with Optimum for Intel Gaudi. In addition, this page explains how to modify any example from the Transformers library to make it work with Optimum for Intel Gaudi.

If you find any issues while using those, please open an issue or a pull request.

After training your model, feel free to submit it to the Intel leaderboard which is designed to evaluate, score, and rank open-source LLMs that have been pre-trained or fine-tuned on Intel Hardwares. Models submitted to the leaderboard will be evaluated on the Intel Developer Cloud. The evaluation platform consists of Gaudi Accelerators and Xeon CPUs running benchmarks from the Eleuther AI Language Model Evaluation Harness.

Development

Check the contributor guide for instructions.

optimum-habana's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes regisss muskanmahajan486 automationkit zhaifeiyue it-forrest jychen21 bukeao junxichhen jwieczorekhabana bzhu-habana bhargaveede vivekgoe sywangyi htang2012 anshuman87 estelleafl dkzdev fxmartyv2 philliphoward ankurhabana inkcherry anindya-saha ilyasmoutawwakil yangulei zhejiangxiaomai huijuanzh sidgog danielkorat baihuijin ethicalsecurity-agency sureshnam ikurtchen ankurneog nirsonnenschein coreweave danyray420 xuwh-intel bjornrun minmin-intel yuanwu2017 sensenzhl hansenchen2011 sjagtap1803 alvarobartt yangw1234 oelayan7 nngokhale prempsg vidyasiv piotrbocian yangqun1 classicvalues kausikmaiti dsocek habanaai xin3he skavulya atakaha x574chen lkk12014402 dsmertin 12010486 kplau1128 iermolae alekseyfa mkumargarg yafshar wei-lin-intel srajabos mgonchar cfgfung billishyahao yiliu30 hlahkar xinyuye-intel joeytpchou yao-matrix pankd hhhhhhao spycsh yuwenzho gaurav7888 srinarayan-srikanthan ccrhx4 danielohayon pi314ever ndavidson19 sandeep-maddipatla luca-calabria cabelo nraste malkomes lqnguyen alberto-villarreal jianhong-zhang dmsuehir elicharlese taotod tannervoas742

optimum-habana's Issues

Can not run text-generation with bloom deepspeed?

System Info

ubuntu 20.04
docker 1.9

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

refer https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation to setup
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path bigscience/bloom-560m --bf16 --max_new_tokens 10 --batch_size 1 --use_kv_cache --do_sample

Expected behavior

running correctly

Latest version of optimum-habana does not work with transformers==4.32.1

System Info

optimum-habana: 1.8.0.dev0
transformers: 4.32.1 (preinstalled)

Following message seen when installing optimum-habana:
Requirement already satisfied: transformers>=4.32.0

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

python run_qa.py
--model_name_or_path bert-large-uncased-whole-word-masking
--gaudi_config_name Habana/bert-large-uncased-whole-word-masking
--dataset_name squad
--do_train
--do_eval
--per_device_train_batch_size 24
--per_device_eval_batch_size 8
--learning_rate 3e-5
--num_train_epochs 1
--max_seq_length 384
--doc_stride 128
--output_dir /tmp/squad/
--use_habana
--use_lazy_mode
--use_hpu_graphs_for_inference
--bf16
--throughput_warmup_steps 3

Fails with the following error:
ModuleNotFoundError: No module named 'transformers.integrations.deepspeed'; 'transformers.integrations' is not a package

Expected behavior

Since transformers 4.32.0 is no longer supported with latest optimum-habana it should install the minimum version that is supported

Error in tests when test_trainer is run before test_trainer_distributed

Unit and integration tests currently needs to be run with pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py. If not, for instance with pytests tests/ , test_trainer will be executed before test_trainer_distributed and the latter will fail without any error message.

The following code snippet in training_args.py should actually not be executed in single-card mode and is responsible for this error:

try:
  global mpi_comm
  from mpi4py import MPI
  
  mpi_comm = MPI.COMM_WORLD
  world_size = mpi_comm.Get_size()
  if world_size > 1:
      rank = mpi_comm.Get_rank()
      self.local_rank = rank
  else:
      raise ("Single MPI process")
except Exception as e:
  logger.info("Single node run")

However, even when this is corrected, I still get the following error:

Traceback (most recent call last):
  File "/root/shared/optimum-habana/tests/test_trainer_distributed.py", line 117, in <module>
    trainer = GaudiTrainer(
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 118, in __init__
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 382, in __init__
    self._move_model_to_device(model, args.device)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 548, in _move_model_to_device
    model = model.to(device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 899, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 570, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 593, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 897, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: Device acquire failed.

I think this is due to the fact that one process may still be running on a HPU when Torch tries to acquire devices.

Resume from checkpoint does not work

Error Message:

Traceback (most recent call last):
  File "examples/question-answering/run_qa.py", line 664, in <module>
    main()
  File "examples/question-answering/run_qa.py", line 605, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 517, in train
    self._load_optimizer_and_scheduler(resume_from_checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1795, in _load_optimizer_and_scheduler
    torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 882, in _load
    result = unpickler.load()
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 857, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 846, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 827, in restore_location
    return default_restore_location(storage, str(map_location))
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 178, in default_restore_location
    raise RuntimeError("don't know how to restore data location of "
RuntimeError: don't know how to restore data location of torch.FloatStorage (tagged with hpu)

Command used to run training :

python examples/question-answering/run_qa.py --model_name_or_path albert-xxlarge-v1 --dataset_name squad  --do_train --do_eval --per_device_train_batch_size=12 --learning_rate=5e-06 --num_train_epochs 2 --save_steps 5000 --seed 42 --doc_stride 128 --max_seq_length 384 --per_device_eval_batch_size 2 --use_lazy_mode  --use_habana --output_dir=./albert_xxlarge_bf16_squad 2>&1 | tee albert_xxlarge_bf16_squad_continued.log

Method for reproducing the issue:

Use above command to run the training.
Halt the training after few steps/epochs.
Resume the training using the same command with --resume_from_checkpoint flag pointing to the output directory of the above command.
Above error is encountered.

Attached Log file:
albert_xxlarge_bf16_squad_continued.log

AttributeError: 'GaudiStableDiffusionPipeline' object has no attribute '_internal_dict'

System Info

Optimum habana version: 1.5.0.dev
Docker image: vault.habana.ai/gaudi-docker/1.8.0/ubuntu20.04/habanalabs/pytorch-installer-1.13.1

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Goes into examples/stable-diffusionFor.
Ran "python text_to_image_generation.py --model_name_or_path stabilityai/stable-diffusion-2-base --prompts "a photo of an astronaut riding a horse on mars" --num_images_per_prompt 1 --batch_size 1 --image_save_dir /tmp/stable_diffusion_images --use_habana --use_hpu_graph --gaudi_config Habana/stable-diffusion".
Got error: AttributeError: 'GaudiStableDiffusionPipeline' object has no attribute '_internal_dict'

Expected behavior

Is there any method can fix this issue?

Enable beam_search for text-generation

Feature request

currently test-generation only support beam=1,which is not exposed to users, any possible that expose --beams user?

Motivation

enable beam_search algo path if beams > 1 and beams =1 falls into greedy_search

Your contribution

submit a PR

Performance is better in 1.6.1 release compared to 1.7.4 release in many models

System Info

Optimum-habana - 1.7.4
Synapse AI - 1.12.0
Docker - 1.12.0-463
Gaudi2 (HLS 225) - 1x and 8x.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to Reproduce
Writing down the steps to reproduce to run SwinT in 1x

Download and install optimum-habana
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana
git chekout v1.6-release
pip install -r examples/image-classification/requirements.txt
pip install optimum-habana==1.6.1
python3 /root//optimum-habana/examples/image-classification/run_image_classification.py --model_name_or_path microsoft/swin-base-patch4-window7-224 --dataset_name cifar10 --output_dir /tmp/swint_hf/results/ --remove_unused_columns False --do_train --learning_rate 2e-05 --per_device_train_batch_size 64 --evaluation_strategy no --save_strategy no --load_best_model_at_end True --save_total_limit 3 --seed 1337 --use_habana --use_lazy_mode --gaudi_config_name Habana/swin --throughput_warmup_steps 3 --ignore_mismatched_sizes --bf16 --num_train_epochs 1 --logging_steps 20 --dataloader_num_workers 8

Expected behavior

The expected behaviour is that
1.12.0-463 having similar perf with optimum-habana 1.6.1 and optimum-habana 1.7.4

But what is observed is that perf is better in optimum-habana 1.6.1 and comparitively lesser in 1.7.4

This is applicable for SwinT, ViT, Bert-Large in 8x and 1x.
Eg: values in SwinT is given below

OH - 1.7.4 values
362.524
362.566
360.719
358.089

OH - 1.6.1 values
389.045
390.971
389.587

Almost 7.5% drop

Default value of ignore_eos

System Info

Now, ignore_eos default value is set to lazy_mode.
Whereas default value in normal transformers is False.

This results in different values for Accuracy and Perf in Gaudi when compared with GPU.
Also, In summarization tasks, What's the need for setting ignore_eos to True. Stopping when we reach eos should be there for summarization, right?

Right now, only way to set the flag is from generation_config. In case of default generation config, We have to update and that will cause issue with CI accuracy and perf which were right now running with ignore_eos as lazy_mode (which is True in most cases).

Shouldn't there be an extra flag to provide along with command line? So that each model can set it based on the need?

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

pytest -s -v tests/test_encoder_decoder_text_summarization.py

Expected behavior

There should be a way to set ignore_eos value without impacting other runs and models

Where in the directory "/tmp/tst-summarization", is the summarization output stored?

System Info

Optimum Habana : 1.6.0
SynapseAI : 1.10.0
Docker Image : Habana® Deep Learning Base AMI (Ubuntu 20.04)
Volume : 1000 GiB

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Start an EC2 instance with DL1 Resource and this image : Habana® Deep Learning Base AMI (Ubuntu 20.04)
Run these commands
a. docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
b. git clone https://github.com/huggingface/optimum-habana.git
c. pip install optimum[habana]
d. cd examples
e. cd summarization
f. pip install -r requirements.txt

python run_summarization.py
--model_name_or_path t5-small
--do_eval
--dataset_name cnn_dailymail
--dataset_config "3.0.0"
--source_prefix "summarize: "
--output_dir /tmp/tst-summarization
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--overwrite_output_dir
--predict_with_generate
--use_habana
--use_lazy_mode
--use_hpu_graphs_for_inference
--gaudi_config_name Habana/t5
--ignore_pad_token_for_loss False
--pad_to_max_length
--save_strategy epoch
--throughput_warmup_steps 3

Expected behavior

Need a file with the summarized text and not just the evaluation metrics

gpt neox finetuning does not work(segmentaion fault) since 1.7.0

System Info

optimum-habana version >1.7.0
deepspeed 1.11.0

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

python3 /root/repos/optimum-habana/examples/gaudi_spawn.py   --hostfile /root/repos/hostsfile --world_size 8 --use_deepspeed /root/repos/optimum-habana/examples/language-modeling/run_clm.py --deepspeed /root/repos/optimum-habana/tests/configs/deepspeed_zero_2.json --model_name_or_path 'EleutherAI/gpt-neox-20b' --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --num_train_epochs 1 --do_train --output_dir ~/gpt-neox-20b --gaudi_config_name Habana/gpt2 --gradient_checkpointing --use_habana --use_lazy_mode --throughput_warmup_steps 3 --overwrite_output_dir --use_hpu_graphs_for_inference

crash log

10.233.250.163: Loading extension module utils...
10.233.250.163: [INFO|trainer.py:680] 2023-09-12 09:11:29,269 >> ***** Running training *****
10.233.250.163: [INFO|trainer.py:681] 2023-09-12 09:11:29,269 >>   Num examples = 2,334
10.233.250.163: [INFO|trainer.py:682] 2023-09-12 09:11:29,269 >>   Num Epochs = 1
10.233.250.163: [INFO|trainer.py:683] 2023-09-12 09:11:29,269 >>   Instantaneous batch size per device = 2
10.233.250.163: [INFO|trainer.py:686] 2023-09-12 09:11:29,269 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
10.233.250.163: [INFO|trainer.py:687] 2023-09-12 09:11:29,269 >>   Gradient Accumulation steps = 1
10.233.250.163: [INFO|trainer.py:688] 2023-09-12 09:11:29,269 >>   Total optimization steps = 73
10.233.250.163: [INFO|trainer.py:689] 2023-09-12 09:11:29,274 >>   Number of trainable parameters = 20,554,567,680
10.233.168.102: Time to load utils op: 0.0013871192932128906 seconds
10.233.168.102: Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
10.233.168.102: No modifications detected for re-loaded extension module utils, skipping build step...
10.233.168.102: Loading extension module utils...
10.233.168.102: Time to load utils op: 0.00066375732421875 seconds
10.233.168.102: Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
10.233.168.102: No modifications detected for re-loaded extension module utils, skipping build step...
10.233.168.102: Loading extension module utils...
10.233.168.102: Time to load utils op: 0.0005764961242675781 seconds
10.233.168.102: Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
10.233.168.102: No modifications detected for re-loaded extension module utils, skipping build step...
10.233.168.102: Loading extension module utils...
10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:605:forward] Activation Checkpointing Information
10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:606:forward] ----Partition Activations False, CPU CHECKPOINTING False
10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:607:forward] ----contiguous Memory Checkpointing False with None total layers
10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:609:forward] ----Synchronization False
10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:610:forward] ----Profiling time in checkpointing False
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.250.163: 
  0%|          | 0/73 [00:00<?, ?it/s]Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault
10.233.250.163: Internal Error: Received signal - Segmentation fault
10.233.168.102: Internal Error: Received signal - Segmentation fault

Expected behavior

same with 1.6.1

ImportError: No module named optimum.habana.distributed

System Info

Optimum Habana : 1.6.0
SynapseAI : 1.10.0
Docker Image : vault.habana.ai/gaudi-docker/1.10.0/amzn2/habanalabs/pytorch-installer-2.0.1:latest

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Pull this image vault.habana.ai/gaudi-docker/1.10.0/amzn2/habanalabs/pytorch-installer-2.0.1:latest into the cnvrg repository
Run the following commands
pip install optimum[habana]
cd optimum-habana/examples/text-generation
pip install -r requirements.txt
pip install git+https://github.com/HabanaAI/[email protected]
python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py
--model_name_or_path bigscience\bloom-560m
--use_hpu_graphs
--use_kv_cache
--max_new_tokens 100
--do_sample
--prompt "Tell me a poem about stone and water"

Expected behavior

it should either generate text or give permission issue error

Run text-generation with non-deepspeed mode

Feature request

Current text-generation only support bloom & bloomz with deepspeed, but not support other generation model like gpt2, gpt-j, neox.
There also a PR shows how to run inference in each of examples, but the inference is training evaluation not a real generation case.
so is there any possible to support the following features?

text-generation support other generation model(e.g. gpt2, gpt-j)
text-generation support non-deepspeed mode

Motivation

text-generation only support bloom and bloomz, can not run text-generation with gpt2, gpt-j, neox,...

Your contribution

submitting a PR

Runtime Error in Eager mode evaluation: The number of dims cannot be packed into CompleteArgumentSpec:65535

Error Message:

100%|██████████| 2702/2702 [02:58<00:00, 17.93it/s]Traceback (most recent call last):
  File "examples/question-answering/run_qa.py", line 664, in <module>
    main()
  File "examples/question-answering/run_qa.py", line 621, in main
    metrics = trainer.evaluate()
  File "/root/optimum-habana/examples/question-answering/trainer_qa.py", line 45, in evaluate
    output = self.evaluation_loop(
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 1112, in evaluation_loop
    logits = nested_numpify(preds_host)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer_pt_utils.py", line 138, in nested_numpify
    return type(tensors)(nested_numpify(t) for t in tensors)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer_pt_utils.py", line 138, in <genexpr>
    return type(tensors)(nested_numpify(t) for t in tensors)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer_pt_utils.py", line 139, in nested_numpify
    t = tensors.cpu()
RuntimeError: The number of dims cannot be packed into CompleteArgumentSpec:65535

Attaching the log file below:
albert_large_bf16_squad_eager.log

Command used:

python examples/question-answering/run_qa.py --model_name_or_path albert-large-v2 --dataset_name squad  --do_train --do_eval --max_seq_length 384 --per_device_train_batch_size 32 --learning_rate 5e-05 --num_train_epochs 2 --save_steps 5000 --seed 42 --doc_stride 128 --per_device_eval_batch_size 4 --use_lazy_mode false --use_habana  --output_dir=./albert_large_bf16_squad_eager  --cache_dir /software/lfs/data/pytorch/transformers/Squad 2>&1 | tee albert_large_bf16_squad_eager.log

Add a utility method to get the memory consumptions for various batch sizes

Feature request

The GaudiTrainer class should provide a method that takes a list of batch sizes as argument and returns the memory consumptions on HPU for each batch size.
For each batch size in the list, a training of 5 steps should be run without logging anything and the maximum memory consumption of this run should be returned.

Motivation

This feature will avoid users launching a full training just to check the memory consumption in the logs.

Your contribution

I will submit a PR.

Several greedy search Test cases failing with KeyError: 'bucket_size'

System Info

After recent integration of transformers tests to optimum-habana several test cases are seen failing with the following error 
@ssarkar2 . please have a look, we should probably has a check for the key availability before accessing and also return gracefully, so that existing functionality is not affected. Else please suggest modification required to be made in the tests to update the arguments.

FAILED test_modeling_t5.py::T5ModelTest::test_greedy_generate - KeyError: 'bucket_size'
FAILED test_modeling_t5.py::T5ModelTest::test_greedy_generate_dict_outputs - KeyError: 'bucket_size'
FAILED test_modeling_t5.py::T5ModelTest::test_greedy_generate_dict_outputs_use_cache - KeyError: 'bucket_size'


<pt> (conda_qnpu1) (anneog_transformers_tests_updates) anneog@anneog-vm-u20:t5 $ python -m pytest -vs test_modeling_t5.py::T5ModelTest::test_greedy_generate
============================================================================================= test session starts =============================================================================================
platform linux -- Python 3.8.18, pytest-7.4.2, pluggy-1.3.0 -- /home/anneog/anaconda3/envs/conda_qnpu1/bin/python
cachedir: .pytest_cache
rootdir: /home/anneog/github/ankurneog/optimum-habana
configfile: setup.cfg
collecting ... [WARNING|utils.py:179] 2023-10-04 07:28:42,015 >> optimum-habana v1.8.0.dev0 has been validated for SynapseAI v1.11.0 but habana-frameworks v1.13.0.133 was found, this could lead to undefined behavior!
[WARNING|utils.py:196] 2023-10-04 07:28:42,043 >> Could not run `hl-smi`, please follow the installation guide: https://docs.habana.ai/en/latest/Installation_Guide/index.html.
collected 1 item                                                                                                                                                                                              

test_modeling_t5.py::T5ModelTest::test_greedy_generate ============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 8
CPU RAM       : 40852220 KB
------------------------------------------------------------------------------
FAILED

================================================================================================== FAILURES ===================================================================================================
______________________________________________________________________________________ T5ModelTest.test_greedy_generate _______________________________________________________________________________________

self = <tests.models.t5.test_modeling_t5.T5ModelTest testMethod=test_greedy_generate>

    def test_greedy_generate(self):
        # check `generate()` and `greedy_search()` are equal
        for model_class in self.all_generative_model_classes:
            config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
            # test old generation output for backwards compatibility
            model = model_class(config).to(torch_device).eval()
>           output_greedy, output_generate = self._greedy_generate(
                model=model, input_ids=input_ids, attention_mask=attention_mask, max_length=max_length
            )

../../generation/test_utils.py:704: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../generation/test_utils.py:293: in _greedy_generate
    output_greedy = model.greedy_search(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = T5ForConditionalGeneration(
  (shared): Embedding(99, 32)
  (encoder): T5Stack(
    (embed_tokens): Embedding(99, 32)
...m()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (lm_head): Linear(in_features=32, out_features=99, bias=False)
)
input_ids = tensor([[0],
        [0]], device='hpu:0')
logits_processor = [<transformers.generation.logits_process.MinLengthLogitsProcessor object at 0x7f75ae6456d0>, <transformers.generation....at 0x7f75ae6456a0>, <transformers.generation.logits_process.RepetitionPenaltyLogitsProcessor object at 0x7f75ae5e8520>]
stopping_criteria = [<transformers.generation.stopping_criteria.MaxLengthCriteria object at 0x7f75ae5ea8e0>], max_length = 4, pad_token_id = 0, eos_token_id = [1], output_attentions = False
output_hidden_states = False, output_scores = False, return_dict_in_generate = False, synced_gpus = False, streamer = None, lazy_mode = False, ignore_eos = False, profiling_warmup_steps = 0
profiling_steps = 0
model_kwargs = {'encoder_outputs': BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[-1.8599e-03,  1.3660e-03, -1...    grad_fn=<IndexSelectBackward0>), past_key_values=None, hidden_states=None, attentions=None, cross_attentions=None)}
eos_token_id_tensor = tensor([1], device='hpu:0'), scores = None, decoder_attentions = None, cross_attentions = None, decoder_hidden_states = None

    def greedy_search(
        self,
        input_ids: torch.LongTensor,
        logits_processor: Optional[LogitsProcessorList] = None,
        stopping_criteria: Optional[StoppingCriteriaList] = None,
        max_length: Optional[int] = None,
        pad_token_id: Optional[int] = None,
        eos_token_id: Optional[Union[int, List[int]]] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        output_scores: Optional[bool] = None,
        return_dict_in_generate: Optional[bool] = None,
        synced_gpus: bool = False,
        streamer: Optional["BaseStreamer"] = None,
        lazy_mode: Optional[bool] = False,
        ignore_eos: Optional[bool] = False,
        profiling_warmup_steps: Optional[int] = 0,
        profiling_steps: Optional[int] = 0,
        **model_kwargs,
    ) -> Union[GreedySearchOutput, torch.LongTensor]:
        r"""
        Generates sequences of token ids for models with a language modeling head using **greedy decoding** and can be
        used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
    
        <Tip warning={true}>
    
        In most cases, you do not need to call [`~generation.GenerationMixin.greedy_search`] directly. Use generate()
        instead. For an overview of generation strategies and code examples, check the [following
        guide](../generation_strategies).
    
        </Tip>
    
    
        Parameters:
            input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
                The sequence used as a prompt for the generation.
            logits_processor (`LogitsProcessorList`, *optional*):
                An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsProcessor`]
                used to modify the prediction scores of the language modeling head applied at each generation step.
            stopping_criteria (`StoppingCriteriaList`, *optional*):
                An instance of [`StoppingCriteriaList`]. List of instances of class derived from [`StoppingCriteria`]
                used to tell if the generation loop should stop.
            max_length (`int`, *optional*, defaults to 20):
                **DEPRECATED**. Use `logits_processor` or `stopping_criteria` directly to cap the number of generated
                tokens. The maximum length of the sequence to be generated.
            pad_token_id (`int`, *optional*):
                The id of the *padding* token.
            eos_token_id (`Union[int, List[int]]`, *optional*):
                The id of the *end-of-sequence* token. Optionally, use a list to set multiple *end-of-sequence* tokens.
            output_attentions (`bool`, *optional*, defaults to `False`):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more details.
            output_hidden_states (`bool`, *optional*, defaults to `False`):
                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
                for more details.
            output_scores (`bool`, *optional*, defaults to `False`):
                Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
            return_dict_in_generate (`bool`, *optional*, defaults to `False`):
                Whether or not to return a [`transformers.generationutils.ModelOutput`] instead of a plain tuple.
            synced_gpus (`bool`, *optional*, defaults to `False`):
                Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
            streamer (`BaseStreamer`, *optional*):
                Streamer object that will be used to stream the generated sequences. Generated tokens are passed
                through `streamer.put(token_ids)` and the streamer is responsible for any further processing.
            lazy_mode (`bool`, *optional*, defaults to `False`):
                Whether the run is executed in lazy mode or not (i.e. eager mode).
            ignore_eos (`bool`, *optional*, defaults to `False`):
                Whether to ignore finished sequences (faster in lazy mode and with HPU graphs) or not (eager mode).
            profiling_warmup_steps (`int`, *optional*, defaults to 0):
                Number of steps to ignore for profling.
            profiling_steps (`int`, *optional*, defaults to 0):
                Number of steps to be captured when enabling profiling.
            model_kwargs:
                Additional model specific keyword arguments will be forwarded to the `forward` function of the model.
                If model is an encoder-decoder model the kwargs should include `encoder_outputs`.
    
        Return:
            [`transformers.generation.GreedySearchDecoderOnlyOutput`], [`transformers.generation.GreedySearchEncoderDecoderOutput`]
            or `torch.LongTensor`: A `torch.LongTensor` containing the generated tokens (default behaviour) or a
            [`transformers.generation.GreedySearchDecoderOnlyOutput`] if `model.config.is_encoder_decoder=False` and
            `return_dict_in_generate=True` or a [`transformers.generation.GreedySearchEncoderDecoderOutput`] if
            `model.config.is_encoder_decoder=True`.
    
        Examples:
    
        
        >>> from transformers import (
        ...     AutoTokenizer,
        ...     AutoModelForCausalLM,
        ...     LogitsProcessorList,
        ...     MinLengthLogitsProcessor,
        ...     StoppingCriteriaList,
        ...     MaxLengthCriteria,
        ... )
    
        >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
        >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
    
        >>> # set pad_token_id to eos_token_id because GPT2 does not have a PAD token
        >>> model.generation_config.pad_token_id = model.generation_config.eos_token_id
    
        >>> input_prompt = "It might be possible to"
        >>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids
    
        >>> # instantiate logits processors
        >>> logits_processor = LogitsProcessorList(
        ...     [
        ...         MinLengthLogitsProcessor(10, eos_token_id=model.generation_config.eos_token_id),
        ...     ]
        ... )
        >>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
    
        >>> outputs = model.greedy_search(
        ...     input_ids, logits_processor=logits_processor, stopping_criteria=stopping_criteria
        ... )
    
        >>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
        ["It might be possible to get a better understanding of the nature of the problem, but it's not"]
        """
        # init values
        logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
        if max_length is not None:
            warnings.warn(
                (
                    "`max_length` is deprecated in this function, use"
                    " `stopping_criteria=StoppingCriteriaList([MaxLengthCriteria(max_length=max_length)])` instead."
                ),
                UserWarning,
            )
            stopping_criteria = validate_stopping_criteria(stopping_criteria, max_length)
        pad_token_id = pad_token_id if pad_token_id is not None else self.generation_config.pad_token_id
        eos_token_id = eos_token_id if eos_token_id is not None else self.generation_config.eos_token_id
        if isinstance(eos_token_id, int):
            eos_token_id = [eos_token_id]
        eos_token_id_tensor = torch.tensor(eos_token_id).to(input_ids.device) if eos_token_id is not None else None
        output_scores = output_scores if output_scores is not None else self.generation_config.output_scores
        output_attentions = (
            output_attentions if output_attentions is not None else self.generation_config.output_attentions
        )
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.generation_config.output_hidden_states
        )
        return_dict_in_generate = (
            return_dict_in_generate
            if return_dict_in_generate is not None
            else self.generation_config.return_dict_in_generate
        )
    
        # init attention / hidden states / scores tuples
        scores = () if (return_dict_in_generate and output_scores) else None
        decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
        cross_attentions = () if (return_dict_in_generate and output_attentions) else None
        decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None
    
        # if model is an encoder-decoder, retrieve encoder attention weights and hidden states
        if return_dict_in_generate and self.config.is_encoder_decoder:
            encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None
            encoder_hidden_states = (
                model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None
            )
    
        # keep track of which sequences are already finished
        if not ignore_eos:
            unfinished_sequences = torch.ones(input_ids.shape[0], dtype=torch.long, device=input_ids.device)
    
        hb_profer = HabanaProfile(warmup=profiling_warmup_steps, active=profiling_steps)
        hb_profer.start()
        this_peer_finished = False  # used by synced_gpus only
>       bucket_size = model_kwargs["bucket_size"]
E       KeyError: 'bucket_size'

../../../../../optimum/habana/transformers/generation/utils.py:1252: KeyError

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

clone optimum-habana
pip install pytest
cd optimum-habana/transformers/tests/models/t5
python -m pytest -vs test_modeling_t5.py::T5ModelTest::test_greedy_generate

Expected behavior

Test case should pass without errors

accelerate llama inference in TGI

TGI only supports https://github.com/huggingface/optimum-habana/blob/main/text-generation-inference/server/text_generation_server/models/causal_lm.py#L25 ("bloom", "gpt2", "gptj", "gpt_neox", "opt") these models with static shape. Now llama static shape is supported in optimum-habana main branch. When will the new tag of optimum-habana be released, that TGI could enjoy the acceleration of llama？
currently TGI is installed with optimum-habana 1.6.1(the latest tag is 1.6.1)

Reproduction

launch TGI with llama model, client run text-generation job.

Expected behavior

llama model could run successfully with better throughput.

Robert large 8x run failed

With the following cmd roberta large failed at 8x

python ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path roberta-large --gaudi_config_name ../gaudi_config.json --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --per_device_eval_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir ./roberta_large_8x_bf16_lazy --use_habana --use_lazy_mode

to make the issue easier to reproduce: add the following cmd
--save_steps 5

it's related to the save portion, need to find out which save
configuration or checkpoint, tockenizer config, special tokens

Compute Accuracy in clip-roberta

Feature request

Is it possible to include an accuracy metric when training clip-roberta?

Motivation

We would like to have something other than loss to track during training and evaluation.

Your contribution

I tried creating a dummy compute_metrics function to pass to GaudiTrainer thusly.

metric = evaluate.load("accuracy")
def compute_metrics(p):
    return 1

get the following error:

Traceback (most recent call last):
  File "run_clip.py", line 553, in <module>
    main()
  File "run_clip.py", line 532, in main
    metrics = trainer.evaluate()
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2932, in evaluate
    output = eval_loop(
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/transformers/trainer.py", line 1074, in evaluation_loop
    logits_dtype = get_dtype(logits)
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/transformers/trainer_utils.py", line 43, in get_dtype
    return [get_dtype(logits_tensor) for logits_tensor in logits]
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/transformers/trainer_utils.py", line 43, in <listcomp>
    return [get_dtype(logits_tensor) for logits_tensor in logits]
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/transformers/trainer_utils.py", line 45, in get_dtype
    raise TypeError(f"logits should be of type torch.Tensor or tuple, got {type(logits)} which is not supported")
TypeError: logits should be of type torch.Tensor or tuple, got <class 'transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions'> which is not supported

multi-node distributed training on Ray is failed.

System Info

Docker image: vault.habana.ai/gaudi-docker/1.12.1/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
Habana version: 1.12.1
Deepspeed: https://github.com/HabanaAI/[email protected]
optimum-habana: pip install --upgrade-strategy eager optimum[habana]

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I didn't use the gaudi_spawn.py to launch the distributed training of run_clm.py. I used the Ray TorchTrainer to run the GaudiTrainer.
The single-node-multi-cards training can work. But the multi-node-multi-cards training is failed. Can you help to provide some clues?

Logs:

(RayTrainWorker pid=19045) [2023-11-02 13:17:59,227 E 19045 21937] logging.cc:97: Unhandled exception: N3c105ErrorE. what(): Collective call returned error
(RayTrainWorker pid=19045) Exception raised from operator() at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/process_group_hccl_base.cpp:286 (most recent call first):
(RayTrainWorker pid=19045) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7ee2d91f557c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayTrainWorker pid=19045) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x84 (0x7ee2d91bb220 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
(RayTrainWorker pid=19045) frame #2: <unknown function> + 0x3497f (0x7ee22db4c97f in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so)
(RayTrainWorker pid=19045) frame #3: <unknown function> + 0x5b0e9 (0x7ee22db730e9 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so)
(RayTrainWorker pid=19045) frame #4: habana_helpers::JobThread::threadFunction() + 0x128 (0x7ee236356578 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
(RayTrainWorker pid=19045) frame #5: <unknown function> + 0xd6df4 (0x7f124793edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
(RayTrainWorker pid=19045) frame #6: <unknown function> + 0x8609 (0x7f1247c15609 in /lib/x86_64-linux-gnu/libpthread.so.0)
(RayTrainWorker pid=19045) frame #7: clone + 0x43 (0x7f1247d4f133 in /lib/x86_64-linux-gnu/libc.so.6)
(RayTrainWorker pid=19045)
(RayTrainWorker pid=19045) [2023-11-02 13:17:59,234 E 19045 21937] logging.cc:104: Stack trace:
(RayTrainWorker pid=19045)  /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xf2e81a) [0x7f1246a9a81a] ray::operator<<()
(RayTrainWorker pid=19045) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xf30fd8) [0x7f1246a9cfd8] ray::TerminateHandler()
(RayTrainWorker pid=19045) /usr/lib/habanalabs/libhl_logger.so(+0x1d45a) [0x7ee23430f45a]
(RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f124791237c]
(RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f12479123e7]
(RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa699) [0x7f1247912699]
(RayTrainWorker pid=19045) /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so(_ZN3c106detail14torchCheckFailEPKcS2_jS2_+0xac) [0x7ee2d91bb248] c10::detail::torchCheckFail()
(RayTrainWorker pid=19045) /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(+0x3497f) [0x7ee22db4c97f] std::_Function_handler<>::_M_invoke()
(RayTrainWorker pid=19045) /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(+0x5b0e9) [0x7ee22db730e9] c10d::ProcessGroupHCCL::collective()::{lambda()#3}::operator()()
(RayTrainWorker pid=19045) /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(_ZN14habana_helpers9JobThread14threadFunctionEv+0x128) [0x7ee236356578] habana_helpers::JobThread::threadFunction()
(RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f124793edf4]
(RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f1247c15609] start_thread
(RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f1247d4f133] __clone
(RayTrainWorker pid=19045)
(RayTrainWorker pid=19045) Internal Error: Received signal - Aborted
(RayTrainWorker pid=19045) *** SIGABRT received at time=1698931079 on cpu 27 ***
(RayTrainWorker pid=19045) PC: @     0x7f1247c7300b  (unknown)  raise
(RayTrainWorker pid=19045)     @     0x7ee23430fa3b  (unknown)  signalHandler()
(RayTrainWorker pid=19045) [2023-11-02 13:17:59,236 E 19045 21937] logging.cc:361: *** SIGABRT received at time=1698931079 on cpu 27 ***
(RayTrainWorker pid=19045) [2023-11-02 13:17:59,236 E 19045 21937] logging.cc:361: PC: @     0x7f1247c7300b  (unknown)  raise
(RayTrainWorker pid=19045) [2023-11-02 13:17:59,236 E 19045 21937] logging.cc:361:     @     0x7ee23430fa3b  (unknown)  signalHandler()
(RayTrainWorker pid=19045) Fatal Python error: Aborted
(RayTrainWorker pid=19045)
(RayTrainWorker pid=19046) /home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/interfaces/hcl_idevice.cpp::423(allocateConnection): The condition [ isNicUp(port) ] failed. Port(22) is DOWN, can't allocate connection  [repeated 15x across cluster]

Expected behavior

multi-node distributed training can work on Ray.

clip-vit-large-patch14 image classification support

Feature request

I am trying to run the following model using the optimum-habana repository https://huggingface.co/openai/clip-vit-large-patch14. Do you have any suggestions for finetuning this model with existing Habana-optimum code? If not could we create a script/amend a current script so this model can be supported?

Motivation

We would like to train this particular model on Habana hardware.

Your contribution

I tried using the existing image_classification script, because it supports regular vit. Here is the command in the image classification README I used. However, I found that the AutoModelforImageClassification class invoked here does not support clip config (not in the list of configs here). So I tried swapping out this class for the generic AutoModel class where CLIPconfig is supported. I get the following error with that change:

hmp:opt_level O1
Traceback (most recent call last):
File "run_image_classification.py", line 410, in
main()
File "run_image_classification.py", line 384, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/root/optimum/habana/transformers/trainer.py", line 397, in train
return inner_training_loop(
File "/root/optimum/habana/transformers/trainer.py", line 500, in _inner_training_loop
self._load_optimizer_and_scheduler(resume_from_checkpoint)
File "/root/optimum/habana/transformers/trainer.py", line 960, in _load_optimizer_and_scheduler
self.optimizer.load_state_dict(
File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 201, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Upon further digging, I also see that CLIP may still be a WIP from transformers side. Do not see _MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES used anywhere. Let me know if this feature enablement is possible. Would be happy to work on it with some direction.

'meta-llama/Llama-2-7b-hf' tests fail with Authentication failure.

System Info

optimum-habana - 1.8.0.dev0
Synapse version - 1.13.0-90
Docker - 1.13.0-90

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana
git checkout main
pip install .
huggingface-cli login --token
cd tests
pytest -s -k test_text_generation_bf16[token0-meta-llama/Llama-2-7b-hf-43.951804139391925]

Expected behavior

This is a test inside optimum-haban to run Llama model. Expected to run without any issues.
But authorization is failing even after logging into hugging face using a token generated with personal account.
Screen shot attached showing the authorization error.

RuntimeError: Device acquire failed. in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/hpu/init.py"

System Info

Optimum Habana : 1.6.0
SynapseAI : 1.4.0
Docker Image : Deep Learning AMI Habana PyTorch 1.10.2 SynapseAI 1.4.0 (Ubuntu 20.04) 20220425

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Start an EC2 instance with DL1 Resource and this image :
Install "Deep Learning AMI Habana PyTorch 1.10.2 SynapseAI 1.4.0 (Ubuntu 20.04) 20220425"
Run these commands
a. docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
b. git clone https://github.com/huggingface/optimum-habana.git
c. cd optimum-habana && python setup.py install
d. cd examples
e. cd text-generation
f. pip install -r requirements.txt
e. python run_generation.py
--model_name_or_path gpt2
--use_hpu_graphs
--use_kv_cache
--max_new_tokens 100
--do_sample
--prompt "Tell me a poem about stone and water"

Expected behavior

Should get the desired text (poem about stone and water)

Error: Getting size for given data type is not supported while fine tuning starcoder model on optimum-habana

System Info

Hello Team,
We are trying to fine tune the bigcode/starcoderbase-7b model on a multi HPU (8 HPU) node and have been following the guidance https://github.com/huggingface/optimum-habana/tree/main/examples/language-modeling .

However, we are encountering a similar issue that have been mentioned in the #318.

We are also using a custom class ConstantLengthDataset(IterableDataset). Essentially we are trying to port the https://github.com/bigcode-project/starcoder/blob/main/finetune/finetune.py to habana and we using from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments at appropriate places.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Training...
Training...
Training...
terminate called after throwing an instance of 'c10::Error'
  what():  Getting size for given data type is not supported: 0
Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7ff0b09bd53c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7ff0b098310c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x544ea (0x7ff0b02f84ea in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/distributed/_hccl_C.so)
frame #3: habana_helpers::JobThread::threadFunction() + 0x128 (0x7ff020da6ae8 in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #4: <unknown function> + 0xd6df4 (0x7ff0b47dedf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7ff0b4ab5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7ff0b4bef133 in /lib/x86_64-linux-gnu/libc.so.6)

Internal Error: Received signal - Aborted
terminate called after throwing an instance of 'c10::Error'
  what():  Getting size for given data type is not supported: 0
Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f881daf453c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f881daba10c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x544ea (0x7f88161634ea in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/distributed/_hccl_C.so)
frame #3: habana_helpers::JobThread::threadFunction() + 0x128 (0x7f8816ee3ae8 in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
frame #4: <unknown function> + 0xd6df4 (0x7f8822904df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f8822bdb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f8822d15133 in /lib/x86_64-linux-gnu/libc.so.6)

Internal Error: Received signal - Aborted
terminate called after throwing an instance of 'c10::Error'
  what():  Getting size for given data type is not supported: 0
Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
...

Internal Error: Received signal - Aborted
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node idc382 exited on signal 6 (Aborted).

Expected behavior

We should be able to complete the training loop without issues. We did try to add a fake _len_ method inside the class ConstantLengthDataset(IterableDataset) class, but it still failed.

def __len__(self):
    return 10

But at the same time I see the following observations:

We cannot run the starcoder-7B model on 1 HPU due to OOM
We can run the 3B model on 1 HPU, no issue with fetching dataset length
We cannot run the 3B model on 8 HPUs (infact > 1 HPU) , fails with the same getting size for data type issue.

So, the issue arises whenever we shift to multliple HPU or distributed training on > 1 HPUs.

A Gaudi2 1 HPU = 96 GB of device memory.

return super().__torch_function__(func, types, new_args, kwargs) RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::3288334336 (3136)MB

System Info

Optimum Habana : 1.6.0
SynapseAI : 1.10.0
Docker Image : Habana® Deep Learning Base AMI (Ubuntu 20.04)
Volume : 1000 GiB

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Expected behavior

It should give a sample poem rather than an error

Repo card metadata block was not found. Setting CardData to empty

System Info

optimum-habana          1.8.0
docker vault.habana.ai/gaudi-docker/1.12.0/ubuntu22.04/habanalabs/pytorch-installer-2.0.1:latest
Synapse Version 1.12.0-480 

https://github.com/huggingface/optimum-habana/tree/ee5e8fc39e78800eb3763d048192bef036fadc4c/examples/contrastive-image-text

The step in Readme fails with "Repo card metadata block was not found. Setting CardData to empty"
The dataset validation step, pasted below. 

import os
import datasets

COCO_DIR = os.path.join(os.getcwd(), "data")
ds = datasets.load_dataset("ydshieh/coco_dataset_script", "2017", data_dir=COCO_DIR)

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Just launch the docker and follow the Readme steps, you see this issue.

Expected behavior

I assume dataset get loaded , Just see no response

Loading from flax checkpoint to resume training with pytorch

@regisss

I am trying to resume training from a flax checkpoint and continue on pytorch. I change the following line of code in the file pytorch/question-answering/run_qa.py

model = AutoModelForQuestionAnswering.from_pretrained(
    model_args.model_name_or_path,
    **from_flax=True, #from_tf=bool(".ckpt" in model_args.model_name_or_path),**
    config=config,
    cache_dir=model_args.cache_dir,
    revision=model_args.model_revision,
    use_auth_token=True if model_args.use_auth_token else None,
)

I get the following error

Traceback (most recent call last):
File "run_qa.py", line 652, in
main()
File "run_qa.py", line 593, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/venv_bert/lib/python3.6/site-packages/transformers/trainer.py", line 1170, in train
raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at transformers/examples/flax/question-answering/bert-qa-squad/

I have tried on both optimum-habana and the main hugging_face/transformers repo. Any advice?

Beach search transformers test cases are failing with KeyError: 'limit_hpu_graphs'

System Info

After recent integration of transformer test cases to optimum-habana ,it was observed that several beam search related testcases are failing with the following error, The test case below is for T5 but several other language modelling models such as GPT2, GPTJ, GPTNEOX etc. also invoke the same test case and hence they are failing as well.

Logs : 
 <pt> (conda_qnpu1) (anneog_transformers_tests_updates) anneog@anneog-vm-u20:t5 $ python -m pytest -vs test_modeling_t5.py::T5ModelTest::test_beam_search_generate
============================================================================================= test session starts =============================================================================================
platform linux -- Python 3.8.18, pytest-7.4.2, pluggy-1.3.0 -- /home/anneog/anaconda3/envs/conda_qnpu1/bin/python
cachedir: .pytest_cache
rootdir: /home/anneog/github/ankurneog/optimum-habana
configfile: setup.cfg
collecting ... [WARNING|utils.py:179] 2023-10-04 06:56:49,584 >> optimum-habana v1.8.0.dev0 has been validated for SynapseAI v1.11.0 but habana-frameworks v1.13.0.133 was found, this could lead to undefined behavior!
[WARNING|utils.py:196] 2023-10-04 06:56:49,606 >> Could not run `hl-smi`, please follow the installation guide: https://docs.habana.ai/en/latest/Installation_Guide/index.html.
collected 1 item                                                                                                                                                                                              

test_modeling_t5.py::T5ModelTest::test_beam_search_generate ============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 8
CPU RAM       : 40852220 KB
------------------------------------------------------------------------------
FAILED

================================================================================================== FAILURES ===================================================================================================
____________________________________________________________________________________ T5ModelTest.test_beam_search_generate ____________________________________________________________________________________

self = <tests.models.t5.test_modeling_t5.T5ModelTest testMethod=test_beam_search_generate>

    def test_beam_search_generate(self):
        for model_class in self.all_generative_model_classes:
            config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
    
            # It is important set set the eos_token_id to None to ensure that no sequences
            # shorter than `max_length` can be generated which could lead to flaky circle ci
            # failures if the top `num_return_sequences` beams are all shorter than the longest beam
            config.eos_token_id = None
            config.forced_eos_token_id = None
    
            model = model_class(config).to(torch_device).eval()
            if model.config.is_encoder_decoder:
                max_length = 4
    
            logits_process_kwargs, logits_processor = self._get_logits_processor_and_kwargs(
                input_ids.shape[-1],
                config.eos_token_id,
                config.forced_bos_token_id,
                config.forced_eos_token_id,
                max_length,
            )
            beam_kwargs, beam_scorer = self._get_beam_scorer_and_kwargs(input_ids.shape[0], max_length)
    
            # check `generate()` and `beam_search()` are equal
>           output_generate, output_beam_search = self._beam_search_generate(
                model=model,
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_length=max_length,
                beam_scorer=beam_scorer,
                beam_kwargs=beam_kwargs,
                logits_process_kwargs=logits_process_kwargs,
                logits_processor=logits_processor,
            )

../../generation/test_utils.py:881: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../generation/test_utils.py:422: in _beam_search_generate
    output_beam_search = model.beam_search(
../../../../../optimum/habana/transformers/generation/utils.py:1995: in beam_search
    hpu_graphs_kwargs = self._get_hpu_graphs_kwargs(model_kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = T5ForConditionalGeneration(
  (shared): Embedding(99, 32)
  (encoder): T5Stack(
    (embed_tokens): Embedding(99, 32)
...m()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (lm_head): Linear(in_features=32, out_features=99, bias=False)
)
model_kwargs = {'encoder_outputs': BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[-4.7808e-04, -6.3646e-04, -2...    grad_fn=<IndexSelectBackward0>), past_key_values=None, hidden_states=None, attentions=None, cross_attentions=None)}

    def _get_hpu_graphs_kwargs(self, model_kwargs):
        hpu_graphs_kwargs = {}
>       if model_kwargs["limit_hpu_graphs"]:
E       KeyError: 'limit_hpu_graphs'

../../../../../optimum/habana/transformers/generation/utils.py:141: KeyError

@@p9olisettyvarma could you have a look. I think we should modify the code so that the key is not accessed if it is not filled in the dictionary for eg. a check for if key not in dict and the return hpu_graphs_kwargs with default values

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Clone optimum-habana
pip install pytest
cd optimum-habana/tests/transformers/tests/model/t5
python -m pytest -vs test_modeling_t5.py::T5ModelTest::test_beam_search_generate

Expected behavior

The test should pass.

Inconsistent argument bf16/bf16_full_eval with json file

With gaudi_config.json

{
"execution_mode": "lazy",
"use_habana_mixed_precision": true,
"world_size": 8,
"use_fused_adam": true,
"use_fused_clip_norm": true
}

The bf16 /bf16_full_eval are still false

04/05/2022 15:19:13 - INFO - main - Training/evaluation parameters GaudiTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,

Several links in the doc are broken

For example https://huggingface.co/docs/optimum.habana/main/en/trainer#optimum.habana.GaudiTrainer in https://huggingface.co/docs/optimum/main/en/habana_single_hpu
Or https://github.com/huggingface/optimum.habana/blob/main/optimum/habana/trainer.py#L97 in https://huggingface.co/docs/optimum/main/en/habana_trainer#optimum.habana.GaudiTrainer

The first one seem to be broken because it uses optimum.habana instead of optimum, and the path main/en/trainer instead of main/en/habana_trainer.

The second one because it uses optimum.habana instead of optimum-habana in the github path.

@regisss

Adding profiling

Feature request

HabanaAI support PyTorch profiler, and currently if user want to figure out the bottleneck of training or inference, they should modify the source code(adding torch.profiler into GaudiTrainer or generate), with which the user experience is not good.
If we support torch.profiler in optimum-habana, and user could generate the profiling data just with --do_profiling 10

default value of do_profiling is 0, means not generate the profiling data
10 means how many steps or iterations will be captured

refer profiling-with-pytorch

Motivation

easy to address the bottleneck of model both for training and inference

Your contribution

submit a PR

Does it make sense to also provide an option of max input tokens for text generation ?

Feature request

Right now we tokenize as below and based upon input sentences in the batch, the input ids are padded.
input_tokens = tokenizer.batch_encode_plus(input_sentences, return_tensors="pt", padding=True)

What if the next batch of data exceeds the above input tokens length which will trigger recompilations.

Would it not be better if we also introduce an argument of max input length and tokenize as below using padding to max length and truncation to True:

input_tokens = tokenizer.batch_encode_plus(batch_sentences, padding='max_length', truncation=True, max_length=args.max_input_length)

Motivation

Avoid graph recompilations.

Your contribution

Yes if we agree.

Fine-tuning BERT model without Trainer

Hello,

I have a custom model that I've incorporated BERT into. Is it possible to train this model using a normal training loop?

Example:

def training_loop(dataloader, model1):
    device = torch.device('hpu')
    model1 = model1.to(device)
    model2 = AutoModel.from_pretrained('bert-base-uncased').to(device)
    custom_model = some_wrapper(model1, model2)
    for batch in dataloader:
        batch = batch.to(device)
        output = custom_model(batch)
    
    ...

htcore issue on "text-generation-inference" server with "langchain" client

System Info

optimum-habana                         1.6.1
text-generation                        0.6.0
text-generation-server                 0.9.2
langchain                              0.0.265

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Step to reproduce:

start text-generation-server following the instruction:
https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference
launch bash in the docker,
inside the docker:
pip install langchain text-generation
python run the following script:

from langchain import HuggingFaceTextGenInference
llm = HuggingFaceTextGenInference(
inference_server_url="http://127.0.0.1:80",
max_new_tokens=64,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.01,
repetition_penalty=1.03,
)
output = llm("What is Machine Learning?")

print(output)

noticed:
Client side:

Traceback (most recent call last):
File "/root/langchain_client/langchain-client.py", line 12, in
output = llm("What is Machine Learning?")
File "/usr/local/lib/python3.10/dist-packages/langchain/llms/base.py", line 802, in call
self.generate(
File "/usr/local/lib/python3.10/dist-packages/langchain/llms/base.py", line 598, in generate
output = self._generate_helper(
File "/usr/local/lib/python3.10/dist-packages/langchain/llms/base.py", line 504, in _generate_helper
raise e
File "/usr/local/lib/python3.10/dist-packages/langchain/llms/base.py", line 491, in _generate_helper
self._generate(
File "/usr/local/lib/python3.10/dist-packages/langchain/llms/base.py", line 977, in _generate
self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/langchain/llms/huggingface_text_gen_inference.py", line 164, in _call
res = self.client.generate(prompt, **invocation_params)
File "/usr/local/lib/python3.10/dist-packages/text_generation/client.py", line 149, in generate
raise parse_error(resp.status_code, payload)
text_generation.errors.GenerationError: Request failed during generation: Server error: name 'htcore' is not defined

Server side:

Traceback (most recent call last):
File "/usr/local/bin/text-generation-server", line 8, in
sys.exit(app())
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
return _main(
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 66, in serve
server.serve(model_id, revision, dtype, uds_path)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 161, in serve
asyncio.run(serve_inner(model_id, revision, dtype))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 633, in run_until_complete
self.run_forever()
File "/usr/lib/python3.10/asyncio/base_events.py", line 600, in run_forever
self._run_once()
File "/usr/lib/python3.10/asyncio/base_events.py", line 1896, in _run_once
handle._run()
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
return await self.intercept(

File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 76, in Prefill
generations, next_batch = self.model.generate_token(batch)
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 557, in generate_token
next_token_id, logprobs = next_token_chooser(all_input_ids.view(1, -1), logits[-1:, :])
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/utils/tokens.py", line 65, in call
scores, next_logprob = self.static_warper(scores)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/utils/logits_process.py", line 51, in call
self.hpu_graph = htcore.hpu.HPUGraph()
NameError: name 'htcore' is not defined

Expected behavior

The expected behavior should run smoothly and provide the output text.

Readme suggestions

Can you add "(HPU)" after Habana's Gaudi processor?
🤗 Optimum Habana is the interface between the 🤗 Transformers library and Habana's Gaudi processor.
Move the following to bf16 list
"truediv",
"div",
"softmax"
Add section for the recommendation training parameter for the 4 models

No support for optimum-habana pipeline() causes error during inference for PyTorch BERT finetuned model using dtype bf16

System Info

optimum-habana 1.5.0
docker version 1.9.0
pytorch version 1.13.1

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

During the inference of bert (bert-large-uncased), finetuned on Financial PhraseBank dataset with bf16 data type, it results in an error.

The finetuning on Gaudi (HPU) is done with the help of optimum-habana library.

The transformers (4.28.1) and supporting libraries are installed as part of optimum-habana installation.

The finetuning works well for both data type (bf16 and fp32). The inference works well on fp32 data type. But when inference is done on bf16, it results in error.

The finetuning code is present here in finbert.py file.

import sys
import subprocess

subprocess.check_call([sys.executable, '-m', 'pip', 'install', 
'numpy', ' pandas', ' scikit-learn', 'datasets', 'optimum.habana', '--user'])

import pandas as pd
import numpy as np
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from datasets import Dataset


def load_data():
    df = pd.read_csv(
        'FinancialPhraseBank-v1.0/Sentences_50Agree.txt',
        sep='@',
        names=['sentence', 'label'],
        encoding = "ISO-8859-1")
    df = df.dropna()
    df['label'] = df['label'].map({"neutral": 0, "positive": 1, "negative": 2})
    df.head()

    df_train, df_test, = train_test_split(df, stratify=df['label'], test_size=0.1, random_state=42)
    df_train, df_val = train_test_split(df_train, stratify=df_train['label'],test_size=0.1, random_state=42)

    dataset_train = Dataset.from_pandas(df_train, preserve_index=False)
    dataset_val = Dataset.from_pandas(df_val, preserve_index=False)
    dataset_test = Dataset.from_pandas(df_test, preserve_index=False)

    return dataset_train, dataset_val, dataset_test


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy': accuracy_score(predictions, labels)}


def main():
    dataset_train, dataset_val, dataset_test = load_data()

    bert_model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased', num_labels=3)
    bert_tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased')

    dataset_train = dataset_train.map(lambda e: bert_tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
    dataset_val = dataset_val.map(lambda e: bert_tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
    dataset_test = dataset_test.map(lambda e: bert_tokenizer(e['sentence'], truncation=True, padding='max_length' , max_length=128), batched=True)

    dataset_train.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
    dataset_val.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
    dataset_test.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])

    args = GaudiTrainingArguments(
        output_dir='temp/',
        overwrite_output_dir=True,
        evaluation_strategy='epoch',
        save_strategy='no',
        logging_strategy='epoch',
        logging_dir='logs/',
        report_to='tensorboard',

        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=4,
        num_train_epochs=5,
        weight_decay=0.01,
        metric_for_best_model='accuracy',

        use_habana=True,                        # use Habana device
        use_lazy_mode=True,                     # use Gaudi lazy mode
        use_hpu_graphs=True,                    # set value for hpu_graphs
        gaudi_config_name='gaudi_config.json',  # load config file
    )

    trainer = GaudiTrainer(
        model=bert_model,                   # the instantiated 🤗 Transformers model to be trained
        args=args,                          # training arguments, defined above
        train_dataset=dataset_train,        # training dataset
        eval_dataset=dataset_val,           # evaluation dataset
        compute_metrics=compute_metrics
    )

    trainer.train()   


if __name__ == '__main__':
    main()

It also needs a gaudi_config.json file which has details for bf16 dtype training.
The gaudi_config.json file is:

{
  "use_habana_mixed_precision": true,
  "hmp_is_verbose": false,
  "use_fused_adam": true,
  "use_fused_clip_norm": true,
  "hmp_bf16_ops": [
    "add",
    "addmm",
    "bmm",
    "div",
    "dropout",
    "gelu",
    "iadd",
    "linear",
    "layer_norm",
    "matmul",
    "mm",
    "rsub",
    "softmax",
    "truediv"
  ],
  "hmp_fp32_ops": [
    "embedding",
    "nll_loss",
    "log_softmax",
    "cross_entropy"
  ]
}

Note: Keep both finbert.py and gaudi_config.json files in same folder.

Run it with command:
export MASTER_ADDR="localhost"
export MASTER_PORT="12345"
mpirun -n 8 --bind-to core --map-by socket:PE=4 --rank-by core --report-bindings --allow-run-as-root python finbert.py

Note: It can also be finetuned on 1 card for debugging purpose.

After completing the finetuning on bf16 dtype, next while running the inference code either code-1 or code-2, it results in error.

Inference code-1:

from transformers import pipeline 
device=torch.device('hpu') 
pipe = pipeline("text-classification", model=bert_model, tokenizer=bert_tokenizer, device=device) 
print(pipe("Alabama Takes From the Poor and Gives to the Rich")) 
print(pipe("Economists are predicting the highest rate of employment in 15 years"))

Inference code-2:

from transformers import TextClassificationPipeline
pipe = TextClassificationPipeline(model=bert_model, tokenizer=bert_tokenizer)
pipe.device=torch.device('hpu')
print(pipe("Alabama Takes From the Poor and Gives to the Rich")) 
print(pipe("Economists are predicting the highest rate of employment in 15 years"))

Error seen after inference:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:1146, in _LazyModule._get_module(self, module_name)
   1145 try:
-> 1146     return importlib.import_module("." + module_name, self.__name__)
   1147 except Exception as e:File /usr/lib/python3.8/importlib/__init__.py:127, in import_module(name, package)
    126         level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)File <frozen importlib._bootstrap>:1014, in _gcd_import(name, package, level)File <frozen importlib._bootstrap>:991, in _find_and_load(name, import_)File <frozen importlib._bootstrap>:975, in _find_and_load_unlocked(name, import_)File <frozen importlib._bootstrap>:671, in _load_unlocked(spec)File <frozen importlib._bootstrap_external>:848, in exec_module(self, module)File <frozen importlib._bootstrap>:219, in _call_with_frames_removed(f, *args, **kwds)File /usr/local/lib/python3.8/dist-packages/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py:56
     51 # Fused kernels
     52 # Use separate functions for each case because conditionals prevent kernel fusion.
     53 # TODO: Could have better fused kernels depending on scaling, dropout and head mask.
     54 #  Is it doable without writing 32 functions?
     55 @torch.jit.script
---> 56 def upcast_masked_softmax(
     57     x: torch.Tensor, mask: torch.Tensor, mask_value: torch.Tensor, scale: float, softmax_dtype: torch.dtype
     58 ):
     59     input_dtype = x.dtypeFile /usr/local/lib/python3.8/dist-packages/torch/jit/_script.py:1343, in script(obj, optimize, _frames_up, _rcb, example_inputs)
   1342     _rcb = _jit_internal.createResolutionCallbackFromClosure(obj)
-> 1343 fn = torch._C._jit_script_compile(
   1344     qualified_name, ast, _rcb, get_default_args(obj)
   1345 )
   1346 # Forward docstringsFile /usr/local/lib/python3.8/dist-packages/torch/jit/_recursive.py:863, in try_compile_fn(fn, loc)
    862 rcb = _jit_internal.createResolutionCallbackFromClosure(fn)
--> 863 return torch.jit.script(fn, _rcb=rcb)File /usr/local/lib/python3.8/dist-packages/torch/jit/_script.py:1343, in script(obj, optimize, _frames_up, _rcb, example_inputs)
   1342     _rcb = _jit_internal.createResolutionCallbackFromClosure(obj)
-> 1343 fn = torch._C._jit_script_compile(
   1344     qualified_name, ast, _rcb, get_default_args(obj)
   1345 )
   1346 # Forward docstringsRuntimeError: 
Unknown type name 'DType':
  File "/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/hpex/hmp/utils.py", line 1811
def softmax(input: Tensor, dim: Optional[int] = None, _stacklevel: int = 3, dtype: Optional[DType] = None) -> Tensor:
                                                                                            ~~~~~ <--- HERE
    r"""Applies a softmax function.
'softmax' is being compiled since it was called from 'upcast_masked_softmax'
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py", line 62
    x = x.to(softmax_dtype) * scale
    x = torch.where(mask, x, mask_value)
    x = torch.nn.functional.softmax(x, dim=-1).to(input_dtype)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return x
The above exception was the direct cause of the following exception:RuntimeError                              Traceback (most recent call last)
Cell In[26], line 3
      1 from transformers import pipeline
      2 device=torch.device('hpu')
----> 3 pipe = pipeline("text-classification", model=trainer.model, tokenizer=bert_tokenizer, device=device)
      4 #pipe = TextClassificationPipeline(model=bert_model, tokenizer=bert_tokenizer)
      5 #pipe = TextClassificationPipeline(model=bert_model, tokenizer=bert_tokenizer)
      6 #pipe.device=torch.device('hpu')
      8 print(pipe("Alabama Takes From the Poor and Gives to the Rich"))File /usr/local/lib/python3.8/dist-packages/transformers/pipelines/__init__.py:979, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    976 if device is not None:
    977     kwargs["device"] = device
--> 979 return pipeline_class(model=model, framework=framework, task=task, **kwargs)File /usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_classification.py:85, in TextClassificationPipeline.__init__(self, **kwargs)
     82 def __init__(self, **kwargs):
     83     super().__init__(**kwargs)
---> 85     self.check_model_type(
     86         TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING
     87         if self.framework == "tf"
     88         else MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING
     89     )File /usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py:942, in Pipeline.check_model_type(self, supported_models)
    940 if not isinstance(supported_models, list):  # Create from a model mapping
    941     supported_models_names = []
--> 942     for config, model in supported_models.items():
    943         # Mapping can now contain tuples of models for the same configuration.
    944         if isinstance(model, tuple):
    945             supported_models_names.extend([_model.__name__ for _model in model])File /usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py:644, in _LazyAutoMapping.items(self)
    643 def items(self):
--> 644     mapping_items = [
    645         (
    646             self._load_attr_from_module(key, self._config_mapping[key]),
    647             self._load_attr_from_module(key, self._model_mapping[key]),
    648         )
    649         for key in self._model_mapping.keys()
    650         if key in self._config_mapping.keys()
    651     ]
    652     return mapping_items + list(self._extra_content.items())File /usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py:647, in <listcomp>(.0)
    643 def items(self):
    644     mapping_items = [
    645         (
    646             self._load_attr_from_module(key, self._config_mapping[key]),
--> 647             self._load_attr_from_module(key, self._model_mapping[key]),
    648         )
    649         for key in self._model_mapping.keys()
    650         if key in self._config_mapping.keys()
    651     ]
    652     return mapping_items + list(self._extra_content.items())File /usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py:616, in _LazyAutoMapping._load_attr_from_module(self, model_type, attr)
    614 if module_name not in self._modules:
    615     self._modules[module_name] = importlib.import_module(f".{module_name}", "transformers.models")
--> 616 return getattribute_from_module(self._modules[module_name], attr)File /usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py:561, in getattribute_from_module(module, attr)
    559 if isinstance(attr, tuple):
    560     return tuple(getattribute_from_module(module, a) for a in attr)
--> 561 if hasattr(module, attr):
    562     return getattr(module, attr)
    563 # Some of the mappings have entries model_type -> object of another model type. In that case we try to grab the
    564 # object at the top level.File /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:1136, in _LazyModule.__getattr__(self, name)
   1134     value = self._get_module(name)
   1135 elif name in self._class_to_module.keys():
-> 1136     module = self._get_module(self._class_to_module[name])
   1137     value = getattr(module, name)
   1138 else:File /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:1148, in _LazyModule._get_module(self, module_name)
   1146     return importlib.import_module("." + module_name, self.__name__)
   1147 except Exception as e:
-> 1148     raise RuntimeError(
   1149         f"Failed to import {self.__name__}.{module_name} because of the following error (look up to see its"
   1150         f" traceback):\n{e}"
   1151     ) from eRuntimeError: Failed to import transformers.models.gpt_bigcode.modeling_gpt_bigcode because of the following error (look up to see its traceback):Unknown type name 'DType':
  File "/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/hpex/hmp/utils.py", line 1811
def softmax(input: Tensor, dim: Optional[int] = None, _stacklevel: int = 3, dtype: Optional[DType] = None) -> Tensor:
                                                                                            ~~~~~ <--- HERE
    r"""Applies a softmax function.
'softmax' is being compiled since it was called from 'upcast_masked_softmax'
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py", line 62
    x = x.to(softmax_dtype) * scale
    x = torch.where(mask, x, mask_value)
    x = torch.nn.functional.softmax(x, dim=-1).to(input_dtype)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return x

Expected behavior

Inference is expected to work well in bf16 just like fp32 dtype.

Below shown output is expected
[{'label': 'neutral', 'score': 0.9094224572181702}]
[{'label': 'positive', 'score': 0.9752092957496643}]

Will Hugging Face support GLM series models (ChatGLM-6B, ChatGLM2-6B ...) in Transformers?

Feature request

We plan to enable ChatGLM-6B, ChatGLM2-6B series models on HPU but found that the models are not in Transformers.

Model definitions and weights are all on Hugging Face model card:
ChatGLM-6B: https://huggingface.co/THUDM/chatglm-6b
ChatGLM2-6B: https://huggingface.co/THUDM/chatglm2-6b

Is that possible to support these models in Transformers so that optimum-habana could do some hijacking and then enable them on HPU?

Motivation

enable ChatGLM-6B, ChatGLM2-6B on HPU

Your contribution

enable ChatGLM-6B, ChatGLM2-6B on HPU

tgi server keeps repeating short words in the output.

System Info

optimum-habana                         1.7.2
text-generation                        0.6.0
text-generation-server                 1.0.3
langchain                              0.0.279

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Step to reproduce:

start text-generation-server following the instruction:
https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference
launch bash in the docker,
inside the docker:
pip install langchain text-generation
python run the following script-- lclient.py

print(output)

noticed:
on the client side, for the first launch, the output is as follows which is ok.

root@653f39f1fd71:/text-gen/client# python lclient.py

Learn More
The purpose of this article is to show you how to use the machine learning algorithm to solve a problem. The goal of this article is to demonstrate that the machine learning algorithm can solve many problems. Learn More
The first section of our book is about general computer programs and their usage. We will cover

then after the first launch, the output is as follows.

root@653f39f1fd71:/text-gen/client# python lclient.py
explain give cover give then then give then give then cover cover give explain discuss explain discuss cover then give discuss give cover then give then explain give then explain explain discuss explain cover give give discuss cover explain then discuss cover discuss then cover explain discuss then explain explain then cover give cover explain discuss cover give explain explain give cover cover discuss

root@653f39f1fd71:/text-gen/client# python lc.py
give give explain explain explain then cover then cover explain give then give then discuss explain cover give give give give then then explain then cover explain cover then discuss cover cover explain then give cover then cover then discuss then cover cover give explain discuss then discuss cover cover cover explain explain explain then then explain give give cover discuss give give then

root@653f39f1fd71:/text-gen/client# python lc.py
cover give discuss explain explain cover then explain then explain give discuss explain discuss discuss discuss give give then then give cover cover explain cover then give then then give then discuss discuss then discuss explain then explain then give give then explain explain cover discuss discuss cover discuss discuss discuss discuss then discuss cover discuss then explain discuss give then cover give discuss

Expected behavior

The second client request should yield samiliar expected output as the first.

Adaptive output and contextual dialogue capabilities of text-generation-inference

System Info

System Info
HL-SMI Version: hl-1.11.0-fw-45.1.1.1
Driver Version: 1.11.0-e6eb0fd

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Deploy the Llama-2-7b-chat-hf model through text-generation-inference, but there is no adaptive output when using the following command, instead the input and output size are max_new_tokens.

curl 127.0.0.1:8080/generate_stream -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":200}}'     -H 'Content-Type: application/json'

Also, how to implement chat functionality with context? Similar to GPT4, it can adaptively output appropriate content and has the ability to dialogue with context.

Expected behavior

adaptive output
dialogue with context

StableDiffusion v2.1 produces incorrect images

System Info

diffusers==0.23.1
habana-torch-plugin==1.13.0.463
optimum==1.14.1
optimum-habana==1.8.1
transformers==4.34.1
optimum-habana repo (examples) on main (which is c7eb594aa9eaf45ef8e4ac8f4a20d0038be50aa6)

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

git clone https://github.com/huggingface/optimum-habana.git /root/optimum-habana
pip install optimum[habana]
cd /root/optimum-habana/examples/stable-diffusion/
python text_to_image_generation.py --model_name_or_path stabilityai/stable-diffusion-2-1 --prompts "a professional photograph of an astronaut riding a horse" --num_images_per_prompt 4 --batch_size 1 --height 768 --width 768 --image_save_dir stable_diffusion_images --use_habana --use_hpu_graphs --gaudi_config Habana/stable-diffusion-2

Expected behavior

Generated images are incorrect. An astronaut on a horse is expected, but a construction site (?) is produced as a first output image:

setup.py need to be updated to 0.22.0 for accelerate

System Info

optimum habana: 1.8.0-dev0
transformers： 4.32.1

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

pip install .
cd examples/text-generation
pip install -r requirements.txt
python run_generation.py
--model_name_or_path gpt2
--use_hpu_graphs
--use_kv_cache
--max_new_tokens 100
--do_sample
--prompt "Here is my prompt"

Expected behavior

run sucessfully, but current result is reporting "cannot import name 'AutocastKwargs' from ...

Example failing on AWS habana instance - cannot import name cached_path

When training a model from scratch, configuration values may be overridden with the help of --config_overrides:

python run_clm.py \
    --model_type gpt2 \
    --tokenizer_name gpt2 \
    --config_overrides="n_embd=1024,n_head=16,n_layer=48,n_positions=102" \
    --use_habana \
    --use_lazy_mode \
    --gaudi_config_name Habana/gpt2 \
    --throughput_warmup_steps 2

output:

Traceback (most recent call last):
  File "run_clm.py", line 36, in <module>
    from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/__init__.py", line 19, in <module>
    from .gaudi_configuration import GaudiConfig
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/gaudi_configuration.py", line 17, in <module>
    from optimum.configuration_utils import BaseConfig
  File "/usr/local/lib/python3.8/dist-packages/optimum/configuration_utils.py", line 25, in <module>
    from transformers.file_utils import cached_path, get_list_of_files, hf_bucket_url, is_offline_mode, is_remote_url
ImportError: cannot import name 'cached_path' from 'transformers.file_utils' (/home/ubuntu/.local/lib/python3.8/site-packages/transformers/file_utils.py)
Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7f954e4c4f06]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7f954e4bc8e5]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7f954e3e1e09]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f954e4c5a3d]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7f954e3df948]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f954e4c5a3d]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7f954e39ab46]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7f954ddff46a]
/lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f954fd7f8a7]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f954fd7fa60]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f954fd5d08a]
python(_start+0x2e) [0x5fc5fe]
Aborted (core dumped)

Add support for max_length in run_generation

System Info

This is caught while executing the transformers unit tests for optimum-habana 
https://github.com/huggingface/optimum-habana/blob/main/tests/transformers/tests/models/gpt2/test_modeling_gpt2.py

several TCs are failing because the config in the tests is updated with max_length for text generation rather than max_new_tokens. 

Hence text generation is failing for decoder only models due to this check : 
            if not self.config.is_encoder_decoder:
                # only pad if bucket_size < -1. If we are bucketing (bucket_size > 0), then that is taken care in greedy_search()
                if not is_greedy_and_bucket:
                    # token_idx is the current index in the generation process, it is incremented each time a new token is generated
                    model_kwargs["token_idx"] = torch.tensor(inputs_tensor.shape[-1], device=inputs_tensor.device)
>                   inputs_tensor = torch.nn.functional.pad(
                        inputs_tensor, (0, generation_config.max_new_tokens), value=generation_config.pad_token_id
                    )
E                   TypeError: pad(): argument 'pad' must be tuple of ints, but found element of type NoneType at pos 2

max_new_tokens is 0 


FAILED test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_generate - TypeError: pad(): argument 'pad' must be tuple of ints, but found element of type NoneType at pos 2

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

python -m pytest -vs test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_generate

Expected behavior

test should pass

text_generation_launcher: Waiting for shard to be ready... rank=1 forever if we pass --num-shard

System Info

model=bigscience/bloom-560m  (same issue with 
docker run -p 8080:80 -v $volume:/data --runtime=habana --privileged -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=$token  -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi  --model-id $model --num-shard 2

when running without --num-shard its works fine but seems to be using one Gaudi hpu.
Instance : Ec2 DL1 instance

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Log:
2023-08-31T19:25:28.057753Z INFO text_generation_launcher: Sharding model on 2 processes
2023-08-31T19:25:28.057837Z INFO download: text_generation_launcher: Starting download process.
2023-08-31T19:25:35.148837Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-08-31T19:25:35.468574Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-08-31T19:25:35.468985Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-08-31T19:25:35.468987Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2023-08-31T19:25:45.485025Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-08-31T19:25:45.485090Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-08-31T19:25:55.500375Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-08-31T19:25:55.500375Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-08-31T19:26:05.515041Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2023-08-31T19:26:05.515110Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2023-08-31T19:26:11.027830Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-08-31T19:26:11.123093Z INFO shard-manager: text_generation_launcher: Shard ready in 35.652902762s rank=0
2023-08-31T19:26:11.312428Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-08-31T19:26:15.529479Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1

Expected behavior

Goal is to run Llama2 7B model with TGI and connect it with Langchain. with one HPU performance is very less, trying to leverage multiple hpu's to improve performance.

When loading datasets by HuggingFace datasets.load_dataset like cifar10, could it be possible to return the dataset without decoding automatically.

Feature request

When loading datasets by HuggingFace datasets.load_dataset like cifar10, could it be possible to return the dataset without decoding automatically?

Motivation

According to #189, The scale efficiency is about 72.6% for Gaudi2 and 79.4% for Gaudi, we found that the efficiency of Gaudi2 is low because of the data loader, so we intend to implement a data loader (especially for Gaudi2) based on Habana Media Pipeline to do the Decoding, RandomResizedCrop, RandomHorizontalFlip, and Normalize.

As described in cifar10 dataset data-fields, when accessing the image column: dataset[0]["image"] the image file is automatically decoded this will be executed on CPU. can it just be like the dataset with a root path to the image files and let our self defined dataloader to do the decoding on HPU?

Your contribution

Implement a Habana media-based data loader

dock build fail in https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference

shall we need to upgrade text-generation-inference code to main?
see huggingface/text-generation-inference#840

I meet the same issue when build the docker image for TGI of habana

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference

docker build -t tgi_gaudi .

Expected behavior

build success

Bad Performance of text-generation with sampling algo

System Info

Ubuntu 20.04

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

with do_sample, the throughput is 450 tps

python run_generation.py --model_name_or_path gpt2  --batch_size 1 --max_new_tokens 100 --use_hpu_graphs --prompt "I am a student from" --do_sample

without do_sample, the throughput is 900tps

python run_generation.py --model_name_or_path gpt2  --batch_size 1 --max_new_tokens 100 --use_hpu_graphs --prompt "I am a student from"

Expected behavior

decent performance with do_sample algo

GPT2 support bf16 for both training and inferecne

Feature request

enable HMP for GPT2

Motivation

BF16 has better performance than FP32

Your contribution

submitting a PR

FileNotFoundError: Couldn't find a dataset script

System Info

GPT-NeoX Text-generation Training on HLS2 

--branch v1.7.3(Fail)

--branch v1.7.4(Fail)

 --branch v1.7.5(Fail)

 
`FileNotFoundError: Couldn't find a dataset script at /root/optimum-habana/examples/text-generation/wikitext-2-raw-v1/wikitext-2-raw-v1.py or any data file in the same directory. Couldn't find 'wikitext-2-raw-v1' on the Hugging Face Hub either: FileNotFoundError: Dataset 'wikitext-2-raw-v1' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with `huggingface-cli login`. 
I encountered this following error:


Full log:

https://logs-browser.k8s-infra.habana-labs.com/files/qa-tester-9-004527355-2642-tfjob/log.txt

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

pip install --upgrade-strategy eager optimum[habana]
git clone https://github.com/huggingface/optimum-habana --branch v1.7.4
pip install -r requirements.txt
pip install git+https://github.com/HabanaAI/[email protected]
python ../gaudi_spawn.py --use_deepspeed --world_size number_of_devices run_generation.py ARGS

python run_generation.py
--model_name_or_path gpt2
--use_hpu_graphs
--use_kv_cache
--max_new_tokens 100
--do_sample
--prompt "Here is my prompt"

Expected behavior

Pass

installation command needs a change

System Info

the following steps is not installing all components:
 1.git clone https://github.com/huggingface/optimum-habana.git
 2.cd optimum-habana
 3.python setup.py install

step 3. needs to be change to "pip install -e ."

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana
python setup.py install

run summarization t5-small and it will report no module found:transformers

Expected behavior

git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana
python install -e .

Device Aquire failed

System Info

running this command in single Gaudi works very well:
optimum-habana/examples/language-modeling/run_lora_clm.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --train_file merged_final_ultimate_andy.json \
    --bf16 True \
    --output_dir ./model_lora_llama \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --dataset_concatenation \
    --do_train \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 1

and its finishing the training on Gaudi.

but trying this command will fail:
python optimum-habana/examples/gaudi_spawn.py \
    --world_size 4 --use_mpi optimum-habana/examples/language-modeling/run_lora_clm.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --train_file merged_final_ultimate_andy.json \
    --bf16 True \
    --output_dir ./model_lora_llama \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --dataset_concatenation \
    --do_train \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 1

while using only 2 devices:
python optimum-habana/examples/gaudi_spawn.py \
    --world_size 2 --use_mpi optimum-habana/examples/language-modeling/run_lora_clm.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --train_file merged_final_ultimate_andy.json \
    --bf16 True \
    --output_dir ./model_lora_llama \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --dataset_concatenation \
    --do_train \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 1

also this command fails with fail to acquire device:
python optimum-habana/examples/gaudi_spawn.py \
    --world_size 4 --use_deepspeed optimum-habana/examples/language-modeling/run_lora_clm.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --train_file merged_final_ultimate_andy.json \
    --bf16 True \
    --output_dir ./model_lora_llama \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --dataset_concatenation \
    --do_train \
    --use_habana \
    --use_lazy_mode \
    --throughput_warmup_steps 1 \
    --deepspeed gaudi_config.json
I used this config file:

{
    "steps_per_print": 64,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "bf16": {
        "enabled": true
    },
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": false,
        "reduce_scatter": false,
        "contiguous_gradients": false
    }
}

Note: Im using a 7 devices in my template which gives me 7 HPU's


for a while Im facing some issues running distributed with Gaudi multi devices and I really want to run a 70B llama model but for now Im stuck.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

described in info

Expected behavior

to work

huggingface / optimum-habana Goto Github PK

optimum-habana's Introduction

Optimum for Intel® Gaudi® Accelerators

What are Intel Gaudi AI Accelerators (HPUs)?

Gaudi Setup

Install the library and get example scripts

Option 1: Use the latest stable release

Option 2: Use the latest main branch under development

Option 3: Use the transformers_future branch to have the latest changes from Transformers

Install dependencies

How to use it?

Quick Start

Transformers Interface

Diffusers Interface

Documentation

Validated Models

Development

optimum-habana's People

Contributors

Stargazers

Watchers

Forkers

optimum-habana's Issues

System Info

Information

Tasks

Reproduction

Expected behavior

System Info

Information

Tasks

Reproduction

Expected behavior

System Info

Information

Tasks

Reproduction

Expected behavior

Feature request

Motivation

Your contribution

System Info

Information

Tasks

Reproduction

Expected behavior

System Info

Information

Tasks

Reproduction

Expected behavior

System Info

Information

Tasks

Reproduction

Expected behavior

System Info

Information

Tasks

Reproduction

Expected behavior

System Info

Information

Tasks

Reproduction

Expected behavior

Feature request

Motivation

Your contribution

Feature request

Motivation

Your contribution

System Info

Information

Tasks

Reproduction

Expected behavior

Reproduction

Expected behavior

Feature request

Option 3: Use the `transformers_future` branch to have the latest changes from Transformers