Code Monkey home page Code Monkey logo

optimum-habana's Introduction

Optimum for Intel® Gaudi® Accelerators

Optimum for Intel Gaudi - a.k.a. optimum-habana - is the interface between the Transformers and Diffusers libraries and Intel Gaudi AI Accelerators (HPU). It provides a set of tools enabling easy model loading, training and inference on single- and multi-HPU settings for different downstream tasks. The list of officially validated models and tasks is available here. Users can try other of the thousands of Hugging Face models on Intel Gaudi accelerators and tasks with only few changes.

What are Intel Gaudi AI Accelerators (HPUs)?

HPUs offer fast model training and inference as well as a great price-performance ratio. Check out this blog post about BLOOM inference and this post benchmarking Intel Gaudi 2 and NVIDIA A100 GPUs for BridgeTower training for concrete examples.

Gaudi Setup

Please refer to the Intel Gaudi AI Accelerator official installation guide.

Tests should be run in a Docker container based on Intel Gaudi Docker images.

The current version has been validated for SynapseAI 1.17.

Install the library and get example scripts

Option 1: Use the latest stable release

To install the latest stable release of this package

pip install --upgrade-strategy eager optimum[habana]

The --upgrade-strategy eager option is needed to ensure optimum-habana is upgraded to the latest stable release.

To use the example associated with the latest stable release, run:

git clone https://github.com/huggingface/optimum-habana
cd optimum-habana && git checkout v1.13.1

with v1.13.1 the version number of this release.

Option 2: Use the latest main branch under development

Optimum for Intel Gaudi is a fast-moving project, and you may want to install it from source and get the latest scripts :

pip install git+https://github.com/huggingface/optimum-habana.git
git clone https://github.com/huggingface/optimum-habana

Option 3: Use the transformers_future branch to have the latest changes from Transformers

The transformers_future branch is regularly updated with the latest changes from the main branches of Optimum Habana and Transformers. This enables you to try out new Transformers features that have not been merged into the main branch yet.

Warning

The transformers_future branch may have some regressions or bugs and may be less stable than the main branch.

pip install git+https://github.com/huggingface/optimum-habana.git@transformers_future
git clone -b transformers_future https://github.com/huggingface/optimum-habana

Install dependencies

To use DeepSpeed on HPUs, you also need to run the following command:

pip install git+https://github.com/HabanaAI/[email protected]

To install the requirements for every example:

cd <example-folder>
pip install -r requirements.txt

How to use it?

Quick Start

Optimum for Intel Gaudi was designed with one goal in mind: to make training and inference straightforward for Transformers and Diffusers users, while fully leveraging the power of Intel Gaudi AI Accelerators.

Transformers Interface

There are two main classes one needs to know:

  • GaudiTrainer: the trainer class that takes care of compiling and distributing the model to run on HPUs, and performing training and evaluation.
  • GaudiConfig: the class that enables to configure Habana Mixed Precision and to decide whether optimized operators and optimizers should be used or not.

The GaudiTrainer is very similar to the Transformers Trainer, and adapting a script using the Trainer to make it work with Intel Gaudi accelerators will mostly consist in simply swapping the Trainer class for the GaudiTrainer one. That's how most of the example scripts were adapted from their original counterparts.

Here is an example:

- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments

- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
  # training arguments...
+ use_habana=True,
+ use_lazy_mode=True,  # whether to use lazy or eager mode
+ gaudi_config_name=path_to_gaudi_config,
)

# A lot of code here

# Initialize our Trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
    model=model,
    args=training_args,  # Original training arguments.
    train_dataset=train_dataset if training_args.do_train else None,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

where gaudi_config_name is the name of a model from the Hub (Intel Gaudi configurations are stored in model repositories) or a path to a local Intel Gaudi configuration file (you can see here how to write your own).

Diffusers Interface

You can generate images from prompts using Stable Diffusion on Intel Gaudi using the GaudiStableDiffusionPipeline class and the [GaudiDDIMScheduler] which have been both optimized for HPUs. Here is how to use them and the differences with the Diffusers library:

- from diffusers import DDIMScheduler, StableDiffusionPipeline
+ from optimum.habana.diffusers import GaudiDDIMScheduler, GaudiStableDiffusionPipeline


model_name = "CompVis/stable-diffusion-v1-4"

- scheduler = DDIMScheduler.from_pretrained(model_name, subfolder="scheduler")
+ scheduler = GaudiDDIMScheduler.from_pretrained(model_name, subfolder="scheduler")

- pipeline = StableDiffusionPipeline.from_pretrained(
+ pipeline = GaudiStableDiffusionPipeline.from_pretrained(
    model_name,
    scheduler=scheduler,
+   use_habana=True,
+   use_hpu_graphs=True,
+   gaudi_config="Habana/stable-diffusion",
)

outputs = generator(
    ["An image of a squirrel in Picasso style"],
    num_images_per_prompt=16,
+   batch_size=4,
)

Documentation

Check out the documentation of Optimum for Intel Gaudi for more advanced usage.

Validated Models

The following model architectures, tasks and device distributions have been validated for Optimum for Intel Gaudi:

In the tables below, ✔️ means single-card, multi-card and DeepSpeed have all been validated.

  • Transformers:
Architecture Training Inference Tasks
BERT ✔️ ✔️
  • text classification
  • question answering
  • language modeling
  • text feature extraction
  • RoBERTa ✔️ ✔️
  • question answering
  • language modeling
  • ALBERT ✔️ ✔️
  • question answering
  • language modeling
  • DistilBERT ✔️ ✔️
  • question answering
  • language modeling
  • GPT2 ✔️ ✔️
  • language modeling
  • text generation
  • BLOOM(Z)
  • DeepSpeed
  • text generation
  • StarCoder / StarCoder2 ✔️
  • Single card
  • language modeling
  • text generation
  • GPT-J
  • DeepSpeed
  • Single card
  • DeepSpeed
  • language modeling
  • text generation
  • GPT-NeoX
  • DeepSpeed
  • DeepSpeed
  • language modeling
  • text generation
  • OPT
  • DeepSpeed
  • text generation
  • Llama 2 / CodeLlama / Llama 3 / Llama Guard / Granite ✔️ ✔️
  • language modeling
  • text generation
  • question answering
  • text classification (Llama Guard)
  • StableLM
  • Single card
  • text generation
  • Falcon
  • LoRA
  • ✔️
  • language modeling
  • text generation
  • CodeGen
  • Single card
  • text generation
  • MPT
  • Single card
  • text generation
  • Mistral
  • Single card
  • text generation
  • Phi ✔️
  • Single card
  • language modeling
  • text generation
  • Mixtral
  • Single card
  • text generation
  • Persimmon
  • Single card
  • text generation
  • Qwen2
  • Single card
  • Single card
  • language modeling
  • text generation
  • Gemma ✔️
  • Single card
  • language modeling
  • text generation
  • T5 / Flan T5 ✔️ ✔️
  • summarization
  • translation
  • question answering
  • BART
  • Single card
  • summarization
  • translation
  • question answering
  • ViT ✔️ ✔️
  • image classification
  • Swin ✔️ ✔️
  • image classification
  • Wav2Vec2 ✔️ ✔️
  • audio classification
  • speech recognition
  • Whisper ✔️ ✔️
  • speech recognition
  • SpeechT5
  • Single card
  • text to speech
  • CLIP ✔️ ✔️
  • contrastive image-text training
  • BridgeTower ✔️ ✔️
  • contrastive image-text training
  • ESMFold
  • Single card
  • protein folding
  • Blip
  • Single card
  • visual question answering
  • image to text
  • OWLViT
  • Single card
  • zero shot object detection
  • ClipSeg
  • Single card
  • object segmentation
  • Llava / Llava-next
  • Single card
  • image to text
  • Segment Anything Model
  • Single card
  • object segmentation
  • VideoMAE
  • Single card
  • Video classification
  • TableTransformer
  • Single card
  • table object detection
  • DETR
  • Single card
  • object detection
    • Diffusers:
    Architecture Training Inference Tasks
    Stable Diffusion
  • textual inversion
  • ControlNet
  • Single card
  • text-to-image generation
  • Stable Diffusion XL
  • fine-tuning
  • Single card
  • text-to-image generation
  • Stable Diffusion Depth2img
  • Single card
  • depth-to-image generation
  • LDM3D
  • Single card
  • text-to-image generation
  • Text to Video
  • Single card
  • text-to-video generation
    • PyTorch Image Models/TIMM:
    Architecture Training Inference Tasks
    FastViT
  • Single card
  • image classification
    • TRL:
    Architecture Training Inference Tasks
    Llama 2 ✔️
  • DPO Pipeline
  • Llama 2 ✔️
  • PPO Pipeline
  • Stable Diffusion ✔️
  • DDPO Pipeline
  • Other models and tasks supported by the Transformers and Diffusers libraries may also work. You can refer to this section for using them with Optimum for Intel Gaudi. In addition, this page explains how to modify any example from the Transformers library to make it work with Optimum for Intel Gaudi.

    If you find any issues while using those, please open an issue or a pull request.

    After training your model, feel free to submit it to the Intel leaderboard which is designed to evaluate, score, and rank open-source LLMs that have been pre-trained or fine-tuned on Intel Hardwares. Models submitted to the leaderboard will be evaluated on the Intel Developer Cloud. The evaluation platform consists of Gaudi Accelerators and Xeon CPUs running benchmarks from the Eleuther AI Language Model Evaluation Harness.

    Development

    Check the contributor guide for instructions.

    optimum-habana's People

    Contributors

    ankurneog avatar astachowiczhabana avatar baihuijin avatar bgoldberg-habana avatar bhargaveede avatar cfgfung avatar dmsuehir avatar dsocek avatar hlahkar avatar hsubramony avatar imangohari1 avatar jiminha avatar jychen21 avatar kalyanjk avatar lewtun avatar libinta avatar mandy-li avatar mgonchar avatar mohitintel avatar pi314ever avatar regisss avatar schoi-habana avatar skaulintel avatar ssarkar2 avatar sywangyi avatar vidyasiv avatar vivekgoe avatar yeonsily avatar yuanwu2017 avatar zhaifeiyue avatar

    Stargazers

     avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

    Watchers

     avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

    optimum-habana's Issues

    Can not run text-generation with bloom deepspeed?

    System Info

    ubuntu 20.04
    docker 1.9

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    1. refer https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation to setup
    2. python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path bigscience/bloom-560m --bf16 --max_new_tokens 10 --batch_size 1 --use_kv_cache --do_sample

    Expected behavior

    running correctly

    Latest version of optimum-habana does not work with transformers==4.32.1

    System Info

    optimum-habana: 1.8.0.dev0
    transformers: 4.32.1 (preinstalled)
    
    Following message seen when installing optimum-habana:
    Requirement already satisfied: transformers>=4.32.0

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    python run_qa.py
    --model_name_or_path bert-large-uncased-whole-word-masking
    --gaudi_config_name Habana/bert-large-uncased-whole-word-masking
    --dataset_name squad
    --do_train
    --do_eval
    --per_device_train_batch_size 24
    --per_device_eval_batch_size 8
    --learning_rate 3e-5
    --num_train_epochs 1
    --max_seq_length 384
    --doc_stride 128
    --output_dir /tmp/squad/
    --use_habana
    --use_lazy_mode
    --use_hpu_graphs_for_inference
    --bf16
    --throughput_warmup_steps 3

    Fails with the following error:
    ModuleNotFoundError: No module named 'transformers.integrations.deepspeed'; 'transformers.integrations' is not a package

    Expected behavior

    Since transformers 4.32.0 is no longer supported with latest optimum-habana it should install the minimum version that is supported

    Error in tests when test_trainer is run before test_trainer_distributed

    Unit and integration tests currently needs to be run with pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py. If not, for instance with pytests tests/ , test_trainer will be executed before test_trainer_distributed and the latter will fail without any error message.

    The following code snippet in training_args.py should actually not be executed in single-card mode and is responsible for this error:

    try:
      global mpi_comm
      from mpi4py import MPI
      
      mpi_comm = MPI.COMM_WORLD
      world_size = mpi_comm.Get_size()
      if world_size > 1:
          rank = mpi_comm.Get_rank()
          self.local_rank = rank
      else:
          raise ("Single MPI process")
    except Exception as e:
      logger.info("Single node run")

    However, even when this is corrected, I still get the following error:

    Traceback (most recent call last):
      File "/root/shared/optimum-habana/tests/test_trainer_distributed.py", line 117, in <module>
        trainer = GaudiTrainer(
      File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 118, in __init__
        super().__init__(
      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 382, in __init__
        self._move_model_to_device(model, args.device)
      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 548, in _move_model_to_device
        model = model.to(device)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 899, in to
        return self._apply(convert)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 570, in _apply
        module._apply(fn)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 593, in _apply
        param_applied = fn(param)
      File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 897, in convert
        return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
    RuntimeError: Device acquire failed.

    I think this is due to the fact that one process may still be running on a HPU when Torch tries to acquire devices.

    Resume from checkpoint does not work

    Error Message:

    Traceback (most recent call last):
      File "examples/question-answering/run_qa.py", line 664, in <module>
        main()
      File "examples/question-answering/run_qa.py", line 605, in main
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
      File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 517, in train
        self._load_optimizer_and_scheduler(resume_from_checkpoint)
      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1795, in _load_optimizer_and_scheduler
        torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
      File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 607, in load
        return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
      File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 882, in _load
        result = unpickler.load()
      File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 857, in persistent_load
        load_tensor(data_type, size, key, _maybe_decode_ascii(location))
      File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 846, in load_tensor
        loaded_storages[key] = restore_location(storage, location)
      File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 827, in restore_location
        return default_restore_location(storage, str(map_location))
      File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 178, in default_restore_location
        raise RuntimeError("don't know how to restore data location of "
    RuntimeError: don't know how to restore data location of torch.FloatStorage (tagged with hpu)
    

    Command used to run training :

    python examples/question-answering/run_qa.py --model_name_or_path albert-xxlarge-v1 --dataset_name squad  --do_train --do_eval --per_device_train_batch_size=12 --learning_rate=5e-06 --num_train_epochs 2 --save_steps 5000 --seed 42 --doc_stride 128 --max_seq_length 384 --per_device_eval_batch_size 2 --use_lazy_mode  --use_habana --output_dir=./albert_xxlarge_bf16_squad 2>&1 | tee albert_xxlarge_bf16_squad_continued.log
    

    Method for reproducing the issue:

    1. Use above command to run the training.
    2. Halt the training after few steps/epochs.
    3. Resume the training using the same command with --resume_from_checkpoint flag pointing to the output directory of the above command.
    4. Above error is encountered.

    Attached Log file:
    albert_xxlarge_bf16_squad_continued.log

    AttributeError: 'GaudiStableDiffusionPipeline' object has no attribute '_internal_dict'

    System Info

    Optimum habana version: 1.5.0.dev
    Docker image: vault.habana.ai/gaudi-docker/1.8.0/ubuntu20.04/habanalabs/pytorch-installer-1.13.1

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    1. Goes into examples/stable-diffusionFor.
    2. Ran "python text_to_image_generation.py --model_name_or_path stabilityai/stable-diffusion-2-base --prompts "a photo of an astronaut riding a horse on mars" --num_images_per_prompt 1 --batch_size 1 --image_save_dir /tmp/stable_diffusion_images --use_habana --use_hpu_graph --gaudi_config Habana/stable-diffusion".
    3. Got error: AttributeError: 'GaudiStableDiffusionPipeline' object has no attribute '_internal_dict'

    Expected behavior

    Is there any method can fix this issue?

    Enable beam_search for text-generation

    Feature request

    currently test-generation only support beam=1,which is not exposed to users, any possible that expose --beams user?

    Motivation

    enable beam_search algo path if beams > 1 and beams =1 falls into greedy_search

    Your contribution

    submit a PR

    Performance is better in 1.6.1 release compared to 1.7.4 release in many models

    System Info

    Optimum-habana - 1.7.4
    Synapse AI - 1.12.0
    Docker - 1.12.0-463
    Gaudi2 (HLS 225) - 1x and 8x.

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    Steps to Reproduce
    Writing down the steps to reproduce to run SwinT in 1x

    1. Download and install optimum-habana
    2. git clone https://github.com/huggingface/optimum-habana.git
    3. cd optimum-habana
    4. git chekout v1.6-release
    5. pip install -r examples/image-classification/requirements.txt
    6. pip install optimum-habana==1.6.1
    7. python3 /root//optimum-habana/examples/image-classification/run_image_classification.py --model_name_or_path microsoft/swin-base-patch4-window7-224 --dataset_name cifar10 --output_dir /tmp/swint_hf/results/ --remove_unused_columns False --do_train --learning_rate 2e-05 --per_device_train_batch_size 64 --evaluation_strategy no --save_strategy no --load_best_model_at_end True --save_total_limit 3 --seed 1337 --use_habana --use_lazy_mode --gaudi_config_name Habana/swin --throughput_warmup_steps 3 --ignore_mismatched_sizes --bf16 --num_train_epochs 1 --logging_steps 20 --dataloader_num_workers 8

    Expected behavior

    The expected behaviour is that
    1.12.0-463 having similar perf with optimum-habana 1.6.1 and optimum-habana 1.7.4

    But what is observed is that perf is better in optimum-habana 1.6.1 and comparitively lesser in 1.7.4

    This is applicable for SwinT, ViT, Bert-Large in 8x and 1x.
    Eg: values in SwinT is given below

    OH - 1.7.4 values
    362.524
    362.566
    360.719
    358.089

    OH - 1.6.1 values
    389.045
    390.971
    389.587

    Almost 7.5% drop

    Default value of ignore_eos

    System Info

    Now, ignore_eos default value is set to lazy_mode.
    Whereas default value in normal transformers is False.
    
    This results in different values for Accuracy and Perf in Gaudi when compared with GPU.
    Also, In summarization tasks, What's the need for setting ignore_eos to True. Stopping when we reach eos should be there for summarization, right?
    
    Right now, only way to set the flag is from generation_config. In case of default generation config, We have to update and that will cause issue with CI accuracy and perf which were right now running with ignore_eos as lazy_mode (which is True in most cases).
    
    Shouldn't there be an extra flag to provide along with command line? So that each model can set it based on the need?

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    pytest -s -v tests/test_encoder_decoder_text_summarization.py

    Expected behavior

    There should be a way to set ignore_eos value without impacting other runs and models

    Where in the directory "/tmp/tst-summarization", is the summarization output stored?

    System Info

    Optimum Habana : 1.6.0
    SynapseAI : 1.10.0
    Docker Image : Habana® Deep Learning Base AMI (Ubuntu 20.04)
    Volume : 1000 GiB

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    Start an EC2 instance with DL1 Resource and this image : Habana® Deep Learning Base AMI (Ubuntu 20.04)
    Run these commands
    a. docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
    b. git clone https://github.com/huggingface/optimum-habana.git
    c. pip install optimum[habana]
    d. cd examples
    e. cd summarization
    f. pip install -r requirements.txt

    python run_summarization.py
    --model_name_or_path t5-small
    --do_eval
    --dataset_name cnn_dailymail
    --dataset_config "3.0.0"
    --source_prefix "summarize: "
    --output_dir /tmp/tst-summarization
    --per_device_train_batch_size 4
    --per_device_eval_batch_size 4
    --overwrite_output_dir
    --predict_with_generate
    --use_habana
    --use_lazy_mode
    --use_hpu_graphs_for_inference
    --gaudi_config_name Habana/t5
    --ignore_pad_token_for_loss False
    --pad_to_max_length
    --save_strategy epoch
    --throughput_warmup_steps 3

    Expected behavior

    Need a file with the summarized text and not just the evaluation metrics

    gpt neox finetuning does not work(segmentaion fault) since 1.7.0

    System Info

    optimum-habana version >1.7.0
    deepspeed 1.11.0

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    python3 /root/repos/optimum-habana/examples/gaudi_spawn.py   --hostfile /root/repos/hostsfile --world_size 8 --use_deepspeed /root/repos/optimum-habana/examples/language-modeling/run_clm.py --deepspeed /root/repos/optimum-habana/tests/configs/deepspeed_zero_2.json --model_name_or_path 'EleutherAI/gpt-neox-20b' --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --num_train_epochs 1 --do_train --output_dir ~/gpt-neox-20b --gaudi_config_name Habana/gpt2 --gradient_checkpointing --use_habana --use_lazy_mode --throughput_warmup_steps 3 --overwrite_output_dir --use_hpu_graphs_for_inference

    crash log

    10.233.250.163: Loading extension module utils...
    10.233.250.163: [INFO|trainer.py:680] 2023-09-12 09:11:29,269 >> ***** Running training *****
    10.233.250.163: [INFO|trainer.py:681] 2023-09-12 09:11:29,269 >>   Num examples = 2,334
    10.233.250.163: [INFO|trainer.py:682] 2023-09-12 09:11:29,269 >>   Num Epochs = 1
    10.233.250.163: [INFO|trainer.py:683] 2023-09-12 09:11:29,269 >>   Instantaneous batch size per device = 2
    10.233.250.163: [INFO|trainer.py:686] 2023-09-12 09:11:29,269 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
    10.233.250.163: [INFO|trainer.py:687] 2023-09-12 09:11:29,269 >>   Gradient Accumulation steps = 1
    10.233.250.163: [INFO|trainer.py:688] 2023-09-12 09:11:29,269 >>   Total optimization steps = 73
    10.233.250.163: [INFO|trainer.py:689] 2023-09-12 09:11:29,274 >>   Number of trainable parameters = 20,554,567,680
    10.233.168.102: Time to load utils op: 0.0013871192932128906 seconds
    10.233.168.102: Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
    10.233.168.102: No modifications detected for re-loaded extension module utils, skipping build step...
    10.233.168.102: Loading extension module utils...
    10.233.168.102: Time to load utils op: 0.00066375732421875 seconds
    10.233.168.102: Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
    10.233.168.102: No modifications detected for re-loaded extension module utils, skipping build step...
    10.233.168.102: Loading extension module utils...
    10.233.168.102: Time to load utils op: 0.0005764961242675781 seconds
    10.233.168.102: Using /root/.cache/torch_extensions/py38_cpu as PyTorch extensions root...
    10.233.168.102: No modifications detected for re-loaded extension module utils, skipping build step...
    10.233.168.102: Loading extension module utils...
    10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:605:forward] Activation Checkpointing Information
    10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:606:forward] ----Partition Activations False, CPU CHECKPOINTING False
    10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:607:forward] ----contiguous Memory Checkpointing False with None total layers
    10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:609:forward] ----Synchronization False
    10.233.250.163: [2023-09-12 09:11:30,324] [INFO] [checkpointing.py:610:forward] ----Profiling time in checkpointing False
    10.233.168.102: Internal Error: Received signal - Segmentation fault
    10.233.168.102: Internal Error: Received signal - Segmentation fault
    10.233.250.163: 
      0%|          | 0/73 [00:00<?, ?it/s]Internal Error: Received signal - Segmentation fault
    10.233.250.163: Internal Error: Received signal - Segmentation fault
    10.233.250.163: Internal Error: Received signal - Segmentation fault
    10.233.250.163: Internal Error: Received signal - Segmentation fault
    10.233.250.163: Internal Error: Received signal - Segmentation fault
    10.233.250.163: Internal Error: Received signal - Segmentation fault
    10.233.250.163: Internal Error: Received signal - Segmentation fault
    10.233.168.102: Internal Error: Received signal - Segmentation fault
    10.233.168.102: Internal Error: Received signal - Segmentation fault
    10.233.168.102: Internal Error: Received signal - Segmentation fault
    10.233.168.102: Internal Error: Received signal - Segmentation fault
    10.233.250.163: Internal Error: Received signal - Segmentation fault
    10.233.168.102: Internal Error: Received signal - Segmentation fault 
    

    Expected behavior

    same with 1.6.1

    ImportError: No module named optimum.habana.distributed

    System Info

    Optimum Habana : 1.6.0
    SynapseAI : 1.10.0
    Docker Image : vault.habana.ai/gaudi-docker/1.10.0/amzn2/habanalabs/pytorch-installer-2.0.1:latest

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    1. Pull this image vault.habana.ai/gaudi-docker/1.10.0/amzn2/habanalabs/pytorch-installer-2.0.1:latest into the cnvrg repository
    2. Run the following commands
    3. pip install optimum[habana]
    4. cd optimum-habana/examples/text-generation
    5. pip install -r requirements.txt
    6. pip install git+https://github.com/HabanaAI/[email protected]
    7. python ../gaudi_spawn.py --use_deepspeed --world_size 2 run_generation.py
      --model_name_or_path bigscience\bloom-560m
      --use_hpu_graphs
      --use_kv_cache
      --max_new_tokens 100
      --do_sample
      --prompt "Tell me a poem about stone and water"

    Expected behavior

    it should either generate text or give permission issue error

    Run text-generation with non-deepspeed mode

    Feature request

    Current text-generation only support bloom & bloomz with deepspeed, but not support other generation model like gpt2, gpt-j, neox.
    There also a PR shows how to run inference in each of examples, but the inference is training evaluation not a real generation case.
    so is there any possible to support the following features?

    • text-generation support other generation model(e.g. gpt2, gpt-j)
    • text-generation support non-deepspeed mode

    Motivation

    text-generation only support bloom and bloomz, can not run text-generation with gpt2, gpt-j, neox,...

    Your contribution

    submitting a PR

    Runtime Error in Eager mode evaluation: The number of dims cannot be packed into CompleteArgumentSpec:65535

    Error Message:

    100%|██████████| 2702/2702 [02:58<00:00, 17.93it/s]Traceback (most recent call last):
      File "examples/question-answering/run_qa.py", line 664, in <module>
        main()
      File "examples/question-answering/run_qa.py", line 621, in main
        metrics = trainer.evaluate()
      File "/root/optimum-habana/examples/question-answering/trainer_qa.py", line 45, in evaluate
        output = self.evaluation_loop(
      File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 1112, in evaluation_loop
        logits = nested_numpify(preds_host)
      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer_pt_utils.py", line 138, in nested_numpify
        return type(tensors)(nested_numpify(t) for t in tensors)
      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer_pt_utils.py", line 138, in <genexpr>
        return type(tensors)(nested_numpify(t) for t in tensors)
      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer_pt_utils.py", line 139, in nested_numpify
        t = tensors.cpu()
    RuntimeError: The number of dims cannot be packed into CompleteArgumentSpec:65535
    

    Attaching the log file below:
    albert_large_bf16_squad_eager.log

    Command used:

    python examples/question-answering/run_qa.py --model_name_or_path albert-large-v2 --dataset_name squad  --do_train --do_eval --max_seq_length 384 --per_device_train_batch_size 32 --learning_rate 5e-05 --num_train_epochs 2 --save_steps 5000 --seed 42 --doc_stride 128 --per_device_eval_batch_size 4 --use_lazy_mode false --use_habana  --output_dir=./albert_large_bf16_squad_eager  --cache_dir /software/lfs/data/pytorch/transformers/Squad 2>&1 | tee albert_large_bf16_squad_eager.log
    

    Add a utility method to get the memory consumptions for various batch sizes

    Feature request

    The GaudiTrainer class should provide a method that takes a list of batch sizes as argument and returns the memory consumptions on HPU for each batch size.
    For each batch size in the list, a training of 5 steps should be run without logging anything and the maximum memory consumption of this run should be returned.

    Motivation

    This feature will avoid users launching a full training just to check the memory consumption in the logs.

    Your contribution

    I will submit a PR.

    Several greedy search Test cases failing with KeyError: 'bucket_size'

    System Info

    After recent integration of transformers tests to optimum-habana several test cases are seen failing with the following error 
    @ssarkar2 . please have a look, we should probably has a check for the key availability before accessing and also return gracefully, so that existing functionality is not affected. Else please suggest modification required to be made in the tests to update the arguments.
    
    FAILED test_modeling_t5.py::T5ModelTest::test_greedy_generate - KeyError: 'bucket_size'
    FAILED test_modeling_t5.py::T5ModelTest::test_greedy_generate_dict_outputs - KeyError: 'bucket_size'
    FAILED test_modeling_t5.py::T5ModelTest::test_greedy_generate_dict_outputs_use_cache - KeyError: 'bucket_size'
    
    
    <pt> (conda_qnpu1) (anneog_transformers_tests_updates) anneog@anneog-vm-u20:t5 $ python -m pytest -vs test_modeling_t5.py::T5ModelTest::test_greedy_generate
    ============================================================================================= test session starts =============================================================================================
    platform linux -- Python 3.8.18, pytest-7.4.2, pluggy-1.3.0 -- /home/anneog/anaconda3/envs/conda_qnpu1/bin/python
    cachedir: .pytest_cache
    rootdir: /home/anneog/github/ankurneog/optimum-habana
    configfile: setup.cfg
    collecting ... [WARNING|utils.py:179] 2023-10-04 07:28:42,015 >> optimum-habana v1.8.0.dev0 has been validated for SynapseAI v1.11.0 but habana-frameworks v1.13.0.133 was found, this could lead to undefined behavior!
    [WARNING|utils.py:196] 2023-10-04 07:28:42,043 >> Could not run `hl-smi`, please follow the installation guide: https://docs.habana.ai/en/latest/Installation_Guide/index.html.
    collected 1 item                                                                                                                                                                                              
    
    test_modeling_t5.py::T5ModelTest::test_greedy_generate ============================= HABANA PT BRIDGE CONFIGURATION =========================== 
     PT_HPU_LAZY_MODE = 1
     PT_RECIPE_CACHE_PATH = 
     PT_CACHE_FOLDER_DELETE = 0
     PT_HPU_RECIPE_CACHE_CONFIG = 
     PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
     PT_HPU_LAZY_ACC_PAR_MODE = 1
     PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
    ---------------------------: System Configuration :---------------------------
    Num CPU Cores : 8
    CPU RAM       : 40852220 KB
    ------------------------------------------------------------------------------
    FAILED
    
    ================================================================================================== FAILURES ===================================================================================================
    ______________________________________________________________________________________ T5ModelTest.test_greedy_generate _______________________________________________________________________________________
    
    self = <tests.models.t5.test_modeling_t5.T5ModelTest testMethod=test_greedy_generate>
    
        def test_greedy_generate(self):
            # check `generate()` and `greedy_search()` are equal
            for model_class in self.all_generative_model_classes:
                config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
                # test old generation output for backwards compatibility
                model = model_class(config).to(torch_device).eval()
    >           output_greedy, output_generate = self._greedy_generate(
                    model=model, input_ids=input_ids, attention_mask=attention_mask, max_length=max_length
                )
    
    ../../generation/test_utils.py:704: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    ../../generation/test_utils.py:293: in _greedy_generate
        output_greedy = model.greedy_search(
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    
    self = T5ForConditionalGeneration(
      (shared): Embedding(99, 32)
      (encoder): T5Stack(
        (embed_tokens): Embedding(99, 32)
    ...m()
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (lm_head): Linear(in_features=32, out_features=99, bias=False)
    )
    input_ids = tensor([[0],
            [0]], device='hpu:0')
    logits_processor = [<transformers.generation.logits_process.MinLengthLogitsProcessor object at 0x7f75ae6456d0>, <transformers.generation....at 0x7f75ae6456a0>, <transformers.generation.logits_process.RepetitionPenaltyLogitsProcessor object at 0x7f75ae5e8520>]
    stopping_criteria = [<transformers.generation.stopping_criteria.MaxLengthCriteria object at 0x7f75ae5ea8e0>], max_length = 4, pad_token_id = 0, eos_token_id = [1], output_attentions = False
    output_hidden_states = False, output_scores = False, return_dict_in_generate = False, synced_gpus = False, streamer = None, lazy_mode = False, ignore_eos = False, profiling_warmup_steps = 0
    profiling_steps = 0
    model_kwargs = {'encoder_outputs': BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[-1.8599e-03,  1.3660e-03, -1...    grad_fn=<IndexSelectBackward0>), past_key_values=None, hidden_states=None, attentions=None, cross_attentions=None)}
    eos_token_id_tensor = tensor([1], device='hpu:0'), scores = None, decoder_attentions = None, cross_attentions = None, decoder_hidden_states = None
    
        def greedy_search(
            self,
            input_ids: torch.LongTensor,
            logits_processor: Optional[LogitsProcessorList] = None,
            stopping_criteria: Optional[StoppingCriteriaList] = None,
            max_length: Optional[int] = None,
            pad_token_id: Optional[int] = None,
            eos_token_id: Optional[Union[int, List[int]]] = None,
            output_attentions: Optional[bool] = None,
            output_hidden_states: Optional[bool] = None,
            output_scores: Optional[bool] = None,
            return_dict_in_generate: Optional[bool] = None,
            synced_gpus: bool = False,
            streamer: Optional["BaseStreamer"] = None,
            lazy_mode: Optional[bool] = False,
            ignore_eos: Optional[bool] = False,
            profiling_warmup_steps: Optional[int] = 0,
            profiling_steps: Optional[int] = 0,
            **model_kwargs,
        ) -> Union[GreedySearchOutput, torch.LongTensor]:
            r"""
            Generates sequences of token ids for models with a language modeling head using **greedy decoding** and can be
            used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
        
            <Tip warning={true}>
        
            In most cases, you do not need to call [`~generation.GenerationMixin.greedy_search`] directly. Use generate()
            instead. For an overview of generation strategies and code examples, check the [following
            guide](../generation_strategies).
        
            </Tip>
        
        
            Parameters:
                input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
                    The sequence used as a prompt for the generation.
                logits_processor (`LogitsProcessorList`, *optional*):
                    An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsProcessor`]
                    used to modify the prediction scores of the language modeling head applied at each generation step.
                stopping_criteria (`StoppingCriteriaList`, *optional*):
                    An instance of [`StoppingCriteriaList`]. List of instances of class derived from [`StoppingCriteria`]
                    used to tell if the generation loop should stop.
                max_length (`int`, *optional*, defaults to 20):
                    **DEPRECATED**. Use `logits_processor` or `stopping_criteria` directly to cap the number of generated
                    tokens. The maximum length of the sequence to be generated.
                pad_token_id (`int`, *optional*):
                    The id of the *padding* token.
                eos_token_id (`Union[int, List[int]]`, *optional*):
                    The id of the *end-of-sequence* token. Optionally, use a list to set multiple *end-of-sequence* tokens.
                output_attentions (`bool`, *optional*, defaults to `False`):
                    Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                    returned tensors for more details.
                output_hidden_states (`bool`, *optional*, defaults to `False`):
                    Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
                    for more details.
                output_scores (`bool`, *optional*, defaults to `False`):
                    Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
                return_dict_in_generate (`bool`, *optional*, defaults to `False`):
                    Whether or not to return a [`transformers.generationutils.ModelOutput`] instead of a plain tuple.
                synced_gpus (`bool`, *optional*, defaults to `False`):
                    Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
                streamer (`BaseStreamer`, *optional*):
                    Streamer object that will be used to stream the generated sequences. Generated tokens are passed
                    through `streamer.put(token_ids)` and the streamer is responsible for any further processing.
                lazy_mode (`bool`, *optional*, defaults to `False`):
                    Whether the run is executed in lazy mode or not (i.e. eager mode).
                ignore_eos (`bool`, *optional*, defaults to `False`):
                    Whether to ignore finished sequences (faster in lazy mode and with HPU graphs) or not (eager mode).
                profiling_warmup_steps (`int`, *optional*, defaults to 0):
                    Number of steps to ignore for profling.
                profiling_steps (`int`, *optional*, defaults to 0):
                    Number of steps to be captured when enabling profiling.
                model_kwargs:
                    Additional model specific keyword arguments will be forwarded to the `forward` function of the model.
                    If model is an encoder-decoder model the kwargs should include `encoder_outputs`.
        
            Return:
                [`transformers.generation.GreedySearchDecoderOnlyOutput`], [`transformers.generation.GreedySearchEncoderDecoderOutput`]
                or `torch.LongTensor`: A `torch.LongTensor` containing the generated tokens (default behaviour) or a
                [`transformers.generation.GreedySearchDecoderOnlyOutput`] if `model.config.is_encoder_decoder=False` and
                `return_dict_in_generate=True` or a [`transformers.generation.GreedySearchEncoderDecoderOutput`] if
                `model.config.is_encoder_decoder=True`.
        
            Examples:
        
            
            >>> from transformers import (
            ...     AutoTokenizer,
            ...     AutoModelForCausalLM,
            ...     LogitsProcessorList,
            ...     MinLengthLogitsProcessor,
            ...     StoppingCriteriaList,
            ...     MaxLengthCriteria,
            ... )
        
            >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
            >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
        
            >>> # set pad_token_id to eos_token_id because GPT2 does not have a PAD token
            >>> model.generation_config.pad_token_id = model.generation_config.eos_token_id
        
            >>> input_prompt = "It might be possible to"
            >>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids
        
            >>> # instantiate logits processors
            >>> logits_processor = LogitsProcessorList(
            ...     [
            ...         MinLengthLogitsProcessor(10, eos_token_id=model.generation_config.eos_token_id),
            ...     ]
            ... )
            >>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
        
            >>> outputs = model.greedy_search(
            ...     input_ids, logits_processor=logits_processor, stopping_criteria=stopping_criteria
            ... )
        
            >>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
            ["It might be possible to get a better understanding of the nature of the problem, but it's not"]
            """
            # init values
            logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
            stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
            if max_length is not None:
                warnings.warn(
                    (
                        "`max_length` is deprecated in this function, use"
                        " `stopping_criteria=StoppingCriteriaList([MaxLengthCriteria(max_length=max_length)])` instead."
                    ),
                    UserWarning,
                )
                stopping_criteria = validate_stopping_criteria(stopping_criteria, max_length)
            pad_token_id = pad_token_id if pad_token_id is not None else self.generation_config.pad_token_id
            eos_token_id = eos_token_id if eos_token_id is not None else self.generation_config.eos_token_id
            if isinstance(eos_token_id, int):
                eos_token_id = [eos_token_id]
            eos_token_id_tensor = torch.tensor(eos_token_id).to(input_ids.device) if eos_token_id is not None else None
            output_scores = output_scores if output_scores is not None else self.generation_config.output_scores
            output_attentions = (
                output_attentions if output_attentions is not None else self.generation_config.output_attentions
            )
            output_hidden_states = (
                output_hidden_states if output_hidden_states is not None else self.generation_config.output_hidden_states
            )
            return_dict_in_generate = (
                return_dict_in_generate
                if return_dict_in_generate is not None
                else self.generation_config.return_dict_in_generate
            )
        
            # init attention / hidden states / scores tuples
            scores = () if (return_dict_in_generate and output_scores) else None
            decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
            cross_attentions = () if (return_dict_in_generate and output_attentions) else None
            decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None
        
            # if model is an encoder-decoder, retrieve encoder attention weights and hidden states
            if return_dict_in_generate and self.config.is_encoder_decoder:
                encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None
                encoder_hidden_states = (
                    model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None
                )
        
            # keep track of which sequences are already finished
            if not ignore_eos:
                unfinished_sequences = torch.ones(input_ids.shape[0], dtype=torch.long, device=input_ids.device)
        
            hb_profer = HabanaProfile(warmup=profiling_warmup_steps, active=profiling_steps)
            hb_profer.start()
            this_peer_finished = False  # used by synced_gpus only
    >       bucket_size = model_kwargs["bucket_size"]
    E       KeyError: 'bucket_size'
    
    ../../../../../optimum/habana/transformers/generation/utils.py:1252: KeyError

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    1. clone optimum-habana
    2. pip install pytest
    3. cd optimum-habana/transformers/tests/models/t5
    4. python -m pytest -vs test_modeling_t5.py::T5ModelTest::test_greedy_generate

    Expected behavior

    Test case should pass without errors

    accelerate llama inference in TGI

    TGI only supports https://github.com/huggingface/optimum-habana/blob/main/text-generation-inference/server/text_generation_server/models/causal_lm.py#L25 ("bloom", "gpt2", "gptj", "gpt_neox", "opt") these models with static shape. Now llama static shape is supported in optimum-habana main branch. When will the new tag of optimum-habana be released, that TGI could enjoy the acceleration of llama?
    currently TGI is installed with optimum-habana 1.6.1(the latest tag is 1.6.1)

    Reproduction

    launch TGI with llama model, client run text-generation job.

    Expected behavior

    llama model could run successfully with better throughput.

    Robert large 8x run failed

    With the following cmd roberta large failed at 8x

    python ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path roberta-large --gaudi_config_name ../gaudi_config.json --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --per_device_eval_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir ./roberta_large_8x_bf16_lazy --use_habana --use_lazy_mode

    to make the issue easier to reproduce: add the following cmd
    --save_steps 5

    it's related to the save portion, need to find out which save
    configuration or checkpoint, tockenizer config, special tokens

    Compute Accuracy in clip-roberta

    Feature request

    Is it possible to include an accuracy metric when training clip-roberta?

    Motivation

    We would like to have something other than loss to track during training and evaluation.

    Your contribution

    I tried creating a dummy compute_metrics function to pass to GaudiTrainer thusly.

    metric = evaluate.load("accuracy")
    def compute_metrics(p):
        return 1
    

    get the following error:

    Traceback (most recent call last):
      File "run_clip.py", line 553, in <module>
        main()
      File "run_clip.py", line 532, in main
        metrics = trainer.evaluate()
      File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2932, in evaluate
        output = eval_loop(
      File "/usr/local/lib/python3.8/dist-packages/optimum/habana/transformers/trainer.py", line 1074, in evaluation_loop
        logits_dtype = get_dtype(logits)
      File "/usr/local/lib/python3.8/dist-packages/optimum/habana/transformers/trainer_utils.py", line 43, in get_dtype
        return [get_dtype(logits_tensor) for logits_tensor in logits]
      File "/usr/local/lib/python3.8/dist-packages/optimum/habana/transformers/trainer_utils.py", line 43, in <listcomp>
        return [get_dtype(logits_tensor) for logits_tensor in logits]
      File "/usr/local/lib/python3.8/dist-packages/optimum/habana/transformers/trainer_utils.py", line 45, in get_dtype
        raise TypeError(f"logits should be of type torch.Tensor or tuple, got {type(logits)} which is not supported")
    TypeError: logits should be of type torch.Tensor or tuple, got <class 'transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions'> which is not supported
    

    multi-node distributed training on Ray is failed.

    System Info

    Docker image: vault.habana.ai/gaudi-docker/1.12.1/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
    Habana version: 1.12.1
    Deepspeed: https://github.com/HabanaAI/[email protected]
    optimum-habana: pip install --upgrade-strategy eager optimum[habana]

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    I didn't use the gaudi_spawn.py to launch the distributed training of run_clm.py. I used the Ray TorchTrainer to run the GaudiTrainer.
    The single-node-multi-cards training can work. But the multi-node-multi-cards training is failed. Can you help to provide some clues?

    Logs:

    (RayTrainWorker pid=19045) [2023-11-02 13:17:59,227 E 19045 21937] logging.cc:97: Unhandled exception: N3c105ErrorE. what(): Collective call returned error
    (RayTrainWorker pid=19045) Exception raised from operator() at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/process_group_hccl_base.cpp:286 (most recent call first):
    (RayTrainWorker pid=19045) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7ee2d91f557c in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
    (RayTrainWorker pid=19045) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x84 (0x7ee2d91bb220 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
    (RayTrainWorker pid=19045) frame #2: <unknown function> + 0x3497f (0x7ee22db4c97f in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so)
    (RayTrainWorker pid=19045) frame #3: <unknown function> + 0x5b0e9 (0x7ee22db730e9 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so)
    (RayTrainWorker pid=19045) frame #4: habana_helpers::JobThread::threadFunction() + 0x128 (0x7ee236356578 in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
    (RayTrainWorker pid=19045) frame #5: <unknown function> + 0xd6df4 (0x7f124793edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
    (RayTrainWorker pid=19045) frame #6: <unknown function> + 0x8609 (0x7f1247c15609 in /lib/x86_64-linux-gnu/libpthread.so.0)
    (RayTrainWorker pid=19045) frame #7: clone + 0x43 (0x7f1247d4f133 in /lib/x86_64-linux-gnu/libc.so.6)
    (RayTrainWorker pid=19045)
    (RayTrainWorker pid=19045) [2023-11-02 13:17:59,234 E 19045 21937] logging.cc:104: Stack trace:
    (RayTrainWorker pid=19045)  /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xf2e81a) [0x7f1246a9a81a] ray::operator<<()
    (RayTrainWorker pid=19045) /usr/local/lib/python3.8/dist-packages/ray/_raylet.so(+0xf30fd8) [0x7f1246a9cfd8] ray::TerminateHandler()
    (RayTrainWorker pid=19045) /usr/lib/habanalabs/libhl_logger.so(+0x1d45a) [0x7ee23430f45a]
    (RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f124791237c]
    (RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f12479123e7]
    (RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa699) [0x7f1247912699]
    (RayTrainWorker pid=19045) /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so(_ZN3c106detail14torchCheckFailEPKcS2_jS2_+0xac) [0x7ee2d91bb248] c10::detail::torchCheckFail()
    (RayTrainWorker pid=19045) /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(+0x3497f) [0x7ee22db4c97f] std::_Function_handler<>::_M_invoke()
    (RayTrainWorker pid=19045) /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/distributed/_hccl_C.so(+0x5b0e9) [0x7ee22db730e9] c10d::ProcessGroupHCCL::collective()::{lambda()#3}::operator()()
    (RayTrainWorker pid=19045) /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so(_ZN14habana_helpers9JobThread14threadFunctionEv+0x128) [0x7ee236356578] habana_helpers::JobThread::threadFunction()
    (RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f124793edf4]
    (RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f1247c15609] start_thread
    (RayTrainWorker pid=19045) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f1247d4f133] __clone
    (RayTrainWorker pid=19045)
    (RayTrainWorker pid=19045) Internal Error: Received signal - Aborted
    (RayTrainWorker pid=19045) *** SIGABRT received at time=1698931079 on cpu 27 ***
    (RayTrainWorker pid=19045) PC: @     0x7f1247c7300b  (unknown)  raise
    (RayTrainWorker pid=19045)     @     0x7ee23430fa3b  (unknown)  signalHandler()
    (RayTrainWorker pid=19045) [2023-11-02 13:17:59,236 E 19045 21937] logging.cc:361: *** SIGABRT received at time=1698931079 on cpu 27 ***
    (RayTrainWorker pid=19045) [2023-11-02 13:17:59,236 E 19045 21937] logging.cc:361: PC: @     0x7f1247c7300b  (unknown)  raise
    (RayTrainWorker pid=19045) [2023-11-02 13:17:59,236 E 19045 21937] logging.cc:361:     @     0x7ee23430fa3b  (unknown)  signalHandler()
    (RayTrainWorker pid=19045) Fatal Python error: Aborted
    (RayTrainWorker pid=19045)
    (RayTrainWorker pid=19046) /home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/interfaces/hcl_idevice.cpp::423(allocateConnection): The condition [ isNicUp(port) ] failed. Port(22) is DOWN, can't allocate connection  [repeated 15x across cluster]
    
    
    
    

    Expected behavior

    multi-node distributed training can work on Ray.

    clip-vit-large-patch14 image classification support

    Feature request

    I am trying to run the following model using the optimum-habana repository https://huggingface.co/openai/clip-vit-large-patch14. Do you have any suggestions for finetuning this model with existing Habana-optimum code? If not could we create a script/amend a current script so this model can be supported?

    Motivation

    We would like to train this particular model on Habana hardware.

    Your contribution

    I tried using the existing image_classification script, because it supports regular vit. Here is the command in the image classification README I used. However, I found that the AutoModelforImageClassification class invoked here does not support clip config (not in the list of configs here). So I tried swapping out this class for the generic AutoModel class where CLIPconfig is supported. I get the following error with that change:

    hmp:opt_level O1
    Traceback (most recent call last):
    File "run_image_classification.py", line 410, in
    main()
    File "run_image_classification.py", line 384, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
    File "/root/optimum/habana/transformers/trainer.py", line 397, in train
    return inner_training_loop(
    File "/root/optimum/habana/transformers/trainer.py", line 500, in _inner_training_loop
    self._load_optimizer_and_scheduler(resume_from_checkpoint)
    File "/root/optimum/habana/transformers/trainer.py", line 960, in _load_optimizer_and_scheduler
    self.optimizer.load_state_dict(
    File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 201, in load_state_dict
    raise ValueError("loaded state dict contains a parameter group "
    ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

    Upon further digging, I also see that CLIP may still be a WIP from transformers side. Do not see _MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES used anywhere. Let me know if this feature enablement is possible. Would be happy to work on it with some direction.

    'meta-llama/Llama-2-7b-hf' tests fail with Authentication failure.

    System Info

    optimum-habana - 1.8.0.dev0
    Synapse version - 1.13.0-90
    Docker - 1.13.0-90

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    1. git clone https://github.com/huggingface/optimum-habana.git
    2. cd optimum-habana
    3. git checkout main
    4. pip install .
    5. huggingface-cli login --token
    6. cd tests
    7. pytest -s -k test_text_generation_bf16[token0-meta-llama/Llama-2-7b-hf-43.951804139391925]

    Expected behavior

    This is a test inside optimum-haban to run Llama model. Expected to run without any issues.
    But authorization is failing even after logging into hugging face using a token generated with personal account.
    Screen shot attached showing the authorization error.
    llama_gated_repo

    RuntimeError: Device acquire failed. in /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/hpu/__init__.py"

    System Info

    Optimum Habana : 1.6.0
    SynapseAI : 1.4.0
    Docker Image : Deep Learning AMI Habana PyTorch 1.10.2 SynapseAI 1.4.0 (Ubuntu 20.04) 20220425

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    1. Start an EC2 instance with DL1 Resource and this image :
    2. Install "Deep Learning AMI Habana PyTorch 1.10.2 SynapseAI 1.4.0 (Ubuntu 20.04) 20220425"
    3. Run these commands
      a. docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
      b. git clone https://github.com/huggingface/optimum-habana.git
      c. cd optimum-habana && python setup.py install
      d. cd examples
      e. cd text-generation
      f. pip install -r requirements.txt
      e. python run_generation.py
      --model_name_or_path gpt2
      --use_hpu_graphs
      --use_kv_cache
      --max_new_tokens 100
      --do_sample
      --prompt "Tell me a poem about stone and water"

    Expected behavior

    Should get the desired text (poem about stone and water)

    Error: Getting size for given data type is not supported while fine tuning starcoder model on optimum-habana

    System Info

    Hello Team,
    We are trying to fine tune the bigcode/starcoderbase-7b model on a multi HPU (8 HPU) node and have been following the guidance https://github.com/huggingface/optimum-habana/tree/main/examples/language-modeling .

    However, we are encountering a similar issue that have been mentioned in the #318.

    We are also using a custom class ConstantLengthDataset(IterableDataset). Essentially we are trying to port the https://github.com/bigcode-project/starcoder/blob/main/finetune/finetune.py to habana and we using from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments at appropriate places.

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    Training...
    Training...
    Training...
    terminate called after throwing an instance of 'c10::Error'
      what():  Getting size for given data type is not supported: 0
    Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7ff0b09bd53c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
    frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7ff0b098310c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
    frame #2: <unknown function> + 0x544ea (0x7ff0b02f84ea in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/distributed/_hccl_C.so)
    frame #3: habana_helpers::JobThread::threadFunction() + 0x128 (0x7ff020da6ae8 in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
    frame #4: <unknown function> + 0xd6df4 (0x7ff0b47dedf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
    frame #5: <unknown function> + 0x8609 (0x7ff0b4ab5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
    frame #6: clone + 0x43 (0x7ff0b4bef133 in /lib/x86_64-linux-gnu/libc.so.6)
    
    Internal Error: Received signal - Aborted
    terminate called after throwing an instance of 'c10::Error'
      what():  Getting size for given data type is not supported: 0
    Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f881daf453c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
    frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f881daba10c in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/torch/lib/libc10.so)
    frame #2: <unknown function> + 0x544ea (0x7f88161634ea in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/distributed/_hccl_C.so)
    frame #3: habana_helpers::JobThread::threadFunction() + 0x128 (0x7f8816ee3ae8 in /home/devcloud/habanalabs-venv/lib/python3.8/site-packages/habana_frameworks/torch/lib/libhabana_pytorch_plugin.so)
    frame #4: <unknown function> + 0xd6df4 (0x7f8822904df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
    frame #5: <unknown function> + 0x8609 (0x7f8822bdb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
    frame #6: clone + 0x43 (0x7f8822d15133 in /lib/x86_64-linux-gnu/libc.so.6)
    
    Internal Error: Received signal - Aborted
    terminate called after throwing an instance of 'c10::Error'
      what():  Getting size for given data type is not supported: 0
    Exception raised from getHCCLDataSize at /npu-stack/pytorch-integration/python_packages/habana_frameworks/torch/distributed/hccl/ProcessGroupHCCL.cpp:128 (most recent call first):
    ...
    
    Internal Error: Received signal - Aborted
    --------------------------------------------------------------------------
    Primary job  terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    mpirun noticed that process rank 2 with PID 0 on node idc382 exited on signal 6 (Aborted).

    Expected behavior

    We should be able to complete the training loop without issues. We did try to add a fake _len_ method inside the class ConstantLengthDataset(IterableDataset) class, but it still failed.

    def __len__(self):
        return 10
    

    But at the same time I see the following observations:

    • We cannot run the starcoder-7B model on 1 HPU due to OOM
    • We can run the 3B model on 1 HPU, no issue with fetching dataset length
    • We cannot run the 3B model on 8 HPUs (infact > 1 HPU) , fails with the same getting size for data type issue.

    So, the issue arises whenever we shift to multliple HPU or distributed training on > 1 HPUs.

    A Gaudi2 1 HPU = 96 GB of device memory.

    return super().__torch_function__(func, types, new_args, kwargs) RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::3288334336 (3136)MB

    System Info

    Optimum Habana : 1.6.0
    SynapseAI : 1.10.0
    Docker Image : Habana® Deep Learning Base AMI (Ubuntu 20.04)
    Volume : 1000 GiB

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    Start an EC2 instance with DL1 Resource and this image : Habana® Deep Learning Base AMI (Ubuntu 20.04)
    Run these commands
    a. docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
    b. git clone https://github.com/huggingface/optimum-habana.git
    c. pip install optimum[habana]
    d. cd examples
    e. cd text-generation
    f. pip install -r requirements.txt
    e. python run_generation.py
    --model_name_or_path bigscience/bloom
    --use_hpu_graphs
    --use_kv_cache
    --max_new_tokens 100
    --do_sample
    --prompt "Tell me a poem about stone and water"

    Expected behavior

    It should give a sample poem rather than an error

    Repo card metadata block was not found. Setting CardData to empty

    System Info

    optimum-habana          1.8.0
    docker vault.habana.ai/gaudi-docker/1.12.0/ubuntu22.04/habanalabs/pytorch-installer-2.0.1:latest
    Synapse Version 1.12.0-480 
    
    https://github.com/huggingface/optimum-habana/tree/ee5e8fc39e78800eb3763d048192bef036fadc4c/examples/contrastive-image-text
    
    The step in Readme fails with "Repo card metadata block was not found. Setting CardData to empty"
    The dataset validation step, pasted below. 
    
    import os
    import datasets
    
    COCO_DIR = os.path.join(os.getcwd(), "data")
    ds = datasets.load_dataset("ydshieh/coco_dataset_script", "2017", data_dir=COCO_DIR)

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    Just launch the docker and follow the Readme steps, you see this issue.

    Expected behavior

    I assume dataset get loaded , Just see no response

    Loading from flax checkpoint to resume training with pytorch

    @regisss

    I am trying to resume training from a flax checkpoint and continue on pytorch. I change the following line of code in the file pytorch/question-answering/run_qa.py

    model = AutoModelForQuestionAnswering.from_pretrained(
        model_args.model_name_or_path,
        **from_flax=True, #from_tf=bool(".ckpt" in model_args.model_name_or_path),**
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
    )
    

    I get the following error

    Traceback (most recent call last):
    File "run_qa.py", line 652, in
    main()
    File "run_qa.py", line 593, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
    File "/venv_bert/lib/python3.6/site-packages/transformers/trainer.py", line 1170, in train
    raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
    ValueError: Can't find a valid checkpoint at transformers/examples/flax/question-answering/bert-qa-squad/

    I have tried on both optimum-habana and the main hugging_face/transformers repo. Any advice?

    Beach search transformers test cases are failing with KeyError: 'limit_hpu_graphs'

    System Info

    After recent integration of transformer test cases to optimum-habana ,it was observed that several beam search related testcases are failing with the following error, The test case below is for T5 but several other language modelling models such as GPT2, GPTJ, GPTNEOX etc. also invoke the same test case and hence they are failing as well.
    
    Logs : 
     <pt> (conda_qnpu1) (anneog_transformers_tests_updates) anneog@anneog-vm-u20:t5 $ python -m pytest -vs test_modeling_t5.py::T5ModelTest::test_beam_search_generate
    ============================================================================================= test session starts =============================================================================================
    platform linux -- Python 3.8.18, pytest-7.4.2, pluggy-1.3.0 -- /home/anneog/anaconda3/envs/conda_qnpu1/bin/python
    cachedir: .pytest_cache
    rootdir: /home/anneog/github/ankurneog/optimum-habana
    configfile: setup.cfg
    collecting ... [WARNING|utils.py:179] 2023-10-04 06:56:49,584 >> optimum-habana v1.8.0.dev0 has been validated for SynapseAI v1.11.0 but habana-frameworks v1.13.0.133 was found, this could lead to undefined behavior!
    [WARNING|utils.py:196] 2023-10-04 06:56:49,606 >> Could not run `hl-smi`, please follow the installation guide: https://docs.habana.ai/en/latest/Installation_Guide/index.html.
    collected 1 item                                                                                                                                                                                              
    
    test_modeling_t5.py::T5ModelTest::test_beam_search_generate ============================= HABANA PT BRIDGE CONFIGURATION =========================== 
     PT_HPU_LAZY_MODE = 1
     PT_RECIPE_CACHE_PATH = 
     PT_CACHE_FOLDER_DELETE = 0
     PT_HPU_RECIPE_CACHE_CONFIG = 
     PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
     PT_HPU_LAZY_ACC_PAR_MODE = 1
     PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
    ---------------------------: System Configuration :---------------------------
    Num CPU Cores : 8
    CPU RAM       : 40852220 KB
    ------------------------------------------------------------------------------
    FAILED
    
    ================================================================================================== FAILURES ===================================================================================================
    ____________________________________________________________________________________ T5ModelTest.test_beam_search_generate ____________________________________________________________________________________
    
    self = <tests.models.t5.test_modeling_t5.T5ModelTest testMethod=test_beam_search_generate>
    
        def test_beam_search_generate(self):
            for model_class in self.all_generative_model_classes:
                config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
        
                # It is important set set the eos_token_id to None to ensure that no sequences
                # shorter than `max_length` can be generated which could lead to flaky circle ci
                # failures if the top `num_return_sequences` beams are all shorter than the longest beam
                config.eos_token_id = None
                config.forced_eos_token_id = None
        
                model = model_class(config).to(torch_device).eval()
                if model.config.is_encoder_decoder:
                    max_length = 4
        
                logits_process_kwargs, logits_processor = self._get_logits_processor_and_kwargs(
                    input_ids.shape[-1],
                    config.eos_token_id,
                    config.forced_bos_token_id,
                    config.forced_eos_token_id,
                    max_length,
                )
                beam_kwargs, beam_scorer = self._get_beam_scorer_and_kwargs(input_ids.shape[0], max_length)
        
                # check `generate()` and `beam_search()` are equal
    >           output_generate, output_beam_search = self._beam_search_generate(
                    model=model,
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    max_length=max_length,
                    beam_scorer=beam_scorer,
                    beam_kwargs=beam_kwargs,
                    logits_process_kwargs=logits_process_kwargs,
                    logits_processor=logits_processor,
                )
    
    ../../generation/test_utils.py:881: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    ../../generation/test_utils.py:422: in _beam_search_generate
        output_beam_search = model.beam_search(
    ../../../../../optimum/habana/transformers/generation/utils.py:1995: in beam_search
        hpu_graphs_kwargs = self._get_hpu_graphs_kwargs(model_kwargs)
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    
    self = T5ForConditionalGeneration(
      (shared): Embedding(99, 32)
      (encoder): T5Stack(
        (embed_tokens): Embedding(99, 32)
    ...m()
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (lm_head): Linear(in_features=32, out_features=99, bias=False)
    )
    model_kwargs = {'encoder_outputs': BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[-4.7808e-04, -6.3646e-04, -2...    grad_fn=<IndexSelectBackward0>), past_key_values=None, hidden_states=None, attentions=None, cross_attentions=None)}
    
        def _get_hpu_graphs_kwargs(self, model_kwargs):
            hpu_graphs_kwargs = {}
    >       if model_kwargs["limit_hpu_graphs"]:
    E       KeyError: 'limit_hpu_graphs'
    
    ../../../../../optimum/habana/transformers/generation/utils.py:141: KeyError
    
    @@p9olisettyvarma could you have a look. I think we should modify the code so that the key is not accessed if it is not filled in the dictionary for eg. a check for if key not in dict and the return hpu_graphs_kwargs with default values

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    1. Clone optimum-habana
    2. pip install pytest
    3. cd optimum-habana/tests/transformers/tests/model/t5
    4. python -m pytest -vs test_modeling_t5.py::T5ModelTest::test_beam_search_generate

    Expected behavior

    The test should pass.

    Inconsistent argument bf16/bf16_full_eval with json file

    With gaudi_config.json

    {
    "execution_mode": "lazy",
    "use_habana_mixed_precision": true,
    "world_size": 8,
    "use_fused_adam": true,
    "use_fused_clip_norm": true
    }

    The bf16 /bf16_full_eval are still false

    04/05/2022 15:19:13 - INFO - main - Training/evaluation parameters GaudiTrainingArguments(
    _n_gpu=1,
    adafactor=False,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-08,
    bf16=False,
    bf16_full_eval=False,
    dataloader_drop_last=False,
    dataloader_num_workers=0,
    dataloader_pin_memory=True,
    ddp_bucket_cap_mb=None,

    Several links in the doc are broken

    For example https://huggingface.co/docs/optimum.habana/main/en/trainer#optimum.habana.GaudiTrainer in https://huggingface.co/docs/optimum/main/en/habana_single_hpu
    Or https://github.com/huggingface/optimum.habana/blob/main/optimum/habana/trainer.py#L97 in https://huggingface.co/docs/optimum/main/en/habana_trainer#optimum.habana.GaudiTrainer

    The first one seem to be broken because it uses optimum.habana instead of optimum, and the path main/en/trainer instead of main/en/habana_trainer.

    The second one because it uses optimum.habana instead of optimum-habana in the github path.

    @regisss

    Adding profiling

    Feature request

    HabanaAI support PyTorch profiler, and currently if user want to figure out the bottleneck of training or inference, they should modify the source code(adding torch.profiler into GaudiTrainer or generate), with which the user experience is not good.
    If we support torch.profiler in optimum-habana, and user could generate the profiling data just with --do_profiling 10

    • default value of do_profiling is 0, means not generate the profiling data
    • 10 means how many steps or iterations will be captured

    refer profiling-with-pytorch

    Motivation

    • easy to address the bottleneck of model both for training and inference

    Your contribution

    submit a PR

    Does it make sense to also provide an option of max input tokens for text generation ?

    Feature request

    Right now we tokenize as below and based upon input sentences in the batch, the input ids are padded.
    input_tokens = tokenizer.batch_encode_plus(input_sentences, return_tensors="pt", padding=True)

    What if the next batch of data exceeds the above input tokens length which will trigger recompilations.

    Would it not be better if we also introduce an argument of max input length and tokenize as below using padding to max length and truncation to True:

    input_tokens = tokenizer.batch_encode_plus(batch_sentences, padding='max_length', truncation=True, max_length=args.max_input_length)

    Motivation

    Avoid graph recompilations.

    Your contribution

    Yes if we agree.

    Fine-tuning BERT model without Trainer

    Hello,

    I have a custom model that I've incorporated BERT into. Is it possible to train this model using a normal training loop?

    Example:

    def training_loop(dataloader, model1):
        device = torch.device('hpu')
        model1 = model1.to(device)
        model2 = AutoModel.from_pretrained('bert-base-uncased').to(device)
        custom_model = some_wrapper(model1, model2)
        for batch in dataloader:
            batch = batch.to(device)
            output = custom_model(batch)
        
        ...
    

    htcore issue on "text-generation-inference" server with "langchain" client

    System Info

    optimum-habana                         1.6.1
    text-generation                        0.6.0
    text-generation-server                 0.9.2
    langchain                              0.0.265

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    Step to reproduce:

    1. start text-generation-server following the instruction:
      https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference
    2. launch bash in the docker,
    3. inside the docker:
      pip install langchain text-generation
    4. python run the following script:

    from langchain import HuggingFaceTextGenInference
    llm = HuggingFaceTextGenInference(
    inference_server_url="http://127.0.0.1:80",
    max_new_tokens=64,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
    )
    output = llm("What is Machine Learning?")

    print(output)

    1. noticed:
      Client side:

    Traceback (most recent call last):
    File "/root/langchain_client/langchain-client.py", line 12, in
    output = llm("What is Machine Learning?")
    File "/usr/local/lib/python3.10/dist-packages/langchain/llms/base.py", line 802, in call
    self.generate(
    File "/usr/local/lib/python3.10/dist-packages/langchain/llms/base.py", line 598, in generate
    output = self._generate_helper(
    File "/usr/local/lib/python3.10/dist-packages/langchain/llms/base.py", line 504, in _generate_helper
    raise e
    File "/usr/local/lib/python3.10/dist-packages/langchain/llms/base.py", line 491, in _generate_helper
    self._generate(
    File "/usr/local/lib/python3.10/dist-packages/langchain/llms/base.py", line 977, in _generate
    self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/langchain/llms/huggingface_text_gen_inference.py", line 164, in _call
    res = self.client.generate(prompt, **invocation_params)
    File "/usr/local/lib/python3.10/dist-packages/text_generation/client.py", line 149, in generate
    raise parse_error(resp.status_code, payload)
    text_generation.errors.GenerationError: Request failed during generation: Server error: name 'htcore' is not defined

    Server side:

    Traceback (most recent call last):
    File "/usr/local/bin/text-generation-server", line 8, in
    sys.exit(app())
    File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in call
    return get_command(self)(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
    return self.main(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
    return _main(
    File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
    File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
    File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
    File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params) # type: ignore
    File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 66, in serve
    server.serve(model_id, revision, dtype, uds_path)
    File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 161, in serve
    asyncio.run(serve_inner(model_id, revision, dtype))
    File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
    File "/usr/lib/python3.10/asyncio/base_events.py", line 633, in run_until_complete
    self.run_forever()
    File "/usr/lib/python3.10/asyncio/base_events.py", line 600, in run_forever
    self._run_once()
    File "/usr/lib/python3.10/asyncio/base_events.py", line 1896, in _run_once
    handle._run()
    File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
    File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(

    File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
    File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
    File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
    File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 76, in Prefill
    generations, next_batch = self.model.generate_token(batch)
    File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
    File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 557, in generate_token
    next_token_id, logprobs = next_token_chooser(all_input_ids.view(1, -1), logits[-1:, :])
    File "/usr/local/lib/python3.10/dist-packages/text_generation_server/utils/tokens.py", line 65, in call
    scores, next_logprob = self.static_warper(scores)
    File "/usr/local/lib/python3.10/dist-packages/text_generation_server/utils/logits_process.py", line 51, in call
    self.hpu_graph = htcore.hpu.HPUGraph()
    NameError: name 'htcore' is not defined

    Expected behavior

    The expected behavior should run smoothly and provide the output text.

    Readme suggestions

    1. Can you add "(HPU)" after Habana's Gaudi processor?
      🤗 Optimum Habana is the interface between the 🤗 Transformers library and Habana's Gaudi processor.

    2. Move the following to bf16 list
      "truediv",
      "div",
      "softmax"

    3. Add section for the recommendation training parameter for the 4 models

    No support for optimum-habana pipeline() causes error during inference for PyTorch BERT finetuned model using dtype bf16

    System Info

    optimum-habana 1.5.0
    docker version 1.9.0
    pytorch version 1.13.1

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    During the inference of bert (bert-large-uncased), finetuned on Financial PhraseBank dataset with bf16 data type, it results in an error.

    The finetuning on Gaudi (HPU) is done with the help of optimum-habana library.

    The transformers (4.28.1) and supporting libraries are installed as part of optimum-habana installation.

    The finetuning works well for both data type (bf16 and fp32). The inference works well on fp32 data type. But when inference is done on bf16, it results in error.

    The finetuning code is present here in finbert.py file.

    import sys
    import subprocess
    
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 
    'numpy', ' pandas', ' scikit-learn', 'datasets', 'optimum.habana', '--user'])
    
    import pandas as pd
    import numpy as np
    from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
    from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    from datasets import Dataset
    
    
    def load_data():
        df = pd.read_csv(
            'FinancialPhraseBank-v1.0/Sentences_50Agree.txt',
            sep='@',
            names=['sentence', 'label'],
            encoding = "ISO-8859-1")
        df = df.dropna()
        df['label'] = df['label'].map({"neutral": 0, "positive": 1, "negative": 2})
        df.head()
    
        df_train, df_test, = train_test_split(df, stratify=df['label'], test_size=0.1, random_state=42)
        df_train, df_val = train_test_split(df_train, stratify=df_train['label'],test_size=0.1, random_state=42)
    
        dataset_train = Dataset.from_pandas(df_train, preserve_index=False)
        dataset_val = Dataset.from_pandas(df_val, preserve_index=False)
        dataset_test = Dataset.from_pandas(df_test, preserve_index=False)
    
        return dataset_train, dataset_val, dataset_test
    
    
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        return {'accuracy': accuracy_score(predictions, labels)}
    
    
    def main():
        dataset_train, dataset_val, dataset_test = load_data()
    
        bert_model = AutoModelForSequenceClassification.from_pretrained('bert-large-uncased', num_labels=3)
        bert_tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased')
    
        dataset_train = dataset_train.map(lambda e: bert_tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
        dataset_val = dataset_val.map(lambda e: bert_tokenizer(e['sentence'], truncation=True, padding='max_length', max_length=128), batched=True)
        dataset_test = dataset_test.map(lambda e: bert_tokenizer(e['sentence'], truncation=True, padding='max_length' , max_length=128), batched=True)
    
        dataset_train.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
        dataset_val.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
        dataset_test.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
    
        args = GaudiTrainingArguments(
            output_dir='temp/',
            overwrite_output_dir=True,
            evaluation_strategy='epoch',
            save_strategy='no',
            logging_strategy='epoch',
            logging_dir='logs/',
            report_to='tensorboard',
    
            learning_rate=2e-5,
            per_device_train_batch_size=8,
            per_device_eval_batch_size=4,
            num_train_epochs=5,
            weight_decay=0.01,
            metric_for_best_model='accuracy',
    
            use_habana=True,                        # use Habana device
            use_lazy_mode=True,                     # use Gaudi lazy mode
            use_hpu_graphs=True,                    # set value for hpu_graphs
            gaudi_config_name='gaudi_config.json',  # load config file
        )
    
        trainer = GaudiTrainer(
            model=bert_model,                   # the instantiated 🤗 Transformers model to be trained
            args=args,                          # training arguments, defined above
            train_dataset=dataset_train,        # training dataset
            eval_dataset=dataset_val,           # evaluation dataset
            compute_metrics=compute_metrics
        )
    
        trainer.train()   
    
    
    if __name__ == '__main__':
        main()

    It also needs a gaudi_config.json file which has details for bf16 dtype training.
    The gaudi_config.json file is:

    {
      "use_habana_mixed_precision": true,
      "hmp_is_verbose": false,
      "use_fused_adam": true,
      "use_fused_clip_norm": true,
      "hmp_bf16_ops": [
        "add",
        "addmm",
        "bmm",
        "div",
        "dropout",
        "gelu",
        "iadd",
        "linear",
        "layer_norm",
        "matmul",
        "mm",
        "rsub",
        "softmax",
        "truediv"
      ],
      "hmp_fp32_ops": [
        "embedding",
        "nll_loss",
        "log_softmax",
        "cross_entropy"
      ]
    }

    Note: Keep both finbert.py and gaudi_config.json files in same folder.

    Run it with command:
    export MASTER_ADDR="localhost"
    export MASTER_PORT="12345"
    mpirun -n 8 --bind-to core --map-by socket:PE=4 --rank-by core --report-bindings --allow-run-as-root python finbert.py

    Note: It can also be finetuned on 1 card for debugging purpose.

    After completing the finetuning on bf16 dtype, next while running the inference code either code-1 or code-2, it results in error.

    Inference code-1:

    from transformers import pipeline 
    device=torch.device('hpu') 
    pipe = pipeline("text-classification", model=bert_model, tokenizer=bert_tokenizer, device=device) 
    print(pipe("Alabama Takes From the Poor and Gives to the Rich")) 
    print(pipe("Economists are predicting the highest rate of employment in 15 years"))

    Inference code-2:

    from transformers import TextClassificationPipeline
    pipe = TextClassificationPipeline(model=bert_model, tokenizer=bert_tokenizer)
    pipe.device=torch.device('hpu')
    print(pipe("Alabama Takes From the Poor and Gives to the Rich")) 
    print(pipe("Economists are predicting the highest rate of employment in 15 years"))

    Error seen after inference:

    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    File /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:1146, in _LazyModule._get_module(self, module_name)
       1145 try:
    -> 1146     return importlib.import_module("." + module_name, self.__name__)
       1147 except Exception as e:File /usr/lib/python3.8/importlib/__init__.py:127, in import_module(name, package)
        126         level += 1
    --> 127 return _bootstrap._gcd_import(name[level:], package, level)File <frozen importlib._bootstrap>:1014, in _gcd_import(name, package, level)File <frozen importlib._bootstrap>:991, in _find_and_load(name, import_)File <frozen importlib._bootstrap>:975, in _find_and_load_unlocked(name, import_)File <frozen importlib._bootstrap>:671, in _load_unlocked(spec)File <frozen importlib._bootstrap_external>:848, in exec_module(self, module)File <frozen importlib._bootstrap>:219, in _call_with_frames_removed(f, *args, **kwds)File /usr/local/lib/python3.8/dist-packages/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py:56
         51 # Fused kernels
         52 # Use separate functions for each case because conditionals prevent kernel fusion.
         53 # TODO: Could have better fused kernels depending on scaling, dropout and head mask.
         54 #  Is it doable without writing 32 functions?
         55 @torch.jit.script
    ---> 56 def upcast_masked_softmax(
         57     x: torch.Tensor, mask: torch.Tensor, mask_value: torch.Tensor, scale: float, softmax_dtype: torch.dtype
         58 ):
         59     input_dtype = x.dtypeFile /usr/local/lib/python3.8/dist-packages/torch/jit/_script.py:1343, in script(obj, optimize, _frames_up, _rcb, example_inputs)
       1342     _rcb = _jit_internal.createResolutionCallbackFromClosure(obj)
    -> 1343 fn = torch._C._jit_script_compile(
       1344     qualified_name, ast, _rcb, get_default_args(obj)
       1345 )
       1346 # Forward docstringsFile /usr/local/lib/python3.8/dist-packages/torch/jit/_recursive.py:863, in try_compile_fn(fn, loc)
        862 rcb = _jit_internal.createResolutionCallbackFromClosure(fn)
    --> 863 return torch.jit.script(fn, _rcb=rcb)File /usr/local/lib/python3.8/dist-packages/torch/jit/_script.py:1343, in script(obj, optimize, _frames_up, _rcb, example_inputs)
       1342     _rcb = _jit_internal.createResolutionCallbackFromClosure(obj)
    -> 1343 fn = torch._C._jit_script_compile(
       1344     qualified_name, ast, _rcb, get_default_args(obj)
       1345 )
       1346 # Forward docstringsRuntimeError: 
    Unknown type name 'DType':
      File "/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/hpex/hmp/utils.py", line 1811
    def softmax(input: Tensor, dim: Optional[int] = None, _stacklevel: int = 3, dtype: Optional[DType] = None) -> Tensor:
                                                                                                ~~~~~ <--- HERE
        r"""Applies a softmax function.
    'softmax' is being compiled since it was called from 'upcast_masked_softmax'
      File "/usr/local/lib/python3.8/dist-packages/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py", line 62
        x = x.to(softmax_dtype) * scale
        x = torch.where(mask, x, mask_value)
        x = torch.nn.functional.softmax(x, dim=-1).to(input_dtype)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return x
    The above exception was the direct cause of the following exception:RuntimeError                              Traceback (most recent call last)
    Cell In[26], line 3
          1 from transformers import pipeline
          2 device=torch.device('hpu')
    ----> 3 pipe = pipeline("text-classification", model=trainer.model, tokenizer=bert_tokenizer, device=device)
          4 #pipe = TextClassificationPipeline(model=bert_model, tokenizer=bert_tokenizer)
          5 #pipe = TextClassificationPipeline(model=bert_model, tokenizer=bert_tokenizer)
          6 #pipe.device=torch.device('hpu')
          8 print(pipe("Alabama Takes From the Poor and Gives to the Rich"))File /usr/local/lib/python3.8/dist-packages/transformers/pipelines/__init__.py:979, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
        976 if device is not None:
        977     kwargs["device"] = device
    --> 979 return pipeline_class(model=model, framework=framework, task=task, **kwargs)File /usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_classification.py:85, in TextClassificationPipeline.__init__(self, **kwargs)
         82 def __init__(self, **kwargs):
         83     super().__init__(**kwargs)
    ---> 85     self.check_model_type(
         86         TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING
         87         if self.framework == "tf"
         88         else MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING
         89     )File /usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py:942, in Pipeline.check_model_type(self, supported_models)
        940 if not isinstance(supported_models, list):  # Create from a model mapping
        941     supported_models_names = []
    --> 942     for config, model in supported_models.items():
        943         # Mapping can now contain tuples of models for the same configuration.
        944         if isinstance(model, tuple):
        945             supported_models_names.extend([_model.__name__ for _model in model])File /usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py:644, in _LazyAutoMapping.items(self)
        643 def items(self):
    --> 644     mapping_items = [
        645         (
        646             self._load_attr_from_module(key, self._config_mapping[key]),
        647             self._load_attr_from_module(key, self._model_mapping[key]),
        648         )
        649         for key in self._model_mapping.keys()
        650         if key in self._config_mapping.keys()
        651     ]
        652     return mapping_items + list(self._extra_content.items())File /usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py:647, in <listcomp>(.0)
        643 def items(self):
        644     mapping_items = [
        645         (
        646             self._load_attr_from_module(key, self._config_mapping[key]),
    --> 647             self._load_attr_from_module(key, self._model_mapping[key]),
        648         )
        649         for key in self._model_mapping.keys()
        650         if key in self._config_mapping.keys()
        651     ]
        652     return mapping_items + list(self._extra_content.items())File /usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py:616, in _LazyAutoMapping._load_attr_from_module(self, model_type, attr)
        614 if module_name not in self._modules:
        615     self._modules[module_name] = importlib.import_module(f".{module_name}", "transformers.models")
    --> 616 return getattribute_from_module(self._modules[module_name], attr)File /usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py:561, in getattribute_from_module(module, attr)
        559 if isinstance(attr, tuple):
        560     return tuple(getattribute_from_module(module, a) for a in attr)
    --> 561 if hasattr(module, attr):
        562     return getattr(module, attr)
        563 # Some of the mappings have entries model_type -> object of another model type. In that case we try to grab the
        564 # object at the top level.File /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:1136, in _LazyModule.__getattr__(self, name)
       1134     value = self._get_module(name)
       1135 elif name in self._class_to_module.keys():
    -> 1136     module = self._get_module(self._class_to_module[name])
       1137     value = getattr(module, name)
       1138 else:File /usr/local/lib/python3.8/dist-packages/transformers/utils/import_utils.py:1148, in _LazyModule._get_module(self, module_name)
       1146     return importlib.import_module("." + module_name, self.__name__)
       1147 except Exception as e:
    -> 1148     raise RuntimeError(
       1149         f"Failed to import {self.__name__}.{module_name} because of the following error (look up to see its"
       1150         f" traceback):\n{e}"
       1151     ) from eRuntimeError: Failed to import transformers.models.gpt_bigcode.modeling_gpt_bigcode because of the following error (look up to see its traceback):Unknown type name 'DType':
      File "/usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/hpex/hmp/utils.py", line 1811
    def softmax(input: Tensor, dim: Optional[int] = None, _stacklevel: int = 3, dtype: Optional[DType] = None) -> Tensor:
                                                                                                ~~~~~ <--- HERE
        r"""Applies a softmax function.
    'softmax' is being compiled since it was called from 'upcast_masked_softmax'
      File "/usr/local/lib/python3.8/dist-packages/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py", line 62
        x = x.to(softmax_dtype) * scale
        x = torch.where(mask, x, mask_value)
        x = torch.nn.functional.softmax(x, dim=-1).to(input_dtype)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return x

    Expected behavior

    Inference is expected to work well in bf16 just like fp32 dtype.

    Below shown output is expected
    [{'label': 'neutral', 'score': 0.9094224572181702}]
    [{'label': 'positive', 'score': 0.9752092957496643}]

    Will Hugging Face support GLM series models (ChatGLM-6B, ChatGLM2-6B ...) in Transformers?

    Feature request

    We plan to enable ChatGLM-6B, ChatGLM2-6B series models on HPU but found that the models are not in Transformers.

    Model definitions and weights are all on Hugging Face model card:
    ChatGLM-6B: https://huggingface.co/THUDM/chatglm-6b
    ChatGLM2-6B: https://huggingface.co/THUDM/chatglm2-6b

    Is that possible to support these models in Transformers so that optimum-habana could do some hijacking and then enable them on HPU?

    Motivation

    enable ChatGLM-6B, ChatGLM2-6B on HPU

    Your contribution

    enable ChatGLM-6B, ChatGLM2-6B on HPU

    tgi server keeps repeating short words in the output.

    System Info

    optimum-habana                         1.7.2
    text-generation                        0.6.0
    text-generation-server                 1.0.3
    langchain                              0.0.279

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    Step to reproduce:

    1. start text-generation-server following the instruction:
      https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference
    2. launch bash in the docker,
    3. inside the docker:
      pip install langchain text-generation
    4. python run the following script-- lclient.py

    from langchain import HuggingFaceTextGenInference
    llm = HuggingFaceTextGenInference(
    inference_server_url="http://127.0.0.1:80",
    max_new_tokens=64,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
    )
    output = llm("What is Machine Learning?")

    print(output)

    1. noticed:
      on the client side, for the first launch, the output is as follows which is ok.

    root@653f39f1fd71:/text-gen/client# python lclient.py

    Learn More
    The purpose of this article is to show you how to use the machine learning algorithm to solve a problem. The goal of this article is to demonstrate that the machine learning algorithm can solve many problems. Learn More
    The first section of our book is about general computer programs and their usage. We will cover

    then after the first launch, the output is as follows.

    root@653f39f1fd71:/text-gen/client# python lclient.py
    explain give cover give then then give then give then cover cover give explain discuss explain discuss cover then give discuss give cover then give then explain give then explain explain discuss explain cover give give discuss cover explain then discuss cover discuss then cover explain discuss then explain explain then cover give cover explain discuss cover give explain explain give cover cover discuss

    root@653f39f1fd71:/text-gen/client# python lc.py
    give give explain explain explain then cover then cover explain give then give then discuss explain cover give give give give then then explain then cover explain cover then discuss cover cover explain then give cover then cover then discuss then cover cover give explain discuss then discuss cover cover cover explain explain explain then then explain give give cover discuss give give then

    root@653f39f1fd71:/text-gen/client# python lc.py
    cover give discuss explain explain cover then explain then explain give discuss explain discuss discuss discuss give give then then give cover cover explain cover then give then then give then discuss discuss then discuss explain then explain then give give then explain explain cover discuss discuss cover discuss discuss discuss discuss then discuss cover discuss then explain discuss give then cover give discuss

    Expected behavior

    The second client request should yield samiliar expected output as the first.

    Adaptive output and contextual dialogue capabilities of text-generation-inference

    System Info

    System Info
    HL-SMI Version: hl-1.11.0-fw-45.1.1.1
    Driver Version: 1.11.0-e6eb0fd

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    Deploy the Llama-2-7b-chat-hf model through text-generation-inference, but there is no adaptive output when using the following command, instead the input and output size are max_new_tokens.

    curl 127.0.0.1:8080/generate_stream -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":200}}'     -H 'Content-Type: application/json'
    

    Also, how to implement chat functionality with context? Similar to GPT4, it can adaptively output appropriate content and has the ability to dialogue with context.

    Expected behavior

    1. adaptive output
    2. dialogue with context

    StableDiffusion v2.1 produces incorrect images

    System Info

    diffusers==0.23.1
    habana-torch-plugin==1.13.0.463
    optimum==1.14.1
    optimum-habana==1.8.1
    transformers==4.34.1
    optimum-habana repo (examples) on main (which is c7eb594aa9eaf45ef8e4ac8f4a20d0038be50aa6)

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    1. git clone https://github.com/huggingface/optimum-habana.git /root/optimum-habana
    2. pip install optimum[habana]
    3. cd /root/optimum-habana/examples/stable-diffusion/
    4. python text_to_image_generation.py --model_name_or_path stabilityai/stable-diffusion-2-1 --prompts "a professional photograph of an astronaut riding a horse" --num_images_per_prompt 4 --batch_size 1 --height 768 --width 768 --image_save_dir stable_diffusion_images --use_habana --use_hpu_graphs --gaudi_config Habana/stable-diffusion-2

    Expected behavior

    Generated images are incorrect. An astronaut on a horse is expected, but a construction site (?) is produced as a first output image: image_1

    setup.py need to be updated to 0.22.0 for accelerate

    System Info

    optimum habana: 1.8.0-dev0
    transformers: 4.32.1

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    1. pip install .
    2. cd examples/text-generation
    3. pip install -r requirements.txt
    4. python run_generation.py
      --model_name_or_path gpt2
      --use_hpu_graphs
      --use_kv_cache
      --max_new_tokens 100
      --do_sample
      --prompt "Here is my prompt"

    Expected behavior

    run sucessfully, but current result is reporting "cannot import name 'AutocastKwargs' from ...
    Weixin Image_20230829110035

    Example failing on AWS habana instance - cannot import name cached_path

    When training a model from scratch, configuration values may be overridden with the help of --config_overrides:

    python run_clm.py \
        --model_type gpt2 \
        --tokenizer_name gpt2 \
        --config_overrides="n_embd=1024,n_head=16,n_layer=48,n_positions=102" \
        --use_habana \
        --use_lazy_mode \
        --gaudi_config_name Habana/gpt2 \
        --throughput_warmup_steps 2

    output:

    Traceback (most recent call last):
      File "run_clm.py", line 36, in <module>
        from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments
      File "/usr/local/lib/python3.8/dist-packages/optimum/habana/__init__.py", line 19, in <module>
        from .gaudi_configuration import GaudiConfig
      File "/usr/local/lib/python3.8/dist-packages/optimum/habana/gaudi_configuration.py", line 17, in <module>
        from optimum.configuration_utils import BaseConfig
      File "/usr/local/lib/python3.8/dist-packages/optimum/configuration_utils.py", line 25, in <module>
        from transformers.file_utils import cached_path, get_list_of_files, hf_bucket_url, is_offline_mode, is_remote_url
    ImportError: cannot import name 'cached_path' from 'transformers.file_utils' (/home/ubuntu/.local/lib/python3.8/site-packages/transformers/file_utils.py)
    Fatal error condition occurred in /opt/vcpkg/buildtrees/aws-c-io/src/9e6648842a-364b708815.clean/source/event_loop.c:72: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
    Exiting Application
    ################################################################################
    Stack trace:
    ################################################################################
    /usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x200af06) [0x7f954e4c4f06]
    /usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x20028e5) [0x7f954e4bc8e5]
    /usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x1f27e09) [0x7f954e3e1e09]
    /usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f954e4c5a3d]
    /usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x1f25948) [0x7f954e3df948]
    /usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x200ba3d) [0x7f954e4c5a3d]
    /usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x1ee0b46) [0x7f954e39ab46]
    /usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.900(+0x194546a) [0x7f954ddff46a]
    /lib/x86_64-linux-gnu/libc.so.6(+0x468a7) [0x7f954fd7f8a7]
    /lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7f954fd7fa60]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7f954fd5d08a]
    python(_start+0x2e) [0x5fc5fe]
    Aborted (core dumped)
    

    Add support for max_length in run_generation

    System Info

    This is caught while executing the transformers unit tests for optimum-habana 
    https://github.com/huggingface/optimum-habana/blob/main/tests/transformers/tests/models/gpt2/test_modeling_gpt2.py
    
    several TCs are failing because the config in the tests is updated with max_length for text generation rather than max_new_tokens. 
    
    Hence text generation is failing for decoder only models due to this check : 
                if not self.config.is_encoder_decoder:
                    # only pad if bucket_size < -1. If we are bucketing (bucket_size > 0), then that is taken care in greedy_search()
                    if not is_greedy_and_bucket:
                        # token_idx is the current index in the generation process, it is incremented each time a new token is generated
                        model_kwargs["token_idx"] = torch.tensor(inputs_tensor.shape[-1], device=inputs_tensor.device)
    >                   inputs_tensor = torch.nn.functional.pad(
                            inputs_tensor, (0, generation_config.max_new_tokens), value=generation_config.pad_token_id
                        )
    E                   TypeError: pad(): argument 'pad' must be tuple of ints, but found element of type NoneType at pos 2
    
    max_new_tokens is 0 
    
    
    FAILED test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_generate - TypeError: pad(): argument 'pad' must be tuple of ints, but found element of type NoneType at pos 2

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    python -m pytest -vs test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_generate

    Expected behavior

    test should pass

    text_generation_launcher: Waiting for shard to be ready... rank=1 forever if we pass --num-shard

    System Info

    model=bigscience/bloom-560m  (same issue with 
    docker run -p 8080:80 -v $volume:/data --runtime=habana --privileged -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=$token  -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi  --model-id $model --num-shard 2
    
    when running without --num-shard its works fine but seems to be using one Gaudi hpu.
    Instance : Ec2 DL1 instance

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    Log:
    2023-08-31T19:25:28.057753Z INFO text_generation_launcher: Sharding model on 2 processes
    2023-08-31T19:25:28.057837Z INFO download: text_generation_launcher: Starting download process.
    2023-08-31T19:25:35.148837Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

    2023-08-31T19:25:35.468574Z INFO download: text_generation_launcher: Successfully downloaded weights.
    2023-08-31T19:25:35.468985Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
    2023-08-31T19:25:35.468987Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
    2023-08-31T19:25:45.485025Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
    2023-08-31T19:25:45.485090Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
    2023-08-31T19:25:55.500375Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
    2023-08-31T19:25:55.500375Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
    2023-08-31T19:26:05.515041Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
    2023-08-31T19:26:05.515110Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
    2023-08-31T19:26:11.027830Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

    2023-08-31T19:26:11.123093Z INFO shard-manager: text_generation_launcher: Shard ready in 35.652902762s rank=0
    2023-08-31T19:26:11.312428Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

    2023-08-31T19:26:15.529479Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1

    Expected behavior

    Goal is to run Llama2 7B model with TGI and connect it with Langchain. with one HPU performance is very less, trying to leverage multiple hpu's to improve performance.

    When loading datasets by HuggingFace datasets.load_dataset like cifar10, could it be possible to return the dataset without decoding automatically.

    Feature request

    When loading datasets by HuggingFace datasets.load_dataset like cifar10, could it be possible to return the dataset without decoding automatically?

    Motivation

    According to #189, The scale efficiency is about 72.6% for Gaudi2 and 79.4% for Gaudi, we found that the efficiency of Gaudi2 is low because of the data loader, so we intend to implement a data loader (especially for Gaudi2) based on Habana Media Pipeline to do the Decoding, RandomResizedCrop, RandomHorizontalFlip, and Normalize.
    image

    As described in cifar10 dataset data-fields, when accessing the image column: dataset[0]["image"] the image file is automatically decoded this will be executed on CPU. can it just be like the dataset with a root path to the image files and let our self defined dataloader to do the decoding on HPU?

    Your contribution

    Implement a Habana media-based data loader

    dock build fail in https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference

    shall we need to upgrade text-generation-inference code to main?
    see huggingface/text-generation-inference#840

    I meet the same issue when build the docker image for TGI of habana

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference

    docker build -t tgi_gaudi .

    Expected behavior

    build success

    Bad Performance of text-generation with sampling algo

    System Info

    Ubuntu 20.04

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    with do_sample, the throughput is 450 tps

    python run_generation.py --model_name_or_path gpt2  --batch_size 1 --max_new_tokens 100 --use_hpu_graphs --prompt "I am a student from" --do_sample

    without do_sample, the throughput is 900tps

    python run_generation.py --model_name_or_path gpt2  --batch_size 1 --max_new_tokens 100 --use_hpu_graphs --prompt "I am a student from"

    Expected behavior

    decent performance with do_sample algo

    FileNotFoundError: Couldn't find a dataset script

    System Info

    GPT-NeoX Text-generation Training on HLS2 
    
    --branch v1.7.3(Fail)
    
    --branch v1.7.4(Fail)
    
     --branch v1.7.5(Fail)
    
     
    `FileNotFoundError: Couldn't find a dataset script at /root/optimum-habana/examples/text-generation/wikitext-2-raw-v1/wikitext-2-raw-v1.py or any data file in the same directory. Couldn't find 'wikitext-2-raw-v1' on the Hugging Face Hub either: FileNotFoundError: Dataset 'wikitext-2-raw-v1' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with `huggingface-cli login`. 
    I encountered this following error:
    
    
    Full log:
    
    https://logs-browser.k8s-infra.habana-labs.com/files/qa-tester-9-004527355-2642-tfjob/log.txt

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    pip install --upgrade-strategy eager optimum[habana]
    git clone https://github.com/huggingface/optimum-habana --branch v1.7.4
    pip install -r requirements.txt
    pip install git+https://github.com/HabanaAI/[email protected]
    python ../gaudi_spawn.py --use_deepspeed --world_size number_of_devices run_generation.py ARGS

    python run_generation.py
    --model_name_or_path gpt2
    --use_hpu_graphs
    --use_kv_cache
    --max_new_tokens 100
    --do_sample
    --prompt "Here is my prompt"

    Expected behavior

    Pass

    installation command needs a change

    System Info

    the following steps is not installing all components:
     1.git clone https://github.com/huggingface/optimum-habana.git
     2.cd optimum-habana
     3.python setup.py install
    
    step 3. needs to be change to "pip install -e ."

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    git clone https://github.com/huggingface/optimum-habana.git
    cd optimum-habana
    python setup.py install

    run summarization t5-small and it will report no module found:transformers

    Expected behavior

    git clone https://github.com/huggingface/optimum-habana.git
    cd optimum-habana
    python install -e .

    Device Aquire failed

    System Info

    running this command in single Gaudi works very well:
    optimum-habana/examples/language-modeling/run_lora_clm.py \
        --model_name_or_path meta-llama/Llama-2-7b-hf \
        --train_file merged_final_ultimate_andy.json \
        --bf16 True \
        --output_dir ./model_lora_llama \
        --num_train_epochs 3 \
        --per_device_train_batch_size 2 \
        --per_device_eval_batch_size 2 \
        --gradient_accumulation_steps 4 \
        --evaluation_strategy no \
        --save_strategy steps \
        --save_total_limit 1 \
        --learning_rate 1e-4 \
        --logging_steps 1 \
        --dataset_concatenation \
        --do_train \
        --use_habana \
        --use_lazy_mode \
        --throughput_warmup_steps 1
    
    and its finishing the training on Gaudi.
    
    but trying this command will fail:
    python optimum-habana/examples/gaudi_spawn.py \
        --world_size 4 --use_mpi optimum-habana/examples/language-modeling/run_lora_clm.py \
        --model_name_or_path meta-llama/Llama-2-7b-hf \
        --train_file merged_final_ultimate_andy.json \
        --bf16 True \
        --output_dir ./model_lora_llama \
        --num_train_epochs 3 \
        --per_device_train_batch_size 2 \
        --per_device_eval_batch_size 2 \
        --gradient_accumulation_steps 4 \
        --evaluation_strategy no \
        --save_strategy steps \
        --save_total_limit 1 \
        --learning_rate 1e-4 \
        --logging_steps 1 \
        --dataset_concatenation \
        --do_train \
        --use_habana \
        --use_lazy_mode \
        --throughput_warmup_steps 1
    
    while using only 2 devices:
    python optimum-habana/examples/gaudi_spawn.py \
        --world_size 2 --use_mpi optimum-habana/examples/language-modeling/run_lora_clm.py \
        --model_name_or_path meta-llama/Llama-2-7b-hf \
        --train_file merged_final_ultimate_andy.json \
        --bf16 True \
        --output_dir ./model_lora_llama \
        --num_train_epochs 3 \
        --per_device_train_batch_size 2 \
        --per_device_eval_batch_size 2 \
        --gradient_accumulation_steps 4 \
        --evaluation_strategy no \
        --save_strategy steps \
        --save_total_limit 1 \
        --learning_rate 1e-4 \
        --logging_steps 1 \
        --dataset_concatenation \
        --do_train \
        --use_habana \
        --use_lazy_mode \
        --throughput_warmup_steps 1
    
    also this command fails with fail to acquire device:
    python optimum-habana/examples/gaudi_spawn.py \
        --world_size 4 --use_deepspeed optimum-habana/examples/language-modeling/run_lora_clm.py \
        --model_name_or_path meta-llama/Llama-2-7b-hf \
        --train_file merged_final_ultimate_andy.json \
        --bf16 True \
        --output_dir ./model_lora_llama \
        --num_train_epochs 3 \
        --per_device_train_batch_size 2 \
        --per_device_eval_batch_size 2 \
        --gradient_accumulation_steps 4 \
        --evaluation_strategy no \
        --save_strategy steps \
        --save_total_limit 1 \
        --learning_rate 1e-4 \
        --logging_steps 1 \
        --dataset_concatenation \
        --do_train \
        --use_habana \
        --use_lazy_mode \
        --throughput_warmup_steps 1 \
        --deepspeed gaudi_config.json
    I used this config file:
    
    {
        "steps_per_print": 64,
        "train_batch_size": "auto",
        "train_micro_batch_size_per_gpu": "auto",
        "gradient_accumulation_steps": "auto",
        "bf16": {
            "enabled": true
        },
        "gradient_clipping": 1.0,
        "zero_optimization": {
            "stage": 2,
            "overlap_comm": false,
            "reduce_scatter": false,
            "contiguous_gradients": false
        }
    }
    
    Note: Im using a 7 devices in my template which gives me 7 HPU's
    
    
    for a while Im facing some issues running distributed with Gaudi multi devices and I really want to run a 70B llama model but for now Im stuck.

    Information

    • The official example scripts
    • My own modified scripts

    Tasks

    • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
    • My own task or dataset (give details below)

    Reproduction

    described in info

    Expected behavior

    to work

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.