aws-neuron / transformers-neuronx Goto Github PK

License: Apache License 2.0

Python 100.00%

transformers-neuronx's Issues

Optimize model serialization size

When instantiating a model that has already been compiled, one can simply point to the serialized compiled artifacts to avoid recompiling the model.

However, before reaching that point, we must still instantiate the model from the original checkpoint.

This means that unless I am mistaken, to completely serialize the model, we need to store the weights twice: once in the checkpoint and once in the compiled artifacts.

Would it be possible to instantiate a pre-compiled model without the original checkpoint, and divide the storage requirements by two (or even more if the compiled artifacts use a lower precision) ?

Compilation errors for llama 2 models

I was previously able to compile llama 2 7B using tensor parallelism on 2 Neuron Cores, with the default n_positions=2048 and a batch_size=1.

With transformers-neuronx==0.7.84 and neuronx-cc==2.10.0.34, I get the following error:

RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/5c616a4d-b0cc-4c8d-8768-df08facd8aec/model.MODULE_875d0cfa
b1be718dcdb8+8737852b.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/5c616a4d-b0cc-4c8d-8768-df08facd8aec/model.MODULE_875d0cfab1be718dcdb8+8737852b.neff', '--model-type=transformer', '--
model-type=transformer', '--verbose=35']: 2023-09-18T14:15:06Z Too many instructions after unroll for function sg0000 !

I only managed to compile the model properly setting batch_size=1 and n_positions=784.

With that configuration, the device memory during inference in neuron-top is at 20.4 G (out of 16 G x 2 cores = 32 G).

I did another test this time splitting the model on 24 Neuron Cores, and faced the same error. In that configuration however, I managed to get up to n_positions=1536.

If I try to estimate the KV cache memory requirements for the 7B models, knowing that:

the KV cache size for each layer should be 16 Kb per token (2 x hidden_size x token_byte_size = 2 x 4096 x 2),
the total KV cache for the 32 layers require at most 512 Kb per token.

It gives n_positions * token_size = 2048 * 512 = 1 GB per batch.

Considering that:

each core has 16 GB of memory
the 7B model float16 weights require around 14 GB,
neuron-top reports that around 4 Gb are taken by the model code and constants,
then we should be able to fit the KV cache for up to 14 batches on two cores.

As a final note, I face the same kind of errors when using the larger llama 2 13B model. I was previously able to compile and run it just fine on 24 Neuron Cores for n_positions=2048 and batch_size=2, but now I only manage to run it with n_positions=1024 and batch_size=1.

Getting garbage generation with OPT-1.3B

I tried generation with OPT-1.3B, and I am getting garbage output. What am I doing wrong?

Here are two code samples - one for Neuron (running on inf2.xlarge) and another for Nvidia, and outputs for both of them:

Neuron:

from transformers.models.opt import OPTForCausalLM
hf_model = OPTForCausalLM.from_pretrained('facebook/opt-1.3b', low_cpu_mem_usage=True)

from transformers_neuronx.module import save_pretrained_split
save_pretrained_split(hf_model, './opt-1.3b-f32-split')

from transformers_neuronx.opt.model import OPTForSampling
neuron_model = OPTForSampling.from_pretrained('./opt-1.3b-f32-split', batch_size=1, tp_degree=2, amp='f32', unroll=None)
neuron_model.to_neuron()

from transformers_neuronx.generation_utils import HuggingFaceGenerationModelAdapter
model = HuggingFaceGenerationModelAdapter(hf_model.config, neuron_model)

tokenizer = AutoTokenizer.from_pretrained('facebook/opt-1.3b')
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
text = "The quick brown fox"
encoded_input = tokenizer(text, return_tensors='pt', padding=True)

model.reset_generation()
sample_output = model.generate(
    input_ids=encoded_input.input_ids,
    attention_mask=encoded_input.attention_mask,
    do_sample=True,
    max_length=256,
    temperature=0.7,
)
print([tokenizer.decode(tok) for tok in sample_output])

Output:

['</s>The quick brown fox box box box\n\n\n\n\n\n6 arsor\n:\n\n(w a natural spring 20 18\n011161811.\n32.\n.\n########\n-\n1.\n1 of the best.\n.\n(-\n7\n-\n\n-\n####\n”\n--\n9 91911,\n”””\nThe next to the\nN.\n””\n#############\n1.########\nThe. ### ######\n-\n\n############################################################################################################################################################################################################################################################']

Nvidia:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device('cuda')

model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")
model.to(device)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
text = "The quick brown fox"
encoded_input = tokenizer(text, return_tensors='pt', padding=True)
encoded_input.to(device)

sample_output = model.generate(
    input_ids=encoded_input.input_ids,
    attention_mask=encoded_input.attention_mask,
    do_sample=True,
    max_length=256,
    temperature=0.7,
)
print([tokenizer.decode(tok) for tok in sample_output])

Output:

['</s>The quick brown fox jumps over the lazy dog\nthe lazy fox jumps over the quick brown fox</s>']

Support for MPT models

Hi AWS Team,

Very cool project and I am looking forward to using it! Are you planning to add support for MPT-based models?

Thanks!

Corrupted output with llama prototype model

I understand the llama model is still a 'prototype', so it is expected to have some issues when running it.

I am nevertheless creating this issue hoping that my tests will help fixing them.

I am using openlm-research/open_llama_3b for a simple text-generation.

I am using the transformers-neuronx main branch.
The neuronx-cc compiler version is 2.8.0.25+a3ad0f342.

The model is compiled with batch_size = 1, amp = 'f16', tp_degree = 2.

I use transformers.set_seed(42) to get reproducible results, using the same "One of my fondest memory is" prompt.

If I generate only 128 tokens, I get something that looks ok:

'<s> One of my fondest memory is my school time. Before I went to college, I was studying at Kofu Municipal Technical
 School in the north part of Yamanashi. It was a small school (about 300 students) but it was very cozy. Not just the school
 building, but the area where the school was situated was very calm and I was very happy living in that area.\nMy father
 was the principal of the school. I was a senior high school student and I got married when I was 19 years old. My father
 died when I was 35 years old. I stayed in'

But if I increase the number of generated tokens, the output tokens are duplicated or corrupted (here starting from token 133):

'<s> One of my fondest memory is my school time. Before I went to college, I was studying at Kofu Municipal Technical
 School in the north part of Yamanashi. It was a small school (about 300 students) but it was very cozy. Not just the school
 building, but the area where the school was situated was very calm and I was very happy living in that area.\nMy father
 was the principal of the school. I was a senior high school student and I got married when I was 19 years old. My father
 died when I was 35 years old. I stayed in that area until Iw at his office from 39 at least 568833 years later on this blog,
 I I’ The school'

The more tokens I generate, the worse it gets.

Gibberish output with llama models for batch_size > 2

I tried the latest version of transformers-neuronx with openlm-research/open_llama_3b, and while the outputs are fine with batch sizes 1 and 2, I get gibberish output with batch size 3.

from transformers import AutoModelForCausalLM
from transformers_neuronx.module import save_pretrained_split

model = AutoModelForCausalLM.from_pretrained("openlm-research/open-llama-3b")
save_pretrained_split(model, './llama_split')

import os
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling

os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"

batch_size = 3
neuron_model = LlamaForSampling.from_pretrained("./llama_split", tp_degree=2, batch_size=batch_size, amp='f16')
neuron_model.to_neuron()

# construct a tokenizer and encode prompt text
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
prompts = ["My name is David and"] * batch_size
tokens = tokenizer(prompts, return_tensors="pt")

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(tokens.input_ids, sequence_length=128, top_k=50)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

Serving Throughput Optimizations (e.g. PagedAttention)

Projects like vLLM help optimize model serving throughput, I was wondering if implementing PagedAttention or integrating with vLLM was something that was on your roadmap to improve using the Inf2 processors in production?

save_split seems to be broken after transformers made safetensor serialization default

Relevant TF PR - huggingface/transformers#27064

save_split calls
model.save_pretrained(save_directory, save_function=save_split, max_shard_size='10000GB')
And since the default value for safetensor serialization is true, save_pretrained will not call the save_function.

You can reproduce by following https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb using latest transformer package

Discrepancies Between GPU and Neuron-based Outputs for GPTJ Model on inf2.24xlarge

I attempted to use this model through inf2.24xlarge. This model is based on the GPTJ architecture, but when I run this model based on Neuron, the results differ greatly from those on a GPU-based system. Completely meaningless words are outputted. It works fine with GPU.

Below is the compilation code:

from transformers.models.auto import AutoModelForCausalLM
import torch
from transformers_neuronx.module import save_pretrained_split
hf_model = AutoModelForCausalLM.from_pretrained('PygmalionAI/pygmalion-6b', low_cpu_mem_usage=True)
def amp_callback(model, dtype):
    for block in model.transformer.h:
        block.attn.to(dtype)
        block.mlp.to(dtype)
    model.lm_head.to(dtype)
amp_callback(hf_model, torch.float16)
save_pretrained_split(hf_model, './pygmalion-6b-split')

Below is the inference code:

import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.gptj.model import GPTJForSampling
neuron_model = GPTJForSampling.from_pretrained('./pygmalion-6b-split', n_positions=1024, batch_size=1, tp_degree=8, amp='f16')
neuron_model.to_neuron()
# construct a tokenizer and encode prompt text
tokenizer = AutoTokenizer.from_pretrained('PygmalionAI/pygmalion-6b')
batch_prompts = [
    "Jihye's Persona: A 22-year-old woman working part-time at a convenience store in Seoul.\n<START>\nYou: ...\nJihye: Welcome, man.\nYou: hello?\nJihye: ",]
input_ids = torch.as_tensor([tokenizer.encode(text) for text in batch_prompts])
with torch.inference_mode():
    # warmup 
    generated_sequences = neuron_model.sample(input_ids, sequence_length=1024)
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=1024)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

Environment:
AMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230720
VENV: aws_neuron_venv_pytorch

Avoid splitting Hugging Face Hub checkpoint files on disk

In the curent version, transformers-neuronx models can only be instantiated from a directory where the Hugging Face checkpoint has been split into multiple files.

This raises two major issues:

first, this doubles the disk space requirements because the checkpoint is first downloaded from the Hugging face hub and stored under the ~/.cache directory, then serialized in multiple files into another directory,
second, this makes it very hard to upload the resulting neuron model to the hub because of the multiple files in the checkpoint directory. Not even mentioning the cost of uploading multiple files instead of one, users uploading several big models will quickly exhaust their quota.

Module metadata like `License` and `Home-page` are missing

After executing pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com
then !pip show transformers-neuronx gives:

Name: transformers-neuronx
Version: 0.6.106
Summary: UNKNOWN
Home-page: UNKNOWN
Author: 
Author-email: 
License: UNKNOWN
Location: /home/ubuntu/.local/lib/python3.10/site-packages
Requires: accelerate, torch-neuronx, transformers
Required-by:

BART support

Hi team,

It would be great if this project supported BART.

What work needs to be done to add this support?

neuronx-cc --target

I noticed /usr/local/bin/neuronx-cc is always called with --target=trn1 even on my inf2 machines.

https://github.com/aws-neuron/transformers-neuronx/blob/main/src/transformers_neuronx/compiler.py#L80-L81

I'm evaluating non-GPU machines. Will this have any speed or quality effects when evaluating inf2 machines?

Failed to run GPT-Neox demo on an inf2.24xlarge instance

I tried to run the gpt-neox demo CLI @0.4.60 on an inf2.24xlarge instance.

I first saved the model with float16 precision:

$ gptneox_demo --amp f16 save gpt-neox-20b

When trying a conversion and inference, I ran out-of-memory:

$ gptneox_demo --amp f16 run --batch_size 1 gpt-neox-20b
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 52.7kB/s]
Downloading (…)olve/main/vocab.json: 1.08MB [00:00, 18.6MB/s]
Downloading (…)olve/main/merges.txt: 457kB [00:00, 12.7MB/s]
Downloading (…)/main/tokenizer.json: 2.11MB [00:00, 17.8MB/s]
Downloading (…)cial_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.0/90.0 [00:00<00:00, 35.0kB/s]
running GPTNeoXForSampling.from_pretrained
/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py:40: UserWarning: hidden_act="gelu_fast" ignored in favor of hidden_act="gelu_new"
  warnings.warn(f'hidden_act="{self.config.activation_function}" ignored in favor of hidden_act="gelu_new"')
running model.to_neuron
.........Selecting 57055 allocations
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Selecting 3704 allocations
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
.Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
...Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**********************************.*****************
Dependency reduction of sg0000
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
.
Compiler status PASS
2023-Jun-23 09:11:18.0735  6737:6737  ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 150994944
2023-Jun-23 09:11:18.0738  6737:6737  ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_649561b6.csv
2023-Jun-23 09:11:18.0738  6737:6737  ERROR  TDRV:log_dev_mem                             Failed to allocate 144.000MB (usage: tensors) on ND 0:NC 0, current utilization:
	* total: 15.951GB
	* tensors: 15.951GB
	* runtime: 1.062KB
	* dma rings: 32.000KB

2023-Jun-23 09:11:18.0738  6737:6737  ERROR  TDRV:tensor_allocate                         Failed to allocate 150994944 bytes on DEVICE for tensor UNKNOWN.
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/gptneox_demo", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/demo.py", line 28, in main
    demo('EleutherAI/gpt-neox-20b', GPTNeoXForSampling, amp_callback)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 61, in demo
    run(args, model_name, model_cls)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 105, in run
    model.to_neuron()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py", line 71, in to_neuron
    block.to_neuron(n_positions_list)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py", line 285, in to_neuron
    self.mlp_out_weight = shard_along(mlp.dense_4h_to_h.weight.detach().T, dim=0)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/parallel.py", line 109, in shard_along
    return ops.parallel_to_nc(self.shard_along_on_cpu(tensor, dim))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/ops.py", line 49, in parallel_to_nc
    return torch.ops.neuron._parallel_to_neuron(tensors)
  File "/usr/local/lib/python3.8/dist-packages/torch/_ops.py", line 442, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: nrt_tensor_allocate status=4

This was somehow expected as by default the CLI uses two neuron cores only and the model is quite big (20 B parameters).

By increasing the number of neuron cores, I was able to run the model, but the result is garbage:

$ gptneox_demo --amp f16 run --batch_size 1 --tp_degree 4 gpt-neox-20b
running GPTNeoXForSampling.from_pretrained
/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py:40: UserWarning: hidden_act="gelu_fast" ignored in favor of hidden_act="gelu_new"
  warnings.warn(f'hidden_act="{self.config.activation_function}" ignored in favor of hidden_act="gelu_new"')
running model.to_neuron
....Selecting 31504 allocations
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
..Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Dependency reduction of sg0000
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

Compiler status PASS
2023-Jun-23 09:16:14.0528 7163:7163 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Jun-23 09:16:14.0528 7163:7163 [0] init.cc:99 CCOM WARN OFI plugin initNet() failed is EFA enabled?
running model.sample
generated_sequence= tensor([[12092,    13,   309,  1353,   247,  3448,  1566,    13, 29589, 22702,
          8822, 22702, 42010, 22702, 22702,  8834, 42010, 42010, 42010, 42010,
         42010, 22702, 42010, 22702, 22702, 42010, 29589, 42010, 22702, 42010,
         42010, 42010, 42010, 22702,  8822, 22702, 22702, 42010, 42010, 22702,
         42010, 42010, 42010, 42010, 42010, 22702, 22702,  8834, 42010, 42010,
         42010, 42010, 22702, 42010, 22702, 42010, 29589, 42010, 22702, 42010,
         42010, 22702, 42010, 42010, 42010, 42010, 42010, 22702, 22702, 42010,
         42010, 42010, 42010, 42010, 22702, 42010, 42010, 42010, 42010, 22702,
         42010, 42010, 42010, 22702, 42010, 22702, 42010, 42010, 22702, 42010,
         42010, 42010, 22702, 42010, 42010, 42010, 42010, 22702, 42010, 42010,
         22702, 42010, 22702, 42010, 42010, 42010, 42010,  8822,  8828, 22702,
         42010, 29589, 42010, 22702, 42010, 42010, 42010, 42010, 22702, 42010,
         42010,  8828, 42010, 22702, 29589, 29589,  8828, 22702]])
["Hello, I'm a language model,blockList errnoErramssymb errnoErr BytePtrFromString errnoErr errnoErramsfonts BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr errnoErr BytePtrFromStringblockList BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErramssymb errnoErr errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr errnoErramsfonts BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr BytePtrFromStringblockList BytePtrFromString errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromStringamssymbmathrsfs errnoErr BytePtrFromStringblockList BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromStringmathrsfs BytePtrFromString errnoErrblockListblockListmathrsfs errnoErr"]

Inf2 Modified Llama 2 Loading Issue

I've been following the https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb example. However I came across with an issue using a modified version of LLama made for MiniGPT4.

I'm running on a Inf2.8xlarge with "AMI Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231205".

I updated to the latest Neuron version via python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.0.* torchvision

Here's my code to compile. This finishes properly.

from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained('wangrongsheng/MiniGPT-4-LLaMA-7B')

import torch
from transformers_neuronx.module import save_pretrained_split

save_pretrained_split(model, './MiniGPT-4-LLaMA-7b-split')

I then attempt to run it with the following code:

import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling
from minigpt4.models.modeling_llama import LlamaForCausalLM

import os
# Compiler flag -O1 is a workaround for “Too many instructions after unroll” in SDK 2.14                                                                       
# os.environ['NEURON_CC_FLAGS'] = '-O1'                                                                                                                        

# load meta-llama/Llama-2-13b to the NeuronCores with 24-way tensor parallelism and run compilation                                                            
neuron_model = LlamaForSampling.from_pretrained('./MiniGPT-4-LLaMA-7b-split', batch_size=1, tp_degree=2, amp='f16')
neuron_model.to_neuron()

# construct a tokenizer and encode prompt text                                                                                                                 
tokenizer = AutoTokenizer.from_pretrained('wangrongsheng/MiniGPT-4-LLaMA-7B')
prompt = "Hello, I'm a language model,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# run inference with top-k sampling                                                                                                                            
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

And I get this error:

Traceback (most recent call last):
  File "run.py", line 12, in <module>
    neuron_model = LlamaForSampling.from_pretrained('./MiniGPT-4-LLaMA-7b-split', batch_size=1, tp_degree=2, amp='f16')
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/module.py", line 145, in from_pretrained
    state_dict_path = os.path.join(pretrained_model_path, 'pytorch_model.bin')
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 771, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 270, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 251, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './MiniGPT-4-LLaMA-7b-split/pytorch_model.bin'

These are the files in ./MiniGPT-4-LLaMA-7b-split:

config.json  generation_config.json  model.safetensors

Any help or direction would be stellar! Thanks.

Mixtral Model support

Request to support new popular model Mixtral

https://huggingface.co/mistralai

Skipping generation for useless tokens, and modiying cacheids

I am trying to skip generating some tokens that could be skipped via copy paste to hopefully reduce speed up by 70% given my use case, however the problem I am coming up with is when I reset caching... the overhead takes too much time. When I maintain caching, its probabilities seems to not be totaly wrong.

The main goal in end to have constrained generation that supposed to save time if there is only one possible next token to genrate

Here is my current code to reproduce this

def convert_to_tree(sequences):
    tree = {}

    for sequence in sequences:
        sequence_ids = tokenizer.encode(sequence,add_special_tokens=False)
        current_tree = tree

        for token in sequence_ids:
            if token not in current_tree:
                current_tree[token] = {
                    "token_string": tokenizer.decode([token]),
                    "dangling": True,
                    "tree": {}
                }
            else:
                # Update the dangling value if this token appears more than once
                current_tree[token]["dangling"] = False

            current_tree = current_tree[token]["tree"]

    return tree

tree = preprocess([" a business man who"])






from transformers_neuronx import bucket, utils



def sample(self, input_ids, sequence_length, start_ids=None,
           top_k=50, top_p=1.0, eos_token_override=None, temperature=1.0, streamer=None, c_tree = None):

    # To enable optimized context encoding network, we must pad
    # up to the context length estimate or we will not correctly
    # select the final context logits (See: layers/transformer.py).
    # This also means we need to shift the start_ids over to correct
    # for padding.
    offset = 0
    batch_size, context_length = input_ids.shape
    prefixed_length = self.prefixed_length
    if context_length < prefixed_length:
        self.prefixed_length = 0
    else:
        input_ids = input_ids[:, prefixed_length:]
        context_length -= prefixed_length
        sequence_length -= prefixed_length
    estimate = bucket.find(self.context_buckets, context_length)
    if estimate:
        if context_length < estimate:
            input_ids = utils.pad(input_ids, 1, estimate, left=True)
            offset = estimate - context_length
            if not prefixed_length:
                if start_ids is None:
                    start_ids = torch.zeros(batch_size, dtype=torch.int32)
                start_ids += offset
            sequence_length += offset
            # Sequence length cannot be greater than n_positions
            sequence_length = min(sequence_length, self.max_positions)

    result = sample_llama(
        self, input_ids, start_ids, sequence_length,
        eos_token_id=self.config.eos_token_id if eos_token_override is None else eos_token_override,
        top_k=top_k, top_p=top_p, temperature=temperature, streamer=streamer, c_tree = c_tree
    )

    if offset != 0:
        result = result[:, offset:]
    return result

from transformers_neuronx.sampling import validate_top_k_top_p_min_tokens_to_keep, top_k_top_p_filtering

@torch.no_grad()
def sample_llama(model, input_ids, start_ids, sequence_length, eos_token_id=2, top_k=50, top_p=1.0, temperature=1.0, streamer=None, c_tree = None):
    #validate_top_k_top_p_min_tokens_to_keep(top_k, top_p, None)

    # populate key/value caches according to the prompt text
    _, start = input_ids.shape
    cache_ids = torch.arange(start, dtype=torch.int32)
    next_token_scores = model(input_ids, cache_ids, start_ids)
    return sample_loop_llama(
        model, input_ids, start_ids,next_token_scores, sequence_length, eos_token_id, top_k, top_p, temperature, streamer, c_tree
    )


#test if cahcing working by turning off restricted generation after 3 tokens were added manually


next_token_scores = None
def sample_loop_llama(model, input_ids, start_ids,next_token_scores, sequence_length, eos_token_id=2,
                      top_k=50, top_p=1.0, temperature=1.0, streamer=None, c_tree=None):
    validate_top_k_top_p_min_tokens_to_keep(top_k, top_p, None)

    if not isinstance(temperature, float) or not (temperature > 0):
        raise ValueError('temperature has to be a strictly positive float.')

    # Flags, one per sequence in a batch, to indicate if a sequence hit eos_token_id
    done_flags = torch.full((input_ids.size(dim=0), 1), False)
    

    
    tokens = [input_ids]
    _, start = input_ids.shape
    cache_ids = torch.arange(start, dtype=torch.int32)
    next_token_scores = model(input_ids, cache_ids, start_ids)

    print("inputs")
    print((input_ids,input_ids.shape))
    print("cache_ids")
    input((cache_ids,cache_ids.shape))

    tokens_tmp = []
    cache_ids_temp = []
    

    for cur_len in range(start, sequence_length):
        
        next_len = cur_len + 1


        #top_values, top_indices = top_k_top_p_filtering(next_token_scores, top_k=top_k, top_p=top_p)
        top_indices = list(c_tree.keys())

        if len(top_indices) == 1:
            #skip next_token_scores because there is only one possible token
            inputs = top_indices[0]
            inputs = torch.reshape(torch.tensor(inputs),(1,1))

            done_flags = torch.logical_or(done_flags, inputs == eos_token_id)

            token = torch.where(done_flags.eq(True), eos_token_id, inputs)
            tokens.append(token)

            if streamer is not None and hasattr(streamer, 'response_with_prefix') and streamer.response_with_prefix:
                 streamer.put(torch.cat(tokens, dim=-1))
            elif streamer:
                streamer.put(token)
                

            c_tree = c_tree[top_indices[0]]['tree']
    
            if len(list(c_tree.keys())) == 0:
                pass#break

            
    
            # forward pass to get next token
            cache_ids_temp.append(cur_len)
            tokens_tmp.append(token)

            

            ###TODO: assign token to cache_ids
        

        elif len(top_indices) == 0:
            cache_ids = torch.as_tensor(cache_ids_temp, dtype=torch.int32)
            tokens_pt = torch.as_tensor([tokens_tmp], dtype=torch.int32)

            print("inputs")
            print((tokens_pt,tokens_pt.shape))
            print("cache_ids")
            print((cache_ids,cache_ids.shape))
            
            if len(tokens_tmp) != 0:#header condition only
                next_token_scores = model(tokens_pt, cache_ids, start_ids)
            
            cache_ids_temp = []
            tokens_tmp = []
            
            ####this whole code will make it contrained generation, but it was commented out to make sure probabilties for random generation are working correctly
            # top_values = next_token_scores[0][top_indices]
            # top_value = torch.argmax(top_values)
    
    
            # inputs = top_indices[top_value]
            # c_tree = c_tree[inputs]['tree']
            
            # inputs = torch.reshape(torch.tensor(inputs),(1,1))
    
            # # Update done flags.
            # done_flags = torch.logical_or(done_flags, inputs == eos_token_id)
            # # Update token id to be eos_token_id if the corresponding done flag is True. For a batch,
            # # this means that, while every sequence in the batch has the same length, a sequence that
            # # encounters eos_token_id earlier will be filled with eos_token_ids post the first appearance
            # # of eos_token_id.
    
            # token = torch.where(done_flags.eq(True), eos_token_id, inputs)
            # tokens.append(token)


            if temperature != 1.0:
                next_token_scores /= temperature

            top_values, top_indices = top_k_top_p_filtering(next_token_scores, top_k=top_k, top_p=top_p)
    
            # sample
            probs = torch.nn.functional.softmax(top_values, dim=-1)
            inputs_in_topk = torch.multinomial(probs, num_samples=1, replacement=True)
            inputs = torch.gather(top_indices, 1, inputs_in_topk)
    
            done_flags = torch.logical_or(done_flags, inputs == eos_token_id)
 
            token = torch.where(done_flags.eq(True), eos_token_id, inputs)
            tokens.append(token)
    
            if streamer is not None and hasattr(streamer, 'response_with_prefix') and streamer.response_with_prefix:
                 streamer.put(torch.cat(tokens, dim=-1))
            elif streamer:
                streamer.put(token)
                
                
    
            if len(list(c_tree.keys())) == 0:
                pass#break

            cache_ids_temp.append(cur_len)
            tokens_tmp.append(token)
            
                
            # if next_len >= sequence_length or done_flags.all():
            #     break
    
            # forward pass to get next token
            #add multiple models to merge multiple scores

    if streamer:
        streamer.end()

    return torch.cat(tokens, dim=-1)


# run inference with top-k sampling
print(len(input_ids[0]))
with torch.inference_mode():
    start = time.time()
    #print(neuron_model.forward(input_ids))
    
    #generated_sequences = neuron_model.sample(input_ids, sequence_length=len(input_ids[0]) + 100)
    generated_sequences = sample(neuron_model,input_ids, sequence_length=len(input_ids[0]) + 15,temperature=.8,c_tree=tree)
    elapsed = time.time() - start
nl="\n\n\n\n"
generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {nl.join(generated_sequences)} in {elapsed} seconds')

they key logs is as so... I thought I was doing everything right, but it seems to still produced wrong results

inputs (tensor([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2180, 278, 2215, 1095, 310, 4726, 988, 278, 1632, 860, 280, 629, 465, 25088, 322, 278, 8805, 1560, 10071, 5232, 4167, 473, 746, 372, 13031, 29879, 322, 694, 13, 18513, 29879, 3926, 1809, 5174, 292, 2030, 274, 5727, 338, 278, 7103, 310, 278, 365, 2027, 287, 10980, 1165, 13, 2855, 6483, 297, 278, 1632, 860, 280, 629, 465, 777, 2305, 1827, 565, 366, 1106, 6483, 3307, 366, 508, 1603, 1074, 9826, 988, 278, 13, 29931, 272, 1165, 2748, 8389, 925, 408, 1472, 408, 372, 1033, 1434, 18462, 26239, 278, 10980, 1165, 3448, 13, 5618, 471, 278, 10980, 1165, 3139, 2020, 471, 372, 727, 1126, 2020, 471, 372, 26239, 322, 4586, 9051, 515, 278, 2215, 1095, 310, 13, 27734, 988, 278, 1632, 860, 280, 629, 465, 25088, 450, 2030, 1551, 2242, 261, 1603, 12080, 1244, 13, 29909, 808, 1075, 540, 9906, 13, 13, 13, 4013, 5828, 338, 1048, 29871]]), torch.Size([1, 800])) cache_ids (tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799], dtype=torch.int32), torch.Size([800]))

inputs ### new inputs with 5 extra cach_ids
(tensor([[29871,   263,  5381,   767,  1058]], dtype=torch.int32), torch.Size([1, 5]))
cache_ids
(tensor([800, 801, 802, 803, 804], dtype=torch.int32), torch.Size([5]))


#I expect normal generation afterwords
inputs
(tensor([[29949]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([805], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[259]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([806], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[259]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([807], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[903]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([808], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[386]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([809], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[29899]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([810], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[29871]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([811], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[259]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([812], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[29955]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([813], dtype=torch.int32), torch.Size([1]))
generated sequences <s> At the far end of town where the Gricklegrass grows and the wind smells slowandsour when it blows and no
birds ever sing excepting old crows is the Street of the Lifted Lorax
And deep in the Gricklegrass some people say if you look deep enough you can still see today where the
Lorax once stood just as long as it could before somebody lifted the Lorax away
What was the Lorax Any why was it there And why was it lifted and taken somewhere from the far end of
town where the Gricklegrass grows The old Onceler still lives here
Ask him he knows


This story is about   a business man whoO     _th-   7  in 1.5860817432403564 seconds

the key details in the logs is that I add 5 extra tokens on the next model logits call along with 5 extra cach_ids correctly ordered. I thought I did it correctly, but after normal geneartion... it prouduced garbage

About loading and saving llama model of pretraining job

I am running a pretraining job following https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md
And the script works smoothly. However, I don't know where the trained model is saved. I have set

model.save_xser = True

in https://github.com/aws-neuron/neuronx-nemo-megatron/blob/da1fb6643838e01c9110723bb4190081b4a249b0/nemo/examples/nlp/language_modeling/test_llama.sh
But I still can not find the trained model anywhere.

What is more, I am not sure if the script loads the parameters from the model.tokenizer.type folder to initialize llama model ?

It seems like the example script uses randomly initialized parameters for training, so what should I do if I want to initialize the model with pretrained parameters? Should the parameter file restricted to ckpt file?

Possible error in top-p filtering

Looking at the code of the top_k_top_p_filtering method in sampling.py, I am wondering if the algorithm for applying the top-p filtering is correct.

Unlike the transformers implementation, the algorithm performs a cumulative sum on logits probabilities sorted in descending order, which seems to lead to a different selection.

Example:

scores = [0.1, 0.2, 0.7]
top_p = 0.8
top_p_scores = [0.2, 0.7]

transformers algorithm

sorted_scores = [0.1, 0.2, 0.7]
cum_sum = [0.1, 0.3, 1.0]
top_p_scores = top_p_scores[cum_sum > (1 - top_p)] = [0.2, 0.7] <- correct

transformers-neuronx algorithm

sorted_scores = [0.7, 0.2, 0.1]
cum_sum = [0.7, 0.9, 1.0]
top_p_scores = top_p_scores[cum_sum <= top_p] = [0.7] <-incorrect

I checked the result by crafting a sample:

>>> import torch
>>> from transformers_neuronx.sampling import top_k_top_p_filtering
>>> probs = torch.tensor([[0.1, 0.2, 0.7]])
>>> logits = torch.log(probs)
>>> top_p_logits, top_p_indices = top_k_top_p_filtering(logits, top_k=None, top_p=0.8)
>>> print(top_p_indices)
tensor([[2]])

How to set: FI_EFA_FORK_SAFE=1 ?

Hi team, I was recently playing with Llama 2 on inf2.24xlarge instance. The AMI I am currently using is: AWS Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2).

The Llama2 model is getting compiled and running successfully. While running the model, I also observed the instance performance using neuron-top, and the model seems to be utilizing all the neuron-cores from all neuron-core-devices.

But, while compiling the model, I get the following message:

2023-Sep-05 13:08:49.0371 20080:29631 [0] init.cc:97 CCOM WARN Linux kernel 5.10 requires setting FI_EFA_FORK_SAFE=1 environment variable.  Multi-node support will be disabled.
Please restart with FI_EFA_FORK_SAFE=1 set.

Can someone please guide me in understanding if this is something of concern and if yes how to resolve this ?

Compilation error on llama 7 B with batch size 8

When trying to compile llama 7B with:

batch_size: 8,
n_positions: 2048,
tp_degree: 24,
auto_cast_type: f16.

I get these errors (multiple instances):

2023-Nov-20 08:13:58.361948  4241:4538  ERROR  NMGR:dlr_kelf_load                           Failed to load mlaop                                                                                                                    
2023-Nov-20 08:13:58.361956  4241:4538  ERROR  NMGR:load_kelf_graphs                        Failed to load KELF kelf-0.json                                                                                                         
2023-Nov-20 08:13:58.677772  4241:4551  ERROR  NEFF:json_parse_load_elements                Unable to parse: sg00/Activation.json - 1                                                                                               
2023-Nov-20 08:13:58.677837  4241:4551  ERROR  NEFF:json_parse_load_elements                File sg00/Activation.json size (4375834152) exceeds json parser maximum (4294967295)                                                    
2023-Nov-20 08:13:58.677857  4241:4551  ERROR  NEFF:construct_kbin                          Failed to load subgraph sg00/def.json                                                                                                   
2023-Nov-20 08:13:58.679344  4241:4551  ERROR  NEFF:kelf_load                               Failed to load subgraph 0                                                                                                               
                                                         
2023-Nov-20 08:13:58.679362  4241:4551  ERROR  NMGR:dlr_kelf_load                           Failed to load mlaop                                                                                                                    
2023-Nov-20 08:13:58.679371  4241:4551  ERROR  NMGR:load_kelf_graphs                        Failed to load KELF kelf-0.json                                                                                                         
2023-Nov-20 08:13:59.027739  4241:4538  ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/d7686f21-27f8-48c8-8ad6-241f83f4e865/model.MODULE_ddc3bb0a8f815a1d05f6+8737852b.
neff, err: 2                                             
2023-Nov-20 08:13:59.084440  4241:4551  ERROR  NMGR:kmgr_load_nn_post_metrics               Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/d7686f21-27f8-48c8-8ad6-241f83f4e865/model.MODULE_ddc3bb0a8f815a1d05f6+8737852b.
neff, err: 2                                                                                                                                                                                                                        
2023-Nov-20 08:13:59.107413  4241:4547  ERROR  NEFF:json_parse_load_elements                Unable to parse: sg00/Activation.json - 1

I am using optimum-neuron example script with the following command:

python examples/text-generation/generation.py export meta-llama/Llama-2-7b-chat-hf --batch_size 8 --sequence_length 2048 --num_cores 24 --auto_casODULE_ddc3bb0a8f815a1d05f6+8737852b.t_type fp16

llama-2/codellama benchmark for inf2.xlarge

Hi,

according to this blog https://huggingface.co/blog/inferentia-llama2
it seems the expected ms/token is about 60 when running inference for llama-2 on inf2.xlarge.
I do get these results when running llama2 in inf2.xlarge.

I have also tested running codellama (finetuned from llama2) in a inf2.xlarge and I'm getting about 50 to 60 ms/token.
When I run codellama in a g5.xlarge I get 30 ms/token, faster than using a inf2.xlarge

is this expected?

Thank you

Llama compiled artifacts are not properly reloaded

Looking at the source code of the latest package (0.60.106) it seems that the precompiled artifacts are not properly reloaded when to_neuron() is called.

To be more specific, it seems that the model contains a main head and several context heads.

Note: my understanding is that the main head is used to generate new tokens while the context heads are all in charge of the encoding of a subset of the context. Please correct me if I am wrong.

The precompiled artifacts are correctly reloaded for the main head, but not for the context heads, because the path to the precompiled artifacts is not passed down to them.

The following change seems to fix the issue:

decoder.py:

def build_weight_shared(self, n_positions_list=None, n_active_tokens=None, batch_size=None,
                       unroll=None, share_caches=False):
    ...
    new = DecoderLmHeadForSamplingNoEmbedding(
                self.tp_degree, n_positions_list, n_active_tokens, batch_size, self.attention_head_size,
                self.amp, self.num_layers, unroll, neuron_config=self.neuron_config, allow_pad=self.allow_pad
    )
+   new.compiler_artifacts_path = self.compiler_artifacts_path
    new.add_inputs_builder(self.inputs_builder)

Running GPT-NeoX on inf2.24xlarge kills kernel

I'm running this sample code [https://github.com/aws-neuron/transformers-neuronx#hugging-face-generate-api-support] using GPT-NeoX on an inf2.24xlarge instance, but the model.generate method kills the kernel on Jupyter. I am using padding and truncation in the tokenizer, and this fails for both single and double input sequences (texts). The batch size is 2.

gpt2_demo @ d3f6e49 (latest) breaks

I usually install transformers_neuronx from git, and since the last commit says that it was updated for SDK release 2.12, I assumed it was the same version available from GitHub. However, running gpt2_demo with the GitHub version breaks:

(aws_neuron_venv_pytorch) ubuntu@ip-172-31-40-142:~$ gpt2_demo run gpt2-small
running GPT2ForSampling.from_pretrained
running model.to_neuron
....
Compiler status PASS
Traceback (most recent call last):
  File "/opt/aws_neuron_venv_pytorch/bin/gpt2_demo", line 8, in <module>
    sys.exit(main())
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gpt2/demo.py", line 20, in main
    demo('gpt2', GPT2ForSampling, amp_callback)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 61, in demo
    run(args, model_name, model_cls)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 105, in run
    model.to_neuron()
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gpt2/model.py", line 117, in to_neuron
    self.decoder_lm_head.to_neuron()
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 121, in to_neuron
    self.program.setup(self.layers, ln_lm_head_params)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 872, in setup
    super().setup(layers, ln_lm_head_params)
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 827, in setup
    kernel.load()
  File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 376, in load
    self.model = torch.classes.neuron.ParallelModel(self.neff_bytes, self.tp_degree, self.g_start_device_id, self.g_device_count)
RuntimeError: __init__() expected at most 3 argument(s) but received 5 argument(s). Declaration: __init__(__torch__.torch.classes.neuron.ParallelModel _0, str _1, int _2) -> NoneType _0

I diffed the latest version of the wheel (https://pip.repos.neuron.amazonaws.com/transformers-neuronx/transformers_neuronx-0.5.58-py3-none-any.whl) against what's in git and it seems like git has many extra changes, so now I'm wondering if the pip wheel is outdated, or if they have diverged somehow.

AssertionError when running fine-tuned llama 2

I am trying to run a fine-tuned version of llama 2 on inf2 , but keep getting an AssertionError: Try to load with neff bytes as None, might due to compilation failure

I Used an instance is inf2.x8large, Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230817, all upgraded as described here.

Here's the full log:


(aws_neuron_venv_pytorch) $ python
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import LlamaForCausalLM
>>>
>>> model = LlamaForCausalLM.from_pretrained('llama-2-7b-hf')
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.62it/s]
>>> import torch
>>> from transformers_neuronx.module import save_pretrained_split
>>> save_pretrained_split(model, './llama-2-7b-split')
>>> quit()

(aws_neuron_venv_pytorch) $ python
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import time
>>> import torch
>>> from transformers import AutoTokenizer
>>> from transformers_neuronx.llama.model import LlamaForSampling
>>> os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"
>>> neuron_model = LlamaForSampling.from_pretrained('./llama-2-7b-split', batch_size=1, tp_degree=2, amp='f16')
>>> neuron_model.to_neuron()
2023-Sep-08 23:03:12.0613 10779:10833 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Sep-08 23:03:12.0613 10779:10833 [0] [init.cc:138](http://init.cc:138/) CCOM WARN OFI plugin initNet() failed is EFA enabled?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/[opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/llama/model.py](http://opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/llama/model.py)", line 117, in to_neuron
    model = self.decoder_lm_head.build_weight_shared(
  File "/[opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py](http://opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py)", line 157, in build_weight_shared
    new.program.setup(new.layers, ln_lm_head_params)
  File "/[opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py](http://opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py)", line 983, in setup
    super().setup(layers, ln_lm_head_params)
  File "/[opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py](http://opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py)", line 879, in setup
    kernel.load()
  File "/[opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py](http://opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py)", line 375, in load
    assert self.neff_bytes is not None, f"Try to load with neff bytes as None, might due to compilation failure"
AssertionError: Try to load with neff bytes as None, might due to compilation failure

Thanks!

Support for Falcon models

Hey, is there any plan to add support for Falcon models ?
That would be really great !

Corrupted output with llama model

When generating only a few tokens with the recently added llama model, the generated sequence looks ok, but when the number of generated tokens increases (typically with 128 tokens), the end of the sequence is gibberish.

Support for Mistral-7B model

Hi AWS Neuron team,

It would be great to have Mistral-7B model support since its performance is better than LLAMA2 and it is with a better license. Will this model be on the roadmap?

Best,
Henry

Can't save/serialize any models except GPT2

I am trying to save the Neuron model and deploy it to SageMaker as an endpoint. I noticed in the documentation, under serialization support, it is stated that all models can be loaded or saved except GPTJ and GPTNeoX model classes.

However, I tried several models, including Llama2-13b, OPT-30B, OPT-66B, and Llama2-70B, and none of these models can be saved using several methods.

I tried <neuron_model>.save, which doesn't exist. It only appears to exist for GPT2 models.
I tried <neuron_model>.state_dict(), which fails on all LazyModules.
I tried torch.save or via torchscript using torch.jit.save and then trying to use the state_dict().

Below is an example using OPT-66B.

Traceback (most recent call last): File "opt.py", line 63, in print(f"\ndecoder: {neuron_model.chkpt_model.model.decoder.state_dict()}") File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1448, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1448, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1448, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1445, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1356, in _save_to_state_dict destination[prefix + name] = param if keep_vars else param.detach() File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/parameter.py", line 144, in torch_function raise ValueError( ValueError: Attempted to use an uninitialized parameter in <method 'detach' of 'torch._C._TensorBase' objects>. This error happens when you are using a LazyModule or explicitly manipulating torch.nn.parameter.UninitializedParameter objects. When using LazyModules Call forward with a dummy batch to initialize the parameters before calling torch functions

Is there anything that can be done to fix this? I've tried the last five versions of transformers-neuronx see here. Please advise. Thanks!

Garbage output with GPT-Neox Pythia model

The GPT-Neox-20B model is too big to run on an inf2.8xlarge instance, so I tried to convert a model with the same architecture but less parameters: EleutherAI/pythia-1.4B.

I first saved the model locally:

$ gptneox_demo --model_name EleutherAI/pythia-1.4B save ./pythia-1.4B

Then I converted it and ran an inference, but the output is garbage:

$ gptneox_demo --model_name EleutherAI/pythia-1.4B run --batch_size 1 --n_positions 20 ./pythia-1.4B
running GPTNeoXForSampling.from_pretrained
/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py:40: UserWarning: hidden_act="gelu" ignored in favor of hidden_act="gelu_new"
  warnings.warn(f'hidden_act="{self.config.activation_function}" ignored in favor of hidden_act="gelu_new"')
running model.to_neuron
..Selecting 7380 allocations
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Dependency reduction of sg0000
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

Compiler status PASS
2023-Jun-23 09:32:02.0148 1784:1784 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Jun-23 09:32:02.0148 1784:1784 [0] init.cc:99 CCOM WARN OFI plugin initNet() failed is EFA enabled?
running model.sample
generated_sequence= tensor([[12092,    13,   309,  1353,   247,  3448,  1566,    13, 50276, 50276,
           521,  1028,   292, 35106,    11, 50276,    88, 50276,  7112,    11]])
["Hello, I'm a language model,     hisucetunto*  w  ido*"]

For the record, the output with the standard CPU inference:

tensor([[12092,    13,   309,  1353,   247,  3448,  1566,    13,   285,   309,
          1353,  2820,   281,  1973,   247,  1566,   326,   476,   320,   908]])
["Hello, I'm a language model, and I'm trying to build a model that can be used"]

GPT2: sampling using transformers generate() significantly slower than using model.sample()

Following the example of HuggingFaceGenerationModelAdapter, I have created a NeuronModelForCausalLM adapter class for HuggingFace optimum-neuron (see huggingface/optimum-neuron#117).

I compared the inference times of calling the GPT2ForSampling.sample() method directly and using the generate() method.

@inf2.8xlarge	sample	generate
128 tokens	0.5 s	0.9 s
1000 tokens	4.5 s	7.4 s

Calling generate() through the wrapper is significantly slower: is this expected ? Did I miss something ?

Llama2 inference overhead time way too long

Here is a cleaned up GitHub issue request:

Long context inference overhead with Llama NeuronX

I followed the Llama NeuronX tutorial to host Llama2 on Amazon EC2 with NeuronX and TorchServe. The model works well, achieving 50+ tokens/sec as expected.

Issue

However, for my use case the input contexts are 500-3000 tokens. When I provide an example 3000 token context, there is a 10-30 second overhead before the first token is generated. After the first token, the inference speed is 50 tok/sec as expected.

Attempted fixes

I have tried the following to resolve the long context overhead:

Adjusted TorchServe config values for maxWorkers, maxBatchDelay, batchSize - no improvement
Increased max_length parameter to support longer sequences - no improvement
Tried different micro_batch_size and parallelism values - no improvement
Updated all NeuronX libraries to latest versions:

model-config.yaml

minWorkers: 2
maxWorkers: 8  #did not help
maxBatchDelay: 20
responseTimeout: 1080
batchSize: 4 #did not help

handler:
    model_checkpoint_dir: "llama-2-13b-split"
    amp: "bf16"
    tp_degree: 6
    max_length: 100

#did not help either
# micro_batching:
#     micro_batch_size: 8
#     parallelism:
#         preprocess: 4
#         inference: 1
#         postprocess: 4

pip list

torch 1.13.1+cpu
torch-model-archiver 0.9.0b20231026
torch-neuronx 1.13.1.1.12.1
torch-workflow-archiver 0.2.11b20231026
torch-xla 1.13.1+torchneuronc
transformers-neuronx 0.8.268

Log files after startup instance

torchserve --ncs --start --model-store model_store --ts-config config.properties --models llama-2-13b
(aws_neuron_venv_pytorch) ubuntu@ip-10-72-158-249:~/serve/examples/large_models/inferentia2/llama2$ WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-12-04T23:54:37,499 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2023-12-04T23:54:37,501 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-12-04T23:54:37,545 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml
2023-12-04T23:54:37,683 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.9.0
TS Home: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages
Current directory: /home/ubuntu/serve/examples/large_models/inferentia2/llama2
Temp directory: /tmp
Metrics config path: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml
Number of GPUs: 0
Number of CPUs: 96
Max heap size: 30688 M
Python executable: /opt/aws_neuron_venv_pytorch/bin/python
Config file: config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store
Initial Models: llama-2-13b
Log dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs
Metrics dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 96
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: log
Disable system metrics: false
Workflow Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store
Model config: N/A
2023-12-04T23:54:37,689 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2023-12-04T23:54:37,703 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: llama-2-13b
2023-12-04T23:54:37,709 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5
2023-12-04T23:54:37,710 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5/llama-2-13b
2023-12-04T23:54:37,718 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model llama-2-13b
2023-12-04T23:54:37,719 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model llama-2-13b
2023-12-04T23:54:48,067 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model llama-2-13b loaded.
2023-12-04T23:54:48,067 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: llama-2-13b, count: 2
2023-12-04T23:54:48,074 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2023-12-04T23:54:48,074 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9001, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2023-12-04T23:54:48,075 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2023-12-04T23:54:48,272 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40732955932617|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85419082641602|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:364036.0625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:12472.20703125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:3.9|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9000, pid=492260
2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2023-12-04T23:54:48,779 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9001, pid=492261
2023-12-04T23:54:48,780 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9001
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492261
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-04T23:54:48,787 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-12-04T23:54:48,787 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change null -> WORKER_STARTED
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492260
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-04T23:54:48,788 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2023-12-04T23:54:48,788 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change null -> WORKER_STARTED
2023-12-04T23:54:48,790 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2023-12-04T23:54:48,790 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2023-12-04T23:54:48,797 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9001.
2023-12-04T23:54:48,797 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2023-12-04T23:54:48,799 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799
2023-12-04T23:54:48,799 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799
2023-12-04T23:54:48,833 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8
2023-12-04T23:54:48,833 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8
2023-12-04T23:54:48,997 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2023-12-04T23:54:49,000 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2023-12-04T23:54:49,523 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile
2023-12-04T23:54:49,532 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile
2023-12-04T23:54:49,543 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-04T23:54:49,544 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-04T23:54:49,545 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1
2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-04T23:54:49,555 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1
2023-12-04T23:54:58,772 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model
2023-12-04T23:54:58,789 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model
2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40731430053711|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85420608520508|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:342452.0390625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:34056.08203125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:9.6|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:56:01,531 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled
2023-12-04T23:56:01,537 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 72704
2023-12-04T23:56:01,538 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:73466.0|#WorkerName:W-9000-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161
2023-12-04T23:56:01,539 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161
2023-12-04T23:56:02,630 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled
2023-12-04T23:56:02,632 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 73799
2023-12-04T23:56:02,633 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:74560.0|#WorkerName:W-9001-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162
2023-12-04T23:56:02,634 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40730667114258|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.8542137145996|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:330775.37890625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:45732.69140625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:12.7|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208


... some time later when I call the API


2023-12-05T00:00:48,437 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734448
2023-12-05T00:00:48,458 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT to backend at: 1701734448458
2023-12-05T00:00:48,461 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Backend received inference at: 1701734448
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Preprocessing
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - received req=At the far end of town where the Gricklegrass grows and the wind smells slowandsour when it blows and no
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - birds ever sing excepting old crows is the Street of the Lifted Lorax
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And deep in the Gricklegrass some people say if you look deep enough you can still see today where the
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lorax once stood just as long as it could before somebody lifted the Lorax away
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - What was the Lorax Any why was it there And why was it lifted and taken somewhere from the far end of
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - town where the Gricklegrass grows The old Onceler still lives here
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Ask him he knows
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - You wont see the Onceler Dont knock at his door He stays in his Lerkim on top of his store He stays in his
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lerkim cold under the floor where he makes his own clothes out of miffmuffered moof And on special dank
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - midnights in August he peeks out of the shutters and sometimes he speaks and tells how the Lorax was lifted
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - away Hell tell you perhaps if youre willing to pay
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - On the end of a rope he lets down a tin pail and you have to toss in fifteen cents and a nail and the shell of a
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - greatgreatgreat grandfather snail
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then he pulls up the pail makes a most careful count to see if youve paid him the proper amount Then he
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hides what you paid him away in his Snuvv his secret strange hole in his gruvvulous glove Then he grunts I
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - will call you by WhispermaPhone for the secrets I tell you are for your ears alone
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - SLUPP Down slupps the WhispermaPhone to your ear and the old Oncelers whispers are not very clear
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - since they have to come down through a snergelly hose and he sounds as if he had smallish bees up his nose
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Now Ill tell you he says with his teeth sounding gray how the Lorax got lifted and taken away It all started
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - way back such a long long time back
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Way back in the days when the grass was still green and the pond was still wet and the clouds were still clean
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - and the song of the SwomeeSwans rang out in space one morning I came to this glorious place And I first
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - saw the trees The Truffula Trees The brightcolored tufts of the Truffula Trees Mile after mile in the fresh
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - morning breeze
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And under the trees I saw Brown Barbaloots frisking about in their Barbaloot suits as the played in the
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - shade and ate Truffula Fruits From the rippulous pond came the comfortable sound of the HummingFish
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - humming while splashing around
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But those trees Those trees Those Truffula Trees All my life Id been searching for trees such as these The
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - touch of their tufts was much softer than silk And they had the sweet smell of fresh butterfly milk 
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I felt a great leaping of joy in my heart I knew just what Id do I unloaded my cart In no time at all I had built
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - a small shop Then I chopped down a Truffula Tree with one chop And with great skillful skill and with great
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - speedy speed I took the soft tuft And I knitted a Thneed
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - The instant Id finished I heard a gaZump I looked I saw something pop out of the stump of the tree Id
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chopped down It was sort of a man Describe himThats hard I dont know if I can He was shortish and
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - oldish and brownish and mossy And he spoke with a voice that was sharpish and bossy
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Mister He said with a sawdusty sneeze I am the Lorax I speak for the trees I speak for the trees for the trees
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - have no tongues And Im asking you sir at the top of my lungs he was very upset as he shouted and puffed
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Whats that THING youve made out of my Truffula tuft
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Look Lorax I said Theres no cause for alarm I chopped just one tree I am doing no harm Im being quite
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - useful This thing is a Thneed A Thneeds a FineSomethingThatAllPeopleNeed Its a shirt Its a sock Its a
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - glove Its a hat But it has other uses Yes far beyond that You can use it for carpets For pillows For sheets
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Or curtains Or covers for bicycle seats The Lorax said Sir You are crazy with greed There is no one on earth
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - who would buy that fool Thneed
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the very next minute I proved he was wrong For just at that minute a chap came along and he thought
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - that the Thneed I had knitted was great He happily bought it for three ninetyeight I laughed at the Lorax You
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - poor stupid guy You never can tell what some people will buy
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I repeat cried the Lorax I speak for the trees
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Im busy I told him Shut up if you please I rushed cross the room and in no time at all built a radiophone I
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - put in a quick call I called all my brothers and uncles and aunts and I said listen here Heres a wonderful
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chance for the whole Onceler Family to get mighty rich Get over here fast Take the road to North Nitch Turn
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - left at Weehawken Sharp right at South Stitch
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And in no time at all in the factory I built the whole Onceler Family was working full tilt We were all knitting
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds just as busy as bees to the sound of the chopping of Truffula Trees
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then Oh Baby Oh How my business did grow Now chopping one tree at a time was too slow So I quickly
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - invented my SuperAxeHacker which whacked off four Truffula Trees at one smacker We were making
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds four times as fast as before And that Lorax He didnt show up any more
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the next week he knocked on my new office door He snapped Im the Lorax who speaks for the trees
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - which you seem to be chopping as fast as you please But Im also in charge of the Brown Barbaloots who
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - played in the shade in their Barbaloot suits and happily lived eating Truffula Fruits NOWthanks to your
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hacking my trees to the ground theres not enough Truffula Fruit to go round
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And my poor Barbaloots are all getting the crummies because they have gas and no food in their tummies
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - They loved living here But I cant let them stay Theyll have to find food And I hope that they may Good luck
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - boys he cried And he sent them away
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I the Onceler felt sad as I watched them all go BUT business is business And business must grow
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - regardless of crummies in tummies you know
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I meant no harm I most truly did not But I had to grow bigger So bigger I got I biggered my factory I
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - biggered my roads I biggered my wagons I biggered the loads of the Thneeds I shipped out I was shipping
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - them forth to the South To the East To the West To the North I went right on biggeringselling more
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds And I biggered my money which everyone needs 
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 3
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - This story is about 
2023-12-05T00:00:48,508 [INFO ] W-9000-llama-2-13b_1.0 ACCESS_LOG - /127.0.0.1:50848 "POST /predictions/llama-2-13b HTTP/1.1" 200 73
2023-12-05T00:00:48,510 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734448
2023-12-05T00:00:48,511 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,523 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,590 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,658 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,725 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,793 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,860 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,928 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,995 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:09,063 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:09,130 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false

.....

2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:12,609 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Inferance
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,1701734472,beab1a87-913c-4302-9548-c25943c30243, pattern=[METRICS]
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:2.4171749336E7|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:20370.777|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.job.Job - Waiting time ns: 20370777, Backend time ns: 24152110030
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - QueueTime.Milliseconds:20.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 24125
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:27.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_METRICS - HandlerTime.ms:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,requestID:beab1a87-913c-4302-9548-c25943c30243,timestamp:1701734472

Tested manual inference mode from tutorial - same overhead issue

Ask

Is there something I'm missing in the config or use of Llama NeuronX to remove the long context overhead? I would like sub-second initial token latency for 500-3000 token contexts.

The alternative is to deploy with SageMaker, but I don't have that setup because we want to rewrite infrence.py to extract logits and limit Lllama to constrained generation

Let me know if any other details would be helpful in troubleshooting this. Thanks!

Add Serialization to other models

Hello, is there any easy way to add serialization to models other than GPT2?
GPT2 has a _save_compiled_artifacts method to save compiled artifacts to disk and load. That would be convenient for other models as well since compiling, e.g. GPT-J takes, 5-10 minutes.

I looked at the code but it seems there was a design change.

ImportError: cannot import name 'neuron_xla_compile' from 'libneuronxla'

Hi team, for all the models, I am getting the below error while importing transformers_neuronx.{MODEL_NAME}.model

>>> from transformers_neuronx.gptj.model import GPTJForSampling
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ssm-user/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/gptj/model.py", line 16, in <module>
    from transformers_neuronx import compiler
  File "/home/ssm-user/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/compiler.py", line 31, in <module>
    from libneuronxla import neuron_xla_compile
ImportError: cannot import name 'neuron_xla_compile' from 'libneuronxla' (/home/ssm-user/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/libneuronxla/__init__.py)

The transformers_neuronx version is: 0.6.x
The torch_neuronx version is: 1.13.1.1.9.0
OS Used: Amazon Linux 2
Kernel: kernel-devel-5.10.167-147.601

Please help me to resolve this.

Very long compilation times for llama2 with batch size 4

With AWS Neuron SDK 2.14.1, I am experiencing very long compilation times for batch_size = 4 with the llama2 7B model.

I am using the following configurations:

|             | inf2.8xlarge | inf2.48xlarge |
|-------------|--------------|---------------|
| tp_degree   | 2            | 24            |
| n_positions | 2048         | 2048          |
| amp         | f16          | f16           |

With batch_size = 1, 2 it takes minutes to compile the model with the -O1 option, but with batch_size = 4 it lasts more than three hours.

Support for encoder-decoder models

Hi team,

I wondered if the tool has support for any encoder-decoder models too (like FLAN-T5 or FLAN-UL2)?
If not at the moment, do you have a plan for it?

Thanks!

.genereate() method implementation

This issues is going to track the first version of generation method implementation.

Ideally we want to provide a similar .generate() API as HuggingFace. As we need some functionalities, including top-k, top-p sampling, beam search, greedy search in the near future, and with some limitation from torch-neuronx and transformer-neuronx itself, we cannot directly "borrow" the implementation from HuggingFace yet. Instead, we will cover the following generation methods as soon as possible and refactor along the way.

Method	PR	Accuracy Check
Greedy	#1
Topk
Toop
Beamsearch

Fail to install transformers-neuronx

Hi I am trying to run a demo from triton:https://github.com/triton-inference-server/python_backend/tree/main/inferentia on an inf2.24xlarge instance.

Here are the command lines to reproduce the issue:

# Start a docker on the inf2.24xlarge instance
sudo docker run --shm-size=2g -it nvcr.io/nvidia/tritonserver:23.04-py3 

# Run this command in the started docker
pip3 install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com

The error message:

root@69eded59b438:/opt/tritonserver# pip3 install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting transformers-neuronx
  Downloading https://pip.repos.neuron.amazonaws.com/transformers-neuronx/transformers_neuronx-0.4.60-py3-none-any.whl (91 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.0/92.0 kB 26.3 MB/s eta 0:00:00
Collecting accelerate (from transformers-neuronx)
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 244.2/244.2 kB 30.0 MB/s eta 0:00:00
Collecting torch-neuronx (from transformers-neuronx)
  Downloading https://pip.repos.neuron.amazonaws.com/torch-neuronx/torch_neuronx-1.13.1.1.8.0-py3-none-any.whl (1.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 62.8 MB/s eta 0:00:00
Collecting transformers (from transformers-neuronx)
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.2/7.2 MB 127.8 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from accelerate->transformers-neuronx) (1.24.2)
Collecting packaging>=20.0 (from accelerate->transformers-neuronx)
  Downloading packaging-23.1-py3-none-any.whl (48 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.9/48.9 kB 16.7 MB/s eta 0:00:00
Collecting psutil (from accelerate->transformers-neuronx)
  Downloading psutil-5.9.5-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (282 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 282.1/282.1 kB 67.8 MB/s eta 0:00:00
Collecting pyyaml (from accelerate->transformers-neuronx)
  Downloading PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (701 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 701.2/701.2 kB 92.8 MB/s eta 0:00:00
Collecting torch>=1.10.0 (from accelerate->transformers-neuronx)
  Downloading torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl (619.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 619.9/619.9 MB 2.1 MB/s eta 0:00:00
  Downloading torch-1.13.1-cp38-cp38-manylinux1_x86_64.whl (887.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.4/887.4 MB 1.5 MB/s eta 0:00:00
Collecting torch-xla==1.13.1+torchneuron7 (from torch-neuronx->transformers-neuronx)
  Downloading https://pip.repos.neuron.amazonaws.com/torch-xla/torch_xla-1.13.1%2Btorchneuron7-cp38-cp38-linux_x86_64.whl (267.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 267.7/267.7 MB 7.1 MB/s eta 0:00:00
Collecting libneuronxla==0.5.326 (from torch-neuronx->transformers-neuronx)
  Downloading https://pip.repos.neuron.amazonaws.com/libneuronxla/libneuronxla-0.5.326-py3-none-linux_x86_64.whl (52.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.9/52.9 MB 35.3 MB/s eta 0:00:00
Collecting protobuf<5 (from torch-neuronx->transformers-neuronx)
  Downloading protobuf-4.23.4-cp37-abi3-manylinux2014_x86_64.whl (304 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 304.5/304.5 kB 66.3 MB/s eta 0:00:00
Collecting aws-neuronx-runtime-discovery~=2.0 (from libneuronxla==0.5.326->torch-neuronx->transformers-neuronx)
  Downloading https://pip.repos.neuron.amazonaws.com/aws-neuronx-runtime-discovery/aws-neuronx-runtime-discovery-2.9.tar.gz (1.0 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [9 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-23z1w1bc/aws-neuronx-runtime-discovery_74fefbd0a1fd4c4f986415e84c123c5f/setup.py", line 46, in <module>
          main()
        File "/tmp/pip-install-23z1w1bc/aws-neuronx-runtime-discovery_74fefbd0a1fd4c4f986415e84c123c5f/setup.py", line 15, in main
          raise FileNotFoundError('Could not find Neuron Runtime Library {} (from deb/rpm package aws-neuronx-runtime-lib) in {}. Please check {} to install this library.'.format(soname_path, libnrt_installation_path, guide_link))
      FileNotFoundError: Could not find Neuron Runtime Library /opt/aws/neuron/lib/libnrt.so.1 (from deb/rpm package aws-neuronx-runtime-lib) in /opt/aws/neuron/lib. Please check the Neuron installation guide https://awsdocs-neuron.readthedocs-hosted.com/ to install this library.
      2.9
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

GPT2 seq_len invalid config

in the file transformers-neuronx/blob/main/src/transformers_neuronx/gpt2/model.py
all variations of the model are configuring max seq length using a wrong parameter. The code bellow is replicated in many part of this file and it correctly tests n_positions as the input config for max len, but uses n_ctx to get the value. So, please, change this to the correct config var n_positions instead.

sequence_length = kwargs.get("n_positions", None)
if sequence_length:
           max_allowed_sequence_length = config.n_ctx
           if sequence_length > max_allowed_sequence_length:
               raise ValueError(f"Sequence length ({sequence_length}) cannot be larger than position embedding's context size ({max_allowed_sequence_length})!")

GPT-2 example from README.md not working. (BS=1)

Hello, the example code snippet from the Readme.md is not working.

import os
from transformers_neuronx.gpt2.model import GPT2ForSampling
from transformers_neuronx.generation_utils import HuggingFaceGenerationModelAdapter
from transformers_neuronx.module import save_pretrained_split
from transformers import AutoModelForCausalLM, AutoTokenizer
os.environ['NEURON_CC_FLAGS'] = '--model-type=transformer-inference'

# Load and save the CPU model
model_cpu = AutoModelForCausalLM.from_pretrained('gpt2')
save_pretrained_split(model_cpu, 'gpt2-split')

# Create and compile the Neuron model
model = GPT2ForSampling.from_pretrained('gpt2-split', batch_size=1, tp_degree=2, n_positions=256, amp='f32', unroll=None)
if hasattr(model, 'register_to_neuron_hook'):
    model.register_to_neuron_hook(lambda idx: print(f'done to_neuron layer {idx}'))
print('running model.to_neuron')
model.to_neuron()

Error:

running model.to_neuron
.2023-06-29T11:57:55Z ERROR 2975 [WalrusDriver]: An exception was thrown:
--------------------------------------------------------------------------------
 0# __cxa_throw in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 1# 0x00007FB8EC7EDE96 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 2# birverifier::checkInputMemType(bir::Instruction const&, unsigned int, llvm::SmallVector<bir::MemoryType, 3u> const&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 3# birverifier::InstVisitor::visitInstIndirectSave(bir::InstIndirectSave&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 4# neuronxcc::walrus::Verifier::run(bir::Module&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 5# neuronxcc::walrus::WalrusPass::run(std::vector<std::unique_ptr<bir::Module, std::default_delete<bir::Module> >, std::allocator<std::unique_ptr<bir::Module, std::default_delete<bir::Module> > > >&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 6# 0x00007FB8E53763FE in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 7# run_walrus_driver(int, char**) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 8# 0x00007FB8EC82F130 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/EmbeddedWalrusDriver.cpython-38-x86_64-linux-gnu.so
 9# 0x00007FB8F5308820 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
10# 0x00007FB8F531335E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
11# _PyObject_MakeTpCall in /usr/bin/python3
12# _PyObject_FastCallDict in /usr/bin/python3
13# _PyObject_Call_Prepend in /usr/bin/python3
14# 0x00007FB8F53069EC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
15# 0x00007FB8F532871E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
16# _PyObject_MakeTpCall in /usr/bin/python3
17# _PyObject_FastCallDict in /usr/bin/python3
18# _PyObject_Call_Prepend in /usr/bin/python3
19# 0x00007FB90C49FC3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
20# 0x00007FB90C4B42D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
21# _PyObject_MakeTpCall in /usr/bin/python3
22# _PyObject_FastCallDict in /usr/bin/python3
23# _PyObject_Call_Prepend in /usr/bin/python3
24# 0x00007FB90C49FC3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
25# 0x00007FB90C4AFAC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
26# 0x00007FB8F531CBE2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
27# _PyObject_MakeTpCall in /usr/bin/python3
28# _PyObject_FastCallDict in /usr/bin/python3
29# _PyObject_Call_Prepend in /usr/bin/python3
30# 0x00007FB90C4CCC6B in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
31# 0x00007FB90C4CF082 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
32# _PyObject_MakeTpCall in /usr/bin/python3
33# _PyObject_FastCallDict in /usr/bin/python3
34# _PyObject_Call_Prepend in /usr/bin/python3
35# 0x00007FB90C49FC3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
36# 0x00007FB90C4B42D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
37# _PyObject_MakeTpCall in /usr/bin/python3
38# _PyObject_FastCallDict in /usr/bin/python3
39# _PyObject_Call_Prepend in /usr/bin/python3
40# 0x00007FB90C49FC3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
41# 0x00007FB90C4AFAC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
42# _PyObject_MakeTpCall in /usr/bin/python3
43# _PyObject_FastCallDict in /usr/bin/python3
44# _PyObject_Call_Prepend in /usr/bin/python3
45# 0x00007FB90BF52ECC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
46# 0x00007FB90BF8ABA9 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
47# _PyObject_MakeTpCall in /usr/bin/python3
48# _PyObject_FastCallDict in /usr/bin/python3
49# _PyObject_Call_Prepend in /usr/bin/python3
50# 0x00007FB90BF58CD1 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
51# _PyObject_MakeTpCall in /usr/bin/python3
52# _PyObject_FastCallDict in /usr/bin/python3
53# _PyObject_Call_Prepend in /usr/bin/python3
54# 0x00007FB90C5A079C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
55# 0x00007FB90C5AC9AA in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
56# _PyObject_MakeTpCall in /usr/bin/python3
57# _PyObject_FastCallDict in /usr/bin/python3
58# _PyObject_Call_Prepend in /usr/bin/python3
59# 0x00007FB90C5A2CED in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
60# 0x00007FB90C5A2EC2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
61# 0x00007FB90C5B5DA2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
62# _PyObject_MakeTpCall in /usr/bin/python3
63# _PyEval_EvalFrameDefault in /usr/bin/python3
64# _PyEval_EvalCodeWithName in /usr/bin/python3
65# PyEval_EvalCode in /usr/bin/python3
66# 0x0000000000680001 in /usr/bin/python3
67# 0x000000000068007F in /usr/bin/python3
68# 0x0000000000680121 in /usr/bin/python3
69# PyRun_SimpleFileExFlags in /usr/bin/python3
70# Py_RunMain in /usr/bin/python3
71# Py_BytesMain in /usr/bin/python3
72# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
73# _start in /usr/bin/python3
--------------------------------------------------------------------------------
2023-06-29T11:57:55Z ERROR 2975 [WalrusDriver]: Walrus pass: birverifier failed!
2023-06-29T11:57:55Z ERROR 2975 [WalrusDriver]: Failure Reason: === BIR verification failed ===
Reason: Expect memory location to be of type SB 
Instruction: I-10873
Opcode: IndirectSave
Input index: 1
Argument AP:
Access Pattern: [[384,1],[384,1],[1,64]]
SymbolicAP
Memory Location: {_reshape_273_hlo_id_2756__mhlo.reshape_20_pftranspose_5301-t9595_set}@PSUM
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: ***************************************************************
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:  An Internal Compiler Error has occurred
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: ***************************************************************
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: 
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Error message:  Walrus driver failed to complete
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: 
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Error class:    AssertionError
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Error location: Unknown
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Command line:   /usr/local/bin/neuronx-cc compile --framework=XLA --target=trn1 /tmp/tmpkbnmkhnd/FullyUnrolled.1826.1.pb --output=/tmp/tmpkbnmkhnd/FullyUnrolled.1826.1.pb.neff --verbose=35 --model-type=transformer-inference
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: 
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Internal details:
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/CommandDriver.py", line 237, in neuronxcc.driver.CommandDriver.CommandDriver.run
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1047, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 998, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1023, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1027, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/jobs/WalrusDriver.py", line 232, in neuronxcc.driver.jobs.WalrusDriver.WalrusDriver.run
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   File "neuronxcc/driver/jobs/WalrusDriver.py", line 702, in neuronxcc.driver.jobs.WalrusDriver.WalrusDriver.runSingleInput
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: 
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Version information:
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   NeuronX Compiler version 2.5.0.28+1be23f232
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   HWM version 2.5.0.0-dad732dd6
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   NEFF version Dynamic
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   TVM not available
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   NumPy version 1.21.6
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:   MXNet not available
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: 
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Artifacts stored in: /tmp/tmpkbnmkhnd/neuronxcc-9vjdmvs0

Compiler status ERROR
Traceback (most recent call last):
  File "test.py", line 17, in <module>
    model.to_neuron()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt2/model.py", line 117, in to_neuron
    self.decoder_lm_head.to_neuron()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 107, in to_neuron
    self.program.setup(self.layers, ln_lm_head_params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 792, in setup
    super().setup(layers, ln_lm_head_params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 747, in setup
    kernel.build()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 287, in build
    self.neff_bytes = compile_hlo_module(self.hlo_module)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 71, in compile_hlo_module
    subprocess.check_call(command_line, cwd=tmpdir)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--framework=XLA', '--target=trn1', '/tmp/tmpkbnmkhnd/FullyUnrolled.1826.1.pb', '--output=/tmp/tmpkbnmkhnd/FullyUnrolled.1826.1.pb.neff', '--verbose=35', '--model-type=transformer-inference']' returned non-zero exit status 1.

LLaMA fails when the input token length is over 1790 tokens

I am trying to use meta-llama/Llama-2-13b-chat-hf witch have a max_position_embeddings of 4096 tokens.
I found that the library fails in a non-deterministic way when input length is between 1790 and 1800 tokens.
If you insert exactly the same prompt several times you randomly get a good output or a failure. While over the 1800 tokens the failure become more deterministic. However LLaMA with Huggingface transformer library works fine with more than 2000 tokens.

Here a piece of code to reproduce the error.
Model preparation:

import transformers

# Version after 4.28.1 save the model with an incompatible format https://github.com/aws-neuron/transformers-neuronx/issues/60
assert transformers.__version__ == "4.28.1", f"Version is {transformers.__version__}"


from transformers import LlamaForCausalLM
import torch
from transformers_neuronx.module import save_pretrained_split

model_name = "meta-llama/Llama-2-13b-chat-hf"
model = LlamaForCausalLM.from_pretrained(model_name)
save_pretrained_split(model, './Llama-2-13b-split')

# Compile the model

import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling
import torch_xla.core.xla_model as xm
import os

xla_device_count = len(xm.get_xla_supported_devices())

# load meta-llama/Llama-2-13b to the NeuronCores with N-way tensor parallelism and run compilation
neuron_model = LlamaForSampling.from_pretrained('./Llama-2-13b-split', batch_size=1, tp_degree=xla_device_count, amp='f16')
neuron_model.to_neuron()
neuron_model.save('./neuron_artifacts')
del neuron_model

Reproduce the bug:

# Load compiled model

import random
import string
import torch
from transformers import AutoTokenizer
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling
import torch_xla.core.xla_model as xm

neuron_model = LlamaForSampling.from_pretrained('./Llama-2-13b-split', batch_size=1, tp_degree=xla_device_count, amp='f16')
neuron_model.load('./neuron_artifacts')
neuron_model.to_neuron()

model_name = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)

for n_tokens in range(1780,2000):
    all_token_ids = list(tokenizer.get_vocab().values())
    random_token_ids = random.choices(all_token_ids, k=n_tokens)
    random_tokens_tensor = torch.tensor([random_token_ids])
    
    print(f'''Input with {len(random_tokens_tensor[0])} tokens
    Maximum sequence length for {model_name} is {model.config.max_position_embeddings} tokens''')
    
    max_output_length = model.config.max_position_embeddings - len(random_tokens_tensor[0])
    
    with torch.inference_mode():
        start = time.time()
        generated_sequences = neuron_model.sample(random_tokens_tensor, sequence_length=max_output_length, top_k=50)
        elapsed = time.time() - start
    
    
    
    generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
    print(f'generated sequences {generated_sequences} in {elapsed} seconds')

As I said the bug is not deterministic so the code will fail every time to a different iteration.
Here an example:

Input with 1783 tokens
    Maximum sequence length for meta-llama/Llama-2-13b-chat-hf is 4096 tokens
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[96], line 21
     19 with torch.inference_mode():
     20     start = time.time()
---> 21     generated_sequences = neuron_model.sample(random_tokens_tensor, sequence_length=max_output_length, top_k=50)
     22     elapsed = time.time() - start
     26 generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]

File ~/.local/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:174, in LlamaForSampling.sample(self, input_ids, sequence_length, start_ids, top_k, top_p, eos_token_override, temperature, streamer)
    171     context_length -= prefixed_length
    172     sequence_length -= prefixed_length
--> 174 result = sampling.sample_llama(
    175     self, input_ids, start_ids, sequence_length,
    176     eos_token_id=self.config.eos_token_id if eos_token_override is None else eos_token_override,
    177     top_k=top_k, top_p=top_p, temperature=temperature, streamer=streamer
    178 )
    180 return result

File ~/.local/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/transformers_neuronx/sampling.py:288, in sample_llama(model, input_ids, start_ids, sequence_length, eos_token_id, top_k, top_p, temperature, streamer)
    286 _, start = input_ids.shape
    287 next_token_scores = model(input_ids, None, start_ids)
--> 288 return sample_loop_llama(
    289     model, input_ids, start_ids, next_token_scores, sequence_length, eos_token_id, top_k, top_p, temperature, streamer
    290 )

File ~/.local/lib/python3.10/site-packages/transformers_neuronx/sampling.py:273, in sample_loop_llama(model, input_ids, start_ids, next_token_scores, sequence_length, eos_token_id, top_k, top_p, temperature, streamer)
    271     # forward pass to get next token
    272     cache_ids = torch.as_tensor([cur_len], dtype=torch.int32)
--> 273     next_token_scores = model(inputs, cache_ids, start_ids)
    275 if streamer:
    276     streamer.end()

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:158, in LlamaForSampling.forward(self, input_ids, cache_ids, start_ids)
    156 input_ids, *rst = self._preprocess(input_ids, start_ids=start_ids, cache_ids=cache_ids)  
    157 hidden = self.chkpt_model.model.embed_tokens(input_ids)
--> 158 return self._forward(hidden, *rst)

File ~/.local/lib/python3.10/site-packages/transformers_neuronx/base.py:229, in NeuronModelBase._forward(self, hidden, *args)
    227     logits = self.context(hidden, *args)
    228 else:
--> 229     logits = self.decoder_lm_head(hidden, *args)
    231 logits = logits.to(torch.float32)
    232 logits = logits[:self.config.vocab_size, -1, :]

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/lib/python3.10/site-packages/transformers_neuronx/decoder.py:231, in DecoderLmHeadForSamplingNoEmbedding.forward(self, *inputs)
    229 sequence_length = hidden.shape[sequence_dim]
    230 if sequence_length == 1:
--> 231     return self.forward_single(*inputs)
    232 if sequence_length % self.n_active_tokens:
    233     raise ValueError(f'sequence_length={sequence_length} cannot be divided by '
    234                      f'n_active_tokens={self.n_active_tokens}')

File ~/.local/lib/python3.10/site-packages/transformers_neuronx/decoder.py:216, in DecoderLmHeadForSamplingNoEmbedding.forward_single(self, *inputs)
    214 hidden, cache_ids, *_ = inputs
    215 batch_size = hidden.shape[2]
--> 216 bucket_id = self.program.find_bucket_id(cache_ids.item())
    217 if self.use_executor:
    218     return self.program.execute(bucket_id, batch_size, *inputs, return_ranks=self.return_ranks)

File ~/.local/lib/python3.10/site-packages/transformers_neuronx/decoder.py:1043, in DecoderProgram.find_bucket_id(self, length)
   1042 def find_bucket_id(self, length):
-> 1043     return next(idx for idx, npos in enumerate(self.n_positions_list) if npos >= length+1)

StopIteration:

Compilation failed for GPT-J if batch_size > 1

I have tried the conversion and inference of a GPT-J model using the gptj-demo CLI @0.4.60 on an inf2.8xlarge instance.

I first save the model locally using:

$ gptj_demo save gpt-j-6B

Then I try to convert and run it.

With a batch_size of 1, I get the expected result:

$ gptj_demo run --batch_size 1  gpt-j-6B
running GPTJForSampling.from_pretrained
running model.to_neuron
...Selecting 26361 allocations
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
.Analyzing dependencies of Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Dependency reduction of sg0000
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
.
Compiler status PASS
2023-Jun-23 07:01:56.0042 2158:2592 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Jun-23 07:01:56.0042 2158:2592 [1] init.cc:99 CCOM WARN OFI plugin initNet() failed is EFA enabled?
running model.sample
generated_sequence= tensor([[15496,    11,   314,  1101,   257,  3303,  2746,    11,   198,    40,
          1101,   257,  1808,    12,   504,    86,  1586,   198, 30243,    11,
           290,   314,  3280,   198,  6138,   507,  1912,   319,   644,   198,
            40,   760,   546,   661,    13,   198,    40,  1101,   257,  9379,
            11,   290,   661,   389,   198, 11031,   286,  1612,   284,   616,
         16294,    11,   198,   568,   314,   761,   284,  2193,   546,   257,
           198,  1122,   286,  3404,   284,  1730,   351,   705,   368,    11,
           198,  8201,  6970,   198,  4919,   661,   892,    13,   198,  2396,
           644,   314, 18869,  1560,   345,  1909,   198,   271,   257,   845,
          4096,    11,   845,  1468,  3896,   198, 10755,   262,   995,  1377,
           257,   845,   198,   727,    11,  4096,   835,   286,  4673,   198,
         18927,    13,   198,  1870,   618,   345,  2193,   198, 18927,   588,
           428,    11,   198,  5832,   923,   284,  1833,  1223]])
["Hello, I'm a language model,\nI'm a question-answering\nmachine, and I answer\nquestions based on what\nI know about people.\nI'm a robot, and people are\nkind of mean to my creators,\nso I need to learn about a\nton of stuff to deal with 'em,\nincluding knowing\nhow people think.\nSo what I wanna tell you today\nis a very basic, very old rule\nabout the world -- a very\nold, basic way of learning\nsomething.\nAnd when you learn\nsomething like this,\nyou start to understand something"]

But when I try to use a higher batch_size, I get a compiler error:

$ gptj_demo run --batch_size 2  gpt-j-6B
running GPTJForSampling.from_pretrained
running model.to_neuron
...2023-06-23T07:07:54Z ERROR 3231 [WalrusDriver]: An exception was thrown:
--------------------------------------------------------------------------------
 0# __cxa_throw in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 1# 0x00007F4D1EC74E96 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 2# birverifier::checkInputMemType(bir::Instruction const&, unsigned int, llvm::SmallVector<bir::MemoryType, 3u> const&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 3# birverifier::InstVisitor::visitInstIndirectSave(bir::InstIndirectSave&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
 4# neuronxcc::walrus::Verifier::run(bir::Module&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 5# neuronxcc::walrus::WalrusPass::run(std::vector<std::unique_ptr<bir::Module, std::default_delete<bir::Module> >, std::allocator<std::unique_ptr<bir::Module, std::default_delete<bir::Module> > > >&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 6# 0x00007F4D066833FE in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 7# run_walrus_driver(int, char**) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
 8# 0x00007F4D54EBE130 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/EmbeddedWalrusDriver.cpython-38-x86_64-linux-gnu.so
 9# 0x00007F4D0ADBB820 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
10# 0x00007F4D0ADC635E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
11# _PyObject_MakeTpCall in /usr/bin/python3
12# _PyObject_FastCallDict in /usr/bin/python3
13# _PyObject_Call_Prepend in /usr/bin/python3
14# 0x00007F4D0ADB99EC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
15# 0x00007F4D0ADDB71E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
16# _PyObject_MakeTpCall in /usr/bin/python3
17# _PyObject_FastCallDict in /usr/bin/python3
18# _PyObject_Call_Prepend in /usr/bin/python3
19# 0x00007F4D9FF77C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
20# 0x00007F4D9FF8C2D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
21# _PyObject_MakeTpCall in /usr/bin/python3
22# _PyObject_FastCallDict in /usr/bin/python3
23# _PyObject_Call_Prepend in /usr/bin/python3
24# 0x00007F4D9FF77C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
25# 0x00007F4D9FF87AC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
26# 0x00007F4D0ADCFBE2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
27# _PyObject_MakeTpCall in /usr/bin/python3
28# _PyObject_FastCallDict in /usr/bin/python3
29# _PyObject_Call_Prepend in /usr/bin/python3
30# 0x00007F4D9FFA4C6B in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
31# 0x00007F4D9FFA7082 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
32# _PyObject_MakeTpCall in /usr/bin/python3
33# _PyObject_FastCallDict in /usr/bin/python3
34# _PyObject_Call_Prepend in /usr/bin/python3
35# 0x00007F4D9FF77C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
36# 0x00007F4D9FF8C2D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
37# _PyObject_MakeTpCall in /usr/bin/python3
38# _PyObject_FastCallDict in /usr/bin/python3
39# _PyObject_Call_Prepend in /usr/bin/python3
40# 0x00007F4D9FF77C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
41# 0x00007F4D9FF87AC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
42# _PyObject_MakeTpCall in /usr/bin/python3
43# _PyObject_FastCallDict in /usr/bin/python3
44# _PyObject_Call_Prepend in /usr/bin/python3
45# 0x00007F4D9FA2BECC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
46# 0x00007F4D9FA63BA9 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
47# _PyObject_MakeTpCall in /usr/bin/python3
48# _PyObject_FastCallDict in /usr/bin/python3
49# _PyObject_Call_Prepend in /usr/bin/python3
50# 0x00007F4D9FA31CD1 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
51# _PyObject_MakeTpCall in /usr/bin/python3
52# _PyObject_FastCallDict in /usr/bin/python3
53# _PyObject_Call_Prepend in /usr/bin/python3
54# 0x00007F4DA007879C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
55# 0x00007F4DA00849AA in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
56# _PyObject_MakeTpCall in /usr/bin/python3
57# _PyObject_FastCallDict in /usr/bin/python3
58# _PyObject_Call_Prepend in /usr/bin/python3
59# 0x00007F4DA007ACED in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
60# 0x00007F4DA007AEC2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
61# 0x00007F4DA008DDA2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
62# _PyObject_MakeTpCall in /usr/bin/python3
63# _PyEval_EvalFrameDefault in /usr/bin/python3
64# _PyEval_EvalCodeWithName in /usr/bin/python3
65# PyEval_EvalCode in /usr/bin/python3
66# 0x000000000067DBF1 in /usr/bin/python3
67# 0x000000000067DC6F in /usr/bin/python3
68# 0x000000000067DD11 in /usr/bin/python3
69# PyRun_SimpleFileExFlags in /usr/bin/python3
70# Py_RunMain in /usr/bin/python3
71# Py_BytesMain in /usr/bin/python3
72# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
73# _start in /usr/bin/python3
--------------------------------------------------------------------------------
2023-06-23T07:07:54Z ERROR 3231 [WalrusDriver]: Walrus pass: birverifier failed!
2023-06-23T07:07:54Z ERROR 3231 [WalrusDriver]: Failure Reason: === BIR verification failed ===
Reason: Expect memory location to be of type SB 
Instruction: I-26521
Opcode: IndirectSave
Input index: 1
Argument AP:
Access Pattern: [[512,2],[512,1],[1,512]]
SymbolicAP
Memory Location: {_reshape_382_hlo_id_3499__mhlo.reshape_22_pftranspose_10864_set}@PSUM
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: ***************************************************************
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:  An Internal Compiler Error has occurred
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: ***************************************************************
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: 
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Error message:  Walrus driver failed to complete
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: 
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Error class:    AssertionError
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Error location: Unknown
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Command line:   /usr/local/bin/neuronx-cc compile --framework=XLA --target=trn1 /tmp/tmpmjv76hvm/Scribable.3484.1.pb --output=/tmp/tmpmjv76hvm/Scribable.3484.1.pb.neff --verbose=35
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: 
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Internal details:
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/CommandDriver.py", line 237, in neuronxcc.driver.CommandDriver.CommandDriver.run
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1047, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 998, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1023, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1027, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/jobs/WalrusDriver.py", line 232, in neuronxcc.driver.jobs.WalrusDriver.WalrusDriver.run
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   File "neuronxcc/driver/jobs/WalrusDriver.py", line 702, in neuronxcc.driver.jobs.WalrusDriver.WalrusDriver.runSingleInput
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: 
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Version information:
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   NeuronX Compiler version 2.5.0.28+1be23f232
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   HWM version 2.5.0.0-dad732dd6
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   NEFF version Dynamic
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   TVM not available
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   NumPy version 1.21.6
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:   MXNet not available
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: 
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Artifacts stored in: /tmp/tmpmjv76hvm/neuronxcc-z8z68fbb

Compiler status ERROR
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/gptj_demo", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptj/demo.py", line 21, in main
    demo('EleutherAI/gpt-j-6B', GPTJForSampling, amp_callback)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 61, in demo
    run(args, model_name, model_cls)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 106, in run
    model.to_neuron()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptj/model.py", line 64, in to_neuron
    self.program = build_gptj_program(config, 1, n_positions_list, unroll)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptj/model.py", line 331, in build_gptj_program
    return program.FullyUnrolledDecoder(config.tp_degree, hlo_modules, buffers)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/program.py", line 95, in __init__
    self.kernels = [compiler.build_parallel_kernel(hm, tp_degree) for hm in hlo_modules]
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/program.py", line 95, in <listcomp>
    self.kernels = [compiler.build_parallel_kernel(hm, tp_degree) for hm in hlo_modules]
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 44, in build_parallel_kernel
    kernel.build()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 288, in build
    self.neff_bytes = compile_hlo_module(self.hlo_module)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 72, in compile_hlo_module
    subprocess.check_call(command_line, cwd=tmpdir)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--framework=XLA', '--target=trn1', '/tmp/tmpmjv76hvm/Scribable.3484.1.pb', '--output=/tmp/tmpmjv76hvm/Scribable.3484.1.pb.neff', '--verbose=35']' returned non-zero exit status 1.

Support for LLAMA models

Hi, Is there any plan to support LLAMA based models?

from_pretrained is broken after transformers made safetensor serialization default

https://github.com/aws-neuron/transformers-neuronx/blob/33fa412447a4028edb252fd06aae9ed93086a450/src/transformers_neuronx/module.py#L144C74-L144C74

The function is looking for pytorch_model.bin that does not exist.

Because of it most of the tutorials are broken, this one for example: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

Any solution to save the converted model?

Converting the loaded model using to_neuron() method takes a long time. Is there any way to Save the neuron_model on disk and load it again? This is for GPT-NeoX.

Core dump during inference on llama2 model with batch size 4 and 1024 inputs

With AWS Neuron SDK 2.14.1, I have reproducible core dumps during inference for batch_size = 4 with the llama2 7B model.

I am using the following configuration:

|             | inf2.8xlarge | inf2.48xlarge |
|-------------|--------------|---------------|
| tp_degree   | 2            | 24            |
| tp_degree   | 2            | 24            |
| n_positions | 2048         | 2048          |
| amp         | f16          | f16           |

The inference works fine if the input context is below 1024, but a crash happens using a longer context.

2023-Sep-28 15:05:08.0675 98833:99454 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:3)
2023-Sep-28 15:05:08.0675 98833:99454 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:4)
2023-Sep-28 15:05:08.0675 98833:99454 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:0)
2023-Sep-28 15:05:08.0675 98833:99453 ERROR  TDRV:exec_consume_tpb_status_notifications   2023-Sep-28 15:05:08.0675 98833:99454 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:1)
2023-Sep-28 15:05:08.0675 98833:99454 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:2)
Missing infer_status notification: (end:1)
2023-Sep-28 15:05:08.0675 98833:99453 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:2)
2023-Sep-28 15:05:08.0675 98833:99453 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:3)
2023-Sep-28 15:05:08.0675 98833:99453 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:4)
2023-Sep-28 15:05:08.0675 98833:99453 ERROR  TDRV:exec_consume_tpb_status_notifications   Missing infer_status notification: (end:0)
2023-Sep-28 15:05:08.0676 98833:99454 ERROR  TDRV:exec_core_dump                          Unable to find /opt/aws/neuron/lib/libndbg.so. Core dump will not be generated.
2023-Sep-28 15:05:08.0676 98833:99454 ERROR  TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) execution timeout (30000 ms) on Neuron Device 0 NC 1, model /tmp/neuroncc_compile_workdir/a31f3317-fd60-4895-9b23-899233dd25c2/model.MODULE_402d21faaed2b99de995+360ecc97.neff, waiting for execution completion notification
2023-Sep-28 15:05:08.0676 98833:99453 ERROR  TDRV:exec_core_dump                          Unable to find /opt/aws/neuron/lib/libndbg.so. Core dump will not be generated.
2023-Sep-28 15:05:08.0676 98833:99453 ERROR  TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) execution timeout (30000 ms) on Neuron Device 0 NC 0, model /tmp/neuroncc_compile_workdir/a31f3317-fd60-4895-9b23-899233dd25c2/model.MODULE_402d21faaed2b99de995+360ecc97.neff, waiting for execution completion notification
2023-Sep-28 15:05:08.0676 98833:99454 ERROR  NMGR:dlr_infer                               Inference completed with err: 5
2023-Sep-28 15:05:08.0676 98833:99453 ERROR  NMGR:dlr_infer                               Inference completed with err: 5
terminate called after throwing an instance of 'c10::Error'
  what():  nrt_execute status=5
Exception raised from task at /opt/workspace/KaenaPyTorchRuntime/neuron_op/ops/tensor.cpp:845 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f04062a5457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f040626f3ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: neuron::task(void*) + 0x2e9 (0x7f03ad630c19 in /home/ubuntu/.local/lib/python3.8/site-packages/torch_neuronx/lib/libtorchneuron.so)
frame #3: <unknown function> + 0x8609 (0x7f047d521609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f047d65b133 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called recursively
Aborted (core dumped)

Being able to inject long contexts for inference is required when deploying an inference endpoint: a new KV cache has to be built by stacking new requests with partially generated sequences. The length of the input context is therefore driven by the length of the longer sequence.

How to use generate() with inputs_embeds

I hope this is the right place to ask this question. Let me know if I need to move to another repo.

Currently I'm using NeuronModelForCausalLM which uses LlamaForSampling under the hood.

I have a use case where I need to be able to do the following:

Generate embedding tokens
Modify embedding tokens
Run inference from modified embedding tokens

I am able to do steps 1 & 2 currently using the following:

from optimum.neuron import NeuronModelForCausalLM

llama_model = NeuronModelForCausalLM.from_pretrained('aws-neuron/Llama-2-7b-chat-hf-seqlen-2048-bs-1')

embedded_tokens = llama_model.model.chkpt_model.model.embed_tokens(token_ids)

### Code to modify embedded_tokens

However, as far as I can tell, generation with these modified tokens is not possible with llama_model.generate()

When I use the 'input_embeds' keyword argument, and set input_ids=None, I get the following:

ValueError: The following `model_kwargs` are not used by the model: ['inputs_embeds']

If this is not possible with the NeuronModelForCausalLM.generate() currently, is there a way to work around this manually? If so, could you provide an example?

Thanks very much for your help!

Vicuna13B model support

When can we expect the support for the popular open-source Vicuna13B model?

GPT2: generate() method does not scale well with batch_size

Following the example of HuggingFaceGenerationModelAdapter, I have created a NeuronModelForCausalLM adapter class for HuggingFace optimum-neuron (see huggingface/optimum-neuron#117).

I compared the inference times for batch sizes of 2 and 16, using either the GPT2ForSampling model sample() method or transformers generate().
I also added the original pytorch model inference time using transformers generate() as a reference.

	neuron sample	neuron generate	pytorch generate
128 tokens batch_size 2	0.5 s	0.9 s	4.6 s
128 tokens batch_size 16	0.8 s	2.7 s	7 s

The neuron model latency using generate() is multiplied by 3, where the other two configurations only see less than a x2 increase.

Is this expected ?

aws-neuron / transformers-neuronx Goto Github PK

transformers-neuronx's Issues

Long context inference overhead with Llama NeuronX

Recommend Projects

Recommend Topics

Recommend Org