aws-neuron / transformers-neuronx Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
When instantiating a model that has already been compiled, one can simply point to the serialized compiled artifacts to avoid recompiling the model.
However, before reaching that point, we must still instantiate the model from the original checkpoint.
This means that unless I am mistaken, to completely serialize the model, we need to store the weights twice: once in the checkpoint and once in the compiled artifacts.
Would it be possible to instantiate a pre-compiled model without the original checkpoint, and divide the storage requirements by two (or even more if the compiled artifacts use a lower precision) ?
I was previously able to compile llama 2 7B
using tensor parallelism on 2 Neuron Cores, with the default n_positions=2048
and a batch_size=1
.
With transformers-neuronx==0.7.84
and neuronx-cc==2.10.0.34
, I get the following error:
RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/5c616a4d-b0cc-4c8d-8768-df08facd8aec/model.MODULE_875d0cfa
b1be718dcdb8+8737852b.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/5c616a4d-b0cc-4c8d-8768-df08facd8aec/model.MODULE_875d0cfab1be718dcdb8+8737852b.neff', '--model-type=transformer', '--
model-type=transformer', '--verbose=35']: 2023-09-18T14:15:06Z Too many instructions after unroll for function sg0000 !
I only managed to compile the model properly setting batch_size=1
and n_positions=784
.
With that configuration, the device memory during inference in neuron-top
is at 20.4 G (out of 16 G x 2 cores = 32 G).
I did another test this time splitting the model on 24 Neuron Cores, and faced the same error. In that configuration however, I managed to get up to n_positions=1536
.
If I try to estimate the KV cache memory requirements for the 7B models, knowing that:
2 x hidden_size x token_byte_size = 2 x 4096 x 2
),It gives n_positions * token_size = 2048 * 512 = 1 GB
per batch.
Considering that:
As a final note, I face the same kind of errors when using the larger llama 2 13B
model. I was previously able to compile and run it just fine on 24 Neuron Cores for n_positions=2048
and batch_size=2
, but now I only manage to run it with n_positions=1024
and batch_size=1
.
I tried generation with OPT-1.3B, and I am getting garbage output. What am I doing wrong?
Here are two code samples - one for Neuron (running on inf2.xlarge) and another for Nvidia, and outputs for both of them:
Neuron:
from transformers.models.opt import OPTForCausalLM
hf_model = OPTForCausalLM.from_pretrained('facebook/opt-1.3b', low_cpu_mem_usage=True)
from transformers_neuronx.module import save_pretrained_split
save_pretrained_split(hf_model, './opt-1.3b-f32-split')
from transformers_neuronx.opt.model import OPTForSampling
neuron_model = OPTForSampling.from_pretrained('./opt-1.3b-f32-split', batch_size=1, tp_degree=2, amp='f32', unroll=None)
neuron_model.to_neuron()
from transformers_neuronx.generation_utils import HuggingFaceGenerationModelAdapter
model = HuggingFaceGenerationModelAdapter(hf_model.config, neuron_model)
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-1.3b')
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
text = "The quick brown fox"
encoded_input = tokenizer(text, return_tensors='pt', padding=True)
model.reset_generation()
sample_output = model.generate(
input_ids=encoded_input.input_ids,
attention_mask=encoded_input.attention_mask,
do_sample=True,
max_length=256,
temperature=0.7,
)
print([tokenizer.decode(tok) for tok in sample_output])
Output:
['</s>The quick brown fox box box box\n\n\n\n\n\n6 arsor\n:\n\n(w a natural spring 20 18\n011161811.\n32.\n.\n########\n-\n1.\n1 of the best.\n.\n(-\n7\n-\n\n-\n####\n”\n--\n9 91911,\n”””\nThe next to the\nN.\n””\n#############\n1.########\nThe. ### ######\n-\n\n############################################################################################################################################################################################################################################################']
Nvidia:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device('cuda')
model = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")
model.to(device)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
text = "The quick brown fox"
encoded_input = tokenizer(text, return_tensors='pt', padding=True)
encoded_input.to(device)
sample_output = model.generate(
input_ids=encoded_input.input_ids,
attention_mask=encoded_input.attention_mask,
do_sample=True,
max_length=256,
temperature=0.7,
)
print([tokenizer.decode(tok) for tok in sample_output])
Output:
['</s>The quick brown fox jumps over the lazy dog\nthe lazy fox jumps over the quick brown fox</s>']
Hi AWS Team,
Very cool project and I am looking forward to using it! Are you planning to add support for MPT-based models?
Thanks!
I understand the llama model is still a 'prototype', so it is expected to have some issues when running it.
I am nevertheless creating this issue hoping that my tests will help fixing them.
I am using openlm-research/open_llama_3b for a simple text-generation.
I am using the transformers-neuronx
main branch.
The neuronx-cc
compiler version is 2.8.0.25+a3ad0f342
.
The model is compiled with batch_size = 1
, amp = 'f16'
, tp_degree = 2
.
I use transformers.set_seed(42)
to get reproducible results, using the same "One of my fondest memory is"
prompt.
If I generate only 128 tokens, I get something that looks ok:
'<s> One of my fondest memory is my school time. Before I went to college, I was studying at Kofu Municipal Technical
School in the north part of Yamanashi. It was a small school (about 300 students) but it was very cozy. Not just the school
building, but the area where the school was situated was very calm and I was very happy living in that area.\nMy father
was the principal of the school. I was a senior high school student and I got married when I was 19 years old. My father
died when I was 35 years old. I stayed in'
But if I increase the number of generated tokens, the output tokens are duplicated or corrupted (here starting from token 133
):
'<s> One of my fondest memory is my school time. Before I went to college, I was studying at Kofu Municipal Technical
School in the north part of Yamanashi. It was a small school (about 300 students) but it was very cozy. Not just the school
building, but the area where the school was situated was very calm and I was very happy living in that area.\nMy father
was the principal of the school. I was a senior high school student and I got married when I was 19 years old. My father
died when I was 35 years old. I stayed in that area until Iw at his office from 39 at least 568833 years later on this blog,
I I’ The school'
The more tokens I generate, the worse it gets.
I tried the latest version of transformers-neuronx with openlm-research/open_llama_3b
, and while the outputs are fine with batch sizes 1 and 2, I get gibberish output with batch size 3.
from transformers import AutoModelForCausalLM
from transformers_neuronx.module import save_pretrained_split
model = AutoModelForCausalLM.from_pretrained("openlm-research/open-llama-3b")
save_pretrained_split(model, './llama_split')
import os
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling
os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"
batch_size = 3
neuron_model = LlamaForSampling.from_pretrained("./llama_split", tp_degree=2, batch_size=batch_size, amp='f16')
neuron_model.to_neuron()
# construct a tokenizer and encode prompt text
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
prompts = ["My name is David and"] * batch_size
tokens = tokenizer(prompts, return_tensors="pt")
# run inference with top-k sampling
with torch.inference_mode():
start = time.time()
generated_sequences = neuron_model.sample(tokens.input_ids, sequence_length=128, top_k=50)
elapsed = time.time() - start
generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')
Projects like vLLM help optimize model serving throughput, I was wondering if implementing PagedAttention or integrating with vLLM was something that was on your roadmap to improve using the Inf2 processors in production?
Relevant TF PR - huggingface/transformers#27064
save_split calls
model.save_pretrained(save_directory, save_function=save_split, max_shard_size='10000GB')
And since the default value for safetensor serialization is true, save_pretrained will not call the save_function.
You can reproduce by following https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb using latest transformer package
I attempted to use this model through inf2.24xlarge. This model is based on the GPTJ architecture, but when I run this model based on Neuron, the results differ greatly from those on a GPU-based system. Completely meaningless words are outputted. It works fine with GPU.
Below is the compilation code:
from transformers.models.auto import AutoModelForCausalLM
import torch
from transformers_neuronx.module import save_pretrained_split
hf_model = AutoModelForCausalLM.from_pretrained('PygmalionAI/pygmalion-6b', low_cpu_mem_usage=True)
def amp_callback(model, dtype):
for block in model.transformer.h:
block.attn.to(dtype)
block.mlp.to(dtype)
model.lm_head.to(dtype)
amp_callback(hf_model, torch.float16)
save_pretrained_split(hf_model, './pygmalion-6b-split')
Below is the inference code:
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.gptj.model import GPTJForSampling
neuron_model = GPTJForSampling.from_pretrained('./pygmalion-6b-split', n_positions=1024, batch_size=1, tp_degree=8, amp='f16')
neuron_model.to_neuron()
# construct a tokenizer and encode prompt text
tokenizer = AutoTokenizer.from_pretrained('PygmalionAI/pygmalion-6b')
batch_prompts = [
"Jihye's Persona: A 22-year-old woman working part-time at a convenience store in Seoul.\n<START>\nYou: ...\nJihye: Welcome, man.\nYou: hello?\nJihye: ",]
input_ids = torch.as_tensor([tokenizer.encode(text) for text in batch_prompts])
with torch.inference_mode():
# warmup
generated_sequences = neuron_model.sample(input_ids, sequence_length=1024)
start = time.time()
generated_sequences = neuron_model.sample(input_ids, sequence_length=1024)
elapsed = time.time() - start
generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')
Environment:
AMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230720
VENV: aws_neuron_venv_pytorch
In the curent version, transformers-neuronx
models can only be instantiated from a directory where the Hugging Face checkpoint has been split into multiple files.
This raises two major issues:
After executing pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com
then !pip show transformers-neuronx
gives:
Name: transformers-neuronx
Version: 0.6.106
Summary: UNKNOWN
Home-page: UNKNOWN
Author:
Author-email:
License: UNKNOWN
Location: /home/ubuntu/.local/lib/python3.10/site-packages
Requires: accelerate, torch-neuronx, transformers
Required-by:
Hi team,
It would be great if this project supported BART.
What work needs to be done to add this support?
I noticed /usr/local/bin/neuronx-cc is always called with --target=trn1 even on my inf2 machines.
I'm evaluating non-GPU machines. Will this have any speed or quality effects when evaluating inf2 machines?
I tried to run the gpt-neox demo CLI @0.4.60 on an inf2.24xlarge instance.
I first saved the model with float16 precision:
$ gptneox_demo --amp f16 save gpt-neox-20b
When trying a conversion and inference, I ran out-of-memory:
$ gptneox_demo --amp f16 run --batch_size 1 gpt-neox-20b
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 52.7kB/s]
Downloading (…)olve/main/vocab.json: 1.08MB [00:00, 18.6MB/s]
Downloading (…)olve/main/merges.txt: 457kB [00:00, 12.7MB/s]
Downloading (…)/main/tokenizer.json: 2.11MB [00:00, 17.8MB/s]
Downloading (…)cial_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.0/90.0 [00:00<00:00, 35.0kB/s]
running GPTNeoXForSampling.from_pretrained
/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py:40: UserWarning: hidden_act="gelu_fast" ignored in favor of hidden_act="gelu_new"
warnings.warn(f'hidden_act="{self.config.activation_function}" ignored in favor of hidden_act="gelu_new"')
running model.to_neuron
.........Selecting 57055 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Selecting 3704 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
.Analyzing dependencies of Block1
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
...Analyzing dependencies of Block1
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**********************************.*****************
Dependency reduction of sg0000
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
.
Compiler status PASS
2023-Jun-23 09:11:18.0735 6737:6737 ERROR TDRV:dmem_alloc_internal Failed to alloc DEVICE memory: 150994944
2023-Jun-23 09:11:18.0738 6737:6737 ERROR TDRV:dml_dump Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_649561b6.csv
2023-Jun-23 09:11:18.0738 6737:6737 ERROR TDRV:log_dev_mem Failed to allocate 144.000MB (usage: tensors) on ND 0:NC 0, current utilization:
* total: 15.951GB
* tensors: 15.951GB
* runtime: 1.062KB
* dma rings: 32.000KB
2023-Jun-23 09:11:18.0738 6737:6737 ERROR TDRV:tensor_allocate Failed to allocate 150994944 bytes on DEVICE for tensor UNKNOWN.
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/gptneox_demo", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/demo.py", line 28, in main
demo('EleutherAI/gpt-neox-20b', GPTNeoXForSampling, amp_callback)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 61, in demo
run(args, model_name, model_cls)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 105, in run
model.to_neuron()
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py", line 71, in to_neuron
block.to_neuron(n_positions_list)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py", line 285, in to_neuron
self.mlp_out_weight = shard_along(mlp.dense_4h_to_h.weight.detach().T, dim=0)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/parallel.py", line 109, in shard_along
return ops.parallel_to_nc(self.shard_along_on_cpu(tensor, dim))
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/ops.py", line 49, in parallel_to_nc
return torch.ops.neuron._parallel_to_neuron(tensors)
File "/usr/local/lib/python3.8/dist-packages/torch/_ops.py", line 442, in __call__
return self._op(*args, **kwargs or {})
RuntimeError: nrt_tensor_allocate status=4
This was somehow expected as by default the CLI uses two neuron cores only and the model is quite big (20 B parameters).
By increasing the number of neuron cores, I was able to run the model, but the result is garbage:
$ gptneox_demo --amp f16 run --batch_size 1 --tp_degree 4 gpt-neox-20b
running GPTNeoXForSampling.from_pretrained
/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py:40: UserWarning: hidden_act="gelu_fast" ignored in favor of hidden_act="gelu_new"
warnings.warn(f'hidden_act="{self.config.activation_function}" ignored in favor of hidden_act="gelu_new"')
running model.to_neuron
....Selecting 31504 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
..Analyzing dependencies of Block1
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Dependency reduction of sg0000
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Compiler status PASS
2023-Jun-23 09:16:14.0528 7163:7163 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Jun-23 09:16:14.0528 7163:7163 [0] init.cc:99 CCOM WARN OFI plugin initNet() failed is EFA enabled?
running model.sample
generated_sequence= tensor([[12092, 13, 309, 1353, 247, 3448, 1566, 13, 29589, 22702,
8822, 22702, 42010, 22702, 22702, 8834, 42010, 42010, 42010, 42010,
42010, 22702, 42010, 22702, 22702, 42010, 29589, 42010, 22702, 42010,
42010, 42010, 42010, 22702, 8822, 22702, 22702, 42010, 42010, 22702,
42010, 42010, 42010, 42010, 42010, 22702, 22702, 8834, 42010, 42010,
42010, 42010, 22702, 42010, 22702, 42010, 29589, 42010, 22702, 42010,
42010, 22702, 42010, 42010, 42010, 42010, 42010, 22702, 22702, 42010,
42010, 42010, 42010, 42010, 22702, 42010, 42010, 42010, 42010, 22702,
42010, 42010, 42010, 22702, 42010, 22702, 42010, 42010, 22702, 42010,
42010, 42010, 22702, 42010, 42010, 42010, 42010, 22702, 42010, 42010,
22702, 42010, 22702, 42010, 42010, 42010, 42010, 8822, 8828, 22702,
42010, 29589, 42010, 22702, 42010, 42010, 42010, 42010, 22702, 42010,
42010, 8828, 42010, 22702, 29589, 29589, 8828, 22702]])
["Hello, I'm a language model,blockList errnoErramssymb errnoErr BytePtrFromString errnoErr errnoErramsfonts BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr errnoErr BytePtrFromStringblockList BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErramssymb errnoErr errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr errnoErramsfonts BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr BytePtrFromStringblockList BytePtrFromString errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromString errnoErr BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromStringamssymbmathrsfs errnoErr BytePtrFromStringblockList BytePtrFromString errnoErr BytePtrFromString BytePtrFromString BytePtrFromString BytePtrFromString errnoErr BytePtrFromString BytePtrFromStringmathrsfs BytePtrFromString errnoErrblockListblockListmathrsfs errnoErr"]
I've been following the https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb example. However I came across with an issue using a modified version of LLama made for MiniGPT4.
I'm running on a Inf2.8xlarge with "AMI Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231205".
I updated to the latest Neuron version via python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.0.* torchvision
Here's my code to compile. This finishes properly.
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained('wangrongsheng/MiniGPT-4-LLaMA-7B')
import torch
from transformers_neuronx.module import save_pretrained_split
save_pretrained_split(model, './MiniGPT-4-LLaMA-7b-split')
I then attempt to run it with the following code:
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling
from minigpt4.models.modeling_llama import LlamaForCausalLM
import os
# Compiler flag -O1 is a workaround for “Too many instructions after unroll” in SDK 2.14
# os.environ['NEURON_CC_FLAGS'] = '-O1'
# load meta-llama/Llama-2-13b to the NeuronCores with 24-way tensor parallelism and run compilation
neuron_model = LlamaForSampling.from_pretrained('./MiniGPT-4-LLaMA-7b-split', batch_size=1, tp_degree=2, amp='f16')
neuron_model.to_neuron()
# construct a tokenizer and encode prompt text
tokenizer = AutoTokenizer.from_pretrained('wangrongsheng/MiniGPT-4-LLaMA-7B')
prompt = "Hello, I'm a language model,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# run inference with top-k sampling
with torch.inference_mode():
start = time.time()
generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
elapsed = time.time() - start
generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')
And I get this error:
Traceback (most recent call last):
File "run.py", line 12, in <module>
neuron_model = LlamaForSampling.from_pretrained('./MiniGPT-4-LLaMA-7b-split', batch_size=1, tp_degree=2, amp='f16')
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/module.py", line 145, in from_pretrained
state_dict_path = os.path.join(pretrained_model_path, 'pytorch_model.bin')
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 771, in load
with _open_file_like(f, 'rb') as opened_file:
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 270, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/serialization.py", line 251, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './MiniGPT-4-LLaMA-7b-split/pytorch_model.bin'
These are the files in ./MiniGPT-4-LLaMA-7b-split:
config.json generation_config.json model.safetensors
Any help or direction would be stellar! Thanks.
Request to support new popular model Mixtral
I am trying to skip generating some tokens that could be skipped via copy paste to hopefully reduce speed up by 70% given my use case, however the problem I am coming up with is when I reset caching... the overhead takes too much time. When I maintain caching, its probabilities seems to not be totaly wrong.
The main goal in end to have constrained generation that supposed to save time if there is only one possible next token to genrate
Here is my current code to reproduce this
def convert_to_tree(sequences):
tree = {}
for sequence in sequences:
sequence_ids = tokenizer.encode(sequence,add_special_tokens=False)
current_tree = tree
for token in sequence_ids:
if token not in current_tree:
current_tree[token] = {
"token_string": tokenizer.decode([token]),
"dangling": True,
"tree": {}
}
else:
# Update the dangling value if this token appears more than once
current_tree[token]["dangling"] = False
current_tree = current_tree[token]["tree"]
return tree
tree = preprocess([" a business man who"])
from transformers_neuronx import bucket, utils
def sample(self, input_ids, sequence_length, start_ids=None,
top_k=50, top_p=1.0, eos_token_override=None, temperature=1.0, streamer=None, c_tree = None):
# To enable optimized context encoding network, we must pad
# up to the context length estimate or we will not correctly
# select the final context logits (See: layers/transformer.py).
# This also means we need to shift the start_ids over to correct
# for padding.
offset = 0
batch_size, context_length = input_ids.shape
prefixed_length = self.prefixed_length
if context_length < prefixed_length:
self.prefixed_length = 0
else:
input_ids = input_ids[:, prefixed_length:]
context_length -= prefixed_length
sequence_length -= prefixed_length
estimate = bucket.find(self.context_buckets, context_length)
if estimate:
if context_length < estimate:
input_ids = utils.pad(input_ids, 1, estimate, left=True)
offset = estimate - context_length
if not prefixed_length:
if start_ids is None:
start_ids = torch.zeros(batch_size, dtype=torch.int32)
start_ids += offset
sequence_length += offset
# Sequence length cannot be greater than n_positions
sequence_length = min(sequence_length, self.max_positions)
result = sample_llama(
self, input_ids, start_ids, sequence_length,
eos_token_id=self.config.eos_token_id if eos_token_override is None else eos_token_override,
top_k=top_k, top_p=top_p, temperature=temperature, streamer=streamer, c_tree = c_tree
)
if offset != 0:
result = result[:, offset:]
return result
from transformers_neuronx.sampling import validate_top_k_top_p_min_tokens_to_keep, top_k_top_p_filtering
@torch.no_grad()
def sample_llama(model, input_ids, start_ids, sequence_length, eos_token_id=2, top_k=50, top_p=1.0, temperature=1.0, streamer=None, c_tree = None):
#validate_top_k_top_p_min_tokens_to_keep(top_k, top_p, None)
# populate key/value caches according to the prompt text
_, start = input_ids.shape
cache_ids = torch.arange(start, dtype=torch.int32)
next_token_scores = model(input_ids, cache_ids, start_ids)
return sample_loop_llama(
model, input_ids, start_ids,next_token_scores, sequence_length, eos_token_id, top_k, top_p, temperature, streamer, c_tree
)
#test if cahcing working by turning off restricted generation after 3 tokens were added manually
next_token_scores = None
def sample_loop_llama(model, input_ids, start_ids,next_token_scores, sequence_length, eos_token_id=2,
top_k=50, top_p=1.0, temperature=1.0, streamer=None, c_tree=None):
validate_top_k_top_p_min_tokens_to_keep(top_k, top_p, None)
if not isinstance(temperature, float) or not (temperature > 0):
raise ValueError('temperature has to be a strictly positive float.')
# Flags, one per sequence in a batch, to indicate if a sequence hit eos_token_id
done_flags = torch.full((input_ids.size(dim=0), 1), False)
tokens = [input_ids]
_, start = input_ids.shape
cache_ids = torch.arange(start, dtype=torch.int32)
next_token_scores = model(input_ids, cache_ids, start_ids)
print("inputs")
print((input_ids,input_ids.shape))
print("cache_ids")
input((cache_ids,cache_ids.shape))
tokens_tmp = []
cache_ids_temp = []
for cur_len in range(start, sequence_length):
next_len = cur_len + 1
#top_values, top_indices = top_k_top_p_filtering(next_token_scores, top_k=top_k, top_p=top_p)
top_indices = list(c_tree.keys())
if len(top_indices) == 1:
#skip next_token_scores because there is only one possible token
inputs = top_indices[0]
inputs = torch.reshape(torch.tensor(inputs),(1,1))
done_flags = torch.logical_or(done_flags, inputs == eos_token_id)
token = torch.where(done_flags.eq(True), eos_token_id, inputs)
tokens.append(token)
if streamer is not None and hasattr(streamer, 'response_with_prefix') and streamer.response_with_prefix:
streamer.put(torch.cat(tokens, dim=-1))
elif streamer:
streamer.put(token)
c_tree = c_tree[top_indices[0]]['tree']
if len(list(c_tree.keys())) == 0:
pass#break
# forward pass to get next token
cache_ids_temp.append(cur_len)
tokens_tmp.append(token)
###TODO: assign token to cache_ids
elif len(top_indices) == 0:
cache_ids = torch.as_tensor(cache_ids_temp, dtype=torch.int32)
tokens_pt = torch.as_tensor([tokens_tmp], dtype=torch.int32)
print("inputs")
print((tokens_pt,tokens_pt.shape))
print("cache_ids")
print((cache_ids,cache_ids.shape))
if len(tokens_tmp) != 0:#header condition only
next_token_scores = model(tokens_pt, cache_ids, start_ids)
cache_ids_temp = []
tokens_tmp = []
####this whole code will make it contrained generation, but it was commented out to make sure probabilties for random generation are working correctly
# top_values = next_token_scores[0][top_indices]
# top_value = torch.argmax(top_values)
# inputs = top_indices[top_value]
# c_tree = c_tree[inputs]['tree']
# inputs = torch.reshape(torch.tensor(inputs),(1,1))
# # Update done flags.
# done_flags = torch.logical_or(done_flags, inputs == eos_token_id)
# # Update token id to be eos_token_id if the corresponding done flag is True. For a batch,
# # this means that, while every sequence in the batch has the same length, a sequence that
# # encounters eos_token_id earlier will be filled with eos_token_ids post the first appearance
# # of eos_token_id.
# token = torch.where(done_flags.eq(True), eos_token_id, inputs)
# tokens.append(token)
if temperature != 1.0:
next_token_scores /= temperature
top_values, top_indices = top_k_top_p_filtering(next_token_scores, top_k=top_k, top_p=top_p)
# sample
probs = torch.nn.functional.softmax(top_values, dim=-1)
inputs_in_topk = torch.multinomial(probs, num_samples=1, replacement=True)
inputs = torch.gather(top_indices, 1, inputs_in_topk)
done_flags = torch.logical_or(done_flags, inputs == eos_token_id)
token = torch.where(done_flags.eq(True), eos_token_id, inputs)
tokens.append(token)
if streamer is not None and hasattr(streamer, 'response_with_prefix') and streamer.response_with_prefix:
streamer.put(torch.cat(tokens, dim=-1))
elif streamer:
streamer.put(token)
if len(list(c_tree.keys())) == 0:
pass#break
cache_ids_temp.append(cur_len)
tokens_tmp.append(token)
# if next_len >= sequence_length or done_flags.all():
# break
# forward pass to get next token
#add multiple models to merge multiple scores
if streamer:
streamer.end()
return torch.cat(tokens, dim=-1)
# run inference with top-k sampling
print(len(input_ids[0]))
with torch.inference_mode():
start = time.time()
#print(neuron_model.forward(input_ids))
#generated_sequences = neuron_model.sample(input_ids, sequence_length=len(input_ids[0]) + 100)
generated_sequences = sample(neuron_model,input_ids, sequence_length=len(input_ids[0]) + 15,temperature=.8,c_tree=tree)
elapsed = time.time() - start
nl="\n\n\n\n"
generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {nl.join(generated_sequences)} in {elapsed} seconds')
they key logs is as so... I thought I was doing everything right, but it seems to still produced wrong results
inputs ### new inputs with 5 extra cach_ids
(tensor([[29871, 263, 5381, 767, 1058]], dtype=torch.int32), torch.Size([1, 5]))
cache_ids
(tensor([800, 801, 802, 803, 804], dtype=torch.int32), torch.Size([5]))
#I expect normal generation afterwords
inputs
(tensor([[29949]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([805], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[259]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([806], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[259]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([807], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[903]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([808], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[386]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([809], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[29899]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([810], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[29871]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([811], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[259]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([812], dtype=torch.int32), torch.Size([1]))
inputs
(tensor([[29955]], dtype=torch.int32), torch.Size([1, 1]))
cache_ids
(tensor([813], dtype=torch.int32), torch.Size([1]))
generated sequences <s> At the far end of town where the Gricklegrass grows and the wind smells slowandsour when it blows and no
birds ever sing excepting old crows is the Street of the Lifted Lorax
And deep in the Gricklegrass some people say if you look deep enough you can still see today where the
Lorax once stood just as long as it could before somebody lifted the Lorax away
What was the Lorax Any why was it there And why was it lifted and taken somewhere from the far end of
town where the Gricklegrass grows The old Onceler still lives here
Ask him he knows
This story is about a business man whoO _th- 7 in 1.5860817432403564 seconds
the key details in the logs is that I add 5 extra tokens on the next model logits call along with 5 extra cach_ids correctly ordered. I thought I did it correctly, but after normal geneartion... it prouduced garbage
I am running a pretraining job following https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md
And the script works smoothly. However, I don't know where the trained model is saved. I have set
model.save_xser = True
in https://github.com/aws-neuron/neuronx-nemo-megatron/blob/da1fb6643838e01c9110723bb4190081b4a249b0/nemo/examples/nlp/language_modeling/test_llama.sh
But I still can not find the trained model anywhere.
What is more, I am not sure if the script loads the parameters from the model.tokenizer.type folder to initialize llama model ?
It seems like the example script uses randomly initialized parameters for training, so what should I do if I want to initialize the model with pretrained parameters? Should the parameter file restricted to ckpt file?
Looking at the code of the top_k_top_p_filtering
method in sampling.py
, I am wondering if the algorithm for applying the top-p filtering is correct.
Unlike the transformers
implementation, the algorithm performs a cumulative sum on logits probabilities sorted in descending order, which seems to lead to a different selection.
Example:
scores = [0.1, 0.2, 0.7]
top_p = 0.8
top_p_scores = [0.2, 0.7]
transformers
algorithm
sorted_scores = [0.1, 0.2, 0.7]
cum_sum = [0.1, 0.3, 1.0]
top_p_scores = top_p_scores[cum_sum > (1 - top_p)] = [0.2, 0.7] <- correct
transformers-neuronx
algorithm
sorted_scores = [0.7, 0.2, 0.1]
cum_sum = [0.7, 0.9, 1.0]
top_p_scores = top_p_scores[cum_sum <= top_p] = [0.7] <-incorrect
I checked the result by crafting a sample:
>>> import torch
>>> from transformers_neuronx.sampling import top_k_top_p_filtering
>>> probs = torch.tensor([[0.1, 0.2, 0.7]])
>>> logits = torch.log(probs)
>>> top_p_logits, top_p_indices = top_k_top_p_filtering(logits, top_k=None, top_p=0.8)
>>> print(top_p_indices)
tensor([[2]])
Hi team, I was recently playing with Llama 2 on inf2.24xlarge instance. The AMI I am currently using is: AWS Deep Learning AMI Neuron PyTorch 1.13 (Amazon Linux 2).
The Llama2 model is getting compiled and running successfully. While running the model, I also observed the instance performance using neuron-top, and the model seems to be utilizing all the neuron-cores from all neuron-core-devices.
But, while compiling the model, I get the following message:
2023-Sep-05 13:08:49.0371 20080:29631 [0] init.cc:97 CCOM WARN Linux kernel 5.10 requires setting FI_EFA_FORK_SAFE=1 environment variable. Multi-node support will be disabled.
Please restart with FI_EFA_FORK_SAFE=1 set.
Can someone please guide me in understanding if this is something of concern and if yes how to resolve this ?
When trying to compile llama 7B
with:
I get these errors (multiple instances):
2023-Nov-20 08:13:58.361948 4241:4538 ERROR NMGR:dlr_kelf_load Failed to load mlaop
2023-Nov-20 08:13:58.361956 4241:4538 ERROR NMGR:load_kelf_graphs Failed to load KELF kelf-0.json
2023-Nov-20 08:13:58.677772 4241:4551 ERROR NEFF:json_parse_load_elements Unable to parse: sg00/Activation.json - 1
2023-Nov-20 08:13:58.677837 4241:4551 ERROR NEFF:json_parse_load_elements File sg00/Activation.json size (4375834152) exceeds json parser maximum (4294967295)
2023-Nov-20 08:13:58.677857 4241:4551 ERROR NEFF:construct_kbin Failed to load subgraph sg00/def.json
2023-Nov-20 08:13:58.679344 4241:4551 ERROR NEFF:kelf_load Failed to load subgraph 0
2023-Nov-20 08:13:58.679362 4241:4551 ERROR NMGR:dlr_kelf_load Failed to load mlaop
2023-Nov-20 08:13:58.679371 4241:4551 ERROR NMGR:load_kelf_graphs Failed to load KELF kelf-0.json
2023-Nov-20 08:13:59.027739 4241:4538 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/d7686f21-27f8-48c8-8ad6-241f83f4e865/model.MODULE_ddc3bb0a8f815a1d05f6+8737852b.
neff, err: 2
2023-Nov-20 08:13:59.084440 4241:4551 ERROR NMGR:kmgr_load_nn_post_metrics Failed to load NN: /tmp/ubuntu/neuroncc_compile_workdir/d7686f21-27f8-48c8-8ad6-241f83f4e865/model.MODULE_ddc3bb0a8f815a1d05f6+8737852b.
neff, err: 2
2023-Nov-20 08:13:59.107413 4241:4547 ERROR NEFF:json_parse_load_elements Unable to parse: sg00/Activation.json - 1
I am using optimum-neuron
example script with the following command:
python examples/text-generation/generation.py export meta-llama/Llama-2-7b-chat-hf --batch_size 8 --sequence_length 2048 --num_cores 24 --auto_casODULE_ddc3bb0a8f815a1d05f6+8737852b.t_type fp16
Hi,
according to this blog https://huggingface.co/blog/inferentia-llama2
it seems the expected ms/token is about 60 when running inference for llama-2 on inf2.xlarge.
I do get these results when running llama2 in inf2.xlarge.
I have also tested running codellama (finetuned from llama2) in a inf2.xlarge and I'm getting about 50 to 60 ms/token.
When I run codellama in a g5.xlarge I get 30 ms/token, faster than using a inf2.xlarge
is this expected?
Thank you
Looking at the source code of the latest package (0.60.106
) it seems that the precompiled artifacts are not properly reloaded when to_neuron()
is called.
To be more specific, it seems that the model contains a main
head and several context
heads.
Note: my understanding is that the main
head is used to generate new tokens while the context
heads are all in charge of the encoding of a subset of the context. Please correct me if I am wrong.
The precompiled artifacts are correctly reloaded for the main
head, but not for the context
heads, because the path to the precompiled artifacts is not passed down to them.
The following change seems to fix the issue:
decoder.py
:
def build_weight_shared(self, n_positions_list=None, n_active_tokens=None, batch_size=None,
unroll=None, share_caches=False):
...
new = DecoderLmHeadForSamplingNoEmbedding(
self.tp_degree, n_positions_list, n_active_tokens, batch_size, self.attention_head_size,
self.amp, self.num_layers, unroll, neuron_config=self.neuron_config, allow_pad=self.allow_pad
)
+ new.compiler_artifacts_path = self.compiler_artifacts_path
new.add_inputs_builder(self.inputs_builder)
I'm running this sample code [https://github.com/aws-neuron/transformers-neuronx#hugging-face-generate-api-support] using GPT-NeoX on an inf2.24xlarge instance, but the model.generate method kills the kernel on Jupyter. I am using padding and truncation in the tokenizer, and this fails for both single and double input sequences (texts). The batch size is 2.
I usually install transformers_neuronx
from git, and since the last commit says that it was updated for SDK release 2.12, I assumed it was the same version available from GitHub. However, running gpt2_demo
with the GitHub version breaks:
(aws_neuron_venv_pytorch) ubuntu@ip-172-31-40-142:~$ gpt2_demo run gpt2-small
running GPT2ForSampling.from_pretrained
running model.to_neuron
....
Compiler status PASS
Traceback (most recent call last):
File "/opt/aws_neuron_venv_pytorch/bin/gpt2_demo", line 8, in <module>
sys.exit(main())
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gpt2/demo.py", line 20, in main
demo('gpt2', GPT2ForSampling, amp_callback)
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 61, in demo
run(args, model_name, model_cls)
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 105, in run
model.to_neuron()
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/gpt2/model.py", line 117, in to_neuron
self.decoder_lm_head.to_neuron()
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 121, in to_neuron
self.program.setup(self.layers, ln_lm_head_params)
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 872, in setup
super().setup(layers, ln_lm_head_params)
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 827, in setup
kernel.load()
File "/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 376, in load
self.model = torch.classes.neuron.ParallelModel(self.neff_bytes, self.tp_degree, self.g_start_device_id, self.g_device_count)
RuntimeError: __init__() expected at most 3 argument(s) but received 5 argument(s). Declaration: __init__(__torch__.torch.classes.neuron.ParallelModel _0, str _1, int _2) -> NoneType _0
I diff
ed the latest version of the wheel (https://pip.repos.neuron.amazonaws.com/transformers-neuronx/transformers_neuronx-0.5.58-py3-none-any.whl) against what's in git and it seems like git has many extra changes, so now I'm wondering if the pip wheel is outdated, or if they have diverged somehow.
I am trying to run a fine-tuned version of llama 2 on inf2 , but keep getting an AssertionError: Try to load with neff bytes as None, might due to compilation failure
I Used an instance is inf2.x8large, Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230817, all upgraded as described here.
Here's the full log:
(aws_neuron_venv_pytorch) $ python
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import LlamaForCausalLM
>>>
>>> model = LlamaForCausalLM.from_pretrained('llama-2-7b-hf')
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.62it/s]
>>> import torch
>>> from transformers_neuronx.module import save_pretrained_split
>>> save_pretrained_split(model, './llama-2-7b-split')
>>> quit()
(aws_neuron_venv_pytorch) $ python
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import time
>>> import torch
>>> from transformers import AutoTokenizer
>>> from transformers_neuronx.llama.model import LlamaForSampling
>>> os.environ["NEURON_CC_FLAGS"] = "--model-type=transformer-inference"
>>> neuron_model = LlamaForSampling.from_pretrained('./llama-2-7b-split', batch_size=1, tp_degree=2, amp='f16')
>>> neuron_model.to_neuron()
2023-Sep-08 23:03:12.0613 10779:10833 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Sep-08 23:03:12.0613 10779:10833 [0] [init.cc:138](http://init.cc:138/) CCOM WARN OFI plugin initNet() failed is EFA enabled?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/[opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/llama/model.py](http://opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/llama/model.py)", line 117, in to_neuron
model = self.decoder_lm_head.build_weight_shared(
File "/[opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py](http://opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py)", line 157, in build_weight_shared
new.program.setup(new.layers, ln_lm_head_params)
File "/[opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py](http://opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py)", line 983, in setup
super().setup(layers, ln_lm_head_params)
File "/[opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py](http://opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/decoder.py)", line 879, in setup
kernel.load()
File "/[opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py](http://opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/transformers_neuronx/compiler.py)", line 375, in load
assert self.neff_bytes is not None, f"Try to load with neff bytes as None, might due to compilation failure"
AssertionError: Try to load with neff bytes as None, might due to compilation failure
Thanks!
Hey, is there any plan to add support for Falcon models ?
That would be really great !
When generating only a few tokens with the recently added llama model, the generated sequence looks ok, but when the number of generated tokens increases (typically with 128 tokens), the end of the sequence is gibberish.
Hi AWS Neuron team,
It would be great to have Mistral-7B model support since its performance is better than LLAMA2 and it is with a better license. Will this model be on the roadmap?
Best,
Henry
I am trying to save the Neuron model and deploy it to SageMaker as an endpoint. I noticed in the documentation, under serialization support, it is stated that all models can be loaded or saved except GPTJ and GPTNeoX model classes.
However, I tried several models, including Llama2-13b, OPT-30B, OPT-66B, and Llama2-70B, and none of these models can be saved using several methods.
<neuron_model>.save
, which doesn't exist. It only appears to exist for GPT2 models.<neuron_model>.state_dict()
, which fails on all LazyModules.torch.save
or via torchscript using torch.jit.save
and then trying to use the state_dict()
.Below is an example using OPT-66B.
Traceback (most recent call last): File "opt.py", line 63, in print(f"\ndecoder: {neuron_model.chkpt_model.model.decoder.state_dict()}") File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1448, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1448, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1448, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1445, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1356, in _save_to_state_dict destination[prefix + name] = param if keep_vars else param.detach() File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/nn/parameter.py", line 144, in torch_function raise ValueError( ValueError: Attempted to use an uninitialized parameter in <method 'detach' of 'torch._C._TensorBase' objects>. This error happens when you are using a
LazyModule
or explicitly manipulatingtorch.nn.parameter.UninitializedParameter
objects. When using LazyModules Callforward
with a dummy batch to initialize the parameters before calling torch functions
Is there anything that can be done to fix this? I've tried the last five versions of transformers-neuronx
see here. Please advise. Thanks!
The GPT-Neox-20B model is too big to run on an inf2.8xlarge instance, so I tried to convert a model with the same architecture but less parameters: EleutherAI/pythia-1.4B.
I first saved the model locally:
$ gptneox_demo --model_name EleutherAI/pythia-1.4B save ./pythia-1.4B
Then I converted it and ran an inference, but the output is garbage:
$ gptneox_demo --model_name EleutherAI/pythia-1.4B run --batch_size 1 --n_positions 20 ./pythia-1.4B
running GPTNeoXForSampling.from_pretrained
/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptneox/model.py:40: UserWarning: hidden_act="gelu" ignored in favor of hidden_act="gelu_new"
warnings.warn(f'hidden_act="{self.config.activation_function}" ignored in favor of hidden_act="gelu_new"')
running model.to_neuron
..Selecting 7380 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Dependency reduction of sg0000
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Compiler status PASS
2023-Jun-23 09:32:02.0148 1784:1784 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Jun-23 09:32:02.0148 1784:1784 [0] init.cc:99 CCOM WARN OFI plugin initNet() failed is EFA enabled?
running model.sample
generated_sequence= tensor([[12092, 13, 309, 1353, 247, 3448, 1566, 13, 50276, 50276,
521, 1028, 292, 35106, 11, 50276, 88, 50276, 7112, 11]])
["Hello, I'm a language model, hisucetunto* w ido*"]
For the record, the output with the standard CPU inference:
tensor([[12092, 13, 309, 1353, 247, 3448, 1566, 13, 285, 309,
1353, 2820, 281, 1973, 247, 1566, 326, 476, 320, 908]])
["Hello, I'm a language model, and I'm trying to build a model that can be used"]
Following the example of HuggingFaceGenerationModelAdapter
, I have created a NeuronModelForCausalLM
adapter class for HuggingFace optimum-neuron (see huggingface/optimum-neuron#117).
I compared the inference times of calling the GPT2ForSampling.sample()
method directly and using the generate()
method.
@inf2.8xlarge | sample | generate |
---|---|---|
128 tokens | 0.5 s | 0.9 s |
1000 tokens | 4.5 s | 7.4 s |
Calling generate()
through the wrapper is significantly slower: is this expected ? Did I miss something ?
Here is a cleaned up GitHub issue request:
I followed the Llama NeuronX tutorial to host Llama2 on Amazon EC2 with NeuronX and TorchServe. The model works well, achieving 50+ tokens/sec as expected.
Issue
However, for my use case the input contexts are 500-3000 tokens. When I provide an example 3000 token context, there is a 10-30 second overhead before the first token is generated. After the first token, the inference speed is 50 tok/sec as expected.
Attempted fixes
I have tried the following to resolve the long context overhead:
maxWorkers
, maxBatchDelay
, batchSize
- no improvementmax_length
parameter to support longer sequences - no improvementmicro_batch_size
and parallelism values - no improvementmodel-config.yaml
minWorkers: 2
maxWorkers: 8 #did not help
maxBatchDelay: 20
responseTimeout: 1080
batchSize: 4 #did not help
handler:
model_checkpoint_dir: "llama-2-13b-split"
amp: "bf16"
tp_degree: 6
max_length: 100
#did not help either
# micro_batching:
# micro_batch_size: 8
# parallelism:
# preprocess: 4
# inference: 1
# postprocess: 4
pip list
torch 1.13.1+cpu
torch-model-archiver 0.9.0b20231026
torch-neuronx 1.13.1.1.12.1
torch-workflow-archiver 0.2.11b20231026
torch-xla 1.13.1+torchneuronc
transformers-neuronx 0.8.268
torchserve --ncs --start --model-store model_store --ts-config config.properties --models llama-2-13b
(aws_neuron_venv_pytorch) ubuntu@ip-10-72-158-249:~/serve/examples/large_models/inferentia2/llama2$ WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-12-04T23:54:37,499 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties
2023-12-04T23:54:37,501 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-12-04T23:54:37,545 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml
2023-12-04T23:54:37,683 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.9.0
TS Home: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages
Current directory: /home/ubuntu/serve/examples/large_models/inferentia2/llama2
Temp directory: /tmp
Metrics config path: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml
Number of GPUs: 0
Number of CPUs: 96
Max heap size: 30688 M
Python executable: /opt/aws_neuron_venv_pytorch/bin/python
Config file: config.properties
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store
Initial Models: llama-2-13b
Log dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs
Metrics dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 96
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Enable metrics API: true
Metrics mode: log
Disable system metrics: false
Workflow Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store
Model config: N/A
2023-12-04T23:54:37,689 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2023-12-04T23:54:37,703 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: llama-2-13b
2023-12-04T23:54:37,709 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5
2023-12-04T23:54:37,710 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5/llama-2-13b
2023-12-04T23:54:37,718 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model llama-2-13b
2023-12-04T23:54:37,719 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model llama-2-13b
2023-12-04T23:54:48,067 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model llama-2-13b loaded.
2023-12-04T23:54:48,067 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: llama-2-13b, count: 2
2023-12-04T23:54:48,074 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2023-12-04T23:54:48,074 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9001, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml]
2023-12-04T23:54:48,075 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2023-12-04T23:54:48,272 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40732955932617|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85419082641602|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:364036.0625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:12472.20703125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:3.9|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088
2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9000, pid=492260
2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000
2023-12-04T23:54:48,779 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9001, pid=492261
2023-12-04T23:54:48,780 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9001
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492261
2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-04T23:54:48,787 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml.
2023-12-04T23:54:48,787 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change null -> WORKER_STARTED
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492260
2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started.
2023-12-04T23:54:48,788 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10
2023-12-04T23:54:48,788 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change null -> WORKER_STARTED
2023-12-04T23:54:48,790 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
2023-12-04T23:54:48,790 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2023-12-04T23:54:48,797 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9001.
2023-12-04T23:54:48,797 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000.
2023-12-04T23:54:48,799 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799
2023-12-04T23:54:48,799 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799
2023-12-04T23:54:48,833 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8
2023-12-04T23:54:48,833 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8
2023-12-04T23:54:48,997 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2023-12-04T23:54:49,000 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created.
2023-12-04T23:54:49,523 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile
2023-12-04T23:54:49,532 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile
2023-12-04T23:54:49,543 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-04T23:54:49,544 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-04T23:54:49,545 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1
2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled
2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled
2023-12-04T23:54:49,555 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1
2023-12-04T23:54:58,772 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model
2023-12-04T23:54:58,789 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model
2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled?
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40731430053711|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85420608520508|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:342452.0390625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:34056.08203125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:9.6|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148
2023-12-04T23:56:01,531 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled
2023-12-04T23:56:01,537 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 72704
2023-12-04T23:56:01,538 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:73466.0|#WorkerName:W-9000-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161
2023-12-04T23:56:01,539 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161
2023-12-04T23:56:02,630 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled
2023-12-04T23:56:02,632 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 73799
2023-12-04T23:56:02,633 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:74560.0|#WorkerName:W-9001-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162
2023-12-04T23:56:02,634 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40730667114258|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.8542137145996|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:330775.37890625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:45732.69140625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:12.7|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208
... some time later when I call the API
2023-12-05T00:00:48,437 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734448
2023-12-05T00:00:48,458 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT to backend at: 1701734448458
2023-12-05T00:00:48,461 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Backend received inference at: 1701734448
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Preprocessing
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - received req=At the far end of town where the Gricklegrass grows and the wind smells slowandsour when it blows and no
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - birds ever sing excepting old crows is the Street of the Lifted Lorax
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And deep in the Gricklegrass some people say if you look deep enough you can still see today where the
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lorax once stood just as long as it could before somebody lifted the Lorax away
2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - What was the Lorax Any why was it there And why was it lifted and taken somewhere from the far end of
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - town where the Gricklegrass grows The old Onceler still lives here
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Ask him he knows
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - You wont see the Onceler Dont knock at his door He stays in his Lerkim on top of his store He stays in his
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lerkim cold under the floor where he makes his own clothes out of miffmuffered moof And on special dank
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - midnights in August he peeks out of the shutters and sometimes he speaks and tells how the Lorax was lifted
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - away Hell tell you perhaps if youre willing to pay
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - On the end of a rope he lets down a tin pail and you have to toss in fifteen cents and a nail and the shell of a
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - greatgreatgreat grandfather snail
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then he pulls up the pail makes a most careful count to see if youve paid him the proper amount Then he
2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hides what you paid him away in his Snuvv his secret strange hole in his gruvvulous glove Then he grunts I
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - will call you by WhispermaPhone for the secrets I tell you are for your ears alone
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - SLUPP Down slupps the WhispermaPhone to your ear and the old Oncelers whispers are not very clear
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - since they have to come down through a snergelly hose and he sounds as if he had smallish bees up his nose
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Now Ill tell you he says with his teeth sounding gray how the Lorax got lifted and taken away It all started
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - way back such a long long time back
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Way back in the days when the grass was still green and the pond was still wet and the clouds were still clean
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - and the song of the SwomeeSwans rang out in space one morning I came to this glorious place And I first
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - saw the trees The Truffula Trees The brightcolored tufts of the Truffula Trees Mile after mile in the fresh
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - morning breeze
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And under the trees I saw Brown Barbaloots frisking about in their Barbaloot suits as the played in the
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - shade and ate Truffula Fruits From the rippulous pond came the comfortable sound of the HummingFish
2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - humming while splashing around
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But those trees Those trees Those Truffula Trees All my life Id been searching for trees such as these The
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - touch of their tufts was much softer than silk And they had the sweet smell of fresh butterfly milk
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG -
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I felt a great leaping of joy in my heart I knew just what Id do I unloaded my cart In no time at all I had built
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - a small shop Then I chopped down a Truffula Tree with one chop And with great skillful skill and with great
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - speedy speed I took the soft tuft And I knitted a Thneed
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - The instant Id finished I heard a gaZump I looked I saw something pop out of the stump of the tree Id
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chopped down It was sort of a man Describe himThats hard I dont know if I can He was shortish and
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - oldish and brownish and mossy And he spoke with a voice that was sharpish and bossy
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Mister He said with a sawdusty sneeze I am the Lorax I speak for the trees I speak for the trees for the trees
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - have no tongues And Im asking you sir at the top of my lungs he was very upset as he shouted and puffed
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Whats that THING youve made out of my Truffula tuft
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Look Lorax I said Theres no cause for alarm I chopped just one tree I am doing no harm Im being quite
2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - useful This thing is a Thneed A Thneeds a FineSomethingThatAllPeopleNeed Its a shirt Its a sock Its a
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - glove Its a hat But it has other uses Yes far beyond that You can use it for carpets For pillows For sheets
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Or curtains Or covers for bicycle seats The Lorax said Sir You are crazy with greed There is no one on earth
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - who would buy that fool Thneed
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the very next minute I proved he was wrong For just at that minute a chap came along and he thought
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - that the Thneed I had knitted was great He happily bought it for three ninetyeight I laughed at the Lorax You
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - poor stupid guy You never can tell what some people will buy
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I repeat cried the Lorax I speak for the trees
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Im busy I told him Shut up if you please I rushed cross the room and in no time at all built a radiophone I
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - put in a quick call I called all my brothers and uncles and aunts and I said listen here Heres a wonderful
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chance for the whole Onceler Family to get mighty rich Get over here fast Take the road to North Nitch Turn
2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - left at Weehawken Sharp right at South Stitch
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And in no time at all in the factory I built the whole Onceler Family was working full tilt We were all knitting
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds just as busy as bees to the sound of the chopping of Truffula Trees
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then Oh Baby Oh How my business did grow Now chopping one tree at a time was too slow So I quickly
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - invented my SuperAxeHacker which whacked off four Truffula Trees at one smacker We were making
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds four times as fast as before And that Lorax He didnt show up any more
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the next week he knocked on my new office door He snapped Im the Lorax who speaks for the trees
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - which you seem to be chopping as fast as you please But Im also in charge of the Brown Barbaloots who
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - played in the shade in their Barbaloot suits and happily lived eating Truffula Fruits NOWthanks to your
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hacking my trees to the ground theres not enough Truffula Fruit to go round
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And my poor Barbaloots are all getting the crummies because they have gas and no food in their tummies
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - They loved living here But I cant let them stay Theyll have to find food And I hope that they may Good luck
2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - boys he cried And he sent them away
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I the Onceler felt sad as I watched them all go BUT business is business And business must grow
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - regardless of crummies in tummies you know
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I meant no harm I most truly did not But I had to grow bigger So bigger I got I biggered my factory I
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - biggered my roads I biggered my wagons I biggered the loads of the Thneeds I shipped out I was shipping
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - them forth to the South To the East To the West To the North I went right on biggeringselling more
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds And I biggered my money which everyone needs
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 3
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG -
2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - This story is about
2023-12-05T00:00:48,508 [INFO ] W-9000-llama-2-13b_1.0 ACCESS_LOG - /127.0.0.1:50848 "POST /predictions/llama-2-13b HTTP/1.1" 200 73
2023-12-05T00:00:48,510 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734448
2023-12-05T00:00:48,511 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,523 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,590 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,658 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,725 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,793 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,860 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,928 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:08,995 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:09,063 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:09,130 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
.....
2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false
2023-12-05T00:01:12,609 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Inferance
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,1701734472,beab1a87-913c-4302-9548-c25943c30243, pattern=[METRICS]
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:2.4171749336E7|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:20370.777|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.job.Job - Waiting time ns: 20370777, Backend time ns: 24152110030
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - QueueTime.Milliseconds:20.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 24125
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:27.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472
2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_METRICS - HandlerTime.ms:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,requestID:beab1a87-913c-4302-9548-c25943c30243,timestamp:1701734472
Ask
Is there something I'm missing in the config or use of Llama NeuronX to remove the long context overhead? I would like sub-second initial token latency for 500-3000 token contexts.
The alternative is to deploy with SageMaker, but I don't have that setup because we want to rewrite infrence.py to extract logits and limit Lllama to constrained generation
Let me know if any other details would be helpful in troubleshooting this. Thanks!
Hello, is there any easy way to add serialization to models other than GPT2?
GPT2 has a _save_compiled_artifacts
method to save compiled artifacts to disk and load. That would be convenient for other models as well since compiling, e.g. GPT-J takes, 5-10 minutes.
I looked at the code but it seems there was a design change.
Hi team, for all the models, I am getting the below error while importing transformers_neuronx.{MODEL_NAME}.model
>>> from transformers_neuronx.gptj.model import GPTJForSampling
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ssm-user/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/gptj/model.py", line 16, in <module>
from transformers_neuronx import compiler
File "/home/ssm-user/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/compiler.py", line 31, in <module>
from libneuronxla import neuron_xla_compile
ImportError: cannot import name 'neuron_xla_compile' from 'libneuronxla' (/home/ssm-user/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/libneuronxla/__init__.py)
The transformers_neuronx version is: 0.6.x
The torch_neuronx version is: 1.13.1.1.9.0
OS Used: Amazon Linux 2
Kernel: kernel-devel-5.10.167-147.601
Please help me to resolve this.
With AWS Neuron SDK 2.14.1
, I am experiencing very long compilation times for batch_size = 4
with the llama2 7B model.
I am using the following configurations:
| | inf2.8xlarge | inf2.48xlarge |
|-------------|--------------|---------------|
| tp_degree | 2 | 24 |
| n_positions | 2048 | 2048 |
| amp | f16 | f16 |
With batch_size = 1, 2
it takes minutes to compile the model with the -O1
option, but with batch_size = 4
it lasts more than three hours.
Hi team,
I wondered if the tool has support for any encoder-decoder models too (like FLAN-T5 or FLAN-UL2)?
If not at the moment, do you have a plan for it?
Thanks!
This issues is going to track the first version of generation method implementation.
Ideally we want to provide a similar .generate() API as HuggingFace. As we need some functionalities, including top-k, top-p sampling, beam search, greedy search in the near future, and with some limitation from torch-neuronx and transformer-neuronx itself, we cannot directly "borrow" the implementation from HuggingFace yet. Instead, we will cover the following generation methods as soon as possible and refactor along the way.
Method | PR | Accuracy Check |
---|---|---|
Greedy | #1 | |
Topk | ||
Toop | ||
Beamsearch |
Hi I am trying to run a demo from triton:https://github.com/triton-inference-server/python_backend/tree/main/inferentia on an inf2.24xlarge
instance.
Here are the command lines to reproduce the issue:
# Start a docker on the inf2.24xlarge instance
sudo docker run --shm-size=2g -it nvcr.io/nvidia/tritonserver:23.04-py3
# Run this command in the started docker
pip3 install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com
The error message:
root@69eded59b438:/opt/tritonserver# pip3 install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting transformers-neuronx
Downloading https://pip.repos.neuron.amazonaws.com/transformers-neuronx/transformers_neuronx-0.4.60-py3-none-any.whl (91 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.0/92.0 kB 26.3 MB/s eta 0:00:00
Collecting accelerate (from transformers-neuronx)
Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 244.2/244.2 kB 30.0 MB/s eta 0:00:00
Collecting torch-neuronx (from transformers-neuronx)
Downloading https://pip.repos.neuron.amazonaws.com/torch-neuronx/torch_neuronx-1.13.1.1.8.0-py3-none-any.whl (1.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 62.8 MB/s eta 0:00:00
Collecting transformers (from transformers-neuronx)
Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.2/7.2 MB 127.8 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.8/dist-packages (from accelerate->transformers-neuronx) (1.24.2)
Collecting packaging>=20.0 (from accelerate->transformers-neuronx)
Downloading packaging-23.1-py3-none-any.whl (48 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.9/48.9 kB 16.7 MB/s eta 0:00:00
Collecting psutil (from accelerate->transformers-neuronx)
Downloading psutil-5.9.5-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (282 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 282.1/282.1 kB 67.8 MB/s eta 0:00:00
Collecting pyyaml (from accelerate->transformers-neuronx)
Downloading PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (701 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 701.2/701.2 kB 92.8 MB/s eta 0:00:00
Collecting torch>=1.10.0 (from accelerate->transformers-neuronx)
Downloading torch-2.0.1-cp38-cp38-manylinux1_x86_64.whl (619.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 619.9/619.9 MB 2.1 MB/s eta 0:00:00
Downloading torch-1.13.1-cp38-cp38-manylinux1_x86_64.whl (887.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 887.4/887.4 MB 1.5 MB/s eta 0:00:00
Collecting torch-xla==1.13.1+torchneuron7 (from torch-neuronx->transformers-neuronx)
Downloading https://pip.repos.neuron.amazonaws.com/torch-xla/torch_xla-1.13.1%2Btorchneuron7-cp38-cp38-linux_x86_64.whl (267.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 267.7/267.7 MB 7.1 MB/s eta 0:00:00
Collecting libneuronxla==0.5.326 (from torch-neuronx->transformers-neuronx)
Downloading https://pip.repos.neuron.amazonaws.com/libneuronxla/libneuronxla-0.5.326-py3-none-linux_x86_64.whl (52.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.9/52.9 MB 35.3 MB/s eta 0:00:00
Collecting protobuf<5 (from torch-neuronx->transformers-neuronx)
Downloading protobuf-4.23.4-cp37-abi3-manylinux2014_x86_64.whl (304 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 304.5/304.5 kB 66.3 MB/s eta 0:00:00
Collecting aws-neuronx-runtime-discovery~=2.0 (from libneuronxla==0.5.326->torch-neuronx->transformers-neuronx)
Downloading https://pip.repos.neuron.amazonaws.com/aws-neuronx-runtime-discovery/aws-neuronx-runtime-discovery-2.9.tar.gz (1.0 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [9 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-23z1w1bc/aws-neuronx-runtime-discovery_74fefbd0a1fd4c4f986415e84c123c5f/setup.py", line 46, in <module>
main()
File "/tmp/pip-install-23z1w1bc/aws-neuronx-runtime-discovery_74fefbd0a1fd4c4f986415e84c123c5f/setup.py", line 15, in main
raise FileNotFoundError('Could not find Neuron Runtime Library {} (from deb/rpm package aws-neuronx-runtime-lib) in {}. Please check {} to install this library.'.format(soname_path, libnrt_installation_path, guide_link))
FileNotFoundError: Could not find Neuron Runtime Library /opt/aws/neuron/lib/libnrt.so.1 (from deb/rpm package aws-neuronx-runtime-lib) in /opt/aws/neuron/lib. Please check the Neuron installation guide https://awsdocs-neuron.readthedocs-hosted.com/ to install this library.
2.9
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
in the file transformers-neuronx/blob/main/src/transformers_neuronx/gpt2/model.py
all variations of the model are configuring max seq length using a wrong parameter. The code bellow is replicated in many part of this file and it correctly tests n_positions as the input config for max len, but uses n_ctx to get the value. So, please, change this to the correct config var n_positions instead.
sequence_length = kwargs.get("n_positions", None)
if sequence_length:
max_allowed_sequence_length = config.n_ctx
if sequence_length > max_allowed_sequence_length:
raise ValueError(f"Sequence length ({sequence_length}) cannot be larger than position embedding's context size ({max_allowed_sequence_length})!")
Hello, the example code snippet from the Readme.md is not working.
import os
from transformers_neuronx.gpt2.model import GPT2ForSampling
from transformers_neuronx.generation_utils import HuggingFaceGenerationModelAdapter
from transformers_neuronx.module import save_pretrained_split
from transformers import AutoModelForCausalLM, AutoTokenizer
os.environ['NEURON_CC_FLAGS'] = '--model-type=transformer-inference'
# Load and save the CPU model
model_cpu = AutoModelForCausalLM.from_pretrained('gpt2')
save_pretrained_split(model_cpu, 'gpt2-split')
# Create and compile the Neuron model
model = GPT2ForSampling.from_pretrained('gpt2-split', batch_size=1, tp_degree=2, n_positions=256, amp='f32', unroll=None)
if hasattr(model, 'register_to_neuron_hook'):
model.register_to_neuron_hook(lambda idx: print(f'done to_neuron layer {idx}'))
print('running model.to_neuron')
model.to_neuron()
Error:
running model.to_neuron
.2023-06-29T11:57:55Z ERROR 2975 [WalrusDriver]: An exception was thrown:
--------------------------------------------------------------------------------
0# __cxa_throw in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
1# 0x00007FB8EC7EDE96 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
2# birverifier::checkInputMemType(bir::Instruction const&, unsigned int, llvm::SmallVector<bir::MemoryType, 3u> const&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
3# birverifier::InstVisitor::visitInstIndirectSave(bir::InstIndirectSave&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
4# neuronxcc::walrus::Verifier::run(bir::Module&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
5# neuronxcc::walrus::WalrusPass::run(std::vector<std::unique_ptr<bir::Module, std::default_delete<bir::Module> >, std::allocator<std::unique_ptr<bir::Module, std::default_delete<bir::Module> > > >&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
6# 0x00007FB8E53763FE in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
7# run_walrus_driver(int, char**) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
8# 0x00007FB8EC82F130 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/EmbeddedWalrusDriver.cpython-38-x86_64-linux-gnu.so
9# 0x00007FB8F5308820 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
10# 0x00007FB8F531335E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
11# _PyObject_MakeTpCall in /usr/bin/python3
12# _PyObject_FastCallDict in /usr/bin/python3
13# _PyObject_Call_Prepend in /usr/bin/python3
14# 0x00007FB8F53069EC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
15# 0x00007FB8F532871E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
16# _PyObject_MakeTpCall in /usr/bin/python3
17# _PyObject_FastCallDict in /usr/bin/python3
18# _PyObject_Call_Prepend in /usr/bin/python3
19# 0x00007FB90C49FC3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
20# 0x00007FB90C4B42D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
21# _PyObject_MakeTpCall in /usr/bin/python3
22# _PyObject_FastCallDict in /usr/bin/python3
23# _PyObject_Call_Prepend in /usr/bin/python3
24# 0x00007FB90C49FC3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
25# 0x00007FB90C4AFAC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
26# 0x00007FB8F531CBE2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
27# _PyObject_MakeTpCall in /usr/bin/python3
28# _PyObject_FastCallDict in /usr/bin/python3
29# _PyObject_Call_Prepend in /usr/bin/python3
30# 0x00007FB90C4CCC6B in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
31# 0x00007FB90C4CF082 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
32# _PyObject_MakeTpCall in /usr/bin/python3
33# _PyObject_FastCallDict in /usr/bin/python3
34# _PyObject_Call_Prepend in /usr/bin/python3
35# 0x00007FB90C49FC3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
36# 0x00007FB90C4B42D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
37# _PyObject_MakeTpCall in /usr/bin/python3
38# _PyObject_FastCallDict in /usr/bin/python3
39# _PyObject_Call_Prepend in /usr/bin/python3
40# 0x00007FB90C49FC3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
41# 0x00007FB90C4AFAC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
42# _PyObject_MakeTpCall in /usr/bin/python3
43# _PyObject_FastCallDict in /usr/bin/python3
44# _PyObject_Call_Prepend in /usr/bin/python3
45# 0x00007FB90BF52ECC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
46# 0x00007FB90BF8ABA9 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
47# _PyObject_MakeTpCall in /usr/bin/python3
48# _PyObject_FastCallDict in /usr/bin/python3
49# _PyObject_Call_Prepend in /usr/bin/python3
50# 0x00007FB90BF58CD1 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
51# _PyObject_MakeTpCall in /usr/bin/python3
52# _PyObject_FastCallDict in /usr/bin/python3
53# _PyObject_Call_Prepend in /usr/bin/python3
54# 0x00007FB90C5A079C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
55# 0x00007FB90C5AC9AA in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
56# _PyObject_MakeTpCall in /usr/bin/python3
57# _PyObject_FastCallDict in /usr/bin/python3
58# _PyObject_Call_Prepend in /usr/bin/python3
59# 0x00007FB90C5A2CED in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
60# 0x00007FB90C5A2EC2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
61# 0x00007FB90C5B5DA2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
62# _PyObject_MakeTpCall in /usr/bin/python3
63# _PyEval_EvalFrameDefault in /usr/bin/python3
64# _PyEval_EvalCodeWithName in /usr/bin/python3
65# PyEval_EvalCode in /usr/bin/python3
66# 0x0000000000680001 in /usr/bin/python3
67# 0x000000000068007F in /usr/bin/python3
68# 0x0000000000680121 in /usr/bin/python3
69# PyRun_SimpleFileExFlags in /usr/bin/python3
70# Py_RunMain in /usr/bin/python3
71# Py_BytesMain in /usr/bin/python3
72# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
73# _start in /usr/bin/python3
--------------------------------------------------------------------------------
2023-06-29T11:57:55Z ERROR 2975 [WalrusDriver]: Walrus pass: birverifier failed!
2023-06-29T11:57:55Z ERROR 2975 [WalrusDriver]: Failure Reason: === BIR verification failed ===
Reason: Expect memory location to be of type SB
Instruction: I-10873
Opcode: IndirectSave
Input index: 1
Argument AP:
Access Pattern: [[384,1],[384,1],[1,64]]
SymbolicAP
Memory Location: {_reshape_273_hlo_id_2756__mhlo.reshape_20_pftranspose_5301-t9595_set}@PSUM
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: ***************************************************************
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: An Internal Compiler Error has occurred
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: ***************************************************************
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Error message: Walrus driver failed to complete
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Error class: AssertionError
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Error location: Unknown
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Command line: /usr/local/bin/neuronx-cc compile --framework=XLA --target=trn1 /tmp/tmpkbnmkhnd/FullyUnrolled.1826.1.pb --output=/tmp/tmpkbnmkhnd/FullyUnrolled.1826.1.pb.neff --verbose=35 --model-type=transformer-inference
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Internal details:
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/CommandDriver.py", line 237, in neuronxcc.driver.CommandDriver.CommandDriver.run
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 1047, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 998, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 1023, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 1027, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/jobs/WalrusDriver.py", line 232, in neuronxcc.driver.jobs.WalrusDriver.WalrusDriver.run
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: File "neuronxcc/driver/jobs/WalrusDriver.py", line 702, in neuronxcc.driver.jobs.WalrusDriver.WalrusDriver.runSingleInput
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Version information:
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: NeuronX Compiler version 2.5.0.28+1be23f232
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: HWM version 2.5.0.0-dad732dd6
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: NEFF version Dynamic
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: TVM not available
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: NumPy version 1.21.6
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: MXNet not available
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]:
2023-06-29T11:57:55Z ERROR 2975 [neuronx-cc]: Artifacts stored in: /tmp/tmpkbnmkhnd/neuronxcc-9vjdmvs0
Compiler status ERROR
Traceback (most recent call last):
File "test.py", line 17, in <module>
model.to_neuron()
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt2/model.py", line 117, in to_neuron
self.decoder_lm_head.to_neuron()
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 107, in to_neuron
self.program.setup(self.layers, ln_lm_head_params)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 792, in setup
super().setup(layers, ln_lm_head_params)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/decoder.py", line 747, in setup
kernel.build()
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 287, in build
self.neff_bytes = compile_hlo_module(self.hlo_module)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 71, in compile_hlo_module
subprocess.check_call(command_line, cwd=tmpdir)
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--framework=XLA', '--target=trn1', '/tmp/tmpkbnmkhnd/FullyUnrolled.1826.1.pb', '--output=/tmp/tmpkbnmkhnd/FullyUnrolled.1826.1.pb.neff', '--verbose=35', '--model-type=transformer-inference']' returned non-zero exit status 1.
I am trying to use meta-llama/Llama-2-13b-chat-hf
witch have a max_position_embeddings
of 4096 tokens.
I found that the library fails in a non-deterministic way when input length is between 1790 and 1800 tokens.
If you insert exactly the same prompt several times you randomly get a good output or a failure. While over the 1800 tokens the failure become more deterministic. However LLaMA with Huggingface transformer library works fine with more than 2000 tokens.
Here a piece of code to reproduce the error.
Model preparation:
import transformers
# Version after 4.28.1 save the model with an incompatible format https://github.com/aws-neuron/transformers-neuronx/issues/60
assert transformers.__version__ == "4.28.1", f"Version is {transformers.__version__}"
from transformers import LlamaForCausalLM
import torch
from transformers_neuronx.module import save_pretrained_split
model_name = "meta-llama/Llama-2-13b-chat-hf"
model = LlamaForCausalLM.from_pretrained(model_name)
save_pretrained_split(model, './Llama-2-13b-split')
# Compile the model
import time
import torch
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling
import torch_xla.core.xla_model as xm
import os
xla_device_count = len(xm.get_xla_supported_devices())
# load meta-llama/Llama-2-13b to the NeuronCores with N-way tensor parallelism and run compilation
neuron_model = LlamaForSampling.from_pretrained('./Llama-2-13b-split', batch_size=1, tp_degree=xla_device_count, amp='f16')
neuron_model.to_neuron()
neuron_model.save('./neuron_artifacts')
del neuron_model
Reproduce the bug:
# Load compiled model
import random
import string
import torch
from transformers import AutoTokenizer
from transformers import AutoTokenizer
from transformers_neuronx.llama.model import LlamaForSampling
import torch_xla.core.xla_model as xm
neuron_model = LlamaForSampling.from_pretrained('./Llama-2-13b-split', batch_size=1, tp_degree=xla_device_count, amp='f16')
neuron_model.load('./neuron_artifacts')
neuron_model.to_neuron()
model_name = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
for n_tokens in range(1780,2000):
all_token_ids = list(tokenizer.get_vocab().values())
random_token_ids = random.choices(all_token_ids, k=n_tokens)
random_tokens_tensor = torch.tensor([random_token_ids])
print(f'''Input with {len(random_tokens_tensor[0])} tokens
Maximum sequence length for {model_name} is {model.config.max_position_embeddings} tokens''')
max_output_length = model.config.max_position_embeddings - len(random_tokens_tensor[0])
with torch.inference_mode():
start = time.time()
generated_sequences = neuron_model.sample(random_tokens_tensor, sequence_length=max_output_length, top_k=50)
elapsed = time.time() - start
generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')
As I said the bug is not deterministic so the code will fail every time to a different iteration.
Here an example:
Input with 1783 tokens
Maximum sequence length for meta-llama/Llama-2-13b-chat-hf is 4096 tokens
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
Cell In[96], line 21
19 with torch.inference_mode():
20 start = time.time()
---> 21 generated_sequences = neuron_model.sample(random_tokens_tensor, sequence_length=max_output_length, top_k=50)
22 elapsed = time.time() - start
26 generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
File ~/.local/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:174, in LlamaForSampling.sample(self, input_ids, sequence_length, start_ids, top_k, top_p, eos_token_override, temperature, streamer)
171 context_length -= prefixed_length
172 sequence_length -= prefixed_length
--> 174 result = sampling.sample_llama(
175 self, input_ids, start_ids, sequence_length,
176 eos_token_id=self.config.eos_token_id if eos_token_override is None else eos_token_override,
177 top_k=top_k, top_p=top_p, temperature=temperature, streamer=streamer
178 )
180 return result
File ~/.local/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/transformers_neuronx/sampling.py:288, in sample_llama(model, input_ids, start_ids, sequence_length, eos_token_id, top_k, top_p, temperature, streamer)
286 _, start = input_ids.shape
287 next_token_scores = model(input_ids, None, start_ids)
--> 288 return sample_loop_llama(
289 model, input_ids, start_ids, next_token_scores, sequence_length, eos_token_id, top_k, top_p, temperature, streamer
290 )
File ~/.local/lib/python3.10/site-packages/transformers_neuronx/sampling.py:273, in sample_loop_llama(model, input_ids, start_ids, next_token_scores, sequence_length, eos_token_id, top_k, top_p, temperature, streamer)
271 # forward pass to get next token
272 cache_ids = torch.as_tensor([cur_len], dtype=torch.int32)
--> 273 next_token_scores = model(inputs, cache_ids, start_ids)
275 if streamer:
276 streamer.end()
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/lib/python3.10/site-packages/transformers_neuronx/llama/model.py:158, in LlamaForSampling.forward(self, input_ids, cache_ids, start_ids)
156 input_ids, *rst = self._preprocess(input_ids, start_ids=start_ids, cache_ids=cache_ids)
157 hidden = self.chkpt_model.model.embed_tokens(input_ids)
--> 158 return self._forward(hidden, *rst)
File ~/.local/lib/python3.10/site-packages/transformers_neuronx/base.py:229, in NeuronModelBase._forward(self, hidden, *args)
227 logits = self.context(hidden, *args)
228 else:
--> 229 logits = self.decoder_lm_head(hidden, *args)
231 logits = logits.to(torch.float32)
232 logits = logits[:self.config.vocab_size, -1, :]
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/lib/python3.10/site-packages/transformers_neuronx/decoder.py:231, in DecoderLmHeadForSamplingNoEmbedding.forward(self, *inputs)
229 sequence_length = hidden.shape[sequence_dim]
230 if sequence_length == 1:
--> 231 return self.forward_single(*inputs)
232 if sequence_length % self.n_active_tokens:
233 raise ValueError(f'sequence_length={sequence_length} cannot be divided by '
234 f'n_active_tokens={self.n_active_tokens}')
File ~/.local/lib/python3.10/site-packages/transformers_neuronx/decoder.py:216, in DecoderLmHeadForSamplingNoEmbedding.forward_single(self, *inputs)
214 hidden, cache_ids, *_ = inputs
215 batch_size = hidden.shape[2]
--> 216 bucket_id = self.program.find_bucket_id(cache_ids.item())
217 if self.use_executor:
218 return self.program.execute(bucket_id, batch_size, *inputs, return_ranks=self.return_ranks)
File ~/.local/lib/python3.10/site-packages/transformers_neuronx/decoder.py:1043, in DecoderProgram.find_bucket_id(self, length)
1042 def find_bucket_id(self, length):
-> 1043 return next(idx for idx, npos in enumerate(self.n_positions_list) if npos >= length+1)
StopIteration:
I have tried the conversion and inference of a GPT-J model using the gptj-demo CLI @0.4.60 on an inf2.8xlarge instance.
I first save the model locally using:
$ gptj_demo save gpt-j-6B
Then I try to convert and run it.
With a batch_size of 1, I get the expected result:
$ gptj_demo run --batch_size 1 gpt-j-6B
running GPTJForSampling.from_pretrained
running model.to_neuron
...Selecting 26361 allocations
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Analyzing dependencies of Block1
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
.Analyzing dependencies of Block1
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Dependency reduction of sg0000
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
.
Compiler status PASS
2023-Jun-23 07:01:56.0042 2158:2592 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2023-Jun-23 07:01:56.0042 2158:2592 [1] init.cc:99 CCOM WARN OFI plugin initNet() failed is EFA enabled?
running model.sample
generated_sequence= tensor([[15496, 11, 314, 1101, 257, 3303, 2746, 11, 198, 40,
1101, 257, 1808, 12, 504, 86, 1586, 198, 30243, 11,
290, 314, 3280, 198, 6138, 507, 1912, 319, 644, 198,
40, 760, 546, 661, 13, 198, 40, 1101, 257, 9379,
11, 290, 661, 389, 198, 11031, 286, 1612, 284, 616,
16294, 11, 198, 568, 314, 761, 284, 2193, 546, 257,
198, 1122, 286, 3404, 284, 1730, 351, 705, 368, 11,
198, 8201, 6970, 198, 4919, 661, 892, 13, 198, 2396,
644, 314, 18869, 1560, 345, 1909, 198, 271, 257, 845,
4096, 11, 845, 1468, 3896, 198, 10755, 262, 995, 1377,
257, 845, 198, 727, 11, 4096, 835, 286, 4673, 198,
18927, 13, 198, 1870, 618, 345, 2193, 198, 18927, 588,
428, 11, 198, 5832, 923, 284, 1833, 1223]])
["Hello, I'm a language model,\nI'm a question-answering\nmachine, and I answer\nquestions based on what\nI know about people.\nI'm a robot, and people are\nkind of mean to my creators,\nso I need to learn about a\nton of stuff to deal with 'em,\nincluding knowing\nhow people think.\nSo what I wanna tell you today\nis a very basic, very old rule\nabout the world -- a very\nold, basic way of learning\nsomething.\nAnd when you learn\nsomething like this,\nyou start to understand something"]
But when I try to use a higher batch_size, I get a compiler error:
$ gptj_demo run --batch_size 2 gpt-j-6B
running GPTJForSampling.from_pretrained
running model.to_neuron
...2023-06-23T07:07:54Z ERROR 3231 [WalrusDriver]: An exception was thrown:
--------------------------------------------------------------------------------
0# __cxa_throw in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
1# 0x00007F4D1EC74E96 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
2# birverifier::checkInputMemType(bir::Instruction const&, unsigned int, llvm::SmallVector<bir::MemoryType, 3u> const&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
3# birverifier::InstVisitor::visitInstIndirectSave(bir::InstIndirectSave&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libBIRVerifier.so
4# neuronxcc::walrus::Verifier::run(bir::Module&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
5# neuronxcc::walrus::WalrusPass::run(std::vector<std::unique_ptr<bir::Module, std::default_delete<bir::Module> >, std::allocator<std::unique_ptr<bir::Module, std::default_delete<bir::Module> > > >&) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
6# 0x00007F4D066833FE in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
7# run_walrus_driver(int, char**) in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/../../../starfish/lib/libwalrus.so
8# 0x00007F4D54EBE130 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/support/EmbeddedWalrusDriver.cpython-38-x86_64-linux-gnu.so
9# 0x00007F4D0ADBB820 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
10# 0x00007F4D0ADC635E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
11# _PyObject_MakeTpCall in /usr/bin/python3
12# _PyObject_FastCallDict in /usr/bin/python3
13# _PyObject_Call_Prepend in /usr/bin/python3
14# 0x00007F4D0ADB99EC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
15# 0x00007F4D0ADDB71E in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
16# _PyObject_MakeTpCall in /usr/bin/python3
17# _PyObject_FastCallDict in /usr/bin/python3
18# _PyObject_Call_Prepend in /usr/bin/python3
19# 0x00007F4D9FF77C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
20# 0x00007F4D9FF8C2D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
21# _PyObject_MakeTpCall in /usr/bin/python3
22# _PyObject_FastCallDict in /usr/bin/python3
23# _PyObject_Call_Prepend in /usr/bin/python3
24# 0x00007F4D9FF77C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
25# 0x00007F4D9FF87AC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
26# 0x00007F4D0ADCFBE2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/jobs/WalrusDriver.cpython-38-x86_64-linux-gnu.so
27# _PyObject_MakeTpCall in /usr/bin/python3
28# _PyObject_FastCallDict in /usr/bin/python3
29# _PyObject_Call_Prepend in /usr/bin/python3
30# 0x00007F4D9FFA4C6B in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
31# 0x00007F4D9FFA7082 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Pipeline.cpython-38-x86_64-linux-gnu.so
32# _PyObject_MakeTpCall in /usr/bin/python3
33# _PyObject_FastCallDict in /usr/bin/python3
34# _PyObject_Call_Prepend in /usr/bin/python3
35# 0x00007F4D9FF77C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
36# 0x00007F4D9FF8C2D6 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
37# _PyObject_MakeTpCall in /usr/bin/python3
38# _PyObject_FastCallDict in /usr/bin/python3
39# _PyObject_Call_Prepend in /usr/bin/python3
40# 0x00007F4D9FF77C3C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
41# 0x00007F4D9FF87AC8 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/Job.cpython-38-x86_64-linux-gnu.so
42# _PyObject_MakeTpCall in /usr/bin/python3
43# _PyObject_FastCallDict in /usr/bin/python3
44# _PyObject_Call_Prepend in /usr/bin/python3
45# 0x00007F4D9FA2BECC in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
46# 0x00007F4D9FA63BA9 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
47# _PyObject_MakeTpCall in /usr/bin/python3
48# _PyObject_FastCallDict in /usr/bin/python3
49# _PyObject_Call_Prepend in /usr/bin/python3
50# 0x00007F4D9FA31CD1 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/commands/CompileCommand.cpython-38-x86_64-linux-gnu.so
51# _PyObject_MakeTpCall in /usr/bin/python3
52# _PyObject_FastCallDict in /usr/bin/python3
53# _PyObject_Call_Prepend in /usr/bin/python3
54# 0x00007F4DA007879C in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
55# 0x00007F4DA00849AA in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
56# _PyObject_MakeTpCall in /usr/bin/python3
57# _PyObject_FastCallDict in /usr/bin/python3
58# _PyObject_Call_Prepend in /usr/bin/python3
59# 0x00007F4DA007ACED in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
60# 0x00007F4DA007AEC2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
61# 0x00007F4DA008DDA2 in /usr/local/lib/python3.8/dist-packages/neuronxcc/driver/CommandDriver.cpython-38-x86_64-linux-gnu.so
62# _PyObject_MakeTpCall in /usr/bin/python3
63# _PyEval_EvalFrameDefault in /usr/bin/python3
64# _PyEval_EvalCodeWithName in /usr/bin/python3
65# PyEval_EvalCode in /usr/bin/python3
66# 0x000000000067DBF1 in /usr/bin/python3
67# 0x000000000067DC6F in /usr/bin/python3
68# 0x000000000067DD11 in /usr/bin/python3
69# PyRun_SimpleFileExFlags in /usr/bin/python3
70# Py_RunMain in /usr/bin/python3
71# Py_BytesMain in /usr/bin/python3
72# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
73# _start in /usr/bin/python3
--------------------------------------------------------------------------------
2023-06-23T07:07:54Z ERROR 3231 [WalrusDriver]: Walrus pass: birverifier failed!
2023-06-23T07:07:54Z ERROR 3231 [WalrusDriver]: Failure Reason: === BIR verification failed ===
Reason: Expect memory location to be of type SB
Instruction: I-26521
Opcode: IndirectSave
Input index: 1
Argument AP:
Access Pattern: [[512,2],[512,1],[1,512]]
SymbolicAP
Memory Location: {_reshape_382_hlo_id_3499__mhlo.reshape_22_pftranspose_10864_set}@PSUM
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: ***************************************************************
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: An Internal Compiler Error has occurred
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: ***************************************************************
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Error message: Walrus driver failed to complete
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Error class: AssertionError
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Error location: Unknown
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Command line: /usr/local/bin/neuronx-cc compile --framework=XLA --target=trn1 /tmp/tmpmjv76hvm/Scribable.3484.1.pb --output=/tmp/tmpmjv76hvm/Scribable.3484.1.pb.neff --verbose=35
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Internal details:
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/CommandDriver.py", line 237, in neuronxcc.driver.CommandDriver.CommandDriver.run
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 1047, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 998, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 1023, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 1027, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/jobs/WalrusDriver.py", line 232, in neuronxcc.driver.jobs.WalrusDriver.WalrusDriver.run
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: File "neuronxcc/driver/jobs/WalrusDriver.py", line 702, in neuronxcc.driver.jobs.WalrusDriver.WalrusDriver.runSingleInput
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Version information:
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: NeuronX Compiler version 2.5.0.28+1be23f232
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: HWM version 2.5.0.0-dad732dd6
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: NEFF version Dynamic
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: TVM not available
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: NumPy version 1.21.6
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: MXNet not available
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]:
2023-06-23T07:07:54Z ERROR 3231 [neuronx-cc]: Artifacts stored in: /tmp/tmpmjv76hvm/neuronxcc-z8z68fbb
Compiler status ERROR
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/gptj_demo", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptj/demo.py", line 21, in main
demo('EleutherAI/gpt-j-6B', GPTJForSampling, amp_callback)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 61, in demo
run(args, model_name, model_cls)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gpt_demo.py", line 106, in run
model.to_neuron()
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptj/model.py", line 64, in to_neuron
self.program = build_gptj_program(config, 1, n_positions_list, unroll)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/gptj/model.py", line 331, in build_gptj_program
return program.FullyUnrolledDecoder(config.tp_degree, hlo_modules, buffers)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/program.py", line 95, in __init__
self.kernels = [compiler.build_parallel_kernel(hm, tp_degree) for hm in hlo_modules]
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/program.py", line 95, in <listcomp>
self.kernels = [compiler.build_parallel_kernel(hm, tp_degree) for hm in hlo_modules]
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 44, in build_parallel_kernel
kernel.build()
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 288, in build
self.neff_bytes = compile_hlo_module(self.hlo_module)
File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers_neuronx/compiler.py", line 72, in compile_hlo_module
subprocess.check_call(command_line, cwd=tmpdir)
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--framework=XLA', '--target=trn1', '/tmp/tmpmjv76hvm/Scribable.3484.1.pb', '--output=/tmp/tmpmjv76hvm/Scribable.3484.1.pb.neff', '--verbose=35']' returned non-zero exit status 1.
Hi, Is there any plan to support LLAMA based models?
The function is looking for pytorch_model.bin
that does not exist.
Because of it most of the tutorials are broken, this one for example: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb
Converting the loaded model using to_neuron() method takes a long time. Is there any way to Save the neuron_model on disk and load it again? This is for GPT-NeoX.
With AWS Neuron SDK 2.14.1
, I have reproducible core dumps during inference for batch_size = 4
with the llama2 7B model.
I am using the following configuration:
| | inf2.8xlarge | inf2.48xlarge |
|-------------|--------------|---------------|
| tp_degree | 2 | 24 |
| tp_degree | 2 | 24 |
| n_positions | 2048 | 2048 |
| amp | f16 | f16 |
The inference works fine if the input context is below 1024
, but a crash happens using a longer context.
2023-Sep-28 15:05:08.0675 98833:99454 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:3)
2023-Sep-28 15:05:08.0675 98833:99454 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:4)
2023-Sep-28 15:05:08.0675 98833:99454 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0)
2023-Sep-28 15:05:08.0675 98833:99453 ERROR TDRV:exec_consume_tpb_status_notifications 2023-Sep-28 15:05:08.0675 98833:99454 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:1)
2023-Sep-28 15:05:08.0675 98833:99454 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2)
Missing infer_status notification: (end:1)
2023-Sep-28 15:05:08.0675 98833:99453 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:2)
2023-Sep-28 15:05:08.0675 98833:99453 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:3)
2023-Sep-28 15:05:08.0675 98833:99453 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:4)
2023-Sep-28 15:05:08.0675 98833:99453 ERROR TDRV:exec_consume_tpb_status_notifications Missing infer_status notification: (end:0)
2023-Sep-28 15:05:08.0676 98833:99454 ERROR TDRV:exec_core_dump Unable to find /opt/aws/neuron/lib/libndbg.so. Core dump will not be generated.
2023-Sep-28 15:05:08.0676 98833:99454 ERROR TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) execution timeout (30000 ms) on Neuron Device 0 NC 1, model /tmp/neuroncc_compile_workdir/a31f3317-fd60-4895-9b23-899233dd25c2/model.MODULE_402d21faaed2b99de995+360ecc97.neff, waiting for execution completion notification
2023-Sep-28 15:05:08.0676 98833:99453 ERROR TDRV:exec_core_dump Unable to find /opt/aws/neuron/lib/libndbg.so. Core dump will not be generated.
2023-Sep-28 15:05:08.0676 98833:99453 ERROR TDRV:exec_consume_infer_status_notifications (FATAL-RT-UNDEFINED-STATE) execution timeout (30000 ms) on Neuron Device 0 NC 0, model /tmp/neuroncc_compile_workdir/a31f3317-fd60-4895-9b23-899233dd25c2/model.MODULE_402d21faaed2b99de995+360ecc97.neff, waiting for execution completion notification
2023-Sep-28 15:05:08.0676 98833:99454 ERROR NMGR:dlr_infer Inference completed with err: 5
2023-Sep-28 15:05:08.0676 98833:99453 ERROR NMGR:dlr_infer Inference completed with err: 5
terminate called after throwing an instance of 'c10::Error'
what(): nrt_execute status=5
Exception raised from task at /opt/workspace/KaenaPyTorchRuntime/neuron_op/ops/tensor.cpp:845 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f04062a5457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f040626f3ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: neuron::task(void*) + 0x2e9 (0x7f03ad630c19 in /home/ubuntu/.local/lib/python3.8/site-packages/torch_neuronx/lib/libtorchneuron.so)
frame #3: <unknown function> + 0x8609 (0x7f047d521609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f047d65b133 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called recursively
Aborted (core dumped)
Being able to inject long contexts for inference is required when deploying an inference endpoint: a new KV cache has to be built by stacking new requests with partially generated sequences. The length of the input context is therefore driven by the length of the longer sequence.
I hope this is the right place to ask this question. Let me know if I need to move to another repo.
Currently I'm using NeuronModelForCausalLM
which uses LlamaForSampling
under the hood.
I have a use case where I need to be able to do the following:
I am able to do steps 1 & 2 currently using the following:
from optimum.neuron import NeuronModelForCausalLM
llama_model = NeuronModelForCausalLM.from_pretrained('aws-neuron/Llama-2-7b-chat-hf-seqlen-2048-bs-1')
embedded_tokens = llama_model.model.chkpt_model.model.embed_tokens(token_ids)
### Code to modify embedded_tokens
However, as far as I can tell, generation with these modified tokens is not possible with llama_model.generate()
When I use the 'input_embeds' keyword argument, and set input_ids=None
, I get the following:
ValueError: The following `model_kwargs` are not used by the model: ['inputs_embeds']
If this is not possible with the NeuronModelForCausalLM.generate() currently, is there a way to work around this manually? If so, could you provide an example?
Thanks very much for your help!
When can we expect the support for the popular open-source Vicuna13B model?
Following the example of HuggingFaceGenerationModelAdapter
, I have created a NeuronModelForCausalLM
adapter class for HuggingFace optimum-neuron (see huggingface/optimum-neuron#117).
I compared the inference times for batch sizes of 2 and 16, using either the GPT2ForSampling
model sample()
method or transformers generate()
.
I also added the original pytorch model inference time using transformers generate()
as a reference.
neuron sample | neuron generate | pytorch generate | |
---|---|---|---|
128 tokens batch_size 2 | 0.5 s | 0.9 s | 4.6 s |
128 tokens batch_size 16 | 0.8 s | 2.7 s | 7 s |
The neuron model latency using generate()
is multiplied by 3, where the other two configurations only see less than a x2 increase.
Is this expected ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.