Garbled characters with beam search,about intel/neural-speed

Comments (16)

zhentaoyu commented on July 28, 2024 1

Hi, @jiafuzha, our NS RTN quant has some regressions which need to be fixed and aligned (for example, we quant lm_head and token_embedding for llama). Will let you know if we fix it. Thanks.

from neural-speed.

a32543254 commented on July 28, 2024

we have fixed it on this pr
#202
please try newest branch.

from neural-speed.

jiafuzha commented on July 28, 2024

@a32543254 It does get fixed in single generate call. But for the cont. batching in ModelServer, the issue still exists. Here is the log after running the test_model_server.py.

=======REFERENCE RESULTS FOR COMPARISON=========
=======FOR LOOP GREEDY SEARCH GENERATION RESULTS WITH MHA==========
ARCH_REQ_XCOMP_PERM XTILE_DATA successful.
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
beam_size: 1, do_sample: 0, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
model.cpp: loading model from runtime_outs/ne_llama_q_int4_bestla_cint8_g32.bin
Loading the bin file with NE format...
load_ne_hparams 0.hparams.n_vocab = 32000
load_ne_hparams 1.hparams.n_embd = 4096
load_ne_hparams 2.hparams.n_mult = 256
load_ne_hparams 3.hparams.n_head = 32
load_ne_hparams 4.hparams.n_head_kv = 32
load_ne_hparams 5.hparams.n_layer = 32
load_ne_hparams 6.hparams.n_rot = 128
load_ne_hparams 7.hparams.ftype = 0
load_ne_hparams 8.hparams.max_seq_len = 0
load_ne_hparams 9.hparams.alibi_bias_max = 0.000
load_ne_hparams 10.hparams.clip_qkv = 0.000
load_ne_hparams 11.hparams.par_res = 0
load_ne_hparams 12.hparams.word_embed_proj_dim = 0
load_ne_hparams 13.hparams.do_layer_norm_before = 0
load_ne_hparams 14.hparams.multi_query_group_num = 0
load_ne_hparams 15.hparams.ffn_hidden_size = 11008
load_ne_hparams 16.hparams.inner_hidden_size = 0
load_ne_hparams 17.hparams.n_experts = 0
load_ne_hparams 18.hparams.n_experts_used = 0
load_ne_hparams 19.hparams.n_embd_head_k = 0
load_ne_hparams 20.hparams.norm_eps = 0.000010
load_ne_hparams 21.hparams.freq_base = 10000.000
load_ne_hparams 22.hparams.freq_scale = 1.000
load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
load_ne_hparams 24.hparams.original_max_position_embeddings = 0
load_ne_hparams 25.hparams.use_yarn = 0
load_ne_vocab 26.vocab.bos_token_id = 1
load_ne_vocab 27.vocab.eos_token_id = 2
load_ne_vocab 28.vocab.pad_token_id = 2
load_ne_vocab 29.vocab.sep_token_id = -1
init: n_vocab = 32000
init: n_ctx = 0
init: n_embd = 4096
init: n_mult = 256
init: n_head = 32
init: n_head_kv = 32
init: n_layer = 32
init: n_rot = 128
init: n_ff = 11008
init: n_parts = 1
load: ctx size = 4427.43 MB
load: scratch0 = 4096.00 MB
load: scratch1 = 2048.00 MB
load: scratch2 = 4096.00 MB
load: mem required = 14667.43 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size = 552.00 MB
ARCH_REQ_XCOMP_PERM XTILE_DATA successful.
What's your favorite animal?
Unterscheidung between different types of animals is difficult, as different people may have different preferences and cultural backgrounds can also play a role in shaping one's preferences. However, some animals are generally considered to be popular or iconic, and these are often the ones that people mention as their favorites.

Some of the most popular animals that people tend to mention as their favorites include:

Dogs: Many people consider dogs to be their favorite animals, and it's not hard to see why. Dogs are known for their loyalty, affection, and playful nature, making them
================================
=======FOR LOOP BEAM SEARCH GENERATION RESULTS WITH MHA==========
Will start to reinit model from bin due to different max request num.
beam_size: 4, do_sample: 0, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 1, scratch_size_ratio: 1.000
model.cpp: loading model from runtime_outs/ne_llama_q_int4_bestla_cint8_g32.bin
Loading the bin file with NE format...
load_ne_hparams 0.hparams.n_vocab = 32000
load_ne_hparams 1.hparams.n_embd = 4096
load_ne_hparams 2.hparams.n_mult = 256
load_ne_hparams 3.hparams.n_head = 32
load_ne_hparams 4.hparams.n_head_kv = 32
load_ne_hparams 5.hparams.n_layer = 32
load_ne_hparams 6.hparams.n_rot = 128
load_ne_hparams 7.hparams.ftype = 0
load_ne_hparams 8.hparams.max_seq_len = 0
load_ne_hparams 9.hparams.alibi_bias_max = 0.000
load_ne_hparams 10.hparams.clip_qkv = 0.000
load_ne_hparams 11.hparams.par_res = 0
load_ne_hparams 12.hparams.word_embed_proj_dim = 0
load_ne_hparams 13.hparams.do_layer_norm_before = 0
load_ne_hparams 14.hparams.multi_query_group_num = 0
load_ne_hparams 15.hparams.ffn_hidden_size = 11008
load_ne_hparams 16.hparams.inner_hidden_size = 0
load_ne_hparams 17.hparams.n_experts = 0
load_ne_hparams 18.hparams.n_experts_used = 0
load_ne_hparams 19.hparams.n_embd_head_k = 0
load_ne_hparams 20.hparams.norm_eps = 0.000010
load_ne_hparams 21.hparams.freq_base = 10000.000
load_ne_hparams 22.hparams.freq_scale = 1.000
load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
load_ne_hparams 24.hparams.original_max_position_embeddings = 0
load_ne_hparams 25.hparams.use_yarn = 0
load_ne_vocab 26.vocab.bos_token_id = 1
load_ne_vocab 27.vocab.eos_token_id = 2
load_ne_vocab 28.vocab.pad_token_id = 2
load_ne_vocab 29.vocab.sep_token_id = -1
init: n_vocab = 32000
init: n_ctx = 0
init: n_embd = 4096
init: n_mult = 256
init: n_head = 32
init: n_head_kv = 32
init: n_layer = 32
init: n_rot = 128
init: n_ff = 11008
init: n_parts = 1
load: ctx size = 4427.43 MB
load: scratch0 = 16384.00 MB
load: scratch1 = 8192.00 MB
load: scratch2 = 16384.00 MB
load: mem required = 45387.43 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size = 2208.00 MB
What's your favorite animal? ��

from neural-speed.

zhentaoyu commented on July 28, 2024

Hi, @jiafuzha, sorry for the late response.

The � in your test_model_server.py script is not related to cont-batching or ModelServer. It just has different num_beams which is 4 when compared to your first " single generate call". And in fact, it is still a "single generate call".
What does the � mean?
I reproduce your issue when num_beams=4, do_sample=False, max_new_token=10. The generated_tokens (with prompt) are [[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243]]. Let's pick up the last one token 243, it maps to (from llama2 tokenizer.json):

And seems it's a hexadecimal representation. However, I'm not a big fan of it. So I don't know why these hexadecimal representations exist.
Is it caused by our c++ beam search, model_eval or just model itself?
- yes, our c++ beam_search is not as same as transformers, but the results should not be much different since we refer to their Python implementation. For example, you can check the beam search results between PyTorch FP32 and NS FP32:
  env: INTEL(R) XEON(R) PLATINUM 8580, latest NS and ITREX (both build from source). remember to clean up the runtime_outs folder when you change quant-related args
  PyTorch:
```
 from intel_extension_for_transformers.transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained(model_name, use_neural_speed=False, trust_remote_code=True).eval()
 generate_ids = itrex_model.generate(tokens, num_beams=4, do_sample=False, max_new_tokens=10)[0]
 print(generate_ids)
 print(tokenizer.decode(generate_ids, skip_special_tokens=True))
```
And it outputs like:
tensor([ 1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 185, 243, 162, 147, 180, 243])
What's your favorite animal? ��
NS:
```
model.init(model_name, use_quant=False)
....same code as above
```
And it outputs like:
[[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 185, 243, 162, 147, 180, 243]]
What's your favorite animal? ��
They are the same! And the FP32 model outputs � (maybe llama2 has illusion when meets your prompt...)
- Use ITREX RTN algo instead of NS to quant the model and generate by transformers. You can refer to this example for how quant and save low-bits model from ITREX. The quant cmd is: python run_generation.py --model xxx --woq --woq_algo Rtn --bits 4 --weight_dtype int4_clip --compute_dtype int8 --group_size 32 --benchmark. Once you finish, you will see the low-bits model in the saved_results folder.
  After running:
```
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_name, use_neural_speed=False, trust_remote_code=True).eval()
generate_ids = itrex_model.generate(tokens, num_beams=4, do_sample=False, max_new_tokens=10)[0]
print(tokenizer.decode(generate_ids, skip_special_tokens=True))
```
You will see:
What's your favorite animal? ��
- Change the RTN quant args. Let use per-channel this time. the python cmd is : model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8", group_size=-1). And the output is: What's your favorite animal? Why? (Submitted 10:. The result seems a bit more reasonable.

So, I think this issue is more like a model related problem (RTN quantization, illusion, etc.). If you still meet this generation problem after trying more models or more quant algorithms (gptq, awq, auto-round), please let me know. Thanks.

from neural-speed.

jiafuzha commented on July 28, 2024

@zhentaoyu thanks for the detailed response. I just got some new things to share with you.

I am able to get correct result after I changed max_new_tokens from 10 to 50 with both vanilla transfomers and itrex.

"What's your favorite animal? 🐰🐶🐱🐷

My favorite animal is the penguin! 🐧 I think they're so cute and funny, and they're great"

tokens:
tensor([ 1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243,
162, 147, 179, 243, 162, 147, 185, 243, 162, 147,
180, 243, 162, 147, 186, 13, 13, 3421, 25448, 13019,
338, 278, 282, 19636, 262, 29991, 29871, 243, 162, 147,
170, 306, 1348, 896, 29915, 276, 577, 274, 1082, 322,
2090, 1460, 29892, 322, 896, 29915, 276, 2107])

with neuralspeed, however, I still got garbled characters. After checking the token IDs, I found most of tokens are just repeating themselves. Do you think it's related to the lack of repetition penalty in ns?

[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 243, 162, 147, 183, 243, 162, 147, 184, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 243, 162, 147, 183, 243, 162, 147, 184, 243, 162, 147, 185, 243]

from neural-speed.

jiafuzha commented on July 28, 2024

By the way, another case of garbled character is with prompt, 'what's your favorite food?'.
ns:
[1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 29871, 243, 162, 144, 151, 243, 162, 144, 162, 243, 162, 168, 167, 243, 162, 143, 177, 243, 162, 144, 152, 243, 162, 168, 171, 243, 162, 143, 177, 243, 162, 144, 151, 243, 162, 144, 162, 243, 162, 168, 167, 243, 162, 143, 177, 243, 162, 144, 152, 243]
What's your favorite food? ��

vanilla transformers:
tensor([ 1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 13, 13,
3421, 25448, 9687, 338, 282, 24990, 29889, 306, 5360, 278,
10296, 310, 278, 2181, 275, 2272, 2181, 504, 29892, 18806,
29891, 6454, 1219, 12507, 346, 29892, 322, 286, 2152, 287,
286, 2112, 29920, 598, 13520, 923, 968, 29889, 739, 29915,
29879, 278, 4922, 13016, 9687, 29889, 13, 13])
What's your favorite food?

My favorite food is pizza. I love the combination of the crispy crust, tangy tomato sauce, and melted mozzarella cheese. It's the perfect comfort food.

from neural-speed.

zhentaoyu commented on July 28, 2024

Are the NS results from RTN quant or FP32? RTN quant model may have bad chat quality.
beam search in NS has not repetition penalty, it only has length_penalty (prefer long or short sequence results)

from neural-speed.

jiafuzha commented on July 28, 2024

Are the NS results from RTN quant or FP32? RTN quant model may have bad chat quality.

beam search in NS has not repetition penalty, it only has length_penalty (prefer long or short sequence results)

NS result is from "model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8")".

from neural-speed.

zhentaoyu commented on July 28, 2024

Member

I see. You can use model.init(model_name. use_quant=False) to compare your vanilla transformers results.

from neural-speed.

jiafuzha commented on July 28, 2024

yes, with fp32, I can get correct result from ns.

I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.

`from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)`

from neural-speed.

zhentaoyu commented on July 28, 2024

yes, with fp32, I can get correct result from ns.

I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.

`from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)`

Hi, @jiafuzha, it's different model_id and weight dtype.

@a32543254 Does NS has some difference in RTN quant when compared to ITREX? I found the pipeline ITREX RTN QUANT -> NS LOAD -> NS BEAM SEARCH will get more reasonable results.
ITREX RTN QUANT follow this example. And the results is like What's your favorite animal? 🐰🐶🐱🐷 everybody loves animals, and there are so many amazing creatures to choose from! 😍 whether you're a cat person, a with max_new_tokens =50

from neural-speed.

jiafuzha commented on July 28, 2024

yes, with fp32, I can get correct result from ns.
I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)

Hi, @jiafuzha, it's different model_id and weight dtype.

@a32543254 Does NS has some difference in RTN quant when compared to ITREX? I found the pipeline ITREX RTN QUANT -> NS LOAD -> NS BEAM SEARCH will get more reasonable results. ITREX RTN QUANT follow this example. And the results is like What's your favorite animal? 🐰🐶🐱🐷 everybody loves animals, and there are so many amazing creatures to choose from! 😍 whether you're a cat person, a with max_new_tokens =50

sorry, I copied wrong code. I was actually using ,

model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
p = "What's your favorite food?"
quantization_config = QuantoConfig(weights="int4")
....
...

I got
"tensor([ 1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 26833, 338,
282, 24990, 29991, 29871, 243, 162, 144, 152, 243, 162,
148, 143, 396, 1181, 397, 347, 396, 29886, 24990, 396,
29891, 398, 2])
What's your favorite food? Mine is pizza! 🍕👌 #foodie #pizza #yum"

from neural-speed.

jiafuzha commented on July 28, 2024

@zhentaoyu @a32543254 any more comments?

from neural-speed.

jiafuzha commented on July 28, 2024

any update on this?

from neural-speed.

zhentaoyu commented on July 28, 2024

Hi, @jiafuzha, sorry for late response. We are tied up with other things recently. We will dig into it and will let you know if we have any findings. Thanks a lot.

from neural-speed.

jiafuzha commented on July 28, 2024

Hi, @jiafuzha, sorry for late response. We are tied up with other things recently. We will dig into it and will let you know if we have any findings. Thanks a lot.

no worries, looking forward to your fix.

from neural-speed.

Garbled characters with beam search about neural-speed HOT 16 OPEN

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent