Comments (16)
Hi, @jiafuzha, our NS
RTN quant has some regressions which need to be fixed and aligned (for example, we quant lm_head
and token_embedding
for llama
). Will let you know if we fix it. Thanks.
from neural-speed.
we have fixed it on this pr
#202
please try newest branch.
from neural-speed.
@a32543254 It does get fixed in single generate call. But for the cont. batching in ModelServer, the issue still exists. Here is the log after running the test_model_server.py.
=======REFERENCE RESULTS FOR COMPARISON=========
=======FOR LOOP GREEDY SEARCH GENERATION RESULTS WITH MHA==========
ARCH_REQ_XCOMP_PERM XTILE_DATA successful.
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
beam_size: 1, do_sample: 0, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
model.cpp: loading model from runtime_outs/ne_llama_q_int4_bestla_cint8_g32.bin
Loading the bin file with NE format...
load_ne_hparams 0.hparams.n_vocab = 32000
load_ne_hparams 1.hparams.n_embd = 4096
load_ne_hparams 2.hparams.n_mult = 256
load_ne_hparams 3.hparams.n_head = 32
load_ne_hparams 4.hparams.n_head_kv = 32
load_ne_hparams 5.hparams.n_layer = 32
load_ne_hparams 6.hparams.n_rot = 128
load_ne_hparams 7.hparams.ftype = 0
load_ne_hparams 8.hparams.max_seq_len = 0
load_ne_hparams 9.hparams.alibi_bias_max = 0.000
load_ne_hparams 10.hparams.clip_qkv = 0.000
load_ne_hparams 11.hparams.par_res = 0
load_ne_hparams 12.hparams.word_embed_proj_dim = 0
load_ne_hparams 13.hparams.do_layer_norm_before = 0
load_ne_hparams 14.hparams.multi_query_group_num = 0
load_ne_hparams 15.hparams.ffn_hidden_size = 11008
load_ne_hparams 16.hparams.inner_hidden_size = 0
load_ne_hparams 17.hparams.n_experts = 0
load_ne_hparams 18.hparams.n_experts_used = 0
load_ne_hparams 19.hparams.n_embd_head_k = 0
load_ne_hparams 20.hparams.norm_eps = 0.000010
load_ne_hparams 21.hparams.freq_base = 10000.000
load_ne_hparams 22.hparams.freq_scale = 1.000
load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
load_ne_hparams 24.hparams.original_max_position_embeddings = 0
load_ne_hparams 25.hparams.use_yarn = 0
load_ne_vocab 26.vocab.bos_token_id = 1
load_ne_vocab 27.vocab.eos_token_id = 2
load_ne_vocab 28.vocab.pad_token_id = 2
load_ne_vocab 29.vocab.sep_token_id = -1
init: n_vocab = 32000
init: n_ctx = 0
init: n_embd = 4096
init: n_mult = 256
init: n_head = 32
init: n_head_kv = 32
init: n_layer = 32
init: n_rot = 128
init: n_ff = 11008
init: n_parts = 1
load: ctx size = 4427.43 MB
load: scratch0 = 4096.00 MB
load: scratch1 = 2048.00 MB
load: scratch2 = 4096.00 MB
load: mem required = 14667.43 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size = 552.00 MB
ARCH_REQ_XCOMP_PERM XTILE_DATA successful.
What's your favorite animal?
Unterscheidung between different types of animals is difficult, as different people may have different preferences and cultural backgrounds can also play a role in shaping one's preferences. However, some animals are generally considered to be popular or iconic, and these are often the ones that people mention as their favorites.
Some of the most popular animals that people tend to mention as their favorites include:
- Dogs: Many people consider dogs to be their favorite animals, and it's not hard to see why. Dogs are known for their loyalty, affection, and playful nature, making them
================================
=======FOR LOOP BEAM SEARCH GENERATION RESULTS WITH MHA==========
Will start to reinit model from bin due to different max request num.
beam_size: 4, do_sample: 0, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 1, scratch_size_ratio: 1.000
model.cpp: loading model from runtime_outs/ne_llama_q_int4_bestla_cint8_g32.bin
Loading the bin file with NE format...
load_ne_hparams 0.hparams.n_vocab = 32000
load_ne_hparams 1.hparams.n_embd = 4096
load_ne_hparams 2.hparams.n_mult = 256
load_ne_hparams 3.hparams.n_head = 32
load_ne_hparams 4.hparams.n_head_kv = 32
load_ne_hparams 5.hparams.n_layer = 32
load_ne_hparams 6.hparams.n_rot = 128
load_ne_hparams 7.hparams.ftype = 0
load_ne_hparams 8.hparams.max_seq_len = 0
load_ne_hparams 9.hparams.alibi_bias_max = 0.000
load_ne_hparams 10.hparams.clip_qkv = 0.000
load_ne_hparams 11.hparams.par_res = 0
load_ne_hparams 12.hparams.word_embed_proj_dim = 0
load_ne_hparams 13.hparams.do_layer_norm_before = 0
load_ne_hparams 14.hparams.multi_query_group_num = 0
load_ne_hparams 15.hparams.ffn_hidden_size = 11008
load_ne_hparams 16.hparams.inner_hidden_size = 0
load_ne_hparams 17.hparams.n_experts = 0
load_ne_hparams 18.hparams.n_experts_used = 0
load_ne_hparams 19.hparams.n_embd_head_k = 0
load_ne_hparams 20.hparams.norm_eps = 0.000010
load_ne_hparams 21.hparams.freq_base = 10000.000
load_ne_hparams 22.hparams.freq_scale = 1.000
load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
load_ne_hparams 24.hparams.original_max_position_embeddings = 0
load_ne_hparams 25.hparams.use_yarn = 0
load_ne_vocab 26.vocab.bos_token_id = 1
load_ne_vocab 27.vocab.eos_token_id = 2
load_ne_vocab 28.vocab.pad_token_id = 2
load_ne_vocab 29.vocab.sep_token_id = -1
init: n_vocab = 32000
init: n_ctx = 0
init: n_embd = 4096
init: n_mult = 256
init: n_head = 32
init: n_head_kv = 32
init: n_layer = 32
init: n_rot = 128
init: n_ff = 11008
init: n_parts = 1
load: ctx size = 4427.43 MB
load: scratch0 = 16384.00 MB
load: scratch1 = 8192.00 MB
load: scratch2 = 16384.00 MB
load: mem required = 45387.43 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size = 2208.00 MB
What's your favorite animal? �������������������������������������������������������������������������������������������������������������������������������
from neural-speed.
Hi, @jiafuzha, sorry for the late response.
-
The
�
in yourtest_model_server.py
script is not related tocont-batching
orModelServer
. It just has differentnum_beams
which is 4 when compared to your first " single generate call". And in fact, it is still a "single generate call". -
What does the
�
mean?
I reproduce your issue whennum_beams=4, do_sample=False, max_new_token=10
. The generated_tokens (with prompt) are[[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243]]
. Let's pick up the last one token243
, it maps to (from llama2 tokenizer.json):
And seems it's a hexadecimal representation. However, I'm not a big fan of it. So I don't know why these hexadecimal representations exist. -
Is it caused by our c++
beam search
,model_eval
or justmodel
itself?- yes, our c++
beam_search
is not as same astransformers
, but the results should not be much different since we refer to their Python implementation. For example, you can check thebeam search
results between PyTorch FP32 and NS FP32:
env:INTEL(R) XEON(R) PLATINUM 8580
, latestNS
andITREX
(both build from source). remember to clean up the runtime_outs folder when you change quant-related args
PyTorch:
from intel_extension_for_transformers.transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained(model_name, use_neural_speed=False, trust_remote_code=True).eval() generate_ids = itrex_model.generate(tokens, num_beams=4, do_sample=False, max_new_tokens=10)[0] print(generate_ids) print(tokenizer.decode(generate_ids, skip_special_tokens=True))
And it outputs like:
tensor([ 1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 185, 243, 162, 147, 180, 243])
What's your favorite animal? ���������
NS
:model.init(model_name, use_quant=False) ....same code as above
And it outputs like:
[[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 185, 243, 162, 147, 180, 243]]
What's your favorite animal? ���������
They are the same! And the FP32 model outputs�
(maybe llama2 has illusion when meets your prompt...)- Use
ITREX
RTN algo instead ofNS
to quant the model and generate bytransformers
. You can refer to this example for how quant and save low-bits model fromITREX
. The quant cmd is:python run_generation.py --model xxx --woq --woq_algo Rtn --bits 4 --weight_dtype int4_clip --compute_dtype int8 --group_size 32 --benchmark
. Once you finish, you will see the low-bits model in thesaved_results
folder.
After running:from intel_extension_for_transformers.transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained(model_name, use_neural_speed=False, trust_remote_code=True).eval() generate_ids = itrex_model.generate(tokens, num_beams=4, do_sample=False, max_new_tokens=10)[0] print(tokenizer.decode(generate_ids, skip_special_tokens=True))
You will see:
What's your favorite animal? ���������
- Change the
RTN
quant args. Let use per-channel this time. the python cmd is :model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8", group_size=-1)
. And the output is:What's your favorite animal? Why? (Submitted 10:
. The result seems a bit more reasonable.
- yes, our c++
So, I think this issue is more like a model
related problem (RTN quantization, illusion, etc.). If you still meet this generation problem after trying more models or more quant algorithms (gptq, awq, auto-round), please let me know. Thanks.
from neural-speed.
@zhentaoyu thanks for the detailed response. I just got some new things to share with you.
- I am able to get correct result after I changed max_new_tokens from 10 to 50 with both vanilla transfomers and itrex.
"What's your favorite animal? 🐰🐶🐱🐷
My favorite animal is the penguin! 🐧 I think they're so cute and funny, and they're great"
tokens:
tensor([ 1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243,
162, 147, 179, 243, 162, 147, 185, 243, 162, 147,
180, 243, 162, 147, 186, 13, 13, 3421, 25448, 13019,
338, 278, 282, 19636, 262, 29991, 29871, 243, 162, 147,
170, 306, 1348, 896, 29915, 276, 577, 274, 1082, 322,
2090, 1460, 29892, 322, 896, 29915, 276, 2107])
- with neuralspeed, however, I still got garbled characters. After checking the token IDs, I found most of tokens are just repeating themselves. Do you think it's related to the lack of repetition penalty in ns?
[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 243, 162, 147, 183, 243, 162, 147, 184, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 243, 162, 147, 183, 243, 162, 147, 184, 243, 162, 147, 185, 243]
from neural-speed.
By the way, another case of garbled character is with prompt, 'what's your favorite food?'.
ns:
[1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 29871, 243, 162, 144, 151, 243, 162, 144, 162, 243, 162, 168, 167, 243, 162, 143, 177, 243, 162, 144, 152, 243, 162, 168, 171, 243, 162, 143, 177, 243, 162, 144, 151, 243, 162, 144, 162, 243, 162, 168, 167, 243, 162, 143, 177, 243, 162, 144, 152, 243]
What's your favorite food? �������������������������������������������������
vanilla transformers:
tensor([ 1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 13, 13,
3421, 25448, 9687, 338, 282, 24990, 29889, 306, 5360, 278,
10296, 310, 278, 2181, 275, 2272, 2181, 504, 29892, 18806,
29891, 6454, 1219, 12507, 346, 29892, 322, 286, 2152, 287,
286, 2112, 29920, 598, 13520, 923, 968, 29889, 739, 29915,
29879, 278, 4922, 13016, 9687, 29889, 13, 13])
What's your favorite food?
My favorite food is pizza. I love the combination of the crispy crust, tangy tomato sauce, and melted mozzarella cheese. It's the perfect comfort food.
from neural-speed.
- Are the NS results from RTN quant or FP32? RTN quant model may have bad chat quality.
beam search
in NS has notrepetition penalty
, it only haslength_penalty
(prefer long or short sequence results)
from neural-speed.
- Are the NS results from RTN quant or FP32? RTN quant model may have bad chat quality.
beam search
in NS has notrepetition penalty
, it only haslength_penalty
(prefer long or short sequence results)
NS result is from "model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8")".
from neural-speed.
Member
I see. You can use model.init(model_name. use_quant=False)
to compare your vanilla transformers results.
from neural-speed.
yes, with fp32, I can get correct result from ns.
I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.
`from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)`
from neural-speed.
yes, with fp32, I can get correct result from ns.
I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.
`from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)`
Hi, @jiafuzha, it's different model_id and weight dtype.
@a32543254 Does NS has some difference in RTN quant when compared to ITREX? I found the pipeline ITREX RTN QUANT -> NS LOAD -> NS BEAM SEARCH
will get more reasonable results.
ITREX RTN QUANT follow this example. And the results is like What's your favorite animal? 🐰🐶🐱🐷 everybody loves animals, and there are so many amazing creatures to choose from! 😍 whether you're a cat person, a
with max_new_tokens =50
from neural-speed.
yes, with fp32, I can get correct result from ns.
I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)
Hi, @jiafuzha, it's different model_id and weight dtype.
@a32543254 Does NS has some difference in RTN quant when compared to ITREX? I found the pipeline
ITREX RTN QUANT -> NS LOAD -> NS BEAM SEARCH
will get more reasonable results. ITREX RTN QUANT follow this example. And the results is likeWhat's your favorite animal? 🐰🐶🐱🐷 everybody loves animals, and there are so many amazing creatures to choose from! 😍 whether you're a cat person, a
withmax_new_tokens =50
sorry, I copied wrong code. I was actually using ,
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
p = "What's your favorite food?"
quantization_config = QuantoConfig(weights="int4")
....
...
I got
"tensor([ 1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 26833, 338,
282, 24990, 29991, 29871, 243, 162, 144, 152, 243, 162,
148, 143, 396, 1181, 397, 347, 396, 29886, 24990, 396,
29891, 398, 2])
What's your favorite food? Mine is pizza! 🍕👌 #foodie #pizza #yum"
from neural-speed.
@zhentaoyu @a32543254 any more comments?
from neural-speed.
any update on this?
from neural-speed.
Hi, @jiafuzha, sorry for late response. We are tied up with other things recently. We will dig into it and will let you know if we have any findings. Thanks a lot.
from neural-speed.
Hi, @jiafuzha, sorry for late response. We are tied up with other things recently. We will dig into it and will let you know if we have any findings. Thanks a lot.
no worries, looking forward to your fix.
from neural-speed.
Related Issues (20)
- Running Q4_K_M gguf models: unrecognized tensor type 12 HOT 1
- Distributing tensors across NUMA nodes HOT 3
- Is tensor parallelism supported by neural speed? HOT 2
- Question about Thread pool and GEMV HOT 4
- i wish for simpler way to run the model HOT 4
- i saw how beautiful this repo is, in terms of parallelism / numa stuff etc. HOT 1
- Linking back to Neural Chat / intel-extension-for-transformers HOT 2
- Add support for phi-3-mini-128k model HOT 4
- Loading checkpoint shards takes too long HOT 2
- Error: Unable to install. HOT 5
- source build from release tar file? HOT 1
- Add support for phi3-vision HOT 1
- is it supported with Batch size >1 ? HOT 7
- Performance on Xeon Scalable HOT 1
- developer_document.md need elaboration on determining buffer sizes? HOT 1
- Bestla Kernels understanding and benchmarking HOT 8
- Whats the different with IPEX-LLM?
- BF16 Compute DType on AVX512 ISA
- Yi-6B model failed to evaluate HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from neural-speed.