Comments (7)
Yes, it is supported, but only for a few model architectures. Please refer to https://github.com/intel/neural-speed/blob/main/docs/continuous_batching.md
from neural-speed.
Hi, @QuPengfei, if you have no other questions, we will close this issue. Thanks.
from neural-speed.
Could you please share the performance testing methods for multi-batch?
Here is the script that I used:
import argparse
from pathlib import Path
from typing import List, Optional
from transformers import AutoTokenizer,TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, RtnConfig
def main(args_in: Optional[List[str]] = None) -> None:
parser = argparse.ArgumentParser(description="Convert a PyTorch model to a NE compatible file")
parser.add_argument("--model_path",type=Path,
help="model path for local or from hf", default="meta-llama/Llama-2-7b-hf")
parser.add_argument("--prompt",type=str,help="model path for local or from hf", default="Once upon a time, there existed a little girl,")
parser.add_argument("--not_quant" ,action="store_false", help="Whether to use a model with low bit quantization")
parser.add_argument("--weight_dtype",type=str,
help="output weight type, default: int4, we support int4, int8, nf4 and others ", default="int4")
parser.add_argument("--compute_dtype", type=str, help="compute type", default="int8")
parser.add_argument("--group_size", type=int, help="group size", default=128)
parser.add_argument('--use_gptq', action='store_true')
parser.add_argument("--n_ctx", type=int, help="n_ctx", default=512)
parser.add_argument("--max_new_tokens", type=int, help="max_new_tokens", default=300)
args = parser.parse_args(args_in)
model_name = args.model_path
woq_config = RtnConfig(load_in_4bit=True, use_quant=args.not_quant,
weight_dtype=args.weight_dtype, compute_dtype=args.compute_dtype, group_size=args.group_size, use_gptq=args.use_gptq)
prompt = args.prompt
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompts = [prompt for _ in range(args.batch_size)]
inputs = tokenizer(prompts, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
outputs = model.generate(inputs, pad_token=tokenizer.eos_token_id, ctx_size=args.n_ctx, max_new_tokens=args.max_new_tokens)
And the cmd is:
OMP_NUM_THREADS=56 numactl -m 1 -C 56-111 python run_inference_mutil_batch.py \
--model_path meta-llama/Llama-2-7b-hf \
--prompt "Once upon a time, there was a little girl. She was born in the midst of winter, a little shy of three weeks into the New year." \
--max_new_tokens 32 \
--group_size 128 \
--batch_size 8
Here are the results I obtained on the Intel SPR 8480+ CPU with ITREX settings: batch_size=8, weight_dtype=int4, compute_dtype=int8, group_size=32.
First-token latency: 489ms
Next-token latency: 104ms
In comparison, the results from IPEX are:
First-token latency: 252ms
Next-token latency: 52ms
Is it normal to see a performance difference between ITREX and IPEX, or is there an issue with my testing method
from neural-speed.
Hi, @hezhiqian01, the first-token and next-token latency you get from neural_speed are batched-token latency. see https://github.com/intel/neural-speed/blob/main/neural_speed/models/llama/llama.cpp#L777-L785.
In your example, the first token latency 489ms = 8 * prompt_length inference time, next-token latency 104ms = 8 * 1 inference time (the eval log of NEURAL_SPEED_VERBOSE
=1 has inaccurate num of runs). I don't know how IPEX define its first-token/next-token latency in batched inference. Can you measure the model.generate
duration between neural-speed and ipex in your local env?
from neural-speed.
Hi @zhentaoyu ,
Thanks for your reply!
Since the latency measure method is confused. Let's look at the Throughput instead.
In my case, since the input of each batch is same, so the throughput measure method is:
fps = args.batch_size * max_new_tokens / total_used_time
batch_size=1 | batch_size=2 | batch_size=4 | batch_size=8 | batch_size=16 | batch_size=32 | |
---|---|---|---|---|---|---|
ITREX(tokens/s) | 38 | 58 | 76 | 71 | 86 | 68 |
IPEX (tokens/s) | 29 | 52 | 78 | 133 | 135 | 164 |
It still has performance gap between ITREX and IPEX.
from neural-speed.
Hi @hezhiqian01, thanks for your benchmarking.
Our batching way is like continuous batching mechanism, we will merge all bs prompt into one sequence for gemm inference except for self-attention and rope (for-loop compute). The more details you can refer to this doc. This way will save padding time in offline and improve efficiency in server.
In your example, all bs have no padding tokens since they are the same prompt, ipex MHA kernel bs parallel will accelerate inference, but neural speed has not in this scenario. Ping @luoyu-intel, @DDEle and @a32543254 for more kernel-related comments.
BTW, I use sum([len(p) for p in outputs]) / generated_duration
to calculate fps calculation since some prompts will early-stop (eos token, etc.).
You can use model.generate(xxx, ignore_prompt=True)
to exclude prompt tokens in your outputs.
from neural-speed.
In your benchmarking, seem you set all the inputs have same seq len, which completely different from really scene.
Because in the really deployment scene, multi input means difference size of seq len, and our solution is for that scene.
Please measure the difference size of seq len of input, you will find advantage of our continuous batching.
In the contrast, IPEX only support static batching, you will need many redundant padding to make sure the all inputs have same size, and many redundant computation will pull down the their throughput.
from neural-speed.
Related Issues (20)
- Running Q4_K_M gguf models: unrecognized tensor type 12 HOT 1
- Distributing tensors across NUMA nodes HOT 3
- Garbled characters with beam search HOT 16
- Is tensor parallelism supported by neural speed? HOT 2
- Question about Thread pool and GEMV HOT 4
- i wish for simpler way to run the model HOT 4
- i saw how beautiful this repo is, in terms of parallelism / numa stuff etc. HOT 1
- Linking back to Neural Chat / intel-extension-for-transformers HOT 2
- Add support for phi-3-mini-128k model HOT 4
- Loading checkpoint shards takes too long HOT 2
- Error: Unable to install. HOT 5
- source build from release tar file? HOT 1
- Add support for phi3-vision HOT 1
- Performance on Xeon Scalable HOT 1
- developer_document.md need elaboration on determining buffer sizes? HOT 1
- Bestla Kernels understanding and benchmarking HOT 8
- Whats the different with IPEX-LLM?
- BF16 Compute DType on AVX512 ISA
- Yi-6B model failed to evaluate HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from neural-speed.