Hi all, is it supported with bs >1? found the following: <p

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

is it supported with Batch size >1 ? about neural-speed HOT 7 OPEN

QuPengfei commented on July 28, 2024

is it supported with Batch size >1 ?

from neural-speed.

Comments (7)

zhentaoyu commented on July 28, 2024

Yes, it is supported, but only for a few model architectures. Please refer to https://github.com/intel/neural-speed/blob/main/docs/continuous_batching.md

from neural-speed.

zhentaoyu commented on July 28, 2024

Hi, @QuPengfei, if you have no other questions, we will close this issue. Thanks.

from neural-speed.

hezhiqian01 commented on July 28, 2024

Could you please share the performance testing methods for multi-batch?

Here is the script that I used:

import argparse
from pathlib import Path
from typing import List, Optional
from transformers import AutoTokenizer,TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, RtnConfig
def main(args_in: Optional[List[str]] = None) -> None:
    parser = argparse.ArgumentParser(description="Convert a PyTorch model to a NE compatible file")
    parser.add_argument("--model_path",type=Path,
                        help="model path for local or from hf", default="meta-llama/Llama-2-7b-hf")
    parser.add_argument("--prompt",type=str,help="model path for local or from hf", default="Once upon a time, there existed a little girl,")
    parser.add_argument("--not_quant" ,action="store_false", help="Whether to use a model with low bit quantization")
    parser.add_argument("--weight_dtype",type=str,
                        help="output weight type, default: int4, we support int4, int8, nf4 and others ", default="int4")
    parser.add_argument("--compute_dtype", type=str, help="compute type", default="int8")
    parser.add_argument("--group_size", type=int, help="group size", default=128)
    parser.add_argument('--use_gptq', action='store_true')
    parser.add_argument("--n_ctx", type=int, help="n_ctx", default=512)
    parser.add_argument("--max_new_tokens", type=int, help="max_new_tokens", default=300)
    args = parser.parse_args(args_in)
    model_name = args.model_path
    woq_config = RtnConfig(load_in_4bit=True, use_quant=args.not_quant,
                                       weight_dtype=args.weight_dtype, compute_dtype=args.compute_dtype, group_size=args.group_size, use_gptq=args.use_gptq)
    prompt = args.prompt
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    prompts = [prompt for _ in range(args.batch_size)]
    inputs = tokenizer(prompts, return_tensors="pt").input_ids
    model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
    outputs = model.generate(inputs, pad_token=tokenizer.eos_token_id, ctx_size=args.n_ctx, max_new_tokens=args.max_new_tokens)

And the cmd is:

OMP_NUM_THREADS=56 numactl -m 1 -C 56-111 python run_inference_mutil_batch.py \
--model_path meta-llama/Llama-2-7b-hf \
--prompt "Once upon a time, there was a little girl. She was born in the midst of winter, a little shy of three weeks into the New year." \
--max_new_tokens 32 \
--group_size 128 \
--batch_size 8

Here are the results I obtained on the Intel SPR 8480+ CPU with ITREX settings: batch_size=8, weight_dtype=int4, compute_dtype=int8, group_size=32.

First-token latency: 489ms
Next-token latency: 104ms

In comparison, the results from IPEX are:

First-token latency: 252ms
Next-token latency: 52ms

Is it normal to see a performance difference between ITREX and IPEX, or is there an issue with my testing method

from neural-speed.

zhentaoyu commented on July 28, 2024

Hi, @hezhiqian01, the first-token and next-token latency you get from neural_speed are batched-token latency. see https://github.com/intel/neural-speed/blob/main/neural_speed/models/llama/llama.cpp#L777-L785.
In your example, the first token latency 489ms = 8 * prompt_length inference time, next-token latency 104ms = 8 * 1 inference time (the eval log of NEURAL_SPEED_VERBOSE=1 has inaccurate num of runs). I don't know how IPEX define its first-token/next-token latency in batched inference. Can you measure the model.generate duration between neural-speed and ipex in your local env?

from neural-speed.

hezhiqian01 commented on July 28, 2024

Hi @zhentaoyu ,

Thanks for your reply!

Since the latency measure method is confused. Let's look at the Throughput instead.

In my case, since the input of each batch is same, so the throughput measure method is:

fps = args.batch_size * max_new_tokens / total_used_time

	batch_size=1	batch_size=2	batch_size=4	batch_size=8	batch_size=16	batch_size=32
ITREX(tokens/s)	38	58	76	71	86	68
IPEX (tokens/s)	29	52	78	133	135	164

It still has performance gap between ITREX and IPEX.

from neural-speed.

zhentaoyu commented on July 28, 2024

Hi @hezhiqian01, thanks for your benchmarking.
Our batching way is like continuous batching mechanism, we will merge all bs prompt into one sequence for gemm inference except for self-attention and rope (for-loop compute). The more details you can refer to this doc. This way will save padding time in offline and improve efficiency in server.
In your example, all bs have no padding tokens since they are the same prompt, ipex MHA kernel bs parallel will accelerate inference, but neural speed has not in this scenario. Ping @luoyu-intel, @DDEle and @a32543254 for more kernel-related comments.

BTW, I use sum([len(p) for p in outputs]) / generated_duration to calculate fps calculation since some prompts will early-stop (eos token, etc.).
You can use model.generate(xxx, ignore_prompt=True) to exclude prompt tokens in your outputs.

from neural-speed.

a32543254 commented on July 28, 2024

In your benchmarking, seem you set all the inputs have same seq len, which completely different from really scene.
Because in the really deployment scene, multi input means difference size of seq len, and our solution is for that scene.
Please measure the difference size of seq len of input, you will find advantage of our continuous batching.
In the contrast, IPEX only support static batching, you will need many redundant padding to make sure the all inputs have same size, and many redundant computation will pull down the their throughput.

from neural-speed.

is it supported with Batch size >1 ? about neural-speed HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent