Code Monkey home page Code Monkey logo

Comments (7)

zhentaoyu avatar zhentaoyu commented on July 28, 2024

Yes, it is supported, but only for a few model architectures. Please refer to https://github.com/intel/neural-speed/blob/main/docs/continuous_batching.md

from neural-speed.

zhentaoyu avatar zhentaoyu commented on July 28, 2024

Hi, @QuPengfei, if you have no other questions, we will close this issue. Thanks.

from neural-speed.

hezhiqian01 avatar hezhiqian01 commented on July 28, 2024

Could you please share the performance testing methods for multi-batch?

Here is the script that I used:

import argparse
from pathlib import Path
from typing import List, Optional
from transformers import AutoTokenizer,TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, RtnConfig
def main(args_in: Optional[List[str]] = None) -> None:
    parser = argparse.ArgumentParser(description="Convert a PyTorch model to a NE compatible file")
    parser.add_argument("--model_path",type=Path,
                        help="model path for local or from hf", default="meta-llama/Llama-2-7b-hf")
    parser.add_argument("--prompt",type=str,help="model path for local or from hf", default="Once upon a time, there existed a little girl,")
    parser.add_argument("--not_quant" ,action="store_false", help="Whether to use a model with low bit quantization")
    parser.add_argument("--weight_dtype",type=str,
                        help="output weight type, default: int4, we support int4, int8, nf4 and others ", default="int4")
    parser.add_argument("--compute_dtype", type=str, help="compute type", default="int8")
    parser.add_argument("--group_size", type=int, help="group size", default=128)
    parser.add_argument('--use_gptq', action='store_true')
    parser.add_argument("--n_ctx", type=int, help="n_ctx", default=512)
    parser.add_argument("--max_new_tokens", type=int, help="max_new_tokens", default=300)
    args = parser.parse_args(args_in)
    model_name = args.model_path
    woq_config = RtnConfig(load_in_4bit=True, use_quant=args.not_quant,
                                       weight_dtype=args.weight_dtype, compute_dtype=args.compute_dtype, group_size=args.group_size, use_gptq=args.use_gptq)
    prompt = args.prompt
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    prompts = [prompt for _ in range(args.batch_size)]
    inputs = tokenizer(prompts, return_tensors="pt").input_ids
    model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
    outputs = model.generate(inputs, pad_token=tokenizer.eos_token_id, ctx_size=args.n_ctx, max_new_tokens=args.max_new_tokens)

And the cmd is:

OMP_NUM_THREADS=56 numactl -m 1 -C 56-111 python run_inference_mutil_batch.py \
--model_path meta-llama/Llama-2-7b-hf \
--prompt "Once upon a time, there was a little girl. She was born in the midst of winter, a little shy of three weeks into the New year." \
--max_new_tokens 32 \
--group_size 128 \
--batch_size 8

Here are the results I obtained on the Intel SPR 8480+ CPU with ITREX settings: batch_size=8, weight_dtype=int4, compute_dtype=int8, group_size=32.

First-token latency: 489ms
Next-token latency: 104ms

In comparison, the results from IPEX are:

First-token latency: 252ms
Next-token latency: 52ms

Is it normal to see a performance difference between ITREX and IPEX, or is there an issue with my testing method

from neural-speed.

zhentaoyu avatar zhentaoyu commented on July 28, 2024

Hi, @hezhiqian01, the first-token and next-token latency you get from neural_speed are batched-token latency. see https://github.com/intel/neural-speed/blob/main/neural_speed/models/llama/llama.cpp#L777-L785.
In your example, the first token latency 489ms = 8 * prompt_length inference time, next-token latency 104ms = 8 * 1 inference time (the eval log of NEURAL_SPEED_VERBOSE=1 has inaccurate num of runs). I don't know how IPEX define its first-token/next-token latency in batched inference. Can you measure the model.generate duration between neural-speed and ipex in your local env?

from neural-speed.

hezhiqian01 avatar hezhiqian01 commented on July 28, 2024

Hi @zhentaoyu ,

Thanks for your reply!

Since the latency measure method is confused. Let's look at the Throughput instead.

In my case, since the input of each batch is same, so the throughput measure method is:

fps = args.batch_size * max_new_tokens / total_used_time

batch_size=1 batch_size=2 batch_size=4 batch_size=8 batch_size=16 batch_size=32
ITREX(tokens/s) 38 58 76 71 86 68
IPEX (tokens/s) 29 52 78 133 135 164

It still has performance gap between ITREX and IPEX.

from neural-speed.

zhentaoyu avatar zhentaoyu commented on July 28, 2024

Hi @hezhiqian01, thanks for your benchmarking.
Our batching way is like continuous batching mechanism, we will merge all bs prompt into one sequence for gemm inference except for self-attention and rope (for-loop compute). The more details you can refer to this doc. This way will save padding time in offline and improve efficiency in server.
In your example, all bs have no padding tokens since they are the same prompt, ipex MHA kernel bs parallel will accelerate inference, but neural speed has not in this scenario. Ping @luoyu-intel, @DDEle and @a32543254 for more kernel-related comments.

BTW, I use sum([len(p) for p in outputs]) / generated_duration to calculate fps calculation since some prompts will early-stop (eos token, etc.).
You can use model.generate(xxx, ignore_prompt=True) to exclude prompt tokens in your outputs.

from neural-speed.

a32543254 avatar a32543254 commented on July 28, 2024

In your benchmarking, seem you set all the inputs have same seq len, which completely different from really scene.
Because in the really deployment scene, multi input means difference size of seq len, and our solution is for that scene.
Please measure the difference size of seq len of input, you will find advantage of our continuous batching.
In the contrast, IPEX only support static batching, you will need many redundant padding to make sure the all inputs have same size, and many redundant computation will pull down the their throughput.

from neural-speed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.