Comments (8)
Question: With In Bestla do we have support only for output s32 (i.e u8s8s32/s8s8s32) or do we have also support for output s8 (i.e u8s8s8/s8s8s8)?
It depends on the epilogue classes, AccumulatorWriteBackInt32
outputs int32 result while AlphaBetaProcessS32U8
outputs the u8 result.
Question: Do we have any specific env variables that needs to be set to get best performance out of Bestla Kernels
No env variable. Better run benchmark on the socket with one numanode
, one CPU socket with multiple numanode
has performance issue. If you are running the benchmark on the hybrid CPUs, please add this to the CMake command: -DBTLA_UT_OPENMP=OFF
from neural-speed.
Thanks @luoyu-intel for the clarification. I have some follow-up questions on the same which looks interesting.
I have been using this benchmarking infra provided in the repo
https://github.com/intel/neural-speed/tree/main/bestla --> bestla/bestla/ut/bestla_benchmark.cpp
mkdir build && cd build
cmake .. -DBTLA_UT_BENCHMARK=ON -DBTLA_UT_ALL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
./bin/bestla_benchmark
With this infra I have benchmarked bestla kernels for u8s8s32 (AccumulatorWriteBackInt32)
and u8s8u8 (AlphaBetaProcessS32U8)
and I have also benchmarked with OneDNN kernels using benchdnn as it also supports low precision kernels - https://github.com/oneapi-src/oneDNN/blob/main/tests/benchdnn/README.md.
The Bestla Kernels are run with u8s8s32
, OneDNN kernels are run with u8s8s8
. With Bestla I have also verified with 8bit output type (i.e AlphaBetaProcessS32U8
) we are observing upto 5% improvement on top of 32 bit output type.
Question 1: At Bestla side, The Benchmark Infra that is being used to get OP level timing for different ISAs (./bin/bestla_benchmark). Would like to confirm If I can proceed further with the above script/infra for more OP level analysis?
Question 2: From the above image we are observing Bestla micro kernels are not on par / performing better compared to OneDNN kernels. Would like to know what might be the reason for not observing bestla time faster than OneDNN time taken?
Parallelism
Neural speed provides functionality called tensor parallelism, Beslta also provides parallelism functionality using parallel template classes.
Question: Is parallelization taken care by Bestla or Neural Speed or Neural Speed followed by Bestla micro kernels?
from neural-speed.
Would like to confirm If I can proceed further with the above script/infra for more OP level analysis?
What do you mean "more OP level analysis"?
Would like to know what might be the reason for not observing bestla time faster than OneDNN time taken?
BesTLA was developed in a tiny group of Intel (~3 people) but has covered all ISAs since AVX2. So we are not able to make it as fast as OneDNN on arbitrary devices with arbitrary cores and arbitrary problem sizes. Our highlight is supporting other low-bits by cpp templates, like: int3,int4,int5.
Question: Is parallelization taken care by Bestla or Neural Speed or Neural Speed followed by Bestla micro kernels?
TP is done by Neural Speed. To better support Intel's new Xeon CPU, we will support it inside BesTLA.
from neural-speed.
Thanks @luoyu-intel for the quick response.
What do you mean "more OP level analysis"?
I was referring to run with more arbitrary problem sizes and observe its behavior.
So we can continue with below infra to run for arbitrary problem sizes (specifically with low-bits by cpp templates to observe its impact) ? - https://github.com/intel/neural-speed/tree/main/bestla
So we are not able to make it as fast as OneDNN on arbitrary devices with arbitrary cores and arbitrary problem sizes.
Question: Do you have any suggestions on device, cores and problem sizes where we can observe BesTLA performing better than OneDNN?
from neural-speed.
Yes, you can add the problem sizes to benchmark's source code and then compile and run it. We are not planning to provide benchdnn-like cli parameters.
Question: Do you have any suggestions on device, cores and problem sizes where we can observe BesTLA performing better than OneDNN?
I'd like to suggest work on this Scheduler class. It schedules problem sizes to each core and do the cache blocking work. It may have 10% performance impact if you optimize the schedule for one problem size.
from neural-speed.
Sure @luoyu-intel, Thanks
Here is my use case, I am trying to run llama model from hugging face with low precision data types(int8, int4) through ipex llm and other libraries. Based on above discussion
Our highlight is supporting other low-bits by cpp templates, like: int3,int4,int5.
Question 1: In order to achieve the best performance with INT8 would you suggest to use OneDNN over Bestla (As here the focus is towards other low precision data types) and compare against ipex llm?
Question 2: With INT4 dtype would you suggest to use Bestla kernels to get best performance over ipex llm ?
from neural-speed.
Question 1: In order to achieve the best performance with INT8 would you suggest to use OneDNN over Bestla (As here the focus is towards other low precision data types) and compare against ipex llm?
oneDNN requires activation reroder for many cases on both CPU and GPU, but benchdnn does not include the reorder process (as I remember). So I'm not sure about this.
Question 2: With INT4 dtype would you suggest to use Bestla kernels to get best performance over ipex llm ?
I'm not familiar with ipex llm's int4 performance.
from neural-speed.
Thanks @luoyu-intel.
As Beslta kernels performance is mostly focused on other low precision Kernels, like: int3,int4,int5
I am trying to utilize the int4 kernels from benchmark's source code and then compile (https://github.com/intel/neural-speed/tree/main/bestla --> bestla/bestla/ut/bestla_benchmark.cpp)
For Int4 to extract the time taken from arbitary sizes, I have used UTWOQ_CompInt8 class for computation with data type BTLA_DTYPE::S4_CLIP, scale types with BF16 and F32
I have noticed that the data format is int this way - Input:F32, Weights:INT4, Output:F32
Epilogue:
I was looking at epilogue class for postprocess from F32 to low precision type(8 bit an 4 bit). Here we can find different writebacks from F32 to BF16, INT32, BF16.
neural-speed/bestla/bestla/bestla_epilogue.h
Line 156 in 97c8190
Question1: I was looking if we have some API that does writeback from F32 to (8 bit and 4 bit). Do we have any API which supports the above case?
Prologue:
I am trying to find something similar with Prologue class for datatype conversion from F32 to INT8 to handle the computation.
Question2: Can you help me by pointing out to the API which takes care of this specific datatype conversion?
Question3: As we have direct class for (u8s8s32, s8s8s32) do we have any class similar to that for INT4?
from neural-speed.
Related Issues (20)
- Running Q4_K_M gguf models: unrecognized tensor type 12 HOT 1
- Distributing tensors across NUMA nodes HOT 3
- Garbled characters with beam search HOT 16
- Is tensor parallelism supported by neural speed? HOT 2
- Question about Thread pool and GEMV HOT 4
- i wish for simpler way to run the model HOT 4
- i saw how beautiful this repo is, in terms of parallelism / numa stuff etc. HOT 1
- Linking back to Neural Chat / intel-extension-for-transformers HOT 2
- Add support for phi-3-mini-128k model HOT 4
- Loading checkpoint shards takes too long HOT 2
- Error: Unable to install. HOT 5
- source build from release tar file? HOT 1
- Add support for phi3-vision HOT 1
- is it supported with Batch size >1 ? HOT 7
- Performance on Xeon Scalable HOT 1
- developer_document.md need elaboration on determining buffer sizes? HOT 1
- Whats the different with IPEX-LLM?
- BF16 Compute DType on AVX512 ISA
- Yi-6B model failed to evaluate HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from neural-speed.