Code Monkey home page Code Monkey logo

Comments (8)

luoyu-intel avatar luoyu-intel commented on July 28, 2024

Question: With In Bestla do we have support only for output s32 (i.e u8s8s32/s8s8s32) or do we have also support for output s8 (i.e u8s8s8/s8s8s8)?

It depends on the epilogue classes, AccumulatorWriteBackInt32 outputs int32 result while AlphaBetaProcessS32U8 outputs the u8 result.

Question: Do we have any specific env variables that needs to be set to get best performance out of Bestla Kernels

No env variable. Better run benchmark on the socket with one numanode, one CPU socket with multiple numanode has performance issue. If you are running the benchmark on the hybrid CPUs, please add this to the CMake command: -DBTLA_UT_OPENMP=OFF

from neural-speed.

Alavandar08 avatar Alavandar08 commented on July 28, 2024

Thanks @luoyu-intel for the clarification. I have some follow-up questions on the same which looks interesting.

I have been using this benchmarking infra provided in the repo
https://github.com/intel/neural-speed/tree/main/bestla --> bestla/bestla/ut/bestla_benchmark.cpp
mkdir build && cd build
cmake .. -DBTLA_UT_BENCHMARK=ON -DBTLA_UT_ALL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . -j
./bin/bestla_benchmark

With this infra I have benchmarked bestla kernels for u8s8s32 (AccumulatorWriteBackInt32) and u8s8u8 (AlphaBetaProcessS32U8) and I have also benchmarked with OneDNN kernels using benchdnn as it also supports low precision kernels - https://github.com/oneapi-src/oneDNN/blob/main/tests/benchdnn/README.md.

The results are as follows:
Picture2

The Bestla Kernels are run with u8s8s32, OneDNN kernels are run with u8s8s8. With Bestla I have also verified with 8bit output type (i.e AlphaBetaProcessS32U8) we are observing upto 5% improvement on top of 32 bit output type.

Question 1: At Bestla side, The Benchmark Infra that is being used to get OP level timing for different ISAs (./bin/bestla_benchmark). Would like to confirm If I can proceed further with the above script/infra for more OP level analysis?


Question 2: From the above image we are observing Bestla micro kernels are not on par / performing better compared to OneDNN kernels. Would like to know what might be the reason for not observing bestla time faster than OneDNN time taken?

Parallelism

Neural speed provides functionality called tensor parallelism, Beslta also provides parallelism functionality using parallel template classes.

Question: Is parallelization taken care by Bestla or Neural Speed or Neural Speed followed by Bestla micro kernels?

from neural-speed.

luoyu-intel avatar luoyu-intel commented on July 28, 2024

@Alavandar08

Would like to confirm If I can proceed further with the above script/infra for more OP level analysis?

What do you mean "more OP level analysis"?

Would like to know what might be the reason for not observing bestla time faster than OneDNN time taken?

BesTLA was developed in a tiny group of Intel (~3 people) but has covered all ISAs since AVX2. So we are not able to make it as fast as OneDNN on arbitrary devices with arbitrary cores and arbitrary problem sizes. Our highlight is supporting other low-bits by cpp templates, like: int3,int4,int5.

Question: Is parallelization taken care by Bestla or Neural Speed or Neural Speed followed by Bestla micro kernels?

TP is done by Neural Speed. To better support Intel's new Xeon CPU, we will support it inside BesTLA.

from neural-speed.

Alavandar08 avatar Alavandar08 commented on July 28, 2024

Thanks @luoyu-intel for the quick response.

What do you mean "more OP level analysis"?

I was referring to run with more arbitrary problem sizes and observe its behavior.

So we can continue with below infra to run for arbitrary problem sizes (specifically with low-bits by cpp templates to observe its impact) ? - https://github.com/intel/neural-speed/tree/main/bestla

So we are not able to make it as fast as OneDNN on arbitrary devices with arbitrary cores and arbitrary problem sizes.

Question: Do you have any suggestions on device, cores and problem sizes where we can observe BesTLA performing better than OneDNN?

from neural-speed.

luoyu-intel avatar luoyu-intel commented on July 28, 2024

Yes, you can add the problem sizes to benchmark's source code and then compile and run it. We are not planning to provide benchdnn-like cli parameters.

Question: Do you have any suggestions on device, cores and problem sizes where we can observe BesTLA performing better than OneDNN?

I'd like to suggest work on this Scheduler class. It schedules problem sizes to each core and do the cache blocking work. It may have 10% performance impact if you optimize the schedule for one problem size.

from neural-speed.

Alavandar08 avatar Alavandar08 commented on July 28, 2024

Sure @luoyu-intel, Thanks

Here is my use case, I am trying to run llama model from hugging face with low precision data types(int8, int4) through ipex llm and other libraries. Based on above discussion

Our highlight is supporting other low-bits by cpp templates, like: int3,int4,int5.

Question 1: In order to achieve the best performance with INT8 would you suggest to use OneDNN over Bestla (As here the focus is towards other low precision data types) and compare against ipex llm?

Question 2: With INT4 dtype would you suggest to use Bestla kernels to get best performance over ipex llm ?

from neural-speed.

luoyu-intel avatar luoyu-intel commented on July 28, 2024

Question 1: In order to achieve the best performance with INT8 would you suggest to use OneDNN over Bestla (As here the focus is towards other low precision data types) and compare against ipex llm?

oneDNN requires activation reroder for many cases on both CPU and GPU, but benchdnn does not include the reorder process (as I remember). So I'm not sure about this.

Question 2: With INT4 dtype would you suggest to use Bestla kernels to get best performance over ipex llm ?

I'm not familiar with ipex llm's int4 performance.

from neural-speed.

Alavandar08 avatar Alavandar08 commented on July 28, 2024

Thanks @luoyu-intel.

As Beslta kernels performance is mostly focused on other low precision Kernels, like: int3,int4,int5

I am trying to utilize the int4 kernels from benchmark's source code and then compile (https://github.com/intel/neural-speed/tree/main/bestla --> bestla/bestla/ut/bestla_benchmark.cpp)

For Int4 to extract the time taken from arbitary sizes, I have used UTWOQ_CompInt8 class for computation with data type BTLA_DTYPE::S4_CLIP, scale types with BF16 and F32
I have noticed that the data format is int this way - Input:F32, Weights:INT4, Output:F32

auto memsize = gemm_memsize(m, n, k, BTLA_DTYPE::F32, qtype, BTLA_DTYPE::F32);

Epilogue:
I was looking at epilogue class for postprocess from F32 to low precision type(8 bit an 4 bit). Here we can find different writebacks from F32 to BF16, INT32, BF16.

using AccumulatorWriteBackFp32 = AccumulatorWriteBack<float, float>;

Question1: I was looking if we have some API that does writeback from F32 to (8 bit and 4 bit). Do we have any API which supports the above case?

Prologue:
I am trying to find something similar with Prologue class for datatype conversion from F32 to INT8 to handle the computation.
Question2: Can you help me by pointing out to the API which takes care of this specific datatype conversion?

Question3: As we have direct class for (u8s8s32, s8s8s32) do we have any class similar to that for INT4?

from neural-speed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.