Code Monkey home page Code Monkey logo

intel-xpu-backend-for-triton's People

Contributors

antiagainst avatar binarman avatar chengjunlu avatar chsigg avatar daadaada avatar etiotto avatar fkouteib avatar gflegar avatar hauntsaninja avatar htyu avatar ienkovich avatar jansel avatar jlebar avatar jokeren avatar joviliast avatar kshama-msft avatar leshikus avatar manman-ren avatar micmelesse avatar pawelszczerbuk avatar pbchekin avatar peterbell10 avatar prathams417 avatar ptillet avatar scxiao avatar thomasraoux avatar vwbaker avatar whitneywhtsang avatar zahimoud avatar zhanglx13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

intel-xpu-backend-for-triton's Issues

GEMM performance is lower than XeTLA

DPAS instruction is not generated where needed
2D-block load/store are not generated where needed
prefetch instruction is not inserted where needed

Make the llvm-target branch use the Triton plugin infrastructure

The code in the llvm-target branch is a fork of the OpenAI Triton code with modifications in several files. The structure of the project mirrors the structure of the AMD port. This work item objective is to make the project use the Triton plugin infrastructure. The work is similar to what is required for the AMD port, which is also a fork, so we should determine whether we can use the same mechanism.

GEMM Block-pointer Path

get triton gemm perf 80% of oneDNN/XeTLA utilizing genISA/vc-intrinsics.
the lowering pipeline would be "triton -> tritongpu -> optimized/simplified tritongpu => llvm/spirv".

this serves as an umbrella issue including

[atomic_rmw]: Support for fp16

The current implementation of tt.atomic_rmw is functional for several data types but, crucially, it doesn't yet work for float16.
The issue is in IGC.

[DPAS]: Initial implementation for Triton's `tt.dot` operation using the DPAS instruction

Currently the tt.dot operation is lowered to a loop containing scalar (FMA) instructions. This works from a functional perspective but performs poorly.

This work item entails extending the conversion code for tt.dot so that it generates DPAS instruction(s) via the GENX dialect operation @llvm.genx.GenISA.sub.group.dpas
In this first step performance is not the main objective. The operand of the tt.dot operation are expected to be put into shared local memory by converting from a block layout to a shared layout (like it is the case for NVidia). Once in shared memory the operands are to be converted to a dot layout (with a underlying DPAS layout).

Efforts/experiments to elide the blocked layout to shared layout conversions, should be handled separately from this work item.

DeepSpeed Triton backend failure on GPT-J inference

Hi, I succesfully built triton 2.1.0 on commit 5df9042 with xpu-backend at 0bcc485 and added some modifications to be compatible with ipex-2.0. And I can successfully pass xpu pytest in triton with pytest -xvs test/backend/third_party_backends/test_xpu_backend.py --backend xpu.

However, when I run DeepSpeed inference on GPT-J, there are some errors like Translate to SPIRV IR failedLLVM ERROR: Failed to translate TritonGPU to SPIRV IR. Is the support still limited to specific operations or it's a bug? Also, I cannot successfully compile lastest main branch of intel-xpu-backend-for-triton with error:

/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:11:62: error: template argument 1 is invalid
   11 | #define ptrtoint(...) rewriter.create<spirv::ConvertPtrToUOp>(loc, __VA_ARGS__)
      |                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/ConvertLayoutOpToSPIRV.cpp:220:25: note: in expansion of macro ‘ptrtoint’
  220 |               currVal = ptrtoint(llvmElemTy, currVal);
      |                         ^~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:10:46: error: ‘ConvertUToPtrOp’ is not a member of ‘mlir::spirv’; did you mean ‘ConvertUToFOp’?
   10 | #define inttoptr(...) rewriter.create<spirv::ConvertUToPtrOp>(loc, __VA_ARGS__)
      |                                              ^~~~~~~~~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:10:46: note: in definition of macro ‘inttoptr’
   10 | #define inttoptr(...) rewriter.create<spirv::ConvertUToPtrOp>(loc, __VA_ARGS__)
      |                                              ^~~~~~~~~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:10:62: error: no matching function for call to ‘mlir::ConversionPatternRewriter::create<<expression error> >(mlir::Location&, mlir::Type&, mlir::Value&)’
   10 | #define inttoptr(...) rewriter.create<spirv::ConvertUToPtrOp>(loc, __VA_ARGS__)
      |                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

Complete log:

[0] [2023-08-03 01:28:35,085] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.1+7de73909, git-hash=7de73909, git-branch=master
[0] [2023-08-03 01:28:35,085] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[0] [2023-08-03 01:28:35,085] [INFO] [logging.py:96:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[0] [2023-08-03 01:28:35,130] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 4096, 'intermediate_size': 16384, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 64, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': False, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': True, 'triton_autotune': False}
[0] [2023-08-03 01:28:35,130] [INFO] [logging.py:96:log_dist] [Rank 0] Injecting Triton kernels ...
Loading 1 checkpoint shards:   0%|          | 0/1 [00:00<?, ?it/s][0] checkpoint loading time at rank 0: 48.9048216342926 sec
Loading 1 checkpoint shards: 100%|██████████| 1/1 [00:48<00:00, 48.90s/it]
[0] [2023-08-03 01:29:24,846] [INFO] [utils.py:798:see_memory_usage] post-ds-inference-init
[0] [2023-08-03 01:29:24,846] [INFO] [utils.py:799:see_memory_usage] MA 11.27 GB         Max_MA 11.27 GB         CA 11.27 GB         Max_CA 11 GB
[0] [2023-08-03 01:29:24,846] [INFO] [utils.py:806:see_memory_usage] CPU Virtual Memory:  used = 11.66 GB, percent = 1.2%
[0] *** Starting to generate 32 tokens with bs=1
[0] Generate args {'max_new_tokens': 32, 'do_sample': False, 'num_beams': 1, 'token_latency': True}
[0] *** Prompt size:  32
[0] [2023-08-03 01:29:24,933] [INFO] [utils.py:798:see_memory_usage] end-of-run
[0] [2023-08-03 01:29:24,934] [INFO] [utils.py:799:see_memory_usage] MA 11.27 GB         Max_MA 11.27 GB         CA 11.27 GB         Max_CA 11 GB
[0] [2023-08-03 01:29:24,934] [INFO] [utils.py:806:see_memory_usage] CPU Virtual Memory:  used = 11.66 GB, percent = 1.2%
[0] *** Running benchmark[0]
[0] Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[0] >>> inner function (32,)
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
[0]   warnings.warn(
[0] 2023-08-03 01:29:25,509 - root - INFO - running build_ext
[0] 2023-08-03 01:29:25,514 - root - INFO - building 'layer_norm_kernel' extension
[0] 2023-08-03 01:29:25,514 - root - INFO - creating /tmp/tmp2a44hft2/tmp
[0] 2023-08-03 01:29:25,514 - root - INFO - creating /tmp/tmp2a44hft2/tmp/tmp2a44hft2
[0] 2023-08-03 01:29:25,515 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmp2a44hft2/main.c -o /tmp/tmp2a44hft2/tmp/tmp2a44hft2/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=layer_norm_kernel -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:29:34,504 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmp2a44hft2/tmp/tmp2a44hft2/main.o -lze_loader -o /tmp/tmp2a44hft2/layer_norm_kernel.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] >>> inner function <function Fp16Matmul.forward.<locals>.<lambda> at 0x7f6ace1720e0>
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>[0]
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
[0]   warnings.warn(
[0] 2023-08-03 01:29:50,809 - root - INFO - running build_ext
[0] 2023-08-03 01:29:50,812 - root - INFO - building '_fp_matmul' extension
[0] 2023-08-03 01:29:50,812 - root - INFO - creating /tmp/tmp7aguxrst/tmp
[0] 2023-08-03 01:29:50,812 - root - INFO - creating /tmp/tmp7aguxrst/tmp/tmp7aguxrst
[0] 2023-08-03 01:29:50,812 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmp7aguxrst/main.c -o /tmp/tmp7aguxrst/tmp/tmp7aguxrst/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=_fp_matmul -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:29:59,847 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmp7aguxrst/tmp/tmp7aguxrst/main.o -lze_loader -o /tmp/tmp7aguxrst/_fp_matmul.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] >>> inner function <function Fp16Matmul._score_4d_matmul.<locals>.<lambda> at 0x7f6ace1720e0>
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
[0]   warnings.warn(
[0] 2023-08-03 01:30:23,065 - root - INFO - running build_ext
[0] 2023-08-03 01:30:23,068 - root - INFO - building 'matmul_4d_kernel' extension
[0] 2023-08-03 01:30:23,069 - root - INFO - creating /tmp/tmpsihcyg_g/tmp
[0] 2023-08-03 01:30:23,069 - root - INFO - creating /tmp/tmpsihcyg_g/tmp/tmpsihcyg_g
[0] 2023-08-03 01:30:23,069 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmpsihcyg_g/main.c -o /tmp/tmpsihcyg_g/tmp/tmpsihcyg_g/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=matmul_4d_kernel -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:30:32,197 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmpsihcyg_g/tmp/tmpsihcyg_g/main.o -lze_loader -o /tmp/tmpsihcyg_g/matmul_4d_kernel.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] >>> inner function (512,)
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
[0]   warnings.warn(
[0] 2023-08-03 01:30:49,560 - root - INFO - running build_ext
[0] 2023-08-03 01:30:49,563 - root - INFO - building 'softmax_kernel' extension
[0] 2023-08-03 01:30:49,563 - root - INFO - creating /tmp/tmpxm55vb5j/tmp
[0] 2023-08-03 01:30:49,563 - root - INFO - creating /tmp/tmpxm55vb5j/tmp/tmpxm55vb5j
[0] 2023-08-03 01:30:49,563 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmpxm55vb5j/main.c -o /tmp/tmpxm55vb5j/tmp/tmpxm55vb5j/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=softmax_kernel -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:30:58,436 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmpxm55vb5j/tmp/tmpxm55vb5j/main.o -lze_loader -o /tmp/tmpxm55vb5j/softmax_kernel.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] loc("-":17:13): error: external function is unhandled
[0] Translate to SPIRV IR failedLLVM ERROR: Failed to translate TritonGPU to SPIRV IR.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 1046382 RUNNING AT 
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

` /usr/bin/ld: cannot find -lz` when building the backend

I'd like to try the compile bundle from there #83 but the very last step (building triton) fails with

/usr/bin/ld: cannot find -lz

a solution is suggested here https://github.com/intel/intel-xpu-backend-for-triton/wiki/Possible-Build-Bugs#1-usrbinld-cannot-find--lz but I'm not root on the system so I can't install this package.

I also have zlib install in the conda environment, or I tried to extract the package and use LD_LIBRARY_PATH to point to other locations, it doesn't help.

Is there a way to do without system zlib ?

Test `dot_combine` in `test_line_info.py` fails to compile

PR #190 exposed an issue in compiling one of the test contains d in test_line_info.py. The dot_combine kernel fails to compile cleanly. To reproduce the issue, comment out all test except dot_combine and execute:

python3 -m pytest --verbose --device xpu test_line_info.py

The back trace is:

test_line_info.py::test_line_info[dot_combine] Fatal Python error: Aborted

Current thread 0x00007f291caa52c0 (most recent call first):
  File "/home/etiotto/intel-xpu-backend-for-triton/python/triton/compiler/backends/xpu.py", line 191 in make_llir
  File "/home/etiotto/intel-xpu-backend-for-triton/python/triton/compiler/backends/xpu.py", line 208 in <lambda>
  File "/home/etiotto/intel-xpu-backend-for-triton/python/triton/compiler/compiler.py", line 200 in compile

Reduce test case

#blocked = #triton_gpu.blocked<{sizePerThread = [2, 2], threadsPerWarp = [2, 16], warpsPerCTA = [4, 1], order = [1, 0], CTAsPerCGA = [1, 1], CTASplitNum = [1, 1], CTAOrder = [1, 0]}>

module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 : i32} {
  tt.func public @kernel_dot_combine_0(%op1: tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 0, parent = #blocked}>>, 
                                       %op2: tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 1, parent = #blocked}>>, 
                                       %cst : tensor<2x2xi32, #blocked>) {
    %0 = tt.dot %op1, %op2, %cst {allowTF32 = true, maxNumImpreciseAcc = 0 : i32} : tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<2x2xi32, #blocked>
    tt.return
  }
}

To reproduce using reduced test case

% triton-opt  -split-input-file --convert-triton-gpu-to-llvm="target=genx" t1.ttgir
Unsupported conversion from i8 to i32

UNREACHABLE executed at /home/etiotto/intel-xpu-backend-for-triton/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:1415!
...

[CI]: Create initial integration test runs

All pull requests against branch https://github.com/intel/intel-xpu-backend-for-triton/tree/llvm-target need to run basic integration tests. CI should:

  1. clone and build MLIR from the following repository (https://github.com/intel/llvm/tree/genx)
  2. cache the libraries built in step 1
  3. clone and build the code in the pull request
  4. Run the following tests:
    a. Triton C++ tests
    b. Triton LIT tests
    c. Triton core tests (python end-to-end tests), running on a PVC GPU
    d. Subset of the Triton Tutorial tests

LLVM Platform issue

I'm currently building an extension to triton (pre-hopper commit) for Apple Silicon and I have settled on a method to do that:

  1. convert TritonGPU To SPIRV; seeing that this has already been implemented I don't want to have to myself if I dont have to
  2. convert ".spv" to Human-readable Metal Shader language using SPIRV-CROSS to .metal
  3. compile .metal to .air
  4. compile .air to .metallib
  5. use metal-cpp to use .metallib with pybind11 that will get JIT compiled similiar to your usage of SYCL but with Metal

is this a good proccess to do this and how will the different LLVM versions play into this, is the intel fork of LLVM compatible with macos-arm64?
Any feedback would be great

Conversion code for `triton_gpu.async_wait`

When the target is NVIDIA, the triton_gpu.async_wait operation is lowered to the following inline asm:

%2 = "llvm.inline_asm"() {asm_dialect = 0 : i64, asm_string = "cp.async.wait_group 0x4;", constraints = "", has_side_effects, operand_attrs = []} : () -> !llvm.void

We need to support for the equivalent operation of cp.async.wait_group

The cp.async.wait_group instruction waits for completion of prior asynchronous copy operations. The syntax is:

cp.async.wait_group N;

The instruction will cause the executing thread to wait till only N or fewer of the most recent cp.async-groups are pending and all the prior cp.async-groups committed by the executing threads are complete. For example, when N is 0, the executing thread waits on all the prior cp.async-groups to complete. Operand N is an integer constant.

Port and run operator tests

We need to run the python operator tests:

image

As part of this investigation we might discover missing functionality or expose functional issues in the current implementation. Separate work items may be added to track compiler work required to complete this task.

Merge OpenAI Triton till Feb 11

It is essential to keep up to date with OpenAI Triton, to get the latest features, and reduce the difficulty to upstream our changes to OpenAI Triton.

Conversion code for `triton_gpu.async_commit_group`

When the target is NVIDIA, the triton_gpu.async_commit_group operation is lowered to the following inline asm:

  llvm.func @async_commit_group() attributes {nvvm.kernel = 1 : ui1, nvvm.maxntid = [128 : i32]} {
    ...
    %2 = llvm.inline_asm has_side_effects asm_dialect = att operand_attrs = [] "cp.async.commit_group ;", ""  : () -> !llvm.void
    llvm.return
  }

We need to support the equivalent operation of cp.async.commit_group.

convert tritongpu to xegpu

it is kind of 1:1 mapping:
tt.make_tensor_ptr %shape %strides %offsets => xegpu.create_tdesc %shape %strides %offsets
tt.load %ptr => xegpu.load_nd %ptr
tt.dot %a %b %c => xegpu.dpas %a %b %c
tt.store %ptr, %value => xegpu.store_nd %ptr, %value
tt.advance %ptr %offsets => xegpu.update_nd_offsets %ptr %offsets

example
TritonGPU code

// total 8x16x1024
//              it has 1 workgroup. each workgroup caculates a [8x16 = 8x1024 * 1024x16] block
// each work-group has 1  subgroup. each  subgroup caculates a [8x16 = 8x1024 * 1024x16] block
module {
tt.func @test_kernel(%arg0: !tt.ptr<f16, 1> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16, 1> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f16, 1> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
    %c0 = arith.constant 0 : i32
    %c16 = arith.constant 16 : i32
    %c8 = arith.constant 8 : i32
    %c1024 = arith.constant 1024 : i32
    %m = arith.constant 8 : i64
    %n = arith.constant 16 : i64
    %k = arith.constant 1024 : i64
    %c1_i64 = arith.constant 1 : i64
    %cst = arith.constant dense<0.000000e+00> : tensor<8x16xf32>
    %aPtr = tt.make_tensor_ptr %arg0, [%m, %k], [%k, %c1_i64], [%c0, %c0] {order = array<i32: 1, 0>} : <tensor<8x16xf16>, 1>
    %bPtr = tt.make_tensor_ptr %arg1, [%k, %n], [%n, %c1_i64], [%c0, %c0] {order = array<i32: 1, 0>} : <tensor<16x16xf16>, 1>
    %6:3 = scf.for %arg3 = %c0 to %c1024 step %c16 iter_args(%arg4 = %cst, %subA = %aPtr, %subB = %bPtr) -> (tensor<8x16xf32>, !tt.ptr<tensor<8x16xf16>, 1>, !tt.ptr<tensor<16x16xf16>, 1>) : i32 {
      %a = tt.load %subA {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32, isVolatile = false} : !tt.ptr<tensor<8x16xf16>, 1> -> tensor<8x16xf16>
      %b = tt.load %subB {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32, isVolatile = false, DotB = true} : !tt.ptr<tensor<16x16xf16>, 1> -> tensor<16x16xf16>
      %c = tt.dot %a, %b, %arg4 {allowTF32 = true, maxNumImpreciseAcc = 0 : i32} : tensor<8x16xf16> * tensor<16x16xf16> -> tensor<8x16xf32>
      %30 = tt.advance %aPtr, [%c0, %c16] : <tensor<8x16xf16>, 1>
      %31 = tt.advance %bPtr, [%c16, %c0] : <tensor<16x16xf16>, 1>
      scf.yield %c, %30, %31 : tensor<8x16xf32>, !tt.ptr<tensor<8x16xf16>, 1>, !tt.ptr<tensor<16x16xf16>, 1>
    }
    %value = arith.truncf %6#0 : tensor<8x16xf32> to tensor<8x16xf16>
    %cPtr = tt.make_tensor_ptr %arg2, [%m, %n], [%n, %c1_i64], [%c0, %c0] {order = array<i32: 1, 0>} : <tensor<8x16xf16>, 1>
    tt.store %cPtr, %value {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32} : !tt.ptr<tensor<8x16xf16>, 1>, tensor<8x16xf16>
    tt.return

}
}

XeGPU code

   func.func @test_kernel(%arg0: memref<8x1024xf16>, %arg1: memref<1024x16xf16>, %arg2: memref<8x16xf16>) kernel attributes {
      %c0 = arith.constant 0 : index
      %c16 = arith.constant 16 : index
      %c8 = arith.constant 8 : index
      %c1024 = arith.constant 1024 : index
      %cst = arith.constant dense<0.0> :vector<8x16xf32>
        %7 = xegpu.create_nd_tdesc %arg0[%c0, %c0] {mode=vc}: memref<8x1024xf16> -> !xegpu.tensor_desc<8x16xf16>
        %8 = xegpu.create_nd_tdesc %arg1[%c0, %c0] {mode=vc}: memref<1024x16xf16> -> !xegpu.tensor_desc<16x16xf16>
      %6:3 = scf.for %arg3 = %c0 to %c1024 step %c16 iter_args(%arg4 = %cst, %subA = %7, %subB = %8) -> (vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>) {
        %9  = xegpu.load_nd %subA  {mode=vc, vnni_axis = 1}: !xegpu.tensor_desc<8x16xf16> -> vector<8x8x2xf16>
        %10 = xegpu.load_nd %subB  {mode=vc, vnni_axis = 0} : !xegpu.tensor_desc<16x16xf16> -> vector<8x16x2xf16>
        %11 = xegpu.dpas %9, %10, %arg4 {mode=vc}: vector<8x8x2xf16>, vector<8x16x2xf16>, vector<8x16xf32> -> vector<8x16xf32>
        %12 = xegpu.update_nd_offset %subA, [%c0, %c16] {mode=vc}: !xegpu.tensor_desc<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
        %13 = xegpu.update_nd_offset %subB, [%c16, %c0] {mode=vc}: !xegpu.tensor_desc<16x16xf16> -> !xegpu.tensor_desc<16x16xf16>
        scf.yield %11, %12, %13: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>
      }
      %value = arith.truncf %6#0 : vector<8x16xf32> to vector<8x16xf16>
      %4 = xegpu.create_nd_tdesc %arg2[%c0, %c0] {mode = vc} : memref<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
      xegpu.store_nd %value, %4 {mode = vc}: vector<8x16xf16>, !xegpu.tensor_desc<8x16xf16>
      return
    }

[DPAS]: Use 2d-loads instruction to load the operand of `tt.dot`

The operands of the Triton's tt.dot operation should be loaded by using specialized instruction to load 2D blocks of the matrices.
Loading the operands in blocks is more efficient than loading them by using regular loads @llvm.genx.GenISA.LSCPrefetch.

We might need to leverage the semantic information associated with Tritons blocked pointers (https://triton-lang.org/main/getting-started/tutorials/08-experimental-block-pointer.html) in order to generate 2d-Blocked loads.

Port and run `test_conversions.py`

intel-xpu-backend-for-triton/python/test/unit/language/test_conversions.py

As part of this investigation we might discover missing functionality or expose functional issues in the current implementation. Separate work items may be added to track compiler work required to complete this task.

Unknown CMake command "mlir_tablegen"

Could you please add doc on building and running examples ? Thanks.

1 clone the repo
2 mkdir build & cd build
3 cmake ..

-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at include/triton/Conversion/TritonGPUToSPIRV/CMakeLists.txt:2 (mlir_tablegen):
Unknown CMake command "mlir_tablegen".

python setup.py develop" when users do not have root permission

Could you suggest the command "python setup.py develop" when users do not have root permission ? Thanks.

# cd to triton root folder and checkout to pinned commit
cd ../..
git checkout `cat third_party/intel_xpu_backend/triton_hash.txt`
# Build triton with XPU backend enabled
cd python
TRITON_CODEGEN_INTEL_XPU_BACKEND=1 python setup.py develop

[block_pointers]: Enable test `test_block_pointer.py` on PVC

The upstream tests in test_block_pointer.py need to be ported to XPU. Support for blocked pointers is new and never exercised in our prototype, we might expose some functional features that nees to be implemented by attempting to run these tests.

Can I help with providing early feedback ?

I've been exploring writing kernels with numba_dpex which brings a numba.cuda-like interface to xpu-device, and I've also been following this project with great interest. If you can provide me with early install instructions I'd be very interested in early testing this backend on intel xpu devices (flex 170 GPU and iGPUs) and give feedback.

Port and run runtime tests

Port and run the test in the runtime directory:

image

As part of this investigation we might discover missing functionality or expose functional issues in the current implementation. Separate work items may be added to track compiler work required to complete this task.

Port and run Triton tutorials

We are currently running only the first 3 tutorial tests and we need to enable all the remaining tutorial tests:

image

As part of this investigation we might discover missing functionality or expose functional issues in the current implementation. Separate work items may be added to track compiler work required to complete this task.

[CI]: Enhance basic integration tests

Work item #144 establishes the base integration tests (CI) for the Triton compiler generating LLVM IR. This work item has the goal to enhance CI to make it closer to the CI system utilized upstream.

As inspiration for this work item we should analyze the following workloads and determine whether we want to adopt them as part of CI:

AttributeError: 'Stream' object has no attribute 'sycl_queue'

Running the vadd example shows the following error. Thank you for the suggestion.

/path/to/lib/python3.10/site-packages/triton/third_party/xpu/__init__.py", line 400, in get_stream
    return torch.xpu.current_stream(idx).sycl_queue
AttributeError: 'Stream' object has no attribute 'sycl_queue'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.