intel / intel-xpu-backend-for-triton Goto Github PK
View Code? Open in Web Editor NEWOpenAI Triton backend for Intel® GPUs
License: MIT License
OpenAI Triton backend for Intel® GPUs
License: MIT License
This issue will be auto-comment by actions when nightly failure detected for notify relevant owners awareness.
DPAS instruction is not generated where needed
2D-block load/store are not generated where needed
prefetch instruction is not inserted where needed
Add test_subprocess.py
to CI and test-triton.sh
.
intel/intel-extension-for-pytorch#489
https://github.com/openai/triton/tree/main/python/test/unit need to pass 100%
In order to complete this task we need to complete the following work items:
The code in the llvm-target
branch is a fork of the OpenAI Triton code with modifications in several files. The structure of the project mirrors the structure of the AMD port. This work item objective is to make the project use the Triton plugin infrastructure. The work is similar to what is required for the AMD port, which is also a fork, so we should determine whether we can use the same mechanism.
The end-to-end core test test_masked_load_shared_memory
run correctly on a PVC GPU when the input is a float32
or float16
but fails when the input is bfloat16
.
The current implementation of tt.atomic_rmw
is functional for several data types but, crucially, it doesn't yet work for float16
.
The issue is in IGC.
Currently the tt.dot
operation is lowered to a loop containing scalar (FMA) instructions. This works from a functional perspective but performs poorly.
This work item entails extending the conversion code for tt.dot
so that it generates DPAS instruction(s) via the GENX dialect operation @llvm.genx.GenISA.sub.group.dpas
In this first step performance is not the main objective. The operand of the tt.dot
operation are expected to be put into shared local memory by converting from a block layout to a shared layout (like it is the case for NVidia). Once in shared memory the operands are to be converted to a dot layout (with a underlying DPAS layout).
Efforts/experiments to elide the blocked layout to shared layout conversions, should be handled separately from this work item.
Hi, I succesfully built triton 2.1.0 on commit 5df9042 with xpu-backend at 0bcc485 and added some modifications to be compatible with ipex-2.0. And I can successfully pass xpu pytest in triton with pytest -xvs test/backend/third_party_backends/test_xpu_backend.py --backend xpu
.
However, when I run DeepSpeed inference on GPT-J, there are some errors like Translate to SPIRV IR failedLLVM ERROR: Failed to translate TritonGPU to SPIRV IR.
Is the support still limited to specific operations or it's a bug? Also, I cannot successfully compile lastest main branch of intel-xpu-backend-for-triton with error:
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:11:62: error: template argument 1 is invalid
11 | #define ptrtoint(...) rewriter.create<spirv::ConvertPtrToUOp>(loc, __VA_ARGS__)
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/ConvertLayoutOpToSPIRV.cpp:220:25: note: in expansion of macro ‘ptrtoint’
220 | currVal = ptrtoint(llvmElemTy, currVal);
| ^~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:10:46: error: ‘ConvertUToPtrOp’ is not a member of ‘mlir::spirv’; did you mean ‘ConvertUToFOp’?
10 | #define inttoptr(...) rewriter.create<spirv::ConvertUToPtrOp>(loc, __VA_ARGS__)
| ^~~~~~~~~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:10:46: note: in definition of macro ‘inttoptr’
10 | #define inttoptr(...) rewriter.create<spirv::ConvertUToPtrOp>(loc, __VA_ARGS__)
| ^~~~~~~~~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:10:62: error: no matching function for call to ‘mlir::ConversionPatternRewriter::create<<expression error> >(mlir::Location&, mlir::Type&, mlir::Value&)’
10 | #define inttoptr(...) rewriter.create<spirv::ConvertUToPtrOp>(loc, __VA_ARGS__)
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
Complete log:
[0] [2023-08-03 01:28:35,085] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.1+7de73909, git-hash=7de73909, git-branch=master
[0] [2023-08-03 01:28:35,085] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[0] [2023-08-03 01:28:35,085] [INFO] [logging.py:96:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[0] [2023-08-03 01:28:35,130] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 4096, 'intermediate_size': 16384, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 64, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': False, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': True, 'triton_autotune': False}
[0] [2023-08-03 01:28:35,130] [INFO] [logging.py:96:log_dist] [Rank 0] Injecting Triton kernels ...
Loading 1 checkpoint shards: 0%| | 0/1 [00:00<?, ?it/s][0] checkpoint loading time at rank 0: 48.9048216342926 sec
Loading 1 checkpoint shards: 100%|██████████| 1/1 [00:48<00:00, 48.90s/it]
[0] [2023-08-03 01:29:24,846] [INFO] [utils.py:798:see_memory_usage] post-ds-inference-init
[0] [2023-08-03 01:29:24,846] [INFO] [utils.py:799:see_memory_usage] MA 11.27 GB Max_MA 11.27 GB CA 11.27 GB Max_CA 11 GB
[0] [2023-08-03 01:29:24,846] [INFO] [utils.py:806:see_memory_usage] CPU Virtual Memory: used = 11.66 GB, percent = 1.2%
[0] *** Starting to generate 32 tokens with bs=1
[0] Generate args {'max_new_tokens': 32, 'do_sample': False, 'num_beams': 1, 'token_latency': True}
[0] *** Prompt size: 32
[0] [2023-08-03 01:29:24,933] [INFO] [utils.py:798:see_memory_usage] end-of-run
[0] [2023-08-03 01:29:24,934] [INFO] [utils.py:799:see_memory_usage] MA 11.27 GB Max_MA 11.27 GB CA 11.27 GB Max_CA 11 GB
[0] [2023-08-03 01:29:24,934] [INFO] [utils.py:806:see_memory_usage] CPU Virtual Memory: used = 11.66 GB, percent = 1.2%
[0] *** Running benchmark[0]
[0] Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[0] >>> inner function (32,)
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to that, this path of onednn version maybe not match with the built-in version.
[0] warnings.warn(
[0] 2023-08-03 01:29:25,509 - root - INFO - running build_ext
[0] 2023-08-03 01:29:25,514 - root - INFO - building 'layer_norm_kernel' extension
[0] 2023-08-03 01:29:25,514 - root - INFO - creating /tmp/tmp2a44hft2/tmp
[0] 2023-08-03 01:29:25,514 - root - INFO - creating /tmp/tmp2a44hft2/tmp/tmp2a44hft2
[0] 2023-08-03 01:29:25,515 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmp2a44hft2/main.c -o /tmp/tmp2a44hft2/tmp/tmp2a44hft2/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=layer_norm_kernel -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:29:34,504 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmp2a44hft2/tmp/tmp2a44hft2/main.o -lze_loader -o /tmp/tmp2a44hft2/layer_norm_kernel.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] >>> inner function <function Fp16Matmul.forward.<locals>.<lambda> at 0x7f6ace1720e0>
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>[0]
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to that, this path of onednn version maybe not match with the built-in version.
[0] warnings.warn(
[0] 2023-08-03 01:29:50,809 - root - INFO - running build_ext
[0] 2023-08-03 01:29:50,812 - root - INFO - building '_fp_matmul' extension
[0] 2023-08-03 01:29:50,812 - root - INFO - creating /tmp/tmp7aguxrst/tmp
[0] 2023-08-03 01:29:50,812 - root - INFO - creating /tmp/tmp7aguxrst/tmp/tmp7aguxrst
[0] 2023-08-03 01:29:50,812 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmp7aguxrst/main.c -o /tmp/tmp7aguxrst/tmp/tmp7aguxrst/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=_fp_matmul -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:29:59,847 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmp7aguxrst/tmp/tmp7aguxrst/main.o -lze_loader -o /tmp/tmp7aguxrst/_fp_matmul.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] >>> inner function <function Fp16Matmul._score_4d_matmul.<locals>.<lambda> at 0x7f6ace1720e0>
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to that, this path of onednn version maybe not match with the built-in version.
[0] warnings.warn(
[0] 2023-08-03 01:30:23,065 - root - INFO - running build_ext
[0] 2023-08-03 01:30:23,068 - root - INFO - building 'matmul_4d_kernel' extension
[0] 2023-08-03 01:30:23,069 - root - INFO - creating /tmp/tmpsihcyg_g/tmp
[0] 2023-08-03 01:30:23,069 - root - INFO - creating /tmp/tmpsihcyg_g/tmp/tmpsihcyg_g
[0] 2023-08-03 01:30:23,069 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmpsihcyg_g/main.c -o /tmp/tmpsihcyg_g/tmp/tmpsihcyg_g/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=matmul_4d_kernel -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:30:32,197 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmpsihcyg_g/tmp/tmpsihcyg_g/main.o -lze_loader -o /tmp/tmpsihcyg_g/matmul_4d_kernel.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] >>> inner function (512,)
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to that, this path of onednn version maybe not match with the built-in version.
[0] warnings.warn(
[0] 2023-08-03 01:30:49,560 - root - INFO - running build_ext
[0] 2023-08-03 01:30:49,563 - root - INFO - building 'softmax_kernel' extension
[0] 2023-08-03 01:30:49,563 - root - INFO - creating /tmp/tmpxm55vb5j/tmp
[0] 2023-08-03 01:30:49,563 - root - INFO - creating /tmp/tmpxm55vb5j/tmp/tmpxm55vb5j
[0] 2023-08-03 01:30:49,563 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmpxm55vb5j/main.c -o /tmp/tmpxm55vb5j/tmp/tmpxm55vb5j/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=softmax_kernel -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:30:58,436 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmpxm55vb5j/tmp/tmpxm55vb5j/main.o -lze_loader -o /tmp/tmpxm55vb5j/softmax_kernel.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] loc("-":17:13): error: external function is unhandled
[0] Translate to SPIRV IR failedLLVM ERROR: Failed to translate TritonGPU to SPIRV IR.
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 1046382 RUNNING AT
= KILLED BY SIGNAL: 6 (Aborted)
===================================================================================
Port and run (on PVC) the test contained in test_line_info.py
.
I'd like to try the compile bundle from there #83 but the very last step (building triton) fails with
/usr/bin/ld: cannot find -lz
a solution is suggested here https://github.com/intel/intel-xpu-backend-for-triton/wiki/Possible-Build-Bugs#1-usrbinld-cannot-find--lz but I'm not root on the system so I can't install this package.
I also have zlib install in the conda environment, or I tried to extract the package and use LD_LIBRARY_PATH
to point to other locations, it doesn't help.
Is there a way to do without system zlib ?
Currently (as of Jan 17, 2024) the Triton compiler in https://github.com/intel/intel-xpu-backend-for-triton/tree/llvm-target works with a custom version of torch 2.1.0
. This work item has the objective to make it work with the torch 2.4 version.
PR #190 exposed an issue in compiling one of the test contains d in test_line_info.py
. The dot_combine
kernel fails to compile cleanly. To reproduce the issue, comment out all test except dot_combine
and execute:
python3 -m pytest --verbose --device xpu test_line_info.py
The back trace is:
test_line_info.py::test_line_info[dot_combine] Fatal Python error: Aborted
Current thread 0x00007f291caa52c0 (most recent call first):
File "/home/etiotto/intel-xpu-backend-for-triton/python/triton/compiler/backends/xpu.py", line 191 in make_llir
File "/home/etiotto/intel-xpu-backend-for-triton/python/triton/compiler/backends/xpu.py", line 208 in <lambda>
File "/home/etiotto/intel-xpu-backend-for-triton/python/triton/compiler/compiler.py", line 200 in compile
Reduce test case
#blocked = #triton_gpu.blocked<{sizePerThread = [2, 2], threadsPerWarp = [2, 16], warpsPerCTA = [4, 1], order = [1, 0], CTAsPerCGA = [1, 1], CTASplitNum = [1, 1], CTAOrder = [1, 0]}>
module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 : i32} {
tt.func public @kernel_dot_combine_0(%op1: tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 0, parent = #blocked}>>,
%op2: tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 1, parent = #blocked}>>,
%cst : tensor<2x2xi32, #blocked>) {
%0 = tt.dot %op1, %op2, %cst {allowTF32 = true, maxNumImpreciseAcc = 0 : i32} : tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<2x2xi32, #blocked>
tt.return
}
}
To reproduce using reduced test case
% triton-opt -split-input-file --convert-triton-gpu-to-llvm="target=genx" t1.ttgir
Unsupported conversion from i8 to i32
UNREACHABLE executed at /home/etiotto/intel-xpu-backend-for-triton/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:1415!
...
Split into the separate issue from #140
All pull requests against branch https://github.com/intel/intel-xpu-backend-for-triton/tree/llvm-target need to run basic integration tests. CI should:
The end-to-end core test test_typeconvert_downcast
in test_conversions.py
fails to run correctly on a PVC.
I'm currently building an extension to triton (pre-hopper commit) for Apple Silicon and I have settled on a method to do that:
is this a good proccess to do this and how will the different LLVM versions play into this, is the intel fork of LLVM compatible with macos-arm64?
Any feedback would be great
When the target is NVIDIA, the triton_gpu.async_wait
operation is lowered to the following inline asm:
%2 = "llvm.inline_asm"() {asm_dialect = 0 : i64, asm_string = "cp.async.wait_group 0x4;", constraints = "", has_side_effects, operand_attrs = []} : () -> !llvm.void
We need to support for the equivalent operation of cp.async.wait_group
The cp.async.wait_group
instruction waits for completion of prior asynchronous copy operations. The syntax is:
cp.async.wait_group N;
The instruction will cause the executing thread to wait till only N or fewer of the most recent cp.async-groups are pending and all the prior cp.async-groups committed by the executing threads are complete. For example, when N is 0, the executing thread waits on all the prior cp.async-groups to complete. Operand N is an integer constant.
Rewrite the Triton FE code we have ported to use the SYCL runtime rather than the L0 runtime.
The end-to-end core test test_chain_reduce
run correctly on a PVC GPU when the input operand has blocked layout. However it fails when the operand has DPAS layout.
The end-to-end core test test_locality
fails to run correctly on a PVC GPU. We need to investigate the cause of the functional problem(s).
It is essential to keep up to date with OpenAI Triton, to get the latest features, and reduce the difficulty to upstream our changes to OpenAI Triton.
Add test_annotations.py
to CI and the test-triton.sh
script.
It works when the data type is float32
but fails if the data type is float16
.
The end-to-end core test test_scan_layouts
fails to run correctly on a PVC GPU. We need to investigate the cause of the functional problem(s).
When the target is NVIDIA, the triton_gpu.async_commit_group
operation is lowered to the following inline asm:
llvm.func @async_commit_group() attributes {nvvm.kernel = 1 : ui1, nvvm.maxntid = [128 : i32]} {
...
%2 = llvm.inline_asm has_side_effects asm_dialect = att operand_attrs = [] "cp.async.commit_group ;", "" : () -> !llvm.void
llvm.return
}
We need to support the equivalent operation of cp.async.commit_group.
it is kind of 1:1 mapping:
tt.make_tensor_ptr %shape %strides %offsets => xegpu.create_tdesc %shape %strides %offsets
tt.load %ptr => xegpu.load_nd %ptr
tt.dot %a %b %c => xegpu.dpas %a %b %c
tt.store %ptr, %value => xegpu.store_nd %ptr, %value
tt.advance %ptr %offsets => xegpu.update_nd_offsets %ptr %offsets
example
TritonGPU code
// total 8x16x1024
// it has 1 workgroup. each workgroup caculates a [8x16 = 8x1024 * 1024x16] block
// each work-group has 1 subgroup. each subgroup caculates a [8x16 = 8x1024 * 1024x16] block
module {
tt.func @test_kernel(%arg0: !tt.ptr<f16, 1> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16, 1> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f16, 1> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
%c0 = arith.constant 0 : i32
%c16 = arith.constant 16 : i32
%c8 = arith.constant 8 : i32
%c1024 = arith.constant 1024 : i32
%m = arith.constant 8 : i64
%n = arith.constant 16 : i64
%k = arith.constant 1024 : i64
%c1_i64 = arith.constant 1 : i64
%cst = arith.constant dense<0.000000e+00> : tensor<8x16xf32>
%aPtr = tt.make_tensor_ptr %arg0, [%m, %k], [%k, %c1_i64], [%c0, %c0] {order = array<i32: 1, 0>} : <tensor<8x16xf16>, 1>
%bPtr = tt.make_tensor_ptr %arg1, [%k, %n], [%n, %c1_i64], [%c0, %c0] {order = array<i32: 1, 0>} : <tensor<16x16xf16>, 1>
%6:3 = scf.for %arg3 = %c0 to %c1024 step %c16 iter_args(%arg4 = %cst, %subA = %aPtr, %subB = %bPtr) -> (tensor<8x16xf32>, !tt.ptr<tensor<8x16xf16>, 1>, !tt.ptr<tensor<16x16xf16>, 1>) : i32 {
%a = tt.load %subA {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32, isVolatile = false} : !tt.ptr<tensor<8x16xf16>, 1> -> tensor<8x16xf16>
%b = tt.load %subB {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32, isVolatile = false, DotB = true} : !tt.ptr<tensor<16x16xf16>, 1> -> tensor<16x16xf16>
%c = tt.dot %a, %b, %arg4 {allowTF32 = true, maxNumImpreciseAcc = 0 : i32} : tensor<8x16xf16> * tensor<16x16xf16> -> tensor<8x16xf32>
%30 = tt.advance %aPtr, [%c0, %c16] : <tensor<8x16xf16>, 1>
%31 = tt.advance %bPtr, [%c16, %c0] : <tensor<16x16xf16>, 1>
scf.yield %c, %30, %31 : tensor<8x16xf32>, !tt.ptr<tensor<8x16xf16>, 1>, !tt.ptr<tensor<16x16xf16>, 1>
}
%value = arith.truncf %6#0 : tensor<8x16xf32> to tensor<8x16xf16>
%cPtr = tt.make_tensor_ptr %arg2, [%m, %n], [%n, %c1_i64], [%c0, %c0] {order = array<i32: 1, 0>} : <tensor<8x16xf16>, 1>
tt.store %cPtr, %value {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32} : !tt.ptr<tensor<8x16xf16>, 1>, tensor<8x16xf16>
tt.return
}
}
XeGPU code
func.func @test_kernel(%arg0: memref<8x1024xf16>, %arg1: memref<1024x16xf16>, %arg2: memref<8x16xf16>) kernel attributes {
%c0 = arith.constant 0 : index
%c16 = arith.constant 16 : index
%c8 = arith.constant 8 : index
%c1024 = arith.constant 1024 : index
%cst = arith.constant dense<0.0> :vector<8x16xf32>
%7 = xegpu.create_nd_tdesc %arg0[%c0, %c0] {mode=vc}: memref<8x1024xf16> -> !xegpu.tensor_desc<8x16xf16>
%8 = xegpu.create_nd_tdesc %arg1[%c0, %c0] {mode=vc}: memref<1024x16xf16> -> !xegpu.tensor_desc<16x16xf16>
%6:3 = scf.for %arg3 = %c0 to %c1024 step %c16 iter_args(%arg4 = %cst, %subA = %7, %subB = %8) -> (vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>) {
%9 = xegpu.load_nd %subA {mode=vc, vnni_axis = 1}: !xegpu.tensor_desc<8x16xf16> -> vector<8x8x2xf16>
%10 = xegpu.load_nd %subB {mode=vc, vnni_axis = 0} : !xegpu.tensor_desc<16x16xf16> -> vector<8x16x2xf16>
%11 = xegpu.dpas %9, %10, %arg4 {mode=vc}: vector<8x8x2xf16>, vector<8x16x2xf16>, vector<8x16xf32> -> vector<8x16xf32>
%12 = xegpu.update_nd_offset %subA, [%c0, %c16] {mode=vc}: !xegpu.tensor_desc<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
%13 = xegpu.update_nd_offset %subB, [%c16, %c0] {mode=vc}: !xegpu.tensor_desc<16x16xf16> -> !xegpu.tensor_desc<16x16xf16>
scf.yield %11, %12, %13: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>
}
%value = arith.truncf %6#0 : vector<8x16xf32> to vector<8x16xf16>
%4 = xegpu.create_nd_tdesc %arg2[%c0, %c0] {mode = vc} : memref<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
xegpu.store_nd %value, %4 {mode = vc}: vector<8x16xf16>, !xegpu.tensor_desc<8x16xf16>
return
}
The operands of the Triton's tt.dot
operation should be loaded by using specialized instruction to load 2D blocks of the matrices.
Loading the operands in blocks is more efficient than loading them by using regular loads @llvm.genx.GenISA.LSCPrefetch
.
We might need to leverage the semantic information associated with Tritons blocked pointers (https://triton-lang.org/main/getting-started/tutorials/08-experimental-block-pointer.html) in order to generate 2d-Blocked loads.
intel-xpu-backend-for-triton/python/test/unit/language/test_conversions.py
As part of this investigation we might discover missing functionality or expose functional issues in the current implementation. Separate work items may be added to track compiler work required to complete this task.
Could you please add doc on building and running examples ? Thanks.
1 clone the repo
2 mkdir build & cd build
3 cmake ..
-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at include/triton/Conversion/TritonGPUToSPIRV/CMakeLists.txt:2 (mlir_tablegen):
Unknown CMake command "mlir_tablegen".
Could you suggest the command "python setup.py develop" when users do not have root permission ? Thanks.
# cd to triton root folder and checkout to pinned commit
cd ../..
git checkout `cat third_party/intel_xpu_backend/triton_hash.txt`
# Build triton with XPU backend enabled
cd python
TRITON_CODEGEN_INTEL_XPU_BACKEND=1 python setup.py develop
The upstream tests in test_block_pointer.py
need to be ported to XPU. Support for blocked pointers is new and never exercised in our prototype, we might expose some functional features that nees to be implemented by attempting to run these tests.
I've been exploring writing kernels with numba_dpex
which brings a numba.cuda
-like interface to xpu-device, and I've also been following this project with great interest. If you can provide me with early install instructions I'd be very interested in early testing this backend on intel xpu devices (flex 170 GPU and iGPUs) and give feedback.
Test cases for the tt.random.*
operation are in test_random.py
which needs to be ported to XPU.
We are currently running only the first 3 tutorial tests and we need to enable all the remaining tutorial tests:
As part of this investigation we might discover missing functionality or expose functional issues in the current implementation. Separate work items may be added to track compiler work required to complete this task.
The current Triton compiler uses the oneAPI libraries. Currently it works when oneAPI is 2023.2 but fails with 2024.0. Investigate the reason(s) and address the problem(s).
Work item #144 establishes the base integration tests (CI) for the Triton compiler generating LLVM IR. This work item has the goal to enhance CI to make it closer to the CI system utilized upstream.
As inspiration for this work item we should analyze the following workloads and determine whether we want to adopt them as part of CI:
The end-to-end core test test_scan2d
fails to run correctly on a PVC GPU for certain combinations of arguments/attributes (axis = 1, num_warps = 4, shape = [1024, 2]).
The end-to-end core test test_reduce_layouts
run correctly on a PVC GPU, but fails for certain combination or argument/attributes. Specifically it fails when: first_axis == 0 and op == "max" and [M, N] in [[128, 128], [256, 128]]:
The goal is to understand the project at https://github.com/microsoft/triton-shared and determine how it fits within our Triton port.
Running the vadd example shows the following error. Thank you for the suggestion.
/path/to/lib/python3.10/site-packages/triton/third_party/xpu/__init__.py", line 400, in get_stream
return torch.xpu.current_stream(idx).sycl_queue
AttributeError: 'Stream' object has no attribute 'sycl_queue'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.