intel / intel-xpu-backend-for-triton Goto Github PK

View Code? Open in Web Editor NEW

85.0 22.0 25.0 91.6 MB

OpenAI Triton backend for Intel® GPUs

License: MIT License

CMake 0.89% C++ 35.43% C 0.25% Python 32.38% Shell 0.67% MLIR 30.34% LLVM 0.03%

intel-xpu-backend-for-triton's People

Contributors

Stargazers

Watchers

intel-xpu-backend-for-triton's Issues

Nightly Tests Failure Tracker

This issue will be auto-comment by actions when nightly failure detected for notify relevant owners awareness.

GEMM performance is lower than XeTLA

DPAS instruction is not generated where needed
2D-block load/store are not generated where needed
prefetch instruction is not inserted where needed

Fix and enable tests in `test_subprocess.py`

Add test_subprocess.py to CI and test-triton.sh.

Issue Description

Code Example

System Info

OS:
PyTorch:
Intel Extension For Pytorch:
LLVM Version.
Triton Version:
Intel® XPU Backend for Triton*:

~600 of the Triton tests are failing

https://github.com/openai/triton/tree/main/python/test/unit need to pass 100%

In order to complete this task we need to complete the following work items:

#152
#153
#154
#155
#156
#157
#158
#159
#160
#161
#164
#167
#191
#205
#206
#212
#213
#251
#253
#255
#256
#257
#261
#291
#311
#312
#377
#381
#435
#500
#506
#513
#514
#597
#598
#601

Make the llvm-target branch use the Triton plugin infrastructure

The code in the llvm-target branch is a fork of the OpenAI Triton code with modifications in several files. The structure of the project mirrors the structure of the AMD port. This work item objective is to make the project use the Triton plugin infrastructure. The work is similar to what is required for the AMD port, which is also a fork, so we should determine whether we can use the same mechanism.

[test_masked_load_shared_memory]: Fails to run correctly on PVC

The end-to-end core test test_masked_load_shared_memory run correctly on a PVC GPU when the input is a float32 or float16 but fails when the input is bfloat16.

Support for Python 3.10

GEMM Block-pointer Path

get triton gemm perf 80% of oneDNN/XeTLA utilizing genISA/vc-intrinsics.
the lowering pipeline would be "triton -> tritongpu -> optimized/simplified tritongpu => llvm/spirv".

this serves as an umbrella issue including

[atomic_rmw]: Support for fp16

The current implementation of tt.atomic_rmw is functional for several data types but, crucially, it doesn't yet work for float16.
The issue is in IGC.

Compare GEMM for PVC to NVidia path (cuDNN, cuTLASS)

[DPAS]: Initial implementation for Triton's `tt.dot` operation using the DPAS instruction

Currently the tt.dot operation is lowered to a loop containing scalar (FMA) instructions. This works from a functional perspective but performs poorly.

This work item entails extending the conversion code for tt.dot so that it generates DPAS instruction(s) via the GENX dialect operation @llvm.genx.GenISA.sub.group.dpas
In this first step performance is not the main objective. The operand of the tt.dot operation are expected to be put into shared local memory by converting from a block layout to a shared layout (like it is the case for NVidia). Once in shared memory the operands are to be converted to a dot layout (with a underlying DPAS layout).

Efforts/experiments to elide the blocked layout to shared layout conversions, should be handled separately from this work item.

DeepSpeed Triton backend failure on GPT-J inference

Hi, I succesfully built triton 2.1.0 on commit 5df9042 with xpu-backend at 0bcc485 and added some modifications to be compatible with ipex-2.0. And I can successfully pass xpu pytest in triton with pytest -xvs test/backend/third_party_backends/test_xpu_backend.py --backend xpu.

However, when I run DeepSpeed inference on GPT-J, there are some errors like Translate to SPIRV IR failedLLVM ERROR: Failed to translate TritonGPU to SPIRV IR. Is the support still limited to specific operations or it's a bug? Also, I cannot successfully compile lastest main branch of intel-xpu-backend-for-triton with error:

/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:11:62: error: template argument 1 is invalid
   11 | #define ptrtoint(...) rewriter.create<spirv::ConvertPtrToUOp>(loc, __VA_ARGS__)
      |                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/ConvertLayoutOpToSPIRV.cpp:220:25: note: in expansion of macro ‘ptrtoint’
  220 |               currVal = ptrtoint(llvmElemTy, currVal);
      |                         ^~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:10:46: error: ‘ConvertUToPtrOp’ is not a member of ‘mlir::spirv’; did you mean ‘ConvertUToFOp’?
   10 | #define inttoptr(...) rewriter.create<spirv::ConvertUToPtrOp>(loc, __VA_ARGS__)
      |                                              ^~~~~~~~~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:10:46: note: in definition of macro ‘inttoptr’
   10 | #define inttoptr(...) rewriter.create<spirv::ConvertUToPtrOp>(loc, __VA_ARGS__)
      |                                              ^~~~~~~~~~~~~~~
/home/xxx/Project/triton/third_party/intel_xpu_backend/lib/Conversion/TritonGPUToSPIRV/Utility.h:10:62: error: no matching function for call to ‘mlir::ConversionPatternRewriter::create<<expression error> >(mlir::Location&, mlir::Type&, mlir::Value&)’
   10 | #define inttoptr(...) rewriter.create<spirv::ConvertUToPtrOp>(loc, __VA_ARGS__)
      |                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

Complete log:

[0] [2023-08-03 01:28:35,085] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.1+7de73909, git-hash=7de73909, git-branch=master
[0] [2023-08-03 01:28:35,085] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[0] [2023-08-03 01:28:35,085] [INFO] [logging.py:96:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[0] [2023-08-03 01:28:35,130] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 4096, 'intermediate_size': 16384, 'heads': 16, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 64, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': False, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': True, 'triton_autotune': False}
[0] [2023-08-03 01:28:35,130] [INFO] [logging.py:96:log_dist] [Rank 0] Injecting Triton kernels ...
Loading 1 checkpoint shards:   0%|          | 0/1 [00:00<?, ?it/s][0] checkpoint loading time at rank 0: 48.9048216342926 sec
Loading 1 checkpoint shards: 100%|██████████| 1/1 [00:48<00:00, 48.90s/it]
[0] [2023-08-03 01:29:24,846] [INFO] [utils.py:798:see_memory_usage] post-ds-inference-init
[0] [2023-08-03 01:29:24,846] [INFO] [utils.py:799:see_memory_usage] MA 11.27 GB         Max_MA 11.27 GB         CA 11.27 GB         Max_CA 11 GB
[0] [2023-08-03 01:29:24,846] [INFO] [utils.py:806:see_memory_usage] CPU Virtual Memory:  used = 11.66 GB, percent = 1.2%
[0] *** Starting to generate 32 tokens with bs=1
[0] Generate args {'max_new_tokens': 32, 'do_sample': False, 'num_beams': 1, 'token_latency': True}
[0] *** Prompt size:  32
[0] [2023-08-03 01:29:24,933] [INFO] [utils.py:798:see_memory_usage] end-of-run
[0] [2023-08-03 01:29:24,934] [INFO] [utils.py:799:see_memory_usage] MA 11.27 GB         Max_MA 11.27 GB         CA 11.27 GB         Max_CA 11 GB
[0] [2023-08-03 01:29:24,934] [INFO] [utils.py:806:see_memory_usage] CPU Virtual Memory:  used = 11.66 GB, percent = 1.2%
[0] *** Running benchmark[0]
[0] Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
[0] >>> inner function (32,)
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
[0]   warnings.warn(
[0] 2023-08-03 01:29:25,509 - root - INFO - running build_ext
[0] 2023-08-03 01:29:25,514 - root - INFO - building 'layer_norm_kernel' extension
[0] 2023-08-03 01:29:25,514 - root - INFO - creating /tmp/tmp2a44hft2/tmp
[0] 2023-08-03 01:29:25,514 - root - INFO - creating /tmp/tmp2a44hft2/tmp/tmp2a44hft2
[0] 2023-08-03 01:29:25,515 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmp2a44hft2/main.c -o /tmp/tmp2a44hft2/tmp/tmp2a44hft2/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=layer_norm_kernel -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:29:34,504 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmp2a44hft2/tmp/tmp2a44hft2/main.o -lze_loader -o /tmp/tmp2a44hft2/layer_norm_kernel.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] >>> inner function <function Fp16Matmul.forward.<locals>.<lambda> at 0x7f6ace1720e0>
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>[0]
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
[0]   warnings.warn(
[0] 2023-08-03 01:29:50,809 - root - INFO - running build_ext
[0] 2023-08-03 01:29:50,812 - root - INFO - building '_fp_matmul' extension
[0] 2023-08-03 01:29:50,812 - root - INFO - creating /tmp/tmp7aguxrst/tmp
[0] 2023-08-03 01:29:50,812 - root - INFO - creating /tmp/tmp7aguxrst/tmp/tmp7aguxrst
[0] 2023-08-03 01:29:50,812 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmp7aguxrst/main.c -o /tmp/tmp7aguxrst/tmp/tmp7aguxrst/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=_fp_matmul -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:29:59,847 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmp7aguxrst/tmp/tmp7aguxrst/main.o -lze_loader -o /tmp/tmp7aguxrst/_fp_matmul.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] >>> inner function <function Fp16Matmul._score_4d_matmul.<locals>.<lambda> at 0x7f6ace1720e0>
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
[0]   warnings.warn(
[0] 2023-08-03 01:30:23,065 - root - INFO - running build_ext
[0] 2023-08-03 01:30:23,068 - root - INFO - building 'matmul_4d_kernel' extension
[0] 2023-08-03 01:30:23,069 - root - INFO - creating /tmp/tmpsihcyg_g/tmp
[0] 2023-08-03 01:30:23,069 - root - INFO - creating /tmp/tmpsihcyg_g/tmp/tmpsihcyg_g
[0] 2023-08-03 01:30:23,069 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmpsihcyg_g/main.c -o /tmp/tmpsihcyg_g/tmp/tmpsihcyg_g/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=matmul_4d_kernel -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:30:32,197 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmpsihcyg_g/tmp/tmpsihcyg_g/main.o -lze_loader -o /tmp/tmpsihcyg_g/matmul_4d_kernel.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] >>> inner function (512,)
[0] <function get_backend at 0x7f6b81fe3910>
[0] <triton.third_party.xpu.XPUBackend object at 0x7f6acea6d960>
[0] /home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/xpu/cpp_extension.py:1373: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
[0]   warnings.warn(
[0] 2023-08-03 01:30:49,560 - root - INFO - running build_ext
[0] 2023-08-03 01:30:49,563 - root - INFO - building 'softmax_kernel' extension
[0] 2023-08-03 01:30:49,563 - root - INFO - creating /tmp/tmpxm55vb5j/tmp
[0] 2023-08-03 01:30:49,563 - root - INFO - creating /tmp/tmpxm55vb5j/tmp/tmpxm55vb5j
[0] 2023-08-03 01:30:49,563 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icx -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2023.1.0/linux/include -I/opt/intel/oneapi/compiler/2023.1.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2023.1.0/include -I/home/xxx/onednn/include -I/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/include -I/home/xxx/miniconda3/envs/llm/include/python3.10 -c /tmp/tmpxm55vb5j/main.c -o /tmp/tmpxm55vb5j/tmp/tmpxm55vb5j/main.o -fPIC -fPIC -w -fsycl -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=softmax_kernel -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[0] 2023-08-03 01:30:58,436 - root - INFO - /opt/intel/oneapi/compiler/2023.1.0/linux/bin/icpx -shared -fsycl /tmp/tmpxm55vb5j/tmp/tmpxm55vb5j/main.o -lze_loader -o /tmp/tmpxm55vb5j/softmax_kernel.cpython-310-x86_64-linux-gnu.so -lc10 -ltorch_cpu -ltorch -ltorch_python -L/home/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/lib -L/home/xxx/Project/frameworks.ai.pytorch.ipex-gpu.rls/intel_extension_for_pytorch/lib -L/home/xxx/onednn/lib -Wl,--start-group /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sycl.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_intel_ilp64.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_sequential.a /opt/intel/oneapi/mkl/2023.1.0/lib/intel64/libmkl_core.a -Wl,--end-group -lsycl -lOpenCL -lpthread -lm -ldl -ldnnl -lintel-ext-pt-gpu
[0] loc("-":17:13): error: external function is unhandled
[0] Translate to SPIRV IR failedLLVM ERROR: Failed to translate TritonGPU to SPIRV IR.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 1046382 RUNNING AT 
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

[test_line_info]: Enable the test in `test_line_info.py`

Port and run (on PVC) the test contained in test_line_info.py.

` /usr/bin/ld: cannot find -lz` when building the backend

I'd like to try the compile bundle from there #83 but the very last step (building triton) fails with

/usr/bin/ld: cannot find -lz

a solution is suggested here https://github.com/intel/intel-xpu-backend-for-triton/wiki/Possible-Build-Bugs#1-usrbinld-cannot-find--lz but I'm not root on the system so I can't install this package.

I also have zlib install in the conda environment, or I tried to extract the package and use LD_LIBRARY_PATH to point to other locations, it doesn't help.

Is there a way to do without system zlib ?

Support for pytorch 2.4

Currently (as of Jan 17, 2024) the Triton compiler in https://github.com/intel/intel-xpu-backend-for-triton/tree/llvm-target works with a custom version of torch 2.1.0. This work item has the objective to make it work with the torch 2.4 version.

Test `dot_combine` in `test_line_info.py` fails to compile

PR #190 exposed an issue in compiling one of the test contains d in test_line_info.py. The dot_combine kernel fails to compile cleanly. To reproduce the issue, comment out all test except dot_combine and execute:

python3 -m pytest --verbose --device xpu test_line_info.py

The back trace is:

test_line_info.py::test_line_info[dot_combine] Fatal Python error: Aborted

Current thread 0x00007f291caa52c0 (most recent call first):
  File "/home/etiotto/intel-xpu-backend-for-triton/python/triton/compiler/backends/xpu.py", line 191 in make_llir
  File "/home/etiotto/intel-xpu-backend-for-triton/python/triton/compiler/backends/xpu.py", line 208 in <lambda>
  File "/home/etiotto/intel-xpu-backend-for-triton/python/triton/compiler/compiler.py", line 200 in compile

Reduce test case

#blocked = #triton_gpu.blocked<{sizePerThread = [2, 2], threadsPerWarp = [2, 16], warpsPerCTA = [4, 1], order = [1, 0], CTAsPerCGA = [1, 1], CTASplitNum = [1, 1], CTAOrder = [1, 0]}>

module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 : i32} {
  tt.func public @kernel_dot_combine_0(%op1: tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 0, parent = #blocked}>>, 
                                       %op2: tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 1, parent = #blocked}>>, 
                                       %cst : tensor<2x2xi32, #blocked>) {
    %0 = tt.dot %op1, %op2, %cst {allowTF32 = true, maxNumImpreciseAcc = 0 : i32} : tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<2x2xi8, #triton_gpu.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<2x2xi32, #blocked>
    tt.return
  }
}

To reproduce using reduced test case

% triton-opt  -split-input-file --convert-triton-gpu-to-llvm="target=genx" t1.ttgir
Unsupported conversion from i8 to i32

UNREACHABLE executed at /home/etiotto/intel-xpu-backend-for-triton/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:1415!
...

Prefetch instruction is not inserted where needed

Split into the separate issue from #140

[CI]: Create initial integration test runs

All pull requests against branch https://github.com/intel/intel-xpu-backend-for-triton/tree/llvm-target need to run basic integration tests. CI should:

clone and build MLIR from the following repository (https://github.com/intel/llvm/tree/genx)
cache the libraries built in step 1
clone and build the code in the pull request
Run the following tests:
a. Triton C++ tests
b. Triton LIT tests
c. Triton core tests (python end-to-end tests), running on a PVC GPU
d. Subset of the Triton Tutorial tests

[test_typeconvert_downcast]: Fails to run correctly on PVC

The end-to-end core test test_typeconvert_downcast in test_conversions.py fails to run correctly on a PVC.

LLVM Platform issue

I'm currently building an extension to triton (pre-hopper commit) for Apple Silicon and I have settled on a method to do that:

convert TritonGPU To SPIRV; seeing that this has already been implemented I don't want to have to myself if I dont have to
convert ".spv" to Human-readable Metal Shader language using SPIRV-CROSS to .metal
compile .metal to .air
compile .air to .metallib
use metal-cpp to use .metallib with pybind11 that will get JIT compiled similiar to your usage of SYCL but with Metal

is this a good proccess to do this and how will the different LLVM versions play into this, is the intel fork of LLVM compatible with macos-arm64?
Any feedback would be great

Conversion code for `triton_gpu.async_wait`

When the target is NVIDIA, the triton_gpu.async_wait operation is lowered to the following inline asm:

%2 = "llvm.inline_asm"() {asm_dialect = 0 : i64, asm_string = "cp.async.wait_group 0x4;", constraints = "", has_side_effects, operand_attrs = []} : () -> !llvm.void

We need to support for the equivalent operation of cp.async.wait_group

The cp.async.wait_group instruction waits for completion of prior asynchronous copy operations. The syntax is:

cp.async.wait_group N;

The instruction will cause the executing thread to wait till only N or fewer of the most recent cp.async-groups are pending and all the prior cp.async-groups committed by the executing threads are complete. For example, when N is 0, the executing thread waits on all the prior cp.async-groups to complete. Operand N is an integer constant.

Port and run operator tests

We need to run the python operator tests:

As part of this investigation we might discover missing functionality or expose functional issues in the current implementation. Separate work items may be added to track compiler work required to complete this task.

[FE]: Use SYCL runtime rather than L0 in Triton FE/driver

Rewrite the Triton FE code we have ported to use the SYCL runtime rather than the L0 runtime.

[test_chain_reduce]: Fails to run correctly on PVC

The end-to-end core test test_chain_reduce run correctly on a PVC GPU when the input operand has blocked layout. However it fails when the operand has DPAS layout.

[benchmarks] Enable HF benchmark suite

[test_locality]: Fails to run correctly on PVC GPU.

The end-to-end core test test_locality fails to run correctly on a PVC GPU. We need to investigate the cause of the functional problem(s).

Merge OpenAI Triton till Feb 11

It is essential to keep up to date with OpenAI Triton, to get the latest features, and reduce the difficulty to upstream our changes to OpenAI Triton.

Enable tests in `test_annotations.py`

Add test_annotations.py to CI and the test-triton.sh script.

[test_dot_without_load]: Fails to verify when fdata type is `float16`

It works when the data type is float32 but fails if the data type is float16.

[test_scan_layouts]: Fails to run correctly on PVC

The end-to-end core test test_scan_layouts fails to run correctly on a PVC GPU. We need to investigate the cause of the functional problem(s).

Conversion code for `triton_gpu.async_commit_group`

When the target is NVIDIA, the triton_gpu.async_commit_group operation is lowered to the following inline asm:

  llvm.func @async_commit_group() attributes {nvvm.kernel = 1 : ui1, nvvm.maxntid = [128 : i32]} {
    ...
    %2 = llvm.inline_asm has_side_effects asm_dialect = att operand_attrs = [] "cp.async.commit_group ;", ""  : () -> !llvm.void
    llvm.return
  }

We need to support the equivalent operation of cp.async.commit_group.

convert tritongpu to xegpu

it is kind of 1:1 mapping:
tt.make_tensor_ptr %shape %strides %offsets => xegpu.create_tdesc %shape %strides %offsets
tt.load %ptr => xegpu.load_nd %ptr
tt.dot %a %b %c => xegpu.dpas %a %b %c
tt.store %ptr, %value => xegpu.store_nd %ptr, %value
tt.advance %ptr %offsets => xegpu.update_nd_offsets %ptr %offsets

example
TritonGPU code

// total 8x16x1024
//              it has 1 workgroup. each workgroup caculates a [8x16 = 8x1024 * 1024x16] block
// each work-group has 1  subgroup. each  subgroup caculates a [8x16 = 8x1024 * 1024x16] block
module {
tt.func @test_kernel(%arg0: !tt.ptr<f16, 1> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<f16, 1> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<f16, 1> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
    %c0 = arith.constant 0 : i32
    %c16 = arith.constant 16 : i32
    %c8 = arith.constant 8 : i32
    %c1024 = arith.constant 1024 : i32
    %m = arith.constant 8 : i64
    %n = arith.constant 16 : i64
    %k = arith.constant 1024 : i64
    %c1_i64 = arith.constant 1 : i64
    %cst = arith.constant dense<0.000000e+00> : tensor<8x16xf32>
    %aPtr = tt.make_tensor_ptr %arg0, [%m, %k], [%k, %c1_i64], [%c0, %c0] {order = array<i32: 1, 0>} : <tensor<8x16xf16>, 1>
    %bPtr = tt.make_tensor_ptr %arg1, [%k, %n], [%n, %c1_i64], [%c0, %c0] {order = array<i32: 1, 0>} : <tensor<16x16xf16>, 1>
    %6:3 = scf.for %arg3 = %c0 to %c1024 step %c16 iter_args(%arg4 = %cst, %subA = %aPtr, %subB = %bPtr) -> (tensor<8x16xf32>, !tt.ptr<tensor<8x16xf16>, 1>, !tt.ptr<tensor<16x16xf16>, 1>) : i32 {
      %a = tt.load %subA {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32, isVolatile = false} : !tt.ptr<tensor<8x16xf16>, 1> -> tensor<8x16xf16>
      %b = tt.load %subB {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32, isVolatile = false, DotB = true} : !tt.ptr<tensor<16x16xf16>, 1> -> tensor<16x16xf16>
      %c = tt.dot %a, %b, %arg4 {allowTF32 = true, maxNumImpreciseAcc = 0 : i32} : tensor<8x16xf16> * tensor<16x16xf16> -> tensor<8x16xf32>
      %30 = tt.advance %aPtr, [%c0, %c16] : <tensor<8x16xf16>, 1>
      %31 = tt.advance %bPtr, [%c16, %c0] : <tensor<16x16xf16>, 1>
      scf.yield %c, %30, %31 : tensor<8x16xf32>, !tt.ptr<tensor<8x16xf16>, 1>, !tt.ptr<tensor<16x16xf16>, 1>
    }
    %value = arith.truncf %6#0 : tensor<8x16xf32> to tensor<8x16xf16>
    %cPtr = tt.make_tensor_ptr %arg2, [%m, %n], [%n, %c1_i64], [%c0, %c0] {order = array<i32: 1, 0>} : <tensor<8x16xf16>, 1>
    tt.store %cPtr, %value {boundaryCheck = array<i32: 0, 1>, cache = 1 : i32, evict = 1 : i32} : !tt.ptr<tensor<8x16xf16>, 1>, tensor<8x16xf16>
    tt.return

}
}

XeGPU code

   func.func @test_kernel(%arg0: memref<8x1024xf16>, %arg1: memref<1024x16xf16>, %arg2: memref<8x16xf16>) kernel attributes {
      %c0 = arith.constant 0 : index
      %c16 = arith.constant 16 : index
      %c8 = arith.constant 8 : index
      %c1024 = arith.constant 1024 : index
      %cst = arith.constant dense<0.0> :vector<8x16xf32>
        %7 = xegpu.create_nd_tdesc %arg0[%c0, %c0] {mode=vc}: memref<8x1024xf16> -> !xegpu.tensor_desc<8x16xf16>
        %8 = xegpu.create_nd_tdesc %arg1[%c0, %c0] {mode=vc}: memref<1024x16xf16> -> !xegpu.tensor_desc<16x16xf16>
      %6:3 = scf.for %arg3 = %c0 to %c1024 step %c16 iter_args(%arg4 = %cst, %subA = %7, %subB = %8) -> (vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>) {
        %9  = xegpu.load_nd %subA  {mode=vc, vnni_axis = 1}: !xegpu.tensor_desc<8x16xf16> -> vector<8x8x2xf16>
        %10 = xegpu.load_nd %subB  {mode=vc, vnni_axis = 0} : !xegpu.tensor_desc<16x16xf16> -> vector<8x16x2xf16>
        %11 = xegpu.dpas %9, %10, %arg4 {mode=vc}: vector<8x8x2xf16>, vector<8x16x2xf16>, vector<8x16xf32> -> vector<8x16xf32>
        %12 = xegpu.update_nd_offset %subA, [%c0, %c16] {mode=vc}: !xegpu.tensor_desc<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
        %13 = xegpu.update_nd_offset %subB, [%c16, %c0] {mode=vc}: !xegpu.tensor_desc<16x16xf16> -> !xegpu.tensor_desc<16x16xf16>
        scf.yield %11, %12, %13: vector<8x16xf32>, !xegpu.tensor_desc<8x16xf16>, !xegpu.tensor_desc<16x16xf16>
      }
      %value = arith.truncf %6#0 : vector<8x16xf32> to vector<8x16xf16>
      %4 = xegpu.create_nd_tdesc %arg2[%c0, %c0] {mode = vc} : memref<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
      xegpu.store_nd %value, %4 {mode = vc}: vector<8x16xf16>, !xegpu.tensor_desc<8x16xf16>
      return
    }

[DPAS]: Use 2d-loads instruction to load the operand of `tt.dot`

The operands of the Triton's tt.dot operation should be loaded by using specialized instruction to load 2D blocks of the matrices.
Loading the operands in blocks is more efficient than loading them by using regular loads @llvm.genx.GenISA.LSCPrefetch.

We might need to leverage the semantic information associated with Tritons blocked pointers (https://triton-lang.org/main/getting-started/tutorials/08-experimental-block-pointer.html) in order to generate 2d-Blocked loads.

Port and run `test_conversions.py`

intel-xpu-backend-for-triton/python/test/unit/language/test_conversions.py

Unknown CMake command "mlir_tablegen"

Could you please add doc on building and running examples ? Thanks.

1 clone the repo
2 mkdir build & cd build
3 cmake ..

-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at include/triton/Conversion/TritonGPUToSPIRV/CMakeLists.txt:2 (mlir_tablegen):
Unknown CMake command "mlir_tablegen".

[benchmarks] Enable TIMM benchmark suite

python setup.py develop" when users do not have root permission

Could you suggest the command "python setup.py develop" when users do not have root permission ? Thanks.

# cd to triton root folder and checkout to pinned commit
cd ../..
git checkout `cat third_party/intel_xpu_backend/triton_hash.txt`
# Build triton with XPU backend enabled
cd python
TRITON_CODEGEN_INTEL_XPU_BACKEND=1 python setup.py develop

[block_pointers]: Enable test `test_block_pointer.py` on PVC

The upstream tests in test_block_pointer.py need to be ported to XPU. Support for blocked pointers is new and never exercised in our prototype, we might expose some functional features that nees to be implemented by attempting to run these tests.

Can I help with providing early feedback ?

I've been exploring writing kernels with numba_dpex which brings a numba.cuda-like interface to xpu-device, and I've also been following this project with great interest. If you can provide me with early install instructions I'd be very interested in early testing this backend on intel xpu devices (flex 170 GPU and iGPUs) and give feedback.

/path/to/lib/python3.10/site-packages/triton/third_party/xpu/__init__.py", line 400, in get_stream
    return torch.xpu.current_stream(idx).sycl_queue
AttributeError: 'Stream' object has no attribute 'sycl_queue'

intel / intel-xpu-backend-for-triton Goto Github PK

intel-xpu-backend-for-triton's People

Contributors

Stargazers

Watchers

Forkers

intel-xpu-backend-for-triton's Issues

Issue Description

Code Example

System Info

Recommend Projects

Recommend Topics

Recommend Org