Feature Deion ThunderKittens is an embedded domain-specific

ThunderKittens：a simple yet faster flashattention alternative about llama.cpp HOT 1 OPEN

sorasoras commented on June 27, 2024

ThunderKittens：a simple yet faster flashattention alternative

from llama.cpp.

Comments (1)

JohannesGaessler commented on June 27, 2024 2

The design philosophy of ggml/llama.cpp is not to use external dependencies if at all possible. I was recently informed by an NVIDIA engineer that the way to go for tensor cores is to directly write PTX code (the NVIDIA equivalent of assembly) so I may take a look at the project in terms of that.

Also, I know that you're an AMD user so I would advise you not to count your chickens before they hatch. If the project does what I think it does it would need significant effort to write the equivalence of PTX code for AMD (at least if the performance is supposed to be actually good) so I'm skeptical about AMD support "soon" (but I'll gladly let myself be proven wrong).

from llama.cpp.

Related Issues (20)

Pretokenizer not supported by conversion script HOT 2
convert.py still fails on llama3 8B-Instruct downloaded directly from Meta (Huggingface works) HOT 5
Flash attention implementations do not handle case where value vectors have different dimension from query vectors HOT 5
AMD ROCm: 8x22B Model Causes 100% GPU Utilization Stall
[Android/Termux] Significantly higher RAM usage with Vulkan compared to CPU only HOT 3
Description of "-t N" option for server is inaccurate HOT 1
Need help on building shared libraries on Windows machine for Android x86_64 (emulator)
[SYCL] include shared libs in sycl release HOT 3
Can I handle multiple images in the same context?
bf16 problem HOT 1
llama_model_load: error loading model: unable to allocate backend buffer HOT 2
Funny response with LLaMa 3 8B HOT 1
Why does the server-cuda container consume CPU time? HOT 1
convert-hf-to-gguf.py fails PR #7234
Custom `seed` values ignored by `llama.cpp HTTP server` HOT 7
Different result between use llama_tokenize and python original transformers tokenizer HOT 6
Possible bug in the 'deepseek-coder' chat template's system message HOT 1
llama_get_logits_ith: invalid logits id 14, reason: no logits HOT 2
Possible (very serious) bug in chat templates that use '<s>' token having a space added after it HOT 2
Llama.cpp server doesn't return grammar error messages when in streaming mode

ThunderKittens：a simple yet faster flashattention alternative about llama.cpp HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent