Optimized kernels for ring-attention [WIP]
Every Sunday 5 PM UTC we meet in the "General" voice channel of the CUDA MODE discord server. You can contact us any time asynchronously in the #ring-attention
channel.
-
Paper: Ring Attention with Blockwise Transformers for Near-Infinite Context
- code: lhao499/ring-attention
-
Paper: World Model on Million-Length Video And Language With RingAttention
- code: LargeWorldModel/LWM,
- project site: largeworldmodel.github.io
- models: HF/LargeWorldModel
-
Paper: Online normalizer calculation for softmax (NVIDIA, 2018)
-
LWM model in ollama: https://ollama.com/ifioravanti/lwm
-
lucid rains pytorch impl (wip?): lucidrains/ring-attention-pytorch
- Incremental Softmax (to understand the algorithm in 'high-level' pytorch)
- Naive flash-attn (to understand the algorithm in 'high-level' pytorch)
- NVIDIA Collective Communication Library (NCCL) Documentation
- PyTorch Distributed Overview
- Distributed communication package - torch.distributed (
send()
,recv()
,broadcast()
, etc.)
Contact us on the CUDA MODE discord server: https://discord.gg/cudamode, PRs are welcome (please create an issue first).