Hello, Does the implementation of causal flash attention support mul

oh shoot, i never built it believe at the time i thought <code class

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Multi-head causal flash attention support? about flash-attention-jax HOT 8 CLOSED

lucidrains commented on May 29, 2024

Multi-head causal flash attention support?

from flash-attention-jax.

Comments (8)

sh0416 commented on May 29, 2024 1

Is there any plan to implement this feature?

I want to apply it to my custom jax code.

from flash-attention-jax.

sh0416 commented on May 29, 2024 1

^0^ great. Thanks for your support! Take care.

from flash-attention-jax.

sh0416 commented on May 29, 2024 1

I am looking into this code more carefully, and it seems that the unwanted computation (upper triangular region in the causal attention) is not excluded in the computational process. (I don't expect that the compiler also handles this aspect..)

I think it is intended for the easy understanding of flash attention, but it could be 2x faster if the length of query and key is the same and the process is compute bounded.

The issue is closed.

from flash-attention-jax.

lucidrains commented on May 29, 2024

yea, it should, as it is agnostic to how many preceding dimensions there are (whether it is batch, heads, etc)

from flash-attention-jax.

lucidrains commented on May 29, 2024

oh shoot, i never built it

believe at the time i thought vmap would suffice

from flash-attention-jax.

lucidrains commented on May 29, 2024

sure! I can add it tomorrow morning, California time

from flash-attention-jax.

lucidrains commented on May 29, 2024

provided I don't drink too much tonight :)

from flash-attention-jax.

lucidrains commented on May 29, 2024

@sh0416 ok its done, you can test it by running

from flash_attention_jax import causal_attention, causal_flash_attention, value_and_grad_difference

diff, (dq_diff, dk_diff, dv_diff) = value_and_grad_difference(
    causal_attention,
    causal_flash_attention,
    seed = 42,
    add_key_mask = False
)

print('shows differences between normal and flash attention for output, dq, dk, dv')
print(f'o: {diff}')       # < 1e-4
print(f'dq: {dq_diff}')   # < 1e-6
print(f'dk: {dk_diff}')   # < 1e-6
print(f'dv: {dv_diff}')   # < 1e-6

from flash-attention-jax.

Recommend Projects

Multi-head causal flash attention support? about flash-attention-jax HOT 8 CLOSED

Comments (8)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent