fudan-zvg / soft Goto Github PK

View Code? Open in Web Editor NEW

300.0 8.0 25.0 5.19 MB

[NeurIPS 2021 Spotlight] & [IJCV 2024] SOFT: Softmax-free Transformer with Linear Complexity

License: MIT License

Python 76.24% Cuda 18.11% C++ 5.29% Shell 0.36%

transformer linear-transformer self-attention linear-complexity softmax-free

soft's People

Contributors

Stargazers

Watchers

soft's Issues

Is it possible to export pretrained model to onnx?

Thanks for your work.
Is it possible to export pretrained model to onnx?

How to evaluate on COCO datasets？

Ask for pretrain model

Dear Jiachen,
Thanks for your work! Would you like to share the pretrained models on ImageNet？ Best wishes!

Thank you for publishing your code. I saw the DeiT ablation in your paper. Is there a chance you could also provide code to reproduce that? If you'd prefer to contact me in private, my email is [email protected]

Thanks again!

如何理解论文中提到的线性复杂度？

很抱歉我图方便直接用中文提问。

论文里线性开销的关键在于用stride conv下采样，但是conv训练完以后kernel size和stride就固定了，那采样的比例也固定了。
那么训练完以后，如果用更长的序列进行测试，m的长度会随着序列长度n增长，复杂度还是O(n^2)而不是O(n)。
我看了下openreview的审稿意见，似乎有审稿人问到这问题，但rebuttal中提到固定m=49，但当测试序列更长时，这似乎在不改变stride的情况下是无法做到的？感觉Nystromformer的adaptive pooling更符合landmark的意义。
另外，用于生成landmark的conv后面还跟着norm和GELU，是不是这才是收敛的关键？

Substitute regular attention module with sofmax-free attention module

Hello,

The background is that due to the limitation of the computation platform I'm using, where the softmax operator costs a lot of time, I'm trying to substitute the regular attention modules into sofmax-free attention module.

I have one question about the structure of SOFT. The core of the softmax-free attention module runs like this:

    def forward(self, X, H, W):

        Q = self.split_heads(self.W_q(X))
        V = self.split_heads(self.W_v(X))
        attn_out = self.attn(Q, V, H, W)
        attn_out = self.combine_heads(attn_out)

        out = self.ff(attn_out)
        return out

As Q and V are generated from X, does that mean this attention module is keen to a self-attention module rather than the cross-attention module where the Q, K, V are from different domains? If that is the case, is there any suggestion on regular cross-attention module substitution with softmax-free attention? Thanks.

Best,
Chenxi

Regular softmax attention in last block

Hello,

First, nice job on the work, I think this is a really interesting paper, with a lot of potential to enable further theoretical investigations into deep attention mechanisms.

Looking into the code, I noticed that in the final block of SOFT there is a normal softmax attention layer. Is there a reason for this? Also did you notice any quantitative or qualitative differences in the attention heatmaps produced by this regular softmax layer and the other approximated, softmax-free attention layers?

Thanks in advance for your time and work

fudan-zvg / soft Goto Github PK

soft's People

Contributors

Stargazers

Watchers

Forkers

soft's Issues

Recommend Projects

Recommend Topics

Recommend Org