是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this? <ul clas

遇到了相同的问题，不用flash-attn batchsize=1可以正常出结果 batchsize>1时候有p

[BUG] model的forward函数接收attention_mask的时候，若attention_mask[i, 0]==0，则序列i输出的logits全都是NaN值 about qwen HOT 7 CLOSED

leileqiTHU commented on September 27, 2024

[BUG] model的forward函数接收attention_mask的时候，若attention_mask[i, 0]==0，则序列i输出的logits全都是NaN值

from qwen.

Comments (7)

cageyoko commented on September 27, 2024 1

遇到了相同的问题，

不用flash-attn batchsize=1可以正常出结果 batchsize>1时候有padding的样本过模型后输出为nan
安装flash-attn 就好了

from qwen.

jklj077 commented on September 27, 2024

Hi, I'm not sure I understand your use case. May I know what results you were expecting? You have literally prevented the initial position from attending to itself and it should be expected that the model did not know what next token would be.

from qwen.

leileqiTHU commented on September 27, 2024

yeah sorry that I may not make it clear.

I was trying to use model.forward function directly rather than calling model.generate function, in order to observe its behavior in the forward pass.
My input is of different lengths, so I have to pad them to the same lengths. I used left padding, prepending pad <endoftext> tokens. In my opinion, those pad tokens should not be attended, and attention_mask is used in this scenario, setting those positions to 0 so that the model won't attend to those pad tokens in the forward pass.
However, I got all NaN logits, which confuses me. I tried not to pass the attention_mask parameter, and there are no NaN values in the logits, which is I expected. So I infer that this may be the problem of the attention_mask. To further locate the problem, I tried different attention_masks, finally found out that If we set the first position to 0 (in which case the model won't attend to the first token which is a pad token), the return values of model.forward function , i.e., the logits, will all be NaN values.

Also, I tried Qwen1.5-7B-Chat model, and it does not have this problem, i.e., even if I set the attention_mask of the first position to 0, the output will still be free of NaN values. So I suspect that this may be a problem of Qwen-7B-Chat.

But also, I may make mistakes, please let me know if I do.

from qwen.

leileqiTHU commented on September 27, 2024

And If the masked tokens in the left positions should not know what the next token should be due to that they are prevented from attending to themselves, why are the logits of other un-masked positions (the right positions ) are also NaN values? Did I get it wrong?

from qwen.

jklj077 commented on September 27, 2024

Hi, after reading through your comments, and if I understood correctly, Qwen1.5 was working as you would expect. I would suggest just using Qwen1.5.

P.S.: Investigating the original issue is more complicated than it appeared. Was flash attention enabled? Were you following the instructions in README to do batch inference?

from qwen.

jklj077 commented on September 27, 2024

Hi, Qwen1.0 models and code will not be updated anymore. Please try Qwen2.0 instead.

from qwen.

paulxu1314 commented on September 27, 2024

遇到一样的问题，qwen2模型做cls任务，在使用flash-attn 正常，不使用时就会出现nan问题，目前onnx导出好像还不支持flash-attn 的算子，导出报错

from qwen.

[BUG] model的forward函数接收attention_mask的时候，若attention_mask[i, 0]==0，则序列i输出的logits全都是NaN值 about qwen HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent