多卡微调报错呢,about yuanzhoulvpi2017/zero_nlp

Comments (19)

cywjava commented on May 14, 2024 1

你要用他提供thuglm下的几个py文件

from zero_nlp.

xiamaozi11 commented on May 14, 2024 1

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

from zero_nlp.

1006076811 commented on May 14, 2024 1

同学我也遇到这个问题，有什么解决方案吗

from zero_nlp.

cywjava commented on May 14, 2024

File "/home/thudm/.cache/huggingface/modules/transformers_modules/local/modeling_chatglm.py", line 864, in forward
logger.warning_once(
AttributeError: 'Logger' object has no attribute 'warning_once'
小白调试，一步一个坑。。。
为什么没有使用你修改后的文件呢。。。

from zero_nlp.

pollymars commented on May 14, 2024

这个bug我也遇到过，我是把warning_once改成info了

from zero_nlp.

cywjava commented on May 14, 2024

改成info 的确可以，已经正常运行起来了，用了四张P40

from zero_nlp.

xiamaozi11 commented on May 14, 2024

同学改好了么

from zero_nlp.

cywjava commented on May 14, 2024

from zero_nlp.

xiamaozi11 commented on May 14, 2024

能说下改哪里么

from zero_nlp.

xiamaozi11 commented on May 14, 2024

from zero_nlp.

zhangzai666 commented on May 14, 2024

大神问一下，微调的时候报错：
RuntimeError: Caught RuntimeError in replica 1 on device 1.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper__index_select)
这个遇到过么，怎么改一下啊

from zero_nlp.

cywjava commented on May 14, 2024

你用他这个目录下的文件，把原来的bin复制进来，还有那个词表文件 ice_txt.bin 就行了。

from zero_nlp.

zhangzai666 commented on May 14, 2024

感谢您的回答，现在又报了新的错误。
ValueError: Caught ValueError in replica 1 on device 1.
ValueError: 150001 is not in list

from zero_nlp.

cywjava commented on May 14, 2024

你试试设置 export CUDA_VISIBLE_DEVICES=0

from zero_nlp.

zhangzai666 commented on May 14, 2024

还是不行，看这个意思是列表没有1500001这个元素，上面报错好像是显卡显存问题，我有4张40G的显卡，按理说不应该。这个是因为多卡训练的问题么，代码我还没仔细看，直接报了很多错

from zero_nlp.

cywjava commented on May 14, 2024

多张训练，为何总是第一张卡负载90%多，其它闲着呢。然而我的GPT2模型训练时，使用多卡，就不会有这样的问题。

from zero_nlp.

yuanzhoulvpi2017 commented on May 14, 2024

是不是词表下载的不对？

from zero_nlp.

xiaoweiweixiao commented on May 14, 2024

我也遇到这个问题了，你们解决了吗？

from zero_nlp.

yuanzhoulvpi2017 commented on May 14, 2024

添加了单机多卡训练代码，链接放在这里，https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/Chatglm6b_ModelParallel

from zero_nlp.

多卡微调报错呢 about zero_nlp HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent