几个问题： 1、3.2.3 下载预训练模型及模型配置文件，模型下载下来的名字是ChatLM-mini-Chinese，但是命令里面是mv ChatLM-Chines

首先感谢反馈。 mv命令以你下载到的文件夹

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

你的pytorch版本、transformers版本和 requirements.txt 里要求的

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

基于提供的模型进行sft报错 about chatlm-mini-chinese HOT 13 CLOSED

charent commented on July 29, 2024

基于提供的模型进行sft报错

from chatlm-mini-chinese.

Comments (13)

charent commented on July 29, 2024

首先感谢反馈。

mv命令以你下载到的文件夹ChatLM-mini-Chinese为准，我等会修改一下readme文件。原来huggingface的仓库命名是ChatLM-Chinese-0.2B，但是调用AutoModelForSeq2SeqLM.from_pretrained(model_id, trust_remote_code=True)会报错，无法下载模型文件，改了名字就正常了，应该是huggingface的问题，不知道现在改了没有。
你可以修改config.py文件，找到64行的finetune_from_ckp_file变量（SFTconfig类下的），把默认的PROJECT_ROOT + '/model_save/pretrain'改为你下载的模型文件路径就可以了。
sft_train.py的46-55行为加载预训练模型的代码。如果你2中传入的finetune_from_ckp_file变量为文件夹，则会调用TextToTextModel.from_pretrained，这个方法是可以正常加载safetensors的。所以你可能把finetune_from_ckp_file设置为safetensors文件了，改成文件夹就可以了。 load_state_dict是加载pytorch原生模型bin文件的。

 # step 2. 加载预训练模型
    model = None
    if os.path.isdir(config.finetune_from_ckp_file):
        # 传入文件夹则 from_pretrained
        model = TextToTextModel.from_pretrained(config.finetune_from_ckp_file)
    else:
        # load_state_dict
        t5_config = get_T5_config(T5ModelConfig(), vocab_size=len(tokenizer), decoder_start_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)
        model = TextToTextModel(t5_config)
        model.load_state_dict(torch.load(config.finetune_from_ckp_file, map_location='cpu')) # set cpu for no exception

from chatlm-mini-chinese.

cq1316 commented on July 29, 2024

第三步，我就是在model = TextToTextModel.from_pretrained(config.finetune_from_ckp_file)这一行报的错，我已经把模型的文件放到model_save/pretrain下面了，中间没有其他文件夹。但是就是报错Error while deserializing header: HeaderTooLarge

from chatlm-mini-chinese.

cq1316 commented on July 29, 2024

from chatlm-mini-chinese.

charent commented on July 29, 2024

你的这个model_save/pretrain文件夹下有什么文件？需要有以下这些文件才行哦，只放一个model.safetensors的不行的。

├─model_save
|  ├─config.json
|  ├─configuration_chat_model.py
|  ├─generation_config.json
|  ├─model.safetensors
|  ├─modeling_chat_model.py
|  ├─special_tokens_map.json
|  ├─tokenizer.json
|  └─tokenizer_config.json

from chatlm-mini-chinese.

cq1316 commented on July 29, 2024

我好像发现问题了。这里的finetune_from_ckp_file是到model_save，但是一开始报错的时候，说是model_save下面没有pretrain。你的model_svae目录下面自带一个tokenizer文件夹

from chatlm-mini-chinese.

charent commented on July 29, 2024

model_svae目录下面自带一个tokenizer文件是历史遗留问题了2333，当时做的时候没规划好。我都想删了但是又怕别人会用到或者等会哪里又报错了。

from chatlm-mini-chinese.

cq1316 commented on July 29, 2024

还是不行，文件的位置、内容，现在都和代码是符合的，但是跑sft的时候，就是报safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

from chatlm-mini-chinese.

charent commented on July 29, 2024

你的pytorch版本、transformers版本和requirements.txt里要求的版本一致吗？能直接运行以下代码吗？能的话说明你的环境没有问题。不要把model_id 改成本地路径，让它直接从huggingface下载。

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = 'charent/ChatLM-mini-Chinese'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, trust_remote_code=True).to(device)

你检查一下你下载的模型文件是不是完整的，可以把config.py文件InferConfig下的model_dir替换为你要sft的模型目录，运行python cli_demo.py看能不能正常加载，如果不能加载就是下载的模型文件不完整，重新下载即可。
我这边试了一下，是可以正常sft的，如下图：

from chatlm-mini-chinese.

cq1316 commented on July 29, 2024

我是用的你提供的模型的，用cli_demo也是报一样的错。

from chatlm-mini-chinese.

cq1316 commented on July 29, 2024

我是用的你提供的模型的，用cli_demo也是报一样的错。

from chatlm-mini-chinese.

charent commented on July 29, 2024

还是不行，文件的位置、内容，现在都和代码是符合的，但是跑sft的时候，就是报safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

我去搜了一下这个错误，就是模型文件的问题，文件不完整，重新下载即可。我这边提供通过md5sum命令得到的文件md5值，你可以对比一下，主要是model.safetensors这个文件。

1de0ba231817fcdaf97e025aa0dfcd00  config.json
e7356676de6c8bad26d2c7ceedc92fad  generation_config.json
655bcc42640baefba8b188a0aa65d339  model.safetensors
adeee419c31a613d7dd281b736e3873a  modeling_chat_model.py
ba22587440fe5ff64aab2cb552cb8654  special_tokens_map.json
0b65eef22c7fb9e1c16a4e51f359134a  tokenizer.json
9fc5ebbabcf9eb5ad752e16649938afc  tokenizer_config.json

from chatlm-mini-chinese.

cq1316 commented on July 29, 2024

问题解决了，因为没有装git lfs，导致模型文件下载不全。建议在readme里面把检测有没有git lfs的步骤放一下。
还有一个问题，sft训练完之后，有两个文件是没有在模型目录里的，需要手动去原模型文件夹里把他移过来，建议可以在sft结束的时候，自动把缺失的文件移过去

from chatlm-mini-chinese.

charent commented on July 29, 2024

我在readme已经写了要通过git命令下载文件的话要先安装Git LFS。我看还有人直接使用浏览器手动下载再移动过去的，我就没有标重点，我下次更新readme的时候改一下吧。

第二个问题，你说的是两个py文件吧，因为TextToTextModel属于自定义类（其实就是继承了T5，写了自己的generate方法），上传到huggingface仓库方便别人使用才需要的，执行 AutoModelForSeq2SeqLM.from_pretrained('charent/ChatLM-mini-Chinese', trust_remote_code=True)的时候需要下载这两个py文件来加载TextToTextModel模型。
本地使用的话：
1.如果通过TextToTextModel.from_pretrained(...)加载模型是不需要这两个py文件的，因为model文件夹下已经clone下来了，from model.chat_model import TextToTextModel就可以了。
2. 如果通过AutoModelForSeq2SeqLM.from_pretrained(...)才需要这两个py文件，以此来加载TextToTextModel。

两个方法都行，我的代码里面本地加载模型都是用的TextToTextModel，所以不用管这两个py文件。关于这两个py文件是怎么映射到具体模型的，建议查看config.json

from chatlm-mini-chinese.

基于提供的模型进行sft报错 about chatlm-mini-chinese HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent