charent / chatlm-mini-chinese Goto Github PK

View Code? Open in Web Editor NEW

1.0K 12.0 120.0 12.91 MB

中文对话0.2B小模型（ChatLM-Chinese-0.2B），开源所有数据集来源、数据清洗、tokenizer训练、模型预训练、SFT指令微调、RLHF优化等流程的全部代码。支持下游任务sft微调，给出三元组信息抽取微调示例。

License: Apache License 2.0

Python 68.49% Jupyter Notebook 31.51%

chatbot language-model t5-model text-generation

chatlm-mini-chinese's Introduction

Hi there 👋

Thanks for visiting my Github Page. Here are some facts about me:

🔭 I’m currently working on: LLM(Large Language Model) landing application and inference acceleration, optimize the performance of deep learning models for specific scenarios. My previous main work was data analysis, risk control, data mining, machine learning, and algorithm.
🌱 I'm also working on those directions of NLP: text classification, information extraction and text generation.
🔬 I'm now interesting in how to get high-quality text for training language models (for examples, text-to-text model such as T5, causal language model such as GPT2 / Qwen / LLaMA3), and how to speedy up LLM training, fine-tune and inference, like DeepSpeed and VLLM. In addition, the application of LLM in vertical fields is also a very interesting direction, such as RAG (Retrieval Augmented Generation) and information extraction.
📫 ······

My skills 🛠️

Languages:
- Python, SQL, Shell, C++, a little Golang and a little Java.
Frameworks:
- PyTorch, Huggingface's NLP framework, Pandas & Numpy, FastAPI, PySpark, Hive.
Developments:
- Linux, Git, Docker, VSCode, Markdown.

Contributes 🧑‍💻

Links 🔗

chatlm-mini-chinese's People

Contributors

Stargazers

Watchers

Forkers

pshysimon ligaoqi2 caplike colongj xiaoyubing tianyudizhua tonywang-sh huigezhi d0z1ngshark ykallan xbdxwyh leejodie yuyhao paipaipaidaxing fujianghua-2020 zzhaobh lianyi henryhesz happyallday harlotte2 ironartisan feehoo anuxs onecany timebackback huterox starklj ai-jie01 zstaotao jasonyank qqr1 itsharex huziyuan14 assassindesign tan-630 generalzh loongel luogaara fengtc ma-dan wzwz180 lihuibng xuguowong 2132660698 sfidea furtherref chaoa zhaohb oh8 aimicm kunshou123 yuanmeng1120 mayi140611 evanlovea nikolahuang yanyuxiyangzk darquedante cbillk fingerx big-data-ai lujiashuai-linyin gz927cool alexwang123456 chaozheng paineliu bifu123 kideve xueg-zhou tims-ml super-bjut haishiniu tslnihaogit jason233333 sanghy puddingss weedge xiaozhiob wendongj shanshu1015 codingonion huniu20 iumarrymeplease novoice willmeng enhaofrank fl183 wyzqbx weisili2016 maodou samxiaosheng hanjr92 wwwbq yy4382 road2018 xuzhiheng feixueck jmaxhu dongjicheng bshark-yb morganarthur

chatlm-mini-chinese's Issues

请教一个问题，生成的回复重复

请教一个问题，用readme提到的数据集做预训练，和readme一样下降到3.几就立即停止训练了。还没有sft和RLHF。直接用cli_demo，stream=False测试，输出如下回复，请问这是预训练时间不够还是因为没有sft呢？

用户：你好
你好
ChatBot：
你好，你好，你好，你的情况，考虑为肾虚，可以服用六味地黄丸或六味地黄丸进行治疗，同时服用六味地黄丸进行治疗，不吃辛辣刺激性的食物很重要。

启动的进程个数:1
INFO: Will watch for changes in these directories: ['C:\ChatLM-mini-Chinese']
INFO: Uvicorn running on http://127.0.0.1:8812 (Press CTRL+C to quit)
INFO: Started reloader process [5640] using StatReload
INFO: Started server process [4988]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 127.0.0.1:52111 - "GET / HTTP/1.1" 404 Not Found

微调后预测三元组不正确原因

您好，按照提供的脚本微调三元组抽取任务，训练5个epoch后loss收敛到0.087100，发现测试样例的三元组结果预测有点不符合常理。

ret = bot.chat('请抽取出给定句子中的所有三元组。给定句子：傅淑云，女，汉族，1915年出生，上海人')
# output: [(,,),(,,1915),(,,)][EOS]

bot.chat(['你好', '请抽取出给定句子中的所有三元组。给定句子：江苏省赣榆海洋经济开发区位于赣榆区青口镇临海而建，2003年1月28日，经江苏省人民政府《关于同意设立赣榆海洋经济开发区的批复》（苏政复〔2003〕14号）文件批准为全省首家省级海洋经济开发区，','如何看待最近南方天气突然变冷？'])
# ['[(,,)][EOS]',
# '[(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,',
# '[(,,)][EOS]']

是默认参数配置不正确的原因，还是包版本不同导致呢？训练和测试数据同样使用的duie1.0，也处理成了想要的格式。base模型是从huggingface上下载的t5-base。

sft微调时报错

在sagemaker平台上进行sft微调，训练时出现爆显存的问题，以下是我的训练设置，就算换成24g显存的机器依然不够

训练数据有135045条，每条prompt和response的token一共1500左右

可以用a卡训练吗

运行·pre_train报错，TypeError: Accelerator.init() got an unexpected keyword argument 'use_seedable_sampler'

E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\venv\Scripts\python.exe E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\pre_train.py
Traceback (most recent call last):
File "E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\pre_train.py", line 136, in
pre_train(config)
File "E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\pre_train.py", line 109, in pre_train
trainer = Seq2SeqTrainer(
^^^^^^^^^^^^^^^
File "E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\venv\Lib\site-packages\transformers\trainer_seq2seq.py", line 56, in init
super().init(
File "E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\venv\Lib\site-packages\transformers\trainer.py", line 367, in init
self.create_accelerator_and_postprocess()
File "E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\venv\Lib\site-packages\transformers\trainer.py", line 4127, in create_accelerator_and_postprocess
self.accelerator = Accelerator(
^^^^^^^^^^^^
TypeError: Accelerator.init() got an unexpected keyword argument 'use_seedable_sampler'

Process finished with exit code 1

RuntimeError: No executable batch size found, reached zero

初学者自己笔记本跑代码，帮忙指教一下，4G独显。

关于小模型ChatLM-mini-Chinese 信息抽取的 sft_train.json文件

请问一下 sft_train.json 这个文件是在哪里生成的嘞？

请问，如果有新的内容需要添加，是否需要全部重新训练？

非常感谢您的奉献和分享！
我想请教的内容如题：是否在新的内容出现后需要对已有的数据也要全部重新训练一次，谢谢！
祝新春快乐

sft_train

使用huggingface实现的sft_train.py 中有实现对应的embeeding和encoder冻结么？

Great Work! Does it support multimodal ability?

想问下，可以支持增加视觉头，实现多模态对话能力吗？目前开源多模态模型都很大，感觉如果能结合下视觉理解，应该很厉害。

如何运行呢？

我按照流程已经将3.2的安装部分都安装成功，不知道下一步该如何训练呢？就是展示的对话不知道怎么去运行打开它？

清洗好的数据集会开源吗？

楼主好，想使用清洗好的数据集自己做一个增量预训练

多卡情况下，同一份数据集会加载多次吗

非常感谢您贡献如此完整的代码！
请教一个问题，单机多卡环境，我在dataset去除shuffer，get_item打印返回的prompt，发现不同的进程会打印相同的prompt，是不是就代表开了几个GPU，同一条数据就会重复训练多次。batch_size=12,num_work=4
启动命令accelerate launch --multi_gpu --num_processes 2 pre_train.py

请教“3.3 Tokenizer训练”如何运行？

按照“3.2 从克隆仓库代码开始”步骤
完成了如下三个步骤：
3.2.1 克隆项目
3.2.2 安装依赖
3.2.3 下载预训练模型及模型配置文件

最终model_save目录文件结构如下：

(aaaenv) root@bcdeee:/ChatLMChinese02B/ChatLM-mini-Chinese# ll model_save/
total 734283
drwxr-xr-x 1 root root        12 Feb  4 10:53 ./
drwxr-xr-x 1 root root        25 Feb  4 10:49 ../
-rw-r--r-- 1 root root     24974 Feb  4 10:44 README.md
-rw-r--r-- 1 root root       803 Feb  4 10:44 config.json
-rw-r--r-- 1 root root       126 Feb  4 10:44 configuration.json
-rw-r--r-- 1 root root        95 Feb  4 10:44 configuration_chat_model.py
drwxr-xr-x 1 root root         8 Feb  4 10:44 description/
-rw-r--r-- 1 root root       142 Feb  4 10:44 generation_config.json
-rw-r--r-- 1 root root 750794624 Feb  4 10:44 model.safetensors
-rw-r--r-- 1 root root      3208 Feb  4 10:44 modeling_chat_model.py
-rw-r--r-- 1 root root         0 Feb  4 09:48 put_model_files_here
-rw-r--r-- 1 root root        75 Feb  4 10:44 special_tokens_map.json
-rw-r--r-- 1 root root   1077208 Feb  4 10:44 tokenizer.json
-rw-r--r-- 1 root root      1420 Feb  4 10:44 tokenizer_config.json

请问：
3.3 Tokenizer训练
这个步骤，是要这样运行吗？

(aaaenv) root@bcdeee:/ChatLMChinese02B/ChatLM-mini-Chinese# python utils/train_tokenizer.py

我python utils/train_tokenizer.py这样运行报错如下：

  File "/ChatLMChinese02B/ChatLM-mini-Chinese/utils/train_tokenizer.py", line 23, in <module>
    from config import PROJECT_ROOT
ModuleNotFoundError: No module named 'config'

是不是需要先从哪里下载语料？在运行3.3之前，需要先下载并处理语料吗？

如何提取中间层的输出？

这种只能通过问答对的方式，有没有办法MLM的方式学习知识体系。

预训练，用了160万数据，共2G句子对，使用A40的48G显存，无论使用1/2/3/4卡，都会报OOM

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to h
ang. It is recommended to upgrade the kernel to the minimum version or higher.
Using auto half precision backend
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to h
ang. It is recommended to upgrade the kernel to the minimum version or higher.
***** Running training *****
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 128
Gradient Accumulation steps = 8
Total optimization steps = 25,458
Number of trainable parameters = 223,395,072
0%| | 0/25458 [00:00<?, ?it/s]
/root/miniconda3/envs/myenv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2692: UserWarn
ing: max_length is ignored when padding=True and there is no truncation strategy. To pad to max length, u
se padding='max_length'.
warnings.warn(
***** Running training *****
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 8
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 8
Total optimization steps = 50,918
Number of trainable parameters = 223,395,072
0%| | 0/25458 [00:01<?, ?it/s]
{'loss': 10.8039, 'grad_norm': 44.39530563354492, 'learning_rate': 9.765625e-08, 'epoch': 0.0} [00:00<?, ?it/s]
*
**** Running training ***** | 9/50918 [00:22<35:08:12, 2.48s/it]
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 4
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 8
Total optimization steps = 101,836
Number of trainable parameters = 223,395,072
0%| | 9/50918 [00:25<40:38:53, 2.87s/it]
{'loss': 10.6074, 'grad_norm': 47.77275085449219, 'learning_rate': 9.765625e-08, 'epoch': 0.0}
{'loss': 10.1781, 'grad_norm': 27.392656326293945, 'learning_rate': 4.8828125e-06, 'epoch': 0.0}
0%| | 73/101836 [01:20<50:52:08, 1.80s/it]
***** Running training *****
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 2
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 8
Total optimization steps = 203,674
Number of trainable parameters = 223,395,072
0%| | 73/101836 [01:23<32:14:29, 1.14s/it]
{'loss': 9.493, 'grad_norm': 18.48982048034668, 'learning_rate': 9.765625e-08, 'epoch': 0.0}74 [00:00<?, ?it/s]
{'loss': 9.3833, 'grad_norm': 24.369417190551758, 'learning_rate': 4.8828125e-06, 'epoch': 0.0}
{'loss': 9.328, 'grad_norm': 31.319684982299805, 'learning_rate': 9.765625e-06, 'epoch': 0.0}
*
**** Running training ***** | 146/203674 [01:53<50:35:35, 1.12it/s]
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 1
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 8
Total optimization steps = 407,348
Number of trainable parameters = 223,395,072
0%| | 146/203674 [01:56<45:07:49, 1.25it/s]
{'loss': 9.116, 'grad_norm': 35.11280822753906, 'learning_rate': 9.765625e-08, 'epoch': 0.0}
{'loss': 8.8506, 'grad_norm': 42.10904312133789, 'learning_rate': 4.8828125e-06, 'epoch': 0.0}
{'loss': 8.7471, 'grad_norm': 67.85017395019531, 'learning_rate': 9.765625e-06, 'epoch': 0.0}
{'loss': 8.6334, 'grad_norm': 60.81837844848633, 'learning_rate': 1.4648437500000001e-05, 'epoch': 0.0}
{'loss': 8.4838, 'grad_norm': 69.64332580566406, 'learning_rate': 1.953125e-05, 'epoch': 0.0}
{'loss': 8.3629, 'grad_norm': 44.6363525390625, 'learning_rate': 2.44140625e-05, 'epoch': 0.0}
{'loss': 8.2015, 'grad_norm': 52.63124084472656, 'learning_rate': 2.9296875000000002e-05, 'epoch': 0.0}
0%| | 344/407348 [04:03<79:03:03, 1.43it/s]
Traceback (most recent call last):
File "/root/t5/./pre_train.py", line 144, in
pre_train(config)
File "/root/t5/./pre_train.py", line 127, in pre_train
trainer.train(
File "/root/miniconda3/envs/myenv/lib/python3.11/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/myenv/lib/python3.11/site-packages/accelerate/utils/memory.py", line 140, in deco
rator
raise RuntimeError("No executable batch size found, reached zero.")

这个是一张 A40卡/48G 的运行日志。
好像是gpu的内存没有释放

请问数据预处理里面bell_open_source/train_0.8M_CN.json是在哪里下载的呀

如题，我在huggingface没找到这个文件

train_3.5M_CN数据处理问题

现在bellegroup的train_3.5M_CN 数据的格式是这样
{
"id":"66623028",
"conversations":[
{
"from":"human",
"value":"归纳介绍三种不同类型的电影题材"
},
{
"from":"assistant",
"value":"1. 动作片：动作片通常强调动作和战斗场面，故事情节紧凑，充满了紧张和兴奋的氛围。主角一般是具有超凡实力或特殊技能的英雄人物，往往会与反派角色进行一系列激烈的格斗，追逐和爆破等酷炫场面。例如《速度与激情》和《复仇者联盟》等。\n2. 爱情片：爱情片是以爱情为主题的电影，着重呈现角色之间的感情纠葛，展现了爱情的美好和悲壮。情节与角色之间的发展令人深思，增加了观众的感性体验。这种电影的故事情节主要以人类与人类之间的情感变化为主题。例如《泰坦尼克号》和《让子弹飞》等。\n3. 科幻片：科幻片是以未来科技和未知世界为主题的电影，通常包括时间旅行、外星人、虚拟现实、未来社会等各种奇幻的元素。这种电影描绘了一种比现实更加宏伟、更加神秘和惊奇的世界。例如《星际穿越》和《发条橙》等。"
}
]
}

跟train_2M_CN的格式不同，目前的数据处理代码无法处理train_3.5M_CN，这个数据目前是多轮对话的形式，这个数据是直接舍弃，还是可以修改代码再用呢

用train.py出现shape的mismatch

在本地用train.py pretrain 的时候出现 Accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All Shapes across devices myst be valid. Input shapes: -Process 0:[16,174] -Process1:[16,167]

是否考虑将预训练的模型和仅stf后的模型也上传的平台呢

Why do I get stuck loading the dataset after running it

The system I am running is Windows

基于提供的模型进行sft报错

几个问题：
1、3.2.3 下载预训练模型及模型配置文件，模型下载下来的名字是ChatLM-mini-Chinese，但是命令里面是mv ChatLM-Chinese-0.2B model_save，文件夹名字不匹配
2、模型文件夹放到model_save下，用python sft_train.py，报错说不存在model_save/pretrain目录
3、把文件夹改名为pretrain后，用python sft_train.py，报错说Error while deserializing header: HeaderTooLarge

网上有说把safetensors的后缀改成ckpt，试了一下，报错说找不到OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory。

怎么才能把sft跑起来？

是否可以在服务器上运行？

Traceback (most recent call last):
File "/home/aidata/work/service/ChatLM-mini-Chinese-main/cli_demo.py", line 13, in
chat_bot = ChatBot(infer_config=infer_config)
File "/home/aidata/work/service/ChatLM-mini-Chinese-main/model/infer.py", line 46, in init
model = load_and_quantize_model(
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/accelerate/utils/bnb.py", line 193, in load_and_quantize_model
return dispatch_model(model, device_map=device_map, offload_dir=offload_folder)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/accelerate/big_modeling.py", line 436, in dispatch_model
model.to(device)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to
return super().to(*args, **kwargs)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
在服务器上测试效果时出现这个报错

是否有计划针对agent函数调用微调

小模型做其他任务不太行，或许可以挖掘一下agent函数调用的潜力

预训练数据集必须是{“prompt”: "response":}的格式么？

{
"prompt": "对于花园街，你有什么了解或看法吗？",
"response": "花园街（是香港油尖旺区的一条富有特色的街道，位于九龙旺角东部，北至界限街，南至登打士街，与通菜街及洗衣街等街道平行。现时这条街道是香港著名的购物区之一。位于亚皆老街以南的一段花园街，也就是"波鞋街"整条街约150米长，有50多间售卖运动鞋和运动用品的店舖。旺角道至太子道西一段则为排档区，售卖成衣、蔬菜和水果等。花园街一共分成三段。明清时代，花园街是芒角村栽种花卉的地方。此外，根据历史专家郑宝鸿的考证：花园街曾是1910年代东方殷琴拿烟厂的花园。纵火案。自2005年起，花园街一带最少发生5宗纵火案，当中4宗涉及排档起火。2010年。2010年12月6日，花园街222号一个卖鞋的排档于凌晨5时许首先起火，浓烟涌往旁边住宅大厦，消防接报4"
}
为什么预训练数据集是如上格式，不应该是拼接文本么？

如果在更好的设备上训练效果区别大吗

另外，可以免费提供A100训练，如有需要请回复

使用Lora 和 sft_train.py 训练效果好像没有，有没有好的方法？

做了两条数据测试微调，分别使用Lora 和 sft_train.py，训练数据只放了两条：“什么是大排档” 和提供的例子 "对于花园街，你有什么了解或看法吗？"
已经按照“prompt”和“response”的关系处理，“num_train_epochs” 设置为10，

输出结果：
***** Running training *****
Num examples = 2
Num Epochs = 10
Instantaneous batch size per device = 12
Total train batch size (w. parallel, distributed & accumulation) = 48
Gradient Accumulation steps = 4
Total optimization steps = 10
Number of trainable parameters = 187,692,288
{'loss': 1.0191, 'learning_rate': 1.0000000000000001e-07, 'epoch': 1.0}

{'train_runtime': 20.1565, 'train_samples_per_second': 0.992, 'train_steps_per_second': 0.496, 'train_loss': 1.0059864044189453, 'epoch': 10.0}

本人新入手，请教一下大哥，是否因为数据量太少的缘故？“大排档” 我通过加入token tokenizer.add_tokens(['大排档']) 也一样没效果，看内容应该是识别为汽车的大排量和档位

另外提示这个是否会存在问题？我看了dataset也有prompt, input_mask, response

The following columns in the training set don't have a corresponding argument in TextToTextModel.forward and have been ignored: prompt, input_mask, response. If prompt, input_mask, response are not expected by TextToTextModel.forward, you can safely ignore this message.

这个模型好像没有长文对话的能力，该如何训练它让它有这个能力？

项目怎么使用fastchat 进行调试

直接加载有问题，不知道怎么处理

(sdw) PS F:\VM> python -m fastchat.serve.model_worker --model-path ChatLM-mini-Chinese
2024-02-29 16:50:15 | INFO | model_worker | args: Namespace(host='localhost', port=21002, worker_address='http://localhost:21002', controller_address='http://localhost:21001', model_path='ChatLM-mini-Chinese', revision='main', device='cuda', gpus=None, num_gpus=1, max_gpu_memory=None, dtype=None, load_8bit=False, cpu_offloading=False, gptq_ckpt=None, gptq_wbits=16, gptq_groupsize=-1, gptq_act_order=False, awq_ckpt=None, awq_wbits=16, awq_groupsize=-1, enable_exllama=False, exllama_max_seq_len=4096, exllama_gpu_split=None, exllama_cache_8bit=False, enable_xft=False, xft_max_seq_len=4096, xft_dtype=None, model_names=None, conv_template=None, embed_in_truncate=False, limit_worker_concurrency=5, stream_interval=2, no_register=False, seed=None, debug=False, ssl=False)
2024-02-29 16:50:15 | INFO | model_worker | Loading the model ['ChatLM-mini-Chinese'] on worker 43282e2a ...
2024-02-29 16:50:15 | ERROR | stderr | Traceback (most recent call last):
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\runpy.py", line 196, in _run_module_as_main
2024-02-29 16:50:15 | ERROR | stderr | return _run_code(code, main_globals, None,
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\runpy.py", line 86, in _run_code
2024-02-29 16:50:15 | ERROR | stderr | exec(code, run_globals)
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\fastchat\serve\model_worker.py", line 414, in
2024-02-29 16:50:15 | ERROR | stderr | args, worker = create_model_worker()
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\fastchat\serve\model_worker.py", line 385, in create_model_worker
2024-02-29 16:50:15 | ERROR | stderr | worker = ModelWorker(
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\fastchat\serve\model_worker.py", line 77, in init
2024-02-29 16:50:15 | ERROR | stderr | self.model, self.tokenizer = load_model(
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\fastchat\model\model_adapter.py", line 353, in load_model
2024-02-29 16:50:15 | ERROR | stderr | model, tokenizer = adapter.load_model(model_path, kwargs)
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\fastchat\model\model_adapter.py", line 99, in load_model
2024-02-29 16:50:15 | ERROR | stderr | model = AutoModelForCausalLM.from_pretrained(
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\transformers\models\auto\auto_factory.py", line 564, in from_pretrained
2024-02-29 16:50:15 | ERROR | stderr | raise ValueError(
2024-02-29 16:50:15 | ERROR | stderr | ValueError: Unrecognized configuration class <class 'transformers.models.t5.configuration_t5.T5Config'> for this kind of AutoModel: AutoModelForCausalLM.
2024-02-29 16:50:15 | ERROR | stderr | Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

在 SFT 微调途中出现报错

操作系统：windows 10

在第一次达到save_steps 5000步保存完模型后会出现权限不足报错：

aving model checkpoint to D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000:12:55,  3.39s/it]
Configuration saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\config.json
Configuration saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\generation_config.json
Model weights saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\model.safetensors
tokenizer config file saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\tokenizer_config.json
Special tokens file saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\special_tokens_map.json
Traceback (most recent call last):
  File "D:\sj\project\python\ChatLM-mini-Chinese\sft_train.py", line 127, in <module>
    sft_train(config)
  File "D:\sj\project\python\ChatLM-mini-Chinese\sft_train.py", line 114, in sft_train
    trainer.train(
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 1539, in train
    return inner_training_loop(
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\accelerate\utils\memory.py", line 136, in decorator
    return function(batch_size, *args, **kwargs)
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 1929, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 2300, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 2418, in _save_checkpoint
    fd = os.open(output_dir, os.O_RDONLY)
PermissionError: [Errno 13] Permission denied: 'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000'

已经尝试两次都是这样，且第二次是以管理员权限运行

readme可以提供下封装了环境加模型的docker镜像吗?

大佬请教一下，只做中文RAG的话，这个跟你另外一个phi，哪个效果比较好？

如题

预训练数据集

作者你好，请问为什么预训练阶段的训练数据也采用 prompt，response 的格式？
我理解预训练应该是做无监督的语言学习，直接喂入自然语言文本而不是喂入 prompt、response 格式的数据效果上有什么区别？
你当时这样整理预训练数据的原因是什么？

tokenizer的字典中有不少token带有下划线，请问这种是什么意思

这些带下划线的token是什么，而且有的是粗线有的是细线

考虑出一个支持llama的版本吗

有考虑将模型分发的https://modelscope.cn/么？

Some NCCL operations have failed or timed out.

rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=_ALLGATHER_BASE, NumelIn=7168, NumelOut=14336, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e2ae5781d87 in /home/dongbingcheng/anaconda3/envs/llmfinetuning/lib/python3.9/site-packages/torch/lib/libc10.so)

我是双卡训练，感觉是训练完第一个epoch就出现这个错误
我使用的是实现的train.py文件
感觉是不是评估的时候，前面进程没结束
添加个accelerator.wait_for_everyone()
谢谢解答！

请问这些预训练数据加起来有多少token呀

请问这些预训练数据加起来有多少token呀，现在是按数据条数来算的，但好像一般都是按照token数来算orz(我可能了解不多

可以介绍一下不同的任务训练的配置吗

rt，想复现一下

Hello, 第一次使用，请问运行时出现 unsupported operand type(s) for |: 'types.GenericAlias' and 'type' 是什么问题？

官方的demo都已经运行使用了，使用git仓库运行api、训练和微调都遇到了这个错误，具体位置是：

Traceback (most recent call last):
File "/Users/xxxx/xxx/ChatLM-mini-Chinese/sft_train.py", line 16, in
from utils.functions import get_T5_config
File "/Users/xxxx/xxx/ChatLM-mini-Chinese/utils/functions.py", line 27, in
def _get_doc_mini_hash(doc: list[str] | str, num_perm: int) -> MinHash:
TypeError: unsupported operand type(s) for |: 'types.GenericAlias' and 'type'

非常不错的开源项目

如何加载sft后的模型？

我sft训练后发现model_save文件夹中多了一个sft文件夹，其中有checkpoints-10000还有一些其他的，如何加载并使用sft后的模型呢？