Code Monkey home page Code Monkey logo

charent / chatlm-mini-chinese Goto Github PK

View Code? Open in Web Editor NEW
946.0 946.0 113.0 12.91 MB

中文对话0.2B小模型(ChatLM-Chinese-0.2B),开源所有数据集来源、数据清洗、tokenizer训练、模型预训练、SFT指令微调、RLHF优化等流程的全部代码。支持下游任务sft微调,给出三元组信息抽取微调示例。

License: Apache License 2.0

Python 68.49% Jupyter Notebook 31.51%
chatbot language-model t5-model text-generation

chatlm-mini-chinese's Introduction

PROFILE+VIEWS FOLLOWERS 博客主页 Gitee主页

Hi there 👋

Thanks for visiting my Github Page. Here are some facts about me:

  • 🔭 I’m currently working on: machine learning / deep learning, data analysis / risk control / data mining, and algorithm.
  • 🌱 I'm also working on those directions of NLP: text classification, information extraction and text generation.
  • 🔬 I'm now interesting in how to get high-quality text for training language models (for examples, text-to-text model such as T5, causal language model such as GPT2 / Phi), and how to speedy up LLM (Large Language Model) training, fine-tune and inference. In addition, the application of LLM in vertical fields is also a very interesting direction, such as RAG (Retrieval Augmented Generation).
  • 📫 ······

My skills 🛠️

  • Languages:
    • Python, SQL, Shell, C++, a little Golang and a little Java.
  • Frameworks:
    • PyTorch, Huggingface's NLP framework, Pandas & Numpy, PySpark, Hive.
  • Developments:
    • Linux, Git, Docker, VSCode, Markdown.

Contributes 🧑‍💻

GitHub Streak

   Charent's GitHub stats       Top Langs   

Links 🔗

chatlm-mini-chinese's People

Contributors

charent avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chatlm-mini-chinese's Issues

用train.py出现shape的mismatch

在本地用train.py pretrain 的时候出现 Accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All Shapes across devices myst be valid. Input shapes: -Process 0:[16,174] -Process1:[16,167]

请教“3.3 Tokenizer训练”如何运行?

按照“3.2 从克隆仓库代码开始”步骤
完成了如下三个步骤:
3.2.1 克隆项目
3.2.2 安装依赖
3.2.3 下载预训练模型及模型配置文件

最终model_save目录文件结构如下:

(aaaenv) root@bcdeee:/ChatLMChinese02B/ChatLM-mini-Chinese# ll model_save/
total 734283
drwxr-xr-x 1 root root        12 Feb  4 10:53 ./
drwxr-xr-x 1 root root        25 Feb  4 10:49 ../
-rw-r--r-- 1 root root     24974 Feb  4 10:44 README.md
-rw-r--r-- 1 root root       803 Feb  4 10:44 config.json
-rw-r--r-- 1 root root       126 Feb  4 10:44 configuration.json
-rw-r--r-- 1 root root        95 Feb  4 10:44 configuration_chat_model.py
drwxr-xr-x 1 root root         8 Feb  4 10:44 description/
-rw-r--r-- 1 root root       142 Feb  4 10:44 generation_config.json
-rw-r--r-- 1 root root 750794624 Feb  4 10:44 model.safetensors
-rw-r--r-- 1 root root      3208 Feb  4 10:44 modeling_chat_model.py
-rw-r--r-- 1 root root         0 Feb  4 09:48 put_model_files_here
-rw-r--r-- 1 root root        75 Feb  4 10:44 special_tokens_map.json
-rw-r--r-- 1 root root   1077208 Feb  4 10:44 tokenizer.json
-rw-r--r-- 1 root root      1420 Feb  4 10:44 tokenizer_config.json

请问:
3.3 Tokenizer训练
这个步骤,是要这样运行吗?

(aaaenv) root@bcdeee:/ChatLMChinese02B/ChatLM-mini-Chinese# python utils/train_tokenizer.py

python utils/train_tokenizer.py这样运行报错如下:

  File "/ChatLMChinese02B/ChatLM-mini-Chinese/utils/train_tokenizer.py", line 23, in <module>
    from config import PROJECT_ROOT
ModuleNotFoundError: No module named 'config'

是不是需要先从哪里下载语料?在运行3.3之前,需要先下载并处理语料吗?

预训练,用了160万数据,共2G句子对,使用A40的48G显存,无论使用1/2/3/4卡,都会报OOM

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to h
ang. It is recommended to upgrade the kernel to the minimum version or higher.
Using auto half precision backend
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to h
ang. It is recommended to upgrade the kernel to the minimum version or higher.
***** Running training *****
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 128
Gradient Accumulation steps = 8
Total optimization steps = 25,458
Number of trainable parameters = 223,395,072
0%| | 0/25458 [00:00<?, ?it/s]
/root/miniconda3/envs/myenv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2692: UserWarn
ing: max_length is ignored when padding=True and there is no truncation strategy. To pad to max length, u
se padding='max_length'.
warnings.warn(
***** Running training *****
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 8
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 8
Total optimization steps = 50,918
Number of trainable parameters = 223,395,072
0%| | 0/25458 [00:01<?, ?it/s]
{'loss': 10.8039, 'grad_norm': 44.39530563354492, 'learning_rate': 9.765625e-08, 'epoch': 0.0} [00:00<?, ?it/s]
*
**** Running training ***** | 9/50918 [00:22<35:08:12, 2.48s/it]
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 4
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 8
Total optimization steps = 101,836
Number of trainable parameters = 223,395,072
0%| | 9/50918 [00:25<40:38:53, 2.87s/it]
{'loss': 10.6074, 'grad_norm': 47.77275085449219, 'learning_rate': 9.765625e-08, 'epoch': 0.0}
{'loss': 10.1781, 'grad_norm': 27.392656326293945, 'learning_rate': 4.8828125e-06, 'epoch': 0.0}
0%| | 73/101836 [01:20<50:52:08, 1.80s/it]
***** Running training *****
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 2
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 8
Total optimization steps = 203,674
Number of trainable parameters = 223,395,072
0%| | 73/101836 [01:23<32:14:29, 1.14s/it]
{'loss': 9.493, 'grad_norm': 18.48982048034668, 'learning_rate': 9.765625e-08, 'epoch': 0.0}74 [00:00<?, ?it/s]
{'loss': 9.3833, 'grad_norm': 24.369417190551758, 'learning_rate': 4.8828125e-06, 'epoch': 0.0}
{'loss': 9.328, 'grad_norm': 31.319684982299805, 'learning_rate': 9.765625e-06, 'epoch': 0.0}
*
**** Running training ***** | 146/203674 [01:53<50:35:35, 1.12it/s]
Num examples = 1,629,399
Num Epochs = 2
Instantaneous batch size per device = 16
Training with DataParallel so batch size has been adjusted to: 1
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 8
Total optimization steps = 407,348
Number of trainable parameters = 223,395,072
0%| | 146/203674 [01:56<45:07:49, 1.25it/s]
{'loss': 9.116, 'grad_norm': 35.11280822753906, 'learning_rate': 9.765625e-08, 'epoch': 0.0}
{'loss': 8.8506, 'grad_norm': 42.10904312133789, 'learning_rate': 4.8828125e-06, 'epoch': 0.0}
{'loss': 8.7471, 'grad_norm': 67.85017395019531, 'learning_rate': 9.765625e-06, 'epoch': 0.0}
{'loss': 8.6334, 'grad_norm': 60.81837844848633, 'learning_rate': 1.4648437500000001e-05, 'epoch': 0.0}
{'loss': 8.4838, 'grad_norm': 69.64332580566406, 'learning_rate': 1.953125e-05, 'epoch': 0.0}
{'loss': 8.3629, 'grad_norm': 44.6363525390625, 'learning_rate': 2.44140625e-05, 'epoch': 0.0}
{'loss': 8.2015, 'grad_norm': 52.63124084472656, 'learning_rate': 2.9296875000000002e-05, 'epoch': 0.0}
0%| | 344/407348 [04:03<79:03:03, 1.43it/s]
Traceback (most recent call last):
File "/root/t5/./pre_train.py", line 144, in
pre_train(config)
File "/root/t5/./pre_train.py", line 127, in pre_train
trainer.train(
File "/root/miniconda3/envs/myenv/lib/python3.11/site-packages/transformers/trainer.py", line 1859, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/myenv/lib/python3.11/site-packages/accelerate/utils/memory.py", line 140, in deco
rator
raise RuntimeError("No executable batch size found, reached zero.")

这个是一张 A40卡/48G 的运行日志。
好像是gpu的内存没有释放

使用Lora 和 sft_train.py 训练效果好像没有,有没有好的方法?

做了两条数据测试微调,分别使用Lora 和 sft_train.py,训练数据只放了两条:“什么是大排档” 和 提供的例子 "对于花园街,你有什么了解或看法吗?"
已经按照“prompt”和“response”的关系处理,“num_train_epochs” 设置为10,

输出结果:
***** Running training *****
Num examples = 2
Num Epochs = 10
Instantaneous batch size per device = 12
Total train batch size (w. parallel, distributed & accumulation) = 48
Gradient Accumulation steps = 4
Total optimization steps = 10
Number of trainable parameters = 187,692,288
{'loss': 1.0191, 'learning_rate': 1.0000000000000001e-07, 'epoch': 1.0}

{'train_runtime': 20.1565, 'train_samples_per_second': 0.992, 'train_steps_per_second': 0.496, 'train_loss': 1.0059864044189453, 'epoch': 10.0}

本人新入手,请教一下大哥,是否因为数据量太少的缘故?“大排档” 我通过加入token tokenizer.add_tokens(['大排档']) 也一样没效果,看内容应该是识别为汽车的 大排量和档位

另外 提示这个是否会存在问题?我看了dataset也有prompt, input_mask, response

The following columns in the training set don't have a corresponding argument in TextToTextModel.forward and have been ignored: prompt, input_mask, response. If prompt, input_mask, response are not expected by TextToTextModel.forward, you can safely ignore this message.

Some NCCL operations have failed or timed out.

rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=_ALLGATHER_BASE, NumelIn=7168, NumelOut=14336, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e2ae5781d87 in /home/dongbingcheng/anaconda3/envs/llmfinetuning/lib/python3.9/site-packages/torch/lib/libc10.so)

我是双卡训练,感觉是训练完第一个epoch就出现这个错误
我使用的是实现的train.py文件
感觉是不是评估的时候,前面进程没结束
添加个accelerator.wait_for_everyone()
谢谢解答!

微调后预测三元组不正确原因

您好,按照提供的脚本微调三元组抽取任务,训练5个epoch后loss收敛到0.087100,发现测试样例的三元组结果预测有点不符合常理。

ret = bot.chat('请抽取出给定句子中的所有三元组。给定句子:傅淑云,女,汉族,1915年出生,上海人')
# output: [(,,),(,,1915),(,,)][EOS]

bot.chat(['你好', '请抽取出给定句子中的所有三元组。给定句子:江苏省赣榆海洋经济开发区位于赣榆区青口镇临海而建,2003年1月28日,经江苏省人民政府《关于同意设立赣榆海洋经济开发区的批复》(苏政复〔2003〕14号)文件批准为全省首家省级海洋经济开发区,','如何看待最近南方天气突然变冷?'])
# ['[(,,)][EOS]',
# '[(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,,),(,',
# '[(,,)][EOS]']

是默认参数配置不正确的原因,还是包版本不同导致呢?训练和测试数据同样使用的duie1.0,也处理成了想要的格式。base模型是从huggingface上下载的t5-base。

Hello, 第一次使用,请问运行时出现 unsupported operand type(s) for |: 'types.GenericAlias' and 'type' 是什么问题?

官方的demo都已经运行使用了,使用git仓库运行api、训练和微调都遇到了这个错误,具体位置是:

Traceback (most recent call last):
File "/Users/xxxx/xxx/ChatLM-mini-Chinese/sft_train.py", line 16, in
from utils.functions import get_T5_config
File "/Users/xxxx/xxx/ChatLM-mini-Chinese/utils/functions.py", line 27, in
def _get_doc_mini_hash(doc: list[str] | str, num_perm: int) -> MinHash:
TypeError: unsupported operand type(s) for |: 'types.GenericAlias' and 'type'

在 SFT 微调途中出现报错

操作系统:windows 10

在第一次达到save_steps 5000步保存完模型后会出现权限不足报错:

aving model checkpoint to D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000:12:55,  3.39s/it]
Configuration saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\config.json
Configuration saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\generation_config.json
Model weights saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\model.safetensors
tokenizer config file saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\tokenizer_config.json
Special tokens file saved in D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\tmp-checkpoint-5000\special_tokens_map.json
Traceback (most recent call last):
  File "D:\sj\project\python\ChatLM-mini-Chinese\sft_train.py", line 127, in <module>
    sft_train(config)
  File "D:\sj\project\python\ChatLM-mini-Chinese\sft_train.py", line 114, in sft_train
    trainer.train(
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 1539, in train
    return inner_training_loop(
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\accelerate\utils\memory.py", line 136, in decorator
    return function(batch_size, *args, **kwargs)
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 1929, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 2300, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "D:\sj\project\python\ChatLM-mini-Chinese\.venv\lib\site-packages\transformers\trainer.py", line 2418, in _save_checkpoint
    fd = os.open(output_dir, os.O_RDONLY)
PermissionError: [Errno 13] Permission denied: 'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000'

已经尝试两次都是这样,且第二次是以管理员权限运行
image

train_3.5M_CN数据处理问题

现在bellegroup的train_3.5M_CN 数据的格式是这样
{
"id":"66623028",
"conversations":[
{
"from":"human",
"value":"归纳介绍三种不同类型的电影题材"
},
{
"from":"assistant",
"value":"1. 动作片:动作片通常强调动作和战斗场面,故事情节紧凑,充满了紧张和兴奋的氛围。主角一般是具有超凡实力或特殊技能的英雄人物,往往会与反派角色进行一系列激烈的格斗,追逐和爆破等酷炫场面。例如《速度与激情》和《复仇者联盟》等。\n2. 爱情片:爱情片是以爱情为主题的电影,着重呈现角色之间的感情纠葛,展现了爱情的美好和悲壮。情节与角色之间的发展令人深思,增加了观众的感性体验。这种电影的故事情节主要以人类与人类之间的情感变化为主题。例如《泰坦尼克号》和《让子弹飞》等。\n3. 科幻片:科幻片是以未来科技和未知世界为主题的电影,通常包括时间旅行、外星人、虚拟现实、未来社会等各种奇幻的元素。这种电影描绘了一种比现实更加宏伟、更加神秘和惊奇的世界。例如《星际穿越》和《发条橙》等。"
}
]
}

跟train_2M_CN的格式不同,目前的数据处理代码无法处理train_3.5M_CN,这个数据目前是多轮对话的形式,这个数据是直接舍弃,还是可以修改代码再用呢

為甚麼我啟動API會出現這個

启动的进程个数:1
INFO: Will watch for changes in these directories: ['C:\ChatLM-mini-Chinese']
INFO: Uvicorn running on http://127.0.0.1:8812 (Press CTRL+C to quit)
INFO: Started reloader process [5640] using StatReload
INFO: Started server process [4988]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 127.0.0.1:52111 - "GET / HTTP/1.1" 404 Not Found

sft_train

使用huggingface实现的sft_train.py 中有实现对应的embeeding和encoder冻结么?

基于提供的模型进行sft报错

几个问题:
1、3.2.3 下载预训练模型及模型配置文件,模型下载下来的名字是ChatLM-mini-Chinese,但是命令里面是mv ChatLM-Chinese-0.2B model_save,文件夹名字不匹配
2、模型文件夹放到model_save下,用python sft_train.py,报错说不存在model_save/pretrain目录
3、把文件夹改名为pretrain后,用python sft_train.py,报错说Error while deserializing header: HeaderTooLarge

网上有说把safetensors的后缀改成ckpt,试了一下,报错说找不到OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory。

怎么才能把sft跑起来?

如何运行呢?

我按照流程已经将3.2的安装部分都安装成功,不知道下一步该如何训练呢?就是展示的对话不知道怎么去运行打开它?

如何加载sft后的模型?

我sft训练后发现model_save文件夹中多了一个sft文件夹,其中有checkpoints-10000还有一些其他的,如何加载并使用sft后的模型呢?

项目怎么使用fastchat 进行调试

直接加载有问题,不知道怎么处理

(sdw) PS F:\VM> python -m fastchat.serve.model_worker --model-path ChatLM-mini-Chinese
2024-02-29 16:50:15 | INFO | model_worker | args: Namespace(host='localhost', port=21002, worker_address='http://localhost:21002', controller_address='http://localhost:21001', model_path='ChatLM-mini-Chinese', revision='main', device='cuda', gpus=None, num_gpus=1, max_gpu_memory=None, dtype=None, load_8bit=False, cpu_offloading=False, gptq_ckpt=None, gptq_wbits=16, gptq_groupsize=-1, gptq_act_order=False, awq_ckpt=None, awq_wbits=16, awq_groupsize=-1, enable_exllama=False, exllama_max_seq_len=4096, exllama_gpu_split=None, exllama_cache_8bit=False, enable_xft=False, xft_max_seq_len=4096, xft_dtype=None, model_names=None, conv_template=None, embed_in_truncate=False, limit_worker_concurrency=5, stream_interval=2, no_register=False, seed=None, debug=False, ssl=False)
2024-02-29 16:50:15 | INFO | model_worker | Loading the model ['ChatLM-mini-Chinese'] on worker 43282e2a ...
2024-02-29 16:50:15 | ERROR | stderr | Traceback (most recent call last):
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\runpy.py", line 196, in _run_module_as_main
2024-02-29 16:50:15 | ERROR | stderr | return _run_code(code, main_globals, None,
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\runpy.py", line 86, in _run_code
2024-02-29 16:50:15 | ERROR | stderr | exec(code, run_globals)
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\fastchat\serve\model_worker.py", line 414, in
2024-02-29 16:50:15 | ERROR | stderr | args, worker = create_model_worker()
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\fastchat\serve\model_worker.py", line 385, in create_model_worker
2024-02-29 16:50:15 | ERROR | stderr | worker = ModelWorker(
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\fastchat\serve\model_worker.py", line 77, in init
2024-02-29 16:50:15 | ERROR | stderr | self.model, self.tokenizer = load_model(
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\fastchat\model\model_adapter.py", line 353, in load_model
2024-02-29 16:50:15 | ERROR | stderr | model, tokenizer = adapter.load_model(model_path, kwargs)
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\fastchat\model\model_adapter.py", line 99, in load_model
2024-02-29 16:50:15 | ERROR | stderr | model = AutoModelForCausalLM.from_pretrained(
2024-02-29 16:50:15 | ERROR | stderr | File "C:\ProgramData\anaconda3\envs\sdw\lib\site-packages\transformers\models\auto\auto_factory.py", line 564, in from_pretrained
2024-02-29 16:50:15 | ERROR | stderr | raise ValueError(
2024-02-29 16:50:15 | ERROR | stderr | ValueError: Unrecognized configuration class <class 'transformers.models.t5.configuration_t5.T5Config'> for this kind of AutoModel: AutoModelForCausalLM.
2024-02-29 16:50:15 | ERROR | stderr | Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

sft微调时报错

截屏2024-03-20 17 59 05
截屏2024-03-20 18 02 54
在sagemaker平台上进行sft微调,训练时出现爆显存的问题,以下是我的训练设置,就算换成24g显存的机器依然不够
截屏2024-03-20 18 04 04
训练数据有135045条,每条prompt和response的token一共1500左右

请教一个问题,生成的回复重复

请教一个问题,用readme提到的数据集做预训练,和readme一样下降到3.几就立即停止训练了。还没有sft和RLHF。直接用cli_demo,stream=False测试,输出如下回复,请问这是预训练时间不够还是因为没有sft呢?

用户:你好
你好
ChatBot:
你好,你好,你好,你的情况,考虑为肾虚,可以服用六味地黄丸或六味地黄丸进行治疗,同时服用六味地黄丸进行治疗,不吃辛辣刺激性的食物很重要。

运行·pre_train报错,TypeError: Accelerator.__init__() got an unexpected keyword argument 'use_seedable_sampler'

E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\venv\Scripts\python.exe E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\pre_train.py
Traceback (most recent call last):
File "E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\pre_train.py", line 136, in
pre_train(config)
File "E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\pre_train.py", line 109, in pre_train
trainer = Seq2SeqTrainer(
^^^^^^^^^^^^^^^
File "E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\venv\Lib\site-packages\transformers\trainer_seq2seq.py", line 56, in init
super().init(
File "E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\venv\Lib\site-packages\transformers\trainer.py", line 367, in init
self.create_accelerator_and_postprocess()
File "E:\ChatLM-mini-Chinese\ChatLM-mini-Chinese\venv\Lib\site-packages\transformers\trainer.py", line 4127, in create_accelerator_and_postprocess
self.accelerator = Accelerator(
^^^^^^^^^^^^
TypeError: Accelerator.init() got an unexpected keyword argument 'use_seedable_sampler'

Process finished with exit code 1

预训练数据集

作者你好,请问为什么预训练阶段的训练数据也采用 prompt,response 的格式?
我理解预训练应该是做无监督的语言学习,直接喂入自然语言文本而不是喂入 prompt、response 格式的数据效果上有什么区别?
你当时这样整理预训练数据的原因是什么?

是否可以在服务器上运行?

Traceback (most recent call last):
File "/home/aidata/work/service/ChatLM-mini-Chinese-main/cli_demo.py", line 13, in
chat_bot = ChatBot(infer_config=infer_config)
File "/home/aidata/work/service/ChatLM-mini-Chinese-main/model/infer.py", line 46, in init
model = load_and_quantize_model(
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/accelerate/utils/bnb.py", line 193, in load_and_quantize_model
return dispatch_model(model, device_map=device_map, offload_dir=offload_folder)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/accelerate/big_modeling.py", line 436, in dispatch_model
model.to(device)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to
return super().to(*args, **kwargs)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/home/aisdb1/envs/chatmini/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
在服务器上测试效果时 出现这个报错

多卡情况下,同一份数据集会加载多次吗

非常感谢您贡献如此完整的代码!
请教一个问题,单机多卡环境,我在dataset去除shuffer,get_item打印返回的prompt,发现不同的进程会打印相同的prompt,是不是就代表开了几个GPU,同一条数据就会重复训练多次。batch_size=12,num_work=4
启动命令accelerate launch --multi_gpu --num_processes 2 pre_train.py
image
image

预训练数据集必须是{“prompt”: "response":}的格式么?

{
"prompt": "对于花园街,你有什么了解或看法吗?",
"response": "花园街(是香港油尖旺区的一条富有特色的街道,位于九龙旺角东部,北至界限街,南至登打士街,与通菜街及洗衣街等街道平行。现时这条街道是香港著名的购物区之一。位于亚皆老街以南的一段花园街,也就是"波鞋街"整条街约150米长,有50多间售卖运动鞋和运动用品的店舖。旺角道至太子道西一段则为排档区,售卖成衣、蔬菜和水果等。花园街一共分成三段。明清时代,花园街是芒角村栽种花卉的地方。此外,根据历史专家郑宝鸿的考证:花园街曾是1910年代东方殷琴拿烟厂的花园。纵火案。自2005年起,花园街一带最少发生5宗纵火案,当中4宗涉及排档起火。2010年。2010年12月6日,花园街222号一个卖鞋的排档于凌晨5时许首先起火,浓烟涌往旁边住宅大厦,消防接报4"
}
为什么预训练数据集是如上格式,不应该是拼接文本么?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.