Comments (11)
应该是你自己的环境问题,检查你登陆的用户在D盘及你的文件路径是否有写权限(右键->属性->安全->选择你登录的用户名->检查是否有写权限,文件夹及D盘都需要检查),我没法复现你的问题。自己写个写文件的python脚本检查一下你的路径D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000
是否能写文件(\\
、\
、/
的区别)。
from chatlm-mini-chinese.
另外建议先设置一个小的save_step,小的epoch或者小的max_step,几十条数据,把全部流程跑通再开始正式训练,否则训练完了发现没法保存就白干了。
from chatlm-mini-chinese.
我已检查且确定我的登陆的用户在D盘及你的文件路径具有写权限:
且我编写了如下python脚本测试:
import os
def check_write_permission(path):
try:
# 在指定路径下尝试创建一个临时文件
with open(os.path.join(path, 'test_file.tmp'), 'w') as f:
f.write('Testing write permission.')
# 如果成功创建文件,则写入权限检查通过
print(f"路径写入权限检查已通过: {path}")
return True
except Exception as e:
# 如果创建文件过程中出现异常,则写入权限检查失败
print(f"检查写入权限时出错: {e}")
return False
# 指定路径进行写入权限检查
path_to_check_list = [
r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-5000',
r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\\checkpoint-5000',
r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft/checkpoint-5000',
r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft//checkpoint-5000',
r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000',
]
for path in path_to_check_list:
check_write_permission(path)
值得注意的是,当我直接使用 D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000
来通过文件资源管理器的路径栏企图访问的时候会无法访问。使用D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-5000
正常。
但是这个路径是程序自动生成的路径吧?
from chatlm-mini-chinese.
你上传的权限图,组和用户名
那里还有不少用户,你再选其他用户看看,一般登录用户是不排在第一位的,所有用户检查一遍。
能写入文件,难道说是不能创建文件夹??你再试试用代码再你的路径/sft
目录下,看看能不能创建文件夹。
checkpoint-5000
这个文件夹是trainer
在保存检查点的时候自动生成的,在你定义的output_dir
后面添加checkpoint-5000
。
from chatlm-mini-chinese.
我的output_dir
是这样的:output_dir: str = PROJECT_ROOT + '/model_save/sft'
,没有改动
我检查了所有用户的权限都是存在写入,且
我为测试代码添加了一个生成文件夹的函数:
import os
def check_write_permission(path):
try:
# 在指定路径下尝试创建一个临时文件
with open(os.path.join(path, 'test_file.tmp'), 'w') as f:
f.write('Testing write permission.')
# 如果成功创建文件,则写入权限检查通过
print(f"路径写入权限检查已通过: {path}")
return True
except Exception as e:
# 如果创建文件过程中出现异常,则写入权限检查失败
print(f"检查写入权限时出错: {e}")
return False
def test_create_directory(path):
try:
# 尝试创建一个临时文件夹
os.makedirs(path, exist_ok=True)
# 如果成功创建文件夹,则生成文件夹测试通过
print(f"生成文件夹测试已通过: {path}")
return True
except Exception as e:
# 如果创建文件夹过程中出现异常,则生成文件夹测试失败
print(f"生成文件夹测试时出错: {e}")
return False
# 指定路径进行写入权限测试
path_to_check_list = [
r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-5000',
r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\\checkpoint-5000',
r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft/checkpoint-5000',
r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft//checkpoint-5000',
r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000',
]
# 指定路径进行目录生成测试
path_to_create_list = [
r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-test0',
r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\\checkpoint-test1',
r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft/checkpoint-test2',
r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft//checkpoint-test3',
r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-test4',
]
for path in path_to_check_list:
check_write_permission(path)
for path in path_to_create_list:
test_create_directory(path)
同时在同一个虚拟环境中进行了运行测试(之前也是),还是全部通过
如果用户权限不存在我认为上述的python代码应该也是无法通过才对(因为cmd我甚至没开管理员权限)
from chatlm-mini-chinese.
好怪啊,我这边是正常的。
我去看了transformers\trainer.py
的源码,我摘选了部分,标注了你报错的部分:
....
# output_dir`......../checkpoint-5000`已经存在了,且文件夹不为空
if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
logger.warning(
f"Checkpoint destination directory {output_dir} already exists and is non-empty."
"Saving will proceed but saved results may be invalid."
)
staging_output_dir = output_dir
else:
# 你的代码执行了这一步
staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
.....
# Then go through the rewriting process, only renaming and rotating from main process(es)
if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
if staging_output_dir != output_dir: # 因为这两个不等,且staging_output_dir文件夹存在,所以要执行下面的部分
if os.path.exists(staging_output_dir):
os.rename(staging_output_dir, output_dir)
# Ensure rename completed in cases where os.rename is not atomic
fd = os.open(output_dir, os.O_RDONLY) # line 2418,
# 你的代码抛出错误的地方,将tmp-{checkpoint_folder}重命名为output_dir后无法再次打开,原因是没有权限
# 至于为什么没权限,不知道😂
os.fsync(fd)
os.close(fd)
# Maybe delete some older checkpoints.
if self.args.should_save:
self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)
你写的测试代码,所有的路径都用r''
的格式,加了r
后路径内的转义字符会失效,比如\\
就算两个反斜杠,而不是转义后的一个反斜杠。transformers\trainer.py
里面的路径都是带转义的字符串格式,没有用r''
停止转义,不太清楚加了r是否有影响。
最后建议如下:把sft
及子文件夹删除,重启电脑后,再设置一个小的save_step,小的epoch或者小的max_step,几十条数据,几十秒就跑完那种,看看行不行。如果不行,再把output_dir
改为其他路径,比如./model_sft
、PROJECT_ROOT + '/model_sft’
。
from chatlm-mini-chinese.
去掉r''
的测试脚本也是一切正常
重启也试过了,而且两次微调之间都是隔天进行的。
我再研究研究吧,十分感谢😂或许我该去 transformers 反馈?
from chatlm-mini-chinese.
另外 sft_train.py
的 loss_log.to_csv(f"./logs/sft_train_log_{time.strftime('%Y%m%d-%H%M')}.csv")
这行如果没有 logs 目录的话也会报错 OSError: Cannot save file into a non-existent directory: 'logs'
。手动建一个目录才行。(或许是一个微不足道的BUG
from chatlm-mini-chinese.
主要是没法复现😂,我这里win11、wsl都正常,总不能是win10的问题吧?我看其他人做sft的也没出现这个问题,#issuecomment-1897843741。
from chatlm-mini-chinese.
好,logs
这个问题我等会就修。
from chatlm-mini-chinese.
- 61af2fe
无解,logs
修复暂时关了
from chatlm-mini-chinese.
Related Issues (20)
- 预训练数据集 HOT 2
- 微调后预测三元组不正确原因 HOT 5
- 用train.py出现shape的mismatch HOT 10
- sft微调时报错 HOT 4
- 如何提取中间层的输出? HOT 2
- 考虑出一个支持llama的版本吗 HOT 1
- RuntimeError: No executable batch size found, reached zero HOT 2
- 如何加载sft后的模型? HOT 1
- train_3.5M_CN数据处理问题 HOT 1
- 这个模型好像没有长文对话的能力,该如何训练它让它有这个能力? HOT 1
- 请问这些预训练数据加起来有多少token呀 HOT 2
- 非常不错的开源项目 HOT 1
- 预训练数据集必须是{“prompt”: "response":}的格式么? HOT 2
- Some NCCL operations have failed or timed out. HOT 5
- sft_train HOT 1
- 是否考虑将预训练的模型和仅stf后的模型也上传的平台呢 HOT 1
- 这种只能通过问答对的方式,有没有办法MLM的方式学习知识体系。 HOT 1
- 预训练,用了160万数据,共2G句子对,使用A40的48G显存,无论使用1/2/3/4卡,都会报OOM HOT 1
- 可以用a卡训练吗 HOT 1
- tokenizer的字典中有不少token带有下划线,请问这种是什么意思 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatlm-mini-chinese.