Code Monkey home page Code Monkey logo

Comments (11)

charent avatar charent commented on July 29, 2024

应该是你自己的环境问题,检查你登陆的用户在D盘及你的文件路径是否有写权限(右键->属性->安全->选择你登录的用户名->检查是否有写权限,文件夹及D盘都需要检查),我没法复现你的问题。自己写个写文件的python脚本检查一下你的路径D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000是否能写文件(\\\/的区别)。

image

from chatlm-mini-chinese.

charent avatar charent commented on July 29, 2024

另外建议先设置一个小的save_step,小的epoch或者小的max_step,几十条数据,把全部流程跑通再开始正式训练,否则训练完了发现没法保存就白干了。

from chatlm-mini-chinese.

aoguai avatar aoguai commented on July 29, 2024

我已检查且确定我的登陆的用户在D盘及你的文件路径具有写权限:
image
且我编写了如下python脚本测试:

import os


def check_write_permission(path):
    try:
        # 在指定路径下尝试创建一个临时文件
        with open(os.path.join(path, 'test_file.tmp'), 'w') as f:
            f.write('Testing write permission.')
        # 如果成功创建文件,则写入权限检查通过
        print(f"路径写入权限检查已通过: {path}")
        return True
    except Exception as e:
        # 如果创建文件过程中出现异常,则写入权限检查失败
        print(f"检查写入权限时出错: {e}")
        return False


# 指定路径进行写入权限检查
path_to_check_list = [
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-5000',
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\\checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft/checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft//checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000',
]

for path in path_to_check_list:
    check_write_permission(path)

得到结果全部通过
image


值得注意的是,当我直接使用 D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000 来通过文件资源管理器的路径栏企图访问的时候会无法访问。使用D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-5000正常。

但是这个路径是程序自动生成的路径吧?

from chatlm-mini-chinese.

charent avatar charent commented on July 29, 2024

你上传的权限图,组和用户名那里还有不少用户,你再选其他用户看看,一般登录用户是不排在第一位的,所有用户检查一遍。
能写入文件,难道说是不能创建文件夹??你再试试用代码再你的路径/sft目录下,看看能不能创建文件夹。

checkpoint-5000这个文件夹是trainer在保存检查点的时候自动生成的,在你定义的output_dir后面添加checkpoint-5000

from chatlm-mini-chinese.

aoguai avatar aoguai commented on July 29, 2024

我的output_dir 是这样的:output_dir: str = PROJECT_ROOT + '/model_save/sft',没有改动

我检查了所有用户的权限都是存在写入,且
我为测试代码添加了一个生成文件夹的函数:

import os


def check_write_permission(path):
    try:
        # 在指定路径下尝试创建一个临时文件
        with open(os.path.join(path, 'test_file.tmp'), 'w') as f:
            f.write('Testing write permission.')
        # 如果成功创建文件,则写入权限检查通过
        print(f"路径写入权限检查已通过: {path}")
        return True
    except Exception as e:
        # 如果创建文件过程中出现异常,则写入权限检查失败
        print(f"检查写入权限时出错: {e}")
        return False


def test_create_directory(path):
    try:
        # 尝试创建一个临时文件夹
        os.makedirs(path, exist_ok=True)
        # 如果成功创建文件夹,则生成文件夹测试通过
        print(f"生成文件夹测试已通过: {path}")
        return True
    except Exception as e:
        # 如果创建文件夹过程中出现异常,则生成文件夹测试失败
        print(f"生成文件夹测试时出错: {e}")
        return False


# 指定路径进行写入权限测试
path_to_check_list = [
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-5000',
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\\checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft/checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft//checkpoint-5000',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-5000',
]

# 指定路径进行目录生成测试
path_to_create_list = [
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\checkpoint-test0',
    r'D:\sj\project\python\ChatLM-mini-Chinese\model_save\sft\\checkpoint-test1',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft/checkpoint-test2',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft//checkpoint-test3',
    r'D:/sj/project/python/ChatLM-mini-Chinese/model_save/sft\\checkpoint-test4',
]

for path in path_to_check_list:
    check_write_permission(path)

for path in path_to_create_list:
    test_create_directory(path)

同时在同一个虚拟环境中进行了运行测试(之前也是),还是全部通过
image
image

如果用户权限不存在我认为上述的python代码应该也是无法通过才对(因为cmd我甚至没开管理员权限)

from chatlm-mini-chinese.

charent avatar charent commented on July 29, 2024

好怪啊,我这边是正常的。

我去看了transformers\trainer.py的源码,我摘选了部分,标注了你报错的部分:

....
# output_dir`......../checkpoint-5000`已经存在了,且文件夹不为空
 if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
    logger.warning(
        f"Checkpoint destination directory {output_dir} already exists and is non-empty."
        "Saving will proceed but saved results may be invalid."
    )
    staging_output_dir = output_dir
  else:
      # 你的代码执行了这一步
      staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")   
.....

# Then go through the rewriting process, only renaming and rotating from main process(es)
  if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
      if staging_output_dir != output_dir:   # 因为这两个不等,且staging_output_dir文件夹存在,所以要执行下面的部分
          if os.path.exists(staging_output_dir):
              os.rename(staging_output_dir, output_dir)

              # Ensure rename completed in cases where os.rename is not atomic
              fd = os.open(output_dir, os.O_RDONLY)      # line 2418, 
              # 你的代码抛出错误的地方,将tmp-{checkpoint_folder}重命名为output_dir后无法再次打开,原因是没有权限
              # 至于为什么没权限,不知道😂

              os.fsync(fd)
              os.close(fd)

      # Maybe delete some older checkpoints.
      if self.args.should_save:
          self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)

你写的测试代码,所有的路径都用r''的格式,加了r后路径内的转义字符会失效,比如\\就算两个反斜杠,而不是转义后的一个反斜杠。transformers\trainer.py里面的路径都是带转义的字符串格式,没有用r''停止转义,不太清楚加了r是否有影响。

最后建议如下:把sft及子文件夹删除,重启电脑后,再设置一个小的save_step,小的epoch或者小的max_step,几十条数据,几十秒就跑完那种,看看行不行。如果不行,再把output_dir改为其他路径,比如./model_sftPROJECT_ROOT + '/model_sft’

from chatlm-mini-chinese.

aoguai avatar aoguai commented on July 29, 2024

去掉r''的测试脚本也是一切正常

image
image


重启也试过了,而且两次微调之间都是隔天进行的。
我再研究研究吧,十分感谢😂或许我该去 transformers 反馈?

from chatlm-mini-chinese.

aoguai avatar aoguai commented on July 29, 2024

测试了一下换了个小数据集后一切正常。
image


另外 sft_train.pyloss_log.to_csv(f"./logs/sft_train_log_{time.strftime('%Y%m%d-%H%M')}.csv")这行如果没有 logs 目录的话也会报错 OSError: Cannot save file into a non-existent directory: 'logs'。手动建一个目录才行。(或许是一个微不足道的BUG

from chatlm-mini-chinese.

charent avatar charent commented on July 29, 2024

主要是没法复现😂,我这里win11、wsl都正常,总不能是win10的问题吧?我看其他人做sft的也没出现这个问题,#issuecomment-1897843741

from chatlm-mini-chinese.

charent avatar charent commented on July 29, 2024

好,logs这个问题我等会就修。

from chatlm-mini-chinese.

aoguai avatar aoguai commented on July 29, 2024
  • 61af2fe
    无解,logs 修复暂时关了

from chatlm-mini-chinese.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.