Code Monkey home page Code Monkey logo

Comments (9)

Facico avatar Facico commented on May 11, 2024 1

@chenzk1993 你这个问题应该和这个issue相同,可以试试直接从huggingface拉去13B的模型,如j将--model_path 设置为decapoda-research/llama-13b-hf

from chinese-vicuna.

Facico avatar Facico commented on May 11, 2024

@chenzk1993 你好,能提更详细的训练信息吗。比如训练脚本,是否有修改参数,所使用的数据之类的。

from chinese-vicuna.

chenzk1993 avatar chenzk1993 commented on May 11, 2024

感谢回复
参数如下,merge.json是从你提供的3469936万条merge.json随机选了10000条训练的。
parser.add_argument("--data_path", type=str, default="./sample/merge.json")
parser.add_argument("--output_path", type=str, default="lora-Vicuna")
parser.add_argument("--model_path", type=str, default="decapoda-research/llama-13b-hf")
parser.add_argument("--eval_steps", type=int, default=200)
parser.add_argument("--save_steps", type=int, default=200)
parser.add_argument("--test_size", type=int, default=200)

MICRO_BATCH_SIZE = 4 # this could actually be 5 but i like powers of 2
BATCH_SIZE = 128
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
EPOCHS = 2
使用三张显卡

from chinese-vicuna.

Facico avatar Facico commented on May 11, 2024

@chenzk1993 我们的数据大概只有70w条左右,这个300多万应该是指的文件行数,你的10000多条数据格式是正确的吗。

我们finetune.sh提供了大概三个脚本,能提供一下你用了哪个脚本以及你的脚本参数吗(你贴的都是默认参数肯定是没问题的)

from chinese-vicuna.

chenzk1993 avatar chenzk1993 commented on May 11, 2024

我这边也是10000行,大概3000条数据,这上面的参数就是我设置的参数,没有再从finetune.sh传参。怎么判断数据格式有没有问题?

from chinese-vicuna.

Facico avatar Facico commented on May 11, 2024

其实如果能正确加载应该问题不大,正常的json格式就可以(字典或者list套字典或者每行一个字典),可以参照我们的sample.json

看起来训练到1.33个epoch已经训练了很久了,前面的步骤也正常吗。这个loss 100多可能是从一开始就很大?(7B训练的时候一开始是1左右,13B也差不多)
1、一开始就很大的话可能是模型加载参数没加载好(如果模型参数没加载好可能随机初始化),你可以尝试一下用我们已经训练好的lora加载进去,loss是否正常
2、lr=0一般的问题可能是你当前步骤加载了一个优化器、lr_schedule参数,但这个参数的max_step比你目前的step要小,他就会不能从lr_schedule中加载正确的lr,他就会为0(如果你是从我们已有的一个checkpoint的优化器和lr_schedule加载,但是没有设置resume_from_checkpoint就可能遇到这个问题)
3、依赖、硬件的问题,这个比较难排查。比如加载8bit的时候版本不对,参数弄坏了。或者是硬件把参数计算错了(这个概率比较小)

from chinese-vicuna.

chenzk1993 avatar chenzk1993 commented on May 11, 2024

感谢回复,因为在训练过程中一直没有输出loss的变化,前面loss是否正常也不清楚,只是训练到后面报warning显示学习率为零才知道的。我会试下你说的方法,另外怎么输出训练过程中的loss?

from chinese-vicuna.

Facico avatar Facico commented on May 11, 2024

@chenzk1993 你可以把程序里面的logging_steps调成1,每个batch(128个数据)都会记录一次变化

from chinese-vicuna.

chenzk1993 avatar chenzk1993 commented on May 11, 2024

问题已解决,是显卡型号的问题,开始使用的p40,后来换成3090就可以了。

from chinese-vicuna.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.