Code Monkey home page Code Monkey logo

llm-rlhf-tuning's Introduction

LLM-RLHF-Tuning

本项目从零实现了RLHF三阶段训练,并在文档中详细写了实现细节,欢迎大家交流讨论WeChat

主要内容:

  • 支持指令微调Alpaca模型
  • 支持训练Reward模型
  • 支持PPO算法训练RL模型
    • 支持基于两个基模型,两个lora的适配器,同时加载RM、SFT、Actor、Critic四个模型,支持accelerate分布式训练 (PPO算法实现细节
    • 支持基于一个基模型,两个lora适配器,同时加载RM、SFT、Actor、Critic四个模型,支持accelerate、deepspeed训练
    • 支持基于一个基模型,一个lora适配器,Actor、Critic共享base model,同时实现RM、SFT、Actor、Critic四个模型功能,支持accelerate、deepspeed训练
  • 支持DPO算法训练模型

更新

  • [23/8/23] 支持LLaMA2模型训练;支持DPO训练;支持基于一个基模型、选择一个或两个lora适配器训练PPO、支持accelerate、deepspeed训练
  • [23/8/13] 支持LLaMA模型训练;支持基于两个基模型、两个lora的适配器训练PPO;支持accelerate分布式训练

功能

与开源的RLHF训练框架的功能进行对比

框架 SFT Train RM Train PPO Train DPO Train
Our
Deepspeed-chat
trl
MOSS-RLHF
PPO Train
框架 Accelerate Deepspeed Multi LORA 最低模型参数量 (7B为例)
Our single model size ~ 7B
Deepspeed-chat sft+rm+actor+critic ~ 28B
trl single model size(not use ref model)~ 7B
MOSS-RLHF actor model、critic model sft model、rm model sft+rm+actor+critic ~ 28B

使用指引

环境搭建

accelerate==0.21.0
datasets==2.13.1
scikit-learn==1.3.0
sentencepiece==0.1.99
tqdm==4.65.0
transformers==4.31.0
wandb==0.15.8
peft==0.4.0
torch==2.0.1
trl==0.5.0
deepspeed==0.10.0

支持模型

  • LLaMA
  • LLaMA2

支持训练方式

  • LoRA

训练细节

指令微调模型

训练奖励模型

PPO训练

DPO训练

TODO

  • 支持LLaMA2模型
  • 支持deepspeed训练
  • 支持DPO训练
  • PPO提升训练稳定性,实现ppo-max
  • 支持DDPO训练
  • 支持RRHF
  • 支持RAFT
  • 支持拒绝采样 RFT
  • 支持BLOOM模型
  • 支持Baichuan模型
  • 支持QLoRA训练

欢迎加群讨论 WeChat

llm-rlhf-tuning's People

Contributors

joyce94 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

llm-rlhf-tuning's Issues

交流RLHF经验

您好 看到您开源了自己的RLHF方案
麻烦问下在您的实际tuning过程中 PPO收敛后 输出的sample是否人观察,稳定优于原始sft model?(非safe领域)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.