Code Monkey home page Code Monkey logo

gemmalongtext's Introduction

Building a Long Text Supported Model with Deepspeed for GEMMA

README.md

Environment Setup

  • Download the pre-trained checkpoint: gemma.ckpt
  • Prepare a server with sufficient GPU power, my setup is an 8000 server, with preferably a new CUDA version
  • Install the configuration files of this repository, dependencies will be automatically configured: python setup.py install
  • Prepare a dataset with long text

Usage

  • Choose the fine-tuning method to use, lora/lora+/galore/full
  • Choose the parallel training method to use: pp/dp
  • Customize the corresponding script based on your local conditions, for example, when using lora and pp, edit /scripts/pp_scrtpt/lora_train.sh
  • Start by editing the settings in base options
    • Replace data-path with your dataset address, the dataset should be in jsonl format, containing input and output keys
    • Replace output-path with the save location of your model checkpoint
    • Replace ckpt-path with the address of your pre-trained model
  • Edit the settings in options based on your training requirements
#! /bin/bash
base_options="--data-path /workspace/longtext-2k-clean.jsonl \
--tokenizer-path /workspace/tokenizer.model \
--output-path /workspace/gemma/output \
--ckpt-path /workspace/gemma-2b-it.ckpt
"
options="$base_options \
    --experiment-name train_pi_test \
    --show-loss-step 1 \
    --epochs 3 \
    --batch-size-per-gpu 1 \
    --fp16 \
    --gradient-accumulation-steps 2 \
    --warmup 0.02 \
    --device cuda \
    --num-pp-stages 4 \
    --max-len 16384 \
    --max-src-len 16000 \
    --seed 42 \
    --read-nums 100 \
    --ds-config-path /workspace/gemma_long_rope/gemma/ds_config/pipeline.json \
    --variant 2b \
    --train-pi 2 \
    --lr 1e-5 \
    --warmup-min-lr 1e-6 \
    --warmup-max-lr 2e-5 \
    --use-lora-plus \
    --activation-checkpoint \
    --diy-optimizer \
    --atten-type flash_atten \
    "

run_cmd="deepspeed --include localhost:0,1,2,3 --master_port 16666 /workspace/gemma_long_rope/gemma/train.py ${options}"
echo ${run_cmd}
eval ${run_cmd}

set +x

NEWS

  • Added support for activation checkpoint
  • Added support for custom optimizer
  • Added support for lr scheduler from sat library
  • Added support for lora+ (applying different learning rates to matrix A and B of lora can lead to better performance and faster convergence)
  • 2024-03-20 Added support for galore(issues with gradient dimensions, to be modified)
  • 2024-03-21 Added support for torch's flash-attention implementation
  • 2024-03-21 Added support for pure dp training
  • 2024-03-23 Added support for lora-fa
  • 2024-04-01 Added support for deepspeed-ulysses, now able to train longer models!!
  • 2024-04-02 Added support for DoRA

TODO

  • Support more lora versions like plora
  • Support more length extrapolation methods
  • Support more memory-efficient methods
  • And more advanced technologies
  • Enhance the scheduler code
  • Add support for ring-attention

Acknowledgements

Thanks to the open-source code and model weights from the following repositories:

contact

Feel free to contact me via:

gemmalongtext's People

Contributors

hhnqqq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.