Code Monkey home page Code Monkey logo

codegen-fine-tuning's Introduction

CodeGen fine tuning with HuggingFace + Deepspeed

This is a step by step process for fine-tuning CodeGen on specific programming languages using huggingface transformers and deepspeed

CodeGen is a suite of code based language models by SalesForce (https://github.com/salesforce/CodeGen/blob/main/README.md). Model sizes vary with respect to their training corpus, and model parameters. Models are named as per the convention codegen-{model-size}-{data}.

model-size has 4 options: 350M, 2B, 6B, 16B, which represent the number of parameters in each model.

data has 3 options: nl, multi, mono.

  • nl models are randomly initialized and trained on The Pile, a 825.18 GB English text corpus.
  • multi models are initialized from nl models and then trained on a corpus with code data consisting of multiple programming languages.
  • mono models are initialized from multi models and then trained on a corpus with Python code data.

A detailed description of the models isas follows:

CodeGen models

model name data model-size
codegen-350M-nl nl 350M
codegen-350M-multi multi 350M
codegen-350M-mono mono 350M
codegen-2B-nl nl 2B
codegen-2B-multi multi 2B
codegen-2B-mono mono 2B
codegen-6B-nl nl 6B
codegen-6B-multi multi 6B
codegen-6B-mono mono 6B
codegen-16B-nl nl 16B
codegen-16B-multi multi 16B
codegen-16B-mono mono 16B

Following is a detailed set of instruction for replicating the CodeGen fine-tuning on a local server:

The following steps have been tested on an HPC with a sungularity container with Ubuntu20.04 and 50GB RAM. However, the setup can also be replicated on a machine with ubuntu 20.04.

Prepare training corpus.

For CodeGen models, the data format has to be in a loose json format with one json per line followed by a new line as follows:

{‘text’: your data chunk 1}\n {‘text’: your data chunk 2}\n ...

I used the following code snippet to prepare the json,

with open('code_segments.json','a') as f:
    for row in df_code['text'].values:
        dic={"text":str(row)}
        ob=json.dumps(dic)
        f.write(ob)
        f.write('\n')
f.close()

Note, in this case, the for loop iterates a pandas dataframe df_code with a column named text. You may tweak the code snippet according to the type of data you will be rading.

Prepare the environment on your machine

I recommend the followinf for fine-tuning. I created a conda environment inside the singularity container, however, if you are not using container, you may create a conda environment direclty on your machine,

conda create --name anyname python=3.X
then, activate the environment
conda activate anyname

And later, install the following software libraries inside the environment (conda activate name_of_the_conda_env). Please note that, it is assumed the pre-requisited are installed (pip, sklearn,pandas,numpy,scipy, and other packeages for doing basic data science).

  • Clone the transformers repo from GitHub git clone https://github.com/huggingface/transformers

  • And navigate to the path YOUR_ROOT/transformers/examples/pytorch/language-modeling/

  • Run the sequence of pip as follows to install the requirements

pip install -r requirements.txt
pip install git+https://github.com/huggingface/transformers/
pip install deepspeed
  • Put the json file we prepared in teh first step in a folder on the path as above (../transformers/examples/pytorch/language-modeling/), and the name of the folder should be the same as the name of your json file without extension.

  • At this point you are ready to run fine-tuning if everything is good — it is possible that you run into some package conflicts and other issues, which you will have to resolve along the way, you can also let me know, perhaps I must have already encountered those issues

  • At this point, you are ready to run the fine-tuning. The following command runs fine-tuning script run_clm.py using deepspeed (https://huggingface.co/docs/transformers/main_classes/deepspeed). In this case, deepspeed request two gpus on a node. You can play around with the run_clm.py options and deepspeed configuration (ds_config.json) and change the save_steps, model name, number of epochs to train, input token length, and otehr parametrs. The following configuration of run_clm has been tested to work on teh HPC wit ubuntu 20.04.

deepspeed --num_gpus 2 --num_nodes 1 run_clm.py --model_name_or_path=Salesforce/codegen-6B-multi --save_steps=100 --per_device_train_batch_size=1 --learning_rate 2e-5 --num_train_epochs 1 --output_dir=CodeGen/codegen-6B-verilog-3-epochs --report_to 'wandb' --dataset_name code_segments_verilog --tokenizer_name Salesforce/codegen-16B-multi --block_size 1024 --gradient_accumulation_steps 32 --do_train --do_eval --fp16 --overwrite_output_dir --deepspeed ds_config.json"

To run the fine-tuning as a job on HPC, I created a slurm script (run-codegen-finetune.SBATCH) which runs the above command in a slurm script with conda environment within singularity container.

  • The deepspeed configurations is included in the ds_config.json file
  • One more step, if you look at the arguments of the run_clm.py script above, you will notice that there is a term “wandb”. It is similar to tensorboard. wandb is a [web portal] (https://wandb.ai/) that is integrated with the transformers, and helps visualize the system usage, logs, and other details while the fine-tuning progress.
  • Make sure that you install wandb as pip install wandb and register on their portal
  • Next, log in to wandb in the terminal within your singulartity container (or, in the terminal on your machine) before executing the deepspeed command above (or, running the slurm script) like,

wandb login

Note, that the wandb session may timeout, so, you can also open a new termina, login to wandb, and leave that terminal open while you execute the fine-tuning in another window.

  • Upon execution of the wandb login command, the following prompt on the terminal will ask you to paste the API key (available from your profile page on the [wandb portal] (https://wandb.ai)),

It is possible to remove the wandb option from the fine-tuning altogether by removing the option and continue fine-tuning. If you would like to use tensorboard in place of wandb, then simply replace the wandb with tensorboard, and configure tensorboard path (https://www.tensorflow.org/tensorboard/get_started).

  • Run the above command (or, start the batch job), check your job log file for any error
  • If you are running the fine-tuning on HPC, at first, I would suggest you request only one GPU on one node with lesser memory, which will be allocated easily, and you can resolve any error that pops up along the way.
  • If everything is installed and compatible, the fine-tuning should execute and you will be able to track the progress on wandb portal and from the log file on your machine.

Installing packages using requirement.txt

You can also install the requirements as follows, and take care of the conflicting libraries along the way

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

codegen-fine-tuning's People

Contributors

shailja-thakur avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.