Code Monkey home page Code Monkey logo

starcoder's Introduction

๐Ÿ’ซ StarCoder

What is this about?

๐Ÿ’ซ StarCoder is a language model (LM) trained on source code and natural language text. Its training data incorporates more that 80 different programming languages as well as text extracted from github issues and commits and from notebooks. This repository showcases how we get an overview of this LM's capabilities.

Disclaimer

Before you can use the model go to hf.co/bigcode/starcoder and accept the agreement.

Table of Contents

  1. Quickstart
  2. Fine-tuning

Quickstart

StarCoder was trained on github code, thus it can be used to perform code generation. More precisely, the model can complete the implementation of a function or infer the following characters in a line of code. This can be done with the help of the ๐Ÿค—'s transformers library.

Installation

First, we have to install all the libraries listed in requirements.txt

pip install -r requirements.txt

Code generation

The code generation pipeline is as follows

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

or

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_ckpt = "bigcode/starcoder"

model = AutoModelForCausalLM.from_pretrained(model_ckpt)
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
print( pipe("def hello():") )

Text-generation-inference

docker run --gpus '"device:0"' -p 8080:80 -v $PWD/data:/data -e HUGGING_FACE_HUB_TOKEN=<YOUR BIGCODE ENABLED TOKEN> -e HF_HUB_ENABLE_HF_TRANSFER=0 -d  ghcr.io/huggingface/text-generation-inference:sha-880a76e --model-id bigcode/starcoder --max-total-tokens 8192

For more details, see here.

Fine-tuning

Here, we showcase how we can fine-tune this LM on a specific downstream task.

Step by step installation with conda

Create a new conda environment and activate it

conda create -n env
conda activate env

Install the pytorch version compatible with your version of cuda here, for example the following command works with cuda 11.6

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

Install transformers and peft

conda install -c huggingface transformers 
pip install git+https://github.com/huggingface/peft.git

Note that you can install the latest stable version of transformers by using

pip install git+https://github.com/huggingface/transformers

Install datasets, accelerate and huggingface_hub

conda install -c huggingface -c conda-forge datasets
conda install -c conda-forge accelerate
conda install -c conda-forge huggingface_hub

Finally, install bitsandbytes and wandb

pip install bitsandbytes
pip install wandb

To get the full list of arguments with descriptions you can run the following command on any script:

python scripts/some_script.py --help

Before you run any of the scripts make sure you are logged in and can push to the hub:

huggingface-cli login

Make sure you are logged in wandb:

wandb login

Now that everything is done, you can clone the repository and get into the corresponding directory.

Datasets

๐Ÿ’ซ StarCoder can be fine-tuned to achieve multiple downstream tasks. Our interest here is to fine-tune StarCoder in order to make it follow instructions. Instruction fine-tuning has gained a lot of attention recently as it proposes a simple framework that teaches language models to align their outputs with human needs. That procedure requires the availability of quality instruction datasets, which contain multiple instruction - answer pairs. Unfortunately such datasets are not ubiquitous but thanks to Hugging Face ๐Ÿค—'s datasets library we can have access to some good proxies. To fine-tune cheaply and efficiently, we use Hugging Face ๐Ÿค—'s PEFT as well as Tim Dettmers' bitsandbytes.

Stack Exchange SE

Stack Exchange is a well-known network of Q&A websites on topics in diverse fields. It is a place where a user can ask a question and obtain answers from other users. Those answers are scored and ranked based on their quality. Stack exchange instruction is a dataset that was obtained by scrapping the site in order to build a collection of Q&A pairs. A language model can then be fine-tuned on that dataset to make it elicit strong and diverse question-answering skills.

To execute the fine-tuning script run the following command:

python finetune/finetune.py \
  --model_path="bigcode/starcoder"\
  --dataset_name="ArmelR/stack-exchange-instruction"\
  --subset="data/finetune"\
  --split="train"\
  --size_valid_set 10000\
  --streaming\
  --seq_length 2048\
  --max_steps 1000\
  --batch_size 1\
  --input_column_name="question"\
  --output_column_name="response"\ 
  --gradient_accumulation_steps 16\
  --learning_rate 1e-4\
  --lr_scheduler_type="cosine"\
  --num_warmup_steps 100\
  --weight_decay 0.05\
  --output_dir="./checkpoints" \

The size of the SE dataset is better manageable when using streaming. We also have to precise the split of the dataset that is used. For more details, check the dataset's page on ๐Ÿค—. Similarly we can modify the command to account for the availability of GPUs

python -m torch.distributed.launch \
  --nproc_per_node number_of_gpus finetune/finetune.py \
  --model_path="bigcode/starcoder"\
  --dataset_name="ArmelR/stack-exchange-instruction"\
  --subset="data/finetune"\
  --split="train"\
  --size_valid_set 10000\
  --streaming \
  --seq_length 2048\
  --max_steps 1000\
  --batch_size 1\
  --input_column_name="question"\
  --output_column_name="response"\ 
  --gradient_accumulation_steps 16\
  --learning_rate 1e-4\
  --lr_scheduler_type="cosine"\
  --num_warmup_steps 100\
  --weight_decay 0.05\
  --output_dir="./checkpoints" \

Merging PEFT adapter layers

If you train a model with PEFT, you'll need to merge the adapter layers with the base model if you want to run inference / evaluation. To do so, run:

python finetune/merge_peft_adapters.py --model_name_or_path model_to_merge --peft_model_path model_checkpoint

# Push merged model to the Hub
python finetune/merge_peft_adapters.py --model_name_or_path model_to_merge --peft_model_path model_checkpoint --push_to_hub

For example

python finetune/merge_peft_adapters.py --model_name_or_path bigcode/starcoder --peft_model_path checkpoints/checkpoint-1000 --push_to_hub

starcoder's People

Contributors

armelrandy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.