Code Monkey home page Code Monkey logo

code-eval's Introduction

code-eval

What

This is a repo I use to run human-eval on code models, adjust as needed. Some scripts were adjusted from wizardcoder repo (process_eval.py). The evaluation code is duplicated in several files, mostly to handle edge cases around model tokenizing and loading (will clean it up).

Results

Table is sorted by pass@1 score.

model size pass@1 pass@10 screenshot
sahil2801/replit-code-instruct-glaive 3B 63.5% 67% instruct-glaive
WizardCoder-15B-V1.0 15B 57% 68.9% wizardcoder
bigcode/starcoder 15B 34.6% 48.7% starcoder
openchat/opencoderplus 15B 27.3% 43.9% opencoder
teknium/Replit-v1-CodeInstruct-3B 3B 25.8% 42.6% replit-codeinstruct-v1
teknium/Replit-v2-CodeInstruct-3B 3B 21.5% 31% replit-codeinstruct-v2
replit-code-v1-3b 3B 17.1% 29.8% replit-code-v1
mpt-7b 7B 15.9% 23.7% mpt-7b
xgen-7b-8k-base 7B 14.9% 22.5% xgen-7b-8k-base
openllama-7b-v2 7B 14% 23.1% openllama-7b-v2
llama-2-7b 7B 13.1% 21.9% llama-2-7b
llama-7b 7B 12.1% 18.9% llama-7b
mpt-30b 30B pending pending pending

FAQ

Why is there a discrepancy on some of the scores between official numbers?

Because it is not obvious or published what prompt or processing the official models used to conduct their evaluation on this benchmark. The goal here is to try and best reproduce those numbers, in many cases it is possible to get very close to the published numbers.

All of the scores here were run independently of any published numbers and are reproducible by cloning the repo and following the setup.

Why do some models have a filter_code post generation step?

Base models can in many cases repeat outputs, breaking the benchmark scores. Instruct models don't have this problem and so you won't see this step, they tend to output a end of sequence token.

Setup

Create python environment

python -m venv env && source env/bin/activate

Install dependencies

pip install -r requirements.txt

Run the eval script

# replace script file name for various models:
# eval_wizard.py
# eval_opencode.py
# eval_mpt.py
# eval_starcoder.py
# eval_replit.py
# eval_replit_glaive.py
# eval_replit_instruct.py

python eval_wizard.py

Process the jsonl file to extract code samples from model completions.

Note: Only wizard & opencoder require this, they return markdown output with code.

# replace args for various models:
# --path results/wizard --out_path results/wizard/eval.jsonl
# --path results/opencode --out_path results/opencode/eval.jsonl

python process_eval.py --path results/wizard --out_path results/wizard/processed.jsonl --add_prompt

Then get the results

# replace args for various models:
# results/wizard/processed.jsonl
# results/starcoder/eval.jsonl
# results/mpt/eval.jsonl
# results/opencode/processed.jsonl
# results/replit_instruct/eval.jsonl
# results/replit_glaive/eval.jsonl
# results/replit/eval.jsonl

evaluate_functional_correctness results/wizard/processed.jsonl

code-eval's People

Contributors

abacaj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

code-eval's Issues

No GPU Found

@abacaj My environment is NVIDIA TX2, when use the package codecarbon to get information of GPU,but it can not find GPU:

[codecarbon INFO @ 21:03:55] [setup] RAM Tracking...
[codecarbon INFO @ 21:03:55] [setup] GPU Tracking...
[codecarbon INFO @ 21:03:55] No GPU found.
[codecarbon INFO @ 21:03:55] [setup] CPU Tracking...
[codecarbon WARNING @ 21:03:55] No CPU tracking mode found. Falling back on CPU constant mode.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[codecarbon WARNING @ 21:03:55] We saw that you have a ARMv8 Processor rev 1 (v8l) but we don't know it. Please contact us.
[codecarbon INFO @ 21:03:55] CPU Model on constant consumption mode: ARMv8 Processor rev 1 (v8l)
[codecarbon INFO @ 21:03:55] >>> Tracker's metadata:
[codecarbon INFO @ 21:03:55]   Platform system: Linux-5.10.104-tegra-aarch64-with-glibc2.17
[codecarbon INFO @ 21:03:55]   Python version: 3.8.13
[codecarbon INFO @ 21:03:55]   CodeCarbon version: 2.3.4
[codecarbon INFO @ 21:03:55]   Available RAM : 6.329 GB
[codecarbon INFO @ 21:03:55]   CPU count: 6
[codecarbon INFO @ 21:03:55]   CPU model: ARMv8 Processor rev 1 (v8l)
[codecarbon INFO @ 21:03:55]   GPU count: None
[codecarbon INFO @ 21:03:55]   GPU model: None

But the result of torch.cuda.is_available() is true, so i want to know if ``` codecarbon`` could suport the facility TX2? Looking forward to your reply.

Performance of llama-2

Why am I getting low scores on llama-2-13b, pass@1: 3.05%, pass@10: 19.51%, are you applying any other fine prompts to this setup or are the scores related to batch decoding, my setup is such that I need to generate the samples sequentially and can't perform batch decoding.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.