ise-uiuc / magicoder Goto Github PK
View Code? Open in Web Editor NEW[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct
Home Page: https://proceedings.mlr.press/v235/wei24h.html
License: MIT License
[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct
Home Page: https://proceedings.mlr.press/v235/wei24h.html
License: MIT License
Awesome models! Great job guys! :)
I am wondering if you plan to also finetune on top of deepseek-coder-33b-instruct as well? I wonder how high the evaluations will go with that model :)
try to use magicoder with ollama on MacAir M1 (16G), it works for other model, but when I run this, got error
...
ggml_metal_init: GPU name: Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MiB
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 1083.07 MiB
llama_new_context_with_model: max tensor size = 102.54 MiB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 3648.58 MiB, ( 3649.20 / 10922.67)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 8192.00 MiB, offs = 0
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 0.03 MiB, offs = 8589918208, (11841.23 / 10922.67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 1080.02 MiB, (12921.25 / 10922.67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: /tmp/ollama-20231213-4188-jpu97j/llm/llama.cpp/gguf/ggml-metal.m:1623: false
2023/12/23 16:46:59 llama.go:451: signal: abort trap
2023/12/23 16:46:59 llama.go:459: error starting llama runner: llama runner process has terminated
2023/12/23 16:46:59 llama.go:525: llama runner stopped successfully
googled, is it similar to ggerganov/llama.cpp#2048 ?
Not sure whether it can be tuned to work on this Mac, if not, it is better to add limitation (or requirement) in the README
There is an input format mismatch between the eval and training process. Do you intend to emphasize the problem before the model generates its output?
When doing the Humaneval(+) eval, the compiled inputs are as follows, eg.:
@@ Instruction
Write a solution to the following problem:
```python
def fib(n: int):
"""Return n-th Fibonacci number.
>>> fib(10)
55
>>> fib(1)
1
>>> fib(8)
21
"""
@@ Response
def fib(n: int):
"""Return n-th Fibonacci number.
>>> fib(10)
55
>>> fib(1)
1
>>> fib(8)
21
"""```
But in the code files about data processing and training, the instruction data would be compiled as:
You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.
@@ Instruction
Write a solution to the following coding problem:
{problem}
@@ Response
{response}
There is no such **_rephrasing/emphasizing_** in the training data of Magicoder.
From the eval results this mismatch seems not to bring obvious negative effects, but did you deliberately do so?
Sorry if this was mentioned before, but is there a stock prompt template in ooba's text-gen that works with this?
Hi, thx for the brilliant work!
I am curious about the decision to use Adafactor as the optimizer for Magicoder. Have other options been explored or tried in this context? 🤔
We use 110k data to fine-tune CodeLLama and test the HE and the pass@1 is 60.6%. I wonder some settings are not same between our experiments.
thanks for the awesome work!!! My question is, why not deepseek-coder 1b ???? :((
this enables amazing apps on edge devices, like in small robots decision makers. would be a great model if using chain of code and Programmatic Reasoning.
https://arxiv.org/abs/2312.03052
https://sites.google.com/view/chain-of-code
After reading through the page i dint see any mention of any max token for Magicoder-S-DS-6.7B & Magicoder-DS-6.7B (is it safe to assume its the same as Deepseek Coder = 16k)
Fill-in-the-middle can allow the model to make better completions. It would be nice if this model supports it.
Hi dear:
Thanks for your open source, but when i finetuned (whatever full parameters or LoRa ) on my dataset, catastrophic forgetting kept coming up (decrease in performance on the humaneval), i do not know how to solve it, do you have any tops?
With the recent release of Mixtral 8x7B, there's a lot of hope and excitement around open-source MoE models.
It would be very interesting to see how a narrowly focused MoE model performs.
Hello! It is noticed that there are some new experiments for APPS benchmark in Appendix C.1 of the updated paper and MagicoderS-DS outperforms all other models. I wonder if you could provide the evaluation code for reproduction? Thanks a lot!
i want to ask if I can replace the Dilated attention with Attention used in the based model and do the fine-tuning, the idea behind this is to reduce the complexity of Attention and increase the Windows context, does DeepSeek use Llama 2 as a based model the same arch which means, I can load the Checkpoint of layers of the model such Normlayer and feedforward or I need to re-factor the LLM model from Scratch !!
or there's any method to adapt weight or Shared Weight
After experimenting with text-generation-webui by oobabooga I found the following things
-Magicoder models are all instruct only models (no chat/chat-instruct)
-you need to create a new custom template under parameters/instruction template tab
-you also need to change these values under parameters/generation tab (max-new-tokens=1024,top_p=0.9,top_k=50,repetition-penality=1,repetetion-penality-range=1024)
-copy the content in the text file to custom template/instruction template
instruction-template-magicode.txt
I took the generation parameters from deepseek-coder/demo/app.py
The instruction template is edited from airoborous-v1.2 after comparing it with its prompt template
Thanks to your amazing tutorial, we reproduced the training process and experiments in the paper. The model finetuned by ourselves achieved the performance close to yours. For HumanEval(+), we got 57.32% / 52.44% pass@1 for Magicoder and 70.12% / 67.07% for MagicoderS.
Moreover, we conducted ablation studies to clarify the contribution of OSS-instruct with evol-instruct in the training process of MagicoderS.
evol-codealpaca-v1
dataset under the same training setting mentioned in the tutorial.oss-instruct-75k
and evol-codealpaca-v1
.We noticed that oss-instruct-75k
was generated by base model gpt-3.5-turbo-1106
, whereas evol-codealpaca-v1
was generated by gpt-4
, so the experiments of MagicoderS may be unfair. I think there should be additional evidence to show the contribution of OSS-instruct when it is combined with other data generation methods.
First of all, thank you for your amazing work!
I'm attempting to replicate the training process, and I have a question regarding the train.py file. In your paper, you mentioned using two A100-80G GPUs, but I couldn't find any mention of multiprocessing or distributed training in your code. I'm curious if you used deepspeed for training? If not, could you provide guidance on modifying the code to make it compatible with a multi-GPU setup?
Thanks once again!
Thanks for releasing your research codes to everyone.
The variables here I found a bit difficult to figure out what they are for.
Can you please explain them?
And in general, could you please add a more comprehensive readme about the data collection/generation parts of the code base?
Thanks!
I've run prompt for python code completion using:
prompt_template = f"""Write a solution to the following problem:
{code}
```"""
"""
but the LLM result got nothing new in the input prompt code part, just generated some other informations.
因为README-DEV.md脚本直接使用accelerate提示训练内存不足,故修改为deepspeed-stage1启动,其余参数均为默认。因是8卡迭代步长缩小了1/4。
经过实验后我发现:
想咨询下机器不同,且增加了deepspeed有可能让结果变差这么多吗?
I am very excited to read the cool work Magicoder. I strongly believe that OSS-Instruct will push the boundaries of instruction tuning for code LLMs.
I want to ask a question about Magicoder. It seems that you do not test the correctness of the generated solutions from seed code snippets. I am curious about the reason why it is not necessary to go through the code validity checking process. Below are some assumptions I made about this:
What’s your opinion on this problem? I am looking forward to your reply and thanks for your help!
Hello
I am trying to finetune CodeLlama-Python-hf on 4 GPUs with 22GB memory. Using the training processes mentioned in magiccoder readme gives error that CUDA is out of memory.
How can I quantise the model or optimise the memory usage to load in my machine?
I've discovered a significant overlap between the Magicoder-Evol-Instruct-110K and HumanEval. Specifically, I found that:
This is a conservative estimate, and there may be more similar items that I missed. Approximately 3000 items in the training set resemble those in the test set.
Is the overlap between Magicoder-Evol-Instruct-110K and HumanEval a test set leakage issue, or is it acceptable since the overlapping problems are paraphrased rather than identical?
HumenEval/ID | Lineno in MagiCoder-Evol-Instruct-110K |
---|---|
HumanEval/0 | 60301 |
HumanEval/1 | 12556 |
HumanEval/3 | 8985 |
HumanEval/5 | 1934 |
HumanEval/6 | 75548 |
HumanEval/8 | 50011 |
HumanEval/9 | 99748 |
HumanEval/11 | 16654 |
HumanEval/12 | 21800 |
HumanEval/13 | 2630 |
HumanEval/15 | 106409 |
HumanEval/17 | 1993 |
HumanEval/18 | 54 |
HumanEval/20 | 76613 |
HumanEval/21 | 44962 |
HumanEval/24 | 60016 |
HumanEval/25 | 52864 |
HumanEval/26 | 38476 |
HumanEval/27 | 59049 |
HumanEval/29 | 77775 |
HumanEval/31 | 33 |
HumanEval/33 | 13447 |
HumanEval/34 | 47957 |
HumanEval/35 | 108174 |
HumanEval/40 | 34303 |
HumanEval/47 | 2879 |
HumanEval/48 | 103 |
HumanEval/51 | 17387 |
HumanEval/52 | 19399 |
HumanEval/55 | 5484 |
HumanEval/57 | 778 |
HumanEval/58 | 4614 |
HumanEval/59 | 10281 |
HumanEval/60 | 45722 |
HumanEval/61 | 12849 |
HumanEval/63 | 665 |
HumanEval/64 | 9014 |
HumanEval/66 | 16677 |
HumanEval/67 | 66099 |
HumanEval/71 | 87392 |
HumanEval/72 | 17223 |
HumanEval/75 | 107027 |
HumanEval/86 | 95562 |
HumanEval/87 | 47935 |
HumanEval/90 | 4369 |
HumanEval/96 | 6696 |
HumanEval/97 | 39228 |
HumanEval/98 | 5013 |
HumanEval/100 | 39904 |
HumanEval/102 | 75537 |
HumanEval/105 | 12560 |
HumanEval/106 | 835 |
HumanEval/110 | 14161 |
HumanEval/116 | 21351 |
HumanEval/118 | 7984 |
HumanEval/121 | 60649 |
HumanEval/137 | 29729 |
HumanEval/149 | 20001 |
HumanEval/152 | 32034 |
HumanEval/154 | 34346 |
HumanEval/155 | 41634 |
HumanEval/156 | 11264 |
HumanEval/158 | 19315 |
HumanEval/160 | 105483 |
HumanEval/162 | 68840 |
HumanEval/80 | 29996 |
HumanEval/82 | 72856 |
HumanEval/4 | 15300 |
HumanEval/22 | 99988 |
HumanEval/37 | 91008 |
HumanEval/46 | 56824 |
HumanEval/70 | 68607 |
HumanEval/74 | 57346 |
HumanEval/78 | 105847 |
HumanEval/89 | 78000 |
HumanEval/99 | 76033 |
HumanEval/104 | 51604 |
HumanEval/109 | 5608 |
HumanEval/111 | 44999 |
HumanEval/119 | 105874 |
HumanEval/148 | 11486 |
HumanEval/159 | 80388 |
Hello,
Thanks for providing such an amazing repo for LLMs to generate codes.
I am impressive that magicoder got a great result in HumanEval however I can't find the evaluation code for this.
It's great if the evaluation code is made available.
I got the same problem when use the quick start script to run an inference task.
Just the same as this issue: #22
what should i do to solve this problem?
Hello,
I have used the huggingface playground for this model in the past but now it has a runtime error. Should I be expecting that a fix is coming?
Thank You!
Hi, thanks for your great work.
I test the performance of magicoder, however the performance is not align with the result on the paper(68.9 vs. 76.8). I guess it is because I used different hyperparameter for inference, eg, --top_p 1, --temperature 1 and so on. I will be grateful if author can provide the specific hyperparameter for inference. Thank u. i list the script i used below:
I try to evaluate the magicoder on the script you provided in the 'experiment' folder by the following commond:
python experiments/text2code.py \
--model_key deepseek-ai/deepseek-coder-6.7b-base \
--dataset humaneval \
--save_path output_dir/mc_6_7_ds \
--n_batches 1 \
--n_problems_per_batch 1 \
--n_samples_per_problem 1 \
--model_name_or_path ~/weight/magicoders-s-ds-6.7/ \
--top_p 1 \
--max_new_tokens 4096 \
--temperature 1
then I use the the the following commond for evalplus:
docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples output_dir/mc_6_7_ds.jsonl
Finally, i got the result:
Base
{'pass@1': 0.6890243902439024}
Base + Extra
{'pass@1': 0.6158536585365854}
Hello!
Thank you for your great work. Recently I have followed the code you provided in github and your hyperparameters to train the Magicoder. However, The results I reproduced are different from the model you provided in huggingface. I am sure that everything is the same as your paper.
To clarify, I use my own evaluation code to evaluate the two models. But since the two models share the same evaluation code, I think it doesn't matter.
Best regards,
Shen
how about the quality of the python code implementation in solutions?
Can they be used directly to train the model?
Thanks for releasing your research codes to everyone.
I noticed that you used the starcoderdata dataset in your research and generated 80K of data including Python, C++, Java, TypeScript, Shell, C#, Rust, PHP, and Swift.
But I did not find any Swift language data in the starcoderdata dataset.
Did you also generate the Swift data from here?
Could you please include the instructions to reproduce the OSS-Instruct
dataset in README?
I've found the generation script here https://github.com/ise-uiuc/magicoder/blob/main/src/magicoder/generate_data.py
Hi, thx for the work!
I was wondering how you format the OSS75k data for training? Is it in the alpaca format like:
You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.
@@ Instruction
{instruction} # problem column of the OSS75k dataset
@@ Response
# solution column of the OSS75k dataset
Thx
Hello,
I observed that in your paper, the MBPP pass@1 for CodeLlama-Python is listed as 57.6% (see Table 1). But, according to the original CodeLlama paper ([1; Table 2]), it's actually 47.6%. Could you please update this?
Reference:
[1] https://arxiv.org/pdf/2308.12950.pdf
Just started reproducing Magicoder and could not help wondering, would a bigger OSS-Instruct dataset work better and how much better?
PS. There are 12,000,000 files in Python inside bigcode/Starcoderdata, with only 40K/12M being used.
Hi, Dear:
Thank you very much for your code. I am reproducing your training process. I wonder what your training loss and validation loss are during the training process, and I want to align them with your training process on dataset Magicoder-OSS-Instruct-75K and datasets--ise-uiuc--Magicoder-Evol-Instruct-110K
Thx
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.