ise-uiuc / magicoder Goto Github PK

View Code? Open in Web Editor NEW

2.0K 26.0 167.0 2.46 MB

[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct

Home Page: https://proceedings.mlr.press/v235/wei24h.html

License: MIT License

Python 100.00%

ai4code large-language-models llm llm4code

magicoder's Issues

Any plans for a 33b fine tune?

Awesome models! Great job guys! :)
I am wondering if you plan to also finetune on top of deepseek-coder-33b-instruct as well? I wonder how high the evaluations will go with that model :)

any environment requirement for the model, doesn't work in MacAir M1 (16G)

try to use magicoder with ollama on MacAir M1 (16G), it works for other model, but when I run this, got error

...
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 10922.67 MiB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 1083.07 MiB
llama_new_context_with_model: max tensor size =   102.54 MiB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3648.58 MiB, ( 3649.20 / 10922.67)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  8192.00 MiB, offs =            0
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =     0.03 MiB, offs =   8589918208, (11841.23 / 10922.67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =  1080.02 MiB, (12921.25 / 10922.67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: /tmp/ollama-20231213-4188-jpu97j/llm/llama.cpp/gguf/ggml-metal.m:1623: false
2023/12/23 16:46:59 llama.go:451: signal: abort trap
2023/12/23 16:46:59 llama.go:459: error starting llama runner: llama runner process has terminated
2023/12/23 16:46:59 llama.go:525: llama runner stopped successfully

googled, is it similar to ggerganov/llama.cpp#2048 ?

Not sure whether it can be tuned to work on this Mac, if not, it is better to add limitation (or requirement) in the README

The templates used in reproducing the eval results: why adding the instruction again after "### Response: "?

There is an input format mismatch between the eval and training process. Do you intend to emphasize the problem before the model generates its output?

When doing the Humaneval(+) eval, the compiled inputs are as follows, eg.:


@@ Instruction
Write a solution to the following problem:
```python
def fib(n: int):
    """Return n-th Fibonacci number.
    >>> fib(10)
    55
    >>> fib(1)
    1
    >>> fib(8)
    21
    """

@@ Response

def fib(n: int):
    """Return n-th Fibonacci number.
    >>> fib(10)
    55
    >>> fib(1)
    1
    >>> fib(8)
    21
    """```

But in the code files about data processing and training, the instruction data would be compiled as:

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Write a solution to the following coding problem:
{problem}

@@ Response
{response}

There is no such **_rephrasing/emphasizing_** in the training data of Magicoder. 
From the eval results this mismatch seems not to bring obvious negative effects, but did you deliberately do so?

Text-gen prompt template?

Sorry if this was mentioned before, but is there a stock prompt template in ooba's text-gen that works with this?

Optimizer selection

Hi, thx for the brilliant work!

I am curious about the decision to use Adafactor as the optimizer for Magicoder. Have other options been explored or tried in this context? 🤔

How do you test HE after fine-tuned on CodeLLama？

We use 110k data to fine-tune CodeLLama and test the HE and the pass@1 is 60.6%. I wonder some settings are not same between our experiments.

small models

thanks for the awesome work!!! My question is, why not deepseek-coder 1b ???? :((

this enables amazing apps on edge devices, like in small robots decision makers. would be a great model if using chain of code and Programmatic Reasoning.

https://arxiv.org/abs/2312.03052
https://sites.google.com/view/chain-of-code

Max Token = ?

After reading through the page i dint see any mention of any max token for Magicoder-S-DS-6.7B & Magicoder-DS-6.7B (is it safe to assume its the same as Deepseek Coder = 16k)

Magicoder-S-DS-6.7B gguf, gptq models are not working. Please could you provide quantize models. LM studio, text generation web ui is not is not working.

Does this model support fill-in-the-middle?

Fill-in-the-middle can allow the model to make better completions. It would be nice if this model supports it.

catastrophic forgetting problem

Hi dear:

Thanks for your open source, but when i finetuned (whatever full parameters or LoRa ) on my dataset, catastrophic forgetting kept coming up (decrease in performance on the humaneval), i do not know how to solve it, do you have any tops?

Possibility for a Mixture-of-Experts Model?

With the recent release of Mixtral 8x7B, there's a lot of hope and excitement around open-source MoE models.

It would be very interesting to see how a narrowly focused MoE model performs.

Code for the evaluations on APPS.

Hello! It is noticed that there are some new experiments for APPS benchmark in Appendix C.1 of the updated paper and MagicoderS-DS outperforms all other models. I wonder if you could provide the evaluation code for reproduction? Thanks a lot!

used Dilated attenton instead of Vanilla Attention in Llama model and fine-tuen the model ,

i want to ask if I can replace the Dilated attention with Attention used in the based model and do the fine-tuning, the idea behind this is to reduce the complexity of Attention and increase the Windows context, does DeepSeek use Llama 2 as a based model the same arch which means, I can load the Checkpoint of layers of the model such Normlayer and feedforward or I need to re-factor the LLM model from Scratch !!
or there's any method to adapt weight or Shared Weight

can you consider adding my explanation on how to use magicoder in text-generation-webui

After experimenting with text-generation-webui by oobabooga I found the following things
-Magicoder models are all instruct only models (no chat/chat-instruct)
-you need to create a new custom template under parameters/instruction template tab
-you also need to change these values under parameters/generation tab (max-new-tokens=1024,top_p=0.9,top_k=50,repetition-penality=1,repetetion-penality-range=1024)
-copy the content in the text file to custom template/instruction template
instruction-template-magicode.txt

I took the generation parameters from deepseek-coder/demo/app.py
The instruction template is edited from airoborous-v1.2 after comparing it with its prompt template

Achieved close performance of MagicoderS by finetuning only with `evol-codealpaca-v1`.

Thanks to your amazing tutorial, we reproduced the training process and experiments in the paper. The model finetuned by ourselves achieved the performance close to yours. For HumanEval(+), we got 57.32% / 52.44% pass@1 for Magicoder and 70.12% / 67.07% for MagicoderS.
Moreover, we conducted ablation studies to clarify the contribution of OSS-instruct with evol-instruct in the training process of MagicoderS.

We got close performance of MagicoderS (70.12% / 65.24% for HumanEval(+), similar results for DS-1000) by finetuning ONLY with evol-codealpaca-v1 dataset under the same training setting mentioned in the tutorial.
We got even worse results (66.46% / 62.20% for HumanEval(+)) by swaping the training order of oss-instruct-75k and evol-codealpaca-v1.

We noticed that oss-instruct-75k was generated by base model gpt-3.5-turbo-1106, whereas evol-codealpaca-v1 was generated by gpt-4, so the experiments of MagicoderS may be unfair. I think there should be additional evidence to show the contribution of OSS-instruct when it is combined with other data generation methods.

Confusion about the code of train

First of all, thank you for your amazing work!
I'm attempting to replicate the training process, and I have a question regarding the train.py file. In your paper, you mentioned using two A100-80G GPUs, but I couldn't find any mention of multiprocessing or distributed training in your code. I'm curious if you used deepspeed for training? If not, could you provide guidance on modifying the code to make it compatible with a multi-GPU setup?
Thanks once again!

Data collection and generation

Thanks for releasing your research codes to everyone.
The variables here I found a bit difficult to figure out what they are for.
Can you please explain them?
And in general, could you please add a more comprehensive readme about the data collection/generation parts of the code base?
Thanks!

How to write prompt for code completion task

I've run prompt for python code completion using:
prompt_template = f"""Write a solution to the following problem:

{code}
```"""

"""
but the LLM result got nothing new in the  input prompt code part, just generated some other informations.

Is it normal to take more than one hour to get the humanevalplus results?

I ran the humaneval(+) evaluation on an A100 GPU using the original code in the experiments folder.
The model is ise-uiuc/Magicoder-S-DS-6.7B.

It's extremely time-consuming. How can I make it faster?

8台A40机器上复现magicoder-S-DS-6.7B的结果

因为README-DEV.md脚本直接使用accelerate提示训练内存不足，故修改为deepspeed-stage1启动，其余参数均为默认。因是8卡迭代步长缩小了1/4。

经过实验后我发现：

训练速度大幅降低
1阶段和2截断训练效果均无法达到60%

想咨询下机器不同，且增加了deepspeed有可能让结果变差这么多吗？

Inquiry about Paper Details of Magicoder

I am very excited to read the cool work Magicoder. I strongly believe that OSS-Instruct will push the boundaries of instruction tuning for code LLMs.

I want to ask a question about Magicoder. It seems that you do not test the correctness of the generated solutions from seed code snippets. I am curious about the reason why it is not necessary to go through the code validity checking process. Below are some assumptions I made about this:

The most of generated solutions are just correct by manual checking, and LLMs are robust to some wrong codes during fine-tuning.
OSS-Instruct creates new data more like a combination of seed code snippets. And the LLMs (GPT-3.5/GPT-4) used to generate solutions can handle the combination easily since they could see correct seed code snippets.

What’s your opinion on this problem? I am looking forward to your reply and thanks for your help!

Quantised Finetuning on 22GB*4 GPUs

Hello
I am trying to finetune CodeLlama-Python-hf on 4 GPUs with 22GB memory. Using the training processes mentioned in magiccoder readme gives error that CUDA is out of memory.

How can I quantise the model or optimise the memory usage to load in my machine?

can I use vLLM to load the models?

Overlap between Magicoder-Evol-Instruct-110K and HumanEval

I've discovered a significant overlap between the Magicoder-Evol-Instruct-110K and HumanEval. Specifically, I found that:

82 out of 164 HumanEval problems are directly present in the training set(code snippets can directly pass the unittest of problem in HumanEval)
Around 130 problems have significant similarities

This is a conservative estimate, and there may be more similar items that I missed. Approximately 3000 items in the training set resemble those in the test set.

Is the overlap between Magicoder-Evol-Instruct-110K and HumanEval a test set leakage issue, or is it acceptable since the overlapping problems are paraphrased rather than identical?

HumenEval/ID	Lineno in MagiCoder-Evol-Instruct-110K
HumanEval/0	60301
HumanEval/1	12556
HumanEval/3	8985
HumanEval/5	1934
HumanEval/6	75548
HumanEval/8	50011
HumanEval/9	99748
HumanEval/11	16654
HumanEval/12	21800
HumanEval/13	2630
HumanEval/15	106409
HumanEval/17	1993
HumanEval/18	54
HumanEval/20	76613
HumanEval/21	44962
HumanEval/24	60016
HumanEval/25	52864
HumanEval/26	38476
HumanEval/27	59049
HumanEval/29	77775
HumanEval/31	33
HumanEval/33	13447
HumanEval/34	47957
HumanEval/35	108174
HumanEval/40	34303
HumanEval/47	2879
HumanEval/48	103
HumanEval/51	17387
HumanEval/52	19399
HumanEval/55	5484
HumanEval/57	778
HumanEval/58	4614
HumanEval/59	10281
HumanEval/60	45722
HumanEval/61	12849
HumanEval/63	665
HumanEval/64	9014
HumanEval/66	16677
HumanEval/67	66099
HumanEval/71	87392
HumanEval/72	17223
HumanEval/75	107027
HumanEval/86	95562
HumanEval/87	47935
HumanEval/90	4369
HumanEval/96	6696
HumanEval/97	39228
HumanEval/98	5013
HumanEval/100	39904
HumanEval/102	75537
HumanEval/105	12560
HumanEval/106	835
HumanEval/110	14161
HumanEval/116	21351
HumanEval/118	7984
HumanEval/121	60649
HumanEval/137	29729
HumanEval/149	20001
HumanEval/152	32034
HumanEval/154	34346
HumanEval/155	41634
HumanEval/156	11264
HumanEval/158	19315
HumanEval/160	105483
HumanEval/162	68840
HumanEval/80	29996
HumanEval/82	72856
HumanEval/4	15300
HumanEval/22	99988
HumanEval/37	91008
HumanEval/46	56824
HumanEval/70	68607
HumanEval/74	57346
HumanEval/78	105847
HumanEval/89	78000
HumanEval/99	76033
HumanEval/104	51604
HumanEval/109	5608
HumanEval/111	44999
HumanEval/119	105874
HumanEval/148	11486
HumanEval/159	80388

Evaluation codes not found

Hello,

Thanks for providing such an amazing repo for LLMs to generate codes.
I am impressive that magicoder got a great result in HumanEval however I can't find the evaluation code for this.
It's great if the evaluation code is made available.

Got same problem that model only return lots of '\n'

I got the same problem when use the quick start script to run an inference task.
Just the same as this issue: #22
what should i do to solve this problem?

HuggingFace Playground has failed

Hello,
I have used the huggingface playground for this model in the past but now it has a runtime error. Should I be expecting that a fix is coming?

Thank You!

the evaluation result of magicoder is not aligned with the result on paper

Hi, thanks for your great work.

I test the performance of magicoder, however the performance is not align with the result on the paper(68.9 vs. 76.8). I guess it is because I used different hyperparameter for inference, eg, --top_p 1, --temperature 1 and so on. I will be grateful if author can provide the specific hyperparameter for inference. Thank u. i list the script i used below:

I try to evaluate the magicoder on the script you provided in the 'experiment' folder by the following commond:

python experiments/text2code.py \
    --model_key deepseek-ai/deepseek-coder-6.7b-base \
    --dataset humaneval \
    --save_path output_dir/mc_6_7_ds \
    --n_batches 1 \
    --n_problems_per_batch 1 \
    --n_samples_per_problem 1 \
    --model_name_or_path ~/weight/magicoders-s-ds-6.7/ \
    --top_p 1 \
    --max_new_tokens 4096 \
    --temperature 1

then I use the the the following commond for evalplus:

docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples output_dir/mc_6_7_ds.jsonl

Finally, i got the result:

Base
{'pass@1': 0.6890243902439024}
Base + Extra
{'pass@1': 0.6158536585365854}

How do you set the 'stop_words' parameter

So many impressive experiments ! Are there any experiments with neftune ?

Question about the different replication result

Hello!

Thank you for your great work. Recently I have followed the code you provided in github and your hyperparameters to train the Magicoder. However, The results I reproduced are different from the model you provided in huggingface. I am sure that everything is the same as your paper.

Here's my result.

To clarify, I use my own evaluation code to evaluate the two models. But since the two models share the same evaluation code, I think it doesn't matter.

Best regards,
Shen

The correctness of solution

how about the quality of the python code implementation in solutions?
Can they be used directly to train the model?

The model outputs nothing but "\n" HELP! 😭

I followed the demo instruction in github and ran the model in jupyter, but just got nothing but "\n".
Can anyone tell me what's wrong?

A question of the generated data from the starcoderdata

Thanks for releasing your research codes to everyone.
I noticed that you used the starcoderdata dataset in your research and generated 80K of data including Python, C++, Java, TypeScript, Shell, C#, Rust, PHP, and Swift.
But I did not find any Swift language data in the starcoderdata dataset.
Did you also generate the Swift data from here?

`OSS-Instruct` dataset generation instructions

Could you please include the instructions to reproduce the OSS-Instruct dataset in README?

I've found the generation script here https://github.com/ise-uiuc/magicoder/blob/main/src/magicoder/generate_data.py

Training data format for Magicoder-OSS-Instruct-75K

Hi, thx for the work!

I was wondering how you format the OSS75k data for training? Is it in the alpaca format like:

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
{instruction} # problem column of the OSS75k dataset

@@ Response
# solution column of the OSS75k dataset

Thx

Wrong MBPP pass@1 of CodeLlama-Python

Hello,

I observed that in your paper, the MBPP pass@1 for CodeLlama-Python is listed as 57.6% (see Table 1). But, according to the original CodeLlama paper ([1; Table 2]), it's actually 47.6%. Could you please update this?

Reference:
[1] https://arxiv.org/pdf/2308.12950.pdf

A scaling law of instruction-code-data would be very interesting...

Just started reproducing Magicoder and could not help wondering, would a bigger OSS-Instruct dataset work better and how much better?
PS. There are 12,000,000 files in Python inside bigcode/Starcoderdata, with only 40K/12M being used.

Are the training loss and validation loss recorded?

Hi, Dear:

Thank you very much for your code. I am reproducing your training process. I wonder what your training loss and validation loss are during the training process, and I want to align them with your training process on dataset Magicoder-OSS-Instruct-75K and datasets--ise-uiuc--Magicoder-Evol-Instruct-110K

Thx

ise-uiuc / magicoder Goto Github PK

magicoder's Issues

Recommend Projects

Recommend Topics

Recommend Org