Code Monkey home page Code Monkey logo

magicoder's Issues

Any plans for a 33b fine tune?

Awesome models! Great job guys! :)
I am wondering if you plan to also finetune on top of deepseek-coder-33b-instruct as well? I wonder how high the evaluations will go with that model :)

any environment requirement for the model, doesn't work in MacAir M1 (16G)

try to use magicoder with ollama on MacAir M1 (16G), it works for other model, but when I run this, got error

...
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 10922.67 MiB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 1083.07 MiB
llama_new_context_with_model: max tensor size =   102.54 MiB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3648.58 MiB, ( 3649.20 / 10922.67)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  8192.00 MiB, offs =            0
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =     0.03 MiB, offs =   8589918208, (11841.23 / 10922.67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =  1080.02 MiB, (12921.25 / 10922.67)ggml_metal_add_buffer: warning: current allocated size is greater than the recommended max working set size
ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: /tmp/ollama-20231213-4188-jpu97j/llm/llama.cpp/gguf/ggml-metal.m:1623: false
2023/12/23 16:46:59 llama.go:451: signal: abort trap
2023/12/23 16:46:59 llama.go:459: error starting llama runner: llama runner process has terminated
2023/12/23 16:46:59 llama.go:525: llama runner stopped successfully

googled, is it similar to ggerganov/llama.cpp#2048 ?

Not sure whether it can be tuned to work on this Mac, if not, it is better to add limitation (or requirement) in the README

The templates used in reproducing the eval results: why adding the instruction again after "### Response: "?

There is an input format mismatch between the eval and training process. Do you intend to emphasize the problem before the model generates its output?

When doing the Humaneval(+) eval, the compiled inputs are as follows, eg.:


@@ Instruction
Write a solution to the following problem:
```python
def fib(n: int):
    """Return n-th Fibonacci number.
    >>> fib(10)
    55
    >>> fib(1)
    1
    >>> fib(8)
    21
    """

@@ Response

def fib(n: int):
    """Return n-th Fibonacci number.
    >>> fib(10)
    55
    >>> fib(1)
    1
    >>> fib(8)
    21
    """```

But in the code files about data processing and training, the instruction data would be compiled as:

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Write a solution to the following coding problem:
{problem}

@@ Response
{response}

There is no such **_rephrasing/emphasizing_** in the training data of Magicoder. 
From the eval results this mismatch seems not to bring obvious negative effects, but did you deliberately do so?

Text-gen prompt template?

Sorry if this was mentioned before, but is there a stock prompt template in ooba's text-gen that works with this?

Optimizer selection

Hi, thx for the brilliant work!

I am curious about the decision to use Adafactor as the optimizer for Magicoder. Have other options been explored or tried in this context? 🤔

Max Token = ?

After reading through the page i dint see any mention of any max token for Magicoder-S-DS-6.7B & Magicoder-DS-6.7B (is it safe to assume its the same as Deepseek Coder = 16k)

catastrophic forgetting problem

Hi dear:

Thanks for your open source, but when i finetuned (whatever full parameters or LoRa ) on my dataset, catastrophic forgetting kept coming up (decrease in performance on the humaneval), i do not know how to solve it, do you have any tops?

Possibility for a Mixture-of-Experts Model?

With the recent release of Mixtral 8x7B, there's a lot of hope and excitement around open-source MoE models.

It would be very interesting to see how a narrowly focused MoE model performs.

Code for the evaluations on APPS.

Hello! It is noticed that there are some new experiments for APPS benchmark in Appendix C.1 of the updated paper and MagicoderS-DS outperforms all other models. I wonder if you could provide the evaluation code for reproduction? Thanks a lot!

used Dilated attenton instead of Vanilla Attention in Llama model and fine-tuen the model ,

i want to ask if I can replace the Dilated attention with Attention used in the based model and do the fine-tuning, the idea behind this is to reduce the complexity of Attention and increase the Windows context, does DeepSeek use Llama 2 as a based model the same arch which means, I can load the Checkpoint of layers of the model such Normlayer and feedforward or I need to re-factor the LLM model from Scratch !!
or there's any method to adapt weight or Shared Weight

can you consider adding my explanation on how to use magicoder in text-generation-webui

After experimenting with text-generation-webui by oobabooga I found the following things
-Magicoder models are all instruct only models (no chat/chat-instruct)
-you need to create a new custom template under parameters/instruction template tab
-you also need to change these values under parameters/generation tab (max-new-tokens=1024,top_p=0.9,top_k=50,repetition-penality=1,repetetion-penality-range=1024)
-copy the content in the text file to custom template/instruction template
instruction-template-magicode.txt

I took the generation parameters from deepseek-coder/demo/app.py
The instruction template is edited from airoborous-v1.2 after comparing it with its prompt template

Achieved close performance of MagicoderS by finetuning only with `evol-codealpaca-v1`.

Thanks to your amazing tutorial, we reproduced the training process and experiments in the paper. The model finetuned by ourselves achieved the performance close to yours. For HumanEval(+), we got 57.32% / 52.44% pass@1 for Magicoder and 70.12% / 67.07% for MagicoderS.
Moreover, we conducted ablation studies to clarify the contribution of OSS-instruct with evol-instruct in the training process of MagicoderS.

  • We got close performance of MagicoderS (70.12% / 65.24% for HumanEval(+), similar results for DS-1000) by finetuning ONLY with evol-codealpaca-v1 dataset under the same training setting mentioned in the tutorial.
  • We got even worse results (66.46% / 62.20% for HumanEval(+)) by swaping the training order of oss-instruct-75k and evol-codealpaca-v1.

We noticed that oss-instruct-75k was generated by base model gpt-3.5-turbo-1106, whereas evol-codealpaca-v1 was generated by gpt-4, so the experiments of MagicoderS may be unfair. I think there should be additional evidence to show the contribution of OSS-instruct when it is combined with other data generation methods.

Confusion about the code of train

First of all, thank you for your amazing work!
I'm attempting to replicate the training process, and I have a question regarding the train.py file. In your paper, you mentioned using two A100-80G GPUs, but I couldn't find any mention of multiprocessing or distributed training in your code. I'm curious if you used deepspeed for training? If not, could you provide guidance on modifying the code to make it compatible with a multi-GPU setup?
Thanks once again!

Data collection and generation

Thanks for releasing your research codes to everyone.
The variables here I found a bit difficult to figure out what they are for.
Can you please explain them?
And in general, could you please add a more comprehensive readme about the data collection/generation parts of the code base?
Thanks!

How to write prompt for code completion task

I've run prompt for python code completion using:
prompt_template = f"""Write a solution to the following problem:

{code}
```"""

"""
but the LLM result got nothing new in the  input prompt code part, just generated some other informations.

8台A40机器上复现magicoder-S-DS-6.7B的结果

因为README-DEV.md脚本直接使用accelerate提示训练内存不足,故修改为deepspeed-stage1启动,其余参数均为默认。因是8卡迭代步长缩小了1/4。

经过实验后我发现:

  1. 训练速度大幅降低
  2. 1阶段和2截断训练效果均无法达到60%

想咨询下机器不同,且增加了deepspeed有可能让结果变差这么多吗?

Inquiry about Paper Details of Magicoder

I am very excited to read the cool work Magicoder. I strongly believe that OSS-Instruct will push the boundaries of instruction tuning for code LLMs.

I want to ask a question about Magicoder. It seems that you do not test the correctness of the generated solutions from seed code snippets. I am curious about the reason why it is not necessary to go through the code validity checking process. Below are some assumptions I made about this:

  1. The most of generated solutions are just correct by manual checking, and LLMs are robust to some wrong codes during fine-tuning.
  2. OSS-Instruct creates new data more like a combination of seed code snippets. And the LLMs (GPT-3.5/GPT-4) used to generate solutions can handle the combination easily since they could see correct seed code snippets.

What’s your opinion on this problem? I am looking forward to your reply and thanks for your help!

Quantised Finetuning on 22GB*4 GPUs

Hello
I am trying to finetune CodeLlama-Python-hf on 4 GPUs with 22GB memory. Using the training processes mentioned in magiccoder readme gives error that CUDA is out of memory.

How can I quantise the model or optimise the memory usage to load in my machine?

Overlap between Magicoder-Evol-Instruct-110K and HumanEval

I've discovered a significant overlap between the Magicoder-Evol-Instruct-110K and HumanEval. Specifically, I found that:

  • 82 out of 164 HumanEval problems are directly present in the training set(code snippets can directly pass the unittest of problem in HumanEval)
  • Around 130 problems have significant similarities

This is a conservative estimate, and there may be more similar items that I missed. Approximately 3000 items in the training set resemble those in the test set.

Is the overlap between Magicoder-Evol-Instruct-110K and HumanEval a test set leakage issue, or is it acceptable since the overlapping problems are paraphrased rather than identical?

HumenEval/ID Lineno in MagiCoder-Evol-Instruct-110K
HumanEval/0 60301
HumanEval/1 12556
HumanEval/3 8985
HumanEval/5 1934
HumanEval/6 75548
HumanEval/8 50011
HumanEval/9 99748
HumanEval/11 16654
HumanEval/12 21800
HumanEval/13 2630
HumanEval/15 106409
HumanEval/17 1993
HumanEval/18 54
HumanEval/20 76613
HumanEval/21 44962
HumanEval/24 60016
HumanEval/25 52864
HumanEval/26 38476
HumanEval/27 59049
HumanEval/29 77775
HumanEval/31 33
HumanEval/33 13447
HumanEval/34 47957
HumanEval/35 108174
HumanEval/40 34303
HumanEval/47 2879
HumanEval/48 103
HumanEval/51 17387
HumanEval/52 19399
HumanEval/55 5484
HumanEval/57 778
HumanEval/58 4614
HumanEval/59 10281
HumanEval/60 45722
HumanEval/61 12849
HumanEval/63 665
HumanEval/64 9014
HumanEval/66 16677
HumanEval/67 66099
HumanEval/71 87392
HumanEval/72 17223
HumanEval/75 107027
HumanEval/86 95562
HumanEval/87 47935
HumanEval/90 4369
HumanEval/96 6696
HumanEval/97 39228
HumanEval/98 5013
HumanEval/100 39904
HumanEval/102 75537
HumanEval/105 12560
HumanEval/106 835
HumanEval/110 14161
HumanEval/116 21351
HumanEval/118 7984
HumanEval/121 60649
HumanEval/137 29729
HumanEval/149 20001
HumanEval/152 32034
HumanEval/154 34346
HumanEval/155 41634
HumanEval/156 11264
HumanEval/158 19315
HumanEval/160 105483
HumanEval/162 68840
HumanEval/80 29996
HumanEval/82 72856
HumanEval/4 15300
HumanEval/22 99988
HumanEval/37 91008
HumanEval/46 56824
HumanEval/70 68607
HumanEval/74 57346
HumanEval/78 105847
HumanEval/89 78000
HumanEval/99 76033
HumanEval/104 51604
HumanEval/109 5608
HumanEval/111 44999
HumanEval/119 105874
HumanEval/148 11486
HumanEval/159 80388

Evaluation codes not found

Hello,

Thanks for providing such an amazing repo for LLMs to generate codes.
I am impressive that magicoder got a great result in HumanEval however I can't find the evaluation code for this.
It's great if the evaluation code is made available.

HuggingFace Playground has failed

Hello,
I have used the huggingface playground for this model in the past but now it has a runtime error. Should I be expecting that a fix is coming?

Thank You!

the evaluation result of magicoder is not aligned with the result on paper

Hi, thanks for your great work.

I test the performance of magicoder, however the performance is not align with the result on the paper(68.9 vs. 76.8). I guess it is because I used different hyperparameter for inference, eg, --top_p 1, --temperature 1 and so on. I will be grateful if author can provide the specific hyperparameter for inference. Thank u. i list the script i used below:

I try to evaluate the magicoder on the script you provided in the 'experiment' folder by the following commond:

python experiments/text2code.py \
    --model_key deepseek-ai/deepseek-coder-6.7b-base \
    --dataset humaneval \
    --save_path output_dir/mc_6_7_ds \
    --n_batches 1 \
    --n_problems_per_batch 1 \
    --n_samples_per_problem 1 \
    --model_name_or_path ~/weight/magicoders-s-ds-6.7/ \
    --top_p 1 \
    --max_new_tokens 4096 \
    --temperature 1

then I use the the the following commond for evalplus:

docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples output_dir/mc_6_7_ds.jsonl

Finally, i got the result:

Base
{'pass@1': 0.6890243902439024}
Base + Extra
{'pass@1': 0.6158536585365854}

Question about the different replication result

Hello!

Thank you for your great work. Recently I have followed the code you provided in github and your hyperparameters to train the Magicoder. However, The results I reproduced are different from the model you provided in huggingface. I am sure that everything is the same as your paper.

Here's my result.
image

To clarify, I use my own evaluation code to evaluate the two models. But since the two models share the same evaluation code, I think it doesn't matter.

Best regards,
Shen

The correctness of solution

how about the quality of the python code implementation in solutions?
Can they be used directly to train the model?

A question of the generated data from the starcoderdata

Thanks for releasing your research codes to everyone.
I noticed that you used the starcoderdata dataset in your research and generated 80K of data including Python, C++, Java, TypeScript, Shell, C#, Rust, PHP, and Swift.
But I did not find any Swift language data in the starcoderdata dataset.
Did you also generate the Swift data from here?

Training data format for Magicoder-OSS-Instruct-75K

Hi, thx for the work!

I was wondering how you format the OSS75k data for training? Is it in the alpaca format like:

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
{instruction} # problem column of the OSS75k dataset

@@ Response
# solution column of the OSS75k dataset

Thx

Are the training loss and validation loss recorded?

Hi, Dear:

Thank you very much for your code. I am reproducing your training process. I wonder what your training loss and validation loss are during the training process, and I want to align them with your training process on dataset Magicoder-OSS-Instruct-75K and datasets--ise-uiuc--Magicoder-Evol-Instruct-110K

Thx

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.