Code Monkey home page Code Monkey logo

self-instruct's Introduction

Self-Instruct: Aligning LM with Self Generated Instructions

This repository contains code and data for the Self-Instruct paper, a method for aligning pretrained language models with instructions.

Introduction

Self-Instruct is a framework that helps language models improve their ability to follow natural language instructions. It does this by using the model's own generations to create a large collection of instructional data. With Self-Instruct, it is possible to improve the instruction-following capabilities of language models without relying on extensive manual annotation.

Background

In recent years, there has been a growing interest in building models that can follow natural language instructions to perform a wide range of tasks. These models, known as "instruction-tuned" language models, have demonstrated the ability to generalize to new tasks. However, their performance is heavily dependent on the quality and quantity of the human-written instruction data used to train them, which can be limited in diversity and creativity. To overcome these limitations, it is important to develop alternative approaches for supervising instruction-tuned models and improving their instruction-following capabilities.

How Self-Instruct works?

The Self-Instruct process is an iterative bootstrapping algorithm that starts with a seed set of manually-written instructions and uses them to prompt the language model to generate new instructions and corresponding input-output instances. These generations are then filtered to remove low-quality or similar ones, and the resulting data is added back to the task pool. This process can be repeated multiple times, resulting in a large collection of instructional data that can be used to fine-tune the language model to follow instructions more effectively.

Here is an overview of Self-Instruct:

The pipeline for generating instruction data from a language model itself.

Usage

* This work is still in progress. We may update the code and data as we make progress. Please be cautious about the version control.

Instruction-tuning using our Self-Instruct data

We release a dataset that contains 52k instructions, paired with 82K instance inputs and outputs. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The entire model-generated data can be accessed in data/gpt3-generations/batch_221203/all_instances_82K.jsonl. This data (+ the 175 seed tasks) reformatted in clean GPT3-finetuning format (prompt + completion) is put in data/finetuning/self_instruct_221203. You can use the script in ./scripts/finetune_gpt3.sh to finetune GPT3 on this data.

Note: This data is generated by a language model (GPT3) and inevitably contains some errors or biases. We analyzed the data quality on 200 random instructions in our paper, and found that 46% of the data points may have problems. We encourage users to use this data with caution and propose new methods to filter or improve the imperfections.

Evaluating instruction-following capabilities

We also release a new set of 252 expert-written tasks and their instructions motivated by user-oriented applications (rather than well-studied NLP tasks). This data is used in the human evaluation section of the self-instruct paper. Please refer to the human evaluation README for more details.

Generating Self-Instruct data from scratch

To generate Self-Instruct data using your own seed tasks or other models, we open-source our scripts for the entire pipeline here. Our current code is only tested on the GPT3 model accessible via the OpenAI API.

Here are the scripts for generating the data:

# 1. Generate instructions from the seed tasks
./scripts/generate_instructions.sh

# 2. Identify whether the instruction represents a classification task or not
./scripts/is_clf_or_not.sh

# 3. Generate instances for each instruction
./scripts/generate_instances.sh

# 4. Filtering, processing, and reformatting
./scripts/prepare_for_finetuning.sh

Citation

If you use the Self-Instruct framework or data, feel free to cite us.

@misc{selfinstruct,
  title={Self-Instruct: Aligning Language Model with Self Generated Instructions},
  author={Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh},
  journal={arXiv preprint arXiv:2212.10560},
  year={2022}
}

self-instruct's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

self-instruct's Issues

Question about using task instruction as context

Any Reason why using 6 human-written and 2 model-generation instruction as context? Does these 2 hyperparameter, and also diversity of sampling instruction type(like ner, classification, generation), make any difference?

[How to] Generate dataset from pdf ?

I have my data in a bundle of pdf, documents, etc. Is there any way to extract data from them and generate instruction dataset using self-instruct?

What are the steps to create a new instruction from a seed?

The paper is not clear to me. If I have an instruction seed written by human, what is the process to create a single new instruction from this single seed?

In addition the repository says “generated by themselves”, but it is not by themselves, it’s by using third party api.

The problem of training my own data

Thanks for sharing!

I want to "self-instruct" using my Chinese data, but I can't call Open AI. If I want to use an existing model (such as LLAMA2), I can use this model to implement "self-instruct" locally offline. How do I modify the code? Or do you have any suggestions?

Thanks!

Is there a more detailed analysis of seed tasks?

Why set the number of seed tasks to 175? How did the number of seed tasks affect the final results including the quality of generated instructions and the performance of instructions-tuned model?
I have considered generating more domain-specified instructions recently. The number of seed tasks should be smaller and the content (or format) should be more in line. I wonder is there any I should notice if I craft the seed tasks set myself, for example the number and the content. And do you think models tuned by the domian-specified instructions will do better in the specified domain?
Thanks a lot, wish you a good day :).

Minor Grammatical Errors

Greetings!

Excellent code! I saw a few grammatical errors in some of your code that I figured I'd share with you.

on Prep:
Line 32 - Word misspelled. ign instead of ing.

on GPT:
Line 43 - there is a word capitalized after a comma.
Lines 74 and 79 - gpt is lowercase and the rest are upper case.

on Bootstrap:
Line 116 - Used 'GPT-3', however, other instances in your code refer to it as 'GPT3'.
Line 121 - 'missing quotes' around variable referenced.

on CLF:
Line 55 - 'missing quotes' around variable referenced.

Trivial in nature as it does not interfere with your code, but figured you may want some uniformity.

Regards,

Atlas

What does error mean?

Hello Community,

I always become the same error. What does this error mean?:

usage: generate_instances.py [-h] --batch_dir BATCH_DIR
[--input_file INPUT_FILE]
[--output_file OUTPUT_FILE]
[--num_instructions NUM_INSTRUCTIONS]
[--max_instances_to_generate MAX_INSTANCES_TO_GENERATE]
[--generation_tasks_only]
[--classification_tasks_only] [--engine ENGINE]
[--request_batch_size REQUEST_BATCH_SIZE]
[--api_key API_KEY] [--organization ORGANIZATION]
generate_instances.py: error: the following arguments are required: --batch_dir

THX!

[QUESTION] Unexpected results by GPT-SelfInstruct+SuperNI

As I understood figure 5 in your paper, you further fine-tuned GPT-SelfInstruct on the SuperNatural Instructions data and surprisingly the results got worse compared to the "vanilla" GPT-SelfInstruct.

Is my understanding correct? If so, do you have any assumptions why a high-quality human annotated dataset as additional fine-tuning data worsened the overall performance?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.