Code Monkey home page Code Monkey logo

progen's Introduction

PROGEN

This repository contains the code for our paper “ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback”. The implementation is built on the source code from ZeroGen.

If you use this code, please cite our paper:

@inproceedings{ye-etal-2022-progen,
    title = "{P}ro{G}en: Progressive Zero-shot Dataset Generation via In-context Feedback",
    author = "Ye, Jiacheng  and
      Gao, Jiahui  and
      Wu, Zhiyong  and
      Feng, Jiangtao  and
      Yu, Tao  and
      Kong, Lingpeng",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.269",
    pages = "3671--3683"
}

Setup

All requirements for PROGEN can be found in requirements.txt. You can install all required packages in a new environment with pip install -r requirements.txt.

Usage

The scripts/run_main.sh scripts contain the running commands for the following methods:

  • SUPERVISED: supervised learning with human annotations;
  • PROMPTING: prompt-based zero-shot learning;
  • PROMPTING*: prompt-based zero-shot learning with calibration by setting --calibrate in prompting;
  • ZEROGEN: efficient zero-shot learning via dataset generation by setting in_context_type=none;
  • PROGEN: progressive dataset generation via in-context feedback.

We track the performance of the small model in the data generation procedure. Before running, you need to reset the following parameters to yours:

  • batch_size: the batch size for generating with PLM. For SST-2, it costs ~16G when using a batch size of 32 with gpt2-xl. So decrease the batch size if needed;
  • WANDB_PROJECT: project name, by default ProGen;
  • WANDB_ENTITY: your wandb username;
  • WANDB_API_KEY: your api-key.

By default we use GPT2-XL as pre-trained language model (PLM) and DistilBERT as tiny-task model (TAM), to modify the size of PLM and TAM, you can change model_name and small_model_name in run_main.sh scripts. We also include the API-based inference, i.e., OPT 175B, where we generate dataset through a black-box API.

Run with a synthesized dataset

After dataset generation, we save the synthetic dataset at out-${task_name}-x2/${dataset}/${task_name}-dataset.jsonl. The file is in json line format (e.g., {"C": "The Book of Mormon Musical", "X": "The Book of Mormon Musical brings all the drama and excitement of a real revival of the Broadway production to the big screen.", "Y": 0}).

To run DistilBERT or LSTM given a generated dataset, you can use the scripts/run_tam_training.sh script.

Experiments of Influence Function

To experiment the effectiveness of influence function when the validation set is noisy, run bash scripts/run_IF_exp.sh. You can see the effectiveness of IF is highly affected by the noise of the validation set (Paper Fig.6), and using RCE as loss function reduces such effect.

Here is the output after running with obj=ce (upper) and obj=rce (lower) on SST-2, where the identified helpful examples are removed and the resulting data is used to training the small model. We can see rce loss well identifies the helpful instances in the training set when the validation set contains 40% noise.

progen's People

Contributors

jiacheng-ye avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

progen's Issues

Validation results of ProGen is not on gold dataset?

Hi authors,

I feel very interested in your series of works on data generation!
Currently, I am trying to develop my workflow for data generation based on your researchs (ProGen/ZeroGen/SunGen).

When I read through your code at main.py and cls_generation.py, I found that TAMs are evaluated on the synthesis validation set.
I am a little bit confused about this setting because I believe a fair comparison seems to be trained with the synthesis dataset but evaluated on the gold-valid dataset.
Otherwise, the TAM fine-tuned results on the ProGen/ZeroGen are not comparable with the prompting or traditional fine-tuned strategies.

Please correct me if I misunderstand your codes.

Best,
Xuansheng

Dependency Missing of Module accelerate

When I try to run the code with main.py, the following issue raises.

File "C:\Users\DLGA\Desktop\xuansheng\ProGen\model_util.py", line 1, in <module>
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
ModuleNotFoundError: No module named 'accelerate'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.