Code Monkey home page Code Monkey logo

synpg's Introduction

SynPG

Code for our EACL-2021 paper "Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs".

If you find that the code is useful in your research, please consider citing our paper.

@inproceedings{Huang2021synpg,
    author    = {Kuan-Hao Huang and Kai-Wei Chang},
    title     = {Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs},
    booktitle = {Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
    year      = {2021},
}

Setup

  • Python=3.7.10
$ pip install -r requirements.txt

Pretrained Models

Demo

python generate.py \
    --synpg_model_path ./model/pretrained_synpg.pt \
    --pg_model_path ./model/pretrained_parse_generator.pt \
    --input_path ./demo/input.txt \
    --output_path ./demo/output.txt \
    --bpe_codes_path ./data/bpe.codes \
    --bpe_vocab_path ./data/vocab.txt \
    --bpe_vocab_thresh 50 \
    --dictionary_path ./data/dictionary.pkl \
    --max_sent_len 40 \
    --max_tmpl_len 100 \
    --max_synt_len 160 \
    --temp 0.5 \
    --seed 0

Training

  • Download data and put them under ./data/
  • Download glove.840B.300d.txt and put it under ./data/
  • Run scripts/train_synpg.sh or the following command to train SynPG
python train_synpg.py \
    --model_dir ./model \
    --output_dir ./output \
    --bpe_codes_path ./data/bpe.codes \
    --bpe_vocab_path ./data/vocab.txt \
    --bpe_vocab_thresh 50 \
    --dictionary_path ./data/dictionary.pkl \
    --train_data_path ./data/train_data.h5 \
    --valid_data_path ./data/valid_data.h5 \
    --emb_path ./data/glove.840B.300d.txt \
    --max_sent_len 40 \
    --max_synt_len 160 \
    --word_dropout 0.4 \
    --n_epoch 5 \
    --batch_size 64 \
    --lr 1e-4 \
    --weight_decay 1e-5 \
    --log_interval 250 \
    --gen_interval 5000 \
    --save_interval 10000 \
    --temp 0.5 \
    --seed 0
  • Run scripts/train_parse_generator.sh or the following command to train the parse generator
python train_parse_generator.py \
    --model_dir ./model \
    --output_dir ./output_pg \
    --dictionary_path ./data/dictionary.pkl \
    --train_data_path ./data/train_data.h5 \
    --valid_data_path ./data/valid_data.h5 \
    --max_sent_len 40 \
    --max_tmpl_len 100 \
    --max_synt_len 160 \
    --word_dropout 0.2 \
    --n_epoch 5 \
    --batch_size 32 \
    --lr 1e-4 \
    --weight_decay 1e-5 \
    --log_interval 250 \
    --gen_interval 5000 \
    --save_interval 10000 \
    --temp 0.5 \
    --seed 0

Evaluating

  • Download testing data and put them under ./data/
  • Run scripts/eval.sh or the following command to evaluate SynPG
python eval_generate.py \
  --test_data ./data/test_data_mrpc.h5 \
  --dictionary_path ./data/dictionary.pkl \
  --model_path ./model/pretrained_synpg.pt \
  --output_dir ./eval/ \
  --bpe_codes ./data/bpe.codes \
  --bpe_vocab ./data/vocab.txt \
  --bpe_vocab_thresh 50 \
  --max_sent_len 40 \
  --max_synt_len 160 \
  --word_dropout 0.0 \
  --batch_size 64 \
  --temp 0.5 \
  --seed 0 \

python eval_calculate_bleu.py --ref ./eval/target_sents.txt --input ./eval/outputs.txt

The BLEU scores should be similar to the following.

MRPC PAN Quora
SynPG 26.2 27.3 33.2
SynPG-Large 36.2 27.1 34.7

Fine-Tuning

One main advantage of SynPG is that SynPG learns the paraphrase generation model without using any paraphrase pairs. Therefore, it is possible to fine-tune SynPG with the texts (without using the ground truth paraphrases) in the target domain when those texts are available. This fine-tuning step would significantly improve the quality of paraphrase generation in the target domain, as shown in our paper.

  • Download testing data and put them under ./data/
  • Run scripts/finetune_synpg.sh or the following command to finetune SynPG
python finetune_synpg.py \
  --model_dir ./model_finetune \
  --model_path ./model/pretrained_synpg.pt \
  --output_dir ./output_finetune \
  --bpe_codes_path ./data/bpe.codes \
  --bpe_vocab_path ./data/vocab.txt \
  --bpe_vocab_thresh 50 \
  --dictionary_path ./data/dictionary.pkl \
  --train_data_path ./data/test_data_mrpc.h5 \
  --valid_data_path ./data/test_data_mrpc.h5 \
  --max_sent_len 40 \
  --max_synt_len 160 \
  --word_dropout 0.4 \
  --n_epoch 50 \
  --batch_size 64 \
  --lr 1e-4 \
  --weight_decay 1e-5 \
  --log_interval 250 \
  --gen_interval 5000 \
  --save_interval 10000 \
  --temp 0.5 \
  --seed 0

We can observe the significant improvement on BLEU scores from the table below.

MRPC PAN Quora
SynPG 26.2 27.3 33.2
SynPG-Large 36.2 27.1 34.7
SynPG-Fine-Tune 48.7 37.7 49.8

Author

Kuan-Hao Huang / @ej0cl6

synpg's People

Contributors

ej0cl6 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

synpg's Issues

License

Please can you add the license to the code. thanks!

Selecting top-1 output

The generation script gives 4 candidates outputs for each input - what is the "correct" way to select the preferred top-1 output? Is it simply the first generated output?

I'm in China, "data" always prompts download error

Hello, I am in China. When I click the link "data" to download the data, I always get a download error when it reaches 500M. Could you please provide the original webpage for downloading data, and then I can download it from the corresponding webpage, or can you provide the link of Baidu net disk?

Errors for train_synpg.sh and train_parse_generator.sh

Following the instructions for Training using: train_synpg.sh with the downloaded data

==== start training ====
Traceback (most recent call last):
  File "train_synpg.py", line 285, in <module>
    train(epoch, model, train_data, valid_data, train_loader, valid_loader, optimizer, criterion, dictionary, bpe, args)
  File "train_synpg.py", line 90, in train
    sent_ = bpe.segment(sent_).split()
  File "/home/wmk/synpg/subwordnmt/apply_bpe.py", line 59, in segment
    new_word = [out for segment in self._isolate_glossaries(word)
  File "/home/wmk/synpg/subwordnmt/apply_bpe.py", line 67, in <listcomp>
    self.glossaries)]
  File "/home/wmk/synpg/subwordnmt/apply_bpe.py", line 144, in encode
    word = tuple(orig[:-1]) + ( orig[-1] + '</w>',)
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Following the instructions for Training using: train_parse_generator.sh

==== loading data ====
number of train examples: 45377426
number of valid examples: 12800
==== start training ====
Traceback (most recent call last):
  File "train_parse_generator.py", line 282, in <module>
    train(epoch, model, train_data, valid_data, train_loader, valid_loader, optimizer, criterion, dictionary, args)
  File "train_parse_generator.py", line 81, in train
    synt_ = ParentedTree.fromstring(synt_)
  File "/opt/conda/envs/synpg/lib/python3.7/site-packages/nltk/tree.py", line 669, in fromstring
    for match in token_re.finditer(s):
TypeError: cannot use a string pattern on a bytes-like object

Data preprocessing

Hi,

Could you please provide the command you used to generate the parses to use as input, as well as any other preprocessing steps required? I have some data that I would like to test your system on, how should I generate an input file in the correct format?

Thanks!

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Hi, when I run train.syng.sh, I get an error:

Traceback (most recent call last):
File "train_synpg.py", line 285, in
train(epoch, model, train_data, valid_data, train_loader, valid_loader, optimizer, criterion, dictionary, bpe, args)
File "train_synpg.py", line 90, in train
sent_ = bpe.segment(sent_).split()
File "/home/dingp/synpg-master/subwordnmt/apply_bpe.py", line 59, in segment
new_word = [out for segment in self._isolate_glossaries(word)
File "/home/dingp/synpg-master/subwordnmt/apply_bpe.py", line 60, in
for out in encode(segment,
File "/home/dingp/synpg-master/subwordnmt/apply_bpe.py", line 144, in encode
word = tuple(orig[:-1]) + ( orig[-1] + '',)
TypeError: unsupported operand type(s) for +: 'int' and 'str'

It seems something wrong with the BPE? Could you help me ? thanks !

A bug in the generate.py

Hi, nice job! I find a bug in generate.py:

Line 90,

eos_pos = eos_pos[0]+1 if len(eos_pos) > 0 else len(idx)

the 'idx' is not defined.

Data preprocessing process

Hi,
We want to use a new dataset to test on this model. Could you open source your preprocessing code? More specifically, how to generate h5 files contaning sentence and syntax information from files processed by StanfordCoreNLP.
Big Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.