The synpg from uclanlp

SynPG

Code for our EACL-2021 paper "Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs".

If you find that the code is useful in your research, please consider citing our paper.

@inproceedings{Huang2021synpg,
    author    = {Kuan-Hao Huang and Kai-Wei Chang},
    title     = {Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs},
    booktitle = {Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
    year      = {2021},
}

Setup

Python=3.7.10

$ pip install -r requirements.txt

Pretrained Models

Demo

Download pretrained SynPG or Pretrained SynPG-Large as well as pretrained parse generator, and put them to ./model
Run scripts/demo.sh or the following command to generate demo/output.txt

python generate.py \
    --synpg_model_path ./model/pretrained_synpg.pt \
    --pg_model_path ./model/pretrained_parse_generator.pt \
    --input_path ./demo/input.txt \
    --output_path ./demo/output.txt \
    --bpe_codes_path ./data/bpe.codes \
    --bpe_vocab_path ./data/vocab.txt \
    --bpe_vocab_thresh 50 \
    --dictionary_path ./data/dictionary.pkl \
    --max_sent_len 40 \
    --max_tmpl_len 100 \
    --max_synt_len 160 \
    --temp 0.5 \
    --seed 0

Training

Download data and put them under ./data/
Download glove.840B.300d.txt and put it under ./data/
Run scripts/train_synpg.sh or the following command to train SynPG

python train_synpg.py \
    --model_dir ./model \
    --output_dir ./output \
    --bpe_codes_path ./data/bpe.codes \
    --bpe_vocab_path ./data/vocab.txt \
    --bpe_vocab_thresh 50 \
    --dictionary_path ./data/dictionary.pkl \
    --train_data_path ./data/train_data.h5 \
    --valid_data_path ./data/valid_data.h5 \
    --emb_path ./data/glove.840B.300d.txt \
    --max_sent_len 40 \
    --max_synt_len 160 \
    --word_dropout 0.4 \
    --n_epoch 5 \
    --batch_size 64 \
    --lr 1e-4 \
    --weight_decay 1e-5 \
    --log_interval 250 \
    --gen_interval 5000 \
    --save_interval 10000 \
    --temp 0.5 \
    --seed 0

Run scripts/train_parse_generator.sh or the following command to train the parse generator

python train_parse_generator.py \
    --model_dir ./model \
    --output_dir ./output_pg \
    --dictionary_path ./data/dictionary.pkl \
    --train_data_path ./data/train_data.h5 \
    --valid_data_path ./data/valid_data.h5 \
    --max_sent_len 40 \
    --max_tmpl_len 100 \
    --max_synt_len 160 \
    --word_dropout 0.2 \
    --n_epoch 5 \
    --batch_size 32 \
    --lr 1e-4 \
    --weight_decay 1e-5 \
    --log_interval 250 \
    --gen_interval 5000 \
    --save_interval 10000 \
    --temp 0.5 \
    --seed 0

Evaluating

Download testing data and put them under ./data/
Run scripts/eval.sh or the following command to evaluate SynPG

python eval_generate.py \
  --test_data ./data/test_data_mrpc.h5 \
  --dictionary_path ./data/dictionary.pkl \
  --model_path ./model/pretrained_synpg.pt \
  --output_dir ./eval/ \
  --bpe_codes ./data/bpe.codes \
  --bpe_vocab ./data/vocab.txt \
  --bpe_vocab_thresh 50 \
  --max_sent_len 40 \
  --max_synt_len 160 \
  --word_dropout 0.0 \
  --batch_size 64 \
  --temp 0.5 \
  --seed 0 \

python eval_calculate_bleu.py --ref ./eval/target_sents.txt --input ./eval/outputs.txt

The BLEU scores should be similar to the following.

	MRPC	PAN	Quora
SynPG	26.2	27.3	33.2
SynPG-Large	36.2	27.1	34.7

Fine-Tuning

One main advantage of SynPG is that SynPG learns the paraphrase generation model without using any paraphrase pairs. Therefore, it is possible to fine-tune SynPG with the texts (without using the ground truth paraphrases) in the target domain when those texts are available. This fine-tuning step would significantly improve the quality of paraphrase generation in the target domain, as shown in our paper.

Download testing data and put them under ./data/
Run scripts/finetune_synpg.sh or the following command to finetune SynPG

python finetune_synpg.py \
  --model_dir ./model_finetune \
  --model_path ./model/pretrained_synpg.pt \
  --output_dir ./output_finetune \
  --bpe_codes_path ./data/bpe.codes \
  --bpe_vocab_path ./data/vocab.txt \
  --bpe_vocab_thresh 50 \
  --dictionary_path ./data/dictionary.pkl \
  --train_data_path ./data/test_data_mrpc.h5 \
  --valid_data_path ./data/test_data_mrpc.h5 \
  --max_sent_len 40 \
  --max_synt_len 160 \
  --word_dropout 0.4 \
  --n_epoch 50 \
  --batch_size 64 \
  --lr 1e-4 \
  --weight_decay 1e-5 \
  --log_interval 250 \
  --gen_interval 5000 \
  --save_interval 10000 \
  --temp 0.5 \
  --seed 0

We can observe the significant improvement on BLEU scores from the table below.

	MRPC	PAN	Quora
SynPG	26.2	27.3	33.2
SynPG-Large	36.2	27.1	34.7
SynPG-Fine-Tune	48.7	37.7	49.8

Author

Kuan-Hao Huang / @ej0cl6

Errors for train_synpg.sh and train_parse_generator.sh

Following the instructions for Training using: train_synpg.sh with the downloaded data

==== start training ====
Traceback (most recent call last):
  File "train_synpg.py", line 285, in <module>
    train(epoch, model, train_data, valid_data, train_loader, valid_loader, optimizer, criterion, dictionary, bpe, args)
  File "train_synpg.py", line 90, in train
    sent_ = bpe.segment(sent_).split()
  File "/home/wmk/synpg/subwordnmt/apply_bpe.py", line 59, in segment
    new_word = [out for segment in self._isolate_glossaries(word)
  File "/home/wmk/synpg/subwordnmt/apply_bpe.py", line 67, in <listcomp>
    self.glossaries)]
  File "/home/wmk/synpg/subwordnmt/apply_bpe.py", line 144, in encode
    word = tuple(orig[:-1]) + ( orig[-1] + '</w>',)
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Following the instructions for Training using: train_parse_generator.sh

==== loading data ====
number of train examples: 45377426
number of valid examples: 12800
==== start training ====
Traceback (most recent call last):
  File "train_parse_generator.py", line 282, in <module>
    train(epoch, model, train_data, valid_data, train_loader, valid_loader, optimizer, criterion, dictionary, args)
  File "train_parse_generator.py", line 81, in train
    synt_ = ParentedTree.fromstring(synt_)
  File "/opt/conda/envs/synpg/lib/python3.7/site-packages/nltk/tree.py", line 669, in fromstring
    for match in token_re.finditer(s):
TypeError: cannot use a string pattern on a bytes-like object

uclanlp / synpg Goto Github PK

synpg's Introduction

SynPG

Setup

Pretrained Models

Demo

Training

Evaluating

Fine-Tuning

Author

synpg's People

Contributors

Stargazers

Watchers

Forkers

synpg's Issues

Recommend Projects

Recommend Topics

Recommend Org