Code Monkey home page Code Monkey logo

compound-pcfg's Introduction

Compound Probabilistic Context-Free Grammars

Code for the paper:
Compound Probabilistic Context-Free Grammars for Grammar Induction
Yoon Kim, Chris Dyer, Alexander Rush
ACL 2019

The preprocessed datasets, trained models, and the datasets parsed with the trained models can be found here.

Dependencies

The code was tested in python 3.6 and pytorch 1.0. We also require the nltk package if creating the processed data from the raw PTB dataset.

Data

The processed version of PTB can be downloaded at the above link. This contains the train/validation/test sets, as well as the vocabulary used (ptb.dict). If you want to create this from scratch, you can run

python process_ptb.py --ptb_path PATH-TO-PTB/parsed/mrg/wsj --output_path DATA-FOLDER

where PATH-TO-PTB is the location of your PTB corpus and OUTPUT-FOLDER is where the processed trees are saved. This will create ptb-train.txt, ptb-valid.txt, ptb-test.txt in DATA-FOLDER.

Now run the preprocessing script

python preprocess.py --trainfile data/ptb-train.txt --valfile data/ptb-valid.txt 
--testfile data/ptb-test.txt --outputfile data/ptb --vocabsize 10000 --lowercase 1 --replace_num 1

See preprocess.py for more options (e.g. batch size). Running this will save the following files in the data/ folder: ptb-train.pkl, ptb-val.pkl, ptb-test.pkl, ptb.dict. Here ptb.dict is the word-idx mapping, and you can change the output folder/name by changing the argument to --outputfile.

Training

To train the compound PCFG, run

python train.py --train_file data/ptb-train.pkl --val_file data/ptb-val.pkl 
--save_path compound-pcfg.pt --gpu 0

where --save_path is where you want to save the model, and --gpu 0 is for using the first GPU in the cluster (the mapping from PyTorch GPU index to your cluster's GPU index may vary).

To train the neural PCFG that does not use continuous latent variables, run

python train.py --z_dim 0 --train_file data/ptb-train.pkl --val_file data/ptb-val.pkl 
--save_path neural-pcfg.pt --gpu 0

Training will take 2-4 days depending on your setup.

Evaluation

To evaluate the trained model on the test set, run

python eval.py --model_file compound-pcfg.pt --data_file data/ptb-test.txt 
--out_file pred-parse.txt --gold_out_file gold-parse.txt --gpu 0

where --out_file is where you want to output the predicted parses. This will calculate the F1 scores between the predicted trees and the gold trees in ptb-test.txt.

To parse a new set of sentences, run

python parse.py --model_file compound-pcfg.pt --data_file sents-to-be-parsed.txt 
--out_file pred-parse.txt --gpu 0

Note that sents-to-be-parsed.txt should have one sentence per line, and be preprocessed in a way that roughly matches the processing in process_ptb.py (e.g. no punctuation).

To just evaluate F1 given the trees, run (for example)

python compare_trees.py --tree1 data/parsed-data/ptb-test-gold-filtered.txt
--tree2 data/parsed-data/ptb-test-compound-pcfg.txt

Note regarding F1 calculation

To be comparable to the numbers reported in PRPN/Ordered Neurons papers, we use the original sentence F1 evaluation code based on L83-89 of the PRPN repo. This has quirky behavior in corner cases where the gold tree is over a sentence of length > 2 but only has the sentence-level trivial span. In this case the sentence F1 for that example could be potentially nonzero according to the code. Corpus F1 does not have this issue.

Trained models

We provide the best neural/compound PCFG models, under the data/trained-model folder. These can be used for eval.py or parse.py.

Parsed Datasets

We also provide parsed train/val/test sets for the best run of each model for further analysis and RNNG training. These can be found under the data/parsed-data folder when you download the processed datasets from above:

  • ptb-{train/valid/test}-gold-filtered.txt: Gold trees where length 1 sentences have been filtered out.
  • ptb-{train/valid/test}-{neural-pcfg/compound-pcfg/prpn/on}.txt: Predicted trees from the best run of different models. For example ptb-test-compound-pcfg.txt is the test set parsed with the compound PCFG.
  • ptb-test-{neural-pcfg/compound-pcfg/prpn/on}-{rnng/urnng}.txt: Predicted trees for the test only for RNNG and URNNG (i.e. train on induced trees with RNNG, then fine-tune with URNNG objective). For example ptb-test-compound-pcfg-urnng.txt contains the predicted trees from an an RNNG trained on compound PCFG trees then fine-tuned with the URNNG objective.

Results

Here are the sentence-level F1 numbers on the PTB test set for the models that performed best on the validation set . F1 with Induced URNNG indicates training an RNNG on the induced trees and then fine-tuning with the URNNG objective (see below). This gave improvements across the board.

Model F1 F1 with Induced URNNG
PRPN 47.9 51.5
Ordered Neurons 50.0 55.1
Neural PCFG 52.6 58.7
Compound PCFG 60.1 66.9

Training Recurrent Neural Network Grammars (RNNG) on Induced Trees

Training the RNNG on induced trees and fine-tuning with the Unsupervised RNNG uses code from Unsupervised Recurrent Neural Network Grammars. The below commands should be run from the urnng folder.

First preprocess the training set with induced trees, for example with the compound PCFG:

python preprocess.py --batchsize 16 --vocabfile data/ptb.dict --lowercase 1 --replace_num 1
--trainfile data/parsed-data/ptb-train-compound-pcfg.txt 
--valfile data/parsed-data/ptb-valid-compound-pcfg.txt 
--testfile data/parsed-data/ptb-test-compound-pcfg.txt 
--outputfile data/ptb-comp-pcfg

Note the use of --vocabfile to use the same vocabulary as the one used in the above experiments.

Then use the above files to train an RNNG (and fine-tune with URNNG) using instructions from the URNNG folder, e.g.

python train.py --train_file /compound-pcfg/data/ptb-comp-pcfg-train.pkl 
--val_file /compound-pcfg/data/ptb-comp-pcfg-val.pkl --save_path compound-pcfg-rnng.pt 
--mode supervised --train_q_epochs 18 --count_eos_ppl 1 --gpu 0

For this version of PTB we count the </s> token in PPL calculations, hence the use of --count_eos_ppl 1. Note that this only affects evaluation and not training.

For fine-tuning:

python train.py --train_from compound-pcfg-rnng.pt --save_path compound-pcfg-urnng.pt
--train_file /compound-pcfg/data/ptb-comp-pcfg-train.pkl 
--val_file /compound-pcfg/data/ptb-comp-pcfg-val.pkl
--mode unsupervised --train_q_epochs 10 --epochs 10 --count_eos_ppl 1 --lr 0.1 --gpu 0 --kl_warmup 0

For evaluation:

python eval_ppl.py --model_file compound-pcfg-urnng.pt --samples 1000 --is_temp 2 --gpu 0
--test_file /compound-pcfg/data/ptb-test.pkl --count_eos_ppl 1

For parsing F1:

python parse.py --model_file compound-pcfg-urnng.pt --data_file /compound-pcfg/data/ptb-test.txt 
--out_file pred-parse.txt --gold_out_file gold-parse.txt --gpu 0 --lowercase 1 --replace_num 1

Miscellaneous Stuff

  • LSTM LM training in Table 3 uses code/hyperparameters from here.
  • For Table 3, LSTM/RNNG is trained with SGD, while PRPN/ON is trained with Adam. However, it seems like models trained with ASGD, as in the AWD-LSTM, do quite a bit better in terms of perplexity. Thanks to Freda for pointing this out!
  • Curriculum learning based on length does not always seem to help, and in subsequent experiments I am seeing similar results without it. For example, one could use --max_length 45 --final_max_length 45 to train on all sentences of length up to 45 from the first epoch (limit on 45 is due to memory reasons).
  • Yanpeng has a much faster implementation of compound PCFGs utilizing the pytorch-struct library. Check it out here!

Acknowledgements

Much of our preprocessing and evaluation code is based on the following repositories:

License

MIT

compound-pcfg's People

Contributors

yoonkim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

compound-pcfg's Issues

No root symbol in MAP parse trees

The root symbol is not considered when getting the MAP parse tree both in the paper and in the code:

compound-pcfg/eval.py

Lines 176 to 189 in 1c0078c

for i in range(length):
tag = "T-" + str(int(argmax_tags[i].item())+1)
pred_tree[i] = "(" + tag + " " + sent_orig[i] + ")"
for k in np.arange(1, length):
for s in np.arange(length):
t = s + k
if t > length - 1: break
if binary_matrix[s][t] == 1:
nt = "NT-" + str(int(label_matrix[s][t])+1)
span = "(" + nt + " " + pred_tree[s] + " " + pred_tree[t] + ")"
pred_tree[s] = span
pred_tree[t] = span
pred_tree = pred_tree[0]
pred_out.write(pred_tree.strip() + "\n")

But according to the Viterbi algorithm and the majority of golden parse trees in the treebank, there should be a root symbol (although I haven't looked at the Viterbi implementation here in PCFG.py). Why don't we have root symbol in MAP trees?

How to get PCFG rules and rule probabilities?

Is there a way to get the whole set of PCFG rules and rule probabilities while training the model? I found there's a tensor called rule_scores with shape b x NT x (NT+T) X (NT+T), but I don't see how to get rules and rule probabilities from there, and where are rules going from pre-terminals to terminals.

Inquiry on some minor implementation details

Hello, thanks for sharing your code, first of all, which is well-structured and easy to understand.
I have a few questions about the details of your implementation.

  1. In your paper, you said

We employ a curriculum learning strategy (Bengio et al., 2009) where we train only on sentences of length up to 30 in the first epoch, and increase this length limit by 1 each epoch.

But in the code, I found that the maximum of the lengths of sentences increases as
30 -> 40 -> 41 -> ... even though I expected this should change like 30 -> 31 -> 32 -> ...
Maybe this undesirable? working seems coming from the below line in train.py.

args.max_length = max(args.final_max_length, args.max_length + args.len_incr)

Is there any reason you used max instead of min?
It seems right to change the max function into the min function on the basis of your paper.

  1. And, I'm just curious about why we should consider the EOS token ('</s>') in this model by explicitly adding 1 to the number of words in a sentence as below.

num_words += batch_size * (length + 1) # we implicitly generate </s> so we explicitly count it

Thanks again for opening your code!

guide papers to understanding

Hi @yoonkim
Could you introduce me to the guide papers to understand the paper (not all the references)? I have been tried several times to understand your PCFG paper, but it is not easy.

(I know studying all the references of this paper is the best way to understand that.. However, I'm just a beginner who studies the CS224N which is an NLP course at Stanford University. Reading all the references for each reference paper is almost impossible for me.)

If you don't mind, could you share the experiences about OpenNMT to me? I hope to know what am I have to do if I want to make Speech recognition and NMT in the future.

Thank you

Extra indents in PCFG.py?

Are these extra indents in line 70-71 in PCFG.py?

compound-pcfg/PCFG.py

Lines 70 to 72 in 47a7b16

log_Z = self.beta[:, 0, n-1, :self.nt_states] + root_scores
log_Z = self.logsumexp(log_Z, 1)
return log_Z

Should it be:

    log_Z = self.beta[:, 0, n-1, :self.nt_states] + root_scores
    log_Z = self.logsumexp(log_Z, 1)
    return log_Z

a fast implementation of C-PCFGs

Hi Yoon! Thank you for sharing the amazing code. I re-implemented C-PCFGs based on Torch-Struct. It is faster (~25min / epoch) and slightly more accurate (~55.7% F1). I am happy to share it. It may help people explore the potential of C-PCFGs more easily. Here is my implementation.

-Yanpeng

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.