Code Monkey home page Code Monkey logo

ccg-parsing-2015's Introduction

CCG Parsing: 2015

This is the software used for the following publication:

Weakly-Supervised Grammar-Informed Bayesian CCG Parser Learning
Dan Garrette, Chris Dyer, Jason Baldridge, and Noah A. Smith
In Proceedings of AAAI 2015

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning
Dan Garrette, Chris Dyer, Jason Baldridge, and Noah A. Smith
In Proceedings of CoNLL 2015

Getting the code

$ get clone [email protected]:dhgarrette/2015-ccg-parsing.git
$ cd 2015-ccg-parsing

Data setup

Put the English, Chinese, and Italian data into the following directories:

 data/ccgbank
 data/ccgbank-chinese
 data/ccgbank-italian

The files should be arranged as follows:

$ ls data/ccgbank/AUTO
00	02	04	06	08	10	12	14	16	18	20	22	24
01	03	05	07	09	11	13	15	17	19	21	23
$ ls data/ccgbank-chinese/AUTO
00	02	04	06	08	10	20	22	24	26	28	30
01	03	05	07	09	11	21	23	25	27	29	31
$ ls data/ccgbank-italian/pro
civil_law.pro.txt	jrc_acquis.pro.txt	newspaper.pro.txt    

Running the code

First, compile the code and generate the run script:

$ ./compile

Then run:

$ target/start dhg.ccg.run.Parse2015Run [options]

Options

  • --model: The model to use. Options: (no default)
    • pcfg: for the pcfg model (see AAAI-2015 paper)
    • scg: for the supertag-context model (see CoNLL-2015 paper)
  • --learning: The learning algorithm to use to train the model. Options: {mcmc}. Default: mcmc.
  • --additional-rules: Additional CCG rules to be allowed by the parser (comma-separated). Example: FC,BX,FC2,BX2. Default x (meaning no additional rules).
  • --lang: The language of the CCGBank to use. Options: {en, ch, it}. Default: en.
  • --max-sent-len: The maximum sentence length allowed (filter all sentence longer than this). Options: an integer or all for no limit. Default: all.
  • --td-tok: The maximum number of tokens to be read when building the tag dictionary. Options: an integer (following an integer with k will expand to 000; e.g. 10k becomes 10000) or all for no limit. Default: all.
  • --train-sent: The maximum number of sentences to be used for training. Options: an integer or all for no limit. Default: all.
  • --test-sent: The maximum number of sentences to be used for testing. Options: an integer or all for no limit. Default: all.
  • --sampling-iterations: The number of MCMC sampling iterations to run. Default: 500.
  • --burnin-iterations: The number of MCMC burn-in iterations to run. Default: 0.
  • --alpha-root: See paper for details. Default: 1.0.
  • --alpha-biny: See paper for details. Default: 100.0.
  • --alpha-unry: See paper for details. Default: 100.0.
  • --alpha-term: See paper for details. Default: 10000.0.
  • --alpha-prod: See paper for details. Default: 100.0.
  • --alpha-cntx: See paper for details. Only relevant for --model scg. Default: 1000.0.
  • --root-init: Root parameter initializer. Options:
    • uniform
    • catprior: use the grammar-defined category prior.
    • tdecatprior: use the grammar-defined category prior, with atomic category probabilities estimated using the tag dictionary and raw data. DEFAULT.
  • --nt-prod-init: Nonterminal production parameter initializer (for both binary and unary). Options:
    • uniform.
    • catprior: use the grammar-defined category prior.
    • tdecatprior: use the grammar-defined category prior, with atomic category probabilities estimated using the tag dictionary and raw data. DEFAULT.
  • --term-prod-init: Terminal production parameter initializer (for both binary and unary). Options:
    • uniform.
    • tdentry: Use the tag dictionary and raw data to estimate terminal (word) probabilities for each supertag. DEFAULT.
  • --tr-init: Context production parameter initializer (for both left and right contexts). Only relevant for --model scg. Options:
    • uniform,
    • tdentry (use the tag dictionary and raw data to estimate transition probabilities),
    • combine-uniform (use CCG supertag combinability mixed with uniform),
    • combine-tdentry (use CCG supertag combinability mixed with tdentry). DEFAULT.
  • --pterm: See paper for details. Default: 0.7.
  • --pmod: See paper for details. Default: 0.1.
  • --pfwd: See paper for details. Default: 0.5.
  • --comb-tr-mass: Amount of probability mass devoted to "combinable" contexts (called sigma (ฯƒ) in the CoNLL-2015 paper). Only relevant for --model scg. Default: 0.85.
  • --td-cutoff: Exclude tag dictionary entries that occur with less than this proportion in the TD-training corpus. Default: 0.0.
  • --max-accept-tries: Number of samples drawn for each sentence in each iteration. Only relevant for --model scg. Default: 1.
  • --output-file: A file where the parsed trees of the test sentences should be written. Default: do not write out trees.
  • --train-termdel: Allow terminal deletion from a training sentence when a parse is not found. Choices {false,true}. Default: false.
  • --test-termdel: Allow terminal deletion from a test sentence when a parse is not found. Choices {false,true}. Default: false.
  • --max-train-tok: The maximum number of tokens to be read for the training data. Options: an integer (following an integer with k will expand to 000; e.g. 10k becomes 10000) or all for no limit. Default: all.
  • --mcmc-output-count-file: File where additional data should be written. Default: do not write out this information.

ccg-parsing-2015's People

Contributors

dhgarrette avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ccg-parsing-2015's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.