Code Monkey home page Code Monkey logo

morphforest's Introduction

MorphForest

Code for the paper Unsupervised Learning of Morphological Forest (to appear in TACL 2017).

The repo contains a baseline model that is roughly based on this.

Installation

Theano and Gurobi are needed. You can obtain a free academic license for Gurobi solver.

You should be able to do a test run python run.py eng -ILP -DEBUG after proper installation.

Data preparation

You will need to prepare three files for input before running the model, which by default are stored in the data folder where three sample files are also included.

  • Gold segmentation file, named gold.<lc>. One word per line, in the format of :, where morphemes are separated by hyphens, and alternative segmentations separated by spaces. See data/gold.eng.toy for an example.
  • Word vector file, named wv.<lc>. One word per line, specifically one word followed by a continuous vector of float numbers which are all separated by spaces. See data/wv.eng.toy for an example.
  • Wordlist file, named wordlist.<lc>. One word followed by its frequency per line, separated by space. See data/wordlist.eng.toy for an example.

<lc> is the language code you have to specify, usually a three-letter or two-letter string, e.g. eng for English. This language code is entered as an argument when you run the code (as detailed below), and is used to find the input files in the data folder.

Running the code

Use python run.py <lc> to run the model, where <lc> is the afore-mentioned language code. You can add -h flag to see a list of settings you can change, some of which are detailed below.

  • --top-affixes or -a, number of most frequent affixes to use, 100 by default
  • --top-words, -W, number of most frequent words to train on, 5000 by default
  • -compounding, flag to include compounding features
  • -sibling, flag to include sibling features
  • -supervised, flag to train a supervised model
  • -ILP, flag to use the full model which is trained iteratively. Without this flag, a baseline model will be trained. The number of iteration is specified by --iter, 5 by default.
  • --save and --load , save the model to or load the model from a specified location.
  • After training or loading a model, use --input-file or -I to specify the file of words (one word per line) to segment, and --output-file or -O to store the segmentations.
  • --alpha or -a, and --beta or -b are hyperparameters as introduced in the paper, 0.001 and 1.0 by default respectively. To reproduce the results as reported in Table 4, use the default values for English and Arabic. For Turkish and German, use --beta 3.0, as more affixes are expected for both languages.

Segmentation results will be saved in out folder, along with feature weights.

Dataset

Segmentation dataset is available on my website.

morphforest's People

Contributors

j-luo93 avatar

Stargazers

Amy Tzu-Yu Chen avatar appidi abhinav reddy avatar Robert F. Dickerson avatar Alexis Raykhel avatar Fariz Ikhwantri avatar John S. Dvorak avatar 李博放 avatar Thamme Gowda avatar Usman Khan avatar  avatar Karthik Narasimhan avatar Jia Feng avatar  avatar 爱可可-爱生活 avatar

Watchers

James Cloos avatar

morphforest's Issues

Datadirectory CLI argument is not used, paths to word vectors are fixed from pickle files

This is from saral branch

head <Some File>.txt | ./segmenter.py  sw ../sa│
ves/sw-ft-tc-A100-W10000-ILP.pkl  -dd ../word_embeddings/sw-ft-tc                                         │
Traceback (most recent call last):                                                                        │
....                               │
    with codecs.open(wv_path, encoding='utf8', errors='strict') as fin:                                   │
  File "/nas/home/tg/.conda/envs/gurobi/lib/python2.7/codecs.py", line 896, in open                       │
    file = __builtin__.open(filename, mode, buffering)                                                    │
IOError: [Errno 2] No such file or directory: 'data/sw-ft-tc/wv.sw'                                       │
Traceback (most recent call last):                                                                        │
  File "./segmenter.py", line 82, in <module>                                                             │
    main(args)                                                                                            │
  File "./segmenter.py", line 32, in main                                                                 │
    subprocess.check_output('python src/run.py %s --load %s -I %s -O %s -d %s'  %(lang, model, fin, 'tmp',│
 data_dir), shell=True)                                                                                   │
  File "/nas/home/tg/.conda/envs/gurobi/lib/python2.7/subprocess.py", line 219, in check_output           │
    raise CalledProcessError(retcode, cmd, output=output)                                                 │
subprocess.CalledProcessError: Command 'python src/run.py sw --load ../saves/sw-ft-tc-A100-W10000-ILP.pkl │
-I stdin -O tmp -d ../word_embeddings/sw-ft-tc' returned non-zero exit status 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.