morphforest's Introduction

MorphForest

Code for the paper Unsupervised Learning of Morphological Forest (to appear in TACL 2017).

The repo contains a baseline model that is roughly based on this.

Installation

Theano and Gurobi are needed. You can obtain a free academic license for Gurobi solver.

You should be able to do a test run python run.py eng -ILP -DEBUG after proper installation.

Data preparation

You will need to prepare three files for input before running the model, which by default are stored in the data folder where three sample files are also included.

Gold segmentation file, named gold.<lc>. One word per line, in the format of :, where morphemes are separated by hyphens, and alternative segmentations separated by spaces. See data/gold.eng.toy for an example.
Word vector file, named wv.<lc>. One word per line, specifically one word followed by a continuous vector of float numbers which are all separated by spaces. See data/wv.eng.toy for an example.
Wordlist file, named wordlist.<lc>. One word followed by its frequency per line, separated by space. See data/wordlist.eng.toy for an example.

<lc> is the language code you have to specify, usually a three-letter or two-letter string, e.g. eng for English. This language code is entered as an argument when you run the code (as detailed below), and is used to find the input files in the data folder.

Running the code

Use python run.py <lc> to run the model, where <lc> is the afore-mentioned language code. You can add -h flag to see a list of settings you can change, some of which are detailed below.

--top-affixes or -a, number of most frequent affixes to use, 100 by default
--top-words, -W, number of most frequent words to train on, 5000 by default
-compounding, flag to include compounding features
-sibling, flag to include sibling features
-supervised, flag to train a supervised model
-ILP, flag to use the full model which is trained iteratively. Without this flag, a baseline model will be trained. The number of iteration is specified by --iter, 5 by default.
--save and --load , save the model to or load the model from a specified location.
After training or loading a model, use --input-file or -I to specify the file of words (one word per line) to segment, and --output-file or -O to store the segmentations.
--alpha or -a, and --beta or -b are hyperparameters as introduced in the paper, 0.001 and 1.0 by default respectively. To reproduce the results as reported in Table 4, use the default values for English and Arabic. For Turkish and German, use --beta 3.0, as more affixes are expected for both languages.

Segmentation results will be saved in out folder, along with feature weights.

Dataset

Segmentation dataset is available on my website.

morphforest's People

Contributors

Stargazers

Watchers

morphforest's Issues

Datadirectory CLI argument is not used, paths to word vectors are fixed from pickle files

This is from saral branch

head <Some File>.txt | ./segmenter.py  sw ../sa│
ves/sw-ft-tc-A100-W10000-ILP.pkl  -dd ../word_embeddings/sw-ft-tc                                         │
Traceback (most recent call last):                                                                        │
....                               │
    with codecs.open(wv_path, encoding='utf8', errors='strict') as fin:                                   │
  File "/nas/home/tg/.conda/envs/gurobi/lib/python2.7/codecs.py", line 896, in open                       │
    file = __builtin__.open(filename, mode, buffering)                                                    │
IOError: [Errno 2] No such file or directory: 'data/sw-ft-tc/wv.sw'                                       │
Traceback (most recent call last):                                                                        │
  File "./segmenter.py", line 82, in <module>                                                             │
    main(args)                                                                                            │
  File "./segmenter.py", line 32, in main                                                                 │
    subprocess.check_output('python src/run.py %s --load %s -I %s -O %s -d %s'  %(lang, model, fin, 'tmp',│
 data_dir), shell=True)                                                                                   │
  File "/nas/home/tg/.conda/envs/gurobi/lib/python2.7/subprocess.py", line 219, in check_output           │
    raise CalledProcessError(retcode, cmd, output=output)                                                 │
subprocess.CalledProcessError: Command 'python src/run.py sw --load ../saves/sw-ft-tc-A100-W10000-ILP.pkl │
-I stdin -O tmp -d ../word_embeddings/sw-ft-tc' returned non-zero exit status 1

Recommend Projects

j-luo93 / morphforest Goto Github PK

morphforest's Introduction

MorphForest

Installation

Data preparation

Running the code

Dataset

morphforest's People

Contributors

Stargazers

Watchers

Forkers

morphforest's Issues

Datadirectory CLI argument is not used, paths to word vectors are fixed from pickle files

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent