Code Monkey home page Code Monkey logo

lexicon's Introduction

**As of 15/01/14 As the main file (models_kyle.py) was getting messy, I took the initiative to split the file in several subfiles: main.py (it's in the name) A super class: lm.py (with methods create_model, generate, evaluate) inheriting from the subclasses: -> nphone.py -> nsyll.py -> bigmatch.py -> pcfg.py Option to chose the model should be specified in the command line: (default nphone --n 3) --model bigmatch_ngram --model bigmatch_blick --model bigmatch_both --model pcfg --model nsyll (only for n = 1 or n = 2)

Methods for evaluation are in evaluation.py (as now the cross entropy function is NOT returning cross-entropy but average logprob as I was checking for consistency with previous results -- I was used to the logprob -- so do not mind the crazy perplexity score) Methods for generation are in generation.py

Now we can evaluate from nphone and nsyll by specifying: --fnc evaluate it smoothes the conditional probabilities with Katz's smoothing from katz.py and gt.py As now the implementation is ugly and need to be updated, it was just to test that all is working fine.

--perplexity.py uses srilm--for any computer used, add srilm path at top of file It accepts as input either a training list and a test list (with words already separate and spaced) or a training file and a test file with one word per line. Currently can do either addsmooth or wbdiscount. If addsmooth, set the smoothing alpha (currently = .1) New smoothing can be easily added by specifying it.

**OLD -model.py main file for adding and implementing different models has two functions: 1. generating rand lexicon according to a given model (default mode) 2. evaluating the given model on a corpus (splitted between train and test set) currently implemented: ngrams on phones (default) ngrams on syllables: --model nsyll pcfg --model pcfg ngrams using pool of pre-generated word --model npool (match each real word with a non word of same ngram proba +/- eps) --model bigmatch_ngram --model bigmatch_blick --model bigmatch_both

-calc_stats.py input lex file in format and output lexical stats in rfiles -1,word -1,word 0,simword 0,simword 1,simword1 1,simword1

-celex.py input arguments, output celex file in folder celexes (one word per line)

Code Notes:

the ngram syllable class is redundant with the simple ngram class, just made it as it is now to not mess up with the rest but could be merged the smoothing function in nltk is buggy or just did not get why the number of bins should be precised if it is precisely for unseen data... (only useful for the evaluation function) UPDATE: it seems it work ok if you give an upper bound (such as the number of possible syllables or possible phonemes) but it may actually underestimate the right proba... the npool model is not written as a class I just added a function generate_correct_number_pool. There is also an extra function for generating this pool of words if it does not already exist there are several inelegant stuff I did, feel free to modify whatever you want, I am not particularly attached to anything!! Results Notes:

the npool model get actually pretty close to the number of neighbors compared to the true lexicon if you make the eps quite small... compare between eps = 1e-2 and 1e-6 (it under or overshoots depending on the generated pool in fact), results are similar to the bisyllabic ngram model, about 80% of existing words generated so not surprinsingly very close to what the real lexicon is. Also when evaluating the proba of a word in this model, even if it is true that we matched the ngram proba, the actual proba of the word is not that (or at least in the way i did it) because we need to take into account that we take a random word from the set S of all words in the interval +/-eps around the ngram proba of the real word. So the actual proba should be Pgram * 1/card(S)

-make_hists.R -takes all files in rfiles (either indwords_ or global_) and makes histograms -right now just for neighbors but easily modifiable for all -ind_words contains length separated versions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.