Code Monkey home page Code Monkey logo

morfessor's People

Contributors

anmolgulati avatar pabs3 avatar psmit avatar svirpioj avatar waino avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

morfessor's Issues

Fix version check in io.py

This is in reference to Gensim PR #1067.
On Python 2.6, (Travis Job) the version check in line 18 of morfessor/io.py fails with the following error.
Traceback (most recent call last):

  File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_varembed_wrapper.py", line 49, in testEnsembleMorphemeEmbeddings
    morfessor_model=varembed_model_morfessor_file, use_morphemes=True)
  File "/home/travis/build/RaRe-Technologies/gensim/gensim/models/wrappers/varembed.py", line 70, in load_varembed_format
    import morfessor
  File "/home/travis/miniconda2/envs/gensim-test/lib/python2.6/site-packages/morfessor/__init__.py", line 29, in <module>
    from .cmd import main, get_default_argparser, main_evaluation, \
  File "/home/travis/miniconda2/envs/gensim-test/lib/python2.6/site-packages/morfessor/cmd.py", line 12, in <module>
    from .io import MorfessorIO
  File "/home/travis/miniconda2/envs/gensim-test/lib/python2.6/site-packages/morfessor/io.py", line 18, in <module>
    PY3 = sys.version_info.major == 3
AttributeError: 'tuple' object has no attribute 'major'

There seems a really simple fix for this issue to get it working for Python 2.6 of using sys.version_info[0] instead of sys.version_info.major.

If it's fine, I'll go ahead and submit a PR with this fix as we are to integrate that PR into Gensim as well.

Is the tokenizer.model deterministic?

Hi, I'm developing a tokenizer based on Korean.
Since my project is to develop a language model using SRILM's ngram, the role of tokenizer is very important.
I couldn't experiment because of the large capacity of the corpus, but I want to hear your answer quickly, so I'm leaving an issue.

Is the result of morfessor deterministic? In other words, will the same model be created after repeated learning dozens of times?
If it is non-deterministic, are there any index or methods to measure how different the performance of results(tokenizers) varies?

Sample data lines for Turkish or English

I want to use Morfessor to separate Turkish words into stem+suffixes.
I don't have a sample database. So, I must create a new data set for training.
Can you give me some explanatory example data lines in Turkish, or English that should be in the data set?
Thanks.

Trained models

Is there a trained model for Finnish available for download somewhere?

Where's the detail specific document of training data rules?

Hi There,

I tried to craft some simple training like

design de sign, de sign
gender gen der, gen der
bilingual bi lingual, bi lingual
biography bio graphy, bio graphy

for testing list as

design
gender
bilingual
biography

and got the result as

 morfessor -t td1.txt -S model.segm -T text.txt 
Reading corpus from 'td1.txt'...
Detected utf-8 encoding
Done.
Compounds in training data: 16 types / 16 tokens
Starting batch training
Epochs: 0	Cost: 344.6809466060173
.................
Epochs: 1	Cost: 206.03260380373735
.................
Epochs: 2	Cost: 206.0326038037374
Done.
Epochs: 2
Final cost: 206.0326038037374
Training time: 0.017s
Saving segmentations to 'model.segm'...
Done.
Segmenting test data...
Reading corpus from 'text.txt'...
de sign
gen der
bi lingual
bi o graphy
Done.

Done.

Where the expected results is

de sign
gen der
bi lingual
bio graphy

My question is

  • How can I craft the training data correctly?
  • Where Can I find the training data specification?

-R
Jarod

The `--atom-separator` option doesn't work on Python 3

vocab-vi.txt is a list of Vietnamese terms, with syllables separated by _. I tried using Morfessor to group the syllables into words:

morfessor -t vocab-vi.txt -T vocab-vi.txt -x lexicon-vi.txt -S lexicon-vi.morf --traindata-list --atom-separator '_'

and I got this error, from code that apparently hasn't been ported to Python 3:

Traceback (most recent call last):
  File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 22, in <module>
    main(sys.argv[1:])
  File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 13, in main
    morfessor.main(args)
  File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/cmd.py", line 393, in main
    args.finish_threshold, args.maxepochs)
  File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/baseline.py", line 572, in train_batch
    (w, _constructions_to_str(segments)))
  File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/baseline.py", line 17, in _constructions_to_str
    isinstance(constructions[0], unicode)):
NameError: name 'unicode' is not defined

If I try replacing that check with just a check for str, it also doesn't solve the problem, it just uncovers another one:

Traceback (most recent call last):
  File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 22, in <module>
    main(sys.argv[1:])
  File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 13, in main
    morfessor.main(args)
  File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/cmd.py", line 466, in main
    analysis = csep.join(constructions)
TypeError: sequence item 0: expected str instance, tuple found

Unpickling a binary model fails

Hi,
I am trying to load a model on python3.6 using the python API, but it fails.

Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import morfessor
>>> io = morfessor.MorfessorIO()
>>> mf = 'something.morfmodel.bin'
>>> model = io.read_binary_model_file(mf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/someone/libs/miniconda3/envs/py3/lib/python3.6/site-packages/morfessor/io.py", line 179, in read_binary_model_file
    model = pickle.load(fobj)
AttributeError: 'ConstrNode' object has no attribute '__dict__'

It is possible that the model may have been trained on the python 2 (i am not sure, my coworker trained it).
Questions is, shouldn't the model trained on python 2 work on python 3 (considering the same code is used)?

KeyError

Hello,
I'm getting this issue:

p3/bin/morfessor -t en-cs/train.en.tok --num-morph-types 50000 -S morf-models/morf-model.train.en-cs.50k.en -s morf-model.train.en-cs.50k.pickle.en
INFO:morfessor.io:Reading corpus from 'en-cs/train.en.tok'...
INFO:morfessor.io:Detected utf-8 encoding
INFO:morfessor.io:Done.
INFO:morfessor.baseline:Compounds in training data: 1938261 types / 1938261 tokens
INFO:morfessor.baseline:Starting batch training
INFO:morfessor.baseline:Epochs: 0       Cost: 75567655.89912468
.......................................................ERROR:morfessor:Fatal Error <class 'KeyError'> 'lhjij'
Traceback (most recent call last):
  File "p3/bin/morfessor", line 22, in <module>
    main(sys.argv[1:])
  File "p3/bin/morfessor", line 13, in main
    morfessor.main(args)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/cmd.py", line 435, in main
    args.finish_threshold, args.maxepochs)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 595, in train_batch
    segments = self._recursive_optimize(w, *algorithm_params)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 299, in _recursive_optimize
    constructions += self._recursive_split(part)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 312, in _recursive_split
    rcount, count = self._remove(construction)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 124, in _remove
    rcount, count, splitloc = self._analyses[construction]
KeyError: 'lhjij'

Morfessor (2.0.3)

The input file is tokenized English side of CzEng. Is it correct?

UnicodeDecodeError when install via pip

Downloading Morfessor-2.0.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-wy451zlq/morfessor/setup.py", line 9, in <module>
        main_py = open('morfessor/__init__.py').read()
      File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 373: ordinal not in range(128)

cf061cf#diff-4ffe01edeab0886c81c728bf704ac894R13

maybe you need to add # -*- coding: utf-8 -*-
https://www.python.org/dev/peps/pep-0263/

commend line Vs. API

I write python code to segment given words, main code is :

model=io.read_any_model(model.bin')
with open(test.txt,'r') as OutputFile:
                for line in InputFile:
                        words=line.strip().split()
                        morphemes=[(w," ".join(model.viterbi_segment(w)[0])) for w in words]

only few words segmented, but i used the same model on commend line to segment the same text, and most of the words are segmented,
$morfessor-segment -l model.bin test.txt

So any idea what is wrong in my python code? thank you!!!

Segmented output format

I used morfessor-segment -L en.model test.data > test.morf

It works, however the text in my resulting file test.morf has a word on each line. As I am using corpus with one sentence on each line I would like to have to same output format but I cannot find how to achieve that

Thanks in advance

--output-newlines squeezes multiple newlines

Command-line.

flammie@saarkaany ~/Koodit/mt-development/complexity-stats (2145) [01:21:54] 
$ cat > kolme
yksi

kolme
flammie@saarkaany ~/Koodit/mt-development/complexity-stats (2146) [01:22:08] 
$ morfessor -l europarl-v7.fi-en.fi.morfessor --output-format-separator '> <' --output-newlines --output-format '{analysis} ' -T - < kolme
INFO:morfessor.io:Loading model from 'europarl-v7.fi-en.fi.morfessor'...
INFO:morfessor.io:Done.
No training data files specified.
Segmenting test data...
INFO:morfessor.io:Reading corpus from '-'...
yksi 
kolme 
INFO:morfessor.io:Done.

Done.

There should be empty line between yksi and kolme. This is useful for machine translation pipeline where the tools commonly fail when lines don't match.

Morfessor Models Sizes

I am using morfessor with the word count genereted from Wikipedia. I noticed that the larger the word count file is, the larger the model is. Around 0.5GiB the pickle file is.

Is there a correlation?

What do you think the best practice is?

How to save the segmented word to file?

Hi,
I use the following command for model training(morfessor2.0):
morfessor-train --traindata-list --logfile=log.log -S model.segm -d ones inputdata.txt
Then use the following command for word segmentation:
morfessor-segment -L model.segm test.txt
Why is the output in the terminal after the word segmentation? How to save the segmented word to Specified file?

Looking forward to your advice or answers.
Best regards,

yapingzhao

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.