aalto-speech / morfessor Goto Github PK

Morfessor is a tool for unsupervised and semi-supervised morphological segmentation

License: BSD 2-Clause "Simplified" License

Python 100.00%

segmentation subword-units subword-segmentation python

morfessor's Introduction

Morfessor 2.0 - Quick start
===========================


Installation
------------

Morfessor 2.0 is installed using setuptools library for Python. To
build and install the module and scripts to default paths, type

python setup.py install

For details, see http://docs.python.org/install/


Documentation
-------------

User instructions for Morfessor 2.0 are available in the docs directory
as Sphinx source files (see http://sphinx-doc.org/). Instructions how
to build the documentation can be found in docs/README.

The documentation is also available on-line at http://morfessor.readthedocs.org/

Details of the implemented algorithms and methods and a set of
experiments are described in the following technical report:

Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko
Kurimo. Morfessor 2.0: Python Implementation and Extensions for
Morfessor Baseline. Aalto University publication series SCIENCE +
TECHNOLOGY, 25/2013. Aalto University, Helsinki, 2013. ISBN
978-952-60-5501-5.

The report is available online at 

http://urn.fi/URN:ISBN:978-952-60-5501-5


Contact
-------

Questions or feedback? Email: [email protected]

morfessor's People

Contributors

Stargazers

Watchers

morfessor's Issues

The `--atom-separator` option doesn't work on Python 3

vocab-vi.txt is a list of Vietnamese terms, with syllables separated by _. I tried using Morfessor to group the syllables into words:

morfessor -t vocab-vi.txt -T vocab-vi.txt -x lexicon-vi.txt -S lexicon-vi.morf --traindata-list --atom-separator '_'

and I got this error, from code that apparently hasn't been ported to Python 3:

Traceback (most recent call last):
  File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 22, in <module>
    main(sys.argv[1:])
  File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 13, in main
    morfessor.main(args)
  File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/cmd.py", line 393, in main
    args.finish_threshold, args.maxepochs)
  File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/baseline.py", line 572, in train_batch
    (w, _constructions_to_str(segments)))
  File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/baseline.py", line 17, in _constructions_to_str
    isinstance(constructions[0], unicode)):
NameError: name 'unicode' is not defined

If I try replacing that check with just a check for str, it also doesn't solve the problem, it just uncovers another one:

Traceback (most recent call last):
  File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 22, in <module>
    main(sys.argv[1:])
  File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 13, in main
    morfessor.main(args)
  File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/cmd.py", line 466, in main
    analysis = csep.join(constructions)
TypeError: sequence item 0: expected str instance, tuple found

commend line Vs. API

I write python code to segment given words, main code is :

model=io.read_any_model(model.bin')
with open(test.txt,'r') as OutputFile:
                for line in InputFile:
                        words=line.strip().split()
                        morphemes=[(w," ".join(model.viterbi_segment(w)[0])) for w in words]

only few words segmented, but i used the same model on commend line to segment the same text, and most of the words are segmented,
$morfessor-segment -l model.bin test.txt

So any idea what is wrong in my python code? thank you!!!

Sample data lines for Turkish or English

I want to use Morfessor to separate Turkish words into stem+suffixes.
I don't have a sample database. So, I must create a new data set for training.
Can you give me some explanatory example data lines in Turkish, or English that should be in the data set?
Thanks.

How do I control dictionary size

tarball on website out of date

I noticed that the website says the latest version is 2.0.1 but the latest GitHub tag is 2.0.6:

http://morpho.aalto.fi/projects/morpho/morfessor2.html
https://github.com/aalto-speech/morfessor/releases

Where's the detail specific document of training data rules?

Hi There,

I tried to craft some simple training like

design de sign, de sign
gender gen der, gen der
bilingual bi lingual, bi lingual
biography bio graphy, bio graphy

for testing list as

design
gender
bilingual
biography

and got the result as

 morfessor -t td1.txt -S model.segm -T text.txt 
Reading corpus from 'td1.txt'...
Detected utf-8 encoding
Done.
Compounds in training data: 16 types / 16 tokens
Starting batch training
Epochs: 0	Cost: 344.6809466060173
.................
Epochs: 1	Cost: 206.03260380373735
.................
Epochs: 2	Cost: 206.0326038037374
Done.
Epochs: 2
Final cost: 206.0326038037374
Training time: 0.017s
Saving segmentations to 'model.segm'...
Done.
Segmenting test data...
Reading corpus from 'text.txt'...
de sign
gen der
bi lingual
bi o graphy
Done.

Done.

Where the expected results is

de sign
gen der
bi lingual
bio graphy

My question is

How can I craft the training data correctly?
Where Can I find the training data specification?

-R
Jarod

Is your trained English model available?

Hi,

I was wondering if the English trained model behind your demo is available for others to use. I hope this is the case.

Colin Goldberg

Is there a trained model for Kazakh available for download somewhere?

Fix version check in io.py

This is in reference to Gensim PR #1067.
On Python 2.6, (Travis Job) the version check in line 18 of morfessor/io.py fails with the following error.
Traceback (most recent call last):

  File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_varembed_wrapper.py", line 49, in testEnsembleMorphemeEmbeddings
    morfessor_model=varembed_model_morfessor_file, use_morphemes=True)
  File "/home/travis/build/RaRe-Technologies/gensim/gensim/models/wrappers/varembed.py", line 70, in load_varembed_format
    import morfessor
  File "/home/travis/miniconda2/envs/gensim-test/lib/python2.6/site-packages/morfessor/__init__.py", line 29, in <module>
    from .cmd import main, get_default_argparser, main_evaluation, \
  File "/home/travis/miniconda2/envs/gensim-test/lib/python2.6/site-packages/morfessor/cmd.py", line 12, in <module>
    from .io import MorfessorIO
  File "/home/travis/miniconda2/envs/gensim-test/lib/python2.6/site-packages/morfessor/io.py", line 18, in <module>
    PY3 = sys.version_info.major == 3
AttributeError: 'tuple' object has no attribute 'major'

There seems a really simple fix for this issue to get it working for Python 2.6 of using sys.version_info[0] instead of sys.version_info.major.

If it's fine, I'll go ahead and submit a PR with this fix as we are to integrate that PR into Gensim as well.

Trained models

Is there a trained model for Finnish available for download somewhere?

Morfessor Models Sizes

I am using morfessor with the word count genereted from Wikipedia. I noticed that the larger the word count file is, the larger the model is. Around 0.5GiB the pickle file is.

Is there a correlation?

What do you think the best practice is?

UnicodeDecodeError when install via pip

Downloading Morfessor-2.0.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-wy451zlq/morfessor/setup.py", line 9, in <module>
        main_py = open('morfessor/__init__.py').read()
      File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 373: ordinal not in range(128)

cf061cf#diff-4ffe01edeab0886c81c728bf704ac894R13

maybe you need to add # -*- coding: utf-8 -*-
https://www.python.org/dev/peps/pep-0263/

--output-newlines squeezes multiple newlines

Command-line.

flammie@saarkaany ~/Koodit/mt-development/complexity-stats (2145) [01:21:54] 
$ cat > kolme
yksi

kolme
flammie@saarkaany ~/Koodit/mt-development/complexity-stats (2146) [01:22:08] 
$ morfessor -l europarl-v7.fi-en.fi.morfessor --output-format-separator '> <' --output-newlines --output-format '{analysis} ' -T - < kolme
INFO:morfessor.io:Loading model from 'europarl-v7.fi-en.fi.morfessor'...
INFO:morfessor.io:Done.
No training data files specified.
Segmenting test data...
INFO:morfessor.io:Reading corpus from '-'...
yksi 
kolme 
INFO:morfessor.io:Done.

Done.

There should be empty line between yksi and kolme. This is useful for machine translation pipeline where the tools commonly fail when lines don't match.

Unpickling a binary model fails

Hi,
I am trying to load a model on python3.6 using the python API, but it fails.

Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import morfessor
>>> io = morfessor.MorfessorIO()
>>> mf = 'something.morfmodel.bin'
>>> model = io.read_binary_model_file(mf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/someone/libs/miniconda3/envs/py3/lib/python3.6/site-packages/morfessor/io.py", line 179, in read_binary_model_file
    model = pickle.load(fobj)
AttributeError: 'ConstrNode' object has no attribute '__dict__'

It is possible that the model may have been trained on the python 2 (i am not sure, my coworker trained it).
Questions is, shouldn't the model trained on python 2 work on python 3 (considering the same code is used)?

KeyError

Hello,
I'm getting this issue:

p3/bin/morfessor -t en-cs/train.en.tok --num-morph-types 50000 -S morf-models/morf-model.train.en-cs.50k.en -s morf-model.train.en-cs.50k.pickle.en

INFO:morfessor.io:Reading corpus from 'en-cs/train.en.tok'...
INFO:morfessor.io:Detected utf-8 encoding
INFO:morfessor.io:Done.
INFO:morfessor.baseline:Compounds in training data: 1938261 types / 1938261 tokens
INFO:morfessor.baseline:Starting batch training
INFO:morfessor.baseline:Epochs: 0       Cost: 75567655.89912468
.......................................................ERROR:morfessor:Fatal Error <class 'KeyError'> 'lhjij'
Traceback (most recent call last):
  File "p3/bin/morfessor", line 22, in <module>
    main(sys.argv[1:])
  File "p3/bin/morfessor", line 13, in main
    morfessor.main(args)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/cmd.py", line 435, in main
    args.finish_threshold, args.maxepochs)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 595, in train_batch
    segments = self._recursive_optimize(w, *algorithm_params)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 299, in _recursive_optimize
    constructions += self._recursive_split(part)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 312, in _recursive_split
    rcount, count = self._remove(construction)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 124, in _remove
    rcount, count, splitloc = self._analyses[construction]
KeyError: 'lhjij'

Morfessor (2.0.3)

The input file is tokenized English side of CzEng. Is it correct?

How to save the segmented word to file?

Hi,
I use the following command for model training（morfessor2.0)：
morfessor-train --traindata-list --logfile=log.log -S model.segm -d ones inputdata.txt
Then use the following command for word segmentation：
morfessor-segment -L model.segm test.txt
Why is the output in the terminal after the word segmentation? How to save the segmented word to Specified file?

Looking forward to your advice or answers.
Best regards,

yapingzhao

Segmented output format

I used morfessor-segment -L en.model test.data > test.morf

It works, however the text in my resulting file test.morf has a word on each line. As I am using corpus with one sentence on each line I would like to have to same output format but I cannot find how to achieve that

Thanks in advance

Is the tokenizer.model deterministic?

Hi, I'm developing a tokenizer based on Korean.
Since my project is to develop a language model using SRILM's ngram, the role of tokenizer is very important.
I couldn't experiment because of the large capacity of the corpus, but I want to hear your answer quickly, so I'm leaving an issue.

Is the result of morfessor deterministic? In other words, will the same model be created after repeated learning dozens of times?
If it is non-deterministic, are there any index or methods to measure how different the performance of results(tokenizers) varies?

aalto-speech / morfessor Goto Github PK

morfessor's Introduction

morfessor's People

Contributors

Stargazers

Watchers

Forkers

morfessor's Issues

Recommend Projects

Recommend Topics

Recommend Org