Code Monkey home page Code Monkey logo

Comments (6)

svirpioj avatar svirpioj commented on August 26, 2024

There is nothing wrong with the command, so must be something with the data. Can you show an example what en-cs/train.en.tok looks like? A minimal example that produces the error would be great.

from morfessor.

Gldkslfmsd avatar Gldkslfmsd commented on August 26, 2024

Thanks for reply. Here it is:

machacek@cosmos:/net/work/people/machacek/morf-seg-nmt$ cat en-cs/s
The Tanguts called their own state " phiow ¹ -bjij ² -lhjij-lhjij ² " which translates as " The Great State of the White and the Lofty . "
Since it was located in the west , the Chinese name is Xi-Xia ( 西夏 ) , literally " Western Xia , " and thus that name is often used in Sinological literature .
machacek@cosmos:/net/work/people/machacek/morf-seg-nmt$ p3/bin/morfessor -t en-cs/s
INFO:morfessor.io:Reading corpus from 'en-cs/s'...
INFO:morfessor.io:Detected utf-8 encoding
INFO:morfessor.io:Done.
INFO:morfessor.baseline:Compounds in training data: 46 types / 46 tokens
INFO:morfessor.baseline:Starting batch training
INFO:morfessor.baseline:Epochs: 0	Cost: 961.810623090051
.........................ERROR:morfessor:Fatal Error <class 'KeyError'> 'lhjij'
Traceback (most recent call last):
  File "p3/bin/morfessor", line 22, in <module>
    main(sys.argv[1:])
  File "p3/bin/morfessor", line 13, in main
    morfessor.main(args)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/cmd.py", line 435, in main
    args.finish_threshold, args.maxepochs)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 595, in train_batch
    segments = self._recursive_optimize(w, *algorithm_params)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 299, in _recursive_optimize
    constructions += self._recursive_split(part)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 312, in _recursive_split
    rcount, count = self._remove(construction)
  File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 124, in _remove
    rcount, count, splitloc = self._analyses[construction]
KeyError: 'lhjij'

from morfessor.

svirpioj avatar svirpioj commented on August 26, 2024

Thanks! This looks like a bug that is related to how forced splits around certain characters (by default hyphens) are handled. I found out that it affects specific types of pattern like "-lhjij-lhjij" (or more generally (\F.{2-}).*\1, where \F is any character in the force split list).

While we are fixing this, you can use --forcesplit "" to disable forced splitting for hyphens.

from morfessor.

Gldkslfmsd avatar Gldkslfmsd commented on August 26, 2024

While we are fixing this, you can use --forcesplit "" to disable forced splitting for hyphens.

Does it get exactly same output for all other files with and without this option? I want all my corpora to be processed exactly the same way. Do I have to repeat the training?

from morfessor.

svirpioj avatar svirpioj commented on August 26, 2024

Does it get exactly same output for all other files with and without this option? I want all my corpora to be processed exactly the same way. Do I have to repeat the training?

The model will naturally be somewhat different with and without forced splits, although hyphens are in any case split on most contexts. But forced splits are applied only during training, so once you have a model file, the option does not affect the viterbi segmentations produced by the model.

I assume that you are using the output for machine translation. In that case I would not use forced splits on hyphens anyway, but let the model decide whether to leave frequent word parts with hyphens unsegmented.

from morfessor.

svirpioj avatar svirpioj commented on August 26, 2024

Fixed in 2.0.4.

from morfessor.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.