Comments (6)
There is nothing wrong with the command, so must be something with the data. Can you show an example what en-cs/train.en.tok
looks like? A minimal example that produces the error would be great.
from morfessor.
Thanks for reply. Here it is:
machacek@cosmos:/net/work/people/machacek/morf-seg-nmt$ cat en-cs/s
The Tanguts called their own state " phiow ¹ -bjij ² -lhjij-lhjij ² " which translates as " The Great State of the White and the Lofty . "
Since it was located in the west , the Chinese name is Xi-Xia ( 西夏 ) , literally " Western Xia , " and thus that name is often used in Sinological literature .
machacek@cosmos:/net/work/people/machacek/morf-seg-nmt$ p3/bin/morfessor -t en-cs/s
INFO:morfessor.io:Reading corpus from 'en-cs/s'...
INFO:morfessor.io:Detected utf-8 encoding
INFO:morfessor.io:Done.
INFO:morfessor.baseline:Compounds in training data: 46 types / 46 tokens
INFO:morfessor.baseline:Starting batch training
INFO:morfessor.baseline:Epochs: 0 Cost: 961.810623090051
.........................ERROR:morfessor:Fatal Error <class 'KeyError'> 'lhjij'
Traceback (most recent call last):
File "p3/bin/morfessor", line 22, in <module>
main(sys.argv[1:])
File "p3/bin/morfessor", line 13, in main
morfessor.main(args)
File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/cmd.py", line 435, in main
args.finish_threshold, args.maxepochs)
File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 595, in train_batch
segments = self._recursive_optimize(w, *algorithm_params)
File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 299, in _recursive_optimize
constructions += self._recursive_split(part)
File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 312, in _recursive_split
rcount, count = self._remove(construction)
File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 124, in _remove
rcount, count, splitloc = self._analyses[construction]
KeyError: 'lhjij'
from morfessor.
Thanks! This looks like a bug that is related to how forced splits around certain characters (by default hyphens) are handled. I found out that it affects specific types of pattern like "-lhjij-lhjij" (or more generally (\F.{2-}).*\1
, where \F
is any character in the force split list).
While we are fixing this, you can use --forcesplit ""
to disable forced splitting for hyphens.
from morfessor.
While we are fixing this, you can use
--forcesplit ""
to disable forced splitting for hyphens.
Does it get exactly same output for all other files with and without this option? I want all my corpora to be processed exactly the same way. Do I have to repeat the training?
from morfessor.
Does it get exactly same output for all other files with and without this option? I want all my corpora to be processed exactly the same way. Do I have to repeat the training?
The model will naturally be somewhat different with and without forced splits, although hyphens are in any case split on most contexts. But forced splits are applied only during training, so once you have a model file, the option does not affect the viterbi segmentations produced by the model.
I assume that you are using the output for machine translation. In that case I would not use forced splits on hyphens anyway, but let the model decide whether to leave frequent word parts with hyphens unsegmented.
from morfessor.
Fixed in 2.0.4.
from morfessor.
Related Issues (18)
- Morfessor Models Sizes HOT 8
- How to save the segmented word to file? HOT 1
- Unpickling a binary model fails HOT 2
- Is there a trained model for Kazakh available for download somewhere? HOT 1
- Segmented output format HOT 2
- Is your trained English model available? HOT 3
- How do I control dictionary size HOT 1
- --output-newlines squeezes multiple newlines HOT 2
- Where's the detail specific document of training data rules? HOT 1
- tarball on website out of date HOT 1
- Is the tokenizer.model deterministic? HOT 1
- Trained models HOT 2
- commend line Vs. API HOT 2
- Fix version check in io.py HOT 4
- Sample data lines for Turkish or English HOT 1
- The `--atom-separator` option doesn't work on Python 3 HOT 2
- UnicodeDecodeError when install via pip HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from morfessor.