Comments (8)
This is typical behavior for Morfessor models: the larger the training data, the larger the morph lexicon and the longer the average morph length.
There are several options to reduce the model size:
- Train the model with word types (-d ones), if you already do not.
- Set the corpus weight parameter below one (e.g. -w 0.1).
- Discard low-frequency words from training (e.g. --batch-minfreq 2). Useful especially if the data is noisy (likely not the case with Wikipedia data).
More details and discussion can be found, for example, in this article: http://dspace.utlib.ee/dspace/handle/10062/17313
If your concern is just the size of the model file, you can try saving it in gzipped Morfessor 1.0 format. Slower to load and doesn't store any training parameters, but should be smaller.
from morfessor.
In the next version (which will be released in the coming months), there is an option for storing a reduced model; a model that can only be used for segmenting data.
from morfessor.
Is there any progress on the issue of reducing the size of the trained models?
from morfessor.
Yes, we have implemented reduced models, and we have been using internally for a long time. The release of Morfessor 2.1 should come someday soon, but until then you can already use this branch: https://github.com/phsmit/morfessor/tree/develop
On the command line there is the --save-reduced
option, in the code it is model.make_segment_only()
from morfessor.
It indeed reduces the size of the models. It seems the option is already available on the pypi package (Morfessor 2.0.2alpha1), is it necessary to use this development branch?
Once I train a model, can I use the pypi version to actually segment, or I still need the development branch to segment text.
I am developing a package that will use morfessor as the backend for text segmentation and I would like to use the pypi package to manage my dependencies.
from morfessor.
Ah, indeed. No need to use the development branch. The models between the develop and alpha branch should be interchangable, but I can't guarantee it. We are thinking of more persistent models, but is not easy...
from morfessor.
i've just downloaded Morfessor-2.0.2a4 in Ubuntu. I couldnot load Morfessor 1.0 style text model, its throwing error of" no such directory exist". Where could i find this file.
from morfessor.
@bhashi12 Sorry, I had not seen this question before. If it still persists, would you open a new issue?
from morfessor.
Related Issues (18)
- KeyError HOT 6
- How to save the segmented word to file? HOT 1
- Unpickling a binary model fails HOT 2
- Is there a trained model for Kazakh available for download somewhere? HOT 1
- Segmented output format HOT 2
- Is your trained English model available? HOT 3
- How do I control dictionary size HOT 1
- --output-newlines squeezes multiple newlines HOT 2
- Where's the detail specific document of training data rules? HOT 1
- tarball on website out of date HOT 1
- Is the tokenizer.model deterministic? HOT 1
- Trained models HOT 2
- commend line Vs. API HOT 2
- Fix version check in io.py HOT 4
- Sample data lines for Turkish or English HOT 1
- The `--atom-separator` option doesn't work on Python 3 HOT 2
- UnicodeDecodeError when install via pip HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from morfessor.