Comments (3)
Sorry, the demo models are not currently available for download. We'll look into it, but might be that there are some compatibility issues with the current version.
However, most of the models can be easily retrained with the Morpho Challenge data sets - for example the unsupervised English model should be quite the same as the output of these commands:
wget http://morpho.aalto.fi/events/morphochallenge2009/data/wordlist.eng.gz
morfessor-train -s unsup_model.bin --traindata-list wordlist.eng.gz
And the English semi-supervised model (based on the parameters shown in the demo page):
wget http://morpho.aalto.fi/events/morphochallenge2010/data/goldstd_trainset.segmentation.eng
morfessor-train -s semisup_model.bin --traindata-list wordlist.eng.gz -A goldstd_trainset.segmentation.eng -w 0.83 -W 361.32
from morfessor.
Could you make developer-friendly interface and trained models available from an open source such as Wikipedia dumps? There's a use case for off-the-shelf decompounding and morphological splitting tools, but Morfessor doesn't have trained models ready, so its not convenient enough for developers to try. Right now even if you know how to use Morfessor, there's not really time to train and tune the models for a project where it could be useful.
Ideally splitting with Morfessor would be easy as this:
import morfessor
morfessor_model= morfessor.read_model("finnish_model.pkl")
morfessor_model.split("Lentokonesuihkuturbiinimoottoriapumekaanikkoaliupseerioppilas")
Better yet, follow the Scikit-learn API for the model, so that it is accessed using .fit() and .transform() methods. This will make it more accessible to a wider community.
from morfessor.
I would suggest to treat model files like you would compiled executables. Store the open source licensed source data for an individual model in a single GitHub repository (possibly using git-lfs to reduce disk usage for updates), then add a Makefile or similar for automatically training the model, then attach the model binaries to each source data release. In case multiple models share source data, you could create one GitHub repository containing all the source data.
from morfessor.
Related Issues (18)
- Morfessor Models Sizes HOT 8
- KeyError HOT 6
- How to save the segmented word to file? HOT 1
- Unpickling a binary model fails HOT 2
- Is there a trained model for Kazakh available for download somewhere? HOT 1
- Segmented output format HOT 2
- How do I control dictionary size HOT 1
- --output-newlines squeezes multiple newlines HOT 2
- Where's the detail specific document of training data rules? HOT 1
- tarball on website out of date HOT 1
- Is the tokenizer.model deterministic? HOT 1
- Trained models HOT 2
- commend line Vs. API HOT 2
- Fix version check in io.py HOT 4
- Sample data lines for Turkish or English HOT 1
- The `--atom-separator` option doesn't work on Python 3 HOT 2
- UnicodeDecodeError when install via pip HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from morfessor.