Code Monkey home page Code Monkey logo

Comments (8)

svirpioj avatar svirpioj commented on July 23, 2024

This is typical behavior for Morfessor models: the larger the training data, the larger the morph lexicon and the longer the average morph length.

There are several options to reduce the model size:

  • Train the model with word types (-d ones), if you already do not.
  • Set the corpus weight parameter below one (e.g. -w 0.1).
  • Discard low-frequency words from training (e.g. --batch-minfreq 2). Useful especially if the data is noisy (likely not the case with Wikipedia data).

More details and discussion can be found, for example, in this article: http://dspace.utlib.ee/dspace/handle/10062/17313

If your concern is just the size of the model file, you can try saving it in gzipped Morfessor 1.0 format. Slower to load and doesn't store any training parameters, but should be smaller.

from morfessor.

psmit avatar psmit commented on July 23, 2024

In the next version (which will be released in the coming months), there is an option for storing a reduced model; a model that can only be used for segmenting data.

from morfessor.

aboSamoor avatar aboSamoor commented on July 23, 2024

Is there any progress on the issue of reducing the size of the trained models?

from morfessor.

psmit avatar psmit commented on July 23, 2024

Yes, we have implemented reduced models, and we have been using internally for a long time. The release of Morfessor 2.1 should come someday soon, but until then you can already use this branch: https://github.com/phsmit/morfessor/tree/develop

On the command line there is the --save-reduced option, in the code it is model.make_segment_only()

from morfessor.

aboSamoor avatar aboSamoor commented on July 23, 2024

It indeed reduces the size of the models. It seems the option is already available on the pypi package (Morfessor 2.0.2alpha1), is it necessary to use this development branch?

Once I train a model, can I use the pypi version to actually segment, or I still need the development branch to segment text.

I am developing a package that will use morfessor as the backend for text segmentation and I would like to use the pypi package to manage my dependencies.

from morfessor.

psmit avatar psmit commented on July 23, 2024

Ah, indeed. No need to use the development branch. The models between the develop and alpha branch should be interchangable, but I can't guarantee it. We are thinking of more persistent models, but is not easy...

from morfessor.

bhashi12 avatar bhashi12 commented on July 23, 2024

i've just downloaded Morfessor-2.0.2a4 in Ubuntu. I couldnot load Morfessor 1.0 style text model, its throwing error of" no such directory exist". Where could i find this file.

from morfessor.

psmit avatar psmit commented on July 23, 2024

@bhashi12 Sorry, I had not seen this question before. If it still persists, would you open a new issue?

from morfessor.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.