Code Monkey home page Code Monkey logo

Comments (5)

mammothb avatar mammothb commented on June 14, 2024

@k-dal Hi, sorry, I think it's because pickling was implemented before bigrams and I forgot to add it in when implementing bigrams. If you have managed to fix the issue on your end, would you like to submit a pull request so bigrams is included during pickling?

from symspellpy.

k-dal avatar k-dal commented on June 14, 2024

@mammothb Got it, that makes sense (and no need to apologize!).

Since you're open to a pull request, I'm thinking a handful of pickler-related alterations/upgrades might be good to do at the same time:

  1. Add missing attributes to pickle save & load functions. Specifically:
    1. bigrams
    2. below_threshold_words
    3. replaced_words
    4. max_dictionary_edit_distance
    5. prefix_length
    6. count_threshold
  2. An accessory function to raise a warning if load_pickle imports a state where the loaded params conflict with the SymSpell instance's existing params.
    1. Since it looks like the dict attribute entries are heavily influenced by the parameter values (e.g., prefix_length, etc) at time of creation, it seems worthwhile to warn the user if she initializes the class one way but changes parameter states when loading her pickle.
    2. Does this seem sensible from your perspective? I'm hesitant to omit the int params entirely - it feels like important context/info would be missing from the pickle without the params.
  3. A flag to call pickle.dumps() and pickle.loads() instead of the file/stream .dump() and .load() functions.
    1. My motivation here is that strings may be preferable if the pickle is going to be stored in a database rather than written to file.
    2. It may also make things easier when using symspellpy in a serverless environment since you don't always have a persistent file system available.

Anyway, thank you for the quick reply! Let me know what you think.

from symspellpy.

mammothb avatar mammothb commented on June 14, 2024

@k-dal

  1. I think pickling was originally intended to avoid having to parse the dictionary files again as some users have large dictionary files and it takes very long to load those. replaced_words contains the replaced/corrected words while running lookup_compound. I am picturing the following use case:
    The user loads a dictionary file once and would like to skip the creation of dictionary entries and use the same dictionary for multiple projects in the future, perhaps similar to loading model weights for a deep learning model. replaced_words behaves like some sort of output in this case, so I don't think it is necessary include it in the pickle.

  2. I agree, the int params are necessary if we are planning to load additional dictionary files to a previously processed pickle. There is some form of version checking here, perhaps the accessory function could be place around there.

  3. I think the current state of how saving and loading pickle work is partly from the discussion at #31 (comment). If the new save and load functions do not break existing use cases, I am fine adding a flag to call pickle.dumps() and pickle.loads().

from symspellpy.

k-dal avatar k-dal commented on June 14, 2024

@mammothb

  1. Got it, that explanation helps a lot. You're definitely right so I'll plan to include everything except for replaced_words and will make note of that deliberate omission in the docstring(s).

  2. Ah okay, yes, putting it in this area makes sense. I wasn't sure what data_version meant -- since it was hardcoded, I figured it was probably referring to the version of the included .txt dictionary files. Sounds like there should also be some contextual reflection of the "version" when pickled too. Maybe a pickled_at timestamp or optional version_name that to capture some context. Other ideas welcome.

  3. This context helps. I had the same experience with compression causing slow-downs, so now I understand why the compression boolean flag is there. Adding pickle.dumps() and pickle.loads() won't break any existing use cases. We can treat it exactly the same way as compression, except the default will be False so any existing uses are unaffected.

Great, I'm looking forward to contributing here!

(P.S. I'm getting married this weekend, and time is a bit tight this week and next while family's in town. So it may take me a few weeks to submit a PR. Let me know if any additional thoughts come to mind in the meantime.)

from symspellpy.

mammothb avatar mammothb commented on June 14, 2024

@k-dal Congratulations! There's no rush on the PR.

  1. I think data_version was added when we thought we should break compatibility with earlier versions of the pickle and force users to recreate the pickle. Perhaps version_name would be helpful since we want to track the fields in the pickle relative to the state of the codebase/library.

from symspellpy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.