I noticed I was getting better results when doing a fresh load of the included diction

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Bigrams omitted from save and load pickle methods about symspellpy HOT 5 CLOSED

k-dal commented on June 14, 2024

Bigrams omitted from save and load pickle methods

from symspellpy.

Comments (5)

mammothb commented on June 14, 2024

@k-dal Hi, sorry, I think it's because pickling was implemented before bigrams and I forgot to add it in when implementing bigrams. If you have managed to fix the issue on your end, would you like to submit a pull request so bigrams is included during pickling?

from symspellpy.

k-dal commented on June 14, 2024

@mammothb Got it, that makes sense (and no need to apologize!).

Since you're open to a pull request, I'm thinking a handful of pickler-related alterations/upgrades might be good to do at the same time:

Add missing attributes to pickle save & load functions. Specifically:
1. bigrams
2. below_threshold_words
3. replaced_words
4. max_dictionary_edit_distance
5. prefix_length
6. count_threshold
An accessory function to raise a warning if load_pickle imports a state where the loaded params conflict with the SymSpell instance's existing params.
1. Since it looks like the dict attribute entries are heavily influenced by the parameter values (e.g., prefix_length, etc) at time of creation, it seems worthwhile to warn the user if she initializes the class one way but changes parameter states when loading her pickle.
2. Does this seem sensible from your perspective? I'm hesitant to omit the int params entirely - it feels like important context/info would be missing from the pickle without the params.
A flag to call pickle.dumps() and pickle.loads() instead of the file/stream .dump() and .load() functions.
1. My motivation here is that strings may be preferable if the pickle is going to be stored in a database rather than written to file.
2. It may also make things easier when using symspellpy in a serverless environment since you don't always have a persistent file system available.

Anyway, thank you for the quick reply! Let me know what you think.

from symspellpy.

mammothb commented on June 14, 2024

@k-dal

I think pickling was originally intended to avoid having to parse the dictionary files again as some users have large dictionary files and it takes very long to load those. replaced_words contains the replaced/corrected words while running lookup_compound. I am picturing the following use case:
The user loads a dictionary file once and would like to skip the creation of dictionary entries and use the same dictionary for multiple projects in the future, perhaps similar to loading model weights for a deep learning model. replaced_words behaves like some sort of output in this case, so I don't think it is necessary include it in the pickle.
I agree, the int params are necessary if we are planning to load additional dictionary files to a previously processed pickle. There is some form of version checking here, perhaps the accessory function could be place around there.
I think the current state of how saving and loading pickle work is partly from the discussion at #31 (comment). If the new save and load functions do not break existing use cases, I am fine adding a flag to call pickle.dumps() and pickle.loads().

from symspellpy.

k-dal commented on June 14, 2024

@mammothb

Got it, that explanation helps a lot. You're definitely right so I'll plan to include everything except for replaced_words and will make note of that deliberate omission in the docstring(s).
Ah okay, yes, putting it in this area makes sense. I wasn't sure what data_version meant -- since it was hardcoded, I figured it was probably referring to the version of the included .txt dictionary files. Sounds like there should also be some contextual reflection of the "version" when pickled too. Maybe a pickled_at timestamp or optional version_name that to capture some context. Other ideas welcome.
This context helps. I had the same experience with compression causing slow-downs, so now I understand why the compression boolean flag is there. Adding pickle.dumps() and pickle.loads() won't break any existing use cases. We can treat it exactly the same way as compression, except the default will be False so any existing uses are unaffected.

Great, I'm looking forward to contributing here!

(P.S. I'm getting married this weekend, and time is a bit tight this week and next while family's in town. So it may take me a few weeks to submit a PR. Let me know if any additional thoughts come to mind in the meantime.)

from symspellpy.

mammothb commented on June 14, 2024

@k-dal Congratulations! There's no rush on the PR.

I think data_version was added when we thought we should break compatibility with earlier versions of the pickle and force users to recreate the pickle. Perhaps version_name would be helpful since we want to track the fields in the pickle relative to the state of the codebase/library.

from symspellpy.

Bigrams omitted from save and load pickle methods about symspellpy HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent