Comments (5)
@k-dal Hi, sorry, I think it's because pickling was implemented before bigrams and I forgot to add it in when implementing bigrams. If you have managed to fix the issue on your end, would you like to submit a pull request so bigrams is included during pickling?
from symspellpy.
@mammothb Got it, that makes sense (and no need to apologize!).
Since you're open to a pull request, I'm thinking a handful of pickler-related alterations/upgrades might be good to do at the same time:
- Add missing attributes to pickle save & load functions. Specifically:
- bigrams
- below_threshold_words
- replaced_words
- max_dictionary_edit_distance
- prefix_length
- count_threshold
- An accessory function to raise a warning if
load_pickle
imports a state where the loaded params conflict with the SymSpell instance's existing params.- Since it looks like the dict attribute entries are heavily influenced by the parameter values (e.g., prefix_length, etc) at time of creation, it seems worthwhile to warn the user if she initializes the class one way but changes parameter states when loading her pickle.
- Does this seem sensible from your perspective? I'm hesitant to omit the
int
params entirely - it feels like important context/info would be missing from the pickle without the params.
- A flag to call
pickle.dumps()
andpickle.loads()
instead of the file/stream.dump()
and.load()
functions.- My motivation here is that strings may be preferable if the pickle is going to be stored in a database rather than written to file.
- It may also make things easier when using symspellpy in a serverless environment since you don't always have a persistent file system available.
Anyway, thank you for the quick reply! Let me know what you think.
from symspellpy.
-
I think pickling was originally intended to avoid having to parse the dictionary files again as some users have large dictionary files and it takes very long to load those.
replaced_words
contains the replaced/corrected words while runninglookup_compound
. I am picturing the following use case:
The user loads a dictionary file once and would like to skip the creation of dictionary entries and use the same dictionary for multiple projects in the future, perhaps similar to loading model weights for a deep learning model.replaced_words
behaves like some sort of output in this case, so I don't think it is necessary include it in the pickle. -
I agree, the
int
params are necessary if we are planning to load additional dictionary files to a previously processed pickle. There is some form of version checking here, perhaps the accessory function could be place around there. -
I think the current state of how saving and loading pickle work is partly from the discussion at #31 (comment). If the new save and load functions do not break existing use cases, I am fine adding a flag to call
pickle.dumps()
andpickle.loads()
.
from symspellpy.
-
Got it, that explanation helps a lot. You're definitely right so I'll plan to include everything except for
replaced_words
and will make note of that deliberate omission in the docstring(s). -
Ah okay, yes, putting it in this area makes sense. I wasn't sure what
data_version
meant -- since it was hardcoded, I figured it was probably referring to the version of the included .txt dictionary files. Sounds like there should also be some contextual reflection of the "version" when pickled too. Maybe apickled_at
timestamp or optionalversion_name
that to capture some context. Other ideas welcome. -
This context helps. I had the same experience with compression causing slow-downs, so now I understand why the
compression
boolean flag is there. Addingpickle.dumps()
andpickle.loads()
won't break any existing use cases. We can treat it exactly the same way as compression, except the default will beFalse
so any existing uses are unaffected.
Great, I'm looking forward to contributing here!
(P.S. I'm getting married this weekend, and time is a bit tight this week and next while family's in town. So it may take me a few weeks to submit a PR. Let me know if any additional thoughts come to mind in the meantime.)
from symspellpy.
@k-dal Congratulations! There's no rush on the PR.
- I think
data_version
was added when we thought we should break compatibility with earlier versions of the pickle and force users to recreate the pickle. Perhapsversion_name
would be helpful since we want to track the fields in the pickle relative to the state of the codebase/library.
from symspellpy.
Related Issues (20)
- ignore_term_with_digits doesn't work HOT 2
- replaced_words is not correct HOT 4
- Correction doesn't prioritize bigram. HOT 3
- Predicts garbage for Bengali input HOT 7
- wrong word segmentation result HOT 1
- First line of the text file reads wrong HOT 4
- edit distance issue HOT 1
- Substring search
- error if i use spell checker to my dataset HOT 4
- How to empty the dictionary quickly HOT 3
- Custom Edit Distance HOT 2
- Using a custom dictionary with the desired correction HOT 1
- the frequency in the loaded dictionary is absolute, not relative
- Incompatible architecture on macOS
- load 5.5mb dictionary consume 1GB memory?
- Incompatible architecture on macOS
- Correction not using bi-grams
- Does SymSpell has spell checker ? HOT 2
- Error when pip install on MacOS HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from symspellpy.