Hi Symspellpy team, Amazing tool....! Your team did fabulous job.</p

Ah, thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I have included the documentation in the source code itself. However, I can look

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Too much time for loading large dictionary,about mammothb/symspellpy

Comments (21)

kakaroto commented on May 18, 2024 4

Ah, thanks @mammothb! I see the problem and there's definitely some improvement that can be made here but the SuggestionStage is not the way to go.

Looking at what the CS code does, I see this on every delete insertion :


                if (deletes.TryGetValue(deleteHash, out string[] suggestions))
                {
                    var newSuggestions = new string[suggestions.Length + 1];
                    Array.Copy(suggestions, newSuggestions, suggestions.Length);
                    deletes[deleteHash] = suggestions = newSuggestions;
                }
                else
                {
                    suggestions = new string[1];
                    deletes.Add(deleteHash, suggestions);
                }
                suggestions[suggestions.Length - 1] = key;

That's indeed very time consuming, create a string array then copy the old array into the new one then add the new entry to it. So the suggestion stage makes this go much faster by doing this :


            if (!Deletes.TryGetValue(deleteHash, out Entry entry)) entry = new Entry { count = 0, first = -1 };
            int next = entry.first;
            entry.count++;
            entry.first = Nodes.Count;
            Deletes[deleteHash] = entry;
            Nodes.Add(new Node { suggestion = suggestion, next = next });

Which is basically: Create a linked list and append to the list instead then convert to a string array only at the end.

Now in the Python implementation you're not creating arrays, you're already using a linked list and call .append to it so you were already kind of using a 'suggestion stage' because you're using a linked list which is already very fast for insertions.
What you did instead is replace this code :

            for delete in edits:
                delete_hash = self._get_str_hash(delete)
                self._deletes[delete_hash].append(key)

with this :

            for delete in edits:
                staging.add(self._get_str_hash(delete), key)

So you are already looping through the same amount of edits, and calling the same get_str_hash on every one and whatever that staging.add() does, if it's not faster than the already very fast python list.append function, then it's of course going to be a slower implementation and not be helpful.

So there's a couple of things I noticed. First, the CS implementation uses a <int, string[]> dictionary, and since they use an int, that's why they need to hash the delete string, hence the self._get_str_hash, but you don't actually need to do that because a python dictionary is already a hash table, so you can make it into a dictionary of strings and it will be just as fast as a dictionary of ints.
The second thing is that the custom hash function you wrote cannot be faster (or as good/complete) than python's optimized hash function. To confirm, I ran it 10 million times and timed it and python's hash function is 10 times faster than the custom implementation

  a = time.time()
  for i in range(10000000):
     s._get_str_hash("foobar")
  b = time.time(); print(b - a)

13.685925960540771

  a = time.time()
  for i in range(10000000):
     hash("foobar")
  b = time.time(); print(b - a)
1.3400061130523682

So I made this simple change, remove the calls to _str_get_hash, and make that dictionary just take the string directly :

        for delete in edits:
            self._deletes[delete].append(key)

and in lookup, do the same :

            if candidate in self._deletes:
                dict_suggestions = self._deletes[candidate]

And then I tested the speeds again (average of 10 runs) :

Function	Hash	Distance	Average time (seconds)
load_dictionary	_get_str_hash	2	5.7
load_dictionary	_get_str_hash	5	23.5
load_dictionary	python hash	2	3.4
load_dictionary	python hash	5	14.1

So this simple change gives us about 40% speed improvement. I tried a lookup for "helo" and I got the same results and no crash, so yeay! But I haven't done any extensive testing on this, and I'm sure there are many other ways that we can use to make it more efficient. (Note, this would break the pickle. It would be good to add a 'version' in the pickle to make sure you only load from the same version of the internal data).

This is the diff I have for now (I didn't clone/push because I haven't tested it extensively but once I look at the rest of the code and see how it can be improved further, then I'll make a proper patch and send a PR, but I had a quick look at the _edits function, it's gonna be hard to optimize that):

diff -ru /usr/local/lib/python3.6/site-packages/symspellpy/symspellpy.py symspellpy_en/symspellpy.py
--- /usr/local/lib/python3.6/site-packages/symspellpy/symspellpy.py     2019-02-22 20:18:43.693372932 -0500
+++ symspellpy_en/symspellpy.py 2019-02-26 00:35:26.881581236 -0500
@@ -136,8 +136,7 @@
         # create deletes
         edits = self._edits_prefix(key)
         for delete in edits:
-            delete_hash = self._get_str_hash(delete)
-            self._deletes[delete_hash].append(key)
+            self._deletes[delete].append(key)
         return True

     def load_dictionary(self, corpus, term_index, count_index,
@@ -328,8 +327,8 @@
                     continue
                 break

-            if self._get_str_hash(candidate) in self._deletes:
-                dict_suggestions = self._deletes[self._get_str_hash(candidate)]
+            if candidate in self._deletes:
+                dict_suggestions = self._deletes[candidate]
                 for suggestion in dict_suggestions:
                     if suggestion == phrase:
                         continue

from symspellpy.

jalajthanaki commented on May 18, 2024 2

@mammothb I have tried to save delete combination as pickle and loading of pickle significantly improves the loading time for me. Is there any guideline for contributes?

Soon I will be adding the function(s) which can use in-memory database.

from symspellpy.

jalajthanaki commented on May 18, 2024 2

@mammothb I see that you merged the code in the latest version. Are you going to update the readme file and instructions accordingly?

I will be doing it by tomorrow EOD.

Meanwhile refer the following code

Step 1: The following code is for when we are generating the symspell delete combination first time

    if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
        print("Dictionary file not found")
        return
    else:
        sym_spell.save_pickle('symspell.pkl')

Step 2: This is the code if we wanted to load saved pickle: sym_spell.load_pickle('symspell.pkl')

from symspellpy.

mammothb commented on May 18, 2024 2

@kakaroto i have created a new branch with an initial implementation of SuggestionStage

from symspellpy.

mammothb commented on May 18, 2024 1

I have included the documentation in the source code itself. However, I can look into creating a proper documentation for the package.
The original code uses SuggestionStage to cut down load. However, I have yet to figure out how to implement that in python. If you are able to implement something similar, I will be glad to include it in the package.

from symspellpy.

jalajthanaki commented on May 18, 2024 1

Just a suggestion:
wouldn't pickle make things much faster to load? i.e. saving the frequency dictionary to a pickle, then load it to use?

@mzeidhassan Thank you for your suggestion. I did experiment with loading dictionary in pickle format but the actual time-consuming process is to build the pre-calculated delete combination for large dictionary. I'm thinking how I can save the pre-calculated deletions and reuse them again by putting them into any in-memory database. See this interesting post.

@mammothb and @mzeidhassan Do let me know if you want to add something. Correct me/add information if I'm missing something.

from symspellpy.

dongyuwei commented on May 18, 2024 1

@jalajthanaki I think LMDB is suitable for your use case:

With memory-mapped files, it has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.

I had tried redis and leveldb, LMDB is more suitable to work with symspell. If your dictionary does not updated frequently, you can build the symspell delete combination once.

from symspellpy.

kakaroto commented on May 18, 2024 1

By the way I'm happy to see that the ~40% speed improvement in removing _get_str_hash is consistent with @jalajthanaki 's profiling timing that showed the function taking 36% and 37.5% of the time, and the builtin hash() isn't adding any time because it was already being called when inserting into the dictionary, (though it was being called on the str_hash itself).
I've also found out that python's list is not actually a linked list but rather an array but with a pre-allocated size and its append happens in O(1).

A quick little update as well (using frequency_dictionary_en_82_765.txt and the optimized code) :

Function	gzipped	Distance	Size (MB)	Average time (seconds)
load_dictionary	-	2	1.3 MB	3.4
save_picke	True	2	8.6 MB	9.5
load_pickle	True	2	8.6 MB	1.2
save_picke	False	2	26 MB	0.6
load_pickle	False	2	26 MB	0.9
load_dictionary	-	5	1.3 MB	14.1
save_picke	True	5	21 MB	37.8
load_pickle	True	5	21 MB	2.35
save_picke	False	5	54 MB	1.0
load_pickle	False	5	54 MB	1.5

So it looks like the biggest time consumption in loading/saving the pickle is the compression which sounds obvious now that I think about it. I also tested json and it's also taking longer with non-gzipped version (and higher file sizes) when compared to pickle.

Note that the optimized code makes loading/saving the pickle faster too (9.5 vs 14.5s to save, 1.2 vs 1.5s to load) probably because the length of the hash was generally longer than the length of the average word.

For reference, here's another test done with aspell's en_US dictionary which contains 120400 words (more words but smaller file size and faster to load/save) :

Function	gzipped	Distance	Average time (seconds)
load_dictionary	-	2	3.0
save_picke	True	2	9.3
load_pickle	True	2	2.0
save_picke	False	2	0.65
load_pickle	False	2	1.2
save_json	False	2	2.3
load_json	False	2	2.2

And for some more timing test, in create_dictionary_entry, if I do :

        edits = self._edits_prefix(key)
        edits = []

The time to load drops from 3.0s to 1.8s, and if I change it to :

        #edits = self._edits_prefix(key)
        edits = []

It drops to 0.13s, which is again more or less consistent with the profiling information of about 20% in both the call to self._edits and in the create_dictionary_entry (where most of the time is spent looping on those edits, inserting them in the dictionary).

So we currently have more than 50% of the new time spent on these 2 lines in create_dictionary_entry

        for delete in edits:
            self._deletes[delete].append(key)

There isn't a lot that we can do to improve those I think. I think there's a better chance at improving the _edits method, but I haven't been able to find the right way to do it.

By the way, if someone has another dictionary to provide (@jalajthanaki you have one with 4M words?) I can test with that as well.

from symspellpy.

mammothb commented on May 18, 2024 1

@kakaroto I think it's fine to just reject the old version and have the users manually create a new version themselves.

from symspellpy.

mzeidhassan commented on May 18, 2024

Just a suggestion:
wouldn't pickle make things much faster to load? i.e. saving the frequency dictionary to a pickle, then load it to use?

from symspellpy.

jalajthanaki commented on May 18, 2024

I had tried redis and leveldb, LMDB is more suitable to work with symspell. If your dictionary does not updated frequently, you can build the symspell delete combination once.

@dongyuwei I am planning to update my dictionary every 3 months. So totally agree with your suggestion. I will try LMDB and redis.

from symspellpy.

mzeidhassan commented on May 18, 2024

This is amazing, @jalajthanaki ! Thank you so much for helping. This is a great contribution to symspellpy.

from symspellpy.

mzeidhassan commented on May 18, 2024

@mammothb I see that you merged the code in the latest version. Are you going to update the readme file and instructions accordingly?

from symspellpy.

mzeidhassan commented on May 18, 2024

Thanks a million @jalajthanaki ! Appreciate it.

from symspellpy.

dongyuwei commented on May 18, 2024

@jalajthanaki json is faster than pickle, according to this https://konstantin.blog/2010/pickle-vs-json-which-is-faster/, I did not test/verify it, but maybe it is helpful for this project.

JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps).

Or, as @wolfgarbe suggested, use flatbuffers, since it can access to serialized data without parsing/unpacking.

from symspellpy.

kakaroto commented on May 18, 2024

I just did some tests with the 82k word file:

Function	Distance	Average time (seconds)
load_dictionary	2	5.7
save_picke	2	14.5
load_pickle	2	1.5
save_json	2	7.1
load_json	2	1.4
load_dictionary	5	23.5
save_picke	5	37.8
load_pickle	5	2.35
save_json	5	33.5
load_json	5	3.1

I didn't try flatbuffers as @dongyuwei suggested because it seemed more complex to integrate than just replacing pickle with json.
My conclusion here is that json is not particularly faster than pickle in this use case, and loading is still pretty slow. Hopefully the SuggestionStage can be implemented at some point, but it's irrelevant to loading the pickle/json file. The problem there is that the amount of data is too massive (I tried compact_level=10, it didn't make a lot of difference).

from symspellpy.

mammothb commented on May 18, 2024

@kakaroto I tried implementing SuggestionStage (not pushed), but it turned out taking longer than the current code. I assume it is because preallocating memory doesn't work with python

from symspellpy.

kakaroto commented on May 18, 2024

I think there are ways to preallocate memory in python, it depends though on how you're going to use it, it might not work. If you want to share that code in a separate branch, I could have a look tomorrow and see if I can test/figure out a way to speed it up.

By the way, I would suggest to also store the max distance in the pickle dictionary because the deletes/words dictionary is directly related to the distance. I've been generating different dictionaries with various distances (and therefore different file sizes) but they all have the same suggestions because the SymSpell object I create always has a distance of 2, so the pickle file content is irrelevant. I'm not sure though if it's a good idea to change a user-provided setting when loading a dictionary, maybe only do that if the SymSpell object is created without specifying a distance, but keep the distance fixed if that argument is passed to __init__?

from symspellpy.

mammothb commented on May 18, 2024

@kakaroto any updates on the patch? I could implement the suggested improvement if you're busy with other things

from symspellpy.

kakaroto commented on May 18, 2024

@mammothb sorry, I've been busy. I've spent some time trying to optimize the rest of it, but unfortunately unfruitfully.
I should have time to take care of it this week then send the patch. I want to add the versioning of the pickle as well to avoid loading a pickle made with the old format (hash instead of string in the deletes dict). I think it should just reject the old version, making it throw an error. I wonder if that's an acceptable solution (should be possible to just re-create the deletes table from the words dictionary of the pickle, but with no saves in loading time, so it defeats the purpose, I can do that if you prefer).

from symspellpy.

kakaroto commented on May 18, 2024

I've sent the pull request, let me know if there's anything more I can do! Thanks.

from symspellpy.

Too much time for loading large dictionary about symspellpy HOT 21 OPEN

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent