mammothb / symspellpy Goto Github PK

Python port of SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

License: MIT License

Python 95.28% Jupyter Notebook 4.72%

python spellcheck spell-check fuzzy-matching fuzzy-search spelling-correction damerau-levenshtein approximate-string-matching levenshtein edit-distance

symspellpy's Introduction

symspellpy

symspellpy is a Python port of SymSpell v6.7.1, which provides much higher speed and lower memory consumption. Unit tests from the original project are implemented to ensure the accuracy of the port.

Please note that the port has not been optimized for speed.

Notable Changes

v6.7.2: Implemented fast distance comparer with editdistpy. Approximately 2x speed up for usage under default settings, benchmarks found here.

Install

For installation instructions, see the INSTALL.rst file or the install documentation.

Usage

Check out the examples provided for sample usage.

symspellpy's People

Contributors

Stargazers

Watchers

Forkers

jeicoo crossnox elementai sagarailani noke8868 ab93 fjteam altaml ai-zhong wang1128 hnikana dataenthusiastsathya dst1213 yifuliu yyht robingong soares-f devashishjoshi dataagreeswithme mzeidhassan sanjeeku kakaroto mbasaglia gheyret azizullah2017 seanbe ilhamfp scievan gedman4b sudhinbabu zoltan-fedor frutik naushadzaman pidugusundeep darthbhyrava marcoffee wan-docai ttong-ai z3193631 catch-n-release sergshulga alirezabayatmk nqtrieu7987 mayhs19 yapus pdahale95 po-oya ahmadizzan allensmile tonykwok kien-nguyen-ngoc gaihostin martenmink paridhimnnit aluckydog0716 asmitakhaneja ermlab phuctu1901 amritsingh183 dogterbox youikim gerty-ai anjakirchhoefer gracecarrillo p-sodmann onlyrohits potipot akrmandpredict coolsaint iamprashant florencemnrg jkamlah kassenov srikant-ai flyfie sharduls007 selgyun rinabuoy world4jason blackjack4494 mainakmaitra tomaszrz1 xiaohulei2018 m2man ahrvo-technologies mishav78 tetiana-myronivska guanchenpan nikolaik nickcrews ebell495 mayhemheroes techthiyanes trungvan86 kamaruladha yli90 dominiklindorfer indigoviolet amenezes jdongian

symspellpy's Issues

preserve capitalization

Hello symspellpy team,

How should I use the option for preserving capitalization than changing all text to lowercase?

how do i add new terms?

Hi,

Apologies for using the issue tracker to ask a question.

How do I add a replacement for gr8 to great?? Is there a way to augment/extend the existing dictionary frequency_dictionary_en_82_765.txt to include these replacements? if yes how?

Best wishes and great work!

How to improve spell correction accuracy

Hello Symspellpy team,

How would I improve spell correction performance?

Incorrect Input Sentence: The World Econemic Forum is the Intarnational Organizetion for Public-Private Cooperation.

Correct Output Sentence: The World Economic Forum is the International Organization for Public- Private Cooperation.

Output Duration: 2.49 sec

To correct a sentence it is taking 2 to 3 seconds which is much higher. Please provide me suggestion or any solutions is available.

Regards,
Hardik

How to disable word segmentation?

HI @mammothb,

Hope all is well with you.

Can you please let me know how can I disable word segmentation? Apparently, this bringing more harm than good when I test it with Arabic. It makes up undesired corrections.

Thanks

Make a dictionary on another languange

Hai Mammothb, How i can make dictionary file like yours in another languange ?
Or maybe you can explain to me How do you make frequency_dictionary_en_82_765.txt ? Please.

By the way Good project

Handling Specific Contractions

I noticed that symspellpy gives some contractions as suggestions (e.g. "can't"), but when I try to produce suggestions for the word "isn't," for example, it gives me "int." Any ideas as to how I could fix this?

Carrying over casing

Hi,
For my project I am implementing an extension to allow me to carry over the casing (eg uppercase vs lowercase characters) from the input phrase to the typoe-corrected case.

For example:
"I have a typo prooblem in Neew York"
currently turns into
"i have a typo problem in new york"

But I will make it to be:
"I have a typo problem in New York"

I am wondering whether you would be interested in a pull request for it, knowing that this is a port of https://github.com/wolfgarbe/SymSpell.

pip install -U not updating with the new methods

Hi,

Saw you implemented the create_dictionary(). But I can't update the package even after running the pip install. Could you help?
Sorry I am an R user so learning python at the same time.
Ta

Allow removal of dictionary item

Hi, great work on this! Other than the loading of the dictionary, it seems to be very fast at doing suggestions.
I like having the ability to do a create_dictionary_entry to add a new word to the dictionary, however, there is no delete_dictionary_entry. I've tried to just give a negative count for the word, but the code ignores it if count <= 0. I can probably do a del obj._words[word] but I assume it wouldn't be a great idea to do that without also affecting the deletes. I had a quick look at the create_dictionary_entry and I feel like this should be enough to delete a word ?

        del self._words[key]
        # find deletes
        edits = self._edits_prefix(key)
        for delete in edits:
            delete_hash = self._get_str_hash(delete)
            self._deletes[delete_hash].remove(key)

Transfer Casing argument not found

Hi,

I've downloaded the symspellpy package both directly from github and using pip install as directed and I'm not seeing the transfer_casing boolean. I know that it was added a month ago or so, do you have any suggestions on how I could make that work?

Appreciate it.

Punctuations getting removed

Hello. Great work!

For inputs with punctuations, There are many novel, innovaive, and empirical anayysis availaible. we get outputs like there are many novel innovative and empirical analysis available. Is it possible to keep the punctuations?

Facing some issues with ignore_non_words

Hi,

I am currently working on a chat bot use case where i found symspellpy very useful but i am facing some issues with "ignore_non_words" parameter of lookup_compound.
I need a specific pattern like an account no xx004453 to be ignored by spell checker and it kinda of works as well.

My regex is made to satisfy patterns like starts with 2 or 3 alphabets and then numbers or hyphens.

My issue is as follows

je3453 -> jeff hi (it is converting these numbers)[Happens to small numbers only like xx123]
xx1234-5678-1234 -> xx1234 5678 1234 (all the '-' are removed)

How to solve these issue ?

Inconsistent example output

Installed 6.3.7 from pip as instructed on both latest macOS and ubuntu 16.04 and downloaded frequency_dictionary_en_82_765.txt from the official github.
Just ran the examples in the README.md and got inconsistent output on both platforms as follows:

Sample usage (lookup and lookup_compound)

The last number(log_prob_sum) is 11 instead of 10.

members, 226656153, 1
where is to love he had dated for much of the past who couldn't read in six grade and inspired him, 300000, 11

Sample usage (word_segmentation)

The first word the is segmented as t and he which are a bit obvious.
Also overt he should be over the.
I noticed the last two numbers are different from 8 -34.491167981910635.

t he quick brown fox jumps overt he lazy dog, 10, -52.10066239535173

Next, I tried to segment the test string itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness from the official site and the output shows the same error pattern of the as t he.

Any ideas?

pip install cannot find symspellpy

pip install -U symspellpy
Collecting symspellpy
Could not find a version that satisfies the requirement symspellpy (from versions: )
No matching distribution found for symspellpy

Is Symspell Thread Safe?

Hi @mammothb,

I try to deploy symspell model in multi-threading config using simple flask API. When we're trying to validate the result several times, turns out the result is not consistent (yield different result after test several times).

I have feeling that symspell is not thread safe. So, i tested it using single thread. And the result now consistent.

Is my hypothesis true? Related article for safe thread issue.

Thanks!

key-value store integration

Hi @mammothb

Some suggestions around scaling symspell. Store all deletes, word-count mappings into an in-memory key-value store. This might scale well, considering if we keep storing all deletes in python dictionary, the application would get heavy.

If this sounds like a good addition, I would like to give it a try. Let me know

Word segmentation of LatexEquation123

I recently found some examples are not segmented as properly as expected. For instance, the segmentation of LatexEquation123 is La tex Equ at ion 123 but the expected output should be Latex Equation 123. I checked the frequency entries in frequency_dictionary_en_82_765.txt and found latex and equation.

Is this expected in terms of the algorithm?

N-grams in dictionary

Hi @mammothb, great job with SymspellPy.

I recently saw Issue 15 at Symspell's Github (wolfgarbe/SymSpell#15), and the last comment caught my attention. Apparently Symspell suppports N-grams in the dictionary file, but I did a small test in SysmpellPy and I was not able to achieve the desired behavior. My approach was the following:

I added on top of a custom frequency dictionary the following sequence:
abc def ghi 116422658 (highest frequency in the dictionary)
I obtained suggestions to the sentence: abc dff ghi, using both lookup and lookup_compound
The returned corrections were based on single words (1-grams) I had previously defined un my dictionary and not on the newly inserted 3-gram: abc off ghi

I would like to know if there is any way to reproduce the desired behavior in SymspellPy, that is, obtaining a prediction based on the N-gram counts, or if there are any plans to add it as a feature in the near future.

Thanks for your time!

adding new terms and typical workflow

Hi,

I am trying to add new terms by:
initial_capacity = 83000 # maximum edit distance per dictionary precalculation max_edit_distance_dictionary = 0 prefix_length = 7

sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary, prefix_length)
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
print("Dictionary file not found")
return

sym_spell.create_dictionary_entry("steama", 4) sym_spell.create_dictionary_entry("steamb", 6) sym_spell.create_dictionary_entry("steamc", 2)

result = sym_spell.lookup("streama", 2) print(result)

I am getting an empty []. What am I missing?

Additionally, could you provide a skeleton code on how to feed it a text file and it creates a new column of corrected text please?
This will help massively in my text analysis.

Much appreciated

what exactly is max_edit_distance_dictionary

What exactly does max_edit_distance_dictionary do, and why it should be larger than max_edit_distance.

Thanks.

Arabic output garbled from dictionary creation

Hi @mammothb,

I am using your code to create a frequency dictionary in Arabic.
My corpus file is in utf-8 format.

Here is the output I am getting:

طµظپطھظٹظ 2
515 1
ظٹط²ظٹظٹظ 4
طھظˆط³ظٹظ 12
ظپظƒطœ 2
ظٹظˆظƒظٹطھط 14
ط³ظٹط³ظˆط 5

This looks like corrupt characters, so I am not sure what is causing this.

I even tried to print out to file and used 'encoing='utf-8', but again I am getting the same result as you can see in the attached screen shot.

Any idea how can I fix this or what is causing this issue? I am using Anaconda with Python 3..5.2 by the way.

Thanks

Using discreet word frequency dictionary in case of multiple users

I have a scenario where in multiple users want to use auto-correction. Each user has his own set of his domain specific words(having some fixed frequency)which he wants to include in the frequency dictionary. The users do not want that other users be shown the words which they added. Is there a way in which the lookup/lookup-compound would first consider the set of words passed by the users. If the word is not found it will check the word in the default frequency dictionary provided.

I can not understand the cause of the error

I'm trying to run your example...
...
from symspellpy.symspellpy import SymSpell, Verbosity # import the module
...
and I get an error for the import line:
ModuleNotFoundError: No module named 'symspellpy.symspellpy'; 'symspellpy' is not a package

Failed to create dictionary for chinese corpus!!

Hello @mammothb , I am using symspellpy for a last couple of months. I am trying to create dictionary with Chinese corpus dataset. But it just split words from text by using space. (just like english corpus)
BTW, we shouldn't do that for Chinese corpus. Is there any alternative way for create frequency dictionary for chinese corpus ?
Thanks in Advance : )

lookup_compound result not consistent with the word frequency

Using self-defined dictionary, The code is:

from symspellpy.editdistance import EditDistance, DistanceAlgorithm
from symspellpy.symspellpy import SymSpell, Verbosity 

max_edit_distance_dictionary = 2
prefix_length = 7
max_edit_distance_lookup = 1
suggestion_verbosity = Verbosity.TOP
# create object
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
sym_spell.load_dictionary('dictionary.txt', term_index=0, count_index=1)

By comparing the distance of the original string and 2 correction candidates as:

ed = EditDistance(DistanceAlgorithm.DAMERUAUOSA)
ed.compare('andbanana', 'an banana', max_distance=2)
ed.compare('andbanana', 'and banana', max_distance=2)

the distances are both 1

However, by using
sym_spell.lookup_compound(input_term, max_edit_distance_lookup, transfer_casing=True)
the result is
an banana
However, in the dictionary the frequency of and is much higher than an, and the frequency of them are both lower than banana. How an banana is ranked higher than and banana?

Thanks.

Too much time for loading large dictionary

Hi Symspellpy team,

Amazing tool....! Your team did fabulous job.

I have following queries.

Is there any detail documentation for symspellpy?
I have a big dictionary having more than 4M words in it. It's taking too much time to load the dictionary. Is there any ways that I can optimize the loading time. I have attached the screenshots as well which can help you to understand the problem.

This screenshot is for 82,765 words load time

This screenshot is for 4M words load time

Any help or suggestion will be helpful. Even detail explanation of each method help me. I will look forward to collaborating to symspell code if needed.

Missing Word Segmentation from SymSpell

Missing Word Segmentation from https://github.com/wolfgarbe/SymSpell. Please suggest.

include_unknown doesn't actually include unknown words

In lookup(), since the include_unknown flag is only included within early_exit(), it doesn't seem to work if early_exit is never called. I've had unknown words return 0 suggestions. For example, the word "cablevantage" returns nothing. It seems like including if include_unknown and not suggestions: suggestions.append(SuggestItem(phrase, max_edit_distance + 1, 0)) before the final sort, fixes the issue. Is this a symptom of a bigger bug, or a design choice?

How to create another dictionary

hello,

I wanted to modify the existing dictionary by adding some words to it for the purpose of EHR (electronic health record ) analysis. Unfortunately , the program does not read my file properly. I used the pandas package to read the existing dictionary , the new one I want to add to and drop the duplicate. But it does not work . Any clue how to solve the issue ?

dic_1 = pd.read_csv('frequency_dictionary_en_82_765.txt' , delimiter=" " , header=None)
dic_2 = pd.read_csv('medical_wordlist_original.txt',delimiter=" ", header=None,error_bad_lines=False)
dic_2[0]= dic_2[0].str.lower()
dic_big = pd.concat([dic_1,dic_2],ignore_index=True)

dic_big.fillna(floor(dic_big.mean()), inplace=True)
dic_big.sort_values(by = [1] ,ascending=0 , inplace=True)
dic_big.drop_duplicates(subset=0,keep='first', inplace = True)

dic_big.to_csv('dictionary.txt' , sep=" ", index=False, header=False , encoding ='utf-8' )

Another way I tried was to just add the new terminologies to the existing dictionary but that way , I could not take care of duplicated terms.

Another question is that , Should the new dictionary be sorted down by frequency or it does not matter ? How about the duplicated terms ? what if there is not frequency in our dictionary , would it work ? why ?

Need Python 2.x build

Hi,
We are not able to find package on pypi for python2.x

Tried installing using "python setup.py install" but it throws error for python2.7

    "Traceback (most recent call last):
  File "setup.py", line 43, in <module>
    encoding="utf-8") as infile:
TypeError: 'encoding' is an invalid keyword argument for this function"

Is there any means we can build for python 2.x, please assist.

Support python native dictionary object in `load_dictionary()`

Is there a way to pass a Python dictionary object like following:

dict_obj = {'test': 100, "testing": 200, "test-again": 300}

Something like: sym_spell.load_dictionary(dict_obj)
instead of txt file word-frequency

Not getting those results with the dictionary provided

I tried :

max_edit_distance_dictionary = 0
prefix_length = 7



sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
sym_spell.create_dictionary("frequency_dictionary_en_82_765.txt")

input_term = "thequickbrownfoxjumpsoverthelazydog"


result = sym_spell.word_segmentation(input_term)

print("{}, {}, {}".format(result.corrected_string, result.distance_sum,
                              result.log_prob_sum))

It returns:

thequickb row nfox jump s over thelazyd og, 28, -109.83603700895742

lookup('correcn', 1) doesn't output anything.

I see the word "correct" in the supplied dictionary, and yet when I tried to load it and do the above lookup it doesn't output anything. Am I doing something wrong, are you seeing any output?

Why is Numpy used?

Thank to read my issue, and I use symspellpy gratefully.

I wonder why the numpy library is used to generate arrays for the sake of storing costs in every step.

Is there any problem if native arrays were used?
(It's just a question... not serious)

thianks.

Incorrect lookup_compound distance.

Hi, it seems to me, that the returned distance of lookup_compound is off by 1.

The issue:

sym_spell.lookup_compound('whereis', 2)
//output: where is, 360468344, 2

However the distance should be just 1 (add a single whitespace).

How to reproduce

# create object
initial_capacity = 83000
# maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7
sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary,
                     prefix_length)
# load dictionary
dictionary_path = os.path.join(os.path.dirname(__file__),
                               "frequency_dictionary_en_82_765.txt")
term_index = 0  # column of the term in the dictionary text file
count_index = 1  # column of the term frequency in the dictionary text file
if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
    print("Dictionary file not found")

# lookup suggestions for multi-word input strings (supports compound
# splitting & merging)
input_term = "whereis"
# max edit distance per lookup (per single word, not per whole input string)
max_edit_distance_lookup = 2
suggestions = sym_spell.lookup_compound(input_term,
                                        max_edit_distance_lookup)
# display suggestion term, edit distance, and term frequency
for suggestion in suggestions:
    print("{}, {}, {}".format(suggestion.term, suggestion.count,
                              suggestion.distance))

Possible solution:
When I compared the code with the original repo https://github.com/wolfgarbe/SymSpell, there is a slight change. On this line, you're not using right stripped term for the distance calculation, but only as an argument for the term.

suggestion = SuggestItem(joined_term.rstrip(),
                                 distance_comparer.compare(
                                     phrase, joined_term, 2 ** 31 - 1),
                                 joined_count)

Do you agree that this is a bug? Should I fix this and make a PR?

Thank you for the awesome work!

Get the number of misspelled words

It would be useful if lookup_compound could return the number of words it has fixed, or perhaps a different method which only checks for the number of mistakes and returns the total.

Is this possible with the current implementation?

Stop auto-correcting Numbers and special characters

Hi again,

If I have the following string:
The support is 24/7, awesome!

It removes 24/7. Is there a way to avoid deleting numbers and other special characters such as"$£" etc.?

How to ignore specific tokens?

Given this string:

24th december

symspell is correcting as:

In:

    new = sym_spell.word_segmentation('24th december')
    new[1]

Out:

beth december

Is there any way of skipping those cases with a regular expression? That is all the tokens that match \d{2}\bth\b skip them? I tried to tokenize and apply the algorithm to each word. However, this damage the accuracy for short words.

Default to \t split when creating dictionary

Currently when the dictionary is loaded from a txt file it is split on space. Using \t as default split would allow to add the n-gram phrases directly via a file. It will helpful to do by just adding a dictionary text file with lots of phrases. Adding an option to specify the split char will be helpful too.

Not able to differentiate name(noun) while correcting spelling

Example:
Input:
"My name is Abhijeet Kumar"
Output:
"my name is ash meet human"

sym_spell.lookup_compound verbosity

Any plans to have verbosity param for compound lookups?

Explain the architecture of Symspell

Hey there, i'm just using Sympell and wants to explore how its works. can anyone able to explain the concept behind this?

Provide create dictionary example

Could you provide an example creating a new dictionary?

Need another init.py at the root level?

You import like "from symspellpy.editdistance import DistanceAlgorithm, EditDistance" and when I tried to run this, it fails to import. Python doesn't think symspellpy is a module because no init.py. Consider adding one or importing differently.

Exporting in memory dictionary with updated words to .txt file.

Hi @mammothb ,

Could you help me with the following that I have done :

from symspellpy.symspellpy import SymSpell, Verbosity
max_edit_distance_dictionary = 4
prefix_length = 7
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
sym_spell.load_dictionary("xyz.txt",term_index=0,count_index=1)
sym_spell.create_dictionary_entry("abcxyz", 1)

Assuming the text file I loaded was blank and I have created multiple new entries
with create_dictionary_entry() function.

Is there a way to export the in memory update dictionary to a text file and/or overwrite the text file i.e. xyz.txt initially loaded.

Thank you for your time.

Short word Transfer Casing suggestions

Just wanted to alert you to some weird behavior with Transfer Casing.

For example:

Fr -> FOr
Nd -> ANd

My symspellpy settings are suggestion_verbosity=TOP, max_edit_distance=2, include_unknown=True and transfer_casing=True. Appreciate your work and the package!

Some example of this are:
ss_py.lookup('Nd', suggestion_verbosity, max_edit_distance=2, include_unknown = True, transfer_casing=True )

Hash table collision check is not necessary

Python dict handles hash collision (two different keys can have different values even if their hash equal). This article makes it clear.

Thus, logic about hash collision checking is not necessary, and code will be simplified.

Thank you

How Verbosity::Closest is Faster than Verbosity::Top?

Hi @mammothb ,
I am using symspell for a last couple of months. It works awesome. I am working with a large corpus for different languages. In my point of view, Top will always give exactly one correction, And Closest will give one or more corrections. So, what was the reason, Verbosity::Top is slower than Verbsity::Closest?

Failing cases! [How to fix these special case?]

Hello,

This is a great repository because it handles almost all the cases but a couple of cases I have to which it is unable to perform:

Here are the cases: Bold ones are incorrect, italics are correct.

Reinforcing important information keep ng us a ert to danger and more
Reinforcing important information keeping us alert to danger and more

by evaluating perceptions regu at ng emot ona arousal
by evaluating perceptions regulating emotional arousal

during which it shifts into ad fferent mode of regional activation
during which it shifts into a different mode of regional activation

The 3rd failure a most k ed the company.
The 3rd failure almost killed the company.

We partnered with pevate com ies
We partnered with private companies

Can we handle these kinds of cases? if you can guide me where to look into symspellpy.py and put a fix for this kind of cases, I would love to do that.

Thanks

mammothb / symspellpy Goto Github PK

symspellpy's Introduction

symspellpy

Notable Changes

Install

Usage

symspellpy's People

Contributors

Stargazers

Watchers

Forkers

symspellpy's Issues

My regex is made to satisfy patterns like starts with 2 or 3 alphabets and then numbers or hyphens.

Sample usage (lookup and lookup_compound)

Sample usage (word_segmentation)

Recommend Projects

Recommend Topics

Recommend Org