mightguy / customized-symspell Goto Github PK

Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

License: MIT License

Java 99.69% Shell 0.16% HTML 0.15%

symspell spellchecker java-8 spelling-correction qwerty-based-char-distance word-segmentation damerau-levenshtein levenshtein-distance weighted-damerau-levenshtein

customized-symspell's Introduction

Customized SymSpell SpellCheck Java

This customized spell check is is based on the spell correction fuzzy search library SymSpell with a few customizations and optimizations

Java Ported v6.6 (Bigrams)

the optional bigram dictionary in order to use sentence level context information for selecting best spelling correction.

SymSpell

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance.
It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term.
The speed comes from the inexpensive delete-only edit candidate generation and the pre-calculation.

Customizations

We replaced the Damerau-Levenshtein implementation with a weighted Damerau-Levenshtein implementation: where each operation (delete, insert, swap, replace) can have different edit weights.
We added some customizing "hooks" that are used to rerank the top-k results (candidate list). The results are then reordered based on a combined proximity
- added keyboard-distance to get a dynamic replacement weight (since letters close to each other are more likely to be replaced)
- do some query normalization before search

Keyboard based Qwerty/Qwertz Distance

There are 2 implementations of the keyboards one is English Qwerty based and other is German Qwertz based implementation we used the adjancey graph of the keyboard for the weights to the connected nodes.

Example

For 2 terms: 
        slices  
        olives

If the misspelled word is, slives 
both slices and olives is 1 edit distnace, 
  so in default case the one with higher frequency will end up in the result.
While with the qwerty based char distance,
 slives is more closer to slices.

The reason for this is in Qwerty Based Keyboard, 
 S and O are too far while V and C are adjacent.

Generation of Deletes

Word deletes are generated with taking edit distance which is minimum of max edit distance and 0.3 * word.length

Usage

Solr Usage

Accuracy Summary

Indexed Docs: 3695

Searches: 8060

Spellcorrection Strategy	Accuracy	Failures	TP	TN	FP	FN
LUCENE	78.96%	21.04%	5883	481	146	1550
Vanilla SymSpell	88.80%	11.20%	6888	269	358	545
Weighted SymSpell	75.74%	24.26%	5781	324	303	1652
Qwerty Vanilla SymSpell	88.57%	11.43%	6860	279	348	573
Qwerty Weighted SymSpell	75.36%	24.64%	5744	330	297	1689

Benchmark Summary

We have done 3 runs each for 30k and 80k data set, which also includes results for each verbosity level. After the runs the final benchmarking looks like:

Average Precalculation time instance 30843.33 ms
Average Lookup time instance 138141.09296296295 ns ~ 0.03814 ms
Total Lookup results instance 648092

More Detailed summary

Built With

Maven

Versioning

We use SemVer for versioning.

Nexus

Link to Nexus-Releases

Licenese

The MIT License (MIT)
Copyright © 2019 Lucky Sharma ( https://github.com/MighTguY/customized-symspell )
Copyright © 2018 Wolf Garbe (Original C# implementation https://github.com/wolfgarbe/SymSpell )

Permission is hereby granted, free of charge, to any person 
obtaining a copy of this software and associated documentation files
(the “Software”), to deal in the Software without restriction, 
including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is 
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall 
be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, 
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, 
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR 
THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Special Mentions

Sachin Lala

customized-symspell's People

Contributors

Stargazers

Watchers

Forkers

loretoparisi vital-ai tomsquest mortenboldt seanlee2020 gvakras 399601829 atan-aka-xellos jooakim baodongliu markusheiden tomxiong prakharagarwal5689 lihka1 pistolario tomglk 900gle

customized-symspell's Issues

I can't get compound word errors to be split properly

Describe the bug
I have a sentence like:

String resp = "Yes because it's: $~num~ for ~misc~ payment contract $~num~ for planTOTAL $~num~";

but I can't configure the parameters correctly to get "plan total". I have tried the default settings and various combinations of parameters like:

         SpellCheckSettings spellCheckSettings = SpellCheckSettings.builder()
//            .countThreshold(1).deletionWeight(1).insertionWeight(1)
//            .replaceWeight(1).maxEditDistance(2).transpositionWeight(1).topK(5)
//            .prefixLength(10).verbosity(Verbosity.ALL)
            .build();

         DataHolder dataHolder = new InMemoryDataHolder(spellCheckSettings,
            new Murmur3HashFunction());

         StringDistance weightedDamerauLevenshteinDistance = new WeightedDamerauLevenshteinDistance(
            spellCheckSettings.getDeletionWeight(),
            spellCheckSettings.getInsertionWeight(),
            spellCheckSettings.getReplaceWeight(),
            spellCheckSettings.getTranspositionWeight(), new QwertyDistance());

         SymSpellCheck checker = new SymSpellCheck(dataHolder,
            weightedDamerauLevenshteinDistance, spellCheckSettings);

         List<SuggestionItem> suggestions = checker.lookupCompound(resp, 1.0d, true );

To Reproduce
Steps to reproduce the behavior:

Use code above and examine the suggestions.get(0).getTerm() to see that there is no split.

Expected behavior
The term should show "plan total" (or better, preserving case... (is there an option for this?) "plan TOTAL") and not "plantotal"

Desktop (please complete the following information):

OS: MacOS 10.15.7
openjdk version "1.8.0_252"

Additional context
Providing more details on the impact of settings would be helpful for people not familiar with the art.

This may be related to #53

Upgrade Symspell to 6.7.1

Upgrade the current Symspell algorithm from 6.6 to 6.7.1

Problem with early exit

SpellHelper.earlyExit(suggestionItems, phrase, maxEditDistance);
Adding items inside this method seems to be wrong

lookupCompound() doesn't allow to look for 2 correctly spelled terms with only missed space

Precondition:
SpellCheckSettings is initiated with maxEditDistance > 0.

I want to separately cover corner case with missed space, but only between correct words (maxEditdistance=0 for each word separately).
But it's impossible to do with the same SymSpell if it was created with SpellCheckSettings with maxEditDistance > 0.

To cover the case with missed space, lookupCompound() has method lookupSplitWords().
Inside it split a word into part1 and part2.
For each lookup() is called. It has the following code:

    if (maxEditDistance <= 0) {
      maxEditDistance = spellCheckSettings.getMaxEditDistance();
    }

Now the scenario:
query: {applewatch}

Scenario:
I want to lookup for missed space between only correctly spelled words, which means maxEditDistance = 1 (missed space).

With the current implementation, SymSpell will look for extra space between 2 words with additional edit distance by 1. And there is no way to prevent this.
Total maxEditDistance:
lookup(part1, maxEditDistance) = 1
lookup(part2, maxEditDistance) = 1
lookupCompaund(part1+part1) = 1
Total = 3

Depending on the dictionary following results are possible:
{apple watch}, editDistance = 1
{apple patch}, editDistance = 2 (if watch is not present in the dictionary)
{apply patch}, editDistance = 3 (if both apple and watch is not present in the dictionary)

Get priority words based on a dictionary or get maximal segmentation possibility

So, i am trying to use symspell to do post ocr correction in financial domain.

So for a sample sentence like Pro f i t in the year 2020 symspell is giving suggestions as prof it in the year 20 20

Could anyone suggest in

correct pro f i t as profit
for numbers or for some regex don't apply any segmentation

[QUESTION] Short sentences in en

Hello, I have found this case that seems strange for the input string "I am the begt spell cherken!":

int maxEd = 2;
suggestionItems = symSpellCheck.lookupCompound("I am the begt spell cherken!", maxEd);
    for (SuggestionItem elem : suggestionItems) {
      System.out.println("compound : " + elem.getTerm().trim());
    }

I'm getting compound : a am the best spell cher ken

My setup is the default one:

    SpellCheckSettings spellCheckSettings = SpellCheckSettings.builder().countThreshold(1).deletionWeight(1f)
        .insertionWeight(1f).replaceWeight(1f).maxEditDistance(2).transpositionWeight(1f).topK(5).prefixLength(10)
        .verbosity(Verbosity.ALL).build();

    dataHolder = new InMemoryDataHolder(spellCheckSettings, new Murmur3HashFunction());

// weighted Damerau-Levenshtein
    weightedDamerauLevenshteinDistance = new WeightedDamerauLevenshteinDistance(spellCheckSettings.getDeletionWeight(),
        spellCheckSettings.getInsertionWeight(), spellCheckSettings.getReplaceWeight(),
        spellCheckSettings.getTranspositionWeight(), null);

    symSpellCheck = new SymSpellCheck(dataHolder, weightedDamerauLevenshteinDistance, spellCheckSettings);

Test question

First I just want to say thank you for the awesome project!

I ported it to Kotlin Multiplatform and had a few questions, I know it's been a long time, so I understand if you don't remember.

Some of my tests are failing, very slight errors in word distance and such which I'm investigating, but one in particular puzzled me:
testWordBreak()

Passes in:
"itwasabrightcolddayinaprilandtheclockswerestrikingthirteen"

And expects back:
"it was bright cold day in april and the clock were striking thirteen"

Which is missing the "a" between "was" and "bright", as well as the "s" at the end of "clocks".

My Kotlin version returns what looks more correct to me:
"it was a bright cold day in april and the clocks were striking thirteen"

Now my question is this: Did I inadvertently fix a bug? Or is the output you are expecting in your test actually correct, and an expected side effect of the algorithm?

Solr Plugin

Plugin for Solr, to use symspell lib

Build is Failing

Describe the bug
The build is failing for long, the deployment step is failing with unauthorised error

solr test

Hello.
now, I am testing customized-symspell at solr.
I wanted to see the situation when I put slives query, I get slices, not olives.
But I faced a situation where I couldn't see the result that I want.

Before the result,
first, I indexed document that include slices and olives keyword.
Second, I put jar files on specific core.

[jar files]

commons-collections4-4.4.jar
murmur-1.0.0.jar
solr-commons-1.2.34.jar
symspell-lib-6.6.154.jar
symspell-service-6.6.154.jar
symspell-solr-6.6.154.jar

Third, updated solrconfig.xml file like the image below. (lib path, component, requesthandler)

[add the component and requesthandler]

Is there anything else that I have to do?
please help me ! 😢

Factor should be configurable

private static Double getEdistance(double maxEditDistance, int length) {
double factor = 0.3;
double computedEd = Math.round(factor * length);
if (Math.min(maxEditDistance, computedEd) == maxEditDistance) {
return maxEditDistance;
}
return computedEd;
}

This factor should be configurable in Spellchecksettings.

Extra space is not trimmed in suggestions from lookupCompound()

SymSpellCheck.class
lookupCompound()
Line 113
``` joinedTerm = joinedTerm.concat(si.getTerm()).concat(" ");````
results that the joinedTerm end with space.

For example, the next line that calculates editDistance trim() the joinedTerm, while it should trim() the stored result.

phrase: "ipd"
extected joinedTerm: "ipad"
actual joinedTerm: "ipad "

Same for the multiword phase.

Suggestions are not properly sorted with Verbosity.ALL

From docs to Verbosity.ALL:

All suggestions within maxEditDistance, suggestions ordered by edit distance, 
then by term frequency (slower, no early termination).

Several times faced cases when suggestions are not sorted correctly.
Neither by editDistance neither by count.
Settings mostly default. Please see them on the screenshots.

Running Example

Hello, thanks a lot for this porting! I was using another SymSpell Java port, but out of date now.
I have successfully compiled the project. Do you have a basic execution example?
Thanks a lot.

Maven Repository

Could you please create maven repository so that we can use this as library?

isTransposition

Hi, for e it looks like the method isTransposition is bugged:

private boolean isTransposition(int i, int j, String a, String b) {
return i > 2
&& j > 2
&& a.charAt(i - 2) == a.charAt(i - 1)
&& b.charAt(j - 2) == b.charAt(j - 1);
}

The letters of the strings a and b are not compared with each other. You check whether a and b have consecutive characters being identical. Shouldn't it be the following?

private boolean isTransposition(int i, int j, String a, String b) {
return i > 2
&& j > 2
&& a.charAt(i - 2) == b.charAt(j - 1)
&& b.charAt(j - 2) == a.charAt(i - 1);
}