Code Monkey home page Code Monkey logo

customized-symspell's Introduction

Travis Build Status Coverage Status License: MIT Maven Central javadoc.io

Customized SymSpell SpellCheck Java

This customized spell check is is based on the spell correction fuzzy search library SymSpell with a few customizations and optimizations

Java Ported v6.6 (Bigrams)

  • the optional bigram dictionary in order to use sentence level context information for selecting best spelling correction.

SymSpell

  • The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance.
  • It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
  • Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term.
  • The speed comes from the inexpensive delete-only edit candidate generation and the pre-calculation.

Customizations

  • We replaced the Damerau-Levenshtein implementation with a weighted Damerau-Levenshtein implementation: where each operation (delete, insert, swap, replace) can have different edit weights.
  • We added some customizing "hooks" that are used to rerank the top-k results (candidate list). The results are then reordered based on a combined proximity
    • added keyboard-distance to get a dynamic replacement weight (since letters close to each other are more likely to be replaced)
    • do some query normalization before search

Keyboard based Qwerty/Qwertz Distance

There are 2 implementations of the keyboards one is English Qwerty based and other is German Qwertz based implementation we used the adjancey graph of the keyboard for the weights to the connected nodes.

Example

For 2 terms: 
        slices  
        olives

If the misspelled word is, slives 
both slices and olives is 1 edit distnace, 
  so in default case the one with higher frequency will end up in the result.
While with the qwerty based char distance,
 slives is more closer to slices.

The reason for this is in Qwerty Based Keyboard, 
 S and O are too far while V and C are adjacent.

Generation of Deletes

Word deletes are generated with taking edit distance which is minimum of max edit distance and 0.3 * word.length

Accuracy Summary

Indexed Docs: 3695

Searches: 8060

Spellcorrection Strategy Accuracy Failures TP TN FP FN
LUCENE 78.96% 21.04% 5883 481 146 1550
Vanilla SymSpell 88.80% 11.20% 6888 269 358 545
Weighted SymSpell 75.74% 24.26% 5781 324 303 1652
Qwerty Vanilla SymSpell 88.57% 11.43% 6860 279 348 573
Qwerty Weighted SymSpell 75.36% 24.64% 5744 330 297 1689

Benchmark Summary

We have done 3 runs each for 30k and 80k data set, which also includes results for each verbosity level. After the runs the final benchmarking looks like:

Average Precalculation time instance 30843.33 ms
Average Lookup time instance 138141.09296296295 ns ~ 0.03814 ms
Total Lookup results instance 648092

More Detailed summary

Built With

Versioning

We use SemVer for versioning.

Nexus

Licenese

The MIT License (MIT)
Copyright © 2019 Lucky Sharma ( https://github.com/MighTguY/customized-symspell )
Copyright © 2018 Wolf Garbe (Original C# implementation https://github.com/wolfgarbe/SymSpell )

Permission is hereby granted, free of charge, to any person 
obtaining a copy of this software and associated documentation files
(the “Software”), to deal in the Software without restriction, 
including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is 
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall 
be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, 
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, 
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR 
THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Special Mentions

Sachin Lala

customized-symspell's People

Contributors

dependabot[bot] avatar markusheiden avatar mightguy avatar tomsquest avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

customized-symspell's Issues

I can't get compound word errors to be split properly

Describe the bug
I have a sentence like:

String resp = "Yes because it's: $~num~ for ~misc~ payment contract $~num~ for planTOTAL $~num~";

but I can't configure the parameters correctly to get "plan total". I have tried the default settings and various combinations of parameters like:

         SpellCheckSettings spellCheckSettings = SpellCheckSettings.builder()
//            .countThreshold(1).deletionWeight(1).insertionWeight(1)
//            .replaceWeight(1).maxEditDistance(2).transpositionWeight(1).topK(5)
//            .prefixLength(10).verbosity(Verbosity.ALL)
            .build();

         DataHolder dataHolder = new InMemoryDataHolder(spellCheckSettings,
            new Murmur3HashFunction());

         StringDistance weightedDamerauLevenshteinDistance = new WeightedDamerauLevenshteinDistance(
            spellCheckSettings.getDeletionWeight(),
            spellCheckSettings.getInsertionWeight(),
            spellCheckSettings.getReplaceWeight(),
            spellCheckSettings.getTranspositionWeight(), new QwertyDistance());

         SymSpellCheck checker = new SymSpellCheck(dataHolder,
            weightedDamerauLevenshteinDistance, spellCheckSettings);

         List<SuggestionItem> suggestions = checker.lookupCompound(resp, 1.0d, true );

To Reproduce
Steps to reproduce the behavior:

  1. Use code above and examine the suggestions.get(0).getTerm() to see that there is no split.

Expected behavior
The term should show "plan total" (or better, preserving case... (is there an option for this?) "plan TOTAL") and not "plantotal"

Desktop (please complete the following information):

  • OS: MacOS 10.15.7
  • openjdk version "1.8.0_252"

Additional context
Providing more details on the impact of settings would be helpful for people not familiar with the art.

This may be related to #53

Problem with early exit

SpellHelper.earlyExit(suggestionItems, phrase, maxEditDistance);
Adding items inside this method seems to be wrong

lookupCompound() doesn't allow to look for 2 correctly spelled terms with only missed space

Precondition:
SpellCheckSettings is initiated with maxEditDistance > 0.

I want to separately cover corner case with missed space, but only between correct words (maxEditdistance=0 for each word separately).
But it's impossible to do with the same SymSpell if it was created with SpellCheckSettings with maxEditDistance > 0.

To cover the case with missed space, lookupCompound() has method lookupSplitWords().
Inside it split a word into part1 and part2.
For each lookup() is called. It has the following code:

    if (maxEditDistance <= 0) {
      maxEditDistance = spellCheckSettings.getMaxEditDistance();
    }

Now the scenario:
query: {applewatch}

Scenario:
I want to lookup for missed space between only correctly spelled words, which means maxEditDistance = 1 (missed space).

With the current implementation, SymSpell will look for extra space between 2 words with additional edit distance by 1. And there is no way to prevent this.
Total maxEditDistance:
lookup(part1, maxEditDistance) = 1
lookup(part2, maxEditDistance) = 1
lookupCompaund(part1+part1) = 1
Total = 3

Depending on the dictionary following results are possible:
{apple watch}, editDistance = 1
{apple patch}, editDistance = 2 (if watch is not present in the dictionary)
{apply patch}, editDistance = 3 (if both apple and watch is not present in the dictionary)

[QUESTION] Short sentences in en

Hello, I have found this case that seems strange for the input string "I am the begt spell cherken!":

int maxEd = 2;
suggestionItems = symSpellCheck.lookupCompound("I am the begt spell cherken!", maxEd);
    for (SuggestionItem elem : suggestionItems) {
      System.out.println("compound : " + elem.getTerm().trim());
    }

I'm getting compound : a am the best spell cher ken

My setup is the default one:

    SpellCheckSettings spellCheckSettings = SpellCheckSettings.builder().countThreshold(1).deletionWeight(1f)
        .insertionWeight(1f).replaceWeight(1f).maxEditDistance(2).transpositionWeight(1f).topK(5).prefixLength(10)
        .verbosity(Verbosity.ALL).build();

    dataHolder = new InMemoryDataHolder(spellCheckSettings, new Murmur3HashFunction());

// weighted Damerau-Levenshtein
    weightedDamerauLevenshteinDistance = new WeightedDamerauLevenshteinDistance(spellCheckSettings.getDeletionWeight(),
        spellCheckSettings.getInsertionWeight(), spellCheckSettings.getReplaceWeight(),
        spellCheckSettings.getTranspositionWeight(), null);

    symSpellCheck = new SymSpellCheck(dataHolder, weightedDamerauLevenshteinDistance, spellCheckSettings);

Test question

First I just want to say thank you for the awesome project!

I ported it to Kotlin Multiplatform and had a few questions, I know it's been a long time, so I understand if you don't remember.

Some of my tests are failing, very slight errors in word distance and such which I'm investigating, but one in particular puzzled me:
testWordBreak()

Passes in:
"itwasabrightcolddayinaprilandtheclockswerestrikingthirteen"

And expects back:
"it was bright cold day in april and the clock were striking thirteen"

Which is missing the "a" between "was" and "bright", as well as the "s" at the end of "clocks".

My Kotlin version returns what looks more correct to me:
"it was a bright cold day in april and the clocks were striking thirteen"

Now my question is this: Did I inadvertently fix a bug? Or is the output you are expecting in your test actually correct, and an expected side effect of the algorithm?

Build is Failing

Describe the bug
The build is failing for long, the deployment step is failing with unauthorised error

solr test

Hello.
now, I am testing customized-symspell at solr.
I wanted to see the situation when I put slives query, I get slices, not olives.
But I faced a situation where I couldn't see the result that I want.

스크린샷 2023-04-28 오전 11 48 42

Before the result,
first, I indexed document that include slices and olives keyword.
Second, I put jar files on specific core.

[jar files]

  • commons-collections4-4.4.jar
  • murmur-1.0.0.jar
  • solr-commons-1.2.34.jar
  • symspell-lib-6.6.154.jar
  • symspell-service-6.6.154.jar
  • symspell-solr-6.6.154.jar

Third, updated solrconfig.xml file like the image below. (lib path, component, requesthandler)

[add the component and requesthandler]
스크린샷 2023-04-28 오전 11 51 46

Is there anything else that I have to do?
please help me ! 😢

Factor should be configurable

private static Double getEdistance(double maxEditDistance, int length) {
double factor = 0.3;
double computedEd = Math.round(factor * length);
if (Math.min(maxEditDistance, computedEd) == maxEditDistance) {
return maxEditDistance;
}
return computedEd;
}

This factor should be configurable in Spellchecksettings.

Extra space is not trimmed in suggestions from lookupCompound()

SymSpellCheck.class
lookupCompound()
Line 113
``` joinedTerm = joinedTerm.concat(si.getTerm()).concat(" ");````
results that the joinedTerm end with space.

For example, the next line that calculates editDistance trim() the joinedTerm, while it should trim() the stored result.

phrase: "ipd"
extected joinedTerm: "ipad"
actual joinedTerm: "ipad "

Same for the multiword phase.

Screen Shot 2020-06-04 at 12 02 32

Suggestions are not properly sorted with Verbosity.ALL

From docs to Verbosity.ALL:

All suggestions within maxEditDistance, suggestions ordered by edit distance, 
then by term frequency (slower, no early termination).

Several times faced cases when suggestions are not sorted correctly.
Neither by editDistance neither by count.
Settings mostly default. Please see them on the screenshots.

ScreenShot_1

ScreenShot_2

Running Example

Hello, thanks a lot for this porting! I was using another SymSpell Java port, but out of date now.
I have successfully compiled the project. Do you have a basic execution example?
Thanks a lot.

Maven Repository

Could you please create maven repository so that we can use this as library?

isTransposition

Hi, for e it looks like the method isTransposition is bugged:

private boolean isTransposition(int i, int j, String a, String b) {
return i > 2
&& j > 2
&& a.charAt(i - 2) == a.charAt(i - 1)
&& b.charAt(j - 2) == b.charAt(j - 1);
}

The letters of the strings a and b are not compared with each other. You check whether a and b have consecutive characters being identical. Shouldn't it be the following?

private boolean isTransposition(int i, int j, String a, String b) {
return i > 2
&& j > 2
&& a.charAt(i - 2) == b.charAt(j - 1)
&& b.charAt(j - 2) == a.charAt(i - 1);
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.