Code Monkey home page Code Monkey logo

wolfgarbe / symspell Goto Github PK

View Code? Open in Web Editor NEW
3.0K 71.0 281.0 12.36 MB

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

Home Page: https://seekstorm.com/blog/1000x-spelling-correction/

License: MIT License

C# 96.99% Batchfile 1.33% Python 1.68%
levenshtein fuzzy-search approximate-string-matching edit-distance spellcheck spell-check levenshtein-distance damerau-levenshtein spelling fuzzy-matching

symspell's Introduction

SymSpell
NuGet version MIT License

Spelling correction & Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!

The speed comes from the inexpensive delete-only edit candidate generation and the pre-calculation.
An average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3,
but SymSpell needs to generate only 25 deletes to cover them all, both at pre-calculation and at lookup time. Magic!

If you like SymSpell, try SeekStorm - a sub-millisecond full-text search library & multi-tenancy server in Rust (Open Source).


Copyright (c) 2022 Wolf Garbe
Version: 6.7.2
Author: Wolf Garbe <[email protected]>
Maintainer: Wolf Garbe <[email protected]>
URL: https://github.com/wolfgarbe/symspell
Description: https://seekstorm.com/blog/1000x-spelling-correction/

MIT License

Copyright (c) 2022 Wolf Garbe

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated 
documentation files (the "Software"), to deal in the Software without restriction, including without limitation 
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, 
and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

https://opensource.org/licenses/MIT

Single word spelling correction

Lookup provides a very fast spelling correction of single words.

  • A Verbosity parameter allows to control the number of returned results:
    Top: Top suggestion with the highest term frequency of the suggestions of smallest edit distance found.
    Closest: All suggestions of smallest edit distance found, suggestions ordered by term frequency.
    All: All suggestions within maxEditDistance, suggestions ordered by edit distance, then by term frequency.
  • The Maximum edit distance parameter controls up to which edit distance words from the dictionary should be treated as suggestions.
  • The required Word frequency dictionary can either be directly loaded from text files (LoadDictionary) or generated from a large text corpus (CreateDictionary).

Applications

  • Spelling correction,
  • Query correction (10–15% of queries contain misspelled terms),
  • Chatbots,
  • OCR post-processing,
  • Automated proofreading.
  • Fuzzy search & approximate string matching

Performance (single term)

0.033 milliseconds/word (edit distance 2) and 0.180 milliseconds/word (edit distance 3) (single core on 2012 Macbook Pro)

Benchmark

1,870 times faster than BK-tree (see Benchmark 1: dictionary size=500,000, maximum edit distance=3, query terms with random edit distance = 0...maximum edit distance, verbose=0)

1 million times faster than Norvig's algorithm (see Benchmark 2: dictionary size=29,157, maximum edit distance=3, query terms with fixed edit distance = maximum edit distance, verbose=0)

Blog Posts: Algorithm, Benchmarks, Applications

1000x Faster Spelling Correction algorithm
Fast approximate string matching with large edit distances in Big Data
Very fast Data cleaning of product names, company names & street names
Sub-millisecond compound aware automatic spelling correction
SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking
Fast Word Segmentation for noisy text
The Pruning Radix Trie — a Radix trie on steroids


Compound aware multi-word spelling correction

LookupCompound supports compound aware automatic spelling correction of multi-word input strings.

1. Compound splitting & decompounding

Lookup() assumes every input string as single term. LookupCompound also supports compound splitting / decompounding with three cases:

  1. mistakenly inserted space within a correct word led to two incorrect terms
  2. mistakenly omitted space between two correct words led to one incorrect combined term
  3. multiple input terms with/without spelling errors

Splitting errors, concatenation errors, substitution errors, transposition errors, deletion errors and insertion errors can by mixed within the same word.

2. Automatic spelling correction

  • Large document collections make manual correction infeasible and require unsupervised, fully-automatic spelling correction.
  • In conventional spelling correction of a single token, the user is presented with multiple spelling correction suggestions.
    For automatic spelling correction of long multi-word text the algorithm itself has to make an educated choice.

Examples:

- whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixthgrade and ins pired him
+ where is the love he had dated for much of the past who couldn't read in sixth grade and inspired him  (9 edits)

- in te dhird qarter oflast jear he hadlearned ofca sekretplan
+ in the third quarter of last year he had learned of a secret plan  (9 edits)

- the bigjest playrs in te strogsommer film slatew ith plety of funn
+ the biggest players in the strong summer film slate with plenty of fun  (9 edits)

- Can yu readthis messa ge despite thehorible sppelingmsitakes
+ can you read this message despite the horrible spelling mistakes  (9 edits)

Performance (compounds)

0.2 milliseconds / word (edit distance 2) 5000 words / second (single core on 2012 Macbook Pro)


Word Segmentation of noisy text

WordSegmentation divides a string into words by inserting missing spaces at appropriate positions.

  • Misspelled words are corrected and do not prevent segmentation.
  • Existing spaces are allowed and considered for optimum segmentation.
  • SymSpell.WordSegmentation uses a Triangular Matrix approach instead of the conventional Dynamic Programming: It uses an array instead of a dictionary for memoization, loops instead of recursion and incrementally optimizes prefix strings instead of remainder strings.
  • The Triangular Matrix approach is faster than the Dynamic Programming approach. It has a lower memory consumption, better scaling (constant O(1) memory consumption vs. linear O(n)) and is GC friendly.
  • While each string of length n can be segmented into 2^n−1 possible compositions,
    SymSpell.WordSegmentation has a linear runtime O(n) to find the optimum composition.

Examples:

- thequickbrownfoxjumpsoverthelazydog
+ the quick brown fox jumps over the lazy dog

- itwasabrightcolddayinaprilandtheclockswerestrikingthirteen
+ it was a bright cold day in april and the clocks were striking thirteen

- itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness
+ it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness 

Applications:

  • Word Segmentation for CJK languages for Indexing Spelling correction, Machine translation, Language understanding, Sentiment analysis
  • Normalizing English compound nouns for search & indexing (e.g. ice box = ice-box = icebox; pig sty = pig-sty = pigsty)
  • Word segmentation for compounds if both original word and split word parts should be indexed.
  • Correction of missing spaces caused by Typing errors.
  • Correction of Conversion errors: spaces between word may get lost e.g. when removing line breaks.
  • Correction of OCR errors: inferior quality of original documents or handwritten text may prevent that all spaces are recognized.
  • Correction of Transmission errors: during the transmission over noisy channels spaces can get lost or spelling errors introduced.
  • Keyword extraction from URL addresses, domain names, #hashtags, table column descriptions or programming variables written without spaces.
  • For password analysis, the extraction of terms from passwords can be required.
  • For Speech recognition, if spaces between words are not properly recognized in spoken language.
  • Automatic CamelCasing of programming variables.
  • Applications beyond Natural Language processing, e.g. segmenting DNA sequence into words

Performance:

4 milliseconds for segmenting an 185 char string into 53 words (single core on 2012 Macbook Pro)


Usage SymSpell Demo

single word + Enter: Display spelling suggestions
Enter without input: Terminate the program

Usage SymSpellCompound Demo

multiple words + Enter: Display spelling suggestions
Enter without input: Terminate the program

Usage Segmentation Demo

string without spaces + Enter: Display word segmented text
Enter without input: Terminate the program

Demo, DemoCompound and SegmentationDemo projects can be built with the free Visual Studio Code, which runs on Windows, MacOS and Linux.

Usage SymSpell Library

//create object
int initialCapacity = 82765;
int maxEditDistanceDictionary = 2; //maximum edit distance per dictionary precalculation
var symSpell = new SymSpell(initialCapacity, maxEditDistanceDictionary);
      
//load dictionary
string baseDirectory = AppDomain.CurrentDomain.BaseDirectory;
string dictionaryPath= baseDirectory + "../../../../SymSpell/frequency_dictionary_en_82_765.txt";
int termIndex = 0; //column of the term in the dictionary text file
int countIndex = 1; //column of the term frequency in the dictionary text file
if (!symSpell.LoadDictionary(dictionaryPath, termIndex, countIndex))
{
  Console.WriteLine("File not found!");
  //press any key to exit program
  Console.ReadKey();
  return;
}

//lookup suggestions for single-word input strings
string inputTerm="house";
int maxEditDistanceLookup = 1; //max edit distance per lookup (maxEditDistanceLookup<=maxEditDistanceDictionary)
var suggestionVerbosity = SymSpell.Verbosity.Closest; //Top, Closest, All
var suggestions = symSpell.Lookup(inputTerm, suggestionVerbosity, maxEditDistanceLookup);

//display suggestions, edit distance and term frequency
foreach (var suggestion in suggestions)
{ 
  Console.WriteLine(suggestion.term +" "+ suggestion.distance.ToString() +" "+ suggestion.count.ToString("N0"));
}


//load bigram dictionary
string dictionaryPath= baseDirectory + "../../../../SymSpell/frequency_bigramdictionary_en_243_342.txt";
int termIndex = 0; //column of the term in the dictionary text file
int countIndex = 2; //column of the term frequency in the dictionary text file
if (!symSpell.LoadBigramDictionary(dictionaryPath, termIndex, countIndex))
{
  Console.WriteLine("File not found!");
  //press any key to exit program
  Console.ReadKey();
  return;
}

//lookup suggestions for multi-word input strings (supports compound splitting & merging)
inputTerm="whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixtgrade and ins pired him";
maxEditDistanceLookup = 2; //max edit distance per lookup (per single word, not per whole input string)
suggestions = symSpell.LookupCompound(inputTerm, maxEditDistanceLookup);

//display suggestions, edit distance and term frequency
foreach (var suggestion in suggestions)
{ 
  Console.WriteLine(suggestion.term +" "+ suggestion.distance.ToString() +" "+ suggestion.count.ToString("N0"));
}


//word segmentation and correction for multi-word input strings with/without spaces
inputTerm="thequickbrownfoxjumpsoverthelazydog";
maxEditDistance = 0;
suggestion = symSpell.WordSegmentation(input);

//display term and edit distance
Console.WriteLine(suggestion.correctedString + " " + suggestion.distanceSum.ToString("N0"));


//press any key to exit program
Console.ReadKey();

Three ways to add SymSpell to your project:

  1. Add SymSpell.cs, EditDistance.cs and frequency_dictionary_en_82_765.txt to your project. All three files are located in the SymSpell folder. Enabling the compiler option "Prefer 32-bit" will significantly reduce the memory consumption of the precalculated dictionary.
  2. Add SymSpell NuGet to your Net Framework project: Visual Studio / Tools / NuGet Packager / Manage Nuget packages for solution / Select "Browse tab"/ Search for SymSpell / Select SymSpell / Check your project on the right hand windows / Click install button. The frequency_dictionary_en_82_765.txt is automatically installed.
  3. Add SymSpell NuGet to your Net Core project: Visual Studio / Tools / NuGet Packager / Manage Nuget packages for solution / Select "Browse tab"/ Search for SymSpell / Select SymSpell / Check your project on the right hand windows / Click install button. The frequency_dictionary_en_82_765.txt must be copied manually to your project.

SymSpell targets .NET Standard v2.0 and can be used in:

  1. NET Framework (Windows Forms, WPF, ASP.NET),
  2. NET Core (UWP, ASP.NET Core, Windows, OS X, Linux),
  3. XAMARIN (iOS, OS X, Android) projects.

The SymSpell, Demo, DemoCompound and Benchmark projects can be built with the free Visual Studio Code, which runs on Windows, MacOS and Linux.


Frequency dictionary

Dictionary quality is paramount for correction quality. In order to achieve this two data sources were combined by intersection: Google Books Ngram data which provides representative word frequencies (but contains many entries with spelling errors) and SCOWL — Spell Checker Oriented Word Lists which ensures genuine English vocabulary (but contained no word frequencies required for ranking of suggestions within the same edit distance).

The frequency_dictionary_en_82_765.txt was created by intersecting the two lists mentioned below. By reciprocally filtering only those words which appear in both lists are used. Additional filters were applied and the resulting list truncated to ≈ 80,000 most frequent words.

Dictionary file format

  • Plain text file in UTF-8 encoding.
  • Word and Word Frequency are separated by space or tab. Per default, the word is expected in the first column and the frequency in the second column. But with the termIndex and countIndex parameters in LoadDictionary() the position and order of the values can be changed and selected from a row with more than two values. This allows to augment the dictionary with additional information or to adapt to existing dictionaries without reformatting.
  • Every word-frequency-pair in a separate line. A line is defined as a sequence of characters followed by a line feed ("\n"), a carriage return ("\r"), or a carriage return immediately followed by a line feed ("\r\n").
  • Both dictionary terms and input term are expected to be in lower case.

You can build your own frequency dictionary for your language or your specialized technical domain. The SymSpell spelling correction algorithm supports languages with non-latin characters, e.g Cyrillic, Chinese or Georgian.

Frequency dictionaries in other languages

SymSpell includes an English frequency dictionary

Dictionaries for Chinese, English, French, German, Hebrew, Italian, Russian and Spanish are located here:
SymSpell.FrequencyDictionary

Frequency dictionaries in many other languages can be found here:
FrequencyWords repository
Frequency dictionaries
Frequency dictionaries


C# (original source code)
https://github.com/wolfgarbe/symspell

.NET (NuGet package)
https://www.nuget.org/packages/symspell

Ports

The following third party ports or reimplementations to other programming languages have not been tested by myself whether they are an exact port, error free, provide identical results or are as fast as the original algorithm.

Most ports target SymSpell version 3.0. But version 6.1. provides much higher speed & lower memory consumption!

WebAssembly
https://github.com/justinwilaby/spellchecker-wasm

WEB API (Docker)
https://github.com/LeonErath/SymSpellAPI (Version 6.3)

C++
https://github.com/AtheS21/SymspellCPP (Version 6.5)
https://github.com/erhanbaris/SymSpellPlusPlus (Version 6.1)

Crystal
https://github.com/chenkovsky/aha/blob/master/src/aha/sym_spell.cr

Go
https://github.com/sajari/fuzzy
https://github.com/eskriett/spell

Haskell
https://github.com/cbeav/symspell

Java
https://github.com/MighTguY/customized-symspell (Version 6.6)
https://github.com/rxp90/jsymspell (Version 6.6)
https://github.com/Lundez/JavaSymSpell (Version 6.4)
https://github.com/rxp90/jsymspell
https://github.com/gpranav88/symspell
https://github.com/searchhub/preDict
https://github.com/jpsingarayar/SpellBlaze

Javascript
https://github.com/MathieuLoutre/node-symspell (Version 6.6, needs Node.js)
https://github.com/itslenny/SymSpell.js
https://github.com/dongyuwei/SymSpell
https://github.com/IceCreamYou/SymSpell
https://github.com/Yomguithereal/mnemonist/blob/master/symspell.js

Julia
https://github.com/Arkoniak/SymSpell.jl

Kotlin
https://github.com/Wavesonics/SymSpellKt

Objective-C
https://github.com/AmitBhavsarIphone/SymSpell (Version 6.3)

Python
https://github.com/mammothb/symspellpy (Version 6.7)
https://github.com/viig99/SymSpellCppPy (Version 6.5)
https://github.com/zoho-labs/symspell (Python bindings of Rust version)
https://github.com/ne3x7/pysymspell/ (Version 6.1)
https://github.com/Ayyuriss/SymSpell
https://github.com/ppgmg/github_public/blob/master/spell/symspell_python.py
https://github.com/rcourivaud/symspellcompound
https://github.com/Esukhia/sympound-python
https://www.kaggle.com/yk1598/symspell-spell-corrector

Ruby
https://github.com/PhilT/symspell

Rust
https://github.com/reneklacan/symspell (Version 6.6, compiles to WebAssembly)
https://github.com/luketpeterson/fuzzy_rocks (persistent datastore backed by RocksDB)

Scala
https://github.com/semkath/symspell

Swift
https://github.com/Archivus/SymSpell


Citations

Contextual Multilingual Spellchecker for User Queries
Sanat Sharma, Josep Valls-Vargas, Tracy Holloway King, Francois Guerin, Chirag Arora (Adobe)
https://arxiv.org/abs/2305.01082

A context sensitive real-time Spell Checker with language adaptability
Prabhakar Gupta (Amazon)
https://arxiv.org/abs/1910.11242

An Extended Sequence Tagging Vocabulary for Grammatical Error Correction
Stuart Mesham, Christopher Bryant, Marek Rei, Zheng Yuan
https://arxiv.org/abs/2302.05913

German Parliamentary Corpus (GERPARCOR)
Giuseppe Abrami, Mevlüt Bagci, Leon Hammerla, Alexander Mehler
https://arxiv.org/abs/2204.10422

iOCR: Informed Optical Character Recognition for Election Ballot Tallies
Kenneth U. Oyibo, Jean D. Louis, Juan E. Gilbert
https://arxiv.org/abs/2208.00865

Amazigh spell checker using Damerau-Levenshtein algorithm and N-gram
Youness Chaabi, Fadoua Ataa Allah
https://www.sciencedirect.com/science/article/pii/S1319157821001828

Survey of Query correction for Thai business-oriented information retrieval
Phongsathorn Kittiworapanya, Nuttapong Saelek, Anuruth Lertpiya, Tawunrat Chalothorn
https://ieeexplore.ieee.org/document/9376809

SymSpell and LSTM based Spell- Checkers for Tamil
Selvakumar MuruganTamil Arasan BakthavatchalamTamil Arasan BakthavatchalamMalaikannan Sankarasubbu
https://www.researchgate.net/publication/349924975_SymSpell_and_LSTM_based_Spell-_Checkers_for_Tamil

SymSpell4Burmese: Symmetric Delete Spelling Correction Algorithm (SymSpell) for Burmese Spelling Checking
Ei Phyu Phyu Mon; Ye Kyaw Thu; Than Than Yu; Aye Wai Oo
https://ieeexplore.ieee.org/document/9678171

Spell Check Indonesia menggunakan Norvig dan SymSpell
Yasir Abdur Rohman
https://medium.com/@yasirabd/spell-check-indonesia-menggunakan-norvig-dan-symspell-4fa583d62c24

Analisis Perbandingan Metode Burkhard Keller Tree dan SymSpell dalam Spell Correction Bahasa Indonesia
Muhammad Hafizh Ferdiansyah, I Kadek Dwi Nuryana
https://ejournal.unesa.ac.id/index.php/jinacs/article/download/50989/41739

Improving Document Retrieval with Spelling Correction for Weak and Fabricated Indonesian-Translated Hadith
Muhammad zaky ramadhanKemas M LhaksmanaKemas M Lhaksmana
https://www.researchgate.net/publication/342390145_Improving_Document_Retrieval_with_Spelling_Correction_for_Weak_and_Fabricated_Indonesian-Translated_Hadith

Symspell을 이용한 한글 맞춤법 교정
김희규
https://heegyukim.medium.com/symspell%EC%9D%84-%EC%9D%B4%EC%9A%A9%ED%95%9C-%ED%95%9C%EA%B8%80-%EB%A7%9E%EC%B6%A4%EB%B2%95-%EA%B5%90%EC%A0%95-3def9ca00805

Mending Fractured Texts. A heuristic procedure for correcting OCR data
Jens Bjerring-Hansen, Ross Deans Kristensen-McLachla2, Philip Diderichsen and Dorte Haltrup Hansen
https://ceur-ws.org/Vol-3232/paper14.pdf

Towards the Natural Language Processing as Spelling Correction for Offline Handwritten Text Recognition Systems
Arthur Flor de Sousa Neto; Byron Leite Dantas Bezerra; and Alejandro Héctor Toselli
https://www.mdpi.com/2076-3417/10/21/7711

When to Use OCR Post-correction for Named Entity Recognition?
Vinh-Nam Huynh, Ahmed Hamdi, Antoine Doucet
https://hal.science/hal-03034484v1/

Automatic error Correction: Evaluating Performance of Spell Checker Tools
A. Tolegenova
https://journals.sdu.edu.kz/index.php/nts/article/view/690

ZHAW-CAI: Ensemble Method for Swiss German Speech to Standard German Text
Malgorzata Anna Ulasik, Manuela Hurlimann, Bogumila Dubel, Yves Kaufmann,
Silas Rudolf, Jan Deriu, Katsiaryna Mlynchyk, Hans-Peter Hutter, and Mark Cieliebak
https://ceur-ws.org/Vol-2957/sg_paper3.pdf

Cyrillic Word Error Program Based on Machine Learning
Battumur, K., Dulamragchaa, U., Enkhbat, S., Altanhuyag, L., & Tumurbaatar, P.
https://mongoliajol.info/index.php/JIMDT/article/view/2661

Fast Approximate String Search for Wikification
Szymon Olewniczak, Julian Szymanski
https://www.iccs-meeting.org/archive/iccs2021/papers/127440334.pdf

RuMedSpellchecker: Correcting Spelling Errors for Natural Russian Language in Electronic Health Records Using Machine Learning Techniques
Dmitrii Pogrebnoi, Anastasia Funkner, Sergey Kovalchuk
https://link.springer.com/chapter/10.1007/978-3-031-36024-4_16

An Extended Sequence Tagging Vocabulary for Grammatical Error Correction
Stuart Mesham, Christopher Bryant, Marek Rei, Zheng Yuan
https://aclanthology.org/2023.findings-eacl.119.pdf

Lightning-fast adaptive immune receptor similarity search by symmetric deletion lookup
Touchchai Chotisorayuth, Andreas Tiffeau-Mayer
https://arxiv.org/html/2403.09010v1

Unveiling Disguised Toxicity: A Novel Pre-processing Module for Enhanced Content Moderation
Johnny Chan, Yuming Li
https://www.sciencedirect.com/science/article/pii/S2215016124001225


Upcoming changes

  1. Utilizing the pigeonhole principle by partitioning both query and dictionary terms will result in 5x less memory consumption and 3x faster precalculation time.
  2. Option to preserve case (upper/lower case) of input term.
  3. Open source the code for creating custom frequency dictionaries in any language and size as intersection between Google Books Ngram data (Provides representative word frequencies) and SCOWL Spell Checker Oriented Word Lists (Ensures genuine English vocabulary).

Changes in v6.7.2

  1. Exception fixed in WordSegmentation
  2. Platform changed from netcore 2.1 to netcore 3.0

Changes in v6.7.1

  1. Framework target changed from net472 to net47
  2. Framework target added netcoreapp3.0
  3. More common contractions added to frequency_dictionary_en_82_765.txt

Changes in v6.7

  1. WordSegmentation did not work correctly if input string contained words in uppercase.
  2. WordSegmentation now retains/preserves case.
  3. WordSegmentation now keeps punctuation or apostrophe adjacent to previous word.
  4. WordSegmentation now normalizes ligatures: "scientific" -> "scientific".
  5. WordSegmentation now removes hyphens prior to word segmentation (as they might be caused by syllabification).
  6. American English word forms added to dictionary in addition to British English e.g. favourable -> favorable.

Changes in v6.6

  1. IMPROVEMENT: LoadDictionary and LoadBigramDictionary now have an optional separator parameter, which defines the separator characters (e.g. '\t') between term(s) and count. Default is defaultSeparatorChars=null for white space.
    This allows the dictionaries to contain space separated phrases.
    If in LoadBigramDictionary no separator parameter is stated or defaultSeparatorChars (whitespace) is stated as separator parameter, then take two term parts, otherwise take only one (which then itself is a space separated bigram).

Changes in v6.5

  1. IMPROVEMENT: Better SymSpell.LookupCompound correction quality with existing single term dictionary by using Naive Bayes probability for selecting best word splitting.
    bycycle -> bicycle (instead of by cycle )
    inconvient -> inconvenient (instead of i convent)
  2. IMPROVEMENT: Even better SymSpell.LookupCompound correction quality, when using the optional bigram dictionary in order to use sentence level context information for selecting best spelling correction.
  3. IMPROVEMENT: English bigram frequency dictionary included

Changes in v6.4

  1. LoadDictioary(Stream, ...) and CreateDictionary(Stream) methods added (contibution by ccady)
    Allows to get dictionaries from network streams, memory streams, and resource streams in addition to previously supported files.

Changes in v6.3

  1. IMPROVEMENT: WordSegmentation added:
    WordSegmentation divides a string into words by inserting missing spaces at appropriate positions.
    Misspelled words are corrected and do not prevent segmentation.
    Existing spaces are allowed and considered for optimum segmentation.
    SymSpell.WordSegmentation uses a novel approach to word segmentation without recursion.
    While each string of length n can be segmented into 2^n−1 possible compositions,
    SymSpell.WordSegmentation has a linear runtime O(n) to find the optimum composition.
  2. IMPROVEMENT: New CommandLine parameters:
    LookupType: lookup, lookupcompound, wordsegment.
    OutputStats: switch to show only corrected string or corrected string, edit distance, word frequency/probability.
  3. IMPROVEMENT: Lookup with maxEditDistance=0 faster.

Changes in v6.2

  1. IMPROVEMENT: SymSpell.CommandLine project added. Allows pipes and redirects for Input & Output. Dictionary/Copus file, MaxEditDistance, Verbosity, PrefixLength can be specified via Command Line. No programming required.
  2. IMPROVEMENT: DamerauOSA edit distance updated, Levenshtein edit distance added (in SoftWx.Match by Steve Hatchett)
  3. CHANGE: Other projects in the SymSpell solution now use references to SymSpell instead of links to the source files.

Changes in v6.1

  1. IMPROVEMENT: SymSpellCompound has been refactored from static to instantiated class and integrated into SymSpell Therefore SymSpellCompound is now also based on the latest SymSpell version with all fixes and performance improvements
  2. IMPROVEMENT: symspell.demo.csproj, symspell.demoCompound.csproj, symspell.Benchmark.csproj have been recreated from scratch and target now .Net Core instead of .Net Framework for improved compatibility with other platforms like MacOS and Linux
  3. CHANGE: The testdata directory has been moved from the demo folder into the benchmark folder
  4. CHANGE: License changed from LGPL 3.0 to the more permissive MIT license to allow frictionless commercial usage.

Changes in v6.0

  1. IMPROVEMENT: SymSpell internal dictionary has been refactored by Steve Hatchett.
    2x faster dictionary precalculation and 2x lower memory consumption.

Changes in v5.1

  1. IMPROVEMENT: SymSpell has been refactored from static to instantiated class by Steve Hatchett.
  2. IMPROVEMENT: Added benchmarking project.
  3. IMPROVEMENT: Added unit test project.
  4. IMPROVEMENT: Different maxEditDistance for dictionary precalculation and for Lookup.
  5. CHANGE: Removed language feature (use separate SymSpell instances instead).
  6. CHANGE: Verbosity parameter changed from Int to Enum
  7. FIX: Incomplete lookup results, if maxEditDistance=1 AND input.Length>prefixLength.
  8. FIX: count overflow protection fixed.

Changes in v5.0

  1. FIX: Suggestions were not always complete for input.Length <= editDistanceMax.
  2. FIX: Suggestions were not always complete/best for verbose < 2.
  3. IMPROVEMENT: Prefix indexing implemented: more than 90% memory reduction, depending on prefix length and edit distance. The discriminatory power of additional chars is decreasing with word length. By restricting the delete candidate generation to the prefix, we can save space, without sacrificing filter efficiency too much. Longer prefix length means higher search speed at the cost of higher index size.
  4. IMPROVEMENT: Algorithm for DamerauLevenshteinDistance() changed for a faster one.
  5. ParseWords() without LINQ
  6. CreateDictionaryEntry simplified, AddLowestDistance() removed.
  7. Lookup() improved.
  8. Benchmark() added: Lookup of 1000 terms with random spelling errors.

Changes in v4.1

  1. symspell.csproj Generates a SymSpell NuGet package (which can be added to your project)
  2. symspelldemo.csproj Shows how SymSpell can be used in your project (by using symspell.cs directly or by adding the SymSpell NuGet package )

Changes in v4.0

  1. Fix: previously not always all suggestions within edit distance (verbose=1) or the best suggestion (verbose=0) were returned : e.g. "elove" did not return "love"
  2. Regex will not anymore split words at apostrophes
  3. Dictionary<string, object> dictionary changed to Dictionary<string, Int32> dictionary
  4. LoadDictionary() added to load a frequency dictionary. CreateDictionary remains and can be used alternatively to create a dictionary from a large text corpus.
  5. English word frequency dictionary added (wordfrequency_en.txt). Dictionary quality is paramount for correction quality. In order to achieve this two data sources were combined by intersection: Google Books Ngram data which provides representative word frequencies (but contains many entries with spelling errors) and SCOWL — Spell Checker Oriented Word Lists which ensures genuine English vocabulary (but contained no word frequencies required for ranking of suggestions within the same edit distance).
  6. dictionaryItem.count was changed from Int32 to Int64 for compatibility with dictionaries derived from Google Ngram data.

SymSpell is contributed by SeekStorm - the high performance Search as a Service & search API

symspell's People

Contributors

altmas5 avatar ashkanparsa avatar cbeav avatar ccady avatar devleoko avatar hanabi1224 avatar jpsingarayar avatar kestasjk avatar leonerath avatar reneklacan avatar rkttu avatar softwx avatar starsbit avatar tbroadley avatar wolfgarbe avatar xrmx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

symspell's Issues

Update Nuget package to 5.0

I have noticed that you have updated the code to version 5.0, but the Nuget package is still referencing 4.1.

Unable to replicate spelling correction (LookupCompound) shown in README.md

In SymSpell.CompoundDemo (v6.1, and v6.3),
in te dhird qarter oflast jear he hadlearned ofca sekretplan y iran is corrected to
in the third quarter of last year he learned of a secret plan a iran instead of
in the third quarter of last year he learned of a secret plan by iran as shown in README.md. Note that the y in front of iran is corrected to a instead of by.

May I know if I need to change some argument values to get the correction shown in README.md?

repeated characters issue

if I type hekko (for hello) ; rppm (room) - it is giving me unexpected results. it is not giving me hello and room. (Edit distance = 2 and Verbose = ALL)

Dictionary loading/lookup optimization

When you load the dictionary all at once, this can consume quite a bit of memory and is a blocking process. If the dictionary is large, this can cause responsiveness issues, assuming you don't handle that another way.

What do you think about loading the dictionary in chunks in a Lookup method? So, that alternative Lookup method would load a chunk of the dictionary, try to find a match/suggestion in that chunk, and repeat until either a match/suggestion is found or there are no more chunks to load. Effectively, you'd be streaming in the dictionary as needed, rather than loading the whole shebang into memory.

I haven't looked at the source, so I don't know if that's feasible. Maybe you need the whole dictionary loaded? That said, an index-based approach instead of chunking might even be better, like Sphinx does.

Just some thoughts.

How to read line

HI,

First of all thank you for library.

I just work on Swift version for library.
And can't understand line:

if ((prefixLength - maxEditDistance == candidateLen)
                        && (((min = Math.Min(inputLen, suggestionLen) - prefixLength) > 1)
                            && (input.Substring(inputLen + 1 - min) != suggestion.Substring(suggestionLen + 1 - min)))
                           || ((min > 0) && (input[inputLen - min] != suggestion[suggestionLen - min])
                               && ((input[inputLen - min - 1] != suggestion[suggestionLen - min])
                                   || (input[inputLen - min] != suggestion[suggestionLen - min - 1]))))
                    {

When will be changed min?
Only if prefixLength - maxEditDistance == candidateLen ?
Or if Math.Min(inputLen, suggestionLen) - prefixLength) > 1 ?

Thanks a lot!

Objective-C or Swift version

Hi,
This is very efficient and magical way for auto-correction.
Is it possible for you to provide Objective-c or Swift language version for the same?

It would be very helpful for me in my application.

Thank you.

Support for weighted edit distance

I'm not sure if SymSpell already has support for weighted edit distance. If so, please tell me how to use it.

Otherwise, I suggest to add this as another possible distance metric, in addition to Levenshtein and Damerau-Levenshtein. The implementation itself shouldn't be problematic: just use the weight matrix instead of the default unit cost. The matrix is input to the constructor, and for command line use it can be stored in a file. (I could in principle do it myself, but I don't know C#)

Issue with apostrophes

I am working off the java Port of Symspell here (https://github.com/gpranav88/symspell), but I think this error would affect this too?

But looking at the code, @line78 in the parseWords(text) function:

return Regex.Matches(text.ToLower(), @"[\w-[\d_]]+") .Cast<Match>() .Select(m => m.Value);

It seems to split words such as "shouldn't" into 2 words ("shouldn" and "t").

Maybe I am wrong here, but shouldn't that be

return Regex.Matches(text.ToLower(), @"[\w-[\d_']]+") .Cast<Match>() .Select(m => m.Value);

so that contractions are added to the dictionary?

Unable to run the code : New to C# and .net

This question might sound very basic/stupid but I am very new to c# and. net , I installed VS code and opened the downloaded symspell project code in it. Now how can i run the demo ?

In problems it shows the error:- The type 'List<>' is defined in an assembly that is not referenced. You must add a reference to assembly 'netstandard, Version=2.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'. [SymSpell.Test]

When i try to run the SegmentationDemo by ctr+f5 it gives error :-

Unhandled Exception: System.PlatformNotSupportedException: Operation is not supported on this platform.
at System.ConsolePal.set_WindowWidth(Int32 value)
at symspell.Benchmark.Benchmark.Main(String[] args) in /Users/aashishamber/Downloads/SymSpell-master/SymSpell.Benchmark/SymSpell.Benchmark.cs:line 52

Kindly help how to run the code.

Better usage documentation?

How are you supposed to integrate SymSpell into a C# project?

I installed the package via NuGet into a .NET 4.6.2 solution.

 var symSpell = new SymSpell(initialCapacity, maxEditDistance, prefixLength);

This line produces an error in my project: Cannot create an instance of the abstract class SymSpell.

Strangely, that same line does not produce an error in the demo project.

As an aside, I had no issues with integrating SymSpellCompound, although the Correct method seemed to only parrot the input words or input text rather than suggest any corrections.

edit: Well, I went ahead and compiled the SymSpell solution, and included the DLL as a reference in my project. I guess that's how? What's the point of the NuGet package then?

edit: Well, the reason why SymSpellCompound wasn't suggesting anything was because the dictionary wasn't loaded, and Lookup wasn't throwing an exception.

Suggestion: Split dictionary and core

I try SymSpell and it looks great. But one thinks I notice almost intermediately it bring dictionary file in my project (even if I do not use it). I understand it helps with a quick start but I strongly believe in real application most of the users make own. But even if not. I think it will be better to have NuGet package split to SymSpell.Core and SymSpell.Dic.En for example. For keeping compatibility SymSpell could be composed of these two packages (something like Microsoft.AspNetCore.App.

Next valid letters

Does SymSpell have a way to get the next valid set of letters given a prefix? E.g. if I give it "carpo" it would return "o" and "r" for "carpool" and "carport" (maybe others).

Suggestion: instantiable class instead of static

Using SymSpell is great. However, I notice that it's all one static class. This makes it difficult to modify settings for different uses, especially the dictionary and verbosity. It can be helpful to enable this kind of code

SymSpell spellingDict1 = new SymSpell(path1, "", 0, 1, 0); // dictionary 1 with verbosity 0 SymSpell spellingDict2 = new SymSpell(path1, "", 0, 1, 1); // dictionary 1 with verbosity 1 SymSpell spellingDict3 = new SymSpell(path1, "", 0, 1, 2); // dictionary 1 with verbosity 2 SymSpell spellingDict4 = new SymSpell(path2, "", 0, 1, 0); // dictionary 2 with verbosity 0

the constructor could take the dictionary, or set it later, and optional addional parameters of verbosity, edit distance, lp, etc.

Peformance while adding dictionary

Thank you for this great code. I'd like to ask you about recommendation/suggestion on how I could make loading a big dictionary faster and not slowing down the application.
I have an idea, please answer as a creator if this is going to work in your opinion:

  1. I thought about adding words in smaller chunks via multiple for loops using CreateDictionaryEntry() function on arrays.
  2. To make this not stall the application, I could run those loops on a background thread. Small enough chunks could not affect performance, while gradually building a dictionary before a user gets to type in a word.

My previous implementation was my own, but it was using SQL database. Queries were too slow for this kind of usage or too many with a bigger dictionary.

An issue - for some reason, CreateDictionaryEntry() is the only way for me that works after using LoadDictionary() first. CreateDictionary() doesn't add more terms to an existing database. Word count stays the same. Maybe it's because it's plain text without a second column for word frequency?

And two short questions:

  1. How to add multi-word phrases to dictionary among normal single word ones? Currently, CreateDictionaryEntry() seems to add some phrases while throwing others out. If I start with a number and a second word has only letters, it shows up in Lookup(). But if the first one starts with an uppercase letter and second is just 3 uppercase letters, Lookup() doesn't work. The same when there's some punctuation.
  2. How to add phrases (single and multi-word) with punctuation and make them show up as suggestions among single word phrase?

symspell paper

is there any paper for symspell?
dose symspell detect real word error?

Can the processed dictionary created on-the-fly by the command line be saved for re-use?

First of all, thank you for developing this tool. It is amazingly useful and fast!
I have been using the command line version since my programming skills are limited (my skills being in vba, sql and some java). I do have visual studio 2017 installed so perhaps that could help if I need to modify the project on my end..

So my hope would be that I could find a way to do the following:

  1. process the frequency dictionary once and save it for re-use
    (I want to confirm that there is no way to have the processed dictionary be loaded into a db like mysql..I assume this won't work because mysql cannot create the proper indexes..correct?)

  2. use a command line switch to set the number of matches returned (in frequency order of course) for edit distance 1 and edit distance 2 etc. So let's say I want the "top 5" and I set my max-edit distance to 3 then I would get 15 results (assuming there are >=5 matches for each edit distance). As it is now I may get a few for distance 1, a lot for edit distance 2 and a massive list for edit distance 3. I have been attempting to cleanup names from the census with have transcription errors and many times the correct name is the 1st or 2nd result in edit distance 2 or 3 (not edit distance 1). If I could get the top few matches from each edit distance then I have a phonetic algorithm that narrows the results.

  3. ultimately I would really want to create an excel function that could call upon the command line for matches where the processed dictionary is already loaded into memory and that environment is accessible to vba

As a first step - items 1 and 2 are most important (saving the processed dictionary and setting the max # matches ordered by frequency)

Do you think this is possible? and could the dictionary ever be moved into a db? Thanks for your help and for sharing this excellent tool.

As a side note, could this ever be successfully migrated to nodejs to create an api?

How to augment existing dictionary?

Hi,

Apologies for using the issue tracker to ask a question.

How do I add a replacement for gr8 to great?? Is there a way to augment/extend the existing dictionary frequency_dictionary_en_82_765.txt using python to include these replacements? if yes how?

I also tried the example: I like readying, writing and singing and I was expecting an outcome of I like reading writing and singing but it never changed anything apart from striping up the punctuations.

I then checked the frequency_distionary and found that readying is included in there as well. Do I have to take it out to get it to replace readying with reading?

Best wishes and great work!

File and folder capitalisation problems

Firstly, thank you for this is project - it looks really impressive.

I'm running this using the .Net Core on Linux and am running into problems with the capitalisation of files and folders when I try dotnet restore:

/opt/dotnet/sdk/2.1.4/NuGet.targets(227,5): warning MSB3202: The project file "/home/neil/Projects/SymSpell/symspell.Demo/symspell.Demo.csproj" was not found. [/home/neil/Projects/SymSpell/SymSpell.sln]

Having cloned the repo from Github, the folder is actually SymSpell.Demo with similar differences in the files within those folders (eg SymSpell.CompoundDemo SymSpell.Benchmark)

The fix isn't hard, either change the entries in SymSpell.sln to reflect the folders/filenames or adjust those folders/filenames to match the .sln file

I opted to change the folders and files but then had to adjust the location the demo looked for the frequency dictionary as that was looking in the capitalised version of the folder:

File not found: /home/neil/Projects/SymSpell/SymSpell/frequency_dictionary_en_82_765.txt

I am new to using .Net Core on Linux, so there may be some setting or tip I'm overlooking to help with this, but I suspect if the development is being done on Windows then the problem may have been unnoticed there because Windows tends not to be case sensitive for folders/files whereas Linux is.

Let me know if there are any more details you'd like me to supply for this. Thx!

use SymSpell with hunspell dictionaries?

This is not really a bug report, sorry for using the issue tracker for this, but I'd like to see if someone has worked on this. Please point me to a better place if you know one:

Has anybody worked on using SymSpell with German hunspell dictionaries? German uses compounds, so you cannot just export a long list from the hunspell dictionaries and use them as input for SymSpell. The hunspell dictionary has special flags that indicate which words can be used in a compound, these would need to be considered somehow.

More than 2 columns and space seperated words

Hi,

1 - I want to add more columns like 'category' or 'type' or 'Culture' in the dataset and in that case maybe i need to have a word twice in the dataset.
for adding more clolumns which you mentioned it's possible, should I change the LoadDictionary method to support more than 2 columns ?

2 - what can I do for space seperated words, something like Mercedes benz ?

Best,
Amir

prioritizing types of distance errors?

Great library! It works very well and is highly performant. To be honest I don't know or understand the underlying algorithm, but I have a suggestion / request from a user's perspective.

Right now the results seem only ordered by the # of character modifications, and are agnostic to the type of modification. However, this results in some unexpected "corrections". I think adjacent letter swaps should be highest priority, followed by missing a repeated letter, followed by add a letter, followed by remove a letter, followed by total letter replacement (this is most likely to result in a different intended word). Alternatively some sort of ranking/sorting involving for including more % of characters in the input word.

Examples:
basicly --> basic --> expected basically
collegue --> college --> expected colleague
finaly --> final --> expected finally
jist --> list --> expected gist (this one could be theoretically helped by the j sound being the same as g sound)
liase --> laser --> expected liaise
peice --> price --> expected piece
politican --> political --> expected politician
realy --> real --> expected really
rember --> member --> expected remember (this is two steps away, so maybe ignore it)
seige --> beige --> expected siege
tonge --> lounge --> expected tongue

This is an incomplete example list that I just got with a quick list from https://en.oxforddictionaries.com/spelling/common-misspellings
and using SymSpell.editDistanceMax = 3; (because 2 missed too many misspellings).

Overall, great library, thank you for maintaining it.

Best approach for a language that has Clitic pronouns

Currently I am trying to figure out an approach to make a spell checker for Central Kurdish. Just like some other Indo-Europian languages, Kurdish language has Clitic pronouns. It's a bit tricky because:

  • Kurdish has two sets of clitic pronouns that are used depending on the tense of the verb and whether it's transitive or not.
  • The pronouns can stick to most parts of speech: Nouns, Verbs, Adjectives and Adverbs.
  • They don't always stick to the end of the word, they can also appear in the middle of the word (after the root of the verbs or after the first word of a compound word)
  • There can be more than one pronoun stuck to a verb (One of them acts as a subject the other acts as an object). See example 3

Because there are two sets of pronouns, pre-calculating a dictionary with 100K words would results in about 1M words. That's before calculating the edit distances.

Which of these approaches do you think is the best in this case?

  • Compile a big dictionary with every valid combination
  • Compile a smaller dictionary composed of only the base words and categorize them and then expand each of them at run-time based on the PoS and other properties of each word.
  • Use an Recurrent Neural Network to do the job

Examples:

Note: letters in bold are pronouns.

Eat

  • I eat => Min Dexom [Min, m]
  • We eat => Ême Dexoyn [Ême, yn]
  • They eat => Ewan Dexon [Ewan, n]

Work

  • I worked => Min Karmdekird [Min, m]
  • He Worked => Ew Karîdekird [Ew, î]
  • They worked => Ewan Karyandekird [Ewan, yan]

Forgive

  • [You (plural)] Forgive me => Bimbexshin [m, n]
  • [You] Forgive her => Bîbexshe [î, e]
  • [You] Forgive them => Biyanbexshe [yan, e]

How to calculate distance manually in symspell?

Greetings Sir,
I'm working on spell checker for Urdu Language.
Tried with your algorithm it gives me great results.
Now,
Can you explain how symspell algorithm works?
Like,if I wanted to calculate the edit distance of LD manual I know what's the way,
So how to calculate deletion in symspell or sympound manually,or what's the procedure for calculating manually.

Plus,I didn't got the algorithm.i know there is some deletion instead on insertion and all.
But I don't have the proper understanding of the given algorithm.
I read the algorithm written by you on medium.

Using different Dataset of words

Dear Wolfgarbe,

I'm currently trying to make this program work for The Georgian language, I made a Dataset of words with their respective frequencies. For some reason, SymSpell does not return my suggestions. I'm sure that it is not SymSpell's fault. I attached the dataset.
The only difference I see between this and the Demo dataset is that mine is not sorted. So does the sort order matter in this case? or is there some other issue with my dataset?
workfile.txt

Sincerely

Phonetic Suggestions

I am testing this library and need some advice as to whether this is the right tool for the job as the Top suggestion appears to be based simply on nearest match rather than any phonetic matching use by other spell checkers.

For example, searching "kween" returns a top match of "keen". Using Closest returns keen, tween, ween yet something like Hunspell will return Queen as the top match which is what I would expect in this case.

Command line usage?

Hi,

Is there a way to use symspell from the command line? I am not a programmer and want to use symspell without having to build a project (or use old ports). I am on linux. Thanks for any advice.

Use as search engine

Hi, How might I use symspell as a search algorithm that returns an index given a query string

[Question] About SymSpell model and probabilistic models (Norvig, etc.)

I'm currently using both Hunspell and SymSpell as main spelling correction system. They works both ok, SymSpell works great (quality, performances, etc.) That said, I have a question about Norvig probabilistic Spell Checker, that I show up with a simple case.
In some romanized languages, there is not one-to-one relation from the source script language term to the english (romanized) language term. So given that you have the romanization of let's say Hindi, you will get more possible english words as destination. Now this is a typical output of such a system: 1 (Hindi) word -> N (eng) words.
Typically decide which of the N words is the best is done with algorithm like beam search, viterbi, etc., but there are a lot of cases where the indecision stays on.
Also in other case, we have eng (N) -> hi (M), so this function is not bijective at all.
Given that a Spell Checker have knowledge of all (most of) the words in a language, etc. and supposed I need context (like in this case) to go back from eng (N) -> hi (M), do you think that SymSpell or Norvig's probabilistic model could give a valid hint about the M choices (or the N in the opposite way)? What's your opinion on that?

SymSpell lookupcompound with verbosity param

Is there a way to lookup a composite word but with Verbosity param?
For example I have names of people and I would like to get back the closest 5 people, not just the best match.

Symmetric Delete spelling correction algorithm

Hello sir, thank you for that nice code. I am writing a paper about segmentation of word and fuzzy search of words for certain script. Can i get a detailed algorithm used for this code.

Levenshtein

Wolf,
I've updated the GitHub project that has my Levenshtein function. I've updated it to support using it via an instantiated class. Using it this way, it uses almost no memory, even temporarily, so it's memory impact is extremely negligible. I also made it as a nuget package, that is packaged as a .net standard 1.0 library. If you're interested in pulling it into symspell, let me know, and I can do that and submit a pull request. It would be cleaner I think, if the source was pulled into the symspell project, rather than create the package dependency. Before doing it though, it might be good if I had a better idea about how sysspell is used by you and other folks as far as multi-threading is concerned. The levenshtein distance can be done via an instantiated class, but the distance function itself is not threadsafe. There is also a static version of the function, that is only a tiny bit slower, but it loses the memory consumption advantage of the instantiated version.
Steve

Add overload(s) to LoadDictionary which accept Stream instead of file name

At the moment, LoadDictionary assumes that the dictionary is located in a separate file on FS.
This limits ways of storing the dictionary.
E.g. resource or remote storage are out of options unless file is copied locally.

Alternatively all records could be parsed one by one, that however, would require writing usage code, which is almost identical to the body of the LoadDictionary.

P.S. I'm happy with doing a pull request in case it is an acceptable change

How to exclude Nouns in SymSpell

Is there a way to exclude nouns while correcting for spelling mistakes. When I do a spell check for a paragraph, the names also get changed into some word in the dictionary. Please let me know if there are any options available to implement this.

Can i use Symspell in Android app

Hi,
I want to use symspell in android but couldn't find the dependencies to run it in android. How can I bind it with my android app?

Thanks.

Ported to Ruby

I hope you don't mind, I ported your C# code to Ruby.

https://github.com/PhilT/symspell

I've only just got it working and it's late so committing what I have now. Let me know if you'd like me to put any more of the original project details/copyright etc in my port.

Thanks for sharing!

How to handle completely wrong sentence word?

Hi,
First of all SymSpell is damn fast and kind of does my job for spell correction but the issue I am facing is when my application user intentionally type any completely wrong word or sentence Symspell would would come up with a right word for it which can be avoided
Example
User types: avedoamlkejuike...
Syspell: a video am like juice keen...
Something like this which is totally irrelvant for my usecase
So how can I solve this just by using Symspell??
Thanks in advance

SymSpell LookupCompound excluding Numbers and Special characters

I'm trying to use SymSpell for OCR post processing spell correction.
I have noticed that, SymSpell LookupCompound excluding Numbers and Special characters from the output. In my context, numbers and characters are really important for further analysis.
Is it possible to avoid Numbers and Special characters elimination?

Version: SymSpell 6.3 C# project

Steps to reproduce:

  1. Build the SymSpell C# code

  2. Go to \SymSpell\SymSpell.CompoundDemo

  3. Run dotnet run .

  4. Enter below input
    "To find out more about how we use information, visit or contact-any of our offices 24/7"

  5. It gives below output.
    to find out more about how we use information visit or contact any of our offices of 5 30,646,750

Problem:
We can notice that, the output doesn't contain ',' and 24/7

Expected Behavior
to find out more about how we use information, visit or contact any of our offices 24/7

Python V6.3 Multiple Space Segmentation

The Compound finds single spaces, Segmentation finds multiple spaces.

For the case when many words are concatenated without spaces,
how is Segmentation implemented in Python to add spaces where needed?

Introduce an interface to SymSpell class to simplify mocking/testing

At the moment we're forced to writing our own wrappers around SymSpell that would allow mocking the class in our unit tests.
A predefined interface (or at least virtual attributes on most of the compute-intensive methods) would remove this requirement.

P.S. I'm happy with doing a pull request in case it is an acceptable change

Non-english not working so good

I'm trying to use SymSpell on a non-English text (Norwegian). Got a good database from https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/no/no_full.txt.

However, the following example does not work at all:

//lookup suggestions for multi-word input strings (supports compound splitting & merging)
inputTerm="dettefungererikkeveldigbra";
maxEditDistanceLookup = 2; //max edit distance per lookup (per single word, not per whole input string)
suggestions = symSpell.LookupCompound(inputTerm, maxEditDistanceLookup);

//display suggestions, edit distance and term frequency
foreach (var suggestion in suggestions)
{ 
  Console.WriteLine(suggestion.term +" "+ suggestion.distance.ToString() +" "+ suggestion.count.ToString("N0"));
}

Not sure if this is an error on my part, or if non-English is not supported.

I would have wished the output to be "dette fungerer ikke veldig bra", but instead I am returned with the original input.

Lookup() however, works I think.

Frequency dictionary errors

Using frequency_dictionary_en_500_000.txt, I spotted some errors:

  • youre should be you're
  • dont should be don't
  • dont's should be don'ts
  • Lots of weird [a-zA-Z]+[0-9]+ words, too.

This list seems to include common misspellings and 1337 speak (e.g., f1nancially, di3.)

Issue with word segmentation

int imax = Math.Min(input.Length - j, maxSegmentationWordLength);

I have a doubt how the code for word segmentation will be able to segment the given example:

Input : thequickbrownfoxjumpsoverthelazydog
Output : the quick brown fox jumps over the lazy dog

because, in outer loop "j" is iterating from 0 to "input.length" and in inner loop "i" is varying from 1 to "imax"
assuming maxSegmentationWordLength is large enough and imax is alwayas taking the value (input.length - j)
so, as j is increasing imax is decreasing and the scope of substring that we take i.e. "part" will reduce. so, my concerns are

  • the moment j crosses (input.length / 2), "i" would always be smaller than j and "part" substring would make no sense as we are taking part = input.Substring(j, i) and j > i
  • other concern is strings that are at the other half of the middle index will not get segmented because they will never be assigned to "part" because of the above concern.
  • So, how would we be able to segment complete string.
    I have not implemented the actual C# code, but wrote a python2 implementation of the same, and facing functional issues, which i have described above.

Please assist here. Thanks a lot.

Better explination of SuggestionStage

I don't get the whole SuggestionStage thing. Comments say it's to help speed things up and be more memory efficient, but looking at the code I can't see how it would do anything but the opposite. Creating and populating the staging might be faster, but in the end it still has to go in the final data structure. So it's just an intermediate data object. I must be missing something.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.