Overview
There are several open source dictionaries available in the hunspell *.dict
and *.aff
formats. Notably, there are a good many here.
Why?
Right now, the main problem with the spellchecker is the available word list.
The current one, english_words.txt
, has too many words.
Not only that, but the word list also contains a lot of "words" that don't seem to be part of the standard English lexicon (e.g. "aarp").
By enabling Harper to use Hunspell dictionaries, we can lean on the existing curation.
The Formats
Source
*.dict
File
The *.dict
file is extremely similar in usage to our existing english_words.txt
file.
The main difference is the addition of the /
separated postfixes that provide additional information about each word.
These postfixes allow Hunspell to ship a relatively small word set, and expand it at runtime.
This file technically act as a drop-in replacement for the existing wordlist, but certain words will be marked as invalid, since we wouldn't be processing the postfixes.
For example, "there" would be marked as valid, but "there's" would not.
*.aff
File
The affix file define how the postfixes described above should be expanded.
Right now, we do not intend to support the entire *.aff
file format, just enough to fit our needs with a specific dictionary. For example, we will ignore the encoding
setting and assume all dictionaries are UTF-8
.
We will also (at least initially) not support compounding.