wooorm / dictionaries Goto Github PK
View Code? Open in Web Editor NEWHunspell dictionaries in UTF-8
License: MIT License
Hunspell dictionaries in UTF-8
License: MIT License
Seems this one is missing.
This seems trivial, but because a lot is generated there are multiple ways to do it.
My proposal:
dictionaries/script/template/index.d.ts
index.d.ts
to the list of requiredFiles
in test.js
.Hello,
I got some trouble with the french dictionary, most of invariable words are not correctly checked.
I openned the dic file and find lines like :
voici 89
I did not found any documentation about this syntax ?! In dictionnary-en there is no digits after words ?!
While looking at the Hungarian dictionary, I found HTML Entities in the .aff
file.
REP Angström Ångström
Some of them are not real entities:
dictionaries/dictionaries/hu/index.aff
Line 101 in 5ee9325
I was not able to find any references to Hunspell supporting HTML entities in .aff
files.
By the way, thank you for maintaining these dictionaries.
"przypuszczać - przypuszczający"
https://sjp.pwn.pl/szukaj/przypuszczaj%C4%85cy.html
I have been attempting to convert these dictionaries to qtwebengine format using qt's qwebengine_convert_dict tool.
I was unable to convert the file el-polyton/index.bdic due to the following error:
Did not find a space in 'έψ εύσ'.
Most other dictionaries did build using the tool so it leads me to believe that the fault may be with el-polyton/index.aff but as someone who is unfamiliar with hunspell, I am unable to tell if it needs a space instead of the tab.
If so, it seems more useful to report it here than just fix it at my end.
Could you extend the German dictionaries with those in in here: de_dicts.zip? They are all in HunSpell format, but I have never created nor modified these dictionaries.
Note that the archive contains three subfolders:
1901
: so called Old Rules;1996
: so called 1996 Reform`;2006
: so called 2006 Reform`.Currently, the de/index.dic
contains 75,767 words only, whereas the 2006/de_DE.dic
contains 163,202 words.
I'm getting an error when I try to load the Spanish dictionary inside of an electron application:
import dictEs from 'dictionary-es'
import nspell from "nspell"
dictEs(ondictionary)
function ondictionary(err, dict) {
if (err) {
console.log(err);
throw err
}
var spell = nspell(dict);
}
Throws this error:
Uncaught Error: ENOENT, renderer\index.aff not found in C:\Users\LID-Mobile\Development\cccreator-desktop\node_modules\electron\dist\resources\electron.asar
at notFoundError (ELECTRON_ASAR.js:108)
at fs.readFile (ELECTRON_ASAR.js:536)
at one (index.js?e606:15)
at load (index.js?e606:11)
at Object.initDictionary (globalFunc.js?ff72:529)
at Store.updateLanguage (store.js?c0d6:204)
at wrappedMutationHandler (vuex.esm.js?2f62:714)
at commitIterator (vuex.esm.js?2f62:382)
at Array.forEach (<anonymous>)
at eval (vuex.esm.js?2f62:381)
I don't get the same error when loading the
dictionary-en-us or dictionary-en-ca dictionaries.
Thoughts?
Hi,
Thanks for this easy to use awesome library!
I've been using English dictionary without issues but Swedish dictionary at dictionary-sv
does not detect any issues but sees any input as correct.
I looked at index.js
and it looks same as the one for English so I cannot detect the problemhere and appreciate any help.
The Swiss German (de-CH) dictionaries should not include Eszett (ß
) characters. In Swiss German this special character is replaced by ss
as described here.
Hunspell read the affix file byte by byte and decodes UTF-8 on demand. If it's not instructed to do so for flags, it doesn't. So non-ASCII characters like "ý" are treated like several characters, and due to another bug Hunspell silently takes just the first character and ignores the rest. So the words can have unexpected flags.
Example: pt
contains FORBIDDENWORD ý
, and the perfectly valid word trabalhar/akYMjLÀÚ
is treated as having this flag and thus considered misspelled.
According to https://dictionary.cambridge.org/dictionary/english/onwards, 'onwards' is more common in British English, however:
dictionaries/dictionaries/en-GB/index.dic
Line 34383 in 4099de3
And actually it's got 'afterwards':
dictionaries/dictionaries/en-GB/index.dic
Line 11251 in 4099de3
...so I guess it's better to make them consistent?
I have been attempting to convert these dictionaries to qtwebengine format using qt's qwebengine_convert_dict tool.
I was unable to convert the file ko/index.bdic due to the following error:
Word does not match! - Index: 14081 - Expected: 김수한무거북이와두루미삼천갑자동방삭치치카포사리사리센타워리워리세브리캉무드셀라구름위허리케인에담벼락서생원에고양이고양이는바둑이바둑이는돌돌이 - Actual: 김수한무거북이와두루미삼천갑자동방� - ERROR converting, the dictionary does not check out OK.
Most other dictionaries did build using the tool so it leads me to believe that the fault may be with ko/index.bdic but as someone who is unfamiliar with hunspell, I am unable to tell if the Expected should be used to replace the actual.
If so, it seems more useful to report it here than just fix it at my end.
I think this may be a lot of redundant work. I'm just not clear about the sources, this is a large collection of dictionaries. I was looking for something like this. This project seems to source from https://extensions.openoffice.org/en/project/polish-dictionary-pack. But, those were last edited in 08'.
LibreOffice replaces those dictionaries. It's another large repository of hunspell. Makes more sense to just that -- that's what I'm doing for my project. It seems like you have similar aims. Would make more sense just to nuke this and pull in from it.
Maybe there is an advantage, be glad to know it.
We are having apostrophes difference for browser and word insert.
The top picture was the way the document was opened. You can see all these properly spelled words are marked incorrect because the apostrophe, but once I delete the apostrophe and type it back in, they are correct. " ’ " vs " ' "
Word insert:
Browser insert:
We have attached the affix
en_US.zip
file
The Hungarian dictionary, despite having UTF-8 encoding, doesn't contain the proper hungarian characters such as ü,ű,á,í,ö,ó,ő, etc.
Example:
Ăźzenet/1 1
instead of
Üzenet/1 1
I've tried to figure out, that maybe my computer encodes it wrong, but after trying to re-encode with Notepad++, and even setting encoding manually to utf-8 in Chrome, the issue still persists. Thus in it's current form, this dictionary is unusable by any spellchecker, because the special Latin-2 characters are all wrong.
I have been attempting to convert these dictionaries to qtwebengine format using qt's qwebengine_convert_dict tool.
I was unable to convert the file hu/index.bdic due to the following error:
Word does not match! - Index: 35768 - Expected: góóóóóóóóóóóóóóóóóóóóóóóóóóóóóóóól - Actual: góóóóóóóóóóóóóóóóóóóóóóóóóóóóóóóĂl - ERROR converting, the dictionary does not check out OK.
Most other dictionaries did build using the tool so it leads me to believe that the fault may be with hu/index.bdic but as someone who is unfamiliar with hunspell, I am unable to tell if it should be the Expected string.
If so, it seems more useful to report it here than just fix it at my end.
I'm trying this package but the console message is never shown
const nspell = require('nspell')
const dictionaryPt = require('dictionary-pt')
function testSpell(txt){
dictionaryPt((error, pt) => {
if (error) throw error
var spell = nspell(pt)
console.log(spell.suggest(txt))
})
}
testSpell("Maquina")
dictionary-pt: "^3.1.0",
nspell: "^2.1.5"
I checked the programs stucks at var spell = nspell(pt)
line. Any idea what's wrong?
Aloha there,
Can you add arabic dictionary please ?
Thanks in advance
Nice work, thank you for maintaining this repo. Lithuanian dictionary seems to be non UTF-8 encoded and gives errors. For example "Tęsti Žaidimą" which means "Continue Game" when checking each word gives errors for every word.
I want to build a very simple translator using dictionaries, my concerned languages are English, French and Arabic.
Imagine the following dictionaries:
French
[bourgeois, brunette, contraire, ]
English
[bourgeois, brunette, contrary, ]
If there is an index between the meaning of terms than I can map words easily.
Thanks a lot !
Can you add theese files to exports property in package.json? We use direct import of theese files to use it in browser. But with latest versions webpack fails with error:
Module not found: Error: Package path ./index.dic is not exported from package \node_modules\dictionary-en (see exports field in \node_modules\dictionary-en\package.json)
I have used Syncfusion spellchecker with the Humspell using Asp .net core. but I tried to get suggestions from swediish I could not receive any result, as well as spell-checking function, is not working properly. but while not using the suggestion it is working properly and both the suggestion and checking functions are working with other languages like English variants, Russian, French etc.. so is there any limitations in the Swedish dictionary ???
Because 5f2b26a introduced breaking changes, I think now would be the right time to convert to ESM.
The Ukrainian dictionary license situation seems to be a bit confusing.
It states that dictionary files themselves are Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, and only the scripts building them are GPL3 .
I'll be the first one to say that I'm not an expert in OS licensing, but if I'm right, then a change of license in this project should be in order.
I looked for a paper which explains how does hunspell works, but without any success. I would like to know how does hunspell works and especially how does it make suggestions ? Does it uses Levenshtein distance to look for the best suggestion ?
Do we have a way to use 'ё' instead of 'е' for ru
where needed? E.g. 'актер', which is currently in the .dic file, should not be valid in accent-sensitive (case-sensitive) mode. It should be 'актёр'. I see that Portuguese has this issue figured out.
Hi there !
Looks like the suggest/correct methods doesnt handle some words in french, like "préavis"
in index.dic we can find : préavis po:nom is:mas is:inv
but console.log(spell.correct("preavis"));
returns false.
Any idea how to fix this ?
Hello there, and thanks for the work!
Would it be possible to add also Swahili dictionaries?
LibreOffice provides Hunspell dictionaries for Kenyan Swahili and Tanzanian Swahili:
https://extensions.libreoffice.org/en/extensions/show/swahili-dictionary
There are more places that provide these dictionaries, but I suppose they all have the same content:
https://addons.mozilla.org/en-US/firefox/addon/kiswahili-spell-checker/
https://cgit.freedesktop.org/libreoffice/dictionaries/tree/sw_TZ
https://github.com/elastic/hunspell/tree/master/dicts/sw
Best regards
Hi, it looks like some of the dictionaries have had the dic and aff files mixed up. Looking through crawl.sh
the generate
method calls do look incorrect. Was wondering if this is on purpose or a typo. The affected dictionaries are...
I have tried to create a dictionary that would contain all (most) Slovak words including all their forms, but I have failed as I have never done it yet.
Slovak Academy of Sciences (Slovenska akademia vied, SAV) has worked on a morphology analyser untill approx 2015, which contains 100 MB data. Each word has a flag what part of speech the word is and what grammer case the word is in.
Some links (all in Slovak; if needed, I can translate them for you into English):
ma-2015-02-05.txt.xz
;I could help you with the translation of the Slovak texts, Slovak grammer and testing.
The current source for Russian dictionaries was updated long time ago and seems to be unmaintained anymore.
Arch Linux AUR ships a newer, updated version from the Libre Office extension.
It would be great to use those dictionaries here too.
Problem:
Solution:
.json
file ("1mb of data"
) and also as the normal file (1mb of data
)?
export {default as aff} from './index.aff.json' assert {type: 'json'}
export {default as dic} from './index.dic.json' assert {type: 'json'}
export const dictionary = {aff, dic}
export {dictionary as default}
export {default as aff} from 'data:application/json,"..."' assert {type: 'json'}
export {default as dic} from 'data:application/json,"..."' assert {type: 'json'}
export const dictionary = {aff, dic}
export {dictionary as default}
However, that means:
If you don't mind me asking, where do the Russian dictionaries come from? Who created them and licensed them as LGPL-3.0?
We are facing issue in “en-GB” dictionary, there we couldn’t find the word “ability” in aff file. So, the issue occurs all the words related to the word “ability” in suffix. So, can you please provide the definition for that or any other alternative solutions.
Can I use this library to generate a list of word of certain type - e.g. get all nouns, then all verbs, then all conjuctions, then all adjectives, ...?
Or is its only pupose the spell checking (presented in the example inside the readme)?
For some reason the Korean dictionary always return true when using with nspell.
var dictionary = require('dictionary-ko');
var nspell = require('nspell');
dictionary(ondictionary);
function ondictionary(err, dict) {
if (err) {
throw err
}
var spell = nspell(dict);
console.log(spell.correct('hello'));
}
Hey 👋🏻 As you suggested, we should continue our conversation here on GitHub.
Disclaimer: I have almost no clue how Hunspell works, so please forgive if it is a thumb question.
I'm using ReSpeller together with ReSharper in Visual Studio. The German phrase Zahlung gelöscht gets marked as misspelled, so I thought about adding these words to the German Hunspell dictionary.
In your README I found that the source for the German dictionary is j3e. On his page there is a little online spell checker and when entering my sentence Zahlung gelöscht, the result is:
Spellcheck result:
no errors found
And now I'm confused: within the German dictionary I don't find the word Zahlung, but I do find gelöscht:
I'd really appreciate your feedback!
Code:
var dictionary = require('dictionary-hu')
var nspell = require('nspell')
dictionary(ondictionary)
function ondictionary(err, dict) {
if (err) {
throw err
}
var spell = nspell(dict)
console.log(spell.correct('Szerelem'))
}
Error:
C:\Program Files\nodejs\node.exe .\test.js
Process exited with code 3221225477
When I run the following code
fs.readFileSync(path.join(base, 'index.dic'), 'utf-8');
fs.readFileSync(path.join(base, 'index.aff'), 'utf-8');
I get the the following Error, Please help
fs.js:646
return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
^
Error: ENOTDIR: not a directory, open '/home/deeven/Documents/eloquent javascript/work/node_modules/dictionary-en-us/index.js/index.dic'
at Object.fs.openSync (fs.js:646:18)
at Object.fs.readFileSync (fs.js:551:33)
at Object.<anonymous> (/home/deeven/Documents/eloquent javascript/work/nspell.js:7:4)
at Module._compile (module.js:653:30)
at Object.Module._extensions..js (module.js:664:10)
at Module.load (module.js:566:32)
at tryModuleLoad (module.js:506:12)
at Function.Module._load (module.js:498:3)
at Function.Module.runMain (module.js:694:10)
at startup (bootstrap_node.js:204:16)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.