Comments (25)
echo operidum|hfst-ospell tools/grammarcheckers/smj.zhfst -S
on Mac gives the correct suggestion as the eighth suggestion, where as on Linux it is down in twenty-something
from lang-smj.
Does the test require the suggestions to be in a specific order? That seems like a receipt for bogus fails. What is relevant is whether the correct suggestion is in the list or not, and possibly the position of the correct suggestion (the higher the better). Other than that the order and amount of suggestions should not be considered at all.
No, the test framework uses the output of divvun-checker, looking for the correct suggestion among the suggestions given by divvun-checker.
from lang-smj.
Seems strange, I could've thought that there can be tiny variations in floating point math between operating systems and processors or so, but whole 0.4 is unexpected. I'd probably start with unzipping the zhfst files and diffing the hfst-fst2txt
outputs hoping it has simple differences only, otherwise it needs to be debugged with some probably debug prints on each step of the build in the compiler...
from lang-smj.
there are tons of environment variables and probably other ways to change locale settings and possibly some utf8 locales might not have that in name, locale -ck charmap
command will usually tell what it resolves to in typical programs.
I'll try to experiment with some more minimal example if this can be reproduced under linux with , and . locales...
from lang-smj.
The failing test on Linux is:
[ 4/55][FAIL fp1] operidum:opereridum (, ()) => operidum:[opteridum, optieridum, superidum, tuperidum, duperidum, kuperidum, doneridum, operadume, hoveridum, oarridum] (typo)
from lang-smj.
This is the result on Mac:
[ 4/55][PASS tp] operidum:opereridum (, ()) => operidum:[opteridum, optieridum, kuperidum, duperidum, superidum, tuperidum, jieridum, opereridum, opteridam, åhkeridum] (typo)
from lang-smj.
Could it be that the speller package (lib-speller-something) is out of sync on Mac nightly and Linux nightly, since this is a difference in suggestions on a typo?
from lang-smj.
apt-cache show divvun-gramcheck
Package: divvun-gramcheck
Source: libdivvun
Version: 0.3.11+g563~e101aba9-1~jammy1
Architecture: amd64
Maintainer: Debian Science Team <[email protected]>
Installed-Size: 1587
Depends: libxml2-utils, libarchive13 (>= 3.0.4), libc6 (>= 2.34), libdivvun0 (>= 0.3.11+g563~e101aba9), libgcc-s1 (>= 3.3.1), libhfst55 (>= 3.16.0+g3882~0136e846), libhfstospell11 (>= 0.5.3+g381~9bed46c8), libpugixml1v5 (>= 1.4), libstdc++6 (>= 11)
Provides: libdivvun-bin, libdivvun-tools
Homepage: https://github.com/divvun/libdivvun
Priority: optional
Section: science
Filename: pool/main/libd/libdivvun/divvun-gramcheck_0.3.11+g563~e101aba9-1~jammy1_amd64.deb
Size: 332816
SHA256: eaf079c04167894a3cfbdc3a035cbb80a7dfd8c6067e8d9ec363353619af72d8
SHA1: 7fc49678a033301d47cd21514353fd96371180ad
MD5sum: 33ef45415af7e6be098652971a1a4ef8
Description: Grammar checker tools for Divvun languages
Helper tools for grammar checking for Divvun languages
Description-md5: 8151298b2db6426a3d1d4f55957d131a
from lang-smj.
hfst-ospell --version
gives this result
hfst-ospell --version
hfstospell 0.5.3
copyright (C) 2009 - 2018 University of Helsinki
on both machines
from lang-smj.
git hash e101aba9 is the most recent version from 3 days ago... but I don't think anyone has touched suggestion sorting in ages. Are the weights of the mismatching suggestions exactly the same? If so it can easily happen that they end up in different order under circumstances including different oses data structures and sort algorithms...
from lang-smj.
Does the test require the suggestions to be in a specific order? That seems like a receipt for bogus fails. What is relevant is whether the correct suggestion is in the list or not, and possibly the position of the correct suggestion (the higher the better). Other than that the order and amount of suggestions should not be considered at all.
from lang-smj.
The weight on Linux is ~37.99, on Mac it is ~37,59
from lang-smj.
The number of typo-suggestions from divvun-checker seems to be truncated to ten results
from lang-smj.
The weight on Linux is ~37.99, on Mac it is ~37,59
Should they not be the same regardless of platform?
(But hfst-ospell's last commit was in June, hfst's last was in September, so why did this only happen now?)
from lang-smj.
I tested the smj.zhfst that I build on my Mac on my Linux machne. The weights are different (see the comment above), but the wordforms are the same.
❯ echo operidum|hfst-ospell ~/Viežžamat/smj.zhfst -S
"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum 27.590923
optieridum 31.590923
superidum 32.590923
tuperidum 32.590923
duperidum 32.590923
kuperidum 32.590923
doneridum 37.590923
operadume 37.590923
hoveridum 37.590923
oarridum 37.590923
vomeridum 37.590923
moveridum 37.590923
råhperidum 37.590923
gåhperidum 37.590923
noteridum 37.590923
voteridum 37.590923
poneridum 37.590923
poleridum 37.590923
apteridum 37.590923
exeridum 37.590923
roteridum 37.590923
opteridus 37.590923
opteridu 37.590923
dåhperidum 37.590923
poseridum 37.590923
rokeridum 37.590923
åhkeridum 37.590923
opteridam 37.590923
logeridum 37.590923
ageridum 37.590923
koseridum 37.590923
doseridum 37.590923
doteridum 37.590923
joderidum 37.590923
opteridup 37.590923
opteridim 37.590923
jieridum 37.590923
opereridum 37.590923
vs the the smj.zhfst build on the Linux box:
❯ echo operidum|hfst-ospell tools/spellcheckers/smj.zhfst -S
"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum 27.996086
optieridum 31.996086
superidum 32.996086
tuperidum 32.996086
duperidum 32.996086
kuperidum 32.996086
doneridum 37.996086
operadume 37.996086
hoveridum 37.996086
oarridum 37.996086
vomeridum 37.996086
moveridum 37.996086
råhperidum 37.996086
gåhperidum 37.996086
noteridum 37.996086
voteridum 37.996086
poneridum 37.996086
poleridum 37.996086
apteridum 37.996086
exeridum 37.996086
roteridum 37.996086
opteridus 37.996086
opteridu 37.996086
dåhperidum 37.996086
poseridum 37.996086
rokeridum 37.996086
åhkeridum 37.996086
opteridam 37.996086
logeridum 37.996086
ageridum 37.996086
koseridum 37.996086
doseridum 37.996086
doteridum 37.996086
joderidum 37.996086
opteridup 37.996086
opteridim 37.996086
jieridum 37.996086
opereridum 37.996086
from lang-smj.
This is the output of hfst-ospell on my Mac, with natively built .zhfst vs the one built on Linux:
Mac-built
❯ echo operidum|hfst-ospell tools/spellcheckers/smj.zhfst -S
"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum 27.590923
optieridum 31.590923
kuperidum 32.590923
duperidum 32.590923
superidum 32.590923
tuperidum 32.590923
jieridum 37.590923
opereridum 37.590923
Linux-built
❯ echo operidum|hfst-ospell ~/Downloads/smj.zhfst -S
"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum 27.996086
optieridum 31.996086
kuperidum 32.996086
duperidum 32.996086
superidum 32.996086
tuperidum 32.996086
jieridum 37.996086
opereridum 37.996086
from lang-smj.
The weights depend on where they were built and the wanted suggestion is way further down the list on Linux than on Mac.
from lang-smj.
I notice the weights on the Linux side is consistently 0.5
higher than on the Mac. The whole weight difference is strange, the math and the source code should be identical. @flammie do you have any ideas?
from lang-smj.
After converting and a diffing the files, the differences seem to be massive:
❯ wc -l linux/*.txt mac/*.txt *.diff
manually ordered output
1 933 236 linux/acceptor.default.txt
1 930 427 mac/acceptor.default.txt
3 863 469 acceptor.default.diff
2 171 024 linux/errmodel.default.txt
2 171 024 mac/errmodel.default.txt
3 633 755 errmodel.default.diff
from lang-smj.
well error models seem same size, which is kind of unfortunate of course since the akseptor is the one that has 100 step build process. I don't know if there's any way to debug and bisect other than going through the process step by step and compare, maybe it diverges in some obvious step... I think since current hfst's also use openfst as library there can be than that version difference too between mac and linux
from lang-smj.
There is no version difference. Linux and macOS builds both use external OpenFST and Foma, and same version of them.
The OpenFST x86 and x86_64 Linux builds are with SSE math - without that, the HFST test suite failed. And it should actually keep things more consistent, as it forces 64 bit floating point math everywhere - it would use the 80 bit x87 FPU otherwise. And 0.5 is indeed a rather big difference.
from lang-smj.
When building fsts, LC_ALL affects the weights. On my Mac, echo $LC_ALL gives an empty line
When fsts are built using empty LC_ALL, hfst-ospell then gives this list:
❯ echo operidum|hfst-ospell tools/spellcheckers/smj.zhfst -S
"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum 27.590923
optieridum 31.590923
kuperidum 32.590923
duperidum 32.590923
superidum 32.590923
tuperidum 32.590923
jieridum 37.590923
opereridum 37.590923
When I set export LC_ALL=C, then compile the fsts, it gives is the list:
echo operidum|hfst-ospell tools/spellcheckers/smj.zhfst -S
"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum 27.656849
optieridum 31.656849
kuperidum 32.656849
duperidum 32.656849
superidum 32.656849
tuperidum 32.656849
jieridum 37.656849
opereridum 37.656849
from lang-smj.
LC_ALL
is definitely important. I always ensure all machines have a UTF-8 locale. On Linux this is usually C.UTF-8
and on macOS en_US.UTF-8
. The Greenlandic team's scripts check and error out if LC_ALL
does not match icase regex UTF-?8
from lang-smj.
I could guess there's at least one step there that reads or writes weights as floats and uses decimal comma and other part that expects decimal dot or vice versa...
from lang-smj.
looking at the process now, there's all the weighing in tools/spellers that uses a lot of coreutils, I could guess there's also many ways to diverge there as we know even sort and uniq don't agree between linux and macos and locales (I can't reproduce diffs between linux C.utf8 and fi-FI.utf8 though).
from lang-smj.
Related Issues (20)
- Speller-yaml failer HOT 2
- Grammatikkkontroll retter midt i et ord HOT 3
- Spellchecker as part of grammarchecker sucks HOT 14
- Tal som tekst med komma blir ikkje analyserte
- TTS: Legg til rett tekst for ulike symbol HOT 5
- TTS: Sifferinterval bør lesast som "frå X til Y"
- TTS: 'Dr.' blir segmentert som 'dr'+setningsgrense i starten av ei setning HOT 4
- TTS: 'dr' blir ikkje normalisert til 'dåktår' HOT 13
- TTS: 200 som tekst blir ikkje generert i akkusativ HOT 22
- TTS: U21-landslaget - korleis skal vi handtera ikkje-samiske akronym? HOT 3
- Grammar: det er forskjeller mellom trace-mode og divvun-checker HOT 35
- insert-area-tags-before-pos.regex missing? HOT 4
- giella-smj not built since feb 2022 HOT 4
- Forskjell mellom dev og release HOT 4
- `make check` adds minutes per test in tools/grammarcheckers/tests HOT 4
- No rule to make target '.generated/acceptor.NO.hfst HOT 2
- Testing grammarcheckers fails
- North sámi testdata in SMJ HOT 7
- Failing test hinders speller release HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lang-smj.