Code Monkey home page Code Monkey logo

Comments (25)

albbas avatar albbas commented on June 22, 2024 1

echo operidum|hfst-ospell tools/grammarcheckers/smj.zhfst -S on Mac gives the correct suggestion as the eighth suggestion, where as on Linux it is down in twenty-something

from lang-smj.

albbas avatar albbas commented on June 22, 2024 1

Does the test require the suggestions to be in a specific order? That seems like a receipt for bogus fails. What is relevant is whether the correct suggestion is in the list or not, and possibly the position of the correct suggestion (the higher the better). Other than that the order and amount of suggestions should not be considered at all.

No, the test framework uses the output of divvun-checker, looking for the correct suggestion among the suggestions given by divvun-checker.

from lang-smj.

flammie avatar flammie commented on June 22, 2024 1

Seems strange, I could've thought that there can be tiny variations in floating point math between operating systems and processors or so, but whole 0.4 is unexpected. I'd probably start with unzipping the zhfst files and diffing the hfst-fst2txt outputs hoping it has simple differences only, otherwise it needs to be debugged with some probably debug prints on each step of the build in the compiler...

from lang-smj.

flammie avatar flammie commented on June 22, 2024 1

there are tons of environment variables and probably other ways to change locale settings and possibly some utf8 locales might not have that in name, locale -ck charmap command will usually tell what it resolves to in typical programs.

I'll try to experiment with some more minimal example if this can be reproduced under linux with , and . locales...

from lang-smj.

albbas avatar albbas commented on June 22, 2024

The failing test on Linux is:
[ 4/55][FAIL fp1] operidum:opereridum (, ()) => operidum:[opteridum, optieridum, superidum, tuperidum, duperidum, kuperidum, doneridum, operadume, hoveridum, oarridum] (typo)

from lang-smj.

albbas avatar albbas commented on June 22, 2024

This is the result on Mac:
[ 4/55][PASS tp] operidum:opereridum (, ()) => operidum:[opteridum, optieridum, kuperidum, duperidum, superidum, tuperidum, jieridum, opereridum, opteridam, åhkeridum] (typo)

from lang-smj.

albbas avatar albbas commented on June 22, 2024

Could it be that the speller package (lib-speller-something) is out of sync on Mac nightly and Linux nightly, since this is a difference in suggestions on a typo?

from lang-smj.

albbas avatar albbas commented on June 22, 2024
apt-cache show divvun-gramcheck

Package: divvun-gramcheck
Source: libdivvun
Version: 0.3.11+g563~e101aba9-1~jammy1
Architecture: amd64
Maintainer: Debian Science Team <[email protected]>
Installed-Size: 1587
Depends: libxml2-utils, libarchive13 (>= 3.0.4), libc6 (>= 2.34), libdivvun0 (>= 0.3.11+g563~e101aba9), libgcc-s1 (>= 3.3.1), libhfst55 (>= 3.16.0+g3882~0136e846), libhfstospell11 (>= 0.5.3+g381~9bed46c8), libpugixml1v5 (>= 1.4), libstdc++6 (>= 11)
Provides: libdivvun-bin, libdivvun-tools
Homepage: https://github.com/divvun/libdivvun
Priority: optional
Section: science
Filename: pool/main/libd/libdivvun/divvun-gramcheck_0.3.11+g563~e101aba9-1~jammy1_amd64.deb
Size: 332816
SHA256: eaf079c04167894a3cfbdc3a035cbb80a7dfd8c6067e8d9ec363353619af72d8
SHA1: 7fc49678a033301d47cd21514353fd96371180ad
MD5sum: 33ef45415af7e6be098652971a1a4ef8
Description: Grammar checker tools for Divvun languages
 Helper tools for grammar checking for Divvun languages
Description-md5: 8151298b2db6426a3d1d4f55957d131a

from lang-smj.

albbas avatar albbas commented on June 22, 2024

hfst-ospell --version gives this result

hfst-ospell --version

hfstospell 0.5.3
copyright (C) 2009 - 2018 University of Helsinki

on both machines

from lang-smj.

flammie avatar flammie commented on June 22, 2024

git hash e101aba9 is the most recent version from 3 days ago... but I don't think anyone has touched suggestion sorting in ages. Are the weights of the mismatching suggestions exactly the same? If so it can easily happen that they end up in different order under circumstances including different oses data structures and sort algorithms...

from lang-smj.

snomos avatar snomos commented on June 22, 2024

Does the test require the suggestions to be in a specific order? That seems like a receipt for bogus fails. What is relevant is whether the correct suggestion is in the list or not, and possibly the position of the correct suggestion (the higher the better). Other than that the order and amount of suggestions should not be considered at all.

from lang-smj.

albbas avatar albbas commented on June 22, 2024

The weight on Linux is ~37.99, on Mac it is ~37,59

from lang-smj.

albbas avatar albbas commented on June 22, 2024

The number of typo-suggestions from divvun-checker seems to be truncated to ten results

from lang-smj.

unhammer avatar unhammer commented on June 22, 2024

The weight on Linux is ~37.99, on Mac it is ~37,59

Should they not be the same regardless of platform?

(But hfst-ospell's last commit was in June, hfst's last was in September, so why did this only happen now?)

from lang-smj.

albbas avatar albbas commented on June 22, 2024

I tested the smj.zhfst that I build on my Mac on my Linux machne. The weights are different (see the comment above), but the wordforms are the same.

❯ echo operidum|hfst-ospell ~/Viežžamat/smj.zhfst -S
"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum    27.590923
optieridum    31.590923
superidum    32.590923
tuperidum    32.590923
duperidum    32.590923
kuperidum    32.590923
doneridum    37.590923
operadume    37.590923
hoveridum    37.590923
oarridum    37.590923
vomeridum    37.590923
moveridum    37.590923
råhperidum    37.590923
gåhperidum    37.590923
noteridum    37.590923
voteridum    37.590923
poneridum    37.590923
poleridum    37.590923
apteridum    37.590923
exeridum    37.590923
roteridum    37.590923
opteridus    37.590923
opteridu    37.590923
dåhperidum    37.590923
poseridum    37.590923
rokeridum    37.590923
åhkeridum    37.590923
opteridam    37.590923
logeridum    37.590923
ageridum    37.590923
koseridum    37.590923
doseridum    37.590923
doteridum    37.590923
joderidum    37.590923
opteridup    37.590923
opteridim    37.590923
jieridum    37.590923
opereridum    37.590923

vs the the smj.zhfst build on the Linux box:

❯ echo operidum|hfst-ospell tools/spellcheckers/smj.zhfst -S
"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum    27.996086
optieridum    31.996086
superidum    32.996086
tuperidum    32.996086
duperidum    32.996086
kuperidum    32.996086
doneridum    37.996086
operadume    37.996086
hoveridum    37.996086
oarridum    37.996086
vomeridum    37.996086
moveridum    37.996086
råhperidum    37.996086
gåhperidum    37.996086
noteridum    37.996086
voteridum    37.996086
poneridum    37.996086
poleridum    37.996086
apteridum    37.996086
exeridum    37.996086
roteridum    37.996086
opteridus    37.996086
opteridu    37.996086
dåhperidum    37.996086
poseridum    37.996086
rokeridum    37.996086
åhkeridum    37.996086
opteridam    37.996086
logeridum    37.996086
ageridum    37.996086
koseridum    37.996086
doseridum    37.996086
doteridum    37.996086
joderidum    37.996086
opteridup    37.996086
opteridim    37.996086
jieridum    37.996086
opereridum    37.996086

from lang-smj.

albbas avatar albbas commented on June 22, 2024

This is the output of hfst-ospell on my Mac, with natively built .zhfst vs the one built on Linux:

Mac-built

❯ echo operidum|hfst-ospell tools/spellcheckers/smj.zhfst -S
"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum    27.590923
optieridum    31.590923
kuperidum    32.590923
duperidum    32.590923
superidum    32.590923
tuperidum    32.590923
jieridum    37.590923
opereridum    37.590923

Linux-built

❯ echo operidum|hfst-ospell ~/Downloads/smj.zhfst -S
"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum    27.996086
optieridum    31.996086
kuperidum    32.996086
duperidum    32.996086
superidum    32.996086
tuperidum    32.996086
jieridum    37.996086
opereridum    37.996086

from lang-smj.

albbas avatar albbas commented on June 22, 2024

The weights depend on where they were built and the wanted suggestion is way further down the list on Linux than on Mac.

from lang-smj.

snomos avatar snomos commented on June 22, 2024

I notice the weights on the Linux side is consistently 0.5 higher than on the Mac. The whole weight difference is strange, the math and the source code should be identical. @flammie do you have any ideas?

from lang-smj.

albbas avatar albbas commented on June 22, 2024

After converting and a diffing the files, the differences seem to be massive:

❯ wc -l linux/*.txt mac/*.txt *.diff

manually ordered output

 1 933 236 linux/acceptor.default.txt
 1 930 427 mac/acceptor.default.txt
 3 863 469 acceptor.default.diff
 
 2 171 024 linux/errmodel.default.txt
 2 171 024 mac/errmodel.default.txt
 3 633 755 errmodel.default.diff

from lang-smj.

flammie avatar flammie commented on June 22, 2024

well error models seem same size, which is kind of unfortunate of course since the akseptor is the one that has 100 step build process. I don't know if there's any way to debug and bisect other than going through the process step by step and compare, maybe it diverges in some obvious step... I think since current hfst's also use openfst as library there can be than that version difference too between mac and linux

from lang-smj.

TinoDidriksen avatar TinoDidriksen commented on June 22, 2024

There is no version difference. Linux and macOS builds both use external OpenFST and Foma, and same version of them.

The OpenFST x86 and x86_64 Linux builds are with SSE math - without that, the HFST test suite failed. And it should actually keep things more consistent, as it forces 64 bit floating point math everywhere - it would use the 80 bit x87 FPU otherwise. And 0.5 is indeed a rather big difference.

from lang-smj.

albbas avatar albbas commented on June 22, 2024

When building fsts, LC_ALL affects the weights. On my Mac, echo $LC_ALL gives an empty line

When fsts are built using empty LC_ALL, hfst-ospell then gives this list:

❯ echo operidum|hfst-ospell tools/spellcheckers/smj.zhfst -S

"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum    27.590923
optieridum    31.590923
kuperidum    32.590923
duperidum    32.590923
superidum    32.590923
tuperidum    32.590923
jieridum    37.590923
opereridum    37.590923

When I set export LC_ALL=C, then compile the fsts, it gives is the list:

echo operidum|hfst-ospell tools/spellcheckers/smj.zhfst -S

"operidum" is NOT in the lexicon:
Corrections for "operidum":
opteridum    27.656849
optieridum    31.656849
kuperidum    32.656849
duperidum    32.656849
superidum    32.656849
tuperidum    32.656849
jieridum    37.656849
opereridum    37.656849

from lang-smj.

TinoDidriksen avatar TinoDidriksen commented on June 22, 2024

LC_ALL is definitely important. I always ensure all machines have a UTF-8 locale. On Linux this is usually C.UTF-8 and on macOS en_US.UTF-8. The Greenlandic team's scripts check and error out if LC_ALL does not match icase regex UTF-?8

from lang-smj.

flammie avatar flammie commented on June 22, 2024

I could guess there's at least one step there that reads or writes weights as floats and uses decimal comma and other part that expects decimal dot or vice versa...

from lang-smj.

flammie avatar flammie commented on June 22, 2024

looking at the process now, there's all the weighing in tools/spellers that uses a lot of coreutils, I could guess there's also many ways to diverge there as we know even sort and uniq don't agree between linux and macos and locales (I can't reproduce diffs between linux C.utf8 and fi-FI.utf8 though).

from lang-smj.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.