This issue was created automatically with bugzilla2github Bugzilla

Affixes propernouns test vil ha små bokstaver ( about lang-smj HOT 17 CLOSED

albbas commented on June 16, 2024

Affixes propernouns test vil ha små bokstaver (

from lang-smj.

Comments (17)

albbas commented on June 16, 2024

Comment 9508

Date: 2014-06-24 11:52:16 +0200
From: Inga Lill Sigga Mikkelsen <<inga.l.mikkelsen>>

LEXC test 3: analyser-gt-norm.xfst + affixes/propernouns.lexc - 514/578/1092 FAIL

[1/3][PASS] Tjuorri+N+Prop+Ani+Sg+Nom => Tjuorri
[1/3][FAIL] Tjuorri+N+Prop+Ani+Sg+Nom => Unexpected results: tjuorri
[2/3][PASS] Tjuorri+N+Prop+Ani+Sg+Ill => Tjuorrij
[2/3][FAIL] Tjuorri+N+Prop+Ani+Sg+Ill => Unexpected results: tjuorrij
[3/3][PASS] Tjuorri+N+Prop+Ani+Sg+Ela => Tjuorris
[3/3][FAIL] Tjuorri+N+Prop+Ani+Sg+Ela => Unexpected results: tjuorris

Vanlig yaml test fungerer som den skal med tanke på store bokstaver for propernouns. Testen i affix-fila vil også ha propernouns med liten forbokstav.

from lang-smj.

albbas commented on June 16, 2024

Comment 9509

Date: 2014-06-24 11:53:24 +0200
From: Inga Lill Sigga Mikkelsen <<inga.l.mikkelsen>>

Det er Sjur som har jobbet med disse testene så jeg endrer "Assigned to" til Sjur.

from lang-smj.

albbas commented on June 16, 2024

Comment 9515

Date: 2014-06-30 10:09:56 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

Testane viser at fst-en genererer ordformer som ikkje skal genererast (liten forbokstav ved namn). Det er altså ein feil i fst-en, og ikkje med testoppsettet.

Det ser ut som om alle namn får feilaktig liten fyrstebokstav. I tillegg er det nokre namn som er feil tagga:

Test 131: Bieva (Lexical/Generation)

[1/3][FAIL] Bieva+N+Prop+Plc+Sg+Nom => Missing results: Bieva
[1/3][FAIL] Bieva+N+Prop+Plc+Sg+Nom => Unexpected results: Bieva+N+Prop+Plc+Sg+Nom+?
[2/3][FAIL] Bieva+N+Prop+Plc+Sg+Ill => Missing results: Bievijda
[2/3][FAIL] Bieva+N+Prop+Plc+Sg+Ill => Unexpected results: Bieva+N+Prop+Plc+Sg+Ill+?
[3/3][FAIL] Bieva+N+Prop+Plc+Sg+Ela => Missing results: Bievijs
[3/3][FAIL] Bieva+N+Prop+Plc+Sg+Ela => Unexpected results: Bieva+N+Prop+Plc+Sg+Ela+?

Test 131 - Passes: 0, Fails: 3, Total: 3
=========== VS: ============
$ lookup -q -flags mbTT src/analyser-gt-norm.xfst
Bieva
Bieva Bieva+N+Prop+Plc+Pl+Nom
Bieva Bieva+N+Prop+Plc+Pl+Nom

Dvs Sg vs Pl.

I og med at alle namn blir råka, kan det sjå ut som om det er ein feil i bygginga heller enn ein feil i lexc- eller twolc-koden.

from lang-smj.

albbas commented on June 16, 2024

Comment 9517

Date: 2014-06-30 16:33:28 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

Eg og Trond såg på desse i dag, men fann ikkje ut av det. Det gjeld alle fst-ar i smj (bortsett frå -raw-), og det gjeld berre smj, ingen andre språk.

Det er riktig mystisk, fordi byggjestega for alle språk er like. Eg sjekka dei delane som er spesifikke for smj, men det var ingen ting der - problemet finst òg i dei temporære fst-ane som blir laga likt for alle språk, før den språkspesifikke prosesseringa byrjar.

Inga:
Eg får ikkje gjort noko meir med dette no før eg går ut i ferie, men det ville vera fint om du retta opp andre feil i proper-filene, t.d. det med Sg istf Pl.

from lang-smj.

albbas commented on June 16, 2024

Comment 9544

Date: 2014-08-06 13:40:24 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

I hand this over to Tomi, and add myself and Trond to the Cc list. Also changed Product and Component.

from lang-smj.

albbas commented on June 16, 2024

Comment 9545

Date: 2014-08-06 15:16:55 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

Inga's testing has been done using Xerox, but Hfst has some issues as well, as it turns out. It might be a hint at what the problem is:

$ hfst-lookup src/analyser-gt-desc.hfst

oslo
hfst-lookup: warning: Got infinite results, number of cycles limited to 5
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results

$ hfst-lookup src/analyser-gt-desc.hfst

Oslo
hfst-lookup: warning: Got infinite results, number of cycles limited to 5
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results

That is, something is inserting an unlimited number of an/some optional symbol(s) in the analysing fst, causing it to err as above.

Generation works fine:

$ hfst-lookup src/generator-gt-desc.hfst

Oslo+N+Prop+Plc+Sg+Nom
Oslo+N+Prop+Plc+Sg+Nom Oslo 0,000000
Oslo+N+Prop+Plc+Sg+Nom Oslo 0,000000

Oslo+N+Prop+Plc+Sg+Acc
Oslo+N+Prop+Plc+Sg+Acc Oslov 0,000000
Oslo+N+Prop+Plc+Sg+Acc Oslov 0,000000

from lang-smj.

albbas commented on June 16, 2024

Comment 9549

Date: 2014-08-07 09:05:28 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

Forget comment #5, I had local modifications that destroyed some of the results in unpredictable ways. The following is produced with clean data directly from the svn repository, using svn revision 98060:

Xerox:

$ lookup src/analyser-gt-desc.xfst
Oslo
Oslo Oslo +N+Prop+Plc+Pl+Nom
Oslo Oslo +N+Prop+Plc+Sg+Nom
Oslo Oslo +N+Prop+Plc+Sg+Gen
Oslo Oslo +N+Prop+Plc+Pl+Nom
Oslo Oslo +N+Prop+Plc+Sg+Nom
Oslo Oslo +N+Prop+Plc+Sg+Gen

oslo
oslo Oslo +N+Prop+Plc+Pl+Nom
oslo Oslo +N+Prop+Plc+Sg+Nom
oslo Oslo +N+Prop+Plc+Sg+Gen

^C
$ lookup src/generator-gt-desc.xfst
Oslo+N+Prop+Plc+Pl+Nom
Oslo+N+Prop+Plc+Pl+Nom oslo
Oslo+N+Prop+Plc+Pl+Nom Oslo

That is, Xerox has the bug with lower-case initial caps of proper nouns.

Hfst:

$ hfst-lookup src/analyser-gt-desc.hfst

Oslo
Oslo Oslo+N+Prop+Plc+Pl+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Gen 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Pl+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Gen 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Nom 0,000000

oslo
oslo oslo+? inf

^C
$ hfst-lookup src/generator-gt-desc.hfst
Oslo+N+Prop+Plc+Pl+Nom
Oslo+N+Prop+Plc+Pl+Nom Oslo 0,000000
Oslo+N+Prop+Plc+Pl+Nom Oslo 0,000000

^C

That is, everything works as expected using Hfst.

The questions are:

why only smj?
why only Xerox?

from lang-smj.

albbas commented on June 16, 2024

Comment 9550

Date: 2014-08-07 10:47:13 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

The bug is related to the regex for handling downcasing of derived proper strings. If I remove the following line from am-shared/src-dir-include.am:

		.o. @\"orthography/downcase-derived_proper-strings.hfst\" \

both Hfst and Xerox behave properly.

Hfst:

$ hfst-lookup analyser-gt-desc.hfst

oslo
oslo oslo+? inf

Oslo
Oslo Oslo+N+Prop+Plc+Pl+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Gen 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Pl+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Gen 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Nom 0,000000

^C
$ hfst-lookup generator-gt-desc.hfst
Oslo+N+Prop+Plc+Pl+Nom
Oslo+N+Prop+Plc+Pl+Nom Oslo 0,000000
Oslo+N+Prop+Plc+Pl+Nom Oslo 0,000000

^C

Xerox:

$ lookup analyser-gt-desc.xfst
Oslo
Oslo Oslo +N+Prop+Plc+Pl+Nom
Oslo Oslo +N+Prop+Plc+Sg+Gen
Oslo Oslo +N+Prop+Plc+Sg+Nom
Oslo Oslo +N+Prop+Plc+Pl+Nom
Oslo Oslo +N+Prop+Plc+Sg+Gen
Oslo Oslo +N+Prop+Plc+Sg+Nom

oslo
oslo oslo +?

^C
$ lookup generator-gt-desc.xfst
Oslo+N+Prop+Plc+Pl+Nom
Oslo+N+Prop+Plc+Pl+Nom Oslo
Oslo+N+Prop+Plc+Pl+Nom Oslo

That is, without the downcasing of derived proper nouns, everything is working ok for both Xerox and Hfst. But when adding the downcasing, things break for Xerox but not for Hfst.

from lang-smj.

albbas commented on June 16, 2024

Comment 9551

Date: 2014-08-07 10:59:12 +0200
From: Trond Trosterud <<trond.trosterud>>

The first question remains: Why only smj?

from lang-smj.

albbas commented on June 16, 2024

Comment 9552

Date: 2014-08-07 11:40:49 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

(In reply to comment #8)

The first question remains: Why only smj?

Both questions remain, and we have a third one:

How can we get downcasing of proper nouns to work properly and identically for both Xerox and Hfst, without affecting analysis of regular proper nouns? And preferably by using exactly the same build instructions for both Xerox and Hfst.

from lang-smj.

albbas commented on June 16, 2024

Comment 9560

Date: 2014-08-20 12:36:08 +0200
From: Tomi Pieski <<tomi.k.pieski>>

In affix file for smj propernouns the obligatory uppercase strings need to be defined also with @U.Cap.Obl@

The creation of proper nouns has two paths, one with obligatory initial uppercase, and with optional initial uppercase:

@U.Cap.Obl@ ProperNoun ; ! These flags are for
@U.Cap.Opt@ ProperNoun ; ! downcasing the propernouns

Oslo example is directed to ACCRA-plc lexicon:

LEXICON ACCRA-plc !!= @code@ Vowel-final names where caseendings are added directly, no cg. Place names
+N+Prop+Plc:%> ACCRADECL-PLC ;
@U.Cap.Opt@+N+Prop+Plc+Sg+Gen:%>@U.Cap.Opt@ VUONAK ;

LEXICON ACCRADECL-PLC
! These sublexica are irrelevant for ACCRA, but added
! for the sake of the lexicon MARJA
ACCRA-DC ;
@U.Cap.Opt@+Err/Sub:@U.Cap.Opt@ LASJ ; !

LEXICON ACCRA-DC
+Sg+Nom: K ;
+Sg+Nom: RHyph ;
+Ess:n K ;
+Sg+Ill:Q4j K ; !Q4 to get e:i
+Sg+Gen: K ;
+Sg+Gen: RHyph ;
+Sg+Acc:v K ;
+Sg+Ine:n K ;
+Sg+Ela:s K ;
+Pl+Nom: K ;
+Pl+Gen:Q4j K ; !Q4 to get e:i
+Pl+Gen:Q4j K ; !Q4 to get e:i
+Pl+Acc:Q4jt K ;
+Pl+Ill:Q4jda K ;
+Pl+Ine:Q4jn K ;
+Pl+Ela:Q4js K ;
+Pl+Com:Q4j K ; !Q4 to get e:i
+Sg+Com:Q4jn K ;
+Sg+Com+Err/Sub:X5jn K ; ! Norgejn

In above example @U.Cap.Opt@ path with 'regular' cases is not filtered out from fst since there is no @U.Cap.Obl@ flag.

from lang-smj.

albbas commented on June 16, 2024

Comment 9566

Date: 2014-08-27 15:30:36 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

The build system is working as it should, and the basic flag setup is correct, as demonstrated by the following:

$ lookup -q src/generator-gt-desc.xfst
Vuolleednama+N+Prop+Plc+Der/lasj+N+Sg+Nom
Vuolleednama+N+Prop+Plc+Der/lasj+N+Sg+Nom vuolleednamlasj

The problem is that several/many of the continuation lexicons for proper nouns are lacking the required flag(s).

The job is thus to go through the file src/morphology/affixes/propernouns.lexc, and make sure all lexicons are set up as e.g. EATNAMAT-plc (the one for Vuolleednama), which works as intended.

This is tested against svn revision 98779.

from lang-smj.

albbas commented on June 16, 2024

Comment 9567

Date: 2014-08-27 15:42:53 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

Actually, the main reason is that most derivations are Sub-marked, e.g. all LASJ derivations from placenames. I have removed some of the sub-marking for cases where the yaml tests say that they should be accepted by the normative transducer.

from lang-smj.

albbas commented on June 16, 2024

Comment 9568

Date: 2014-08-27 16:56:34 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

Handing this one over to Sandra.

from lang-smj.

albbas commented on June 16, 2024

Comment 9569

Date: 2014-08-28 08:15:55 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

One more observation:

-lasj is Sub-marked in its own lexicon:

LEXICON LASJ !SUB !+CmpN/SgN +CmpN/SgG +CmpN/PlG !from placenames
+Der1+Der/lasj+N+Sg+Nom+Err/Sub:»lasj K ;
+Der1+Der/lasj+N+Err/Sub:»ladtja GÅNÅGIS-EVEN ;

This should probably be reevaluated, since subbing all lasj derivations is probably too strong. I will remove the sub-tagging now, but leave a note for future work.

from lang-smj.

albbas commented on June 16, 2024

Comment 9575

Date: 2014-09-11 22:49:35 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

This comment is only indirectly related to this bug, but it still is worth mentioning. "oslolasj" type derivations (with initial lowercase letter) have not been working with hfst, because of a bug / xerox incompatibility related to composition with flag diacritics. In the very latest hfst code this is now fixed:

$ hfst-lookup src/analyser-gt-desc.hfst

Oslo
Oslo Oslo+N+Prop+Plc+Pl+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Gen 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Nom 0,000000

oslo
oslo oslo+? inf

oslolasj
oslolasj Oslo+N+Prop+Plc+Der/lasj+N+Sg+Nom 0,000000

Oslolasj
Oslolasj Oslo+N+Prop+Plc+Der/lasj+N+Sg+Nom 0,000000

That is, Xerox and Hfst behave the same.

NB! This really requires the very latest hfst code, as of today.

from lang-smj.

albbas commented on June 16, 2024

Comment 9610

Date: 2014-10-07 08:21:19 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

As of Hfst 3.8 we have an official release with all required bug fixes included. Summary after all backs and forths:

Xerox build instructions are correct
Hfst build instructions are correct
Xerox and Hfst build instructions are identical on all relevant points
Proper nouns behave as they should given correct flag diacricitcs and no sub-tagging

Sandra, you can close this bug as fixed as soon as you have verified that all proper noun continuation lexicons have the correct flag diacritics.

from lang-smj.

Comments (17)

Comment 9508

Comment 9509

Comment 9515

Test 131: Bieva (Lexical/Generation)

Comment 9517

Comment 9544

Comment 9545

Comment 9549

Comment 9550

Comment 9551

Comment 9552

Comment 9560

Comment 9566

Comment 9567

Comment 9568

Comment 9569

Comment 9575

Comment 9610

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org