Comments (17)
Comment 9508
Date: 2014-06-24 11:52:16 +0200
From: Inga Lill Sigga Mikkelsen <<inga.l.mikkelsen>>
LEXC test 3: analyser-gt-norm.xfst + affixes/propernouns.lexc - 514/578/1092 FAIL
[1/3][PASS] Tjuorri+N+Prop+Ani+Sg+Nom => Tjuorri
[1/3][FAIL] Tjuorri+N+Prop+Ani+Sg+Nom => Unexpected results: tjuorri
[2/3][PASS] Tjuorri+N+Prop+Ani+Sg+Ill => Tjuorrij
[2/3][FAIL] Tjuorri+N+Prop+Ani+Sg+Ill => Unexpected results: tjuorrij
[3/3][PASS] Tjuorri+N+Prop+Ani+Sg+Ela => Tjuorris
[3/3][FAIL] Tjuorri+N+Prop+Ani+Sg+Ela => Unexpected results: tjuorris
Vanlig yaml test fungerer som den skal med tanke på store bokstaver for propernouns. Testen i affix-fila vil også ha propernouns med liten forbokstav.
from lang-smj.
Comment 9509
Date: 2014-06-24 11:53:24 +0200
From: Inga Lill Sigga Mikkelsen <<inga.l.mikkelsen>>
Det er Sjur som har jobbet med disse testene så jeg endrer "Assigned to" til Sjur.
from lang-smj.
Comment 9515
Date: 2014-06-30 10:09:56 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
Testane viser at fst-en genererer ordformer som ikkje skal genererast (liten forbokstav ved namn). Det er altså ein feil i fst-en, og ikkje med testoppsettet.
Det ser ut som om alle namn får feilaktig liten fyrstebokstav. I tillegg er det nokre namn som er feil tagga:
Test 131: Bieva (Lexical/Generation)
[1/3][FAIL] Bieva+N+Prop+Plc+Sg+Nom => Missing results: Bieva
[1/3][FAIL] Bieva+N+Prop+Plc+Sg+Nom => Unexpected results: Bieva+N+Prop+Plc+Sg+Nom+?
[2/3][FAIL] Bieva+N+Prop+Plc+Sg+Ill => Missing results: Bievijda
[2/3][FAIL] Bieva+N+Prop+Plc+Sg+Ill => Unexpected results: Bieva+N+Prop+Plc+Sg+Ill+?
[3/3][FAIL] Bieva+N+Prop+Plc+Sg+Ela => Missing results: Bievijs
[3/3][FAIL] Bieva+N+Prop+Plc+Sg+Ela => Unexpected results: Bieva+N+Prop+Plc+Sg+Ela+?
Test 131 - Passes: 0, Fails: 3, Total: 3
=========== VS: ============
$ lookup -q -flags mbTT src/analyser-gt-norm.xfst
Bieva
Bieva Bieva+N+Prop+Plc+Pl+Nom
Bieva Bieva+N+Prop+Plc+Pl+Nom
Dvs Sg vs Pl.
I og med at alle namn blir råka, kan det sjå ut som om det er ein feil i bygginga heller enn ein feil i lexc- eller twolc-koden.
from lang-smj.
Comment 9517
Date: 2014-06-30 16:33:28 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
Eg og Trond såg på desse i dag, men fann ikkje ut av det. Det gjeld alle fst-ar i smj (bortsett frå -raw-), og det gjeld berre smj, ingen andre språk.
Det er riktig mystisk, fordi byggjestega for alle språk er like. Eg sjekka dei delane som er spesifikke for smj, men det var ingen ting der - problemet finst òg i dei temporære fst-ane som blir laga likt for alle språk, før den språkspesifikke prosesseringa byrjar.
Inga:
Eg får ikkje gjort noko meir med dette no før eg går ut i ferie, men det ville vera fint om du retta opp andre feil i proper-filene, t.d. det med Sg istf Pl.
from lang-smj.
Comment 9544
Date: 2014-08-06 13:40:24 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
I hand this over to Tomi, and add myself and Trond to the Cc list. Also changed Product and Component.
from lang-smj.
Comment 9545
Date: 2014-08-06 15:16:55 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
Inga's testing has been done using Xerox, but Hfst has some issues as well, as it turns out. It might be a hint at what the problem is:
$ hfst-lookup src/analyser-gt-desc.hfst
oslo
hfst-lookup: warning: Got infinite results, number of cycles limited to 5
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
$ hfst-lookup src/analyser-gt-desc.hfst
Oslo
hfst-lookup: warning: Got infinite results, number of cycles limited to 5
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
optimized-lookup/transducer.cc: maximum recursion depth exceeded, discarding results
That is, something is inserting an unlimited number of an/some optional symbol(s) in the analysing fst, causing it to err as above.
Generation works fine:
$ hfst-lookup src/generator-gt-desc.hfst
Oslo+N+Prop+Plc+Sg+Nom
Oslo+N+Prop+Plc+Sg+Nom Oslo 0,000000
Oslo+N+Prop+Plc+Sg+Nom Oslo 0,000000
Oslo+N+Prop+Plc+Sg+Acc
Oslo+N+Prop+Plc+Sg+Acc Oslov 0,000000
Oslo+N+Prop+Plc+Sg+Acc Oslov 0,000000
from lang-smj.
Comment 9549
Date: 2014-08-07 09:05:28 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
Forget comment #5, I had local modifications that destroyed some of the results in unpredictable ways. The following is produced with clean data directly from the svn repository, using svn revision 98060:
Xerox:
$ lookup src/analyser-gt-desc.xfst
Oslo
Oslo Oslo +N+Prop+Plc+Pl+Nom
Oslo Oslo +N+Prop+Plc+Sg+Nom
Oslo Oslo +N+Prop+Plc+Sg+Gen
Oslo Oslo +N+Prop+Plc+Pl+Nom
Oslo Oslo +N+Prop+Plc+Sg+Nom
Oslo Oslo +N+Prop+Plc+Sg+Gen
oslo
oslo Oslo +N+Prop+Plc+Pl+Nom
oslo Oslo +N+Prop+Plc+Sg+Nom
oslo Oslo +N+Prop+Plc+Sg+Gen
^C
$ lookup src/generator-gt-desc.xfst
Oslo+N+Prop+Plc+Pl+Nom
Oslo+N+Prop+Plc+Pl+Nom oslo
Oslo+N+Prop+Plc+Pl+Nom Oslo
That is, Xerox has the bug with lower-case initial caps of proper nouns.
Hfst:
$ hfst-lookup src/analyser-gt-desc.hfst
Oslo
Oslo Oslo+N+Prop+Plc+Pl+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Gen 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Pl+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Gen 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Nom 0,000000
oslo
oslo oslo+? inf
^C
$ hfst-lookup src/generator-gt-desc.hfst
Oslo+N+Prop+Plc+Pl+Nom
Oslo+N+Prop+Plc+Pl+Nom Oslo 0,000000
Oslo+N+Prop+Plc+Pl+Nom Oslo 0,000000
^C
That is, everything works as expected using Hfst.
The questions are:
- why only smj?
- why only Xerox?
from lang-smj.
Comment 9550
Date: 2014-08-07 10:47:13 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
The bug is related to the regex for handling downcasing of derived proper strings. If I remove the following line from am-shared/src-dir-include.am:
.o. @\"orthography/downcase-derived_proper-strings.hfst\" \
both Hfst and Xerox behave properly.
Hfst:
$ hfst-lookup analyser-gt-desc.hfst
oslo
oslo oslo+? inf
Oslo
Oslo Oslo+N+Prop+Plc+Pl+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Gen 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Pl+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Gen 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Nom 0,000000
^C
$ hfst-lookup generator-gt-desc.hfst
Oslo+N+Prop+Plc+Pl+Nom
Oslo+N+Prop+Plc+Pl+Nom Oslo 0,000000
Oslo+N+Prop+Plc+Pl+Nom Oslo 0,000000
^C
Xerox:
$ lookup analyser-gt-desc.xfst
Oslo
Oslo Oslo +N+Prop+Plc+Pl+Nom
Oslo Oslo +N+Prop+Plc+Sg+Gen
Oslo Oslo +N+Prop+Plc+Sg+Nom
Oslo Oslo +N+Prop+Plc+Pl+Nom
Oslo Oslo +N+Prop+Plc+Sg+Gen
Oslo Oslo +N+Prop+Plc+Sg+Nom
oslo
oslo oslo +?
^C
$ lookup generator-gt-desc.xfst
Oslo+N+Prop+Plc+Pl+Nom
Oslo+N+Prop+Plc+Pl+Nom Oslo
Oslo+N+Prop+Plc+Pl+Nom Oslo
That is, without the downcasing of derived proper nouns, everything is working ok for both Xerox and Hfst. But when adding the downcasing, things break for Xerox but not for Hfst.
from lang-smj.
Comment 9551
Date: 2014-08-07 10:59:12 +0200
From: Trond Trosterud <<trond.trosterud>>
The first question remains: Why only smj?
from lang-smj.
Comment 9552
Date: 2014-08-07 11:40:49 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
(In reply to comment #8)
The first question remains: Why only smj?
Both questions remain, and we have a third one:
How can we get downcasing of proper nouns to work properly and identically for both Xerox and Hfst, without affecting analysis of regular proper nouns? And preferably by using exactly the same build instructions for both Xerox and Hfst.
from lang-smj.
Comment 9560
Date: 2014-08-20 12:36:08 +0200
From: Tomi Pieski <<tomi.k.pieski>>
In affix file for smj propernouns the obligatory uppercase strings need to be defined also with @U.Cap.Obl@
The creation of proper nouns has two paths, one with obligatory initial uppercase, and with optional initial uppercase:
@U.Cap.Obl@ ProperNoun ; ! These flags are for
@U.Cap.Opt@ ProperNoun ; ! downcasing the propernouns
Oslo example is directed to ACCRA-plc lexicon:
LEXICON ACCRA-plc !!= @code@ Vowel-final names where caseendings are added directly, no cg. Place names
+N+Prop+Plc:%> ACCRADECL-PLC ;
@U.Cap.Opt@+N+Prop+Plc+Sg+Gen:%>@U.Cap.Opt@ VUONAK ;
LEXICON ACCRADECL-PLC
! These sublexica are irrelevant for ACCRA, but added
! for the sake of the lexicon MARJA
ACCRA-DC ;
@U.Cap.Opt@+Err/Sub:@U.Cap.Opt@ LASJ ; !
LEXICON ACCRA-DC
+Sg+Nom: K ;
+Sg+Nom: RHyph ;
+Ess:n K ;
+Sg+Ill:Q4j K ; !Q4 to get e:i
+Sg+Gen: K ;
+Sg+Gen: RHyph ;
+Sg+Acc:v K ;
+Sg+Ine:n K ;
+Sg+Ela:s K ;
+Pl+Nom: K ;
+Pl+Gen:Q4j K ; !Q4 to get e:i
+Pl+Gen:Q4j K ; !Q4 to get e:i
+Pl+Acc:Q4jt K ;
+Pl+Ill:Q4jda K ;
+Pl+Ine:Q4jn K ;
+Pl+Ela:Q4js K ;
+Pl+Com:Q4j K ; !Q4 to get e:i
+Sg+Com:Q4jn K ;
+Sg+Com+Err/Sub:X5jn K ; ! Norgejn
In above example @U.Cap.Opt@ path with 'regular' cases is not filtered out from fst since there is no @U.Cap.Obl@ flag.
from lang-smj.
Comment 9566
Date: 2014-08-27 15:30:36 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
The build system is working as it should, and the basic flag setup is correct, as demonstrated by the following:
$ lookup -q src/generator-gt-desc.xfst
Vuolleednama+N+Prop+Plc+Der/lasj+N+Sg+Nom
Vuolleednama+N+Prop+Plc+Der/lasj+N+Sg+Nom vuolleednamlasj
The problem is that several/many of the continuation lexicons for proper nouns are lacking the required flag(s).
The job is thus to go through the file src/morphology/affixes/propernouns.lexc, and make sure all lexicons are set up as e.g. EATNAMAT-plc (the one for Vuolleednama), which works as intended.
This is tested against svn revision 98779.
from lang-smj.
Comment 9567
Date: 2014-08-27 15:42:53 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
Actually, the main reason is that most derivations are Sub-marked, e.g. all LASJ derivations from placenames. I have removed some of the sub-marking for cases where the yaml tests say that they should be accepted by the normative transducer.
from lang-smj.
Comment 9568
Date: 2014-08-27 16:56:34 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
Handing this one over to Sandra.
from lang-smj.
Comment 9569
Date: 2014-08-28 08:15:55 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
One more observation:
-lasj is Sub-marked in its own lexicon:
LEXICON LASJ !SUB !+CmpN/SgN +CmpN/SgG +CmpN/PlG !from placenames
+Der1+Der/lasj+N+Sg+Nom+Err/Sub:»lasj K ;
+Der1+Der/lasj+N+Err/Sub:»ladtja GÅNÅGIS-EVEN ;
This should probably be reevaluated, since subbing all lasj derivations is probably too strong. I will remove the sub-tagging now, but leave a note for future work.
from lang-smj.
Comment 9575
Date: 2014-09-11 22:49:35 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
This comment is only indirectly related to this bug, but it still is worth mentioning. "oslolasj" type derivations (with initial lowercase letter) have not been working with hfst, because of a bug / xerox incompatibility related to composition with flag diacritics. In the very latest hfst code this is now fixed:
$ hfst-lookup src/analyser-gt-desc.hfst
Oslo
Oslo Oslo+N+Prop+Plc+Pl+Nom 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Gen 0,000000
Oslo Oslo+N+Prop+Plc+Sg+Nom 0,000000
oslo
oslo oslo+? inf
oslolasj
oslolasj Oslo+N+Prop+Plc+Der/lasj+N+Sg+Nom 0,000000
Oslolasj
Oslolasj Oslo+N+Prop+Plc+Der/lasj+N+Sg+Nom 0,000000
That is, Xerox and Hfst behave the same.
NB! This really requires the very latest hfst code, as of today.
from lang-smj.
Comment 9610
Date: 2014-10-07 08:21:19 +0200
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
As of Hfst 3.8 we have an official release with all required bug fixes included. Summary after all backs and forths:
- Xerox build instructions are correct
- Hfst build instructions are correct
- Xerox and Hfst build instructions are identical on all relevant points
- Proper nouns behave as they should given correct flag diacricitcs and no sub-tagging
Sandra, you can close this bug as fixed as soon as you have verified that all proper noun continuation lexicons have the correct flag diacritics.
from lang-smj.
Related Issues (20)
- Speller-yaml failer HOT 2
- Grammatikkkontroll retter midt i et ord HOT 3
- Spellchecker as part of grammarchecker sucks HOT 14
- Tal som tekst med komma blir ikkje analyserte
- TTS: Legg til rett tekst for ulike symbol HOT 5
- TTS: Sifferinterval bør lesast som "frå X til Y"
- TTS: 'Dr.' blir segmentert som 'dr'+setningsgrense i starten av ei setning HOT 4
- TTS: 'dr' blir ikkje normalisert til 'dåktår' HOT 13
- TTS: 200 som tekst blir ikkje generert i akkusativ HOT 22
- TTS: U21-landslaget - korleis skal vi handtera ikkje-samiske akronym? HOT 3
- Grammar: make check gives different results on Mac and Linux HOT 25
- Grammar: det er forskjeller mellom trace-mode og divvun-checker HOT 35
- insert-area-tags-before-pos.regex missing? HOT 4
- giella-smj not built since feb 2022 HOT 4
- Forskjell mellom dev og release HOT 4
- `make check` adds minutes per test in tools/grammarcheckers/tests HOT 4
- No rule to make target '.generated/acceptor.NO.hfst HOT 2
- Testing grammarcheckers fails
- North sámi testdata in SMJ HOT 7
- Failing test hinders speller release HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lang-smj.