apertium / apertium-uig Goto Github PK

View Code? Open in Web Editor NEW

6.0 13.0 3.0 845 KB

Apertium linguistic data for Uyghur

License: GNU General Public License v3.0

Makefile 51.51% Shell 9.08% M4 10.51% XML 19.27% Python 9.63%

apertium-languages

apertium-uig's Introduction

Uyghur (`apertium-uig`)

This is an Apertium monolingual language package for Uyghur. What you can use this language package for:

Morphological analysis of Uyghur
Morphological generation of Uyghur
Part-of-speech tagging of Uyghur

Requirements

You will need the following software installed:

lttoolbox (>= 3.3.0)
apertium (>= 3.3.0)
vislcg3 (>= 0.9.9.10297)

If this does not make any sense, we recommend you look at: apertium.org

Compiling

Given the requirements being installed, you should be able to just run:

$ ./configure
$ make

You can use ./autogen.sh instead of ./configure if you're compiling from GitHub.

If you're doing development, you don't have to install the data, you can use it directly from this directory.

If you are installing this language package as a prerequisite for an Apertium translation pair, then do (typically as root / with sudo):

# make install

You can give a --prefix to ./configure to install as a non-root user, but make sure to use the same prefix when installing the translation pair and any other language packages.

Testing

If you are in the source directory after running make, the following commands should work:

$ echo "لېكىن بۇنىڭ مېھرى باشقىچە ئىسسىق بىلىندى."  | apertium -d . uig-morph
^لېكىن/لېكىن<cnjcoo>$ ^بۇنىڭ/بۇ<prn><dem><gen>$ ^مېھرى/مېھر<n><px3sp><nom>$ 
^باشقىچە/باش<n><ter>$ ^ئىسسىق/ئىسسىق<adj>/ئىسسىق<adj><advl>/ئىسسىق<adj><subst><nom>$ 
^بىلىندى/بىل<v><tv><pass><ifi><p3><sg>/بىل<v><tv><pass><ifi><p3><pl>$^./.<sent>$

$ echo "لېكىن بۇنىڭ مېھرى باشقىچە ئىسسىق بىلىندى."  | apertium -d . uig-tagger
^لېكىن/لېكىن<cnjcoo>$ ^بۇنىڭ/بۇ<prn><dem><gen>$ ^مېھرى/مېھر<n><px3sp><nom>$ 
^باشقىچە/باش<n><ter>$ ^ئىسسىق/ئىسسىق<adj><advl>$ ^بىلىندى/بىل<v><tv><pass><ifi><p3><sg>$
^./.<sent>$

Files and data

apertium-uig.uig.lexc - Lexicon file
apertium-uig.uig.twol - Phonological rules
uig.prob - Tagger model
apertium-uig.uig.rlx - Constraint Grammar disambiguation rules
apertium-uig.post-uig.dix - Post-generator
modes.xml - Translation modes

For more information

Help and support

If you need help using this language pair or data, you can contact:

Mailing list: [email protected]
IRC: #apertium on irc.oftc.net

See also the file AUTHORS included in this distribution.

apertium-uig's People

Contributors

Stargazers

Watchers

Forkers

mobiletechnology34 connormayer

apertium-uig's Issues

No morphology on numbers/punctuation

quoting @koguzhan from the wiki:

Suffixes after Numbers or characters like " << >> are currently not analyzed at all

Copying the CRH or TUR solution might work, though I'm not sure how uyghur usually puts numbers and morphology.

Missing analysis for "mu"

"<ئۇيغۇرلارمۇ>"
        "ئۇيغۇر" n pl nom
                "مۇ" qst

مۇ here should have an analysis similar to the Turkish -dA, as in Onlar da öğrenci, "they too are students."

imish forms

context	base	full form	no space	reduced	with mi
nouns (nom, etc.), p3	doktur	doktur imish	dokturimish	dokturmish	—
→ normal analysis	`doktur<n><nom>`	`doktur<n><nom> i<cop><aor><dub><p3><sg>`	`doktur<n><nom>+i<cop><aor><dub><p3><sg>`	`doktur<n><nom>+i<cop><aor><dub><p3><sg>`	—
→ linguistic analysis	`doktur<n><nom>`	`doktur<n><nom> i<cop><aor><dub><p3><sg>`	`doktur<n><nom>+i<cop><aor><dub><p3><sg>`	`doktur<n><nom>+i<cop_no_i><aor><dub><p3><sg>`	—

Copula morphology needs to be checked

Some of the copula stuff in uig was recently developed on the fly and it would be good to check and compare with the other Turkic analyzers.

Clitics not splitting off

LEXICON CLITICS
# ;
%<qst%>:%>ﻡۇ V-COP-PERS ;
%<qst%>:%>چۇ # ;
!!%<gm%>:%>%{D%}ۇﺭ # ;
%<comp%>:%>ﺭ%{A%}%{K%} # ;
%<dek%>:%>%{D%}ەﻙ # ;
%<che%>:%>چە # ;
%<dub%>:%>%{y%}ﻙەﻥ V-COP-PERS ;
%<postadv%>:%>مۇ # ;
%<cnjadv%>:%>كى # ; !

It looks like these clitics are not splitting off properly, causing their tags to be added to the previous word, e.g.
^سىرت<n><px3sp><loc><p3><sg><cnjadv>$
The solution is to split them off and make sure they are assigned their own lemmas:
+%+كى%<cnjadv%>:%>كى # ; !

Locative -dA expecting the wrong vowel

Words like "musabiqe" and "téxnika" conflicting with vowel harmony cause problems when they get the locative suffix, especially when they get both possessive and locative suffixes.
For example, تېخنىكىدا and مۇسابىقىسىدە are the correct forms but the analyzer expects تېخنىكىدە and مۇسابىقىسىدا.
update: roots ending with ى like كىشى are also problematic. The analyzer considers ى a back vowel for example when suffixing -lAr, so it gets كىشىلار while كىشىلەر is the correct form.

Installed modes are missing files

modes.xml includes some modes with install="yes", but the required
files aren't installed.

Some generic suggestions:

-lexc and -twol modes probably aren't useful to users
-spell modes should depend on --enable-ospell
.deps files are never installed, so any modes using them shouldn't be
installed.
Messages for package app-dicts/apertium-uig-9999:
Failed to find '/usr/share/apertium/apertium-uig/.deps/uig.twol.hfst' in install image.
QA: missing files required for mode uig-twol.
Failed to find '/usr/share/apertium/apertium-uig/.deps/uig.LR.lexc.hfst' in install image.
QA: missing files required for mode uig-lexc.

When CG isn't found compile errors

make: *** No rule to make target `no', needed by `uig.rlx.bin'.  Stop.

The "no" comes from the $(CGCOMP) variable in the Makefile. The Makefile should probably not try and compile the CG if it is not there.

This probably applies to other languages too.

Vowel Harmony-ى

ى still causes problems with vowel harmony, especially when it interacts with special characters.

iken forms

context	base	full form	no space	reduced	with mi
nouns (nom, etc.), p3	doktur	doktur iken	dokturiken	dokturken	doktur miken
→ normal analysis	`doktur<n><nom>`	`doktur<n><nom> i<cop><aor><evid><p3><sg>`	`doktur<n><nom>+i<cop><aor><evid><p3><sg>`	`doktur<n><nom>+i<cop><aor><evid><p3><sg>`	`doktur<n><nom>+mi<qst>+i<cop><aor><evid><p3><sg>`
→ linguistic analysis	`doktur<n><nom>`	`doktur<n><nom> i<cop><aor><evid><p3><sg>`	`doktur<n><nom>+i<cop><aor><evid><p3><sg>`	`doktur<n><nom>+i<cop_no_i><aor><evid><p3><sg>`	`doktur<n><nom>+mi<qst>+i<cop><aor><evid><p3><sg>`