Code Monkey home page Code Monkey logo

apertium-kaz's Introduction

Kazakh: apertium-kaz

This is an Apertium monolingual language package for Kazakh. What you can use this language package for:

  • Morphological analysis of Kazakh
  • Morphological generation of Kazakh
  • Part-of-speech tagging of Kazakh

Requirements

You will need the following software installed:

  • lttoolbox (>= 3.3.0)
  • apertium (>= 3.3.0)
  • vislcg3 (>= 0.9.9.10297)

If this does not make any sense, we recommend you look at: apertium.org

Compiling

Given the requirements being installed, you should be able to just run:

$ ./configure
$ make

You can use ./autogen.sh instead of ./configure if you're compiling from SVN.

If you're doing development, you don't have to install the data, you can use it directly from this directory.

If you are installing this language package as a prerequisite for an Apertium translation pair, then do (typically as root / with sudo):

# make install

You can give a --prefix to ./configure to install as a non-root user, but make sure to use the same prefix when installing the translation pair and any other language packages.

Testing

If you are in the source directory after running make, the following commands should work:

$ echo "Сәлем!" | apertium -d . kaz-morph
^Сәлем/сәлем<ij>/сәлем<n><nom>/сәлем<n><attr>/
сәлем<n><nom>+е<cop><aor><p3><pl>/сәлем<n><nom>+е<cop><aor><p3><sg>$
^!/!<sent>$^./.<sent>$

$ echo "Оқу инемен құдық қазғандай." | apertium -d . kaz-tagger
^Оқу/оқу<adj>$ ^инемен/ине<n><ins>$ ^құдық/құдық<n><nom>$
^қазғандай/қаз<v><tv><ger_past><sim>$^./.<sent>$^./.<sent>$ 

Files and data

  • apertium-kaz.kaz.lexc - Monolingual dictionary
  • apertium-kaz.kaz.twol - Morphophonological rules
  • apertium-kaz.kaz.err.twol -
  • apertium-kaz.kaz.guesser.twol -
  • kaz.prob - Tagger model
  • apertium-kaz.kaz.rlx - Constraint Grammar disambiguation rules
  • apertium-kaz.post-kaz.dix - Post-generator
  • apertium-kaz.kaz.mtx -
  • apertium-kaz.kaz.tsx -
  • apertium-kaz.kaz.udx -
  • modes.xml - Translation modes

For more information

Help and support

If you need help using this language pair or data, you can contact:

See also the file AUTHORS included in this distribution.

Acknowledgements

If you use this in your work, please cite:

apertium-kaz's People

Contributors

asselbaltabayeva avatar assem7shormak avatar beknazar avatar dina-ta avatar frankier avatar ftyers avatar gabatanekeyev avatar ilnarselimcan avatar inariksit avatar jonorthwash avatar kantoro avatar mlforcada avatar mr-martian avatar ryanachi avatar sevilaybayatli avatar snomos avatar sundetova avatar sushain97 avatar tinodidriksen avatar unhammer avatar zh3nis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apertium-kaz's Issues

kaz-morph is broken

What's broken

The kaz-morph mode uses lt-proc and kaz.automorf.bin.

However, it only returns this:

Error: Invalid dictionary (hint: the left side of an entry is empty)

Why it's broken

This appears to be because of some ~empty paths, e.g.

0	2	@0@	<ltr>	0.000000
2	83630	@0@	@0@	0.000000
2	848	@0@	@0@	0.000000
848	0.000000
83630	0.000000

These empty paths appear to be due to the guesser being intersected with kaz@[email protected]. Relevant excerpts below:

apertium-kaz.kaz.lexc:

LEXICON LTR

%<ltr%>: # ;

LEXICON Guesser

<( а | ә | б | в | г | ғ | д | е | ё | ж | з | и | і | й | к | қ | л | м | н |
   ң | о | ө | п | р | с | т | у | ұ | ү | ф | х | һ | ц | ч | ш | щ | ь | ы |
   ъ | э | ю | я )> LTR ;

apertium-kaz.Cyrl-Arab.twol:

 ь:0
 ы:ى
 ъ:0

What we should do about it

Ideally, I think we need to find some way to not intersect the guesser part of the transducer with Cyrl-Arab. Alternatively, we could tweak the lexical conversion to not allow paths that would just be 0 (though I'm not positive how to do that upon first contemplation).

(Thanks to @mr-martian for helping me figure out why lt-proc was failing.)
@IlnarSelimcan @ftyers

NUM morphotactics in testvoc lite don't compile

hfst-fst2strings -c 1 .deps/NUM.hfst | gzip -c > NUM.txt.gz
Killed
make: *** No rule to make target 'NUM-ROMAN.txt.gz', needed by 'all'.  Stop.
rm .deps/NUM.hfst .deps/NUM.prefix.bin .deps/NUM.prefix.upper .deps/NUM.prefix.att .deps/NUM.prefix.hfst .deps/NUM.prefixes

Took about an hour and used up all 64GB of RAM on my lab machine plus the additional 64GB of swap before dying.

I'm wondering if it might be cyclical?

Installed modes are missing files.

modes.xml includes a handful of modes with install="yes", but the
required files aren't installed.

 * Failed to find '/usr/share/apertium/apertium-kaz/.deps/kaz.twol.hfst' in install image.
 * QA: missing files required for mode kaz-twol.
 * Failed to find '/usr/share/apertium/apertium-kaz/.deps/kaz.lexc.hfst' in install image.
 * QA: missing files required for mode kaz-lexc.
 * Failed to find '/usr/share/apertium/apertium-kaz/kaz.zhfst' in install image.
 * QA: missing files required for mode kaz-spell.
 * Failed to find '/usr/share/apertium/apertium-kaz/.deps/acceptor.default.hfst' in install image.
 * QA: missing files required for mode kaz-tokenise.

My guess is kaz-{twol,lexc} shouldn't be installed, kaz-spell should
be dependent on --enable-ospell. Not sure about kaz-tokenise.

Redundant and miscategorized stems in apertium-kaz.kaz.lexc

The vocabulary of apertium-kaz.kaz.lexc requires checking for redundancy, consistency and miscategorizations. Here are some examples:

кептірген:кептірген A1 ; ! ""
аршылған:аршылған A1 ; ! ""
жонылған:жонылған A1 ; ! ""
сүрілген:сүрілген A1 ; ! ""

Along with that, reasons why these are considered mistakes, and, generally, choices made should be documented in apertium-kaz/docs so that this kind of issues don't happen in the future.

At that point, (since the coverage of apertium-kaz is relatively high, that documentation will probably be more useful for other (Turkic) languages rather than for Kazakh.

[puupankki.conllu] солай олай қалай

(1) 5 солай сол PRON prn PronType=Dem 8 advmod _ _
(2) 2 солай солай ADV adv _ 5 ccomp _ _
(3) 3 олай ол PRON prn PronType=Dem 5 advmod _ _

A relevant snippet's from validate.py's output:

[Line 1935 Sent akorda-random.tagged.txt:164:2942 Node 5]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'PRON'

Therefore #17 will change UPOS to ADV, but keep XPOS = prn and also keep the PronType=Dem to keep puupankki apertium compatible. kaz.udx file will be adjusted accordingly so that {ол|сол<prn>} get converted into ADV prn PronType=Dem.

This is issue is here for people who won't see the validation errors I'm going through and who's likely to ask me later "why did you change this". (rant)

err_orth in testvoc files

In generating files for testvoc lite in tests/morphotactics, N1.txt.gz contains things like
мектепсұңдаршү:мектеп<n><nom>+е<cop><aor><p2><pl>+шы<emph><err_orth>. Is this intended?

two forms generated for neg.ifi.evid forms

Currently all possible <neg><ifi><evid> forms have two possible generated forms. For example, кет<v><iv><neg><ifi><evid><p1><sg> outputs both кетпеппін and кеткен жоқ екенмін.

The forms кетпедім and кеткен жоқпын both analyse as кет<v><iv><neg><ifi><p1><sg>, but this analysis only generates the latter form.

We eventually need to find a (tag-based) way to distinguish between these forms. For the time being, we probably need to set one of the neg.ifi.evid forms to Dir/LR.

transdcuer no longer meets Apertium Turkic standards

The issue with the reorganisation of the lexicon in de4c77a is that different parts of speech are all lumped together.

Every single other Turkic transducer uses the lexicon names Nouns, Adjectives, Verbs, ProperNouns, etc. This is standardised for several reasons. One of which is so that we have an easy way to count the number of stems of a particular type. E.g., note that the countstems script was broken by your changes.

@IlnarSelimcan, could you justify why you did this reorganisation? Also, in principle this sort of major restructuring should be done in consultation with and by consensus among everyone it affects—that is, everyone who's committed to this repo, or at least the apertium-turkic mailing list.

two neg.ifi paradigms

Similar to #10, Kazakh has the issue of two neg.ifi paradigms.

First-person singular (neg.ifi.p1.sg) looks like this:

  • мен барған жоқпын
  • мен бармадым

The question is whether there is a difference in usage between these two forms, or if they are identical. The answer to this question will inform what needs to be done in the transducer in regards to the issue.

[puupankki.conllu] мың миллион миллиард млн млрд трлн are inconsistent

General context: #17

Actually several related issues:

  1. мың and миллиард are NUM num everywhere, while миллион in some cases is NUM num, and in others NOUN n.

  2. млрд. and трлн. are NOUN abbr everywhere, while млн. is some cases tagged as NUM num, in others as NOUN abbr.

  3. (a)

4	2	2	NUM	num	NumType=Card	5	compound	_	_
5	миллиард	миллиард	NUM	num	NumType=Card	6	compound	_	_
6	300	300	NUM	num	NumType=Card	7	compound	_	_
7	миллион	миллион	NUM	num	NumType=Card	8	nummod	_	_
8	теңгеден	теңге	NOUN	n	Case=Abl	10	nmod	_	_
9	астам	астам	ADJ	adj	_	10	amod	_	_
10	қаржы	қаржы	NOUN	n	Case=Nom	11	obj	_	_

vs (b)

3	4,3	4,3	NUM	num	NumType=Card	4	nummod	_	_
4	мыңнан	мың	NUM	num	Case=Abl|NumType=Card,Ord	6	nmod	_	_
5	астам	астам	ADJ	adj	_	6	amod	_	_
6	шақырымды	шақырым	NOUN	n	Case=Acc	7	obj	_	_

Hereby I suggest:

  • to tag all of мың, миллион, миллиард, триллион, млн., млрд. and трлн. as NUM num. For the latter three, apertium-kaz & co can be modified to output <abbr> as a secondary tag, i.e. млн\.? --> <num><abbr>. Since there are abbreviated nouns, abbreviated numerals etc, for known abbreviations I think it makes sense to make <abbr> a secondary tag, especially in context of UD annotating:

[quote https://universaldependencies.org/u/pos/all.html#sym-symbol]

Strings that consists entirely of alphanumeric characters are not symbols but they may be proper nouns: 130XE, DC10; others may be tagged PROPN (rather than SYM) even if they contain special characters: DC-10. Similarly, abbreviations for single words are not symbols but are assigned the part of speech of the full form. For example, Mr. (mister), kg (kilogram), km (kilometer), Dr (Doctor) should be tagged nouns. Acronyms for proper names such as UN and NATO should be tagged as proper nouns.

[unquote]

but also generally speaking knowing the POS of the unabbreviated form is considered helpful for applications.

UPDATE: note that in UD there is the Abbr feature: https://universaldependencies.org/u/feat/Abbr.html

  • to handle all numerical constructions like the above as compounds (i.e. as done in 3a). In other words, a flat chain of compounds, with the rightmost element being the head receiving nummod or nmod whatever.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.