giellalt / lang-rus Goto Github PK

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Russian language

License: GNU General Public License v3.0

Makefile 0.32% Shell 0.44% M4 0.40% Python 0.80% Regular Expression 0.18% XML 0.03% YAML 0.52% Text 97.32%

finite-state-transducers constraint-grammar nlp language-resources minority-language proofing-tools giellalt-langs maturity-beta geo-russia langfam-indoeuropean

lang-rus's Introduction

The Russian morphology and tools

This repository contains finite state source files for the Russian language, for building morphological analysers, proofing tools and dictionaries. The data and implementation are licenced under GNU GPLv3 licence, also detailed in the LICENSE. The authors named in the AUTHORS file are available to grant other licencing choices.

Install proofing tools and keyboards for the Russian language by using the Divvun Installer (some languages are only available via the nightly channel).

Download and test speller files

The speller files downloadable at the top of this page (the *.bhfst files) can be used with divvunspell, to test their performance. These files are the exact same ones as installed on users' computers and mobile phones. Desktop and mobile speller files differ from each other in the error model and should be tested separately — thus also two different downloads.

Documentation

Documentation can be found at:

Core dependencies

In order to compile and use Russian language morphology and dictionaries, you need:

an FST compiler: HFST, Foma or Xerox Xfst
VislCG3 Constraint Grammar tools

To install VislCG3 and HFST, just copy/paste this into your Terminal on Mac OS X:

curl https://apertium.projectjj.com/osx/install-nightly.sh | sudo bash

or terminal on Ubuntu, Debian or Windows Subsystem for Linux:

wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
sudo apt-get install cg3 hfst

or terminal on RedHat, Fedora, CentOS or Windows Subsystem for Linux:

wget https://apertium.projectjj.com/rpm/install-nightly.sh -O - | sudo bash
sudo dnf install cg3 hfst

Alternatively, the Apertium wiki has good instructions on how to install the dependencies for Mac OS X and how to install the dependencies on linux

Further details and dependencies are described on the GiellaLT Getting Started pages.

Downloading

Using Git:

git clone https://github.com/giellalt/lang-rus

Using Subversion:

svn checkout https://github.com/giellalt/lang-rus.git/trunk lang-rus

Building and installation

INSTALL describes the GNU build system in detail, but for most users it is the usual:

./autogen.sh # This will automatically clone or check out other GiellaLT dependencies
./configure
make
(as root) make install

Citing

If you use language data from more than one GiellaLT language, consider citing our LREC 2022 article on whole infra:

Linda Wiechetek, Katri Hiovain-Asikainen, Inga Lill Sigga Mikkelsen, Sjur Moshagen, Flammie Pirinen, Trond Trosterud, and Børre Gaup. 2022. Unmasking the Myth of Effortless Big Data - Making an Open Source Multi-lingual Infrastructure and Building Language Resources from Scratch. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1167–1177, Marseille, France. European Language Resources Association.

If you use bibtex, following is as it is on ACL anthology:

@inproceedings{wiechetek-etal-2022-unmasking,
    title = "Unmasking the Myth of Effortless Big Data - Making an Open Source
    Multi-lingual Infrastructure and Building Language Resources from Scratch",
    author = "Wiechetek, Linda  and
      Hiovain-Asikainen, Katri  and
      Mikkelsen, Inga Lill Sigga  and
      Moshagen, Sjur  and
      Pirinen, Flammie  and
      Trosterud, Trond  and
      Gaup, B{\o}rre",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation
    Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.125",
    pages = "1167--1177"
}

lang-rus's People

Contributors

Stargazers

Watchers

Forkers

trondtynnol

lang-rus's Issues

Make ambiguous/optional transitivity tag

Taken from reynoldsnlp/udar#24. (some discussion can be see there)

Russian verbs do not inflect for transitivity, so having multiple readings distinguished by transitivity is grammatically inaccurate.

Transitivity tags can be helpful for the CG, so we should specify transitivity when possible, but if the transitivity is ambiguous, there should only be one reading.

Lemmas declared more than once

Taken from reynoldsnlp/udar#40

The following code using the lexc_parser module ...

from sys import stderr

import lexc_parser as lp


filename = GTPATH + '/langs/rus/src/morphology/lexicon.tmp.lexc'

print('Parsing lexc file...', file=stderr)
with open(filename) as f:
    src = f.read()
lexc = lp.Lexc(src)

primary_lexicons = [entry.cc.id for entry in lexc['Root']
                    if entry.cc is not None and entry.cc.id != 'Numeral']
for lex in primary_lexicons:
    lexc[lex].cc_lemmas_dict

...yields the following lists of lemmas that are declared more than once inside the same part of speech's LEXICON:

Parsing lexc file...
ryan.py:17: UserWarning: Lemmas declared more than once within Adverb:
{'коротко', 'наголо', 'верхом', 'чудно', 'страшно'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Noun:
{'бронирование', 'пояс', 'колонок', 'кочан', 'ничтожество', 'судзуки', 'лекарство', 'орган', 'рондо', 'видение', 'уголь', 'туника', 'сапожок', 'пресс-релиз', 'артикул', 'соболь', 'огнеупоры', 'кондуктор', 'индустрия', 'чижик', 'вязанка', 'воздвижение', 'недвижимость', 'пулярка', 'призрак', 'козырь', 'флагман', 'цоколь', 'бакан', 'нон-стоп', 'гитлерюгенд', 'сопло', 'ширма', 'предвозвестник', 'провидение', 'болванчик', 'генсовет', 'парилка', 'пугало', 'гигант', 'тягло', 'полиграфия', 'комплекс', 'микрометр', 'мебельщик', 'характерность', 'феномен', 'пристенок', 'хаханьки', 'натура', 'наркоминдел', 'чувиха', 'пергамент', 'водолей', 'сельдь', 'ламповая', 'напряг', 'ферула', 'хиханьки', 'глюк', 'настриг', 'туркменбаши', 'пролог', 'метчик', 'обрезание', 'туфелька', 'розан', 'речушка', 'чабер', 'порсканье', 'судья', 'светоч', 'урка', 'хаос', 'проводка', 'лиганд', 'колосс', 'дочушка', 'маки', 'транспорт', 'замглавы', 'полип', 'ирис', 'угольник', 'проволочка', 'лосось', 'единица', 'червец', 'тотем', 'холодность', 'плёночка', 'картель', 'нуклеокапсид', 'жертва', 'истукан', 'предвестник', 'кашица', 'кредит', 'взрослый', 'опрощение', 'сведение', 'ужин', 'отзыв', 'русло', 'солнечник', 'ход', 'ястребок', 'префикс', 'цитокин', 'ирей', 'синтип', 'бучение', 'книговедение', 'трапезная', 'безобразность', 'край', 'чучело', 'созданьице', 'зайчик', 'рол', 'подволока', 'разлив', 'солнышко', 'креветка', 'консерваторка', 'дядя', 'прототип', 'сметливость', 'гуарани', 'субъект', 'заворот', 'видик', 'катанье', 'ведение', 'создание', 'калига', 'устрица', 'хобот', 'прослушка', 'бодяга', 'зев', 'комроты', 'отчёт', 'фрик', 'конус', 'адрес', 'котик', 'камора', 'дышло', 'плазмодий', 'марионетка', 'отправитель', 'усадьба', 'селище', 'живчик', 'лоцман', 'дублет', 'светило', 'боливар', 'мшанка', 'целение', 'юнкер', 'спутник', 'скакунок', 'дуплет', 'ордер'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Predicative:
{'чудно', 'полно', 'страшно'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Pronoun:
{'возле', 'поперёд', 'обок', 'вне', 'внутрь', 'близь', 'помимо', 'посредине', 'напротив', 'поперёк', 'вблизи', 'посреди', 'вперёд', 'наместо', 'спереди', 'наперекор', 'подобно', 'согласно', 'насчёт', 'навроде', 'свыше', 'ниже', 'посередине', 'ради', 'позади', 'вдоль', 'под', 'чрез', 'вроде', 'вследствие', 'посредством', 'выключая', 'у', 'путём', 'касательно', 'превыше', 'накануне', 'относительно', 'вопреки', 'про', 'промежду', 'касаемо', 'около', 'над', 'из-за', 'по', 'сквозь', 'за', 'ввиду', 'соразмерно', 'противу', 'поверх', 'вовнутрь', 'наперерез', 'без', 'позадь', 'вкось', 'вослед', 'пред', 'мимо', 'сообразно', 'из-под', 'опричь', 'внизу', 'между', 'по-над', 'кроме', 'сверху', 'о', 'посередь', 'сверх', 'вкруг', 'внутри', 'промеж', 'через', 'к', 'против', 'от', 'наподобие', 'перед', 'посереди', 'сзади', 'кругом', 'на', 'включая', 'прежде', 'до', 'исключая', 'выше', 'снизу', 'соответственно', 'взамен', 'насупротив', 'для', 'из', 'округ', 'среди', 'меж', 'плюс', 'окрест', 'средь', 'с', 'благодаря', 'спустя', 'вслед', 'при', 'противно²', 'вместо', 'минус', 'вокруг', 'после', 'впереди', 'подле', 'близ', 'по-за', 'изнутри', 'супротив', 'в', 'середь'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Verb:
{'осветить', 'прояснеть', 'отползать', 'запыхаться¹', 'усугубиться', 'тикать', 'усугубить', 'икать'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Propernoun:
{'Мелани', 'Сандро', 'Филатов', 'Зощенко', 'Марго', 'Геркулесович', 'Люси', 'Симонович', 'Фениксович', 'Симон', 'Витольдович', 'Манагуа', 'Якобсон', 'Евтушенко', 'Гордон', 'Исидор', 'Терещенко', 'Геркулесовна', 'Бурденко', 'Исидорович', 'Григоренко', 'Симоновна', 'Фигаро', 'Макаренко', 'Стефанович', 'Филиппов', 'Короленко', 'Геркулес', 'Лонгин', 'Франко', 'Довженко', 'Пегасовна', 'Пегасович', 'Никарагуа', 'Лонгиновна', 'Мартиновна', 'Громыко', 'Элизабет', 'Федотов', 'Павлиновна', 'Лысенко', 'Шевченко', 'Гильфердинг', 'Павлин', 'Шульженко', 'Исаченко', 'Иванов', 'Робинсон', 'Пегас', 'Стефан', 'Мартин', 'Михалков', 'Павлинович', 'Персей', 'Стефановна', 'Семашко', 'Икария', 'Катанга', 'Мемфис', 'Лонгинович', 'Исидоровна', 'Фениксовна', 'Викторович', 'Феникс', 'Стефани', 'Персеевич', 'Новиков', 'Витольдовна', 'Мартинович', 'Любань', 'Витольд', 'Виктор', 'Нестеренко', 'Панченко', 'Гурченко', 'Обухов', 'Персеевна', 'Покров', 'Итака', 'Морган', 'Викторовна'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Punctuation:
{''}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within Symbols:
{'%'}
  lexc[lex].cc_lemmas_dict
ryan.py:17: UserWarning: Lemmas declared more than once within LexicalizedParticiple:
{'положить', 'сложить'}
  lexc[lex].cc_lemmas_dict

Reconsider 1Pl imperatives

Taken from reynoldsnlp/udar#31.

Reconsider whether to mark 1pl as imperatives. If so, then should imperfectives be marked as well? This is both a linguistic and practical question.

hfst-compose-intersect in src/Makefile_L2 leads to HfstFatalException

It may be that the best way to solve this problem is to properly integrate the L2 makefile into the automake build (see #10). Maybe @snomos can help determine how difficult that will be.

In the root directory, running $ make && cd src && make -f Makefile_L2 -B throws an HfstFatalException. The problem seems to stem from the number of error tags in L2_ORTH_ERRS. I have tried various combinations to see if there is some kind of conflict between the rules, but every small subset I have tried works without error. ~~(However, maybe I just haven't tested the right combination yet)~~ I ran 12 different rotations of the 12 tags, and it fails on the 10th tag every time.

The regex files for L2_ORTH_ERRS are shown here (removing comments and empty lines):

$ tail -n +1 src/orthography/L2_*.regex | grep -v ^# | grep -v ^$
==> src/orthography/L2_Akn.regex <==
а (<-) о ;
==> src/orthography/L2_e2je.regex <==
е (<-) э ;
==> src/orthography/L2_H2S.regex <==
ь (<-) ъ ;
==> src/orthography/L2_i2j.regex <==
й (<-) и ;
==> src/orthography/L2_i2y.regex <==
ы (<-) и ;
==> src/orthography/L2_Ikn.regex <==
и (<-) е ,
и (<-) я ;
==> src/orthography/L2_j2i.regex <==
и (<-) й ;
==> src/orthography/L2_je2e.regex <==
э (<-) е ;
==> src/orthography/L2_NoSS.regex <==
0 (<-) ь ;
==> src/orthography/L2_sh2shch.regex <==
щ (<-) ш ;
==> src/orthography/L2_shch2sh.regex <==
ш (<-) щ ;
==> src/orthography/L2_y2i.regex <==
и (<-) ы ;

The offending code is this loop in Makefile_L2. It appears that hfst-compose-intersect is outputting a bad transducer and hfst-disjunct is choking on it:

	for tag in $(L2_ORTH_ERRS) ; \
	do \
		echo "[ ? -> ... \"\+Err\/L2_$${tag}\" || _ .#. ]" > add-tag-err-L2_$${tag}.regex.tmp ; \
		hfst-regexp2fst  --format=foma --xerox-composition=ON -v  \
			-S add-tag-err-L2_$${tag}.regex.tmp -o add-tag-err-L2_$${tag}.hfst ; \
		printf "read regex @\"orthography/L2_$${tag}.compose.hfst\" \
			.o. @\"analyser-gt-desc.hfst\" \
			;\n \
			save stack err.orth.tmp.hfst\n \
			quit\n" | hfst-xfst -p -v --format=foma ; \
		hfst-subtract -F err.orth.tmp.hfst \
			      analyser-gt-desc-L2.tmp.hfst \
			      > err.uniq.tmp.hfst ; \
		hfst-compose-intersect -v -1 err.uniq.tmp.hfst \
		      -2 add-tag-err-L2_$${tag}.hfst \
		      -o err.tagged.tmp.hfst ; \
		hfst-disjunct -1 analyser-gt-desc-L2.tmp.hfst \
		      -2 err.tagged.tmp.hfst \
		      | hfst-determinize \
		      | hfst-minimize \
		      > err.tmp.hfst ; \
		mv err.tmp.hfst analyser-gt-desc-L2.tmp.hfst ; \
		echo "слово" | hfst-lookup analyser-gt-desc-L2.tmp.hfst ; \
		hfst-summarize --verbose analyser-gt-desc-L2.tmp.hfst ; \
	done

The last relevant bit of output is the following:

Reading from add-tag-err-L2_sh2shch.regex.tmp, writing to add-tag-err-L2_sh2shch.hfst
Compiling expression #1
Using foma as output handler
Reading from standard input...
? bytes. 167693 states, 372271 arcs, ? paths
hfst[1]: hfst[1]: hfst[1]: .
hfst-subtract: warning: Warning: analyser-gt-desc-L2.tmp.hfst contains flag diacritics. The result of subtraction may be incorrect.
hfst-compose-intersect: warning:
Found output multi-char symbols ("+A") in
transducer in file err.uniq.tmp.hfst which are not found on the
input tapes of transducers in file add-tag-err-L2_sh2shch.hfst.
Reading from err.uniq.tmp.hfst and add-tag-err-L2_sh2shch.hfst, writing to err.tagged.tmp.hfst
Reading and minimizing rule xre(?)...
Reading lexicon... subtract(?stdin?, ?stdin?) read
Computing intersecting composition...
Storing result in err.tagged.tmp.hfst...
terminate called after throwing an instance of 'HfstFatalException'
hfst-determinize: Aborted (core dumped)
<stdin> is not a valid transducer file

separate prepositions into lemmas by case?

Taken from reynoldsnlp/udar#27.

It would be helpful to language learners/teachers to be able to search for instances of a preposition that govern a certain case.

For example, с can govern INST, GEN and ACC. Each of these could be a different lemma, e.g. с¹, с², с³. The superscript numerals are kind of a pain, and they are opaque. Perhaps this should be с+Pr+Acc, с+Pr+Gen, and с+Pr+Ins. This stretches the meaning of the case tags, where in this case it means that the preposition governs that case, rather than that it is in that case.

Superlative tag?

Taken from reynoldsnlp/udar#32

Should tokens such as новейший, высший, etc. be tagged as superlatives?

Integrate L2 error analyzer into automake

Makefiles for the L2 error analyzer were added in 9536efe. These should be optimized and integrated into the standard automake workflow, so that they build with the --enable-L2 configure flag.

Stress on multi-word expressions

Taken from reynoldsnlp/udar#19

The lexical underlying form needs to have a persistent stress mark that survives the two-level rule that reduces stresses to the right-most one. For example,...

красно-жёлтых
так как
так что
то есть

Search through an fst2strings version of a stressed transducer for any words with stresses on both sides of spaces and hyphens. Something like this: egrep ":.*[ё́̀].*(% |-).*[ё́̀]"

Can't build tokenizer

make[4]: *** No rule to make target 'tokeniser-disamb-gt-desc.accented.pmhfst', needed by 'all-am'. Stop.

эты

$ echo эты | hfst-lookup -q analyser-gt-desc.hfstol
эты	эта+N+Fem+Inan+Pl+Acc	0.000000
эты	эта+N+Fem+Inan+Pl+Nom	0.000000
эты	эта+N+Fem+Inan+Sg+Gen	0.000000

Add `+Err/L2_Lat` for confusing Cyrillic with Latin letters

Words like вилет (cf билет). The following are candidates to consider:

б > в
й > у
н > п
п > р
к > с
х > н

Fix spellers

After infra juggle, accents are not filtered off, tokenizer does not recognize simple words

In the src/fst/morphology/stems/nouns.lexc

табак:таба́к м_b_Р2 "weight: 4.490091974131408" ;

BUT

lang-rus jackrueter$ hfst-lookup src/fst/analyser-gt-norm.hfstol 
> табак
табак	табак+?	inf

This is related to several tickets in after the move