Code Monkey home page Code Monkey logo

trnltk's Introduction

PROJECT MOVED

This project is first ported to Java and then merged into Zemberek project ( code, Turkish Wikipedia Article. Zemberek is an NLP project which has been used in the industry for several years. It is used in products such as OpenOffice.

Current codebase for TRNLTK is kept, for people who wants to work on/with a Python Turkish morphologic parser.

""" Copyright 2012 Ali Ok (aliokATapacheDOTorg)

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. """

Turkish Natural Language Toolkit

This project will try to provide a toolkit for computer linguistic work for Turkish.

Some terms used in documentation and code

surface: Full word including the root and suffixes

root : The root of a word. Root atomic part.

derivation : Deriving a new wo

inflection : Conjugating a word with a person agreement / possession / tense etc.

suffix form : Form of a suffix. For example, suffix 'Progressive' has 2 suffix forms; '-iyor' and '-makta'

stem : Root + derivations. Doesn't include the inflections

syntactic category : Verb, Noun, Adjective etc.

inflectional suffix : A suffix that doesn't change stem nor the syntactic category of a surface

derivational suffix : A suffix that changes the stem and might change the syntactic category of a surface

morpheme : Elements of a surface; stem and suffixes

lemma : The root text that can be found in a dictionary

lexeme : Lemma + Syntactic category of the lemma

morphology : How a surface is constructed and how can it be extracted to morphemes

morphotactics : Rules when can a suffix can be applied. For example "Progressive suffix can only be applied to a Verb, and it can't be applied to a surface which has Progressive suffix already"

ortographics : Rules of phonetics. For example the rules for voicing (kitap+a --> kitaba), devoicing (kitap+cı --> kitapçı), vowel drop (omuz+u --> omzu), etc.

for "yüzücülere":

surface : yüzücülere

root : yüz

stem : yüzücü

syntactic category of root : Verb

syntactic category of surface : Noun

suffixes and suffix forms:

  • derivational suffix 'Agentive' with form '-cü'
  • inflectional suffix '3rd person plural agreement' with form '-ler'
  • inflectional suffix 'Dativ' with form '-e'

morphemes:

  • root 'yüz'
  • derivational suffix 'Agentive' with form '-cü'
  • inflectional suffix '3rd person plural agreement' with form '-ler'
  • inflectional suffix 'Dativ' with form '-e'

lemma : yüz

lexeme : yüz + Verb

for 'kitaba':

surface : kitaba

root : kitab

stem : kitab (or kitap, doesn't matter)

syntactic category of root : Noun

syntactic category of surface : Noun

suffixes and suffix forms:

  • inflectional suffix 'Dativ' with form '-a'

morphemes:

  • root 'kitab'
  • inflectional suffix 'Dativ' with form '-a'

lemma : kitap

lexeme : kitap + Noun

Plan

  1. [In Progress] A contextless morphological parser to extract roots and suffixes out of surfaces

  2. A contextless morphological generator that can generate surfaces from roots and suffixes by choosing the correct suffix form

  3. A playground with a data set that provides * Concordance for surfaces * Concordance for roots * Concordance for lexemes * Concordance for transition words * Statistics for words, roots and suffixes * ...

  4. A context dependent morphological parser that uses N-Grams for determining the correct parse result among several results

  5. A context dependent morphological generator that uses N-Grams for determining the correct suffix form among several forms

  6. A rule based and statistical lexical category determining tool for sentences

  7. Contextless Morphological Parser


A finite state machine is used for parsing surfaces. Nodes with different states of words and edges as suffixes.

As of October 2012, this graph is used to parse surfaces. That graph is drawn by this script with the following command

suffixgraphplotter.py E /home/ali/Desktop/suffixGraphX.png

The format of the image is based on extension. So, for a more interactive image you can use dot:

suffixgraphplotter.py E /home/ali/Desktop/suffixGraphX.png

Why another parsing tool and why FSM?

I've inspected other other approaches and I saw that tracking the problems were very hard with them. For example, one approach is creating a suffix graph by defining what suffix can come after other suffix. But with that approach it is impossible to have an overview of the graph, since there would be thousands of nodes and edges.

Phonetic rules and implementation is copied from open-source java library Zemberek3

How it is tested?

There are thousands of parsing unit tests. Plus, I use the treebank from METU-Sabanci, but is closed-source. Unfortunately, its license doesn't allow anyone to publish any portion of the treebank, thus I only test the parser against it in my local environment.

trnltk's People

Contributors

aliok avatar

Stargazers

Emre Kayık avatar  avatar Yigitcan Özer avatar Esra avatar Data Supplier avatar  avatar Hüseyin Mert avatar Eren Gölge avatar Şammas Çölkesen avatar Ergin ALTINTAŞ avatar Murat Duman avatar  avatar Ahmet Sarıgüney avatar Ibrahim KIVANC avatar Hacer Tilbeç Turgut avatar ahmet kotan avatar Selcuk Gulcan avatar Burak Tahtacıoğlu avatar ufukhurriyet avatar Muhammed Ebrar Erdem avatar  avatar Yavuz Yurtbeğendi avatar Kuzey avatar Tarik Uygun avatar  avatar Utku Kaynar avatar akaratay avatar onur sencan avatar  avatar Shaig Khaligli avatar utkugurcuoglu avatar Kate Brady avatar Duygu Altinok avatar Ozgur avatar Murat Hosver avatar Pantelis Koukousoulas avatar Joomy Korkut avatar Dincer Kavraal avatar Osman Başkaya avatar volkan avatar Mustafa Samed Kasal avatar Ertaç Paprat avatar Seçkin Tokcan avatar Anıl Özbek avatar  avatar

Watchers

Yasin avatar  avatar James Cloos avatar Joomy Korkut avatar Kuzey avatar MİKAİL avatar  avatar saydogdu avatar Tarik Uygun avatar FF Bilişim avatar  avatar

trnltk's Issues

Extract intensifying syllables for adjectives and adverbs

masmavi, simsıcak, yapayalnız, ipince, küpküçük etc.

Doesn't have strict rules, thus need to find syllables for those from a big corpus.

If word ends with a dictionary adverb/adjective, and if it starts with the beginning of the same adjective/adverb, we can suggest that it is intensified.

Try to find as much as possible and add them to a dictionary!

Statistical word tokenizer (sentence to words)

Rule based part is already available : https://github.com/aliok/trnltk/blob/master/trnltk/tokenizer/texttokenizer.py

It doesn't work good with:

  1. Abbreviations like M.Ö. or ing.
  2. Ordinals like 3.
  3. Roman numerals like III and III.
  4. Paranthesis such as "(abc"
  5. Some phrases which are multiple words but should be considered as one : "hafta sonu" => "hafta_sonu"
  6. Proper nouns which are multiple words but should be consireded as one : "İç Anadolu" => "İç_Anadolu"
  7. Duplications

Ideas:
while tokenization:

  1. Check if M.Ö. is used as an abbreviation
  2. This is rule based I think. A sentence almost never ends with a cardinal number.
  3. Need morphologic support for that first.
  4. Seems rule based
  5. After tokenization, can have a look if there is a phrase like that. If so, words could be merged
  6. Same as 5
  7. Issue #32 is related

for 5 and 6, see http://www.tdk.gov.tr/index.php?option=com_content&view=article&id=221:Ayri-Yazilan-Birlesik-Kelimeler&catid=50:yazm-kurallar&Itemid=132

When using brute force roots, give penalty to roots which don't look like other roots

For example, with brute force, parse results for the word "yapıyordum" are following:

  • yap + Prog + Past + P1sg
  • yapa + Prog + Past + P1sg
  • yapiyo + Aor + Past + P1sg
  • yapiyor + Past + P1sg

Nr 2, nr 3 and nr 4 are false positives.

For nr 3 and nr 4 we can have a look at the similarity with other verb roots.
There are a lot of verbs (kapa, ...) in the form of nr 2, so this doesn't solve the problem completely. But it is a start.

Morphologic parsing tokenizer

This would be good for deciding what to do when a dot char is seen.
If it makes sense:

  • numerals
  • roman numerals
  • etc.

don't separate it.

Same would go with other ambiguous points.

Unsupervised statistical root extraction without a dictionary

Brute force root extractors already exist. However, the results are too much and it is better to do it statistically

This might be useful for finding roots that doesn't exist in the dictionary (e.g. local words) and proper nouns.

  • Save the possible roots for a big corpus (10M words) in a file
  • ...

For proper noun recognition

  • check if the root has been used with a apostrophe in the corpus
  • or check if the word starts with upper case in the middle of a sentence in the corpus

For e.g. verb recognition: for non-dictionary word 'kıvışlıyordu' find the root as 'kıvışlamak'

  • Check if there are other surfaces with root candidates as "kıvışlamak", such as 'kıvışladım' 'kıvışla' 'kıvışlarsa'
  • Then we would eliminate the some of the candidates : 'kıvışlımak' 'kıvışlıyormak' 'kıvışlıyomak' etc.
  • However, it doesn't eliminate the roots such as "kıvış+Noun" 'kıvmak' 'kıvımak' etc.
  • For them, check if there is other surfaces such as 'kıvışımı' 'kıvdım' 'kıvıyorum' etc.

That would help a lot.

Mark "phrases" in dictionary

For example, it might make sense to mark the word "bilimsel" as "contains 'Related' suffix"

We don't want to break that word into its subparts, but need to reduce the number of parse results produced for that word.

bilimsel+Adj
bilim+A3sg+Pnon+Nom+Adj+Related

Phrase recognition and database

In order to use in tokenizer (sentence to words), we need something like that.
Can be done statistically with some rules, with the support of Issue #25

hafta sonu => hafta_sonu
Turkiye Cumhuriyeti ==> Turkiye_Cumhuriyeti
ilan etmek --> ilan_etmek

Doesn't make sense to parse "ilan" and "etmek" separately.

Zemberek has already a small database about these.

Issue #32 is related

See http://www.tdk.gov.tr/index.php?option=com_content&view=article&id=221:Ayri-Yazilan-Birlesik-Kelimeler&catid=50:yazm-kurallar&Itemid=132

Brute force proper noun root finder and a special suffix graph

Should accept:

  • Surfaces starting with uppercase letter
  • Including no apostrophe (since with apostrophe, the root is obvious)

Some suffixes which can be applied to these kind of roots:

  • -ler :
    • Turkler,
    • Alilere gidiyorum
  • -gil
    • Ahmetgildeyim
  • -li
    • Kayserili
  • -lik
    • Turkluk

etc.

Check the rules from TDK: http://www.tdk.gov.tr/index.php?option=com_content&view=article&id=187:Noktalama-Isaretleri-Aciklamalar&catid=50:yazm-kurallar&Itemid=132

Some suffixes doesn't make sense to apply proper nouns:

  • - len

Really problematic part is, sometimes apostrophe is used, sometimes it is not.

Suffix -si

Can be applied to nouns ending with a consonant.

erkeksi
cocuksu
raporsu

but not
masa-m-si
yasli-m-si

-msi is another suffix

Duplication recognition in tokenization

Duplications:

  1. abur cubur both doesn't make sense
  2. yemek memek second doesn't make sense
  3. iyi kotu opposites
  4. zırıl zırıl called "sound reflection" in Turkish
  5. sıcak sıcak 2 adjectives turn into an adverb
  6. gide gide
  7. kırk elli kişi
  8. uc bes kurus
  9. bata cika
  10. enine boyuna
  11. ev bark
  12. bas basa, daldan dala, ucu ucuna
  13. gelir gelmez, yapar yapmaz --> Adverbs

Some of them can be done during tokenization.

Some of them needs to be done after parsing, such as 13

_R+Verb+Pos+Aor+A3sg _R+Verb+Neg+Aor+A3sg %1 %2+Adverb+When

Support more numerals

1.'yi
2'nci
10000'er
biner
bininci (supported already)

instead of adding words to dictionary, maybe it is better to add the support with the suffix graph

Find if a verb is transitive, reciprocal, reflexive or not

Transitive verbs (verbs that accept an object) could be found from an annotated corpus.
A non-transitive verb can be converted to transitive by adding Causative suffix.
** A very advanced issue. Related with POS tagging.

Finding reciprocal is easy. Rule based (verbs ends with "ş" and no need for POS tagging)

Reflexive (such as giyinmek) is similar to reciprocal

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.