Morphological lexicon Sloleks 2.0 parser

What does this parser do?

This parser will extract all data from the the 1.5 GB XML file found here and put it in a SQLite database so that it can be used for further processing.

Why did you convert the data from XML to SQL?

This parser played a key part in building my final project for Harvard's CS50x course as it allowed me to extract about 100.000 lemmas of the Slovenian language and use them an IndexedDB in a Chrome extension, which was my final project.

How did you extract the XML data?

Like this:

# download the data
wget https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1230/Sloleks2.0.LMF.zip
# activate a Python virtual environment
python3 -m venv env && source env/bin/activate
# install all dependencies
pip install -r requirements.txt
# run the parser
python convert.py -i Sloleks2.0.LMF.zip -x sloleks_clarin_2.0.xml -v

After about 45 min the data will get transferred to a 1 GB SQLite database called sloleks.db.

To extract the data for further use in my Chrome and Firefox extensions I exported it with this SQL query:

SELECT LOWER(fr.zapis_oblike) AS 'word',
       wf.msd,
       LOWER(l.zapis_oblike) AS 'lemma'
FROM form_representations fr
JOIN word_forms wf on fr.word_form_id = wf.id
JOIN lemmas l ON fr.lexical_entry_id = l.lexical_entry_id
WHERE SUBSTR(l.zapis_oblike, 1, 1) NOT IN ('0', '1','2','3','4','5','6','7','8','9')
GROUP BY word

Tell me more about Sloleks

Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains approx. 100,000 most frequent Slovenian lemmas, their inflected or derivative word forms and the corresponding grammatical description. Lemmatization rules, part-of-speech categorization and the set of feature-value pairs follow the JOS morphosyntactic specifications. In addition to grammatical information, each word form is also given the information on its absolute corpus frequency and its compliance with the reference language standard.

More information about Sloleks can be found here.

techouse / sloleks-parser Goto Github PK

sloleks-parser's Introduction

Morphological lexicon Sloleks 2.0 parser

What does this parser do?

Why did you convert the data from XML to SQL?

How did you extract the XML data?

Tell me more about Sloleks

sloleks-parser's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent