Code Monkey home page Code Monkey logo

sloleks-parser's Introduction

Morphological lexicon Sloleks 2.0 parser

What does this parser do?

This parser will extract all data from the the 1.5 GB XML file found here and put it in a SQLite database so that it can be used for further processing.

Why did you convert the data from XML to SQL?

This parser played a key part in building my final project for Harvard's CS50x course as it allowed me to extract about 100.000 lemmas of the Slovenian language and use them an IndexedDB in a Chrome extension, which was my final project.

How did you extract the XML data?

Like this:

# download the data
wget https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1230/Sloleks2.0.LMF.zip
# activate a Python virtual environment
python3 -m venv env && source env/bin/activate
# install all dependencies
pip install -r requirements.txt
# run the parser
python convert.py -i Sloleks2.0.LMF.zip -x sloleks_clarin_2.0.xml -v

After about 45 min the data will get transferred to a 1 GB SQLite database called sloleks.db.

To extract the data for further use in my Chrome and Firefox extensions I exported it with this SQL query:

SELECT LOWER(fr.zapis_oblike) AS 'word',
       wf.msd,
       LOWER(l.zapis_oblike) AS 'lemma'
FROM form_representations fr
JOIN word_forms wf on fr.word_form_id = wf.id
JOIN lemmas l ON fr.lexical_entry_id = l.lexical_entry_id
WHERE SUBSTR(l.zapis_oblike, 1, 1) NOT IN ('0', '1','2','3','4','5','6','7','8','9')
GROUP BY word

Tell me more about Sloleks

Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains approx. 100,000 most frequent Slovenian lemmas, their inflected or derivative word forms and the corresponding grammatical description. Lemmatization rules, part-of-speech categorization and the set of feature-value pairs follow the JOS morphosyntactic specifications. In addition to grammatical information, each word form is also given the information on its absolute corpus frequency and its compliance with the reference language standard.

More information about Sloleks can be found here.

sloleks-parser's People

Contributors

techouse avatar

Watchers

 avatar  avatar  avatar

Forkers

ppisljar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.