Code Monkey home page Code Monkey logo

etymdb's Introduction

EtymDB 2.1

EtymDB 2.1 : An etymological database extracted from the Wiktionary (described in Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0).

Previous versions available here. Logo upgraded by Alix Chagué.

Organisation of the repo (and the base)

  • data

    • etymdb.csv is the raw extracted DB csv file
      • Extracted from wiktionary.xml, itself extracted from enwiktionary-latest-pages-articles.xml - neither have been added to the repo because of their size, if you need them, please contact the repo owner
    • split_etymdb contains the extracted database, separated in several files for easier data analysis
      • etymdb_values: Word ix, Lang identifier (in wiki code), Lexeme, Gloss (English translation)
      • etymdb_links_info: Direct relation type, child word ix, parent word ix
        • If the parent index is negative (usually for derivation or compounding relations), it means that several parents are implied: the negative index will be found in etymdb_links_index, in association with the several parents indices
      • etymdb_links_index: Multiple parents relation ix, parent 1 ix, parent 2 ix, ... parent n ix
  • extraction_scripts contains all the scripts used for data extraction, included for reproducibility

  • analysis_notebooks contains 2 Jupyter notebooks to help you get a quick start with the database. One is the reproduction of part 7 of the paper

  • static contains the logos

Data extraction

You can reproduce all steps of data extraction by using the following commands on your data dump of interest.

Extract your data dump

Download and extract the xml data dump that you want to use, and put it in data/.

tar -xvjf enwiktionary-date-pages-articles.xml.bz2 
mv enwiktionary-date-pages-articles.xml data/

From xml to csv

Then, from the script folder.

cat ../data/enwiktionary-date-pages-articles.xml | perl enwiktionary2xml.pl > ../data/enwiktionary.xml
cat ../data/enwiktionary.xml | perl etymology_analyser.pl > ../data/enwiktionary.csv

From csv to split csv

From the data folder.

# Get only links_info
awk '$1 ~ /^-/'etymdb.csv > split_etymdb/etymdb_links_index.csv
# Get no links info
awk '$1 !~ /^-/'etymdb.csv > split_etymdb/etymdb_not_links_index.csv
# Get only lexeme info
awk 'NF > 3 { print $0 }' split_etymdb/etymdb_not_links_index.csv > split_etymdb/etymdb_values.csv
# Get values info
awk 'NF == 3 { print $0 }' split_etymdb/etymdb_not_links_index.csv > split_etymdb/etymdb_links_info.csv

Citation

@inproceedings{fourrier-sagot-2020-methodological,
    title = "Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing {E}tym{DB}-2.0",
    author = "Fourrier, Cl{\'e}mentine  and
      Sagot, Beno{\^\i}t",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.392",
    pages = "3207--3216",
    ISBN = "979-10-95546-34-4",
}

etymdb's People

Contributors

bsagot avatar clefourrier avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

patkun metopedia

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.