Code Monkey home page Code Monkey logo

wikt2dict's Introduction

wikt2dict

Wiktionary translation parser tool for many language editions.

Wikt2dict parses only the translation sections. It also has a triangulation mode which combines the extracted translation pairs to generate new ones.

News

Wikt2dict changed completely, hope for the better. If you would like to keep using the old one: https://github.com/juditacs/wikt2dict/tree/a08cc896c22dc78db62e1b790c3ec157d00ad08f

Requirements

Wikt2dict should run on any mainstream Linux distribution. It needs Python2.7 and basic command line tools that should be found on most Linux distributions (wget, bzcat). If you're working with large Wiktionaries such as the English Wiktionary, you need at least 10GB of free space, preferrably more. For all Wiktionary editions supported, you need about 35GB of free space.

Installation

git clone https://github.com/juditacs/wikt2dict.git
cd wikt2dict
sudo pip install -e .

You can install wikt2dict in virtualenv if you do not have root access.

A very quick guide to virtualenv:

virtualenv w2d_env
source w2d_env/bin/activate
git clone https://github.com/juditacs/wikt2dict.git
cd wikt2dict
pip install -e .

Note that this way wikt2dict can only be used once the virtualenv was activated. You need to run source w2d_env/bin/activate every time you login.

Very quick start

Wikt2dict's basic functionalities can be accessed using the w2d.py script (which should be directly callable after running pip install).

$ w2d.py -h
Wikt2Dict

Usage:
  w2d.py (download|extract|triangulate|all) (--wikicodes=file|<wc>...)

Options:
  -h --help              Show this screen.
  --version              Show version.
  -w, --wikicodes=file   File containing a list of wikicodes.

W2d.py currently supports 3+1 actions. All actions need a list of Wiktionary codes to work with. You can either list the codes manually or provide them in a file (--wikicodes option).

The actions are:

  1. download: download the Wiktionary dumps. Convert them from XML to plaintext with a special page separator. The files are saved in the directory specified in config.py:wiktionary_defaults['dump_path_base']. The default is wikt2dict/dat/wiktionary/
  2. extract: extract translations. The translations are saved to the file specified in config.py:wiktionary_defaults['output_path']. By default this file is wikt2dict/dat/wiktionary//translation_pairs.
  3. triangulate: use triangulation to generate more translations. Triangles are saved to the directory config.py:wiktionary_defaults['triangle_dir'] in separate files named as __. This file would contain pairs in wc1-wc3 languages triangulated via wc2. For more information on triangulating, see: http://aclweb.org/anthology/W/W13/W13-2507.pdf Note that triangulating only makes sense if you specify at least 3 languages.
  4. all: do all of the above.

Let's try it out on a few small Wiktionary editions.

Downloading the Slovak, the Slovenian and the Occitan Wiktionaries:

w2d.py download sk sl li

The downloaded and textified Wiktionaries should appear in dat/wiktionary//wiktionary.txt

Extracting translations:

w2d.py extract sk sl li

The extracted translations should appear in dat/wiktionary//translation_pairs.

Now let's try triangulating to get a bunch of new translations:

w2d.py triangulate sk sl li

The results should appear in dat/triangle/ arranged in subdirectories with a maximum of 1000 files per directory to avoid filesystem problems. Using only 3 such small editions for triangulating does not make much sense (it yielded 4 pairs on the April 2014 dumps).

Or do all of it at once:

w2d.py all sk sl li

Output

The output is a tab-separated file. If you only want the translation pairs you should just cut the first 4 columns:

cut -f1-4 <output_file> > <dictionary>

Or without Wiktionary codes:

cut -f2,4 <output_file> > <dictionary>

Where <output_file> should be replaced by the output of either the Wiktionary extraction or the triangulating, and is the file where the filtered columns are saved.

The columns explained in details are below.

The one extracted from the Wiktionaries has the following columns:

  1. Wiktionary code 1 (language 1)
  2. Word or expression in language 1
  3. Wiktionary code 2 (language 2)
  4. Word or expression in language 2
  5. Wiktionary code of the Wiktionary from which the pair was extracted
  6. Article from which the pair was extracted
  7. Type of parser used (you probably don't need this)

An example:

en      dog     fr      chien   en      dog     defaultparser

The triangulating output has the following columns:

  1. Wiktionary code 1 (language 1)

  2. Word or expression in language 1

  3. Wiktionary code 2 (language 2)

  4. Word or expression in language 2

  5. 5-10. The articles and their source Wiktionary that were used to generate this pair

    hu kutya oc chin hu kutya el σκύλος oc chin

The pairs are listed with all possible ways they were found. I provided a little script to sort, unify and count the number of times one pair appears. Usage (from wikt2dict base directory):

cat <triangle_files_to_merge> | bash bin/merge_triangle.sh > output_file

To use with all triangle files:

cat <triangle_dir>/*/* | bash bin/merge_triangle.sh > output_file

where the <triangle_dir> should be replaced with the directory where the individual triangle files are stored (triangle_dir option).

Congratulations, you have successfully finished the test tutorial of wikt2dict. Please send your feedback to [email protected].

Cite

Please cite:

@InProceedings{acs-pajkossy-kornai:2013:BUCC,  
  author    = {Acs, Judit  and  Pajkossy, Katalin  and  Kornai, Andras},  
  title     = {Building basic vocabulary across 40 languages},  
  booktitle = {Proceedings of the Sixth Workshop on Building and Using Comparable Corpora},  
  month     = {August},  
  year      = {2013},  
  address   = {Sofia, Bulgaria},  
  publisher = {Association for Computational Linguistics},  
  pages     = {52--58},  
  url       = {http://www.aclweb.org/anthology/W13-2507}  
}  

Or this one:

@InProceedings{CS14.864,
author = {Judit Ács},
title = {Pivot-based multilingual dictionary building using Wiktionary},
booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)},
year = {2014},
month = {may},
date = {26-31},
address = {Reykjavik, Iceland},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-8-4},
language = {english}
}

Known Bugs

  • FIXED - Lithuanian and a few other Wiktionaries have translation tables in many articles not only for Lithuanian words and these are parsed as they were Lithuanian words. Language detection for all articles should be added. This issue is fixed but configuration should be updated.

  • Logging is not always accurate

Upcoming

wikt2dict's People

Contributors

juditacs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

wikt2dict's Issues

Wikicodes list is ignored by langnames parser

The langnames type parser (the one that needs the languages' names in a given language) ignores the list of wikicodes specified and instead extracts all languages it has a name for.

Triangulating does not read all source files

Only at most three source files are read for each triangle. These are the ones extracted from the three Wiktionaries belonging to the 3 languages. There could be other translations in other Wiktionaries (such as the German, the Lithuanian, the Azerbaijani). These should also be handled in a memory-efficient way.

Unable to download ta (Tamil) wiktionary

Thanks for sharing this!
I installed it successfully and ran:
w2d.py download en ta
It downloaded the English bz2 file and also created the enwiktionary.txt file. However, Tamil wiktionary was not downloaded.
Looks like Tamil is not a supported language. Can you give me some tips how to add it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.