Code Monkey home page Code Monkey logo

mw2slob's Introduction

mw2slob

This is a tool to convert MediaWiki content to slob dictionaries from Wikimedia Enterprise HTML Dumps or from CouchDB instances created by mwscrape.

Using HTML dumps is recommended for most users. mwscrape downloads rendered articles via MediaWiki API and can be used to obtain articles HTML for namespaces that are not available as enterprise HTML dumps.

Installation

mw2slob requires Python 3.7 (or Python 3.6 + dataclasses) or newer and depends on the following components:

Install lxml (consult your operating system documentation for installation instructions). For example, on Ubuntu:

sudo apt-get install python3-lxml

Create Python virtual environment and install slob.py as described at https://github.com/itkach/slob/.

In this virtual environment run

pip install git+https://github.com/itkach/mw2slob.git

To work on Python 3.6 mw2slob requires installation of additional dependency dataclasses:

pip install dataclasses

Usage

With WE HTML Dumps

# get site's metadata ("siteinfo")
mw2slob siteinfo http://en.wiktionary.org > enwikt.si.json
# compile dictionary
mw2slob dump --siteinfo enwikt.si.json ./enwiktionary-NS0-20220120-ENTERPRISE-HTML.json.tar.gz -f wikt common

Note -f wikt common argument that specifies content filters to use when compiling this dictionary. Content filter is a text file containing list of CSS selectors (one per line). HTML elements matching these selectors will be removed during compilation. `mw2slob` includes several filters (see ./mw2slob/filters) that work well for most wikipedias and wiktionaries.

Wikimedia Enterprise HTML Dumps are available only for some namespaces. For most wikipedias the main namespace 0 - articles - is typically the only one of interest to most users. Wiktionaries, on the other hand, often make use of other such namespaces. For example, in English Wiktionary many articles include links to articles from Wiktionary or Appendix namespaces, so it makes sense to include their content into compiled dictionary and make these links internal dictionary links rather than link to Wiktionary web site.

These namespaces are not available as html dumps, but can be obtained via Mediawiki API via mwscrape. Let’s say we want to compile English Wiktionary and include the following namespaces in addition to the main articles: Appendix, Wiktionary, Rhymes, Reconstruction and Thesaurus (sampling random articles indicates that these namespaces are often referenced).

First, we examine siteinfo (saved in enwikt.si.json) and find that ids for these namespaces are:

Wiktionary4
Appendix100
Rhymes106
Thesaurus110
Reconstruction118

Then we download rendered articles for these namespaces with mwscrape:

mwscrape https://en.wiktionary.org --db enwikt-wiktionary --namespace 4
mwscrape https://en.wiktionary.org --db enwikt-appendix --namespace 100
mwscrape https://en.wiktionary.org --db enwikt-rhymes --namespace 106
mwscrape https://en.wiktionary.org --db enwikt-thesaurus --namespace 110
mwscrape https://en.wiktionary.org --db enwikt-reconstruction --namespace 118

Each takes some time, but these are relatively small and don’t take too long.

Finally, compile the dictionary:

mw2slob dump --siteinfo enwikt.si.json \
        ./enwiktionary-NS0-20220420-ENTERPRISE-HTML.json.tar.gz \
        http://localhost:5984/enwikt-wiktionary \
        http://localhost:5984/enwikt-appendix \
        http://localhost:5984/enwikt-rhymes \
        http://localhost:5984/enwikt-thesaurus \
        http://localhost:5984/enwikt-reconstruction \
        -f wikt common --local-namespace 4 100 106 110 118

Note that `mw2slob dump` takes CouchDB URLs of the databases we created with mwscrape in addition to the dump file name.

Also note the `–local-namespace` parameter. This tells the compiler to make the links for these namespaces internal dictionary links, just like cross-article links, otherwise they would be converted to web links.

See mw2slob dump --help for complete list of options.

With mwscrape database

Assuming CouchDB server runs at localhost on port 5984 and has mwscrape database created with mwscrape simple.wikipedia.org and named simple-wikipedia-org, to create a slob using common and wiki content filters:

mw2slob scrape http://127.0.0.1:5984/simple-wikipedia-org -f common wiki

See mw2slob scrape --help for complete list of options

mw2slob's People

Contributors

itkach avatar sklart avatar bkuhls avatar iwanb avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.