Code Monkey home page Code Monkey logo

wiki2net's Introduction

wiki2net

Wiki2net is a set of scripts used to extract a dynamic network from Wikipedia database backup dumps. The nodes of the extracted network are articles and the edges are citations and redirections. We only consider citations that are internal to Wikipedia - one article linking to another article. Links have a begin and end timestamp associated with them, so it is trivial to extract a snapshot of the Wikipedia article network at a certain point in time.

The network is stored on a SQLite3 database file, so that it is easy to manipulate with other tools. The scripts are written in Python.

Wikipedia database dumps can be obtained here: http://dumps.wikimedia.org.

Simple usage

The main script is wiki2net.py. It reads a Wikipedia dump file from the stdin and writes the extracted network to the sqlite3 database file passed as the first parameter. If the file does not exit, it is created. A simple usage example would be:

bzcat enwiki-20110405-pages-meta-history1.xml.bz2 | ./wiki2net.py wikipedia.db

So the idea here is to take advantage of the bzcat UNIX command to avoid having to decompress the dump files to disk. The simplest way to extract the network is to sequentially process all the dump files to the same sqlite3 file.

Database schema

The database schema that was defined to contain the network is the following:

CREATE TABLE article (id INTEGER PRIMARY KEY, title TEXT, parsed INTEGER DEFAULT 0);
CREATE TABLE link (id INTEGER PRIMARY KEY, orig_id INTEGER, targ_id INTEGER, start_ts INTEGER, end_ts INTEGER);
CREATE TABLE redirect (id INTEGER PRIMARY KEY, orig_id INTEGER, targ_id INTEGER, start_ts INTEGER, end_ts INTEGER);
CREATE INDEX article_id ON article (id);
CREATE INDEX article_title ON article (title);

Parallel processing

Processing files takes a considerable amount of time. The author's core i7-2600 machine is taking aprox. one day per dump file. One obvious way to speed things up is to process several dump files at the same time. You can process each file to a separate database file and then merge all the outputs with the merge.py script:

./merge.py target.db src1.db src2.db ...

Technical details

Wikipedia dump files are very large xml files that include the content of every revision for every article. Wiki2net has to parse this content to find out when citations and redirections are created or removed.

Author

Wiki2net is being developed by Telmo Menezes ([email protected]). Feel free to contact the author with an issues regarding this software.

License

Wiki2net is released under the GPLv2 open source public license. The full text of the license can be found on the COPYING file.

wiki2net's People

Watchers

James Cloos avatar Albert Chen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.