Code Monkey home page Code Monkey logo

dumpster-dip's Introduction

dumpster-dip

wikipedia dump parser
by Spencer Kelly, Devrim Yasar, and others

gets a wikipedia xml dump into tiny json files,
so you can get a bunch of easy data.

👍 〰〰〰〰〰〰〰〰 👍

dumpster-dip is a script that allows you to parse a wikipedia dump into ad-hoc data.

dumpster-dive is a script that puts it into mongodb, instead.

use whatever you prefer!

1. Download a dump
cruise the wikipedia dump page and look for ${LANG}wiki-latest-pages-articles.xml.bz2

2. Unzip the dump

bzip2 -d ./path/to/enwiki-latest-pages-articles.xml.bz2

3. Start the javascript

npm install dumpster-dip

import dip from 'dumpster-dip'

const opts = {
  input: '/path/to/my-wikipedia-article-dump.xml',
  parse: (doc) => {
    return doc.sentences()[0].text()// return the first sentence of each page
  }
}
// this promise takes ~4hrs
dip(opts).then(() => {
  console.log('done!')
})

en-wikipedia takes about 4hrs on a macbook.


This tool is intended to be a clean way to pull random bits out of wikipedia, like:

'all the birthdays of basketball players'

await dip({
  doPage: (doc) => doc.categories().find(cat => cat === `American men's basketball players`),
  parse: (doc) => doc.infobox().get('birth_date')
})

It uses wtf_wikipedia as the wikiscript parser.

Outputs:

By default, it outputs an individual file for every wikipedia article. Sometimes operating systems don't like having ~6m files in one folder, though - so it nests them 2-deep, using the first 4 characters of the filename's hash:

/BE
  /EF
    /Dennis_Rodman.txt
    /Hilary_Clinton.txt

as a helper, this library exposes a function for navigating this directory scheme:

import getPath from 'dumpster-dip/nested-path'
let file = getPath('Dennis Rodman')
// ./BE/EF/Dennis_Rodman.txt

This is the same scheme that wikipedia does internally.

Flat results:

if you want all files in one flat directory, you can do:

let opts = {
  outputDir: './results', 
  outputMode: 'flat', 
}
Results in one file:

if you want all results in one file, you can do:

let opts = {
  outputDir: './results', 
  outputMode: 'ndjson', 
}

Options

let opts = {
  // directory for all our new files
  outputDir: './results', // (default)
  // how we should write the results
  outputMode: 'nested', // (default)

  // which wikipedia namespaces to handle (null will do all)
  namespace: 0, //(default article namespace)
  // define how many concurrent workers to run
  workers: cpuCount, // default is cpu count
  //interval to log status
  heartbeat: 5000, //every 5 seconds
  
  // parse redirects, too
  redirects: false, // (default)
  // parse disambiguation pages, too
  disambiguation: true, // (default)

  // what do return, for every page
  parse: (doc) => doc.json(), // (default)
  // should we return anything for this page?
  doPage: (doc) => true, // (default)
  // add plugins to wtf_wikipedia
  extend: (wtf) => {
    wtf.extend((models, templates, infoboxes) => {
      models.Doc.prototype.isPerson = function () {
        return this.categories().find((cat) => cat.match(/people/))
      }
    })
  },
}

work in progress

MIT

dumpster-dip's People

Contributors

spencermountain avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.