Code Monkey home page Code Monkey logo

wiki-family-utils's Introduction

๐Ÿ‘ช Wiki-Family-Utils: Extracting plain text from Wikimedia Project Dumps

Simple scripts to extract plain text from the dump files of Wikimedia projects. It processes the HTML data from Wikimedia Enterprise HTML dumps and cleans it up to produce plain text.

Installation

poetry install

Usage

Download

To download a dump file to process, run the following command:

poetry run python download.py ja wiktionary

You need to specify language and project as arguments. The following choices are available for project:

By default, the file will be saved in the data directory.

Extract

To extract plain text from the downloaded file, run the following command:

PATH_TO_DOWNLOADED_FILE="data/jawiktionary-NS0-20231101-ENTERPRISE-HTML.json.tar.gz"
poetry run python extract.py $PATH_TO_DOWNLOADED_FILE plain_text

By default, the extracted data are saved as jsonl file and stored under the same directory as the input file.

You can specify the output type from plain_text or passages.

  • plain_text: Each page is processed as a plain text which is useful for training a language model.
  • passages: Each page is processed as a list of passages which is useful for training a passage retrieval model.

โš ๏ธ Note that some parameters in extract.py are hard-coded for Japanese text ๐Ÿ‡ฏ๐Ÿ‡ต. To get the cleanest text in your language, you may need to tweak the code.

Example

Here is an example script to download and process all the files of Wikimedia projects.

wiki_sites=("wiki" "wiktionary" "wikibooks" "wikinews" "wikiquote" "wikisource" "wikiversity" "wikivoyage")
language="ja"

for site in "${wiki_sites[@]}"; do
    poetry run python download.py $language $site
done

directory="data"
for file in "$directory"/*.tar.gz; do
    poetry run python extract.py $file plain_text
done

License

Wiki-Family-Utils is licensed under the Apache License, Version 2.0.

Acknowledgement

This project is kind of a child ๐Ÿ‘ฆ of Wikipedia-Utils, serving similar use cases but with customized processing.

wiki-family-utils's People

Contributors

ryokan0123 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.