Code Monkey home page Code Monkey logo

german-wikipedia-text-corpus's Introduction

German Wikipedia Text Corpus

A more recent version of the text corpus is published here: https://github.com/GermanT5/wikipedia2corpus

This is a German text corpus from Wikipedia. It is cleaned, preprocessed and sentence splitted. Its purpose is to train NLP embeddings like fastText or ELMo Deep contextualized word representations.

The advantage of this text corpus is that it does not only contain the article space of the wiki, but also the comments for a larger text corpus and a more sloppy language. This should improve the quality of downstream tasks when you process conversations like mails, chats, tweets or support tickets.

How this corpus has been generated

We used a wikipedia dump as the data source.

Then the tool WikiExtractor was used to extract the xml dump. To also include the discussion, the WikiExtractor tool has been modified:

def keepPage(ns, page):
    if ns != '0' and ns != '1': # Aritcle and Talk
        print('skipped ns:', ns)
        return False
    # remove disambig pages if desired
    if options.filter_disambig_pages:
        for line in page:
            if filter_disambig_page_pattern.match(line):
                return False
    return True

Now some hand-crafted python tool was used for further processing: https://github.com/PhilipMay/de-wiki-text-corpus-tools/blob/master/process_wiki_files.py

  • SoMaJo was used for tokenization and sentence splitting
  • spaCy and gensim also have been tested for tokenization and sentence splitting but have not been as good as SoMaJo for German language
  • article headlines and some markup was removed

Everything has been shuffled on sentence-level with linux shuf command.

Download

You can download the texts here:

  • wiki-all-shuf.tgz.part-00
    • MD5: 9cd27b9a22ee4de391435b4bcbb30428
    • SHA1: 66ccc99ccfeb4b546f9c888af9b23e5fc1a67236
  • wiki-all-shuf.tgz.part-01
    • MD5: bf187bdda21ea9f7af1ecdf085ca54d5
    • SHA1: 73749be0285a6e359dead08acf65b33e9a55c9b4
  • wiki-all-shuf.tgz.part-02
    • MD5: b887df79f54d7d36d3da22e5e6f8add1
    • SHA1: f9c773821bee112b976ae5247ac55ffdba6e20f7
  • wiki-all-shuf.tgz
    • MD5: 51ddcca730dca6e48c29d6339c2059f9
    • SHA1: f1c7ef0245abca47d3be2657ac4c345a3dc8d121

Unpack

Using these commands, you can unpack the files (Linux and macOS):

cat wiki-all-shuf.tgz.part-* > wiki-all-shuf.tgz
tar xvfz wiki-all-shuf.tgz

License

As Wikipedia itself, this is published under Creative Commons Attribution-ShareAlike 3.0 Unported license.

german-wikipedia-text-corpus's People

Contributors

philipmay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

german-wikipedia-text-corpus's Issues

Fourth file is not available for download

Hi!

from the four downloadable files named on this site, only three have a link. The fourth "wiki-all-shuf.tgz" is not downloadable and therefore, the archive can't be extracted.

Thanks for looking into this issue!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.