Code Monkey home page Code Monkey logo

henry's Introduction

Description

Scripts to import the dictionary Lexique étymologique du breton moderne (Q19216625) by Victor Henry (Q1386172) from Wikisource to Wikidata's lexicographical data. This dictionary is in French about the Breton language.

Dependencies

  • PHP 7
  • Python 3

Installation

Install the dependencies. Example on a Debian-like system:

apt install php python3 python3-pip

Download the project:

git clone "https://github.com/envlh/henry.git"

Install the Python requirements. Example of the command to use at the root of the project:

pip3 install -r requirements.txt

Configuration

The bot uses Pywikibot. A way to login to Wikidata is to use a bot password.

Download Pywikibot:

git clone "https://gerrit.wikimedia.org/r/pywikibot/core"

After creating your bot password, generate configuration files:

python3 pwb.py generate_user_files.py

Copy generated files user-config.py and user-password.py at the root of the henry project.

Usage

Crawler

Retrieves content from Wikisource, aggregates all pages in one file, and does some cleaning.

php -f crawler.php

Several files are generated:

  • wikitext.txt: raw wikitext crawled from Wikisource (useful for debug)
  • stripped.txt: wikitext after cleaning

Parser

Parses previously created file and converts it into machine-readable format.

python3 parser.py

Several files are generated:

  • lexemes.json: lexemes that will be imported in Wikidata, serialized in Wikibase JSON format
  • lexemes.txt: more human-readable list of lexemes that will be imported
  • errors.json: rejected lexemes, with reason of error
  • monograms.json and bigrams.json: frequencies of letters in lemmas

Import

Imports the data in Wikidata's lexicographical data.

python3 bot.py

Copyright

This project, mainly by Envel Le Hir (@envlh) for the code and Nicolas Vigneron (@belett) for the Wikisource transcription, is under CC0 license (public domain dedication).

henry's People

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

jhsoby

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.