Code Monkey home page Code Monkey logo

go-wiktionary-parse's Introduction

go-wiktionary-parse

This is a tool to parse language dumps from Wiktionary and store the results into a Sqlite database.

Quickstart

git clone https://github.com/macdub/go-wiktionary-parse
cd go-wikitionary-parse
wget https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2
bzip2 -d enwiktionary-latest-pages-articles.xml.bz2
go install .
go-wiktionary-parse -file enwiktionary-latest-pages-articles.xml -threads 20 -database test.db

Usage

Usage of wiktionary-parser:
    -cache_file string
        Use this as the cache file (default "xmlCache.gob")
    -database string
        Database file to use (default "database.db")
    -file string
        XML file to parse
    -lang string
        Language to target for parsing (default "English")
    -log_file string
        Log to this file
    -make_cache
        Make a cache file of the parsed XML
    -threads int
        Set the number of threads to use for parsing (default 5)
    -use_cache
        Use a 'gob' of the parsed XML file
    -purge
        Purge the existing database provided by the database flag
    -verbose
        Use verbose logging

Build

Dependencies

Build

$ go build -o wiktionary-parser main.go

Current Limitations

  • It only looks at 14 lemmas
  • Does not clean the definition. Meaning it looks like raw wiki markup. This is something that will be fixed in the near future.

Database

Structure

  • table name: dictionary
COLUMN TYPE
id integer
word text
lemma text
etymology_no integer
definition_no integer
definition text
  • Primary key is on ID
  • Index is setup over word, lemma, etymology_no, definition_no

Statistics

  • The database (20200506) file that is built is ~127MB (51MB compressed)
    • 914,799 words
    • 1,098,087 definitions
    • 14 lemmas

go-wiktionary-parse's People

Contributors

faddat avatar macdub avatar soniccat avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.