Code Monkey home page Code Monkey logo

eiwa's Introduction

eiwa / 英和

Parses two types of Japanese-English dictionaries:

  • :jmdict_e - JMDict's English-only export of the WWWJDIC online Japanese dictionary.
  • :kanjidic2 - the KANJIDIC2 dictionary of roughly 13,000 kanji characters

Usage

Install

Install the gem:

gem install eiwa

Or add it to your Gemfile:

gem 'eiwa'

Download a supported dictionary

Get your hands on a supported dictionary. Right now eiwa only parses JMDict, which can be fetched from the EDRDG ftp site or with a script like this, for the Japanese-English export:

# Download JMDICT-E:
$ curl http://ftp.edrdg.org/pub/Nihongo/JMdict_e.gz -o jmdict.xml.gz"
# Unzip to jmdict.xml
$ gunzip jmdict.xml.gz

# Download KANJIDIC2:
$ curl http://www.edrdg.org/kanjidic/kanjidic2.xml.gz -o kanjidic2.xml.gz
# Unzip to kanjidic2.xml
$ gunzip kanjidic2.xml.gz

These files are updated daily, and are essentially an export of all vocabulary and kanji in the WWWJDIC application

Parse the dictionary

The eiwa gem implements an evented SAX parser via nokogiri to efficiently work through the very large XML file, as loading a full DOM into memory is very resource-intensive. In consideration of this, eiwa's parsing method provides two modes, one that will return every dictionary entry in an array and one that will invoke a provided block with each entry, but which won't retain a reference to the entries, allowing Ruby to garbage collect them as it goes.

Passing a block

If you just want to do some processing on each entry, it probably makes sense to invoke the library by passing a block (note that supported types include only :jmdict_e and :kanjidic2)

Eiwa.parse_file("path/to/some.xml", type: :jmdict_e) do |entry|
  # Do something with that entry
end

This approach can parse the entire JMDICT-E dictionary in a 15MB Ruby 2.6 process.

Return the results in an array

If you're just going to add all the entries to an array or otherwise retain them in memory, you can call the same method without a block, and it will return all the entries in an array.

entries = Eiwa.parse_file("path/to/some.xml", type: :jmdict_e)

Note that for the abridged Japanese-English dictionary, this will consume about 500MB of RAM.

eiwa's People

Contributors

dependabot[bot] avatar searls avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

plfaucher

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.