Code Monkey home page Code Monkey logo

bunpa's Introduction

Bunpa

Bunpa is an extremely simple wrapper around the MeCab Japanese grammar parser. It was designed with two key features in mind:

  1. Simplicity - only returns the text and major part of speech for each component
  2. Completeness - ensures that whitespace and any unknown characters are preserved

Background

Bunpa parses Japanese text into a set of ordered components. Each component represents either a part of speech (noun, verb, etc.) or formatting (whitespace, etc.) All components have a text value (exactly as they appear in the text provided) and kind (usually part of speech).

All grammatical information is provided by the excellent MeCab Japanese part of speech and morphological analyser. Formatting information is inserted into the set of components in a post processing step (it is not done by MeCab). These components have a fake 'kind' assigned to them. Currently the following kinds of formatting components are handled by Bunpa:

  • spaces (スペース)
  • tabs (タブ)
  • newlines (改行)

Any components that cannot be identified by either MeCab or Bunpa are marked as unknown (未知).

Installation

From within your application's base directory:

  1. Edit your Gemfile and add:

     gem 'bunpa'
    
  2. Install the gem:

     bundle
    

Usage

Bunpa operates as a very simple parser. It returns the components it identifies as an Array of Bunpa::Text::Component objects, in the same order as they appear in the document. Each Component object has two accessors - 'text' and 'kind', which return the text value and part of speech of the component respectively.

Basic usage is as follows:

require 'bunpa'

# Create the parser
parser = Bunpa::JapaneseTextParser.new

# Get an enumerable of Bunpa::Text::Components
components = parser.parse("A: こんにちは! お元気ですか。\nB: はい、元気です!")

components.each do |component|
  puts "#{component.text}\t(#{component.kind}"
end

This would output:

A       (名詞)
:       (名詞)
        (スペース)
こんにちは      (感動詞)
!      (記号)
        (スペース)
お      (接頭詞)
元気    (名詞)
です    (助動詞)
か      (助詞)
。      (記号)

        (改行)
B       (名詞)
:       (名詞)
        (スペース)
は      (助詞)
い      (動詞)
、      (記号)
元気    (名詞)
です    (助動詞)
!      (記号)

For a slightly more detailed example, see the usage_example.rb script in the bin directory.

Notes

This is very much a work in progress - it only has minimal testing at the moment, so use at your own risk :)

bunpa's People

Contributors

haniwaniwa avatar clownba0t avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.