Code Monkey home page Code Monkey logo

audubon's Introduction

audubon

audubon status badge R-CMD-check codecov CRAN logs badge

audubon is Japanese text processing tools for:

  • filling Japanese iteration marks
  • hiraganization, katakanization and romanization using hakatashi/japanese.js
  • segmentation by phrase using google/budoux and ‘TinySegmenter.js’
  • text normalization which is based on rules for the ‘Sudachi’ morphological analyzer and the ‘NEologd’ (Neologism dictionary for ‘MeCab’).

Some features above are not implemented in ‘ICU’ (i.e., the stringi package), and the goal of the audubon package is to provide these additional features.

Installation

remotes::install_github("paithiov909/audubon")

Usage

Fill Japanese iteration marks (Odori-ji)

strj_fill_iter_mark repeats the previous character and replaces the iteration marks if the element has more than 5 characters. You can use this feature with strj_normalize or strj_rewrite_as_def.

strj_fill_iter_mark(c(
  "あいうゝ〃かき",
  "金子みすゞ",
  "のたり〳〵かな",
  "しろ/″\とした"
))
#> [1] "あいうううかき"  "金子みすず"     "のたりたりかな"  "しろじろとした"

strj_fill_iter_mark("いすゞエルフトラック") |>
  strj_normalize()
#> [1] "いすずエルフトラック"

Character class conversion

Character class conversion uses hakatashi/japanese.js.

strj_hiraganize("あのイーハトーヴォのすきとおった風")
#> [1] "あのいーはとーゔぉのすきとおった風"
strj_katakanize("あのイーハトーヴォのすきとおった風")
#> [1] "アノイーハトーヴォノスキトオッタ風"
strj_romanize("あのイーハトーヴォのすきとおった風")
#> [1] "anoīhatōvonosukitōtta"

Segmentation by phrase

strj_tokenize splits Japanese text into some phrases using google/budoux, TinySegmenter, or other tokenizers.

strj_tokenize("あのイーハトーヴォのすきとおった風", engine = "budoux")
#> $`1`
#> [1] "あの"             "イーハトーヴォの" "すきとおった"     "風"

Japanese text normalization

strj_normalize normalizes text following the rule based on NEologd style.

strj_normalize("――南アルプスの 天然水- Sparking* Lemon+ レモン一絞り")
#> [1] "ー南アルプスの天然水-Sparking* Lemon+レモン一絞り"

strj_rewrite_as_def is an R port of SudachiCharNormalizer that typically normalizes characters following a ’*.def’ file.

audubon package contains several ’*.def’ files, so you can use them or write a ‘rewrite.def’ file by yourself as follows.

# single characters will **never** be normalized.
…
# if two characters are separated with a tab,
# left side forms are always rewritten to right side forms
# before normalized.
斎   斉
齋   斉
齊   斉
# supports rewriting a single character to a single character,
# i.e., this cannot work.
アッ  ア

This feature is more powerful than stringi::stri_trans_* because it allows users to control which characters are normalized. For instance, this function can be used to convert kyuji-tai characters to shinji-tai characters.

stringi::stri_trans_nfkc("Ⅹⅳ")
#> [1] "Xiv"
strj_rewrite_as_def("Ⅹⅳ")
#> [1] "Ⅹⅳ"
strj_rewrite_as_def("惡と假面のルール", read_rewrite_def(system.file("def/kyuji.def", package = "audubon")))
#> [1] "悪と仮面のルール"

License

© 2024 Akiru Kato

Licensed under the Apache License, Version 2.0.

Icons made by iconixar from flaticon.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.