Code Monkey home page Code Monkey logo

cutlet's Introduction

Open in Streamlit Current PyPI packages

cutlet

cutlet by Irasutoya

Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.

issueを英語で書く必要はありません。

Features:

  • support for Modified Hepburn, Kunreisiki, Nihonsiki systems
  • custom overrides for individual mappings
  • custom overrides for specific words
  • built in exceptions list (Tokyo, Osaka, etc.)
  • uses foreign spelling when available in UniDic
  • proper nouns are capitalized
  • slug mode for url generation

Things not supported:

  • traditional Hepburn n-to-m: Shimbashi
  • macrons or circumflexes: Tōkyō, Tôkyô
  • passport Hepburn: Satoh (but you can use an exception)
  • hyphenating words
  • Traditional Hepburn in general is not supported

Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.

Installation

Cutlet can be installed through pip as usual.

pip install cutlet

Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.

pip install unidic-lite

Usage

A command-line script is included for quick testing. Just use cutlet and each line of stdin will be treated as a sentence. You can specify the system to use (hepburn, kunrei, nippon, or nihon) as the first argument.

$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.

In code:

import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'

# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'

# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'

# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'

# comparison
nkatu = cutlet.Cutlet('nihon')

sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'

Alternatives

  • kakasi: Historically important, but not updated since 2014.
  • pykakasi: self contained, it does segmentation on its own and uses its own dictionary.
  • kuroshiro: Javascript based.
  • kana: Go based.

cutlet's People

Contributors

4890a avatar hizuru3 avatar kinow avatar kounoike avatar krackers avatar plmrx-hivestack avatar polm avatar stegayet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cutlet's Issues

KeyError: 'ゕ'

It looks like a small か causes some issues :(

% cutlet
夕陽ヵ丘三号館
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 8, in <module>
    sys.exit(main())
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cli.py", line 16, in main
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 130, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 195, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 235, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 268, in get_single_mapping
    return self.table[kk]
KeyError: 'ゕ'

ImportError (circular import)

I have an issue guys.. i don't know how to fix it..
does anyone know it?

Traceback (most recent call last):
  File "d:/Project/Farid/Python/OCR/fugashi.py", line 1, in <module>
    from cutlet import Cutlet
  File "C:\Users\Farid Fardiansyah\AppData\Roaming\Python\Python37\site-packages\cutlet\__init__.py", line 1, in <module>
    from .cutlet import *
  File "C:\Users\Farid Fardiansyah\AppData\Roaming\Python\Python37\site-packages\cutlet\cutlet.py", line 1, in <module>
    import fugashi
  File "d:\Project\Farid\Python\OCR\fugashi.py", line 1, in <module>
    from cutlet import Cutlet
ImportError: cannot import name 'Cutlet' from 'cutlet' (C:\Users\Farid Fardiansyah\AppData\Roaming\Python\Python37\site-packages\cutlet\__init__.py)

convert to romaji to values in pandas column

Dear Polm,
Thanks for sharing this cutlet code 👍
I would like to try it on a dataframe for some column values in Japanese. How do you insert pandas column values instead of only a text ?
Many thanks in advance !

`katsu.romaji('df['Column']`)
=> ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Pythonely,

Morgane

Align Romaji and Kana

Hello, thanks for the great work on this.

I have a use case where I need to make use of both romaji (nihon) and kana. In another issue regarding furigana you mention you can use fugashi as such:

import fugashi

tagger = fugashi.Tagger()
kana = [nn.feature.kana for nn in tagger("吾輩は猫である")]
# => ['ワガハイ', 'ハ', 'ネコ', 'デ', 'アル']

However, it seems the space-handling in this library is slightly customized.

import fugashi
import cutlet

tagger = fugashi.Tagger()
nihon = cutlet.Cutlet(use_foreign_spelling=False, system="nihon")

raw_text = 'また、東寺のように、五大明王と呼ばれる、主要な明王の**に配されることも多い。'
romaji = nihon.romaji(raw_text)
kana = " ".join([nn.feature.kana for nn in tagger("また、東寺のように、五大明王と呼ばれる、主要な明王の**に配されることも多い。")])
kana_romaji = nihon.romaji(kana)

print(f"Romaji text has {len(romaji.split(' '))} words, but kana text has {len(kana.split(' '))} words.")
print(f"This means that we can not align the romaji and kana text for any use case.")
print(f"Correct romaji: {romaji}\nKana: {kana}\nRomaji from kana: {kana_romaji}")

> Romaji text has 19 words, but kana text has 26 words.
> This means that we can not align the romaji and kana text for any use case.
> Correct romaji: Mata, Touzi no you ni, go daimyouou to yobareru, syuyou na myouou no tyuuou ni haisareru koto mo ooi.
> Kana: マタ  トウジ ノ ヨウ ニ  ゴ ダイ ミョウオウ ト ヨバ レル  シュヨウ ナ ミョウオウ ノ チュウオウ ニ ハイサ レル コト モ オオイ 
> Romaji from kana: Mata touzi no you ni godai myouou to yoba reru syuyou na myouou no tyuu Ou ni haisa reru koto mo ooi

Could we optionally provide the raw kana returned with the romaji? If so this would be the one Japanese processing library to rule them all.

UnicodeDecodeError on 'exceptions.tsv' with Windows 10 Japanese Locale

Windows attempts to decode exceptions.tsv with code point 932 instead of utf-8 for some reason. Setting the open keyword argument encoding=utf-8 fixes it.

Traceback (most recent call last): File "cutlet_test.py", line 2, in <module> katsu = cutlet.Cutlet() File "C:\ProgramData\Miniconda3\envs\jpocr\lib\site-packages\cutlet\cutlet.py", line 80, in __init__ self.exceptions = load_exceptions() File "C:\ProgramData\Miniconda3\envs\jpocr\lib\site-packages\cutlet\cutlet.py", line 59, in load_exceptions for line in open(cdir / 'exceptions.tsv'): UnicodeDecodeError: 'cp932' codec can't decode byte 0x83 in position 10: illegal multibyte sequence

Add api to get character to romaji map as list of dicts

I am writing a script that uses the Whisper to transcribe japanese speech and i'd like to use cutlet to produce a romaji transcription. Right now i'm a little stuck because the output of whisper when using word_timestamps=True can produce word segments that break up multi-character words. So when i use cutlet to transcribe entire sentence segments output by whisper, it works fine, but i'd like a map of the individual word timings so that i can create a text animation that highlights the romaji as the words are said in the audio.

Here's an example of the issue:

Full segment output from whisper and cutlet

raw whisper segment:	 大体私ら知らなくて 特にもいけない今日だって
cutlet full segment:	 daitai watakushira shiranakute tokuni mo ikenai kyou da tte

but the way this is broken up by whisper is the following:

whisper per word:	 大-体-私-ら-知-ら-なく-て- 特-に-も-い-け-ない-今日-だ-って
cutlet per word:	 oo-karada-watakushi-ra-chi-ra-naku-te-toku-ni-mo-i-ke-nai-kyou-da-tte

as you can see using cutlet.romaji() on each "word" as defined by the whisper transcription doesn't work. I tried using cutlet.romaji_word() but got this error:

AttributeError                            Traceback (most recent call last)
Cell In[38], line 14
     11 print(f'cutlet full segment:\t {katsu.romaji(full_segment, capitalize=False)}')
     13 single_word_whisper_line = '-'.join([word for word in word_list])
---> 14 single_word_romaji_line = '-'.join([katsu.romaji_word(word) for word in word_list])
     15 print(f'whisper per word:\t {single_word_whisper_line}')
     16 print(f'cutlet per word:\t {single_word_romaji_line}')

Cell In[38], line 14, in <listcomp>(.0)
     11 print(f'cutlet full segment:\t {katsu.romaji(full_segment, capitalize=False)}')
     13 single_word_whisper_line = '-'.join([word for word in word_list])
---> 14 single_word_romaji_line = '-'.join([katsu.romaji_word(word) for word in word_list])
     15 print(f'whisper per word:\t {single_word_whisper_line}')
     16 print(f'cutlet per word:\t {single_word_romaji_line}')

File ~/.pyenv/versions/karagen/lib/python3.11/site-packages/cutlet/cutlet.py:319, in Cutlet.romaji_word(self, word)
    316 def romaji_word(self, word):
    317     """Return the romaji for a single word (node)."""
--> 319     if word.surface in self.exceptions:
    320         return self.exceptions[word.surface]
    322     if word.surface.isdigit():

AttributeError: 'str' object has no attribute 'surface'

I've attached the full output of whisper transcription for the example above (includes the entire transcription of the content i'm transcribing):
whisper_transcription.json

(btw the speech i'm transcribing is the lyrics to the following song: https://www.youtube.com/watch?v=ZAJ3nfQTw4A)

Use the following code with the attached json to show the output.

import json
import cutlet

katsu = cutlet.Cutlet()
katsu.use_foreign_spelling = False

with open('/Users/silman/Desktop/whisper_transcription.json', 'r') as f:
  data = json.load(f)

for sentence_segment in data['segments']:
    word_list = list()
    for word_segment in sentence_segment['words']:
        word_list.append(word_segment['word'])

    full_segment = ''.join([word for word in word_list])
    print(f'raw whisper segment:\t {full_segment}')
    print(f'cutlet full segment:\t {katsu.romaji(full_segment, capitalize=False)}')

    single_word_whisper_line = '-'.join([word for word in word_list])
    single_word_romaji_line = '-'.join([katsu.romaji(word) for word in word_list])
    print(f'whisper per word:\t {single_word_whisper_line}')
    print(f'cutlet per word:\t {single_word_romaji_line}')

If i had a list mapping how the characters from the full sentence are used to create the romaji i would be able to cycle through the characters in the mapping and find the start and end of positions of those characters to create a map of the start/end positions of the romaji.

Thanks for your time and for this library! It's incredibly useful and easy to use!

Maintain formatting?

Hello,

Is it possible to maintain the formatting of input text? I like to structure Japanese lyrics for transliteration purposes, but have found that there are no services that maintain the formatting (other than RomajiDesu but it doesn't do Hepburn), so I usually end up with a huge block of text.

Screenshot 2021-04-16 07 15 58

Unable to romanize with full katakana strings

I'm not sure if this is in the scope of cutlet, but it looks like any katakana-only sentences / phrases seem to not romanize:

% cutlet
アマガミ Sincerely Your S シンシアリーユアーズ
アマガミ Sincerely Your S シンシアリーユアーズ
ケメコデラックス
ケメコデラックス

Put use_foreign_spelling and ensure_ascii in constructor.

Is there a reason why use_foreign_spelling=True, ensure_ascii=True are not in the Cutlet constructor __init__? Placing these in the constructor would help IDE software provide the user information about these modifiable attributes and their defaults, and it is more intuitive (for me at least) to write katsu = cutlet.Cutlet(use_foreign_spelling=False).

KeyError: 'ゖ'

Here's another KeyError I found:

% cutlet
齋藤タヶオ
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 8, in <module>
    sys.exit(main())
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cli.py", line 16, in main
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 133, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 198, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 238, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 271, in get_single_mapping
    return self.table[kk]
KeyError: 'ゖ'

It feels like we're slowly inching toward bullet proofing cutlet xD

スヽメ

katsu.romaji("スヽメ")

File "site-packages/cutlet/cutlet.py", line 145, in romaji
roma = self.romaji_word(word)
File "site-packages/cutlet/cutlet.py", line 214, in romaji_word
return self.map_kana(kana)
File "site-packages/cutlet/cutlet.py", line 254, in map_kana
out += self.get_single_mapping(pk, char, nk)
File "site-packages/cutlet/cutlet.py", line 287, in get_single_mapping
return self.table[kk]
KeyError: 'ゝ'

KeyError: '゙'

I ran into an odd issue with the latest version:

% cutlet
青い春よさらば!
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 8, in <module>
    sys.exit(main())
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cli.py", line 16, in main
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 130, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 195, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 235, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 268, in get_single_mapping
    return self.table[kk]
KeyError: '゙'

I'm not sure what's going on here. :/

Japanese city names in romaji

Why are city names like 東京 and 大阪 converted to Tokyo and Osaka instead of Toukyou and Oosaka? I am working on a text-to-speech project and it caused the program to pronounce them incorrectly.

Cutlet creates additional spaces in some words written in Latin alphabet

I don't know if it's cutlet's or cutlet's dependency's fault, but I'm trying here.

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> katsu = cutlet.Cutlet()
>>> text = '私は Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch にい ま す'
>>> katsu.romaji(text)
'Watakushi wa L l a n f a i r p w l l g w y n g y l l g o g e r y c h w y r n d robwllllantysiliogogogoch ni ima su'

KeyError: 'っ'

Here is a new KeyError I got:

% cutlet
ずっーと
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "$HOME/anaconda3/envs/py36_mecab/lib/python3.6/site-packages/cutlet/cutlet.py", line 133, in romaji
    roma = self.romaji_word(word)
  File "$HOME/anaconda3/envs/py36_mecab/lib/python3.6/site-packages/cutlet/cutlet.py", line 228, in romaji_word
    return self.map_kana(kana)
  File "$HOME/anaconda3/envs/py36_mecab/lib/python3.6/site-packages/cutlet/cutlet.py", line 238, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "$HOME/anaconda3/envs/py36_mecab/lib/python3.6/site-packages/cutlet/cutlet.py", line 254, in get_single_mapping
    if pk: return self.table[pk][-1]
KeyError: 'っ'

ムッォヴァ

katsu.romaji("ムッォヴァ")

File "/usr/local/lib/python3.6/site-packages/cutlet/cutlet.py", line 145, in romaji
roma = self.romaji_word(word)
File "/usr/local/lib/python3.6/site-packages/cutlet/cutlet.py", line 214, in romaji_word
return self.map_kana(kana)
File "/usr/local/lib/python3.6/site-packages/cutlet/cutlet.py", line 254, in map_kana
out += self.get_single_mapping(pk, char, nk)
File "/usr/local/lib/python3.6/site-packages/cutlet/cutlet.py", line 265, in get_single_mapping
return self.table[kk][:-1] + self.table[nk]
KeyError: 'っ'

Python < 3.7 support & possible bug

The use of str.isascii currently limits it to python 3.7+. Since this is only used in one spot, it seems like you could use a polyfill similar to:

try:
   s.encode('ascii');
   return True
except UnicodeEncodeError:
   return False

I'm also seeing an issue where "私は" gets converted to '代名詞 wa'

Cutlet CLI does not work on windows

I know its a demo program but I find it very useful and was surprised when I tried running on windows !

On windows 10 it fails with:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python311\Scripts\cutlet.exe\__main__.py", line 4, in <module>
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python311\Lib\site-packages\cutlet\cli.py", line 6, in <module>
    from signal import signal, SIGPIPE, SIG_DFL
ImportError: cannot import name 'SIGPIPE' from 'signal' (C:\Users\XXX\AppData\Local\Programs\Python\Python311\Lib\signal.py)

Its seems SIGPIPE is not supported on Windows : https://stackoverflow.com/questions/58718659/cannot-import-name-sigpipe-from-signal-in-windows-10

Thanks!

Is there a good way to detect non-Japanese text?

I'm calling Cutlet.romaji() to convert japanese text to romaji, and it's working great. Thanks for the awesome library.

But due to the nature of the data I'm working with, I get the occasional Korean or English string in the mix, and the output for Korean text looks like '???????'.

Rather than writing code to detect whether the output string contains mostly question marks, is there a clean way to detect non-Japanese text?

KeyError: 'ー'

It looks like 'ー' causes an issue on 0.1.10. Here is an example:

% cutlet
押忍! ハト☆マツ学園男子寮! DC (12) プラトーーーン の巻
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 8, in <module>
    sys.exit(main())
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cli.py", line 16, in main
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 129, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 193, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 233, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 249, in get_single_mapping
    if pk: return self.table[pk][-1]
KeyError: 'ー'

I'm guessing that a repeated sequence of ー is the issue :/

Support Title Case

It should be possible to support title case, so that all words except particles are capitalized. So この世界の片隅に would be "Kono Sekai no Katasumi ni".

KeyError: 'ー'

I ran into another issue parsing a title of a book, ティンクル☆くるせいだーすGoGo!(1). The error is below:

% cutlet
ティンクル☆くるせいだーすGoGo!(1)
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 14, in <module>
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 127, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 191, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 231, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 264, in get_single_mapping
    return self.table[kk]
KeyError: 'ー'

This may be related to the changes from #7 and/or #8. This string did not error before either changes. My guess is that the in the middle is causing some issues.

How to use Exceptions properly?

Hey. First, let me give thanks for your great works. Then, I create a exceptions.csv like this:

"転生";"tensei"
"だって";"datte"
"ラノベ";"light novel"
"んですか";"ndesuka"
"でも";"demo"
"美少女";"bishoujo"
"んだが";"ndaga"

and only 転生 and ラノベ that applied. meanwhile for others, like んだが converted to "n-daga" and 美少女 became "bi-shoujo". Can you tell me whats wrong here?

fyi, i use below code to apply exceptions:

import csv
with open("_exceptions.csv") as fd:
    rd = csv.reader(fd, delimiter=";", quotechar='"')
    for row in rd:
        katsu.add_exception(row[0],row[1])

KeyError being raised on certain characters

While experimenting with cutlet using the full unidic dictionary, I've had several KeyErrors being raised:

% cutlet
《月》
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 14, in <module>
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 122, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 175, in romaji_word
    return self.table[word.surface]
KeyError: '《'
% cutlet
くま クマ 熊 ベアー 2【電子版特典付】
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 14, in <module>
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 122, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 192, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 201, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 234, in get_single_mapping
    return self.table[kk]
KeyError: '*'

Is this supposed to occur? I'm not aware if cutlet is meant to handle full-width characters in sentences.

Converting to romanji when the text is tokenized

Hi Paul

Thanks for the awesome library. I have a problem where I'm trying to convert a tokenized Japanese text to Romanji.

" 何人ですか ?" is correctly "Nanijindesu ka"?

But if i tokenize the text to [何, 人, ですか, ?] and convert each token to romanji it is incorrect because the text is missing.

How would I covert Japanese text to Romanji so I can get two matching tokenized arrays?

KeyError: 'ヸ'

katsu.romaji("秋の日のヸオロンのためいきの身にしみてひたぶるにうら悲し。")

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_2037/678387450.py in <module>
----> 1 katsu.romaji("秋の日のヸオロンのためいきの身にしみてひたぶるにうら悲し。")

~/miniconda3/envs/janlpbook/lib/python3.7/site-packages/cutlet/cutlet.py in romaji(self, text, capitalize, title)
    143 
    144             # resolve split verbs / adjectives
--> 145             roma = self.romaji_word(word)
    146             if roma and out and out[-1] == 'っ':
    147                 out = out[:-1] + roma[0]

~/miniconda3/envs/janlpbook/lib/python3.7/site-packages/cutlet/cutlet.py in romaji_word(self, word)
    212             if word.char_type == 6 or word.char_type == 7: # hiragana/katakana
    213                 kana = jaconv.kata2hira(word.surface)
--> 214                 return self.map_kana(kana)
    215 
    216             # At this point this is an unknown word and not kana. Could be

~/miniconda3/envs/janlpbook/lib/python3.7/site-packages/cutlet/cutlet.py in map_kana(self, kana)
    252             nk = kana[ki + 1] if ki < len(kana) - 1 else None
    253             pk = kana[ki - 1] if ki > 0 else None
--> 254             out += self.get_single_mapping(pk, char, nk)
    255         return out
    256 

~/miniconda3/envs/janlpbook/lib/python3.7/site-packages/cutlet/cutlet.py in get_single_mapping(self, pk, kk, nk)
    285             else: return 'n'
    286 
--> 287         return self.table[kk]
    288 

KeyError: 'ヸ'

I think this is an old variant of "ヴィ".

Source: https://tatoeba.org/en/sentences/show/2478013

パン = pao

パン is transcribed as pao which I would say is wrong in any romanization system

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.