polm / cutlet Goto Github PK

View Code? Open in Web Editor NEW

267.0 7.0 19.0 496 KB

Japanese to romaji converter in Python

Home Page: https://polm.github.io/cutlet/

License: MIT License

Python 100.00%

japanese romaji nlp

cutlet's Introduction

cutlet

Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.

issueを英語で書く必要はありません。

Features:

support for Modified Hepburn, Kunreisiki, Nihonsiki systems
custom overrides for individual mappings
custom overrides for specific words
built in exceptions list (Tokyo, Osaka, etc.)
uses foreign spelling when available in UniDic
proper nouns are capitalized
slug mode for url generation

Things not supported:

traditional Hepburn n-to-m: Shimbashi
macrons or circumflexes: Tōkyō, Tôkyô
passport Hepburn: Satoh (but you can use an exception)
hyphenating words
Traditional Hepburn in general is not supported

Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.

Installation

Cutlet can be installed through pip as usual.

pip install cutlet

Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.

pip install unidic-lite

Usage

A command-line script is included for quick testing. Just use cutlet and each line of stdin will be treated as a sentence. You can specify the system to use (hepburn, kunrei, nippon, or nihon) as the first argument.

$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.

In code:

import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'

# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'

# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'

# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'

# comparison
nkatu = cutlet.Cutlet('nihon')

sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'

Alternatives

kakasi: Historically important, but not updated since 2014.
pykakasi: self contained, it does segmentation on its own and uses its own dictionary.
kuroshiro: Javascript based.
kana: Go based.

cutlet's People

Contributors

Stargazers

Watchers

Forkers

kinow krackers kounoike joshua-chavanne jdk6979 running-club vinidiktov alohasensei gryfi ebell495 ctaguchi utensility shikiexe dgcampea ryuuryuusei hoanggtg hizuru3 stegayet oytunturk

cutlet's Issues

KeyError: 'ゕ'

It looks like a small か causes some issues :(

% cutlet
夕陽ヵ丘三号館
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 8, in <module>
    sys.exit(main())
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cli.py", line 16, in main
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 130, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 195, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 235, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 268, in get_single_mapping
    return self.table[kk]
KeyError: 'ゕ'

ImportError (circular import)

I have an issue guys.. i don't know how to fix it..
does anyone know it?

Traceback (most recent call last):
  File "d:/Project/Farid/Python/OCR/fugashi.py", line 1, in <module>
    from cutlet import Cutlet
  File "C:\Users\Farid Fardiansyah\AppData\Roaming\Python\Python37\site-packages\cutlet\__init__.py", line 1, in <module>
    from .cutlet import *
  File "C:\Users\Farid Fardiansyah\AppData\Roaming\Python\Python37\site-packages\cutlet\cutlet.py", line 1, in <module>
    import fugashi
  File "d:\Project\Farid\Python\OCR\fugashi.py", line 1, in <module>
    from cutlet import Cutlet
ImportError: cannot import name 'Cutlet' from 'cutlet' (C:\Users\Farid Fardiansyah\AppData\Roaming\Python\Python37\site-packages\cutlet\__init__.py)

convert to romaji to values in pandas column

Dear Polm,
Thanks for sharing this cutlet code 👍
I would like to try it on a dataframe for some column values in Japanese. How do you insert pandas column values instead of only a text ?
Many thanks in advance !

`katsu.romaji('df['Column']`)
=> ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Pythonely,

Morgane

Align Romaji and Kana

Hello, thanks for the great work on this.

I have a use case where I need to make use of both romaji (nihon) and kana. In another issue regarding furigana you mention you can use fugashi as such:

import fugashi

tagger = fugashi.Tagger()
kana = [nn.feature.kana for nn in tagger("吾輩は猫である")]
# => ['ワガハイ', 'ハ', 'ネコ', 'デ', 'アル']

However, it seems the space-handling in this library is slightly customized.

import fugashi
import cutlet

tagger = fugashi.Tagger()
nihon = cutlet.Cutlet(use_foreign_spelling=False, system="nihon")

raw_text = 'また、東寺のように、五大明王と呼ばれる、主要な明王の**に配されることも多い。'
romaji = nihon.romaji(raw_text)
kana = " ".join([nn.feature.kana for nn in tagger("また、東寺のように、五大明王と呼ばれる、主要な明王の**に配されることも多い。")])
kana_romaji = nihon.romaji(kana)

print(f"Romaji text has {len(romaji.split(' '))} words, but kana text has {len(kana.split(' '))} words.")
print(f"This means that we can not align the romaji and kana text for any use case.")
print(f"Correct romaji: {romaji}\nKana: {kana}\nRomaji from kana: {kana_romaji}")

> Romaji text has 19 words, but kana text has 26 words.
> This means that we can not align the romaji and kana text for any use case.
> Correct romaji: Mata, Touzi no you ni, go daimyouou to yobareru, syuyou na myouou no tyuuou ni haisareru koto mo ooi.
> Kana: マタ  トウジ ノ ヨウ ニ  ゴ ダイ ミョウオウ ト ヨバ レル  シュヨウ ナ ミョウオウ ノ チュウオウ ニ ハイサ レル コト モ オオイ 
> Romaji from kana: Mata touzi no you ni godai myouou to yoba reru syuyou na myouou no tyuu Ou ni haisa reru koto mo ooi

Could we optionally provide the raw kana returned with the romaji? If so this would be the one Japanese processing library to rule them all.

Is it possible to generate furigana for Kanji using this library?

It would be a nice feature.
Hopefully, this is possible because we are already generating the romaji.

UnicodeDecodeError on 'exceptions.tsv' with Windows 10 Japanese Locale

Windows attempts to decode exceptions.tsv with code point 932 instead of utf-8 for some reason. Setting the open keyword argument encoding=utf-8 fixes it.

Traceback (most recent call last): File "cutlet_test.py", line 2, in <module> katsu = cutlet.Cutlet() File "C:\ProgramData\Miniconda3\envs\jpocr\lib\site-packages\cutlet\cutlet.py", line 80, in __init__ self.exceptions = load_exceptions() File "C:\ProgramData\Miniconda3\envs\jpocr\lib\site-packages\cutlet\cutlet.py", line 59, in load_exceptions for line in open(cdir / 'exceptions.tsv'): UnicodeDecodeError: 'cp932' codec can't decode byte 0x83 in position 10: illegal multibyte sequence

Was the streamlit app removed?

This is what I see when I go to the link https://share.streamlit.io/polm/cutlet-demo/main/demo.py

Add api to get character to romaji map as list of dicts

I am writing a script that uses the Whisper to transcribe japanese speech and i'd like to use cutlet to produce a romaji transcription. Right now i'm a little stuck because the output of whisper when using word_timestamps=True can produce word segments that break up multi-character words. So when i use cutlet to transcribe entire sentence segments output by whisper, it works fine, but i'd like a map of the individual word timings so that i can create a text animation that highlights the romaji as the words are said in the audio.

Here's an example of the issue:

Full segment output from whisper and cutlet

raw whisper segment:	 大体私ら知らなくて 特にもいけない今日だって
cutlet full segment:	 daitai watakushira shiranakute tokuni mo ikenai kyou da tte

but the way this is broken up by whisper is the following:

whisper per word:	 大-体-私-ら-知-ら-なく-て- 特-に-も-い-け-ない-今日-だ-って
cutlet per word:	 oo-karada-watakushi-ra-chi-ra-naku-te-toku-ni-mo-i-ke-nai-kyou-da-tte

as you can see using cutlet.romaji() on each "word" as defined by the whisper transcription doesn't work. I tried using cutlet.romaji_word() but got this error:

AttributeError                            Traceback (most recent call last)
Cell In[38], line 14
     11 print(f'cutlet full segment:\t {katsu.romaji(full_segment, capitalize=False)}')
     13 single_word_whisper_line = '-'.join([word for word in word_list])
---> 14 single_word_romaji_line = '-'.join([katsu.romaji_word(word) for word in word_list])
     15 print(f'whisper per word:\t {single_word_whisper_line}')
     16 print(f'cutlet per word:\t {single_word_romaji_line}')

Cell In[38], line 14, in <listcomp>(.0)
     11 print(f'cutlet full segment:\t {katsu.romaji(full_segment, capitalize=False)}')
     13 single_word_whisper_line = '-'.join([word for word in word_list])
---> 14 single_word_romaji_line = '-'.join([katsu.romaji_word(word) for word in word_list])
     15 print(f'whisper per word:\t {single_word_whisper_line}')
     16 print(f'cutlet per word:\t {single_word_romaji_line}')

File ~/.pyenv/versions/karagen/lib/python3.11/site-packages/cutlet/cutlet.py:319, in Cutlet.romaji_word(self, word)
    316 def romaji_word(self, word):
    317     """Return the romaji for a single word (node)."""
--> 319     if word.surface in self.exceptions:
    320         return self.exceptions[word.surface]
    322     if word.surface.isdigit():

AttributeError: 'str' object has no attribute 'surface'

I've attached the full output of whisper transcription for the example above (includes the entire transcription of the content i'm transcribing):
whisper_transcription.json

(btw the speech i'm transcribing is the lyrics to the following song: https://www.youtube.com/watch?v=ZAJ3nfQTw4A)

Use the following code with the attached json to show the output.

import json
import cutlet

katsu = cutlet.Cutlet()
katsu.use_foreign_spelling = False

with open('/Users/silman/Desktop/whisper_transcription.json', 'r') as f:
  data = json.load(f)

for sentence_segment in data['segments']:
    word_list = list()
    for word_segment in sentence_segment['words']:
        word_list.append(word_segment['word'])

    full_segment = ''.join([word for word in word_list])
    print(f'raw whisper segment:\t {full_segment}')
    print(f'cutlet full segment:\t {katsu.romaji(full_segment, capitalize=False)}')

    single_word_whisper_line = '-'.join([word for word in word_list])
    single_word_romaji_line = '-'.join([katsu.romaji(word) for word in word_list])
    print(f'whisper per word:\t {single_word_whisper_line}')
    print(f'cutlet per word:\t {single_word_romaji_line}')

If i had a list mapping how the characters from the full sentence are used to create the romaji i would be able to cycle through the characters in the mapping and find the start and end of positions of those characters to create a map of the start/end positions of the romaji.

Thanks for your time and for this library! It's incredibly useful and easy to use!

Maintain formatting?

Hello,

Is it possible to maintain the formatting of input text? I like to structure Japanese lyrics for transliteration purposes, but have found that there are no services that maintain the formatting (other than RomajiDesu but it doesn't do Hepburn), so I usually end up with a huge block of text.

Unable to romanize with full katakana strings

I'm not sure if this is in the scope of cutlet, but it looks like any katakana-only sentences / phrases seem to not romanize:

% cutlet
アマガミ Sincerely Your S シンシアリーユアーズ
アマガミ Sincerely Your S シンシアリーユアーズ
ケメコデラックス
ケメコデラックス

Put use_foreign_spelling and ensure_ascii in constructor.

Is there a reason why use_foreign_spelling=True, ensure_ascii=True are not in the Cutlet constructor __init__? Placing these in the constructor would help IDE software provide the user information about these modifiable attributes and their defaults, and it is more intuitive (for me at least) to write katsu = cutlet.Cutlet(use_foreign_spelling=False).

KeyError: 'ゖ'

Here's another KeyError I found:

% cutlet
齋藤タヶオ
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 8, in <module>
    sys.exit(main())
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cli.py", line 16, in main
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 133, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 198, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 238, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 271, in get_single_mapping
    return self.table[kk]
KeyError: 'ゖ'

It feels like we're slowly inching toward bullet proofing cutlet xD

スヽメ

katsu.romaji("スヽメ")

File "site-packages/cutlet/cutlet.py", line 145, in romaji
roma = self.romaji_word(word)
File "site-packages/cutlet/cutlet.py", line 214, in romaji_word
return self.map_kana(kana)
File "site-packages/cutlet/cutlet.py", line 254, in map_kana
out += self.get_single_mapping(pk, char, nk)
File "site-packages/cutlet/cutlet.py", line 287, in get_single_mapping
return self.table[kk]
KeyError: 'ゝ'

KeyError: '゙'

I ran into an odd issue with the latest version:

% cutlet
青い春よさらば！
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 8, in <module>
    sys.exit(main())
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cli.py", line 16, in main
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 130, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 195, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 235, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 268, in get_single_mapping
    return self.table[kk]
KeyError: '゙'

I'm not sure what's going on here. :/

very useful and accurate, it would be even better if it could map kanji to kana

Japanese city names in romaji

Why are city names like 東京 and 大阪 converted to Tokyo and Osaka instead of Toukyou and Oosaka? I am working on a text-to-speech project and it caused the program to pronounce them incorrectly.

Cutlet creates additional spaces in some words written in Latin alphabet

I don't know if it's cutlet's or cutlet's dependency's fault, but I'm trying here.

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> katsu = cutlet.Cutlet()
>>> text = '私は Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch にい ま す'
>>> katsu.romaji(text)
'Watakushi wa L l a n f a i r p w l l g w y n g y l l g o g e r y c h w y r n d robwllllantysiliogogogoch ni ima su'

Cutlet converts こんにちは to Konnichiha instead of Konnichiwa

Cutlet converts こんにちは to Konnichiha instead of Konnichiwa, is it an intentional behaviour or a bug? Because こんにちは should be read as Konnichiwa.

KeyError: 'っ'

Here is a new KeyError I got:

% cutlet
ずっーと
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "$HOME/anaconda3/envs/py36_mecab/lib/python3.6/site-packages/cutlet/cutlet.py", line 133, in romaji
    roma = self.romaji_word(word)
  File "$HOME/anaconda3/envs/py36_mecab/lib/python3.6/site-packages/cutlet/cutlet.py", line 228, in romaji_word
    return self.map_kana(kana)
  File "$HOME/anaconda3/envs/py36_mecab/lib/python3.6/site-packages/cutlet/cutlet.py", line 238, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "$HOME/anaconda3/envs/py36_mecab/lib/python3.6/site-packages/cutlet/cutlet.py", line 254, in get_single_mapping
    if pk: return self.table[pk][-1]
KeyError: 'っ'

ムッォヴァ

katsu.romaji("ムッォヴァ")

File "/usr/local/lib/python3.6/site-packages/cutlet/cutlet.py", line 145, in romaji
roma = self.romaji_word(word)
File "/usr/local/lib/python3.6/site-packages/cutlet/cutlet.py", line 214, in romaji_word
return self.map_kana(kana)
File "/usr/local/lib/python3.6/site-packages/cutlet/cutlet.py", line 254, in map_kana
out += self.get_single_mapping(pk, char, nk)
File "/usr/local/lib/python3.6/site-packages/cutlet/cutlet.py", line 265, in get_single_mapping
return self.table[kk][:-1] + self.table[nk]
KeyError: 'っ'

Python < 3.7 support & possible bug

The use of str.isascii currently limits it to python 3.7+. Since this is only used in one spot, it seems like you could use a polyfill similar to:

try:
   s.encode('ascii');
   return True
except UnicodeEncodeError:
   return False

I'm also seeing an issue where "私は" gets converted to '代名詞 wa'

Demo page not loading

Cutlet CLI does not work on windows

I know its a demo program but I find it very useful and was surprised when I tried running on windows !

On windows 10 it fails with:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python311\Scripts\cutlet.exe\__main__.py", line 4, in <module>
  File "C:\Users\XXX\AppData\Local\Programs\Python\Python311\Lib\site-packages\cutlet\cli.py", line 6, in <module>
    from signal import signal, SIGPIPE, SIG_DFL
ImportError: cannot import name 'SIGPIPE' from 'signal' (C:\Users\XXX\AppData\Local\Programs\Python\Python311\Lib\signal.py)

Its seems SIGPIPE is not supported on Windows : https://stackoverflow.com/questions/58718659/cannot-import-name-sigpipe-from-signal-in-windows-10

Thanks!

Romaji to original Japanese

Is there a way to use the tool to convert a text from Romaji to original Japanese?

Is there a good way to detect non-Japanese text?

I'm calling Cutlet.romaji() to convert japanese text to romaji, and it's working great. Thanks for the awesome library.

But due to the nature of the data I'm working with, I get the occasional Korean or English string in the mix, and the output for Korean text looks like '???????'.

Rather than writing code to detect whether the output string contains mostly question marks, is there a clean way to detect non-Japanese text?

KeyError: 'ー'

It looks like 'ー' causes an issue on 0.1.10. Here is an example:

% cutlet
押忍！ ハト☆マツ学園男子寮！ DC　（12）　プラトーーーン の巻
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 8, in <module>
    sys.exit(main())
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cli.py", line 16, in main
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 129, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 193, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 233, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 249, in get_single_mapping
    if pk: return self.table[pk][-1]
KeyError: 'ー'

I'm guessing that a repeated sequence of ー is the issue :/

Support Title Case

It should be possible to support title case, so that all words except particles are capitalized. So この世界の片隅に would be "Kono Sekai no Katasumi ni".

KeyError: 'ｰ'

I ran into another issue parsing a title of a book, ティンクル☆くるせいだｰすGoGo！(1). The error is below:

% cutlet
ティンクル☆くるせいだｰすGoGo！(1)
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 14, in <module>
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 127, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 191, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 231, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 264, in get_single_mapping
    return self.table[kk]
KeyError: 'ｰ'

This may be related to the changes from #7 and/or #8. This string did not error before either changes. My guess is that the ☆ in the middle is causing some issues.

How to use Exceptions properly?

Hey. First, let me give thanks for your great works. Then, I create a exceptions.csv like this:

"転生";"tensei"
"だって";"datte"
"ラノベ";"light novel"
"んですか";"ndesuka"
"でも";"demo"
"美少女";"bishoujo"
"んだが";"ndaga"

and only 転生 and ラノベ that applied. meanwhile for others, like んだが converted to "n-daga" and 美少女 became "bi-shoujo". Can you tell me whats wrong here?

fyi, i use below code to apply exceptions:

import csv
with open("_exceptions.csv") as fd:
    rd = csv.reader(fd, delimiter=";", quotechar='"')
    for row in rd:
        katsu.add_exception(row[0],row[1])

KeyError being raised on certain characters

While experimenting with cutlet using the full unidic dictionary, I've had several KeyErrors being raised:

% cutlet
《月》
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 14, in <module>
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 122, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 175, in romaji_word
    return self.table[word.surface]
KeyError: '《'
% cutlet
くま　クマ　熊　ベアー　２【電子版特典付】
Traceback (most recent call last):
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/bin/cutlet", line 14, in <module>
    print(katsu.romaji(line.strip()))
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 122, in romaji
    roma = self.romaji_word(word)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 192, in romaji_word
    return self.map_kana(kana)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 201, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "/Users/ykim/.local/share/virtualenvs/sandbox-nIHPi2Hu/lib/python3.8/site-packages/cutlet/cutlet.py", line 234, in get_single_mapping
    return self.table[kk]
KeyError: '*'

Is this supposed to occur? I'm not aware if cutlet is meant to handle full-width characters in sentences.

Converting to romanji when the text is tokenized

Hi Paul

Thanks for the awesome library. I have a problem where I'm trying to convert a tokenized Japanese text to Romanji.

" 何人ですか ?" is correctly "Nanijindesu ka"?

But if i tokenize the text to [何, 人, ですか, ?] and convert each token to romanji it is incorrect because the text is missing.

How would I covert Japanese text to Romanji so I can get two matching tokenized arrays?

KeyError: 'ヸ'

katsu.romaji("秋の日のヸオロンのためいきの身にしみてひたぶるにうら悲し。")

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_2037/678387450.py in <module>
----> 1 katsu.romaji("秋の日のヸオロンのためいきの身にしみてひたぶるにうら悲し。")

~/miniconda3/envs/janlpbook/lib/python3.7/site-packages/cutlet/cutlet.py in romaji(self, text, capitalize, title)
    143 
    144             # resolve split verbs / adjectives
--> 145             roma = self.romaji_word(word)
    146             if roma and out and out[-1] == 'っ':
    147                 out = out[:-1] + roma[0]

~/miniconda3/envs/janlpbook/lib/python3.7/site-packages/cutlet/cutlet.py in romaji_word(self, word)
    212             if word.char_type == 6 or word.char_type == 7: # hiragana/katakana
    213                 kana = jaconv.kata2hira(word.surface)
--> 214                 return self.map_kana(kana)
    215 
    216             # At this point this is an unknown word and not kana. Could be

~/miniconda3/envs/janlpbook/lib/python3.7/site-packages/cutlet/cutlet.py in map_kana(self, kana)
    252             nk = kana[ki + 1] if ki < len(kana) - 1 else None
    253             pk = kana[ki - 1] if ki > 0 else None
--> 254             out += self.get_single_mapping(pk, char, nk)
    255         return out
    256 

~/miniconda3/envs/janlpbook/lib/python3.7/site-packages/cutlet/cutlet.py in get_single_mapping(self, pk, kk, nk)
    285             else: return 'n'
    286 
--> 287         return self.table[kk]
    288 

KeyError: 'ヸ'

I think this is an old variant of "ヴィ".

Source: https://tatoeba.org/en/sentences/show/2478013

パン = pao

パン is transcribed as pao which I would say is wrong in any romanization system

Handle 踊り字

々〃ゝゞヽゞ