Comments (5)
I appreciate the advice. This is definitely a case of bad input and not cutlet's fault... and short of raising a NotJapaneseError
or something when presented with unrecognized characters, there isn't much cutlet could do about it.
I'm currently thinking I should run unknown text through chardet
or something first, before deciding what to do with it.
Thanks!
from cutlet.
Got it, thanks for the clarification!
from cutlet.
There is not any specific cutlet feature for detecting non-Japanese text.
Besides checking the input string yourself using regexes or something, MeCab has a feature called char_type
which is present on the Nodes
you get in fugashi. It doesn't recognize hangul specifically, but it has categories like ALPHA
and SYMBOL
(separate from categories for kanji and kana) that should let you detect it.
I also haven't tried this before, but you could maybe add hangul to the mapping tables cutlet uses internally.
You might also be able to make use of unihandecode, which handles Korean and Japanese.
from cutlet.
Having an option to throw an error on text that would be rendered as ? actually sounds like it might be a good idea, I'll think about it!
To give me a little more idea of your use case, are you using this on:
- text for URLs
- titles (of books, articles, etc.)
- prose sentences / paragraphs
- something else?
from cutlet.
I'm using it for book titles. Mostly Japanese light novels, but there are occasional non-Japanese ones thrown in too.
from cutlet.
Related Issues (20)
- KeyError: 'っ' HOT 2
- Cutlet creates additional spaces in some words written in Latin alphabet HOT 12
- convert to romaji to values in pandas column HOT 1
- Maintain formatting? HOT 3
- ImportError (circular import) HOT 3
- Romaji to original Japanese HOT 2
- KeyError: 'ヸ' HOT 1
- Handle 踊り字 HOT 1
- スヽメ HOT 1
- ムッォヴァ HOT 1
- Put use_foreign_spelling and ensure_ascii in constructor. HOT 4
- Demo page not loading HOT 1
- Cutlet converts こんにちは to Konnichiha instead of Konnichiwa HOT 2
- very useful and accurate, it would be even better if it could map kanji to kana HOT 2
- Japanese city names in romaji HOT 2
- Is it possible to generate furigana for Kanji using this library? HOT 2
- Add api to get character to romaji map as list of dicts HOT 2
- Cutlet CLI does not work on windows HOT 2
- Was the streamlit app removed? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cutlet.