Code Monkey home page Code Monkey logo

Comments (5)

Infinoid avatar Infinoid commented on May 30, 2024 1

I appreciate the advice. This is definitely a case of bad input and not cutlet's fault... and short of raising a NotJapaneseError or something when presented with unrecognized characters, there isn't much cutlet could do about it.

I'm currently thinking I should run unknown text through chardet or something first, before deciding what to do with it.

Thanks!

from cutlet.

polm avatar polm commented on May 30, 2024 1

Got it, thanks for the clarification!

from cutlet.

polm avatar polm commented on May 30, 2024

There is not any specific cutlet feature for detecting non-Japanese text.

Besides checking the input string yourself using regexes or something, MeCab has a feature called char_type which is present on the Nodes you get in fugashi. It doesn't recognize hangul specifically, but it has categories like ALPHA and SYMBOL (separate from categories for kanji and kana) that should let you detect it.

I also haven't tried this before, but you could maybe add hangul to the mapping tables cutlet uses internally.

You might also be able to make use of unihandecode, which handles Korean and Japanese.

from cutlet.

polm avatar polm commented on May 30, 2024

Having an option to throw an error on text that would be rendered as ? actually sounds like it might be a good idea, I'll think about it!

To give me a little more idea of your use case, are you using this on:

  • text for URLs
  • titles (of books, articles, etc.)
  • prose sentences / paragraphs
  • something else?

from cutlet.

Infinoid avatar Infinoid commented on May 30, 2024

I'm using it for book titles. Mostly Japanese light novels, but there are occasional non-Japanese ones thrown in too.

from cutlet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.