Code Monkey home page Code Monkey logo

Comments (4)

ZelphirKaltstahl avatar ZelphirKaltstahl commented on May 28, 2024

@abhi18av You could use a pinyin input method to input the text if it is not too much for that.

The problem here is, that the snippet of text you showed is not in any standard format and even irregular at times. For example you usually have the diacritic marks in front of the vowels as follows:

wˇo yˇe

... but in other places you have them on top of the vowel, as they should be, at least visually:

j ̄ınni ́an

My guess is, that this is some text form OCR (optical character recognition) maybe. Is this correct? I also guess that It is not likely that such format will be supported, as it is so irregular and uncommon.

from pinyin-convert.

derhuerst avatar derhuerst commented on May 28, 2024

... but in other places you have them on top of the vowel, as they should be, at least visually:

j ̄ınni ́an

note that i don't see them on top, so it might be related to fonts.

from pinyin-convert.

abhi18av avatar abhi18av commented on May 28, 2024

@ZelphirKaltstahl and @derhuerst , thanks for the suggestions and pointing out the font issue.

The pinyin input won't be a viable alternative, since I don't really wish to type in everything again. It seems some text processing is needed here.

But my question whether it's possible to convert well-formed pinyin to hanzi at all?

It seems possible, but I've skimmed Github and the net couldn't really find anything useful.

from pinyin-convert.

ZelphirKaltstahl avatar ZelphirKaltstahl commented on May 28, 2024

@derhuerst Ha, good catch! Interesting that a font would do that. I guess it is a similar / same effect as taking t and h and combining that into one special character.

@abhi18av There is yet another problem with such text unfortunately: The example is j ̄ınni ́an. It is ambiguous. It could be jīn and nián or it could be jīn, ni and ́an
Pinyin to hanzi is difficult to do all automatically, because one Pinyin syllable may have many corresponding hanzi characters. That is why usually when using some Pinyin input method, you have to select the hanzi you want by pressing some number. Good input methods are clever about this and have some probability estimation for character combinations, so that they put the most probably hanzi first, but even that is sometimes wrong for whatever you want to write. In the example of 今年 (jīn nián) it would probably work well, since it is a very common word.
Maybe some combined algorithm, that uses some kind of learning from real texts to cluster them into topics and then topic specific prediction of what meaning you want to express with some pinyin would do a good job.
For the specific case of parsing your text, you could try to find all rules of rewriting it into proper pīnyīn, then write a specialized program and then check if the produced pinyin makes sense. Then you could put it into some input method and let it guess what the characters are from the pīnyīn. Then you would check again if the 汉字 make sense and choose alternative 汉字 if they do not make sense.

from pinyin-convert.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.