Comments (4)
@abhi18av You could use a pinyin input method to input the text if it is not too much for that.
The problem here is, that the snippet of text you showed is not in any standard format and even irregular at times. For example you usually have the diacritic marks in front of the vowels as follows:
wˇo yˇe
... but in other places you have them on top of the vowel, as they should be, at least visually:
j ̄ınni ́an
My guess is, that this is some text form OCR (optical character recognition) maybe. Is this correct? I also guess that It is not likely that such format will be supported, as it is so irregular and uncommon.
from pinyin-convert.
... but in other places you have them on top of the vowel, as they should be, at least visually:
j ̄ınni ́an
note that i don't see them on top, so it might be related to fonts.
from pinyin-convert.
@ZelphirKaltstahl and @derhuerst , thanks for the suggestions and pointing out the font issue.
The pinyin input won't be a viable alternative, since I don't really wish to type in everything again. It seems some text processing is needed here.
But my question whether it's possible to convert well-formed pinyin to hanzi at all?
It seems possible, but I've skimmed Github and the net couldn't really find anything useful.
from pinyin-convert.
@derhuerst Ha, good catch! Interesting that a font would do that. I guess it is a similar / same effect as taking t
and h
and combining that into one special character.
@abhi18av There is yet another problem with such text unfortunately: The example is j ̄ınni ́an
. It is ambiguous. It could be jīn
and nián
or it could be jīn
, ni
and ́an
Pinyin to hanzi is difficult to do all automatically, because one Pinyin syllable may have many corresponding hanzi characters. That is why usually when using some Pinyin input method, you have to select the hanzi you want by pressing some number. Good input methods are clever about this and have some probability estimation for character combinations, so that they put the most probably hanzi first, but even that is sometimes wrong for whatever you want to write. In the example of 今年 (jīn nián
) it would probably work well, since it is a very common word.
Maybe some combined algorithm, that uses some kind of learning from real texts to cluster them into topics and then topic specific prediction of what meaning you want to express with some pinyin would do a good job.
For the specific case of parsing your text, you could try to find all rules of rewriting it into proper pīnyīn, then write a specialized program and then check if the produced pinyin makes sense. Then you could put it into some input method and let it guess what the characters are from the pīnyīn. Then you would check again if the 汉字 make sense and choose alternative 汉字 if they do not make sense.
from pinyin-convert.
Related Issues (8)
- I'm only getting the first letter from hanzi HOT 2
- An in-range update of pinyin-split is breaking the build 🚨 HOT 17
- An in-range update of pinyin-utils is breaking the build 🚨 HOT 10
- An in-range update of hanzi-to-pinyin is breaking the build 🚨 HOT 4
- An in-range update of pinyin-split is breaking the build 🚨 HOT 33
- An in-range update of pinyin-utils is breaking the build 🚨 HOT 13
- An in-range update of pinyin-or-hanzi is breaking the build 🚨 HOT 34
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pinyin-convert.