ikatyang / cjk-regex Goto Github PK
View Code? Open in Web Editor NEWregular expression for matching CJK text
License: MIT License
regular expression for matching CJK text
License: MIT License
It is a following-up from #47.
I don't think it is sustainable to maintain a long list of Unicode blocks (while unfortunately most isCJK
js utility will do as far as I know). We should make a step forward to make use of properties defined in UCD and maintained by Unicode experts. For example, we can choose all encoded characters satisfying the following constraints:
Script=Han
General_Category=Other_Letter|Letter_Number|Other_Symbol
The semantics of General Category is here. By doing so we can abstract from the concrete Unicode blocks and work on character properties.
Characters: An association between abstract character and a code point (D11 Encoded Characters defined here)
Punctuations: Any character with General_Category = Punctuation
cjk-punctuation: (Some list of blocks to be discussed, I mostly agree with current blocks except for Hangul Syllables)
Letter: Any character with General_Category = Other_Letter | Letter_Number | Other_Symbol
cjk-letter: The Letter with Script=Han, Katakana, Hiragana, Hangul.
Other: Any character is neither cjk-punctuation nor cjk-letter:.
As far as I know from npm package, prettier
is the only dependents of this new project. So we can rethink the use case of this package:
According to prettier/prettier#3026, the requirements of printer-markdown
can be rephrased by the new terminology as:
The Requirement 1 does follow the requirements of Chinese Text Layout, Japanese Text Layout, and Korean Text Layout.
Although these requirements all specify complicated line breaking rules, we can and we should only implement a tiny subset of them. On this principle The Requirement 2 is acceptable for both Japanese and Chinese. However, as noted by pp. 518 of CJKV Information Processing, Korean text is composed of Hangul and uses conventional space, more like western typography than Chinese/Japanese. So we should better do nothing between Hangul Syllables/Jamos. I guess this is the reason why Hangul Syllables
is categorized as cjk_punctuations.
The Requirement 3 is acceptable as-is.
We should split cjk-letter
into two class:
cj-letter: cjk-letter with Script=Han, Katakana, Hiragana
Hangul: cjk-letter with Script=Hangul
And revise the requirement to match w3c typography requirements
This part should be done on prettier side. But it implies that we should have cjk-regex
to expose more interface: Hangul
The current implementation of regex without unicode
flag will be unmaintainable once we support the SIP characters. We should use unicode
flag and use regexpu-core to transpile to ES5.
We don't have to maintain the blocks but simply use unicode-data to generate our code points, filted the necessary code points and converted back to unicode regex.
The introduced extra computation logic is wasteful because once we pick up a unicode-data
version the regex will be generated deterministically, we can use prepack to generate a evaluated build.
Thanks for your patience reading this long issue. ๐
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.