Code Monkey home page Code Monkey logo

cjk-regex's People

Contributors

dependabot[bot] avatar ikatyang avatar jlhwung avatar renovate-bot avatar renovate[bot] avatar weakish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cjk-regex's Issues

discussion: rewrite the package to meet w3c typography requirements and complete character coverage

It is a following-up from #47.

Properties Coverage instead of Blocks Coverage

I don't think it is sustainable to maintain a long list of Unicode blocks (while unfortunately most isCJK js utility will do as far as I know). We should make a step forward to make use of properties defined in UCD and maintained by Unicode experts. For example, we can choose all encoded characters satisfying the following constraints:

Script=Han
General_Category=Other_Letter|Letter_Number|Other_Symbol

The semantics of General Category is here. By doing so we can abstract from the concrete Unicode blocks and work on character properties.

Terminology accordance with Unicode

Characters: An association between abstract character and a code point (D11 Encoded Characters defined here)
Punctuations: Any character with General_Category = Punctuation

Our definition on our specific purpose

cjk-punctuation: (Some list of blocks to be discussed, I mostly agree with current blocks except for Hangul Syllables)
Letter: Any character with General_Category = Other_Letter | Letter_Number | Other_Symbol
cjk-letter: The Letter with Script=Han, Katakana, Hiragana, Hangul.
Other: Any character is neither cjk-punctuation nor cjk-letter:.

Compliant to w3c typography requirements

As far as I know from npm package, prettier is the only dependents of this new project. So we can rethink the use case of this package:

According to prettier/prettier#3026, the requirements of printer-markdown can be rephrased by the new terminology as:

  1. put line(" " or "\n") between Other and cjk-letter
  2. put softline("" or "\n") between cjk-letter and cjk-letter
  3. put nothing between Other and cjk-punctuation, i.e. they're considered not breakable

The Requirement 1 does follow the requirements of Chinese Text Layout, Japanese Text Layout, and Korean Text Layout.

Although these requirements all specify complicated line breaking rules, we can and we should only implement a tiny subset of them. On this principle The Requirement 2 is acceptable for both Japanese and Chinese. However, as noted by pp. 518 of CJKV Information Processing, Korean text is composed of Hangul and uses conventional space, more like western typography than Chinese/Japanese. So we should better do nothing between Hangul Syllables/Jamos. I guess this is the reason why Hangul Syllables is categorized as cjk_punctuations.

The Requirement 3 is acceptable as-is.

Solution

We should split cjk-letter into two class:
cj-letter: cjk-letter with Script=Han, Katakana, Hiragana
Hangul: cjk-letter with Script=Hangul

And revise the requirement to match w3c typography requirements

  1. put line(" " or "\n") between Other and cjk-letter
  2. put softline("" or "\n") between cjk-letter and cjk-letter, except Hangul and Hangul.
  3. put nothing between Other and cjk-punctuation, i.e. they're considered not breakable

This part should be done on prettier side. But it implies that we should have cjk-regex to expose more interface: Hangul

Technical Notes

  1. The current implementation of regex without unicode flag will be unmaintainable once we support the SIP characters. We should use unicode flag and use regexpu-core to transpile to ES5.

  2. We don't have to maintain the blocks but simply use unicode-data to generate our code points, filted the necessary code points and converted back to unicode regex.

  3. The introduced extra computation logic is wasteful because once we pick up a unicode-data version the regex will be generated deterministically, we can use prepack to generate a evaluated build.

Thanks for your patience reading this long issue. ๐Ÿ˜„

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.