Code Monkey home page Code Monkey logo

grapheme-splitter's Introduction

Background

In JavaScript there is not always a one-to-one relationship between string characters and what a user would call a separate visual "letter". Some symbols are represented by several characters. This can cause issues when splitting strings and inadvertently cutting a multi-char letter in half, or when you need the actual number of letters in a string.

For example, emoji characters like "🌷","🎁","💩","😜" and "👍" are represented by two JavaScript characters each (high surrogate and low surrogate). That is,

"🌷".length == 2

The combined emoji are even longer:

"🏳️‍🌈".length == 6

What's more, some languages often include combining marks - characters that are used to modify the letters before them. Common examples are the German letter ü and the Spanish letter ñ. Sometimes they can be represented alternatively both as a single character and as a letter + combining mark, with both forms equally valid:

var two = "ñ"; // unnormalized two-char n+◌̃  , i.e. "\u006E\u0303";
var one = "ñ"; // normalized single-char, i.e. "\u00F1"
console.log(one!=two); // prints 'true'

Unicode normalization, as performed by the popular punycode.js library or ECMAScript 6's String.normalize, can sometimes fix those differences and turn two-char sequences into single characters. But it is not enough in all cases. Some languages like Hindi make extensive use of combining marks on their letters, that have no dedicated single-codepoint Unicode sequences, due to the sheer number of possible combinations. For example, the Hindi word "अनुच्छेद" is comprised of 5 letters and 3 combining marks:

अ + न + ु + च + ् + छ + े + द

which is in fact just 5 user-perceived letters:

अ + नु + च् + छे + द

and which Unicode normalization would not combine properly. There are also the unusual letter+combining mark combinations which have no dedicated Unicode codepoint. The string Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘ obviously has 5 separate letters, but is in fact comprised of 58 JavaScript characters, most of which are combining marks.

Enter the grapheme-splitter.js library. It can be used to properly split JavaScript strings into what a human user would call separate letters (or "extended grapheme clusters" in Unicode terminology), no matter what their internal representation is. It is an implementation on the Default Grapheme Cluster Boundary of UAX #29.

Installation

You can use the index.js file directly as-is. Or you you can install grapheme-splitter to your project using the NPM command below:

$ npm install --save grapheme-splitter

Tests

To run the tests on grapheme-splitter, use the command below:

$ npm test

Usage

Just initialize and use:

var splitter = new GraphemeSplitter();

// split the string to an array of grapheme clusters (one string each)
var graphemes = splitter.splitGraphemes(string);

// iterate the string to an iterable iterator of grapheme clusters (one string each)
var graphemes = splitter.iterateGraphemes(string);

// or do this if you just need their number
var graphemeCount = splitter.countGraphemes(string);

Examples

var splitter = new GraphemeSplitter();

// plain latin alphabet - nothing spectacular
splitter.splitGraphemes("abcd"); // returns ["a", "b", "c", "d"]

// two-char emojis and six-char combined emoji
splitter.splitGraphemes("🌷🎁💩😜👍🏳️‍🌈"); // returns ["🌷","🎁","💩","😜","👍","🏳️‍🌈"]

// diacritics as combining marks, 10 JavaScript chars
splitter.splitGraphemes("Ĺo͂řȩm̅"); // returns ["Ĺ","o͂","ř","ȩ","m̅"]

// individual Korean characters (Jamo), 4 JavaScript chars
splitter.splitGraphemes("뎌쉐"); // returns ["뎌","쉐"]

// Hindi text with combining marks, 8 JavaScript chars
splitter.splitGraphemes("अनुच्छेद"); // returns ["अ","नु","च्","छे","द"]

// demonic multiple combining marks, 75 JavaScript chars
splitter.splitGraphemes("Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"); // returns ["Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍","A̴̵̜̰͔ͫ͗͢","L̠ͨͧͩ͘","G̴̻͈͍͔̹̑͗̎̅͛́","Ǫ̵̹̻̝̳͂̌̌͘","!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"]

TypeScript

Grapheme splitter includes TypeScript declarations.

import GraphemeSplitter = require('grapheme-splitter')

const splitter = new GraphemeSplitter()

const split: string[] = splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞')

Acknowledgements

This library is heavily influenced by Devon Govett's excellent grapheme-breaker CoffeeScript library at https://github.com/devongovett/grapheme-breaker with an emphasis on ease of integration and pure JavaScript implementation.

grapheme-splitter's People

Contributors

emilbader-ab avatar ffflorian avatar ianp avatar jlhwung avatar neoskunk avatar orling avatar petamoriken avatar wopian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grapheme-splitter's Issues

Hallanth letters are not splitted properly

I tried to split letters of hindi words using grapheme-splitter

It works fine for almost all cases, excep for hallanth (Letters formed combining two or more letters).

var GraphemeBreaker = require('grapheme-breaker');

console.log("Output is : " + GraphemeBreaker.break("लल्लनटॉप"));
Output  :  ल,ल्,ल,न,टॉ,प
Expected :  ल,ल्ल,न,टॉ,प

Is there a way to prevent emojis from turning into ASCII symbols?

Input:
'🙄😂❤😜✌👍'

Output:
[ '🙄', '😂', '❤', '😜', '✌', '👍' ] (actual)
[ '🙄', '😂', '♥', '😜', '✌', '👍' ] (what it looks like in code)

For some reason when pasting them into this comment they turned back into emojis, so I replaced them manually with what I see.

There is still a problem though, because if I copy them into another text field such as the address bar, I still see the ascii shapes, even if I copy paste them from what I see in this form. But if I copy+paste them from the original source (an html input field), they all stay as emoji, even in the address bar.

Here's an image of it from my console:
https://i.imgur.com/joYd4TF.png

Heart symbol not processed correctly

The symbol "\u200D\u2764\uFE0F\u200D" seems to be processed incorrectly.
I can string together an endless count of that symbol and it always counts as one grapheme, until the chain is interrupted by another character.

splitter.countGraphemes("x\u200D\u2764\uFE0F\u200Dx\u200D\u2764\uFE0F\u200D\u200D\u2764\uFE0F\u200D\u200D\u2764\uFE0F\u200Dx") === 3

(I would expect 7)

Example how to use with angular 5?

I can't use the lib because my file doesnt recognize the function of GraphemeSplitter.

Can you put an example for typescript for angular 5?

Support for Khmer language (non spacing mark U+17D2 COENG)

Thanks for your lib, it is very helpful.

However I am experiencing issues with Khmer language and the combining mark U+17D2 (See: https://r12a.github.io/scripts/khmer/block#char17D2) which is specific to Khmer language and is used to combine the next consonant as a subscript of the previous one. For example, if you consider the glyph ញ្ច which is the combination of three codepoints ញ ្ ច is considered by the splitter as the two glyphs ញ្ and ច. Note that it doesn't work as a ligature as in #12 but like the combination of consonants and vowels of other Indic scripts (and such combinations are supported by the splitter).

Let me explain further with another example. The word ខ្ញុំ is composed of only one glyph. What is interesting with this word is that the vowel OM (U + NIKAHIT) is applied to the subscript consonant NYO and not to the consonant KHA but all the sequence forms only one glyph and it looks like the vowel is applied to the first consonant KHA:

  • ‎1781 KHMER LETTER KHA
  • ‎17D2 KHMER SIGN COENG
  • ‎1789 KHMER LETTER NYO
  • ‎17BB KHMER VOWEL SIGN U
  • ‎17C6 KHMER SIGN NIKAHIT

But the splitter considers this glyph as two glyphs (note that the combining mark ្ COENG is not discarded but just combined with ខ KHA as the algo considers it as Other character): ខ្ ញុំ

Btw, some useful tools:
https://r12a.github.io/app-analysestring/
https://r12a.github.io/uniview/
https://r12a.github.io/pickers/khmer/

And sample text for testing purpose:
ខ្ញុំ
កញ្ចក់

Combined emoji (ZWJ) are being split

  • 🏳️‍🌈 is being split as ["🏳️‍", "🌈"]
  • 🏃🏽‍♀️ is being split as ["🏃", "🏽‍", "♀️"]

keycap emoji like 8⃣ are correct

These should all be single emoji per character

Emojis splitted up unexpectedly (e.g. https://emojipedia.org/ninja-cat/)

Hi there,

first of all, thanks a lot for this library and the efforts you put in!

I've got a scenario, where some emojis seem to be split up the wrong way.

When splitting up the following emoji-sequence:
🐱‍💻🐱‍🚀🐱‍👤

I get the following string-tokens (notice the first two matching and the ninja-cat being split into two):
image

Is there an easy explanation for the behavior or is there a general guideline on which emojis are supported and which aren't?

I'm on Windows 10.

Thanks!

Telugu Letter Pa incorrect split

The character combination U+0C2A, U+0C41, U+0C2A splits up into 2 graphemes, whereas my display only shows 1 grapheme:

var x = "ప" + "్" + "ప";
console.log(x)
console.log(new (require('grapheme-splitter'))().splitGraphemes(x))

gives

ప్ప
[ 'ప్', 'ప' ]

Refactor plan

@orling

For my personal interest on Unicode, I would like to do a refactor of this library, here is some thoughts come to me:

  • Transcribe the whole library into TypeScript so that we do not have to maintain the index.d.ts and index.js. Before we publish any new version, the TypeScript compiler will compile into JavaScript and generate proper type declaration. This will infer a breaking change as const GraphemeSplitter = require("graheme-spiltter") will now be replaced as import GraphemeSplitter from "grapheme-splitter".

  • Update the library to Unicode 11.0 and draft another branch to Unicode 12.0 beta

  • Write benchmark of the library and tune performance if needed

  • Add ESLint + Prettier to maintain good code quality

I would like to work on some of these tips and push it to https://github.com/JLHwung/grapheme-splitter/tree/next

If you are open to changes to this library, I am happy to raise a PR once I finish the refactor.

publish to npm

Thank you so much!

This is so sorely needed! I searched and search and only found this by happenstance since you used the exact example I was searching for:



'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'

Would you accept a pull request with a package.json so that this can be published to npm?

Issue splitting emoji skin tone

Hi,

I am writing because it would seem that this emoji 👩🏿‍👩🏿‍👧🏿‍👧🏿 is not properly split by the splitter. I would say it should split it into 4 emoji faces but it is considered as only one. Might be tricky to figure out the right approach for this one.

Thanks

Non-existant engine is specified in package.json

Hi all,

I think this line of code should change -

"npm": "~7.3.0"
and specify an npm engine that exists rather than a future version. This is currently preventing me to make a clean/fresh install of my application.

I started getting this error:

lerna ERR! npm ERR! code ENOTSUP
lerna ERR! npm ERR! notsup Unsupported engine for [email protected]: wanted: {"npm":"~7.3.0"} (current: {"node":"8.11.3","npm":"6.3.0"})
lerna ERR! npm ERR! notsup Not compatible with your version of node/npm: [email protected]
lerna ERR! npm ERR! notsup Not compatible with your version of node/npm: [email protected]
lerna ERR! npm ERR! notsup Required: {"npm":"~7.3.0"}
lerna ERR! npm ERR! notsup Actual:   {"npm":"6.3.0","node":"8.11.3"}

I'm always happy to upgrade when it makes sense, but I don't think there is a npm version 7.3.0. I also know we are using lerna and perhaps that is non-standard; however, I think the intent behind that line of code is to specify dependencies on engine. Is it possible/does it make sense to update this package?

It seems to me something in the node/npm/lerna ecosystem has changed and as a result this package is not installable under certain circumstances.

अनुच्छेद => अ नु च्छे द

अनुच्छेद should return the 4 strings ["अ", "नु", "च्छे", "द"] and not ["अ","नु","च्","छे","द"]. Basically how the cursor acts in the string. The cursor skips over the 4 characters or graphemes to be more accurate.

index.d.ts is not included in npm package

The file index.d.ts is not included in the node_modules after installation ,
as demonstrated by the following :

$ npm install --save grapheme-splitter
+ [email protected]
added 1 package in 7.826s

$ tree node_modules/grapheme-splitter

node_modules/grapheme-splitter
├── index.js
├── LICENSE
├── package.json
├── README.md
└── tests
    ├── GraphemeBreakTest.txt
    └── grapheme_splitter_tests.js

1 directory, 6 files

$ 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.