Code Monkey home page Code Monkey logo

Comments (14)

Trolldemorted avatar Trolldemorted commented on August 15, 2024 3

Any chance for grapheme cluster support in the near future? One of our projects requires that we match them, and I really would like to stick with rust and not switch to python, which unfortunately can do it:

import regex

haystack = "🏳️‍🌈F́́̂꣩⃧꣩ᷭ⳱꣪L⃒̲̀̀̀ᷛᷩ⃰᷃À̀̀̀꣢᪱̅̾ᷥG︪̯̰̀̀̀̀ͭ̇"
needle = "🏳️‍🌈\\X\\X\\X\\X"

foo = regex.search(needle, haystack)
print(foo)

prints

PS C:\Users\Benni\repositories\graphemeclusters> python .\sample.py
<regex.Match object; span=(0, 44), match='🏳️\u200d🌈F́́̂꣩⃧꣩ᷭ⳱꣪L⃒̲̀̀̀ᷛᷩ⃰᷃À̀̀̀꣢᪱̅̾ᷥG︪̯̰̀̀̀̀ͭ̇'>

from regex.

BurntSushi avatar BurntSushi commented on August 15, 2024 1

While I'm not against potentially adding this, I do think it would be a mighty big undertaking and it's not quite clear it could be feasibly done in the DFA. If someone wanted to move forward on working on this, then we can re-open this issue. Before writing any code, I would like to see a plan for how it would be added to the DFA though and at least some analysis on the memory required.

from regex.

BurntSushi avatar BurntSushi commented on August 15, 2024 1

No, I don't see it happening. You can write the regex yourself if you want, which is here: https://github.com/BurntSushi/bstr/blob/master/scripts/regex/grapheme.sh

The regex is quite gnarly though, but that's exactly what this crate would do if it supported \X. My guess is that it will be pretty slow, but you can try it.

If you describe the high level problem you're trying to solve, then there are almost certainly other approaches you could take here. It would be pretty surprising to me to switch your entire programming language based on this one thing.

You could also use the pcre2 crate, which I believe supports the \X syntax explicitly.

from regex.

BurntSushi avatar BurntSushi commented on August 15, 2024

Yup. The regex crate assumes a single character is a Unicode scalar value.

Do you know of other regex libraries that handle graphemes correctly?

from regex.

kwantam avatar kwantam commented on August 15, 2024

@tbu- one way to handle this is to apply a normalization to strings before matching. For example, converting a string to NFC form (or was it NFKC? See UAX#15 and the unicode-normalizations crate) should convert equivalent representations like the one you've listed to a single, canonical form. Then, of course, your regex would have to match against the canonical version of the character.

from regex.

tbu- avatar tbu- commented on August 15, 2024

@kwantam I believe that this still does not work for the . pattern which should match a single grapheme (which might well be more than one codepoint, even after normalization).

from regex.

kwantam avatar kwantam commented on August 15, 2024

@tbu- UAX#18§2.2 discusses matching grapheme clusters. The rule here, RL2.2, does not say that . should match a grapheme cluster.

For example, an implementation could interpret \X as matching any
extended grapheme cluster, while interpreting "." as matching any
single code point. It could interpret \b{g} as a zero-width match
against any extended grapheme cluster boundary, and \B{g} as
the negation of that.

I agree that it would be nice to have \b{g} (and \b{w}, \b{l}, and \b{s}) defined. I haven't yet implemented UAX#14 or the UAX#29 sentence break boundary algorithms in the unicode-segmentation crate, but it's on the roadmap. At that point, it would be pretty straightforward to implement \X as .+?\b{g}, as discussed in UAX#18§2.2.

As regards other languages, the perlunicode man page has a nice summary of perl's support for UAX#18. One assumes if perl's implementation is missing it, most everyone else's is, too, and indeed they do not support RL2.2 at least as of perl5.20.1.

from regex.

tbu- avatar tbu- commented on August 15, 2024

Oh, I incorrectly assumed it would be described in "UAX#18§2.2.1 Grapheme Cluster Mode".

from regex.

kwantam avatar kwantam commented on August 15, 2024

Nothing incorrect about that; it would be nice to have that mode available, too! :)

(And, I think, once RL2.2 is implemented, it would be pretty easy to add a switch for grapheme cluster mode.)

from regex.

BurntSushi avatar BurntSushi commented on August 15, 2024

I changed the title of this issue because this crate doesn't support grapheme clusters, so what you're seeing is intended behavior.

I actually don't know much about graphemes and how they're encoded, but if they can be made to fit into the Input abstraction, then it should be relatively easy to actually experiment with this. (The representation of InputAt would have to change.)

from regex.

phiresky avatar phiresky commented on August 15, 2024

If you describe the high level problem you're trying to solve, then there are almost certainly other approaches you could take here.

I'm also interested in having this. For me the use case is finding emojis in text. Sounds like a simple problem, but consider these strings found in the wild:

  • 👨\u200d👩\u200d👦.
    This one is rendered as a family of three people when supported: 👨‍👩‍👦 . Note that this works for any combination of genders and you can also add skin color modifiers etc. The zero-width joiner shows that just doing emoji + modifiers is not enough, because you can join an arbitrary amount of emojis together.

  • \ud83e\udd26\u200d\u2642\ufe0f\ud83c\udfff A sequence of the following:
    ['FACE PALM',
    'ZERO WIDTH JOINER',
    'MALE SIGN',
    'VARIATION SELECTOR-16',
    'EMOJI MODIFIER FITZPATRICK TYPE-6']
    So it's a black male person facepalming. The variation-selector-16 apparently means "interpret the previous character as an emoji and not a text symbol, referring to rendering the "male sign" as "\u2642\ufe0f" = ♂️ and not "\u2642\ufe0e" = '♂︎'.

The best algorithm I found to find these (and all other emoji) is this one:

import emoji
import regex

def split_count(text):

    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_list.append(word)

    return emoji_list

Everything else I've seen basically uses a whitelist of known emojis , but that misses variants that are supported in different devices by combining other emoji as shown above.

So for each grapheme cluster, if at least one of the characters is a known emoji character, the whole cluster is counted as an emoji. This python code uses the PCRE2 library with the \\X escape mentioned above, as well as a list of "base-emoji" generated from the unicode consortium website.

The algorithm PCRE2 uses is described in detail in the Extended grapheme clusters section of the docs. I guess it's pretty similar to the .sh file @BurntSushi linked above, though it looks somewhat simpler

from regex.

BurntSushi avatar BurntSushi commented on August 15, 2024

@phiresky I don't see why you need a regex for that. Just iterate over all graphemes in the text and you can use the same detection logic.

I'm not an expert on emojis but I eould consult Unicode's specification for them. They likely have an algorithm for detecting them already written.

And are you sure the regex package in Python uses PCRE2? I thought it didn't. On mobile, otherwise I would check.

from regex.

phiresky avatar phiresky commented on August 15, 2024

@BurntSushi

The regex package in python is confusingly not the internal regex engine (called re). The internal one does not support \X.

I don't see why you need a regex for that. Just iterate over all graphemes in the text and you can use the same detection logic.

That's true. It seems like the unicode_segmentation Rust crate does implement the same algorithm:
https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/ so I'll try that, thanks! I guess that might also be a solution for other people looking for grapheme clusters in the rust regex crate.

They likely have an algorithm for detecting them already written.

I guess that's off topic here then, but as far as I have found they only have a list of base emojis, and a list of "known" zero-width-joined sequences, as well as the general grapheme splitting algorithm. But sequences not in this list are not invalid, and they already don't have all combinations of skin tones etc in there I think. I think phone manufacturers etc just add emojis when they want so it's not really possible to create a comprehensive list.

https://unicode.org/emoji/charts/index.html
https://unicode.org/Public/emoji/13.1/emoji-zwj-sequences.txt

Edit: You're right, the regex pypi package does implement \X, but they have their own implementation separate from PCRE2.

from regex.

BurntSushi avatar BurntSushi commented on August 15, 2024

That's true. It seems like the unicode_segmentation Rust crate does implement the same algorithm:
https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/ so I'll try that, thanks! I guess that might also be a solution for other people looking for grapheme clusters in the rust regex crate.

Right. As does bstr, with the advantage of working on &[u8].

Also, it's worth mentioning that there are two reasonable interpretations to what "support Unicode grapheme clusters" actually means in the context of a regex engine. The simplest is support for \X, which matches a single grapheme cluster. That could be reasonably added to the regex crate via a simple substitution of \X with the complex regex linked to above. However, it's not clear to me just how useful this really is. As shown in your case, it would be better to just iterate over the graphemes.

The other interpretation is more pervasive, whereby the regex engine itself becomes "grapheme aware." That is, things like . and characters classes more generally become able to match grapheme clusters. This is much harder to implement, or rather, hard to implement efficiently. I feel confident that this aspect of grapheme support will never make it into the regex crate.

from regex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.