Comments (14)
Any chance for grapheme cluster support in the near future? One of our projects requires that we match them, and I really would like to stick with rust and not switch to python, which unfortunately can do it:
import regex
haystack = "🏳️🌈F́́̂꣩⃧꣩ᷭ⳱꣪L⃒̲̀̀̀ᷛᷩ⃰᷃À̀̀̀꣢᪱̅̾ᷥG︪̯̰̀̀̀̀ͭ̇"
needle = "🏳️🌈\\X\\X\\X\\X"
foo = regex.search(needle, haystack)
print(foo)
prints
PS C:\Users\Benni\repositories\graphemeclusters> python .\sample.py
<regex.Match object; span=(0, 44), match='🏳️\u200d🌈F́́̂꣩⃧꣩ᷭ⳱꣪L⃒̲̀̀̀ᷛᷩ⃰᷃À̀̀̀꣢᪱̅̾ᷥG︪̯̰̀̀̀̀ͭ̇'>
from regex.
While I'm not against potentially adding this, I do think it would be a mighty big undertaking and it's not quite clear it could be feasibly done in the DFA. If someone wanted to move forward on working on this, then we can re-open this issue. Before writing any code, I would like to see a plan for how it would be added to the DFA though and at least some analysis on the memory required.
from regex.
No, I don't see it happening. You can write the regex yourself if you want, which is here: https://github.com/BurntSushi/bstr/blob/master/scripts/regex/grapheme.sh
The regex is quite gnarly though, but that's exactly what this crate would do if it supported \X
. My guess is that it will be pretty slow, but you can try it.
If you describe the high level problem you're trying to solve, then there are almost certainly other approaches you could take here. It would be pretty surprising to me to switch your entire programming language based on this one thing.
You could also use the pcre2
crate, which I believe supports the \X
syntax explicitly.
from regex.
Yup. The regex crate assumes a single character is a Unicode scalar value.
Do you know of other regex libraries that handle graphemes correctly?
from regex.
@tbu- one way to handle this is to apply a normalization to strings before matching. For example, converting a string to NFC form (or was it NFKC? See UAX#15 and the unicode-normalizations
crate) should convert equivalent representations like the one you've listed to a single, canonical form. Then, of course, your regex would have to match against the canonical version of the character.
from regex.
@kwantam I believe that this still does not work for the .
pattern which should match a single grapheme (which might well be more than one codepoint, even after normalization).
from regex.
@tbu- UAX#18§2.2 discusses matching grapheme clusters. The rule here, RL2.2, does not say that .
should match a grapheme cluster.
For example, an implementation could interpret
\X
as matching any
extended grapheme cluster, while interpreting "." as matching any
single code point. It could interpret\b{g}
as a zero-width match
against any extended grapheme cluster boundary, and\B{g}
as
the negation of that.
I agree that it would be nice to have \b{g}
(and \b{w}
, \b{l}
, and \b{s}
) defined. I haven't yet implemented UAX#14 or the UAX#29 sentence break boundary algorithms in the unicode-segmentation crate, but it's on the roadmap. At that point, it would be pretty straightforward to implement \X
as .+?\b{g}
, as discussed in UAX#18§2.2.
As regards other languages, the perlunicode man page has a nice summary of perl's support for UAX#18. One assumes if perl's implementation is missing it, most everyone else's is, too, and indeed they do not support RL2.2 at least as of perl5.20.1.
from regex.
Oh, I incorrectly assumed it would be described in "UAX#18§2.2.1 Grapheme Cluster Mode".
from regex.
Nothing incorrect about that; it would be nice to have that mode available, too! :)
(And, I think, once RL2.2 is implemented, it would be pretty easy to add a switch for grapheme cluster mode.)
from regex.
I changed the title of this issue because this crate doesn't support grapheme clusters, so what you're seeing is intended behavior.
I actually don't know much about graphemes and how they're encoded, but if they can be made to fit into the Input
abstraction, then it should be relatively easy to actually experiment with this. (The representation of InputAt
would have to change.)
from regex.
If you describe the high level problem you're trying to solve, then there are almost certainly other approaches you could take here.
I'm also interested in having this. For me the use case is finding emojis in text. Sounds like a simple problem, but consider these strings found in the wild:
-
👨\u200d👩\u200d👦
.
This one is rendered as a family of three people when supported: 👨👩👦 . Note that this works for any combination of genders and you can also add skin color modifiers etc. The zero-width joiner shows that just doing emoji + modifiers is not enough, because you can join an arbitrary amount of emojis together. -
\ud83e\udd26\u200d\u2642\ufe0f\ud83c\udfff
A sequence of the following:
['FACE PALM',
'ZERO WIDTH JOINER',
'MALE SIGN',
'VARIATION SELECTOR-16',
'EMOJI MODIFIER FITZPATRICK TYPE-6']
So it's a black male person facepalming. The variation-selector-16 apparently means "interpret the previous character as an emoji and not a text symbol, referring to rendering the "male sign" as "\u2642\ufe0f" = ♂️ and not "\u2642\ufe0e" = '♂︎'.
The best algorithm I found to find these (and all other emoji) is this one:
import emoji
import regex
def split_count(text):
emoji_list = []
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI for char in word):
emoji_list.append(word)
return emoji_list
Everything else I've seen basically uses a whitelist of known emojis , but that misses variants that are supported in different devices by combining other emoji as shown above.
So for each grapheme cluster, if at least one of the characters is a known emoji character, the whole cluster is counted as an emoji. This python code uses the PCRE2
library with the \\X
escape mentioned above, as well as a list of "base-emoji" generated from the unicode consortium website.
The algorithm PCRE2 uses is described in detail in the Extended grapheme clusters section of the docs. I guess it's pretty similar to the .sh file @BurntSushi linked above, though it looks somewhat simpler
from regex.
@phiresky I don't see why you need a regex for that. Just iterate over all graphemes in the text and you can use the same detection logic.
I'm not an expert on emojis but I eould consult Unicode's specification for them. They likely have an algorithm for detecting them already written.
And are you sure the regex
package in Python uses PCRE2? I thought it didn't. On mobile, otherwise I would check.
from regex.
The regex
package in python is confusingly not the internal regex engine (called re
). The internal one does not support \X
.
I don't see why you need a regex for that. Just iterate over all graphemes in the text and you can use the same detection logic.
That's true. It seems like the unicode_segmentation Rust crate does implement the same algorithm:
https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/ so I'll try that, thanks! I guess that might also be a solution for other people looking for grapheme clusters in the rust regex crate.
They likely have an algorithm for detecting them already written.
I guess that's off topic here then, but as far as I have found they only have a list of base emojis, and a list of "known" zero-width-joined sequences, as well as the general grapheme splitting algorithm. But sequences not in this list are not invalid, and they already don't have all combinations of skin tones etc in there I think. I think phone manufacturers etc just add emojis when they want so it's not really possible to create a comprehensive list.
https://unicode.org/emoji/charts/index.html
https://unicode.org/Public/emoji/13.1/emoji-zwj-sequences.txt
Edit: You're right, the regex pypi package does implement \X, but they have their own implementation separate from PCRE2.
from regex.
That's true. It seems like the unicode_segmentation Rust crate does implement the same algorithm:
https://docs.rs/unicode-segmentation/1.7.1/unicode_segmentation/ so I'll try that, thanks! I guess that might also be a solution for other people looking for grapheme clusters in the rust regex crate.
Right. As does bstr
, with the advantage of working on &[u8]
.
Also, it's worth mentioning that there are two reasonable interpretations to what "support Unicode grapheme clusters" actually means in the context of a regex engine. The simplest is support for \X
, which matches a single grapheme cluster. That could be reasonably added to the regex crate via a simple substitution of \X
with the complex regex linked to above. However, it's not clear to me just how useful this really is. As shown in your case, it would be better to just iterate over the graphemes.
The other interpretation is more pervasive, whereby the regex engine itself becomes "grapheme aware." That is, things like .
and characters classes more generally become able to match grapheme clusters. This is much harder to implement, or rather, hard to implement efficiently. I feel confident that this aspect of grapheme support will never make it into the regex
crate.
from regex.
Related Issues (20)
- Inconsistent behavior with zero-width matches on empty strings
- Valid prefix search (with ^) goes into dead state HOT 3
- The regex parse error while the expre is correct ! HOT 2
- Onepass DFA always has empty captures (user error) HOT 2
- dfa/onepass.rs: index out of bounds HOT 2
- Errors when running quickstart from docs HOT 2
- Add a flag for unescaped literal groups HOT 1
- Inconsistency with is_match and Python's search in Matching Specific Regex Patterns HOT 6
- regex-lite with a &[u8] haystack HOT 2
- Underscore will not match propblaly HOT 2
- Invalid regex with multiple repetition flags is accepted HOT 3
- Valgrind reports "possibly lost" when using static `Regex` HOT 7
- adding regex-automata to cargo.toml file reduces performance
- Add method to get the full match of a Captures without an unwrap HOT 2
- valgrind reports "Conditional jump or move depends on uninitialised value(s)" HOT 2
- no_std support and cargo.toml doc comments HOT 2
- calling `caputres_iter` in an `async` function may cause memory leak HOT 13
- Compilation error when compiling with nightly with `unstable` feature enabled HOT 4
- switch `once_cell` recommendation to `std::sync::LazyLock` HOT 1
- Version 1.5.0 should be yanked HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from regex.