implement matching on byte strings,about rust-lang/regex

Comments (91)

BurntSushi commented on July 17, 2024

This is an interesting question. When building regex, I never even considered the possibility of running a regex on something that wasn't Unicode. If you're matching against, say, &[u8], then how are Unicode structures inside the regex interpreted? For example, \pN or even \w (which is Unicode aware).

For example, how can one implement strings --all using regexes in Rust?

Would a lossy conversion to a Unicode string work? Seems like it should.

from regex.

vi commented on July 17, 2024

In Python, when maching over byte array the regex itself is also a byte array (there is nice b"123" syntax for it).

When matching over byte buffer, I may want to capture some binary blob delimited by text, like in BEGIN_BINARY_BLOB\n(.*)\nEND_BINARY_BLOB.

from regex.

BurntSushi commented on July 17, 2024

I don't follow. Sorry. I'm not talking about the encoding of the regex, I'm talking about the definition of its Unicode aware character classes. For example see http://doc.rust-lang.org/regex/regex/index.html#perl-character-classes-(unicode-friendly) and http://doc.rust-lang.org/regex/regex/index.html#matching-one-character.

from regex.

BurntSushi commented on July 17, 2024

I think the best solution to this---if I understand correctly---is to do lossy Unicode conversion before handing it off to a regex.

If that is for some reason not acceptable, then another way to go about this would be to provide an alternate implementation of Input that restricts characters to single bytes. The problem is: how are non-ASCII single bytes handled? I think the Char type will have to become polymorphic.

It seems doable, but I'm skeptical of whether it's worth it. I'll leave this open for now.

from regex.

vi commented on July 17, 2024

Lossy Unicode conversion can break binary data that we may want to extract using the regular expression.

One alternative is to convert from some 8-bit encoding like cp1251, do the regex then convert the result back to cp1251. This way the binary data should be preserved, but it's hacky and not efficient.

The main idea is regular expressions are not only for handling text. Without a special easy syntax for byte buffers like Python's b"123\x00\x00\xFF" they would be obviously are less convenient.

from regex.

BurntSushi commented on July 17, 2024

Rust has byte string literals.

This does seem doable. Inputs to the various matching functions would need to become polymorphic (i.e., accept either a &str or a &[u8]), and input handling within the engines would need to be polymorphic too.

from regex.

BurntSushi commented on July 17, 2024

I've been thinking about this lately, because I'm currently working on converting the matching engines to be byte based instead of character based. This means that, e.g., . matches Unicode scalar values, which means it would be effectively useless on anything other than UTF-8 encoded data. For example, say I wanted to search for .*?myname.*?. If the regex engine claims to be able to run on binary data, then those dots will foil your intention because they won't consume arbitrary bytes---only valid UTF-8 encodings of scalar values. I contend that this is actually a pretty valid use case, because it means a regex can operate on any data that contains ASCII, which is reasonably common in the wild in my experience. (e.g., latin-1.)

"OK, so add a flag that makes . match bytes instead of characters." Well, that's problematic, because the entire existing API is focused around strings, which must be UTF-8 encoded. If a . can successfully match any byte, then the byte boundaries returned are no longer guaranteed to be on UTF-8 boundaries. Whoops.

So this feature requires some serious thought at the API level.

from regex.

vi commented on July 17, 2024

Missing byte regexes can invite hacks.

Imagine a sequence: somebody wants to parse apart some piece of data. At first he just uses strings and regexes, everything works.

But later he adds features/refactors and realizes that this should be byte buffer, not a string. So regexes should sometimes now extract a string, but other times extract a byte buffer.

Case 1: byte regexes supported. User processed with refactoring, using byte regexes. Nicer, more secure.

Case 2: only string regexes supported. User aborts refactoring, letting everything to be a string (with some escaping or just latin-1 everywhere). Uglier, less secure.

from regex.

BurntSushi commented on July 17, 2024

@vi I don't think I'm questioning the utility of byte based regexes. The question isn't, "Why should we want them?" The question is, "How do we do it?" My other comments have described some of the issues.

from regex.

vi commented on July 17, 2024

Just a parallel set of structs/traits won't do? Like Java's InputStreamWhatever and ReaderWhatever.

from regex.

BurntSushi commented on July 17, 2024

@vi That would double the size of today's API.

from regex.

flying-sheep commented on July 17, 2024

i think it would be useful for fast operations on (mainly) ASCII buffers, e.g. ASCII-based protocols and gene sequence data (which is often stored as ASCII sequence of GTACGTACGATCGATCG...)

byte regexes wouldn’t know anything about unicode and operate bytewise, e.g. . would match any byte (except maybe \n which would be the canonical newline byte like in python bytestrings)

from regex.

BurntSushi commented on July 17, 2024

@flying-sheep I personally can't recall ever using regexes on sequence data during my tenure as a computational biologist. Regardless, that use case is supported today, since all valid ASCII encodings are valid UTF-8 encodings.

byte regexes wouldn’t know anything about unicode and operate bytewise, e.g. . would match any byte (except maybe \n which would be the canonical newline byte like in python bytestrings)

That is what I suggested in a comment above. The naive solution (add a new "byte based" Regex type with all the methods on Regex today, except on &[u8] instead of &str) doubles the size of the existing API.

from regex.

flying-sheep commented on July 17, 2024

I personally can't recall ever using regexes on sequence data during my tenure as a computational biologist

incidentally that’s what i’m doing literally right now for a very good reason 😉

if you’re interested: some single cell sequencing protocols have the barcodes responsible for single cell IDs stored in RNA-Seq reads. in ways that read something like “the barcode is the first 5bp, then come 3-6 Gs, followed by the read sequence and an optional poly-A tail”. and i really don’t know a better way to express that kind of flexibility better than .{5}G{3,6}.+?A*

since all valid ASCII encodings are valid UTF-8 encodings.

obviously, but i meant being able to ignore non-decodable bytes and indeed never going through a decoding step before applying regexes.

The naive solution (add a new "byte based" Regex type with all the methods on Regex today, except on &[u8] instead of &str) doubles the size of the existing API.

yup, sadly. but apart from faciliating that by using a code generator or macro magic i don’t see a better way. we all know strings aren’t as similar to byte sequences as most people like to think, but regexes make sense for both.

from regex.

BurntSushi commented on July 17, 2024

obviously, but i meant being able to ignore non-decodable bytes and indeed never going through a decoding step before applying regexes.

Right, but does sequence data contain non-decodable bytes? (I would consider avoiding the decoding step for performance reasons to be a bit of a niche use case.)

yup, sadly. but apart from faciliating that by using a code generator or macro magic i don’t see a better way. we all know strings aren’t as similar to byte sequences as most people like to think, but regexes make sense for both.

I am just extremely hesistant to make any decisions whose result is "double the size of the API." That is an enormous cost IMO. I'm not concerned at all about the code duplication, as I'm sure there are plenty of ways to eliminate it. (The first step is making the matching engines byte based instead of codepoint based, which is required before any progress can be made on this issue.) I'm concerned about the size of the API for complexity sake. Maybe we could put it in a separate sub-module or something.

from regex.

flying-sheep commented on July 17, 2024

Right, but does sequence data contain non-decodable bytes?

no, that would be the other use case of ASCII protocols

(I would consider avoiding the decoding step for performance reasons to be a bit of a niche use case.)

things here have to be !!fast sometimes. not decoding seems like an useful step.

from regex.

BurntSushi commented on July 17, 2024

I'm not saying it isn't a valid use case. I know I certainly need it sometimes. But use cases have to be weighed with their costs, which is all I'm trying to do.

Once again, to be clear, I really do believe supporting byte-based regexes is something we should persue. But:

The implementation to support it isn't there yet.
The API needs some serious thought so that we don't blow our complexity budget.

from regex.

BurntSushi commented on July 17, 2024

@flying-sheep Just wanted to clarify: addressing your use case of "I have ASCII text and I don't want to pay for Unicode" can be done without supporting byte based regexes in the API (with respect to CPU time). If the automata are byte based instead of codepoint based, then UTF-8 decoding is built into the automaton. If all you have is ASCII text, then the states that decode other codepoints are simply never reached. Their only cost would be memory usage (and take up space in the cache lines).

from regex.

vi commented on July 17, 2024

Maybe byte regexes can be a separate crate, like binregex? Globally, AIP would be still doubled, but projects not using byte regexes shoud not feel it.

from regex.

BurntSushi commented on July 17, 2024

@vi That is unfortunately tricky to do. The actual implementation of byte based regexes and Unicode regexes should be the same. It's only the API that's different and maybe one or two steps in the compiler (e.g., . is any byte instead of any codepoint). Putting the byte based API into a separate crate means exposing enough of the internals of regex to make that possible. A sub-module I guess would be better.

from regex.

vi commented on July 17, 2024

What about completely independent implementations? Obvious drawback: major duplication of code, increased resource usage when both byte and string regexes are in use...

from regex.

BurntSushi commented on July 17, 2024

@vi I can't imagine why anyone would subject themselves to the pain of maintaining such a thing.

from regex.

vi commented on July 17, 2024

Code-generating robot doesn't feel pain.

from regex.

birkenfeld commented on July 17, 2024

Since #146 is scheduled for 1.0, could this be considered for the same milestone?

from regex.

BurntSushi commented on July 17, 2024

@birkenfeld There is unfortunately significant work between #146 and this issue. #146 makes it possible to run any of the matching engines on raw bytes, but this issue requires at least these additional things:

Some way to signal that . matches any byte as opposed to any Unicode scalar value. In the library today, this is always a Unicode scalar value. This has to be handled with care, because as soon as this semantic is added, the boundaries returned are no longer guaranteed to correspond to a valid UTF-8 sequence boundary.
Given (1), how does one express "match any Unicode scalar value" if . is already "match any byte"? It seems like it would be useful to have both. A flag? A distinct syntax for it? If it's a flag, then that flag must be explicitly rejected (or ignored) by the String based regexes.
Craft a public API that supports byte based matching. Is it a separate crate? Is it a sub-module? The latter seems like significantly less work, since the former would require splitting the internals of regex out into its own crate. (I am somewhat strongly opposed to putting a byte-based regex into the top-level API of this crate, if only because it's a niche use case that will otherwise clutter up the docs.)

from regex.

birkenfeld commented on July 17, 2024

No pressure. Just thought I'd ask :)

from regex.

BurntSushi commented on July 17, 2024

@birkenfeld Gotya. To be clear, I do very much want this! (Even if I call it niche.) Being able to exec a regex on an arbitrary &[u8] (which may not be UTF-8) is a really nice thing to have!

from regex.

BurntSushi commented on July 17, 2024

For those that want this, can you help me figure out the semantics?

For example, in a byte based regex, it seems like . should match "any byte" rather than "any Unicode scalar value." We could add a flag that lets you switch to and from Unicode mode. (That flag would not be allowed to be used for the normal &str based Regex API.) I think I'm OK with that.

So let's say the flag is switched so that things are byte based. Character classes like \pL still make sense, because it seems useful to be able to match Unicode things in arbitrary bytes.

But things get hairy when we consider negation. Some cases are easy. For example, [^a] seems like it should match any byte except for 0x61. But what about [^\pL], or even something simpler like, [^☃]. We have no actual way in regex speak to say, "match any byte unless it corresponds to this sequence of bytes." I could maybe work on inventing something along those lines, but it isn't quite clear how tricky that is.

What else can we do? Well:

We could disable the negation of any character class that contains multi-byte characters. So e.g., [^☃] would be disallowed and so would \PL, which is the negation of all Unicode letters.
\w, \s, and \d would revert to their ASCII definitions, negations would be allowed. Similarly, \b would revert to its ASCII definition.
. would match any byte.
Things like \pL would still be allowed? It seems strange to allow \pL but not \PL.

Ideas? Thoughts?

from regex.

BurntSushi commented on July 17, 2024

I sort of find the idea of making byte based regexes completely ASCII-only appealing, mostly because it's simple to explain and describe. I do however find it unfortunate because it means you can't really match text that might be UTF-8 inside arbitrary bytes.

from regex.

BurntSushi commented on July 17, 2024

This thread seems to suggest that RE2 just gives up on Unicode when matching raw bytes: https://groups.google.com/forum/m/#!topic/re2-dev/26wVIHcowh4

from regex.

vi commented on July 17, 2024

Maybe the program for byte regex is also a byte buffer (not a Unicode string), so we don't have "characters" in it.

Matching by b"[^☃]" should be give error: byte constant must be ASCII and matching by "[^☃]".as_bytes() should match any byte except of 226, 152 and 131.

from regex.

BurntSushi commented on July 17, 2024

Right. And that means things like \pL would also be disabled? And \PL?

from regex.

vi commented on July 17, 2024

Probably yes, as \pL is something Unicode-ish and content being matched by byte regex may be not in Unicode, creating confusion.

Unicode helpers inside byte regexes may need to refer encoding more explicitly (imagine regex that matches some UTF-8 text delimited by UTF-16 "tags"). Other idea may be "converting" usual string regex into byte regex (specifying encoding) and combining them: let my_byte_regex = b"1234" + UsualRegex::new("\pL").convert_to_byte_regex("UTF-8") + b"890"

from regex.

BurntSushi commented on July 17, 2024

Well, the thing is, \pL would actually work today, since it can match on raw bytes. (That's how the DFA works.) I feel like it would be pretty cool to be able to say, "match any sequence of utf8 encoded characters in these arbitrary bytes." The issue is really that negating such things becomes tricky.

from regex.

vi commented on July 17, 2024

What is the problem with negation? Won't usual (?!......) work?

Character classes (including ^negated) maps to byte classes, but negation is not limited to character classes.

from regex.

BurntSushi commented on July 17, 2024

What does [^\pL] match? Today, it matches any valid utf8 encoding of a non letter codepoint. What does it do in a byte based regex?

from regex.

BurntSushi commented on July 17, 2024

One you've answered that, then answer: what does [^\n] match?

from regex.

vi commented on July 17, 2024

In byte regexes, [^\n] should match all bytes except of 10. I'm not sure if \p is appropriate inside byte classes.

from regex.

BurntSushi commented on July 17, 2024

Agree on [^\n]. I guess what I'm trying to say is that i see things like \pL being quite useful in byte regexes, especially since it's already supported. I guess I'm just lamenting their loss because they don't always make sense. :(

from regex.

vi commented on July 17, 2024

Maybe \pL can continue to work outside [], meaning "a byte sequence consisting of one Unicode codepoint denoting a letter in UTF-8". Negation can be done with (?!, negating not a byte class, but a sub-regex. \pL inside [] can be only in non-byte (i.e. string) regex.

Internally one can think \pL in byte regex context to be a "macro" which expands to a large byte-twiddling regex with multiple branches, matching 1 to 4 bytes.

from regex.

BurntSushi commented on July 17, 2024

Fyi, We don't have negative or positive lookahead.

What about this. Byte regexes by default are composed of bytes. Anything with unicode awareness is disabled as appropriate. But introduce a new flag, u, that enables all Unicode awareness as it exists in today's regexes.

from regex.

BurntSushi commented on July 17, 2024

E.g. [^a] matches any byte except a by default. But with unicode enabled, it will match any codepoint except a, which wouldn't include arbitrary bytes.

from regex.

vi commented on July 17, 2024

May be viable.

Is positive and negative look-ahead and look-behind a planned feature? Will there be recursive regexes like in Perl for mini-syntax analyser?

from regex.

BurntSushi commented on July 17, 2024

No. This crate uses finite automata and guarantees worst case linear time matching.

from regex.

vi commented on July 17, 2024

"u" mode should also change meaning of . (that can also match multiple bytes now) and so on. Or it should be documented to use something other than . to match both single arbitrary bytes and sequences denoting single arbitrary codepoint in one regex.

from regex.

BurntSushi commented on July 17, 2024

Yup. . by default would match any byte except \n. With the u flag, it would match any codepoint other than \n.

from regex.

vi commented on July 17, 2024

Does this crate settles "official Rust regex API" so that more sophisticated not-necessary-always-halting implementation may use existing traits and one can "switch regex engine" for an application (without manually editing dependencies that depend on regex crate)?

from regex.

BurntSushi commented on July 17, 2024

This crates doesn't define any traits and I'm not sure i really want it to either. My primary goal is to build a sensible API for the implementation in this crate only. Most regex engines differ in subtle and important ways, so I'm not sure it's reality possibly to have an API for "swap regex engines without thinking about it."

from regex.

vi commented on July 17, 2024

How does one match bytes in byte regexes in unicode mode? For example, there is length-prefixed UTF-8 string embedded somewhere and you want to capture something like (....)(@*trigger@*?)\0 (where . denotes "any byte" and @ denotes "any character"), and user expects 4-byte length and the line itself to be available in groups

from regex.

BurntSushi commented on July 17, 2024

Sorry for the ridiculous typos. On mobile!

from regex.

vi commented on July 17, 2024

Also one may want to combine byte class and character class in one regex.

from regex.

BurntSushi commented on July 17, 2024

Hmm, i don't think i understand. You can toggle flags inside the regex. For example .(?u:.) would match any byte followed by any character (sans \n). Or (?s).(?u:.) matches any byte followed by any character (including \n).

from regex.

vi commented on July 17, 2024

Maybe byte mode and unicode mode may be switched inside the regex itself? Like matchanybyte:. matchbyterange:[\xC0-\xFF] (?Umatchanycharacter:. matchcharnonletter:[^\pL]) - meaning of . and [] change inside the (?U ) group.

from regex.

BurntSushi commented on July 17, 2024

Yup, now you got it! That's the way flags work now, so it's perfect.

from regex.

BurntSushi commented on July 17, 2024

This is sweet. I'm excited. I think it's next on my list!

from regex.

vi commented on July 17, 2024

Yes, this way is OK. There may be a flag for non-Unicode text matching or just assistance things (like (?H 0054 01AF FFFF ) instead of \x00\x54\x01\xAF\xFF\xFF or (?N16 5222) matching number \x14\x66 (5222, 16-bit network order).

from regex.

BurntSushi commented on July 17, 2024

I would be willing to consider extensions like that, but after we get initial support. :)

from regex.

vi commented on July 17, 2024

What about returning String instead of a Vec<u8> for groups declared inside (?u )

Something like (I'm not familiar with API):

let regex = b"qwe(.{3})uiop(?uzxc(.{3})m)";
let match_ = regex.matcher(b"qwertyuiopzxcvbnm");
let a : Vec<u8> = match_.byte_group(1); // b"rty"
let b : String = match_.unicode_group(2); // "vbn"

from regex.

BurntSushi commented on July 17, 2024

Yeah that would be pretty cool. Would probably be tricky to pull off unfortunately. Would have to think about it.

from regex.

birkenfeld commented on July 17, 2024

Wow, lots of messages :) FWIW, I'd be happy with the design outlined here (1-byte-based matching by default, UTF-8 awareness with a switch). In the latter mode, invalid UTF-8 bytes should just let the match fail. The following escapes should be switched to ASCII meanings in non-Unicode mode: \d = [[:digit:]], \s = [[:space:]], \w = [[:word:]].

By the way, why does this Regex crate allow the POSIX-like character classes outside of brackets? In all other regex incarnations I know, you have to write [[:alpha:]]. This crate seems to allow [:alpha:]. (Which I think makes more sense, but sadly it makes the syntax needlessly incompatible.)

from regex.

BurntSushi commented on July 17, 2024

In the latter mode, invalid UTF-8 bytes should just let the match fail.

Right. Not only is that the easiest path (yay!), but it's also really nice from the caller's perspective. As @vi pointed out, it means that any sub-capture with (?u:re) where re does not disable the u flag will be guaranteed to match valid UTF-8.

The following escapes should be switched to ASCII meanings in non-Unicode mode

Yup! Also, word boundaries.

By the way, why does this Regex crate allow the POSIX-like character classes outside of brackets? In all other regex incarnations I know, you have to write [[:alpha:]]. This crate seems to allow [:alpha:]. (Which I think makes more sense, but sadly it makes the syntax needlessly incompatible.)

Interesting. Maybe file a separate issue? (Note that syntaxes are already generally incompatible, I think. Every regex engine has its own peculiarities.)

from regex.

birkenfeld commented on July 17, 2024

Filed #175.

I know that most regex libs have their own syntax extensions (and sometimes different behavior for the same syntax!), but character ranges and POSIX char classes are pretty basic, so they should work as the do in others IMO. In particular, it can bite you that (if it is implemented that way) you cannot start a char range with :.

from regex.

BurntSushi commented on July 17, 2024

OK, so I've added all of the necessary (I think) features to the abstract syntax for this. You can see its new definition here: https://github.com/rust-lang-nursery/regex/blob/raw-bytes/regex-syntax/src/lib.rs#L93

What I ended up doing was adding a b flag (for bytes) instead of a u flag (for Unicode). I'm not sure if that's the right thing or not. Some thoughts:

I like the idea that all flags are disabled when calling the main Regex::new.
The b flag is completely disallowed by default when calling Regex::new.
I'm thinking that the b flag should be enabled by default when calling bytes::Regex::new. But this means that in order to get Unicode features, you need to write (?-b:\pL), for example. That seems a little unfortunate.

Questions:

Should the b flag be enabled by default for byte based regexes?
If so, should there also be a u flag that is defined to be the inverted alias to b?

In any case, I actually think the hardest part of this ticket is done. Now all I need to do is wire it up, which is I think just tedium at this point!

from regex.

BurntSushi commented on July 17, 2024

cc @nikomatsakis @jneem You all will probably care about this, since the abstract syntax is getting some new variants. Briefly:

We want the ability to define and run regexes on &[u8].
Today, everything assumes valid Unicode (with the intention of matching on valid UTF-8 only).
Modify the AST to support arbitrary byte variants and make character classes do the right thing.
Add a new flag, b, that disables Unicode features and assumes the search text is just ASCII compatible. Things like . match any byte instead of any codepoint, \d is just ASCII and things like \xFF correspond to the byte 0xFF instead of the codepoint U00FF.
Do not change the current default behavior. That is, byte based features are disallowed in the parser unless they are explicitly turned on by the caller. (Use of the b flag will cause the parser to error.) This will guarantee that your Expr will be purely Unicode based, which I think just means having to add a few unreachable! clauses in whatever does case analysis on Expr. (Unfortunately.)

from regex.

birkenfeld commented on July 17, 2024

Should the b flag be enabled by default for byte based regexes?

Yes please.

If so, should there also be a u flag that is defined to be the inverted alias to b?

Sounds good to me. Although at that point, you might just only have u. Since you can't touch the u flag for str regexes, I don't think it matters much that it's always enabled.

from regex.

nikomatsakis commented on July 17, 2024

@BurntSushi sounds good. (Seems like a good use case for rust-lang/rfcs#757 , though of course semver makes this kind of addition a non-issue.)

from regex.

jneem commented on July 17, 2024

I can see the use-case for byte-based regexes, but combined byte-based and UTF-8-based matching in the same regex seems a bit niche to me. If it was only one or the other then there would be a natural way to do them both in a type-safe way: split Expr into CompoundExpr<T> that contains all the combining things (Group, Repeat, etc) and Primitive{Char,Byte}Expr containing all the things that match {chars,bytes}. Then parsing a unicode regex could return CompoundExpr and parsing a byte-based regex could return CompoundExpr.

from regex.

BurntSushi commented on July 17, 2024

It feels really wrong to prevent full UTF-8 matching on &[u8], even when the capability exists. I agree the complication to the AST is unfortunate. :-(

from regex.

jneem commented on July 17, 2024

I would argue that it's perfectly fine not to support a niche feature if it makes your public API worse. Not that it's the end of the world, of course, but I would feel much better about it if there was at least one person who had a specific application for the feature...

from regex.

vi commented on July 17, 2024

CompoundExpr feels like an overengineering. Byte-default, but with (?u: ... ), maybe also with additional method to retrieve &str instead of (or in addition to) &[u8] from groups inside (?u: ... ) feels reasonable.

For example searching for (?u:(.*)) should extract valid UTF-8 snippets from mixed text/binary data.

from regex.

BurntSushi commented on July 17, 2024

Yes, it'd be quite useful if, for example, you were writing a tool to search in arbitrary data whose encoding may not be known (e.g., files on a file system). Imagine being able to use, for example, \p{Lu} with grep. It won't always work of course, but there's a lot of UTF-8 out there. :-)

from regex.

birkenfeld commented on July 17, 2024

It won't always work of course, but there's a lot of UTF-8 out there. :-)

I agree.

from regex.

jneem commented on July 17, 2024

If you're reading files on a file system then each file is probably encoded consistently, even if encodings might differ between files. So then you could just use a UTF-8 regex (provided that its API were extended to allow searching in a &[u8]). The only reason I can think of for having a mixed regex is if you're searching in some file that has embedded UTF-8 but also some non-ASCII binary structure that you need to match. While such things could exist in principle, I've never needed to run a regex on one.

from regex.

BurntSushi commented on July 17, 2024

HTML documents are a perfect example of common documents that aren't always encoded correctly. Everyone loves running regexes on HTML. :-)

The only reason I can think of for having a mixed regex is if you're searching in some file that has embedded UTF-8 but also some non-ASCII binary structure that you need to match.

It's not necessarily just for when you need to match non-ASCII binary structure, it's necessary for matching UTF-8 among possibly non-UTF-8 bytes. The whole problem with a UTF-8 regex today is that it will immediately stall at the very first invalid byte it sees. You can't utter a regex that will let you skip over it. For example, when one writes [^a] in a UTF-8 regex today, that doesn't match "any byte except a," it matches "any UTF-8 encoding of a codepoint that is not a."

from regex.

jneem commented on July 17, 2024

It's not necessarily just for when you need to match non-ASCII binary structure, it's necessary for matching UTF-8 among possibly non-UTF-8 bytes.

My regex_dfa crate is capable of finding a UTF-8 encoded match within a text that is not fully UTF-8 (although not using the public API, which takes a &str). That part's an implementation detail of the regex engine, not of the regex grammar.

I still think that this mixed-mode regex is only useful if you actually want to find matches that span both UTF-8 and non-UTF-8 regions. I still can't imagine writing such a regex (even if I had a HTML file that I knew was a mishmash of different encodings). But since it seems that everyone but me thinks this is a useful feature, I'm going to bow out of the dispute now...

from regex.

vi commented on July 17, 2024

Shall I try to think up an example[s] where I might need a mixed regex?

from regex.

BurntSushi commented on July 17, 2024

@vi Examples are always useful. I think @jneem's main issue is that this feature complicates the AST, but I think any support of this at all (even without UTF-8 mixed matching) complicates the AST. It'd be nicer to encode the new invariants into the type system, but you end up losing mixed matching and the resulting AST isn't exactly nice either. :-/

from regex.

BurntSushi commented on July 17, 2024

@jneem also suggested building this directly into the matching engines, which is possible, but you still need to be able to support matching arbitrary bytes in the AST, which means using u8 in a few places instead of char and making sure things like \xFF correspond to the byte \xFF instead of the codepoint \xFF (which is represented in UTF-8 with two bytes).

from regex.

BurntSushi commented on July 17, 2024

I guess some of that stuff could be pushed off to the compiler if we didn't support mixed mode matching, but then you need to make sure to deal with the Perl classes (\s, \w and \d) correctly, which are translated to their character class forms in the parser. Blech.

from regex.

vi commented on July 17, 2024

In any case public API should be designed in a way that assing mixed mode is not an incompatible change, so the engine may be changed under the hood.

It can even be a documented, but not implemented "teaser" API to gather votes for the feature request.

from regex.

vi commented on July 17, 2024

Example 1

Hacky extraction of title from Matroska files

perl -ne 'if (/\x7B\xA9([\x80-\xFE]|[\x40-\x7F].)(.+)/u) { print "$2\n"; }' somefile.mkv

This example lacks mixed regexes, so outputs binary garbage after the title. Switching on Unicode mode causes it to fail with Malformed UTF-8 character.

let mkvsnippet = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\x53\x70\x6f\x72\x74\x4d\x80\x98\x54\x76\x68\x65";
let re = BinaryRegex::new(r"\x7B\xA9([\x80-\xFE]|[\x40-\xFF].)(?u:(.*))");
let caps = re.captures(mkvsnippet);
let approximate_title : String = caps.at_str(1);
let encoded_title_length: Vec<u8> = caps.at(0);

Example 2

Peeking at /proc/$SOMEPID/environ or /proc/$SOMEPID/cmdline and extracting some string from it. The files are binary (zero-separated), yet can contain UTF-8 text. I don't remember easy example where UTF-8-encoded data gets passed thingh environment... Maybe CGI?

Example 3

Simple /usr/bin/strings analogue, but for extracting UTF-8 snippets from arbitrary binary file.

Something like r"(?u:(\p{Assigned}{8,}))".

from regex.

BurntSushi commented on July 17, 2024

@vi Those are pretty cute, thanks. :-)

from regex.

Alphare commented on July 17, 2024

Very sorry to necropost, but I can't seem to find an issue or discussion about the regex pattern itself being a &[u8] instead of a &str.

For context, I am working on the Mercurial Rust implementation of .hgignore.

There regex are user-defined, written in any arbitrary encoding (though ASCII compatible for special characters otherwise I don't know how that could possibly work) to match against filenames and paths that follow the same rule.
We already have an HgPath type (and its owned variant HgPathBuf) to handle filenames and paths in both an encoding and platform-agnostic fashion, something that is not possible with OsStr, but I am struggling to see how I can use the bytes module of regex in that case, since Regex takes a &str, which is still WTF-8.

I hope to be wrong or that the doc just needs to be clearer, hehe.
Thanks in advance.

from regex.

BurntSushi commented on July 17, 2024

@Alphare Thanks for the question. In the future, I'd prefer a new issue for
questions like this. I don't mind bumping old PRs/issues per se, but your
question is orthogonal to this. That is, "matching on &[u8]" and "&[u8]
patterns" are quite different.

There is a somewhat related discussion in
#451

For context, I am working on the Mercurial Rust implementation of .hgignore.

Nice! IIRC, Mercurial's implementation of .hgignore uses Python to execute
regexes. There are several incompatibilities between this regex engine and
Python's, so it may be quite tricky to preserve backward compatibility here. I
don't know to what extent that's a concern, but I figured I'd mention it in
case it's a show-stopper.

There regex are user-defined, written in any arbitrary encoding (though ASCII
compatible for special characters otherwise I don't know how that could
possibly work) to match against filenames and paths that follow the same
rule.

The important aspect of matching on &[u8] is that regexes can match invalid
UTF-8. Combine that with the fact that you can write (?-u:\xFF) to refer to
the byte \xFF (rather than the codepoint U+00FF) means that you should be
able to accomplish what you want. All you really need to do is replace all
invalid UTF-8 bytes in your pattern with their corresponding hex escape
sequence. And I think that should do it.

from regex.

Alphare commented on July 17, 2024

@BurntSushi Thank you for the quick response. I will open a separate issue in the future, sorry about that!

Mercurial uses either RE2 or Python's re module, the latter being a fallback for backtracking expressions and other things that RE2 does not support, both of which do differ from regex, as you said. There is also the issue of regex not supporting empty alternation, maybe concerns over performance that has to be benchmarked, etc..

I think the safest bet for me and Mercurial as a project is to also use RE2 from Rust, and then reconsider using regex further down the road.

Again, I appreciate your help. Maybe I can ping ripgrep when I get a crate with good enough support for hgignore. ;)

from regex.

BurntSushi commented on July 17, 2024

If you're already using RE2, then regex should be extremely close to that, since RE2 directly inspired this library. If you're worried about corner cases like empty alternations, then I would just do whatever it is you're planning on doing to handle look-around and backreferences (since RE2 does not support those either).

from regex.

Alphare commented on July 17, 2024

I would be very interested in a feature comparison between the two engines actually. For now Re2 will be "plug-and-play" and will come with fewer unanswered questions and that will be a lot easier to both write and to get through review.

from regex.

BurntSushi commented on July 17, 2024

The main difference is that this library is faster, primarily due to its more advanced literal optimizations. As for actual features, the most significant difference is that this library has much better support for Unicode and enables such things by default. e.g., \w is Unicode aware by default, where as RE2 is not. But Unicode can be selectively disabled to get ASCII only versions of classes like \w. This library also supports character class set operations (intersection, difference, symmetric difference) as well as nested character classes.

You can read more about Unicode support here: https://github.com/rust-lang/regex/blob/master/UNICODE.md

Otherwise, the two libraries are pretty much identical in terms of semantics.

It's still not clear to me how you propose to address backtracking and backreferences...

For now Re2 will be "plug-and-play" and will come with fewer unanswered questions and that will be a lot easier to both write and to get through review.

Was this always your plan? Otherwise, it seems strange to reverse course based on my comments here.

from regex.

Alphare commented on July 17, 2024

The main difference is that this library is faster, primarily due to its more advanced literal optimizations.

Literal optimization as in a string literal in source code or is it something completely different?

Otherwise, the two libraries are pretty much identical in terms of semantics.

I see, I wonder how easy it would be to "fuzz out" all differences between the two aside from the obvious ones you cited.

It's still not clear to me how you propose to address backtracking and backreferences...

For now, the plan is to fall back to Python (baring a potentially incredible slowdown) for unsupported patterns while maybe giving a warning for the user.

For now Re2 will be "plug-and-play" and will come with fewer unanswered questions and that will be a lot easier to both write and to get through review.

Was this always your plan? Otherwise, it seems strange to reverse course based on my comments here.

The plan among the people discussing this issue has so far been to use RE2 because it's already being used in the Python code and used on huge repositories where edge-cases would have been found.
It's also what the first Rust POC used: at that time, the lack of support of empty alternations and the escaping needed to work on bytes was additional work that was deemed not worth it, as some benchmarking on a big repo showed that RE2 was faster. I believe the data for those benchmark was lost and the methodology may not be too relevant anymore because of different pattern used, etc..

The only reason I opened this issue was to make sure that there could be a way of using regex after the initial version with RE2 and to size up how much work it would be, because I prefer using a Rust-native engine than a C++ binding.

I hope that makes sense.

from regex.

BurntSushi commented on July 17, 2024

Literal optimization as in a string literal in source code or is it something completely different?

I don't think source code has anything to do with it. e.g., In the regex (foo|bar|quux)+, there exists an alternation of literals foo, bar and quux. In this regex library, it will do an analysis to detect those literals and accelerate the search, using SIMD where possible. RE2 doesn't do this, other than in the very simple case where every match starts with the same byte.

as some benchmarking on a big repo showed that RE2 was faster

Interesting. If y'all ever do benchmarking again, I'd be curious to know more details.

I hope that makes sense.

Yup, thanks!

from regex.

Alphare commented on July 17, 2024

I don't think source code has anything to do with it. e.g., In the regex (foo|bar|quux)+, there exists an alternation of literals foo, bar and quux. In this regex library, it will do an analysis to detect those literals and accelerate the search, using SIMD where possible. RE2 doesn't do this, other than in the very simple case where every match starts with the same byte.

Very cool indeed.

Interesting. If y'all ever do benchmarking again, I'd be curious to know more details.

I will let you know when we get there, it should be interesting. If the methodology is good enough, you could even use it in your docs.

from regex.

implement matching on byte strings about regex HOT 91 CLOSED

Comments (91)

Example 1

Example 2

Example 3

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent