Code Monkey home page Code Monkey logo

regex's Issues

do simple literal prefix scanning in regex!

Hi,

I saw the news of that regex got refactored and optimized and decided to check my old benchmark. I was very surprised it now runs twice as long!

How to reproduce (using multirust for versions as older regex doesn't compile with newer nightly Rust):

git clone https://github.com/mkpankov/parse-rust.git
cd parse-rust
multirust override nightly-2015-06-24
git checkout 4076c404caf1560a466e9f0799817035089fe841
cargo build --release
time zcat mp3-logs-with-fake-ips.log.gz | ./target/release/parse-rust
// outputs around 4s on my machine
multirust override nightly-2015-05-25
git checkout e33d410291fa7f134eef628b5591d605cd68b218
cargo clean
cargo build --release
time zcat mp3-logs-with-fake-ips.log.gz | ./target/release/parse-rust
// outputs around 2s on my machine

I'm sorry I can't pinpoint it more accurately (maybe it's Rust changes, not regex), but recent major changes of regex might be it. Two times degradation is severe in my opinion, and needs action.

regex versions:

  • new, degraded:
 "regex 0.1.38 (registry+https://github.com/rust-lang/crates.io-index)",
 "regex_macros 0.1.20 (registry+https://github.com/rust-lang/crates.io-index)",
  • old, fast:
 "regex 0.1.30 (registry+https://github.com/rust-lang/crates.io-index)",
 "regex_macros 0.1.18 (registry+https://github.com/rust-lang/crates.io-index)",

Some background: back when I did this I compared Rust version to C++ version (doing almost stupid translation) and Rust beat C++ by about 40% w/o using compile-time regex. This kind of degradation puts it back behind C++ 😞

implement matching on byte strings

Currently as far as I see regexes here are only for Unicode text. But they can be used to parse binary files as well (to to parse a mixture of binary and text).

For example, how can one implement strings --all using regexes in Rust?

iterating over empty matches is wrong

This code:

extern crate regex;

use regex::Regex;

fn main() {
    let re = Regex::new(".*?").unwrap();
    for m in re.find_iter("ΛΘΓΔα") {
        println!("{:?}", m);
    }
}

When run violates an assert in std::str:

[andrew@Liger play] ./target/scratch 
thread '<main>' panicked at 'assertion failed: (w != 0)', /home/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libcore/str/mod.rs:1477

The precise reason is probably in how find_iter (and captures_iter) handles zero-length matches. It has to make progress after a zero-length match in the input, otherwise the iterator won't terminate. Currently, it makes progress by adding 1 to a byte offset into a UTF-8 encoded string, which is obviously wrong. This should be relatively easily fixed by setting self.last_end = self.last_end + length_of_char_that_starts_at(self.last_end).

Regex documentation on \d possibly misleading

Not sure if it's just because I don't know enough about character classes, but when I read:

\d          Perl character class ([0-9])
\D          Negated Perl character class ([^0-9])

(Source: http://doc.rust-lang.org/regex/regex/index.html#matching-one-character)

I thought this meant that \d would only match the ASCII characters 0 through 9, but in fact it'll match other Unicode digits too:

#![feature(plugin)]
#![plugin(regex_macros)]
extern crate regex;

fn main() {
    let re = regex!(r"\d");
    println!("{}", re.is_match("٩")); // prints "true"
}

Maybe the wording should be changed?

problem with regex compilation

on attempt to build my project i see following:

$ cargo build
    Updating registry `https://github.com/rust-lang/crates.io-index`
    Updating git repository `https://github.com/alexcrichton/toml-rs`
 Downloading docopt v0.6.42
 Downloading regex v0.1.16
 Downloading libc v0.1.2
 Downloading log v0.2.5
 Downloading rustc-serialize v0.2.15
 Downloading rustc-serialize v0.3.0
   Compiling regex v0.1.16
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14:16: 14:23 error: unresolved import `std::str::Pattern`. There is no `Pattern` in `std::str`
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14 use std::str::{Pattern, Searcher, SearchStep};
                                                                                                     ^~~~~~~
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14:25: 14:33 error: unresolved import `std::str::Searcher`. There is no `Searcher` in `std::str`
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14 use std::str::{Pattern, Searcher, SearchStep};
                                                                                                              ^~~~~~~~
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14:35: 14:45 error: unresolved import `std::str::SearchStep`. There is no `SearchStep` in `std::str`
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14 use std::str::{Pattern, Searcher, SearchStep};
                                                                                                                        ^~~~~~~~~~
error: aborting due to 3 previous errors
Could not compile `regex`.

[feature request] Expose libregex's parsing/compiling internals

Issue by andrew-d
Thursday Nov 06, 2014 at 19:54 GMT

For earlier discussion, see rust-lang/rust#18710

This issue was labelled with: in the Rust repository


I was looking at implementing something similar to this - a trigram-index-aided search. I'd rather not reproduce the code necessary to parse the regex, considering it already lives in libregex. It'd be nice if the parsing/compiling was exposed for use - perhaps similar to how Go does it with their regexp and regexp/syntax packages.

cc @BurntSushi

HEAD rust fails to compile

Sorry for not writing a PR, but my jetlagged brain isn't working for some reason. Seems easy:

   Compiling regex v0.1.16 (file:///home/steve/src/regex)
src/parse.rs:661:26: 661:41 error: mismatched types:
 expected `&'static [(&'static str, &'static &'static [(char, char)])]`,
    found `&'static [(&'static str, &'static [(char, char)])]`
(expected &-ptr,
    found slice) [E0308]
src/parse.rs:661         match find_class(UNICODE_CLASSES, name.as_slice()) {
                                          ^~~~~~~~~~~~~~~
error: aborting due to previous error
Could not compile `regex`.

crates.io seems to be out of sync for regex_macros 0.1.1

I try to compile a project with "myslq" and get the following:

~/.cargo/registry/src/github.com-1ecc6299db9ec823/regex_macros-0.1.1/src/lib.rs:604 fn vec_expr<T, It: Iterator<T>>(&self, xs: It,

Looking in the actual code of 0.1.1 the line was updated. Is there a version bump missing?

Cannot build with rustc 1.0.0 (a59de37e9 2015-05-13) (built 2015-05-14)

Let's say we have the following Cargo.toml:

[package]

name = "simproc"
version = "0.1.0"
authors = ["Alvaro Polo <[email protected]>"]

[dependencies]
docopt = "*"
regex = "*"
regex_macros = "*"
rustc-serialize = "*"

[[bin]]
name = "spasm"

If we attempt to build with cargo build we obtain:

   Compiling regex v0.1.38
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1093:5: 1093:43 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1093     type Searcher = RegexSearcher<'r, 't>;
                                                                                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1095:5: 1101:6 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1095     fn into_searcher(self, haystack: &'t str) -> RegexSearcher<'r, 't> {
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1096         RegexSearcher {
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1097             it: self.find_iter(haystack),
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1098             last_step_end: 0,
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1099             next_match: None,
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1100         }
                                                                                            ...
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1092:14: 1092:25 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1092 impl<'r, 't> Pattern<'t> for &'r Regex {
                                                                                                         ^~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1106:5: 1109:6 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1106     #[inline]
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1107     fn haystack(&self) -> &'t str {
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1108         self.it.search
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1109     }
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1111:5: 1140:6 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1111     #[inline]
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1112     fn next(&mut self) -> SearchStep {
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1113         if let Some((s, e)) = self.next_match {
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1114             self.next_match = None;
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1115             self.last_step_end = e;
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1116             return SearchStep::Match(s, e);
                                                                                            ...
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1105:21: 1105:33 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1105 unsafe impl<'r, 't> Searcher<'t> for RegexSearcher<'r, 't> {
                                                                                                                ^~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1112:27: 1112:37 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1112     fn next(&mut self) -> SearchStep {
                                                                                                                      ^~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1116:20: 1116:37 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1116             return SearchStep::Match(s, e);
                                                                                                               ^~~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1120:46: 1120:56 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1120                 if self.last_step_end < self.haystack().len() {
                                                                                                                                         ^~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1122:47: 1122:57 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1122                     self.last_step_end = self.haystack().len();
                                                                                                                                          ^~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1123:51: 1123:61 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1123                     SearchStep::Reject(last, self.haystack().len())
                                                                                                                                              ^~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1123:21: 1123:39 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1123                     SearchStep::Reject(last, self.haystack().len())
                                                                                                                ^~~~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1125:21: 1125:37 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1125                     SearchStep::Done
                                                                                                                ^~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1131:21: 1131:38 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1131                     SearchStep::Match(s, e)
                                                                                                                ^~~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1136:21: 1136:39 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1136                     SearchStep::Reject(last, s)
                                                                                                                ^~~~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/lib.rs:394:34: 394:51 error: unstable feature
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/lib.rs:394 #![cfg_attr(feature = "pattern", feature(pattern))]
                                                                                                                             ^~~~~~~~~~~~~~~~~
note: this feature may not be used in the stable release channel
error: aborting due to 16 previous errors
Could not compile `regex`.

To learn more, run the command again with --verbose.

support Unicode grapheme clusters

The regex engine doesn't consider characters (graphemes) that consist of multiple code points correctly.

For example the letter 'ä' has two representations, that should both be matched by the regex ., howver only the latter is.

Bash                 | Rust       | Codepoints
echo $'\x61\xcc\x88' | "\u{e4}"   | U+00e4
echo $'\xc3\xa4'     | "a\u{308}" | U+0061 U+0308

Regex on text streams

Issue by suhr
Wednesday May 07, 2014 at 14:18 GMT

For earlier discussion, see rust-lang/rust#14015

This issue was labelled with: A-libs in the Rust repository


Regex library defines methods for find/replace on strings, but what about text streams?

Very large copies in compiled regexes.

There is a ::std::mem::swap(&mut clist, &mut nlist); call generated when using regex!.
The two variables have the type Threads which can get very large, in this particular case, I estimate they're each 22kB.

According to valgrind, each exec call ends up with 200 memcpy's - this makes regex! five times slower than the dynamic regex, for me.
Adding a single line recovers performance of the generated code, even if it's only marginally better than the dynamic regex:

let (mut clist, mut nlist) = (&mut clist, &mut nlist);

No other changes are necessary because of deref coercions (as weird as &mut &mut T -> &mut T may be).

Optimize case_fold_and_combine_ranges?

I haven’t timed it, but the case_fold_and_combine_ranges function introduced in #78 is probably slow for large sets of character ranges like \p{Lu}. (That is, slow to parse/compile a regex.) It can probably be improved to not consider individual chars when we can determine that none in a given range (or subrange) is affected by case folding.

make case insensitive comparisons better

We should switch from equality comparison on the first character returned by {upper,lower}case iterators to "does this character match any character in the {upper,lower}case iterator."

Incorrect case-insensitive matching of character ranges

Character range matching is conceptually (range_start..range_end).any(|c| c == input_char), but as an optimization is implemented as range_start <= input_char && input_char <= range_end. This is fine.

Case-insensitive matching is implemented as uppercase(c) == uppercase(input_char). This is fine (modulo #55).

So case-insensitive range matching is conceptually (range_start..range_end).any(|c| uppercase(c) == uppercase(input_char)). It is currently implemented as uppercase(range_start) <= uppercase(input_char) && uppercase(input_char) <= uppercase(range_end) which is not equivalent.

One of the tests currently passing is that (?i)\p{Lu}+ matches ΛΘΓΔα entirely. That is, greek letters (both upper case and lower case) all match the category of upper case letters when matched case-insensitively. But the same test with \p{Ll} (category of lower case letters) instead of \p{Lu} currently fails because of this issue. (\p{Lu} and \p{Ll} expand to large unions of character ranges.)

implement &T for Replacer where T: Replacer

When attempting to call replace_all on a Regex with a str as the argument for rep(as opposed to a NoExpand), the Replacer trait implementation for str is apparently being ignored, instead falling back to the FnMut implementation. This project demonstrates the following compile error:

src\main.rs:13:23: 13:53 error: the trait `for<'r,'r> core::ops::Fn<(&'r regex::re::Captures<'r>,)>` is not implemented for the type `str` [E0277]
src\main.rs:13         true => regex.replace_all(&source, &replace.as_str()),

Where replace and source are both Strings.

Edit: just fixed the type of replace by adding as_str, as the & was not coercing the String to a str properly. See the latest commit.

implement a DFA matcher

One of the reasons why RE2/C++ is so fast is because it has two implementations of regex matching: a limited DFA matcher (no sub-capture support) and a full NFA simulation. This crate has the latter, but not the former.

Adding a DFA matcher should be an implementation detail and shouldn't require any public facing changes.

This is a pretty involved project. I hope to find time to do this some day, but if someone else wants to tackle it, I'd be happy to help mentor it. (Part of this will be figuring out how to handle the regex! macro. Do we replicate the DFA there too like we do the NFA?)

optimize literal alternations

From https://lwn.net/Articles/589009/

The handling of more complex alternations is a known (relatively) weak point of jrep (more precisely of rejit) in need of improvement. Grep uses a smart Boyer-Moore algorithm. To look for aaa|bbb|ccc at position p, it looks up the character at p + 2, and if it is not a, b, or c, knows it can jump three characters ahead to p + 3 (and then look at the character at p + 5).

On the other hand, like for single strings, rejit handles alternations simply: it applies brute force. But it does so relatively efficiently, so the performance is still good. To search for aaa|bbb|ccc at some position p in the text, rejit performs operations like:

    loop:
      find 'aaa' at position p
      if found goto match
      find 'bbb' at position p
      if found goto match
      find 'ccc' at position p
      if found goto match
      increment position and goto loop
    match:

The complexity is proportional to the number of alternations. Worse, when the number of alternated expressions exceeds a threshold (i.e. when the compiler cannot allocate a register per alternated expression), rejit falls back to some slow default code. This is what happens for the two regexps with eight or more alternated strings. The code generation should be fixed to allow an arbitrary number of alternated strings.

In other words, this is a way to bypass the regex machinery and degrade to a simple substring search.

In addition to being a common case to optimize, it should also give a small bump to the regex-dna benchmark because one of the regexes is just an alternation of literals: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=rust&id=1. The rest contain character classes, which complicates things somewhat.

The easy part of this is optimization is the actual searching of literal strings and jumping ahead in the input (there is precedent for this already in the code with literal prefixes). The harder part, I think, is analyzing the regex to find where the optimization can be applied. The issue is that an alternation is compiled to a series of split and jump instructions. It is easiest to discover the opportunity to optimize by analyzing the AST of the regex---but there will need to be a way to carry that information through to the VM.

One approach might be to tag pieces of the syntax with possible optimization (this is hopefully the first of many). Then when the AST is compiled to instructions, that information can be stored and indexed by the current program counter. The VM can then ask, "Do there exist any optimizations for this PC?" The rest is gravy.

N.B. This only works for a regex that is of the form a|b|c|.... It might be possible to generalize this to other cases, but it seems tricky.

Attempting to parse {2} causes panic

use regex::Regex;

extern crate regex;


fn main() {
    let _ = Regex::new("{2}");
}
/t/test (master|…) $ cargo run
     Running `target/debug/test`
thread '<main>' panicked at 'called `Option::unwrap()` on a `None` value', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libcore/option.rs:362
An unknown error occurred

To learn more, run the command again with --verbose.

Was found using https://github.com/kmcallister/afl.rs 👍

/(*)/ gives `Tried to unwrap non-AST item: Paren(0, 1, "")`

fn main() {
    regex::Regex::new("(*)");
}

panics

thread '<main>' panicked at 'Tried to unwrap non-AST item: Paren(0, 1, "")', /home/huon/.multirust/toolchains/nightly-2015-03-01/cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/parse.rs:163

(This similar happens with (?:?) etc.)

Allow escaping `#` when using the x flag

It's nothing really important, but it would be nice if one could escape a literal # so it doesn't start a comment when using the x flag.

The current workaround is using the ASCII literal \x23, but something like \# would be clearer.

Workaround example:

let markdown_headline = regex!(r"(?x)
    ^
    (?P<level>[\x23]+)  # A bunch of hash symbols
    \s
    (?P<title>.+)       # Title
    $
");

regex negated character set doesn't work properly with character classes

Issue by kanaka
Tuesday Oct 14, 2014 at 15:11 GMT

For earlier discussion, see rust-lang/rust#18035

This issue was labelled with: A-libs in the Rust repository


$ rustc --version
rustc 0.13.0-nightly (1c3ddd297 2014-10-13 23:27:46 +0000)

Here is my test code:

#![feature(phase)]
#[phase(plugin)]
extern crate regex_macros;
extern crate regex;

fn main() {
    let re = regex!(r#"([^\s])"#);
    println!("\n1 {} on '1 2 3'", re);
    for cap in re.captures_iter("1 2 3") {
        println!("0: {}, 1: {}", cap.at(0), cap.at(1));
    }

    let re = regex!(r#"([^\s,])"#);
    println!("\n2 {} on '1 2 3'", re);
    for cap in re.captures_iter("1 2 3") {
        println!("0: {}, 1: {}", cap.at(0), cap.at(1));
    }

    let re = regex!(r#"([^A,])"#);
    println!("\n3 {} on '1A2A3'", re);
    for cap in re.captures_iter("1A2A3") {
        println!("0: {}, 1: {}", cap.at(0), cap.at(1));
    }

    let re = regex!(r#"([^[:alpha:],])"#);
    println!("\n4 {} on '1A2A3'", re);
    for cap in re.captures_iter("1A2A3") {
        println!("0: {}, 1: {}", cap.at(0), cap.at(1));
    }

    let re = regex!(r#"([^[:alpha:]Z])"#);
    println!("\n5 {} on '1A2A3'", re);
    for cap in re.captures_iter("1A2A3") {
        println!("0: {}, 1: {}", cap.at(0), cap.at(1));
    }
}

Results:

1 ([^\s]) on '1 2 3'
0: 1, 1: 1
0: 2, 1: 2
0: 3, 1: 3

2 ([^\s,]) on '1 2 3'
0: 1, 1: 1
0:  , 1:  
0: 2, 1: 2
0:  , 1:  
0: 3, 1: 3

3 ([^A,]) on '1A2A3'
0: 1, 1: 1
0: 2, 1: 2
0: 3, 1: 3

4 ([^[:alpha:],]) on '1A2A3'
0: 1, 1: 1
0: A, 1: A
0: 2, 1: 2
0: A, 1: A
0: 3, 1: 3

5 ([^[:alpha:]Z]) on '1A2A3'
0: 1, 1: 1
0: A, 1: A
0: 2, 1: 2
0: A, 1: A
0: 3, 1: 3

Cases 2, 4, 5 have the misbehavior. My expectation is that every case should have the same result as 1 and 3.

generate less code for regex plugin

Issue by huonw
Tuesday Apr 29, 2014 at 14:00 GMT

For earlier discussion, see rust-lang/rust#13842

This issue was labelled with: I-compiletime in the Rust repository


#![feature(phase)]
#![allow(dead_code)]

#[phase(syntax)] extern crate regex_macros;
extern crate regex;

#[cfg(short)]
fn short() {
    regex!("a");
}

#[cfg(medium)]
fn medium() {
    // 500
    regex!("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
}

#[cfg(long)]
fn long() {
    // 1000
    regex!("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
}

fn main() {}
$ for x in short medium long; do echo $x; time rustc regex.rs --cfg $x --no-trans; done
short

real    0m0.102s
user    0m0.092s
sys     0m0.012s
medium

real    0m1.384s
user    0m1.332s
sys     0m0.048s
long

real    0m3.612s
user    0m3.508s
sys     0m0.104s

They don't take nearly this long to just compile as dynamic ones, so I would guess it's the extra work that the generating macro is doing. (Note the --no-trans there, so it isn't just the extra code making LLVM slow.)

A perf trace identifies rustc librustc-4283bb68-0.11-pre.so [.] hashmap::HashMap$LT$K$C$$x20V$C$$x20H$GT$::search::h14555543045583792107::v0.11.pre as taking a lot (10.18%) of time.

Regex macros documentation should specify #[no_link]

Currently, the regex documentation shows this for using regex_macros:

#![feature(plugin)]
#[plugin]
// should have #[no_link] here, but doesn't
extern crate regex_macros;

This is slightly incorrect, as not specifying #[no_link] on extern crate regex_macros causes a runtime error on systems without rustc installed.

See rust-lang/rust#20769.

implement better code generation for the regex plugin

Issue by comex
Thursday May 08, 2014 at 05:23 GMT

For earlier discussion, see rust-lang/rust#14029

This issue was labelled with: A-libs in the Rust repository


Consider this code:

#![feature(phase)]
extern crate regex;
#[phase(syntax)]
extern crate regex_macros;

pub fn is_all_a(s: &str) -> bool {
    return regex!("^a+$").is_match(s);
}

Ideally this would optimize away to a small function that just iterates over the string and checks for characters other than 'a'.

Instead, it:

  • calls malloc several times to start out;
  • goes through an indirect call unless LTO is enabled - might not usually be a big deal, but I would like to eventually be able to efficiently match a regex on a single character in lieu of writing out all the possibilities manually
  • to the 'exec' function, which itself, even with LTO (and -O) enabled, makes many non-inlined calls, including to malloc, char_range_at, char_range_at_reverse, etc.

Without LTO, it generates about 7kb of code for one regex, or 34kb if I put 8 regexes in that function. Not the end of the world, but it adds up.

I recognize the regex implementation is new, but I thought this was worth filing anyway as room for improvement.

rustc 0.11-pre-nightly (2dcbad5 2014-05-06 22:01:43 -0700)

optimize use of `^`/`\A`

At the moment, the VM always scans the entire input unless one of the two following things happens:

  • The regex has a literal prefix and that prefix cannot be found anywhere in the input. (This always scans the entire input.)
  • A match is found. This may not scan the entire input, but it requires a match.

Ideally, given a regex ^{something}..., it could terminate the search early if {something} doesn't match at the start of the input.

Extra characters in capture groups

Using the following with rustc 1.0.0-nightly (d8be84eb4 2015-03-29) (built 2015-03-29), regex 0.1.24, and regex_macros 0.1.14:

#![feature(collections)]
#![feature(plugin)]
#![plugin(regex_macros)]
extern crate regex;

fn main() {
    let user_input = String::from_str("0,2,3-5 15\n");
    let re = regex!(r"(?i)\s*([0-9]+(?:-[0-9]+)?)(?:\s*|,)");

    for cap in re.captures_iter(&user_input) {
        println!("Captured: {:?}", cap.at(0).unwrap_or(""));
    }
}

I get the expected output on every println!, except for the last two. Neither the whitespace, nor the newline should be included in the capture group.

% ./test
Captured: "0"
Captured: "2"
Captured: "3-5 "
Captured: "15\n"

Switching the \s* in the regex! to \b matches as I would expect:

    let re = regex!(r"(?i)\b([0-9]+(?:-[0-9]+)?)(?:\b|,)");
% ./test2
Captured: "0"
Captured: "2"
Captured: "3-5"
Captured: "15"

It looks like the \s in the non-capture group immediately following the capture group is leaking in?

be more liberal with escape sequences in the parser

A bit of background: I'm trying to implement the BrowserScope user agent parser as seen here: https://github.com/ua-parser/uap-core/. The regular expression below can be found here: https://github.com/ua-parser/uap-core/blob/master/regexes.yaml#L17

The crate (0.1.38) encounters an error compiling the following regex:

/((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?

with the following error:

Syntax(Error { pos: 92, surround: "p)[ \\-](\\d", kind: UnrecognizedEscape('-') })'

To reproduce:

use regex::Regex;
let re = Regex::new(r"/((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?").unwrap();

Build fails: There is no `BinarySearchResult` in `std::slice`

rustc 0.13.0-nightly (10d99a973 2014-12-31 21:01:42 +0000)
cargo build 
   Compiling regex v0.1.4 (file://***/regex)
***/regex/src/parse.rs:21:5: 21:35 error: unresolved import `std::slice::BinarySearchResult`. There is no `BinarySearchResult` in `std::slice`
***/regex/src/parse.rs:21 use std::slice::BinarySearchResult;
                                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
error: aborting due to previous error
Could not compile `regex`.

To learn more, run the command again with --verbose.

It got removed from libcore: rust-lang/rust@67d1388#diff-91f9d2237c7851d61911b0ca64792a88L1228

Regex docs don't point people to crates.io dependency

Reading http://doc.rust-lang.org/regex/regex/index.html and http://doc.rust-lang.org/regex/regex/enum.Regex.html there is no hint to the user that they should be updating Cargo.toml with something like:

[dependencies]
regex = "*"
regex_macros = "*"

The error doesn't help either:

   Compiling example v0.0.1 (file:///path/to/projects/example)
src/main.rs:5:1: 5:27 error: can't find crate for `regex_macros`
src/main.rs:5 extern crate regex_macros;

impose a limit on the size of a compiled regex program

From the docs:

Currently, there are no counter-measures in place to prevent a malicious user from writing an expression that may use a lot of resources. One such example is to repeat counted repetitions: ((a{100}){100}){100} will try to repeat the a instruction 100^3 times. Essentially, this means it's very easy for an attacker to exhaust your system's memory if they are allowed to execute arbitrary regular expressions. A possible solution to this is to impose a hard limit on the size of a compiled expression, but it does not yet exist.

The conclusion of this is that regexes specified by a user cannot be blindly trusted, since they can trivially exhausted all memory on your system. We can fix this by imposing some limit on the size of a regex program. (In fact, this probably has to be a limit on the size of a regex AST, which will need to be checked during construction.)

Binary search is used with incorrectly-sorted array

The “dynamic” matching of CharClass uses a binary search within char ranges, which relies on the input being sorted. The input is indeed sorted in its “natural” order, but the comparison function in case-insensitive mode uses a different order.

This leads to incorrect results:

assert!(Regex::new(r"(?i)[a_]+$").unwrap().is_match("A_"));

The above fails, because _ in ASCII is between upper-case and lower-case letters. The comparison function maps the 'a'..'a' range to its upper case 'A'..'A', which has a different order relative to '_'..'_'.

Compare with e.g. the code below, which succeeds.

assert!(Regex::new(r"(?i)[a=]+$").unwrap().is_match("A="));

The comparison function has a FIXME to move the case mapping outside of it and have the Vec of ranges be already mapped. I assume this was intended for performance, but I think it’ll also fix this issue. I’ll try it.

regex v0.1.28 : compilation error

I am trying to compile regex as part of a project and I get the following error message :

$ cargo build
    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling regex v0.1.28
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16:25: 16:32 error: unresolved import `std::str::pattern::Pattern`. Could not find `pattern` in `std::str`
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16 use std::str::pattern::{Pattern, Searcher, SearchStep};
                                                                                                                  ^~~~~~~
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16:34: 16:42 error: unresolved import `std::str::pattern::Searcher`. Could not find `pattern` in `std::str`
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16 use std::str::pattern::{Pattern, Searcher, SearchStep};
                                                                                                                           ^~~~~~~~
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16:44: 16:54 error: unresolved import `std::str::pattern::SearchStep`. Could not find `pattern` in `std::str`
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16 use std::str::pattern::{Pattern, Searcher, SearchStep};
                                                                                                                                     ^~~~~~~~~~
error: aborting due to 3 previous errors
Could not compile `regex`.

And the version of rustc :

$ rustc --version
rustc 1.0.0-beta (9854143cb 2015-04-02) (built 2015-04-02)

And the version of cargo :

$ cargo --version
cargo 0.0.1-pre-nightly (84d6d2c 2015-03-31) (built 2015-03-31)

unexpected panic with (invalid) regex "(+)"

Original issue: rust-lang/rust#22679 by @juliusikkala

When trying to compile regex!(r"(+)"), the compiler panics.

This is the code that leads to the crash:

#![feature(plugin)]
#![plugin(regex_macros)]
extern crate regex;

fn main() {
    let re = regex!(r"(+)");
}

Valid regexes have not caused any problems.
rustc output:

error: internal compiler error: unexpected panic
note: the compiler unexpectedly panicked. this is a bug.
note: we would appreciate a bug report: https://github.com/rust-lang/rust/blob/master/CONTRIBUTING.md#bug-reports
note: run with `RUST_BACKTRACE=1` for a backtrace
thread 'rustc' panicked at 'Tried to unwrap non-AST item: Paren(0, 1, "")', /home/julius/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.15/src/parse.rs:163


Could not compile `bug`.

Meta

rustc --version --verbose:

rustc 1.0.0-nightly (522d09dfe 2015-02-19) (built 2015-02-21)
binary: rustc
commit-hash: 522d09dfecbeca1595f25ac58c6d0178bbd21d7d
commit-date: 2015-02-19
build-date: 2015-02-21
host: x86_64-unknown-linux-gnu
release: 1.0.0-nightly

I am using Arch Linux and rustc is installed from AUR (it downloads the binaries from
https://static.rust-lang.org/dist/rust-nightly-x86_64-unknown-linux-gnu.tar.gz).
Backtrace:

   1:     0x7f6461e7d210 - sys::backtrace::write::h252031bd050bf19aKlC
   2:     0x7f6461ea5ac0 - panicking::on_panic::h8a07e978260e2c7btXL
   3:     0x7f6461de6720 - rt::unwind::begin_unwind_inner::h322bcb3f35268c19RBL
   4:     0x7f6461de71f0 - rt::unwind::begin_unwind_fmt::h9448a61362cc80d2nAL
   5:     0x7f6458228fd0 - parse::BuildAst::unwrap::hb5d44cbdb15d2c10v0a
   6:     0x7f64582370e0 - parse::Parser::pop_ast::h76189779d315f940ifb
   7:     0x7f645822f010 - parse::Parser::push_repeater::h9b348b199426ad950fb
   8:     0x7f64582294e0 - parse::Parser::parse::hc4adcc34190b3a5bk3a
   9:     0x7f64582292f0 - parse::parse::hbb40bb6fc331e453H2a
  10:     0x7f6458249370 - re::Regex::new::h8c68b01aada2952bwuc
  11:     0x7f64581b8e20 - native::hf8caf7fb4b19410c2aa
  12:     0x7f64600e5dc0 - ext::base::F.TTMacroExpander::expand::h867963691209911018
  13:     0x7f645f25caa0 - ext::expand::expand_expr::closure.58776
  14:     0x7f645f25c9b0 - ext::expand::expand_expr::h4383deb4367119b5wBd
  15:     0x7f645f210eb0 - ext::expand::MacroExpander<'a, 'b>.Folder::fold_expr::h000c01cd1b3edba9zRe
  16:     0x7f645f267ff0 - fold::noop_fold_expr::closure.58825
  17:     0x7f645f2847e0 - ext::expand::expand_non_macro_stmt::closure.59045
  18:     0x7f645f2841f0 - ext::expand::expand_non_macro_stmt::closure.59043
  19:     0x7f645f283eb0 - ext::expand::expand_non_macro_stmt::h51b42cda1f47cb38sge
  20:     0x7f645f2e2920 - ext::expand::MacroExpander<'a, 'b>.Folder::fold_stmt::closure.59789
  21:     0x7f645f2b0ba0 - ext::expand::expand_block_elts::closure.59361
  22:     0x7f645f2b0820 - iter::FlatMap<I, U, F>.Iterator::next::h12199045387838857590
  23:     0x7f645f2af6b0 - vec::Vec<T>.FromIterator<T>::from_iter::h209756562474197865
  24:     0x7f645f2aee50 - ext::expand::expand_block_elts::closure.59351
  25:     0x7f645f268e20 - ext::expand::expand_block_elts::hbc921c8f419d91ab9oe
  26:     0x7f645f2aec20 - ext::expand::expand_block::h98dad051e0f0e47fpoe
  27:     0x7f645f268070 - ext::expand::MacroExpander<'a, 'b>.Folder::fold_block::hba037cd9748a0507UTe
  28:     0x7f645f268500 - ext::expand::expand_and_rename_fn_decl_and_block::h58ecb6b0c982387fnPe
  29:     0x7f645f273830 - ext::expand::expand_item_underscore::hbfa576cbe10231f502d
  30:     0x7f645f2d9620 - fold::Folder::fold_item_simple::h16347799820789526401
  31:     0x7f645f2d8ca0 - ptr::P<T>::map::h5488296407772723657
  32:     0x7f645f26bd10 - ext::expand::expand_annotatable::h81cc6e36fef572560ze
  33:     0x7f645f268fe0 - ext::expand::expand_item::h2cff22418fb19f38aZd
  34:     0x7f645f278740 - ext::expand::MacroExpander<'a, 'b>.Folder::fold_item::h8c866cfe343d78bcdSe
  35:     0x7f645f278680 - fold::noop_fold_mod::closure.58931
  36:     0x7f645f278370 - iter::FlatMap<I, U, F>.Iterator::next::h7192467770078366644
  37:     0x7f645f277d00 - vec::Vec<T>.FromIterator<T>::from_iter::h1450071030007651331
  38:     0x7f645f277b60 - fold::noop_fold_mod::h17464563872470565435
  39:     0x7f645f273830 - ext::expand::expand_item_underscore::hbfa576cbe10231f502d
  40:     0x7f645f2d9620 - fold::Folder::fold_item_simple::h16347799820789526401
  41:     0x7f645f2d8ca0 - ptr::P<T>::map::h5488296407772723657
  42:     0x7f645f26bd10 - ext::expand::expand_annotatable::h81cc6e36fef572560ze
  43:     0x7f645f268fe0 - ext::expand::expand_item::h2cff22418fb19f38aZd
  44:     0x7f645f278740 - ext::expand::MacroExpander<'a, 'b>.Folder::fold_item::h8c866cfe343d78bcdSe
  45:     0x7f645f2e4b20 - ext::expand::expand_crate::h3aa1e0ab5c8c5f8bIYe
  46:     0x7f6462535dd0 - driver::phase_2_configure_and_expand::closure.20378
  47:     0x7f64624e9c90 - driver::phase_2_configure_and_expand::hf01f6e95dfef33e33ta
  48:     0x7f64624de070 - driver::compile_input::h0c8d8120f6194473Gba
  49:     0x7f64625ad2b0 - run_compiler::heccd2f43b844857cZbc
  50:     0x7f64625abbb0 - thunk::F.Invoke<A, R>::invoke::h14556331494175157254
  51:     0x7f64625aaaa0 - rt::unwind::try::try_fn::h8181016918202193540
  52:     0x7f6461f12880 - rust_try_inner
  53:     0x7f6461f12870 - rust_try
  54:     0x7f64625aada0 - thunk::F.Invoke<A, R>::invoke::h12418455792421810631
  55:     0x7f6461e91b60 - sys::thread::thread_start::h3defdaea150d8cd693G
  56:     0x7f645bd932b0 - start_thread
  57:     0x7f6461a6f249 - __clone
  58:                0x0 - <unknown>

Mismatched types for UNICODE_CLASSES

WIth rustc 1.0.0-nightly (fed12499e 2015-03-03) (built 2015-03-04)

error: mismatched types:
 expected `&'static [(&'static str, &'static &'static [(char, char)])]`,
    found `&'static [(&'static str, &'static [(char, char)])]`
(expected &-ptr,
    found slice) [E0308]
C:\...\cargo\registry\src\github.com-1285ae84e5963aae\regex-0.1.16\src\parse.rs:661
                match find_class(UNICODE_CLASSES, name.as_slice()) {
                                 ~~~~~~~~~~~~~~~~

plan for 1.0 beta/stable channel?

This is mostly about regex_macros and its future. Once beta hits, regex_macros will fail to compile on anything but the nighties until a plugin API has been stabilized. How are we going to manage that in this repository? i.e., The repo would contain one crate (regex) that should work on Rust 1.0 stable (which currently has a dev-dependency on regex_macros) and another crate (regex_macros) that can only work on Rust nightlies. Is it possible for them to cohabitate?

Of secondary concern is the docs. There is a lot of language strongly suggesting use of the regex! macro. Methinks this has to be removed or changed, because teasing users with such an awesome feature is going to make them mad. :-)

Of third concern is the programming shootout. There is a regex-dna benchmark. How do we handle it? Oh, wait, errmm, it appears to have disappeared? Ah, found the commit. That's reasonable. Will it not be included in the shootout any more?

N.B. @erickt is working on rust-syntex, which I think ought to support use of the regex! macro. This may help with the docs, but regex_macros still needs to be handled.

`unicode` crate availability?

At some point, it looks like all of the character class definitions got migrated out of regex and into a separate unicode crate: http://doc.rust-lang.org/unicode/regex/ --- What are the stability plans for unicode? My guess is that things like PERLD aren't going to be included in the Rust distribution, so we'll need to plan to move forward on that.

Cannot compile rust v0.1.24

I added regex to one project recently adding this to my Cargo.toml :

regex = "*"

This is the error output cargo display :

    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling regex v0.1.24
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:341:49: 341:55 error: type `char` does not implement any method in scope named `next`
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:341                 let uregc = regc.to_uppercase().next().unwrap();
                                                                                                                                                                 ^~~~~~
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:342:51: 342:57 error: type `char` does not implement any method in scope named `next`
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:342                 let utextc = textc.to_uppercase().next().unwrap();
                                                                                                                                                                   ^~~~~~
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:561:38: 561:44 error: type `char` does not implement any method in scope named `next`
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:561         textc = textc.to_uppercase().next().unwrap();
                                                                                                                                                      ^~~~~~
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:562:38: 562:44 error: type `char` does not implement any method in scope named `next`
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:562         start = start.to_uppercase().next().unwrap();
                                                                                                                                                      ^~~~~~
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:563:34: 563:40 error: type `char` does not implement any method in scope named `next`
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:563         end = end.to_uppercase().next().unwrap();
                                                                                                                                                  ^~~~~~
error: aborting due to 5 previous errors
Could not compile `regex`.

How to solve it ?

Underscore in replacement string treated as backspace

Using a _ in the replacement string of Regex::replace leads to unexpected behaviour. The _ seems to be treated as a backspace. The documentation should either make mention of this, or this seems to be a bug.

#[test]
fn replacement_with_underscore() {
    let re = regex!(r"(.)(.)");
    let s1 = re.replace("ab","$1-$2");
    let s2 = re.replace("ab","$1_$2");
    assert_eq!("a-b", &s1);
    assert_eq!("a_b", &s2); // Fails here "a_b" != "b"
}

remember matched group name and index in regex

let re = Regex::new(r"([a-zA-Z_][a-zA-Z0-9]*)|([0-9]+)|(\.)|(=)").unwrap();

for cap in re.captures_iter("asdf.aeg = 34") {
    let mut index = 0;
    for (i, name) in cap.iter().enumerate() {
        if i == 0 {continue}
        if let Some(_) = name {index = i; break;}
    }
    println!("group {:?}, match {:?}", index, cap.at(index).unwrap());
}

Now, we can only use iter to get the matched group index or name. But it will cost O(n) time at the worst case.

Pls add the feature to remember the matched group name and index, then only O(1) time will be consumed.

[feature request] Feature to iterate over named groups

It could be useful to be able to iterate over named groups from a regex capture.

Something like:

let re = regex!(r"(?P<id>\d{2})(?P<name>\w+)");
let caps = re.captures("12david").unwrap();
for (k, v) in caps.iter_named() {
    println!("group name: {0}, value: {1}", k, v);
}

PS: I'm new to contributing the rust project and open source in general, so I'm sorry if I'm not following standard procedures.

regex consts should have the same form if possible

Issue by mdinger
Monday Feb 23, 2015 at 03:45 GMT

For earlier discussion, see rust-lang/rust#22699

This issue was labelled with: in the Rust repository


In regex, the consts have these types:

pub static PERLD: &'static &'static [(char, char)] = ...
pub static PERLS: &'static &'static [(char, char)] = ...
pub static PERLW: &'static [(char, char)] = ...

They should probably be the same type but aren't for some reason. They cause this rust-lang/rfcs#730 issue.

Regexes appear to be significantly slower than re2 in some cases

Issue by brson
Thursday Sep 04, 2014 at 18:15 GMT

For earlier discussion, see rust-lang/rust#16989

This issue was labelled with: A-an-interesting-project, A-libs, I-slow in the Rust repository


My understanding is that we believe them to be competitive, but two benchmarks I've seen were not really in the ballpark. The easiest one to test is the shootout's regexdna, which you can see from the upstream shootout is drastically slower than the re2-based C++ implementation.

cc @BurntSushi

Release version doesn't compile in latest cargo (0e42b4d)

Although the development version works fine, the released version still has errors with hyphenated crate names.

$ cargo build
Unable to get packages from source

Caused by:
  failed to parse manifest at `/home/daboross/.cargo/registry/src/github.com-1ecc6299db9ec823/regex_macros-0.1.12/Cargo.toml`

Caused by:
  target names cannot contain hyphens: shootout-regex-dna

implement one-pass NFA matcher

It turns out that when a regex is deterministic, the NFA simulation can be implemented much more efficiently because it only needs to keep track of one set of capture groups (instead of a set of capture groups for every thread).

There are two components to adding this:

  • Detecting whether a regex is deterministic or not. A regex is said to be deterministic if, at any point during the execution of a regex program, at most one thread can lead to a match state.
  • Writing a specialized NFA simulation.

In terms of code, it would be nice if we could find a way to reuse code with the full NFA simulation.

This should be easier to implement than #66, and should boost the performance of a lot of regexes. (Of course, we should do both a one pass NFA and #66!)

\p{Co} does not match all Co characters

\p{Co} only matches U+E000, U+F8FF, U+F0000, U+FFFFD, U+100000, and U+10FFFD, i.e. the first and last characters in each private-use character range.

This test fails:

extern crate regex;

use regex::Regex;

fn main() {
    let re = match Regex::new(r"\p{Co}") {
        Ok(re) => re,
        Err(err) => panic!("{}", err),
    };
    assert_eq!(re.is_match("\u{e001}"), true);
}

range notation

Just a heads up, getting a message about ranges.

examples/blah.rs:124:21: 124:49 warning: use of unstable library feature 'core': will be replaced by range notation
examples/blah.rs:124             .ignore(regex!(r"^\.|^#|~$|\.swp$"));
                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
examples/blah.rs:1:1: 132:1 note: in expansion of regex!
examples/blah.rs:124:21: 124:49 note: expansion site
examples/blah.rs:124:21: 124:49 help: add #![feature(core)] to the crate attributes to silence this warning
examples/blah.rs:124             .ignore(regex!(r"^\.|^#|~$|\.swp$"));
                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
examples/blah.rs:1:1: 132:1 note: in expansion of regex!
examples/blah.rs:124:21: 124:49 note: expansion site

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.