rust-lang / regex Goto Github PK

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

Home Page: https://docs.rs/regex

License: Apache License 2.0

Rust 99.02% Shell 0.13% C 0.76% RenderScript 0.08%

regex regexp regular-expressions regex-engine regex-syntax regex-parser rust dfa nfa automata

regex's Introduction

regex

This crate provides routines for searching strings for matches of a regular expression (aka "regex"). The regex syntax supported by this crate is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences. In exchange, all regex searches in this crate have worst case O(m * n) time complexity, where m is proportional to the size of the regex and n is proportional to the size of the string being searched.

Documentation

Module documentation with examples. The module documentation also includes a comprehensive description of the syntax supported.

Documentation with examples for the various matching functions and iterators can be found on the Regex type.

Usage

To bring this crate into your repository, either add regex to your Cargo.toml, or run cargo add regex.

Here's a simple example that matches a date in YYYY-MM-DD format and prints the year, month and day:

use regex::Regex;

fn main() {
    let re = Regex::new(r"(?x)
(?P<year>\d{4})  # the year
-
(?P<month>\d{2}) # the month
-
(?P<day>\d{2})   # the day
").unwrap();

    let caps = re.captures("2010-03-14").unwrap();
    assert_eq!("2010", &caps["year"]);
    assert_eq!("03", &caps["month"]);
    assert_eq!("14", &caps["day"]);
}

If you have lots of dates in text that you'd like to iterate over, then it's easy to adapt the above example with an iterator:

use regex::Regex;

fn main() {
    let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
    let hay = "On 2010-03-14, foo happened. On 2014-10-14, bar happened.";

    let mut dates = vec![];
    for (_, [year, month, day]) in re.captures_iter(hay).map(|c| c.extract()) {
        dates.push((year, month, day));
    }
    assert_eq!(dates, vec![
      ("2010", "03", "14"),
      ("2014", "10", "14"),
    ]);
}

Usage: Avoid compiling the same regex in a loop

It is an anti-pattern to compile the same regular expression in a loop since compilation is typically expensive. (It takes anywhere from a few microseconds to a few milliseconds depending on the size of the regex.) Not only is compilation itself expensive, but this also prevents optimizations that reuse allocations internally to the matching engines.

In Rust, it can sometimes be a pain to pass regular expressions around if they're used from inside a helper function. Instead, we recommend using the once_cell crate to ensure that regular expressions are compiled exactly once. For example:

use {
    once_cell::sync::Lazy,
    regex::Regex,
};

fn some_helper_function(haystack: &str) -> bool {
    static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"...").unwrap());
    RE.is_match(haystack)
}

fn main() {
    assert!(some_helper_function("abc"));
    assert!(!some_helper_function("ac"));
}

Specifically, in this example, the regex will be compiled when it is used for the first time. On subsequent uses, it will reuse the previous compilation.

Usage: match regular expressions on `&[u8]`

The main API of this crate (regex::Regex) requires the caller to pass a &str for searching. In Rust, an &str is required to be valid UTF-8, which means the main API can't be used for searching arbitrary bytes.

To match on arbitrary bytes, use the regex::bytes::Regex API. The API is identical to the main API, except that it takes an &[u8] to search on instead of an &str. The &[u8] APIs also permit disabling Unicode mode in the regex even when the pattern would match invalid UTF-8. For example, (?-u:.) is not allowed in regex::Regex but is allowed in regex::bytes::Regex since (?-u:.) matches any byte except for \n. Conversely, . will match the UTF-8 encoding of any Unicode scalar value except for \n.

This example shows how to find all null-terminated strings in a slice of bytes:

use regex::bytes::Regex;

let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00").unwrap();
let text = b"foo\xFFbar\x00baz\x00";

// Extract all of the strings without the null terminator from each match.
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<&[u8]> =
    re.captures_iter(text)
      .map(|c| c.name("cstr").unwrap().as_bytes())
      .collect();
assert_eq!(vec![&b"foo\xFFbar"[..], &b"baz"[..]], cstrs);

Notice here that the [^\x00]+ will match any byte except for NUL, including bytes like \xFF which are not valid UTF-8. When using the main API, [^\x00]+ would instead match any valid UTF-8 sequence except for NUL.

Usage: match multiple regular expressions simultaneously

This demonstrates how to use a RegexSet to match multiple (possibly overlapping) regular expressions in a single scan of the search text:

use regex::RegexSet;

let set = RegexSet::new(&[
    r"\w+",
    r"\d+",
    r"\pL+",
    r"foo",
    r"bar",
    r"barfoo",
    r"foobar",
]).unwrap();

// Iterate over and collect all of the matches.
let matches: Vec<_> = set.matches("foobar").into_iter().collect();
assert_eq!(matches, vec![0, 2, 3, 4, 6]);

// You can also test whether a particular regex matched:
let matches = set.matches("foobar");
assert!(!matches.matched(5));
assert!(matches.matched(6));

Usage: regex internals as a library

The regex-automata directory contains a crate that exposes all of the internal matching engines used by the regex crate. The idea is that the regex crate exposes a simple API for 99% of use cases, but regex-automata exposes oodles of customizable behaviors.

Documentation for regex-automata.

Usage: a regular expression parser

This repository contains a crate that provides a well tested regular expression parser, abstract syntax and a high-level intermediate representation for convenient analysis. It provides no facilities for compilation or execution. This may be useful if you're implementing your own regex engine or otherwise need to do analysis on the syntax of a regular expression. It is otherwise not recommended for general use.

Documentation for regex-syntax.

Crate features

This crate comes with several features that permit tweaking the trade off between binary size, compilation time and runtime performance. Users of this crate can selectively disable Unicode tables, or choose from a variety of optimizations performed by this crate to disable.

When all of these features are disabled, runtime match performance may be much worse, but if you're matching on short strings, or if high performance isn't necessary, then such a configuration is perfectly serviceable. To disable all such features, use the following Cargo.toml dependency configuration:

[dependencies.regex]
version = "1.3"
default-features = false
# Unless you have a specific reason not to, it's good sense to enable standard
# library support. It enables several optimizations and avoids spin locks. It
# also shouldn't meaningfully impact compile times or binary size.
features = ["std"]

This will reduce the dependency tree of regex down to two crates: regex-syntax and regex-automata.

The full set of features one can disable are in the "Crate features" section of the documentation.

Performance

One of the goals of this crate is for the regex engine to be "fast." What that is a somewhat nebulous goal, it is usually interpreted in one of two ways. First, it means that all searches take worst case O(m * n) time, where m is proportional to len(regex) and n is proportional to len(haystack). Second, it means that even aside from the time complexity constraint, regex searches are "fast" in practice.

While the first interpretation is pretty unambiguous, the second one remains nebulous. While nebulous, it guides this crate's architecture and the sorts of the trade offs it makes. For example, here are some general architectural statements that follow as a result of the goal to be "fast":

When given the choice between faster regex searches and faster Rust compile times, this crate will generally choose faster regex searches.
When given the choice between faster regex searches and faster regex compile times, this crate will generally choose faster regex searches. That is, it is generally acceptable for Regex::new to get a little slower if it means that searches get faster. (This is a somewhat delicate balance to strike, because the speed of Regex::new needs to remain somewhat reasonable. But this is why one should avoid re-compiling the same regex over and over again.)
When given the choice between faster regex searches and simpler API design, this crate will generally choose faster regex searches. For example, if one didn't care about performance, we could like get rid of both of the Regex::is_match and Regex::find APIs and instead just rely on Regex::captures.

There are perhaps more ways that being "fast" influences things.

While this repository used to provide its own benchmark suite, it has since been moved to rebar. The benchmarks are quite extensive, and there are many more than what is shown in rebar's README (which is just limited to a "curated" set meant to compare performance between regex engines). To run all of this crate's benchmarks, first start by cloning and installing rebar:

$ git clone https://github.com/BurntSushi/rebar
$ cd rebar
$ cargo install --path ./

Then build the benchmark harness for just this crate:

$ rebar build -e '^rust/regex$'

Run all benchmarks for this crate as tests (each benchmark is executed once to ensure it works):

$ rebar measure -e '^rust/regex$' -t

Record measurements for all benchmarks and save them to a CSV file:

$ rebar measure -e '^rust/regex$' | tee results.csv

Explore benchmark timings:

$ rebar cmp results.csv

See the rebar documentation for more details on how it works and how to compare results with other regex engines.

Hacking

The regex crate is, for the most part, a pretty thin wrapper around the meta::Regex from the regex-automata crate. Therefore, if you're looking to work on the internals of this crate, you'll likely either want to look in regex-syntax (for parsing) or regex-automata (for construction of finite automata and the search routines).

My blog on regex internals goes into more depth.

Minimum Rust version policy

This crate's minimum supported rustc version is 1.65.0.

The policy is that the minimum Rust version required to use this crate can be increased in minor version updates. For example, if regex 1.0 requires Rust 1.20.0, then regex 1.0.z for all values of z will also require Rust 1.20.0 or newer. However, regex 1.y for y > 0 may require a newer minimum version of Rust.

License

This project is licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)

at your option.

The data in regex-syntax/src/unicode_tables/ is licensed under the Unicode License Agreement (LICENSE-UNICODE).

regex's People

Contributors

Stargazers

Watchers

Forkers

steveklabnik drbawb frewsxcv pirosb3 canndrew vhbit bvssvni kinghajj tetsuharuohzeki huonw akosthekiss carols10cents dylanede tomjakubowski burntsushi tmerr jrasky havarnov gchp ryman jhny ucarion reklawnos blaenk simonsapin codecommunity killercup lingbotang junyi zmbush navachok ryansname seanrburton bluss golddranks edunham pseudomuto turbo87 chrismacnaughton flying-sheep manishearth fhartwig hobofan alexcrichton jld mcarton defuz dhardy oli-obk gereeter matthieu-m amanieu bombless jonnor wesleywiser birkenfeld viveklucky1848-2 robinst baby-bell julienw mloc sinkuu davidblewett aweinstock314 kodraus shepmaster matklad bjangeofan michaelwoerister jneem aa10000 joshtriplett xion scooter-dangle utkarshkukreti seeekr heycam mytherin christophebiocca chriscoomber rlugojr liamchristopher thommey rap2hpoutre saarw penghao1 l4agxc remexre michaelsproul adamcrume behnam mgeisler tyoverby lapin0t andrea-prearo fulara liangsongyou cuviper ethanpailes ignatenkobrain

regex's Issues

[feature request] Expose the parsing/compiling internals

Note: mirrored from rust-lang/rust#18710

I was looking at implementing something similar to this - a trigram-index-aided search. I'd rather not reproduce the code necessary to parse the regex, considering it already lives in libregex. It'd be nice if the parsing/compiling was exposed for use - perhaps similar to how Go does it with their regexp and regexp/syntax packages.

HEAD rust fails to compile

Sorry for not writing a PR, but my jetlagged brain isn't working for some reason. Seems easy:

   Compiling regex v0.1.16 (file:///home/steve/src/regex)
src/parse.rs:661:26: 661:41 error: mismatched types:
 expected `&'static [(&'static str, &'static &'static [(char, char)])]`,
    found `&'static [(&'static str, &'static [(char, char)])]`
(expected &-ptr,
    found slice) [E0308]
src/parse.rs:661         match find_class(UNICODE_CLASSES, name.as_slice()) {
                                          ^~~~~~~~~~~~~~~
error: aborting due to previous error
Could not compile `regex`.

Allow escaping `#` when using the x flag

It's nothing really important, but it would be nice if one could escape a literal # so it doesn't start a comment when using the x flag.

The current workaround is using the ASCII literal \x23, but something like \# would be clearer.

Workaround example:

let markdown_headline = regex!(r"(?x)
    ^
    (?P<level>[\x23]+)  # A bunch of hash symbols
    \s
    (?P<title>.+)       # Title
    $
");

plan for 1.0 beta/stable channel?

This is mostly about regex_macros and its future. Once beta hits, regex_macros will fail to compile on anything but the nighties until a plugin API has been stabilized. How are we going to manage that in this repository? i.e., The repo would contain one crate (regex) that should work on Rust 1.0 stable (which currently has a dev-dependency on regex_macros) and another crate (regex_macros) that can only work on Rust nightlies. Is it possible for them to cohabitate?

Of secondary concern is the docs. There is a lot of language strongly suggesting use of the regex! macro. Methinks this has to be removed or changed, because teasing users with such an awesome feature is going to make them mad. :-)

Of third concern is the programming shootout. There is a regex-dna benchmark. How do we handle it? Oh, wait, errmm, it appears to have disappeared? Ah, found the commit. That's reasonable. Will it not be included in the shootout any more?

N.B. @erickt is working on rust-syntex, which I think ought to support use of the regex! macro. This may help with the docs, but regex_macros still needs to be handled.

Add backreferences and nested subpatterns

Named and numbered backreferences such as ^(a(?1)?b)$ are common in PCRE implementations, and would be very useful here.

regex v0.1.28 : compilation error

I am trying to compile regex as part of a project and I get the following error message :

$ cargo build
    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling regex v0.1.28
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16:25: 16:32 error: unresolved import `std::str::pattern::Pattern`. Could not find `pattern` in `std::str`
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16 use std::str::pattern::{Pattern, Searcher, SearchStep};
                                                                                                                  ^~~~~~~
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16:34: 16:42 error: unresolved import `std::str::pattern::Searcher`. Could not find `pattern` in `std::str`
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16 use std::str::pattern::{Pattern, Searcher, SearchStep};
                                                                                                                           ^~~~~~~~
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16:44: 16:54 error: unresolved import `std::str::pattern::SearchStep`. Could not find `pattern` in `std::str`
/home/guillaume/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.28/src/re.rs:16 use std::str::pattern::{Pattern, Searcher, SearchStep};
                                                                                                                                     ^~~~~~~~~~
error: aborting due to 3 previous errors
Could not compile `regex`.

And the version of rustc :

$ rustc --version
rustc 1.0.0-beta (9854143cb 2015-04-02) (built 2015-04-02)

And the version of cargo :

$ cargo --version
cargo 0.0.1-pre-nightly (84d6d2c 2015-03-31) (built 2015-03-31)

\p{Co} does not match all Co characters

\p{Co} only matches U+E000, U+F8FF, U+F0000, U+FFFFD, U+100000, and U+10FFFD, i.e. the first and last characters in each private-use character range.

This test fails:

extern crate regex;

use regex::Regex;

fn main() {
    let re = match Regex::new(r"\p{Co}") {
        Ok(re) => re,
        Err(err) => panic!("{}", err),
    };
    assert_eq!(re.is_match("\u{e001}"), true);
}

Incorrect case-insensitive matching of character ranges

Character range matching is conceptually (range_start..range_end).any(|c| c == input_char), but as an optimization is implemented as range_start <= input_char && input_char <= range_end. This is fine.

Case-insensitive matching is implemented as uppercase(c) == uppercase(input_char). This is fine (modulo #55).

So case-insensitive range matching is conceptually (range_start..range_end).any(|c| uppercase(c) == uppercase(input_char)). It is currently implemented as uppercase(range_start) <= uppercase(input_char) && uppercase(input_char) <= uppercase(range_end) which is not equivalent.

One of the tests currently passing is that (?i)\p{Lu}+ matches ΛΘΓΔα entirely. That is, greek letters (both upper case and lower case) all match the category of upper case letters when matched case-insensitively. But the same test with \p{Ll} (category of lower case letters) instead of \p{Lu} currently fails because of this issue. (\p{Lu} and \p{Ll} expand to large unions of character ranges.)

implement better code generation for the regex plugin

Issue by comex
Thursday May 08, 2014 at 05:23 GMT

For earlier discussion, see rust-lang/rust#14029

This issue was labelled with: A-libs in the Rust repository

Consider this code:

#![feature(phase)]
extern crate regex;
#[phase(syntax)]
extern crate regex_macros;

pub fn is_all_a(s: &str) -> bool {
    return regex!("^a+$").is_match(s);
}

Ideally this would optimize away to a small function that just iterates over the string and checks for characters other than 'a'.

Instead, it:

calls malloc several times to start out;
goes through an indirect call unless LTO is enabled - might not usually be a big deal, but I would like to eventually be able to efficiently match a regex on a single character in lieu of writing out all the possibilities manually
to the 'exec' function, which itself, even with LTO (and -O) enabled, makes many non-inlined calls, including to malloc, char_range_at, char_range_at_reverse, etc.

Without LTO, it generates about 7kb of code for one regex, or 34kb if I put 8 regexes in that function. Not the end of the world, but it adds up.

I recognize the regex implementation is new, but I thought this was worth filing anyway as room for improvement.

rustc 0.11-pre-nightly (2dcbad5 2014-05-06 22:01:43 -0700)

make case insensitive comparisons better

We should switch from equality comparison on the first character returned by {upper,lower}case iterators to "does this character match any character in the {upper,lower}case iterator."

Regex documentation on \d possibly misleading

Not sure if it's just because I don't know enough about character classes, but when I read:

\d          Perl character class ([0-9])
\D          Negated Perl character class ([^0-9])

(Source: http://doc.rust-lang.org/regex/regex/index.html#matching-one-character)

I thought this meant that \d would only match the ASCII characters 0 through 9, but in fact it'll match other Unicode digits too:

#![feature(plugin)]
#![plugin(regex_macros)]
extern crate regex;

fn main() {
    let re = regex!(r"\d");
    println!("{}", re.is_match("٩")); // prints "true"
}

Maybe the wording should be changed?

remember matched group name and index in regex

let re = Regex::new(r"([a-zA-Z_][a-zA-Z0-9]*)|([0-9]+)|(\.)|(=)").unwrap();

for cap in re.captures_iter("asdf.aeg = 34") {
    let mut index = 0;
    for (i, name) in cap.iter().enumerate() {
        if i == 0 {continue}
        if let Some(_) = name {index = i; break;}
    }
    println!("group {:?}, match {:?}", index, cap.at(index).unwrap());
}

Now, we can only use iter to get the matched group index or name. But it will cost O(n) time at the worst case.

Pls add the feature to remember the matched group name and index, then only O(1) time will be consumed.

[feature request] Feature to iterate over named groups

It could be useful to be able to iterate over named groups from a regex capture.

Something like:

let re = regex!(r"(?P<id>\d{2})(?P<name>\w+)");
let caps = re.captures("12david").unwrap();
for (k, v) in caps.iter_named() {
    println!("group name: {0}, value: {1}", k, v);
}

PS: I'm new to contributing the rust project and open source in general, so I'm sorry if I'm not following standard procedures.

Underscore in replacement string treated as backspace

Using a _ in the replacement string of Regex::replace leads to unexpected behaviour. The _ seems to be treated as a backspace. The documentation should either make mention of this, or this seems to be a bug.

#[test]
fn replacement_with_underscore() {
    let re = regex!(r"(.)(.)");
    let s1 = re.replace("ab","$1-$2");
    let s2 = re.replace("ab","$1_$2");
    assert_eq!("a-b", &s1);
    assert_eq!("a_b", &s2); // Fails here "a_b" != "b"
}

Extra characters in capture groups

Using the following with rustc 1.0.0-nightly (d8be84eb4 2015-03-29) (built 2015-03-29), regex 0.1.24, and regex_macros 0.1.14:

#![feature(collections)]
#![feature(plugin)]
#![plugin(regex_macros)]
extern crate regex;

fn main() {
    let user_input = String::from_str("0,2,3-5 15\n");
    let re = regex!(r"(?i)\s*([0-9]+(?:-[0-9]+)?)(?:\s*|,)");

    for cap in re.captures_iter(&user_input) {
        println!("Captured: {:?}", cap.at(0).unwrap_or(""));
    }
}

I get the expected output on every println!, except for the last two. Neither the whitespace, nor the newline should be included in the capture group.

% ./test
Captured: "0"
Captured: "2"
Captured: "3-5 "
Captured: "15\n"

Switching the \s* in the regex! to \b matches as I would expect:

    let re = regex!(r"(?i)\b([0-9]+(?:-[0-9]+)?)(?:\b|,)");

% ./test2
Captured: "0"
Captured: "2"
Captured: "3-5"
Captured: "15"

It looks like the \s in the non-capture group immediately following the capture group is leaking in?

unexpected panic with (invalid) regex "(+)"

Original issue: rust-lang/rust#22679 by @juliusikkala

When trying to compile regex!(r"(+)"), the compiler panics.

This is the code that leads to the crash:

#![feature(plugin)]
#![plugin(regex_macros)]
extern crate regex;

fn main() {
    let re = regex!(r"(+)");
}

Valid regexes have not caused any problems.
rustc output:

error: internal compiler error: unexpected panic
note: the compiler unexpectedly panicked. this is a bug.
note: we would appreciate a bug report: https://github.com/rust-lang/rust/blob/master/CONTRIBUTING.md#bug-reports
note: run with `RUST_BACKTRACE=1` for a backtrace
thread 'rustc' panicked at 'Tried to unwrap non-AST item: Paren(0, 1, "")', /home/julius/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.15/src/parse.rs:163


Could not compile `bug`.

Meta

rustc --version --verbose:

rustc 1.0.0-nightly (522d09dfe 2015-02-19) (built 2015-02-21)
binary: rustc
commit-hash: 522d09dfecbeca1595f25ac58c6d0178bbd21d7d
commit-date: 2015-02-19
build-date: 2015-02-21
host: x86_64-unknown-linux-gnu
release: 1.0.0-nightly

I am using Arch Linux and rustc is installed from AUR (it downloads the binaries from
https://static.rust-lang.org/dist/rust-nightly-x86_64-unknown-linux-gnu.tar.gz).
Backtrace:

   1:     0x7f6461e7d210 - sys::backtrace::write::h252031bd050bf19aKlC
   2:     0x7f6461ea5ac0 - panicking::on_panic::h8a07e978260e2c7btXL
   3:     0x7f6461de6720 - rt::unwind::begin_unwind_inner::h322bcb3f35268c19RBL
   4:     0x7f6461de71f0 - rt::unwind::begin_unwind_fmt::h9448a61362cc80d2nAL
   5:     0x7f6458228fd0 - parse::BuildAst::unwrap::hb5d44cbdb15d2c10v0a
   6:     0x7f64582370e0 - parse::Parser::pop_ast::h76189779d315f940ifb
   7:     0x7f645822f010 - parse::Parser::push_repeater::h9b348b199426ad950fb
   8:     0x7f64582294e0 - parse::Parser::parse::hc4adcc34190b3a5bk3a
   9:     0x7f64582292f0 - parse::parse::hbb40bb6fc331e453H2a
  10:     0x7f6458249370 - re::Regex::new::h8c68b01aada2952bwuc
  11:     0x7f64581b8e20 - native::hf8caf7fb4b19410c2aa
  12:     0x7f64600e5dc0 - ext::base::F.TTMacroExpander::expand::h867963691209911018
  13:     0x7f645f25caa0 - ext::expand::expand_expr::closure.58776
  14:     0x7f645f25c9b0 - ext::expand::expand_expr::h4383deb4367119b5wBd
  15:     0x7f645f210eb0 - ext::expand::MacroExpander<'a, 'b>.Folder::fold_expr::h000c01cd1b3edba9zRe
  16:     0x7f645f267ff0 - fold::noop_fold_expr::closure.58825
  17:     0x7f645f2847e0 - ext::expand::expand_non_macro_stmt::closure.59045
  18:     0x7f645f2841f0 - ext::expand::expand_non_macro_stmt::closure.59043
  19:     0x7f645f283eb0 - ext::expand::expand_non_macro_stmt::h51b42cda1f47cb38sge
  20:     0x7f645f2e2920 - ext::expand::MacroExpander<'a, 'b>.Folder::fold_stmt::closure.59789
  21:     0x7f645f2b0ba0 - ext::expand::expand_block_elts::closure.59361
  22:     0x7f645f2b0820 - iter::FlatMap<I, U, F>.Iterator::next::h12199045387838857590
  23:     0x7f645f2af6b0 - vec::Vec<T>.FromIterator<T>::from_iter::h209756562474197865
  24:     0x7f645f2aee50 - ext::expand::expand_block_elts::closure.59351
  25:     0x7f645f268e20 - ext::expand::expand_block_elts::hbc921c8f419d91ab9oe
  26:     0x7f645f2aec20 - ext::expand::expand_block::h98dad051e0f0e47fpoe
  27:     0x7f645f268070 - ext::expand::MacroExpander<'a, 'b>.Folder::fold_block::hba037cd9748a0507UTe
  28:     0x7f645f268500 - ext::expand::expand_and_rename_fn_decl_and_block::h58ecb6b0c982387fnPe
  29:     0x7f645f273830 - ext::expand::expand_item_underscore::hbfa576cbe10231f502d
  30:     0x7f645f2d9620 - fold::Folder::fold_item_simple::h16347799820789526401
  31:     0x7f645f2d8ca0 - ptr::P<T>::map::h5488296407772723657
  32:     0x7f645f26bd10 - ext::expand::expand_annotatable::h81cc6e36fef572560ze
  33:     0x7f645f268fe0 - ext::expand::expand_item::h2cff22418fb19f38aZd
  34:     0x7f645f278740 - ext::expand::MacroExpander<'a, 'b>.Folder::fold_item::h8c866cfe343d78bcdSe
  35:     0x7f645f278680 - fold::noop_fold_mod::closure.58931
  36:     0x7f645f278370 - iter::FlatMap<I, U, F>.Iterator::next::h7192467770078366644
  37:     0x7f645f277d00 - vec::Vec<T>.FromIterator<T>::from_iter::h1450071030007651331
  38:     0x7f645f277b60 - fold::noop_fold_mod::h17464563872470565435
  39:     0x7f645f273830 - ext::expand::expand_item_underscore::hbfa576cbe10231f502d
  40:     0x7f645f2d9620 - fold::Folder::fold_item_simple::h16347799820789526401
  41:     0x7f645f2d8ca0 - ptr::P<T>::map::h5488296407772723657
  42:     0x7f645f26bd10 - ext::expand::expand_annotatable::h81cc6e36fef572560ze
  43:     0x7f645f268fe0 - ext::expand::expand_item::h2cff22418fb19f38aZd
  44:     0x7f645f278740 - ext::expand::MacroExpander<'a, 'b>.Folder::fold_item::h8c866cfe343d78bcdSe
  45:     0x7f645f2e4b20 - ext::expand::expand_crate::h3aa1e0ab5c8c5f8bIYe
  46:     0x7f6462535dd0 - driver::phase_2_configure_and_expand::closure.20378
  47:     0x7f64624e9c90 - driver::phase_2_configure_and_expand::hf01f6e95dfef33e33ta
  48:     0x7f64624de070 - driver::compile_input::h0c8d8120f6194473Gba
  49:     0x7f64625ad2b0 - run_compiler::heccd2f43b844857cZbc
  50:     0x7f64625abbb0 - thunk::F.Invoke<A, R>::invoke::h14556331494175157254
  51:     0x7f64625aaaa0 - rt::unwind::try::try_fn::h8181016918202193540
  52:     0x7f6461f12880 - rust_try_inner
  53:     0x7f6461f12870 - rust_try
  54:     0x7f64625aada0 - thunk::F.Invoke<A, R>::invoke::h12418455792421810631
  55:     0x7f6461e91b60 - sys::thread::thread_start::h3defdaea150d8cd693G
  56:     0x7f645bd932b0 - start_thread
  57:     0x7f6461a6f249 - __clone
  58:                0x0 - <unknown>

implement matching on byte strings

Currently as far as I see regexes here are only for Unicode text. But they can be used to parse binary files as well (to to parse a mixture of binary and text).

For example, how can one implement strings --all using regexes in Rust?

Mismatched types for UNICODE_CLASSES

WIth rustc 1.0.0-nightly (fed12499e 2015-03-03) (built 2015-03-04)

error: mismatched types:
 expected `&'static [(&'static str, &'static &'static [(char, char)])]`,
    found `&'static [(&'static str, &'static [(char, char)])]`
(expected &-ptr,
    found slice) [E0308]
C:\...\cargo\registry\src\github.com-1285ae84e5963aae\regex-0.1.16\src\parse.rs:661
                match find_class(UNICODE_CLASSES, name.as_slice()) {
                                 ~~~~~~~~~~~~~~~~

Very large copies in compiled regexes.

There is a ::std::mem::swap(&mut clist, &mut nlist); call generated when using regex!.
The two variables have the type Threads which can get very large, in this particular case, I estimate they're each 22kB.

According to valgrind, each exec call ends up with 200 memcpy's - this makes regex! five times slower than the dynamic regex, for me.
Adding a single line recovers performance of the generated code, even if it's only marginally better than the dynamic regex:

let (mut clist, mut nlist) = (&mut clist, &mut nlist);

No other changes are necessary because of deref coercions (as weird as &mut &mut T -> &mut T may be).

Cannot compile rust v0.1.24

I added regex to one project recently adding this to my Cargo.toml :

regex = "*"

This is the error output cargo display :

    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling regex v0.1.24
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:341:49: 341:55 error: type `char` does not implement any method in scope named `next`
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:341                 let uregc = regc.to_uppercase().next().unwrap();
                                                                                                                                                                 ^~~~~~
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:342:51: 342:57 error: type `char` does not implement any method in scope named `next`
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:342                 let utextc = textc.to_uppercase().next().unwrap();
                                                                                                                                                                   ^~~~~~
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:561:38: 561:44 error: type `char` does not implement any method in scope named `next`
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:561         textc = textc.to_uppercase().next().unwrap();
                                                                                                                                                      ^~~~~~
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:562:38: 562:44 error: type `char` does not implement any method in scope named `next`
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:562         start = start.to_uppercase().next().unwrap();
                                                                                                                                                      ^~~~~~
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:563:34: 563:40 error: type `char` does not implement any method in scope named `next`
/nfs/zfs-student-2/users/2013/gbersac/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.24/src/vm.rs:563         end = end.to_uppercase().next().unwrap();
                                                                                                                                                  ^~~~~~
error: aborting due to 5 previous errors
Could not compile `regex`.

How to solve it ?

impose a limit on the size of a compiled regex program

From the docs:

Currently, there are no counter-measures in place to prevent a malicious user from writing an expression that may use a lot of resources. One such example is to repeat counted repetitions: ((a{100}){100}){100} will try to repeat the a instruction 100^3 times. Essentially, this means it's very easy for an attacker to exhaust your system's memory if they are allowed to execute arbitrary regular expressions. A possible solution to this is to impose a hard limit on the size of a compiled expression, but it does not yet exist.

The conclusion of this is that regexes specified by a user cannot be blindly trusted, since they can trivially exhausted all memory on your system. We can fix this by imposing some limit on the size of a regex program. (In fact, this probably has to be a limit on the size of a regex AST, which will need to be checked during construction.)

Cannot build with rustc 1.0.0 (a59de37e9 2015-05-13) (built 2015-05-14)

Let's say we have the following Cargo.toml:

[package]

name = "simproc"
version = "0.1.0"
authors = ["Alvaro Polo <[email protected]>"]

[dependencies]
docopt = "*"
regex = "*"
regex_macros = "*"
rustc-serialize = "*"

[[bin]]
name = "spasm"

If we attempt to build with cargo build we obtain:

   Compiling regex v0.1.38
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1093:5: 1093:43 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1093     type Searcher = RegexSearcher<'r, 't>;
                                                                                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1095:5: 1101:6 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1095     fn into_searcher(self, haystack: &'t str) -> RegexSearcher<'r, 't> {
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1096         RegexSearcher {
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1097             it: self.find_iter(haystack),
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1098             last_step_end: 0,
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1099             next_match: None,
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1100         }
                                                                                            ...
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1092:14: 1092:25 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1092 impl<'r, 't> Pattern<'t> for &'r Regex {
                                                                                                         ^~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1106:5: 1109:6 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1106     #[inline]
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1107     fn haystack(&self) -> &'t str {
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1108         self.it.search
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1109     }
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1111:5: 1140:6 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1111     #[inline]
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1112     fn next(&mut self) -> SearchStep {
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1113         if let Some((s, e)) = self.next_match {
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1114             self.next_match = None;
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1115             self.last_step_end = e;
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1116             return SearchStep::Match(s, e);
                                                                                            ...
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1105:21: 1105:33 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1105 unsafe impl<'r, 't> Searcher<'t> for RegexSearcher<'r, 't> {
                                                                                                                ^~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1112:27: 1112:37 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1112     fn next(&mut self) -> SearchStep {
                                                                                                                      ^~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1116:20: 1116:37 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1116             return SearchStep::Match(s, e);
                                                                                                               ^~~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1120:46: 1120:56 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1120                 if self.last_step_end < self.haystack().len() {
                                                                                                                                         ^~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1122:47: 1122:57 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1122                     self.last_step_end = self.haystack().len();
                                                                                                                                          ^~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1123:51: 1123:61 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1123                     SearchStep::Reject(last, self.haystack().len())
                                                                                                                                              ^~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1123:21: 1123:39 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1123                     SearchStep::Reject(last, self.haystack().len())
                                                                                                                ^~~~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1125:21: 1125:37 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1125                     SearchStep::Done
                                                                                                                ^~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1131:21: 1131:38 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1131                     SearchStep::Match(s, e)
                                                                                                                ^~~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1136:21: 1136:39 error: use of unstable library feature 'core'
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/re.rs:1136                     SearchStep::Reject(last, s)
                                                                                                                ^~~~~~~~~~~~~~~~~~
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/lib.rs:394:34: 394:51 error: unstable feature
/Users/apoloval/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.38/src/lib.rs:394 #![cfg_attr(feature = "pattern", feature(pattern))]
                                                                                                                             ^~~~~~~~~~~~~~~~~
note: this feature may not be used in the stable release channel
error: aborting due to 16 previous errors
Could not compile `regex`.

To learn more, run the command again with --verbose.

`unicode` crate availability?

At some point, it looks like all of the character class definitions got migrated out of regex and into a separate unicode crate: http://doc.rust-lang.org/unicode/regex/ --- What are the stability plans for unicode? My guess is that things like PERLD aren't going to be included in the Rust distribution, so we'll need to plan to move forward on that.

optimize literal alternations

From https://lwn.net/Articles/589009/

The handling of more complex alternations is a known (relatively) weak point of jrep (more precisely of rejit) in need of improvement. Grep uses a smart Boyer-Moore algorithm. To look for aaa|bbb|ccc at position p, it looks up the character at p + 2, and if it is not a, b, or c, knows it can jump three characters ahead to p + 3 (and then look at the character at p + 5).

On the other hand, like for single strings, rejit handles alternations simply: it applies brute force. But it does so relatively efficiently, so the performance is still good. To search for aaa|bbb|ccc at some position p in the text, rejit performs operations like:
    loop:
      find 'aaa' at position p
      if found goto match
      find 'bbb' at position p
      if found goto match
      find 'ccc' at position p
      if found goto match
      increment position and goto loop
    match:
The complexity is proportional to the number of alternations. Worse, when the number of alternated expressions exceeds a threshold (i.e. when the compiler cannot allocate a register per alternated expression), rejit falls back to some slow default code. This is what happens for the two regexps with eight or more alternated strings. The code generation should be fixed to allow an arbitrary number of alternated strings.

In other words, this is a way to bypass the regex machinery and degrade to a simple substring search.

In addition to being a common case to optimize, it should also give a small bump to the regex-dna benchmark because one of the regexes is just an alternation of literals: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=rust&id=1. The rest contain character classes, which complicates things somewhat.

The easy part of this is optimization is the actual searching of literal strings and jumping ahead in the input (there is precedent for this already in the code with literal prefixes). The harder part, I think, is analyzing the regex to find where the optimization can be applied. The issue is that an alternation is compiled to a series of split and jump instructions. It is easiest to discover the opportunity to optimize by analyzing the AST of the regex---but there will need to be a way to carry that information through to the VM.

One approach might be to tag pieces of the syntax with possible optimization (this is hopefully the first of many). Then when the AST is compiled to instructions, that information can be stored and indexed by the current program counter. The VM can then ask, "Do there exist any optimizations for this PC?" The rest is gravy.

N.B. This only works for a regex that is of the form a|b|c|.... It might be possible to generalize this to other cases, but it seems tricky.

Regex macros documentation should specify #[no_link]

Currently, the regex documentation shows this for using regex_macros:

#![feature(plugin)]
#[plugin]
// should have #[no_link] here, but doesn't
extern crate regex_macros;

This is slightly incorrect, as not specifying #[no_link] on extern crate regex_macros causes a runtime error on systems without rustc installed.

See rust-lang/rust#20769.

/(*)/ gives `Tried to unwrap non-AST item: Paren(0, 1, "")`

fn main() {
    regex::Regex::new("(*)");
}

panics

thread '<main>' panicked at 'Tried to unwrap non-AST item: Paren(0, 1, "")', /home/huon/.multirust/toolchains/nightly-2015-03-01/cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/parse.rs:163

(This similar happens with (?:?) etc.)

implement one-pass NFA matcher

It turns out that when a regex is deterministic, the NFA simulation can be implemented much more efficiently because it only needs to keep track of one set of capture groups (instead of a set of capture groups for every thread).

There are two components to adding this:

Detecting whether a regex is deterministic or not. A regex is said to be deterministic if, at any point during the execution of a regex program, at most one thread can lead to a match state.
Writing a specialized NFA simulation.

In terms of code, it would be nice if we could find a way to reuse code with the full NFA simulation.

This should be easier to implement than #66, and should boost the performance of a lot of regexes. (Of course, we should do both a one pass NFA and #66!)

Regexes appear to be significantly slower than re2 in some cases

Issue by brson
Thursday Sep 04, 2014 at 18:15 GMT

For earlier discussion, see rust-lang/rust#16989

This issue was labelled with: A-an-interesting-project, A-libs, I-slow in the Rust repository

My understanding is that we believe them to be competitive, but two benchmarks I've seen were not really in the ballpark. The easiest one to test is the shootout's regexdna, which you can see from the upstream shootout is drastically slower than the re2-based C++ implementation.

cc @BurntSushi

[feature request] Expose libregex's parsing/compiling internals

Issue by andrew-d
Thursday Nov 06, 2014 at 19:54 GMT

For earlier discussion, see rust-lang/rust#18710

This issue was labelled with: in the Rust repository

I was looking at implementing something similar to this - a trigram-index-aided search. I'd rather not reproduce the code necessary to parse the regex, considering it already lives in libregex. It'd be nice if the parsing/compiling was exposed for use - perhaps similar to how Go does it with their regexp and regexp/syntax packages.

cc @BurntSushi

Regex on text streams

Issue by suhr
Wednesday May 07, 2014 at 14:18 GMT

For earlier discussion, see rust-lang/rust#14015

This issue was labelled with: A-libs in the Rust repository

Regex library defines methods for find/replace on strings, but what about text streams?

Regex docs don't point people to crates.io dependency

Reading http://doc.rust-lang.org/regex/regex/index.html and http://doc.rust-lang.org/regex/regex/enum.Regex.html there is no hint to the user that they should be updating Cargo.toml with something like:

[dependencies]
regex = "*"
regex_macros = "*"

The error doesn't help either:

   Compiling example v0.0.1 (file:///path/to/projects/example)
src/main.rs:5:1: 5:27 error: can't find crate for `regex_macros`
src/main.rs:5 extern crate regex_macros;

support Unicode grapheme clusters

The regex engine doesn't consider characters (graphemes) that consist of multiple code points correctly.

For example the letter 'ä' has two representations, that should both be matched by the regex ., howver only the latter is.

Bash                 | Rust       | Codepoints
echo $'\x61\xcc\x88' | "\u{e4}"   | U+00e4
echo $'\xc3\xa4'     | "a\u{308}" | U+0061 U+0308

do simple literal prefix scanning in regex!

Hi,

I saw the news of that regex got refactored and optimized and decided to check my old benchmark. I was very surprised it now runs twice as long!

How to reproduce (using multirust for versions as older regex doesn't compile with newer nightly Rust):

git clone https://github.com/mkpankov/parse-rust.git
cd parse-rust
multirust override nightly-2015-06-24
git checkout 4076c404caf1560a466e9f0799817035089fe841
cargo build --release
time zcat mp3-logs-with-fake-ips.log.gz | ./target/release/parse-rust
// outputs around 4s on my machine
multirust override nightly-2015-05-25
git checkout e33d410291fa7f134eef628b5591d605cd68b218
cargo clean
cargo build --release
time zcat mp3-logs-with-fake-ips.log.gz | ./target/release/parse-rust
// outputs around 2s on my machine

I'm sorry I can't pinpoint it more accurately (maybe it's Rust changes, not regex), but recent major changes of regex might be it. Two times degradation is severe in my opinion, and needs action.

regex versions:

new, degraded:

 "regex 0.1.38 (registry+https://github.com/rust-lang/crates.io-index)",
 "regex_macros 0.1.20 (registry+https://github.com/rust-lang/crates.io-index)",

old, fast:

 "regex 0.1.30 (registry+https://github.com/rust-lang/crates.io-index)",
 "regex_macros 0.1.18 (registry+https://github.com/rust-lang/crates.io-index)",

Some background: back when I did this I compared Rust version to C++ version (doing almost stupid translation) and Rust beat C++ by about 40% w/o using compile-time regex. This kind of degradation puts it back behind C++ 😞

regex consts should have the same form if possible

Issue by mdinger
Monday Feb 23, 2015 at 03:45 GMT

For earlier discussion, see rust-lang/rust#22699

This issue was labelled with: in the Rust repository

In regex, the consts have these types:

pub static PERLD: &'static &'static [(char, char)] = ...
pub static PERLS: &'static &'static [(char, char)] = ...
pub static PERLW: &'static [(char, char)] = ...

They should probably be the same type but aren't for some reason. They cause this rust-lang/rfcs#730 issue.

crates.io seems to be out of sync for regex_macros 0.1.1

I try to compile a project with "myslq" and get the following:

~/.cargo/registry/src/github.com-1ecc6299db9ec823/regex_macros-0.1.1/src/lib.rs:604 fn vec_expr<T, It: Iterator<T>>(&self, xs: It,

Looking in the actual code of 0.1.1 the line was updated. Is there a version bump missing?

optimize use of `^`/`\A`

At the moment, the VM always scans the entire input unless one of the two following things happens:

The regex has a literal prefix and that prefix cannot be found anywhere in the input. (This always scans the entire input.)
A match is found. This may not scan the entire input, but it requires a match.

Ideally, given a regex ^{something}..., it could terminate the search early if {something} doesn't match at the start of the input.

problem with regex compilation

on attempt to build my project i see following:

$ cargo build
    Updating registry `https://github.com/rust-lang/crates.io-index`
    Updating git repository `https://github.com/alexcrichton/toml-rs`
 Downloading docopt v0.6.42
 Downloading regex v0.1.16
 Downloading libc v0.1.2
 Downloading log v0.2.5
 Downloading rustc-serialize v0.2.15
 Downloading rustc-serialize v0.3.0
   Compiling regex v0.1.16
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14:16: 14:23 error: unresolved import `std::str::Pattern`. There is no `Pattern` in `std::str`
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14 use std::str::{Pattern, Searcher, SearchStep};
                                                                                                     ^~~~~~~
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14:25: 14:33 error: unresolved import `std::str::Searcher`. There is no `Searcher` in `std::str`
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14 use std::str::{Pattern, Searcher, SearchStep};
                                                                                                              ^~~~~~~~
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14:35: 14:45 error: unresolved import `std::str::SearchStep`. There is no `SearchStep` in `std::str`
/Users/alec/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-0.1.16/src/re.rs:14 use std::str::{Pattern, Searcher, SearchStep};
                                                                                                                        ^~~~~~~~~~
error: aborting due to 3 previous errors
Could not compile `regex`.

implement &T for Replacer where T: Replacer

When attempting to call replace_all on a Regex with a str as the argument for rep(as opposed to a NoExpand), the Replacer trait implementation for str is apparently being ignored, instead falling back to the FnMut implementation. This project demonstrates the following compile error:

src\main.rs:13:23: 13:53 error: the trait `for<'r,'r> core::ops::Fn<(&'r regex::re::Captures<'r>,)>` is not implemented for the type `str` [E0277]
src\main.rs:13         true => regex.replace_all(&source, &replace.as_str()),

Where replace and source are both Strings.

Edit: just fixed the type of replace by adding as_str, as the & was not coercing the String to a str properly. See the latest commit.

macro undefined: regex!

I'm trying to fix regex_macros, but replacing #[plugin] #[no_link] extern crate regex_macros; with #![plugin(regex_macros)] doesn't seem to be working. Here's a gist of my change and output of running cargo test: https://gist.github.com/9a19a08b72413ef86e44 --- What am I missing?

cc @alexcrichton @kmcallister

range notation

Just a heads up, getting a message about ranges.

examples/blah.rs:124:21: 124:49 warning: use of unstable library feature 'core': will be replaced by range notation
examples/blah.rs:124             .ignore(regex!(r"^\.|^#|~$|\.swp$"));
                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
examples/blah.rs:1:1: 132:1 note: in expansion of regex!
examples/blah.rs:124:21: 124:49 note: expansion site
examples/blah.rs:124:21: 124:49 help: add #![feature(core)] to the crate attributes to silence this warning
examples/blah.rs:124             .ignore(regex!(r"^\.|^#|~$|\.swp$"));
                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
examples/blah.rs:1:1: 132:1 note: in expansion of regex!
examples/blah.rs:124:21: 124:49 note: expansion site

generate less code for regex plugin

Issue by huonw
Tuesday Apr 29, 2014 at 14:00 GMT

For earlier discussion, see rust-lang/rust#13842

This issue was labelled with: I-compiletime in the Rust repository

#![feature(phase)]
#![allow(dead_code)]

#[phase(syntax)] extern crate regex_macros;
extern crate regex;

#[cfg(short)]
fn short() {
    regex!("a");
}

#[cfg(medium)]
fn medium() {
    // 500
    regex!("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
}

#[cfg(long)]
fn long() {
    // 1000
    regex!("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
}

fn main() {}

$ for x in short medium long; do echo $x; time rustc regex.rs --cfg $x --no-trans; done
short

real    0m0.102s
user    0m0.092s
sys     0m0.012s
medium

real    0m1.384s
user    0m1.332s
sys     0m0.048s
long

real    0m3.612s
user    0m3.508s
sys     0m0.104s

They don't take nearly this long to just compile as dynamic ones, so I would guess it's the extra work that the generating macro is doing. (Note the --no-trans there, so it isn't just the extra code making LLVM slow.)

A perf trace identifies rustc librustc-4283bb68-0.11-pre.so [.] hashmap::HashMap$LT$K$C$$x20V$C$$x20H$GT$::search::h14555543045583792107::v0.11.pre as taking a lot (10.18%) of time.

be more liberal with escape sequences in the parser

A bit of background: I'm trying to implement the BrowserScope user agent parser as seen here: https://github.com/ua-parser/uap-core/. The regular expression below can be found here: https://github.com/ua-parser/uap-core/blob/master/regexes.yaml#L17

The crate (0.1.38) encounters an error compiling the following regex:

/((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?

with the following error:

Syntax(Error { pos: 92, surround: "p)[ \\-](\\d", kind: UnrecognizedEscape('-') })'

To reproduce:

use regex::Regex;
let re = Regex::new(r"/((?:Ant-)?Nutch|[A-z]+[Bb]ot|[A-z]+[Ss]pider|Axtaris|fetchurl|Isara|ShopSalad|Tailsweep)[ \-](\d+)(?:\.(\d+)(?:\.(\d+))?)?").unwrap();

Release version doesn't compile in latest cargo (0e42b4d)

Although the development version works fine, the released version still has errors with hyphenated crate names.

$ cargo build
Unable to get packages from source

Caused by:
  failed to parse manifest at `/home/daboross/.cargo/registry/src/github.com-1ecc6299db9ec823/regex_macros-0.1.12/Cargo.toml`

Caused by:
  target names cannot contain hyphens: shootout-regex-dna

Optimize case_fold_and_combine_ranges?

I haven’t timed it, but the case_fold_and_combine_ranges function introduced in #78 is probably slow for large sets of character ranges like \p{Lu}. (That is, slow to parse/compile a regex.) It can probably be improved to not consider individual chars when we can determine that none in a given range (or subrange) is affected by case folding.

regex negated character set doesn't work properly with character classes

Issue by kanaka
Tuesday Oct 14, 2014 at 15:11 GMT

For earlier discussion, see rust-lang/rust#18035

This issue was labelled with: A-libs in the Rust repository

$ rustc --version
rustc 0.13.0-nightly (1c3ddd297 2014-10-13 23:27:46 +0000)

Here is my test code:

#![feature(phase)]
#[phase(plugin)]
extern crate regex_macros;
extern crate regex;

fn main() {
    let re = regex!(r#"([^\s])"#);
    println!("\n1 {} on '1 2 3'", re);
    for cap in re.captures_iter("1 2 3") {
        println!("0: {}, 1: {}", cap.at(0), cap.at(1));
    }

    let re = regex!(r#"([^\s,])"#);
    println!("\n2 {} on '1 2 3'", re);
    for cap in re.captures_iter("1 2 3") {
        println!("0: {}, 1: {}", cap.at(0), cap.at(1));
    }

    let re = regex!(r#"([^A,])"#);
    println!("\n3 {} on '1A2A3'", re);
    for cap in re.captures_iter("1A2A3") {
        println!("0: {}, 1: {}", cap.at(0), cap.at(1));
    }

    let re = regex!(r#"([^[:alpha:],])"#);
    println!("\n4 {} on '1A2A3'", re);
    for cap in re.captures_iter("1A2A3") {
        println!("0: {}, 1: {}", cap.at(0), cap.at(1));
    }

    let re = regex!(r#"([^[:alpha:]Z])"#);
    println!("\n5 {} on '1A2A3'", re);
    for cap in re.captures_iter("1A2A3") {
        println!("0: {}, 1: {}", cap.at(0), cap.at(1));
    }
}

Results:

1 ([^\s]) on '1 2 3'
0: 1, 1: 1
0: 2, 1: 2
0: 3, 1: 3

2 ([^\s,]) on '1 2 3'
0: 1, 1: 1
0:  , 1:  
0: 2, 1: 2
0:  , 1:  
0: 3, 1: 3

3 ([^A,]) on '1A2A3'
0: 1, 1: 1
0: 2, 1: 2
0: 3, 1: 3

4 ([^[:alpha:],]) on '1A2A3'
0: 1, 1: 1
0: A, 1: A
0: 2, 1: 2
0: A, 1: A
0: 3, 1: 3

5 ([^[:alpha:]Z]) on '1A2A3'
0: 1, 1: 1
0: A, 1: A
0: 2, 1: 2
0: A, 1: A
0: 3, 1: 3

Cases 2, 4, 5 have the misbehavior. My expectation is that every case should have the same result as 1 and 3.

Attempting to parse {2} causes panic

use regex::Regex;

extern crate regex;


fn main() {
    let _ = Regex::new("{2}");
}

/t/test (master|…) $ cargo run
     Running `target/debug/test`
thread '<main>' panicked at 'called `Option::unwrap()` on a `None` value', /Users/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-mac/build/src/libcore/option.rs:362
An unknown error occurred

To learn more, run the command again with --verbose.

Was found using https://github.com/kmcallister/afl.rs 👍

Build fails: There is no `BinarySearchResult` in `std::slice`

rustc 0.13.0-nightly (10d99a973 2014-12-31 21:01:42 +0000)

cargo build 
   Compiling regex v0.1.4 (file://***/regex)
***/regex/src/parse.rs:21:5: 21:35 error: unresolved import `std::slice::BinarySearchResult`. There is no `BinarySearchResult` in `std::slice`
***/regex/src/parse.rs:21 use std::slice::BinarySearchResult;
                                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
error: aborting due to previous error
Could not compile `regex`.

To learn more, run the command again with --verbose.

It got removed from libcore: rust-lang/rust@67d1388#diff-91f9d2237c7851d61911b0ca64792a88L1228

Binary search is used with incorrectly-sorted array

The “dynamic” matching of CharClass uses a binary search within char ranges, which relies on the input being sorted. The input is indeed sorted in its “natural” order, but the comparison function in case-insensitive mode uses a different order.

This leads to incorrect results:

assert!(Regex::new(r"(?i)[a_]+$").unwrap().is_match("A_"));

The above fails, because _ in ASCII is between upper-case and lower-case letters. The comparison function maps the 'a'..'a' range to its upper case 'A'..'A', which has a different order relative to '_'..'_'.

Compare with e.g. the code below, which succeeds.

assert!(Regex::new(r"(?i)[a=]+$").unwrap().is_match("A="));

The comparison function has a FIXME to move the case mapping outside of it and have the Vec of ranges be already mapped. I assume this was intended for performance, but I think it’ll also fix this issue. I’ll try it.

iterating over empty matches is wrong

This code:

extern crate regex;

use regex::Regex;

fn main() {
    let re = Regex::new(".*?").unwrap();
    for m in re.find_iter("ΛΘΓΔα") {
        println!("{:?}", m);
    }
}

When run violates an assert in std::str:

[andrew@Liger play] ./target/scratch 
thread '<main>' panicked at 'assertion failed: (w != 0)', /home/rustbuild/src/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libcore/str/mod.rs:1477

The precise reason is probably in how find_iter (and captures_iter) handles zero-length matches. It has to make progress after a zero-length match in the input, otherwise the iterator won't terminate. Currently, it makes progress by adding 1 to a byte offset into a UTF-8 encoded string, which is obviously wrong. This should be relatively easily fixed by setting self.last_end = self.last_end + length_of_char_that_starts_at(self.last_end).

implement a DFA matcher

One of the reasons why RE2/C++ is so fast is because it has two implementations of regex matching: a limited DFA matcher (no sub-capture support) and a full NFA simulation. This crate has the latter, but not the former.

Adding a DFA matcher should be an implementation detail and shouldn't require any public facing changes.

This is a pretty involved project. I hope to find time to do this some day, but if someone else wants to tackle it, I'd be happy to help mentor it. (Part of this will be figuring out how to handle the regex! macro. Do we replicate the DFA there too like we do the NFA?)

rust-lang / regex Goto Github PK

regex's Introduction

regex

Documentation

Usage

Usage: Avoid compiling the same regex in a loop

Usage: match regular expressions on &[u8]

Usage: match multiple regular expressions simultaneously

Usage: regex internals as a library

Usage: a regular expression parser

Crate features

Performance

Hacking

Minimum Rust version policy

License

regex's People

Contributors

Stargazers

Watchers

Forkers

regex's Issues

Meta

Recommend Projects

Recommend Topics

Recommend Org

Usage: match regular expressions on `&[u8]`