Code Monkey home page Code Monkey logo

Comments (3)

todd-richmond avatar todd-richmond commented on September 28, 2024

likely related to #8 (comment) which has our incomplete fix for non-ascii matching. In general we always utf8::downgrade strings because native utf8 perl matching+capture can be horrendously slow - which is one reason why we switched to RE2 in a few performance critical areas

from re-engine-re2.

dgl avatar dgl commented on September 28, 2024

The fact one person wants an upgrade and one a downgrade is why this is tricky to fix ;-)

Ideally what needs to happen is re::e::RE2 needs to compile two RE2s, one as UTF-8 and one as Latin1, then pick (maybe one lazily, it needs to do at least one to handle errors). The first problem is however that's done has the potential to seriously affect performance for many cases.

There's also the issue it isn't as simple as you may think, because you can't even compile some regexps as Latin1 with re::engine::RE2, there's a 12 year old note in the todo that's still relevant: https://github.com/dgl/re-engine-RE2/blame/master/TODO

$ perl -Mutf8 -Mblib -e 'use re::engine::RE2 -strict => 1; /😀\x{1F01}/'
$ perl -Mutf8 -Mblib -e 'use re::engine::RE2 -strict => 1; /\x{1F01}/'
invalid escape sequence: \x{1F0 at -e line 1.

So maybe the overall approach is compile as UTF-8, then if needed lazily try Latin1. Maybe we could use the Perl regexp /a flag to actually mean Latin1, although that needs some more design thought; certainly how to do this while making it more perl compatible without regressing people's potentially performance critical code.

from re-engine-re2.

jbalazerpfpt avatar jbalazerpfpt commented on September 28, 2024

The workaround we actually use is to compile two patterns, one utf8 and one non-utf8, and then use the appropriate one for the target string. Successfully downgraded target strings get the non-utf8 pattern. That way we get the benefit of faster processing on target strings that don't need utf8 matching. It works for us, but that's because we don't need to use \x escapes above \xFF in our patterns.

from re-engine-re2.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.