Comments (3)
likely related to #8 (comment) which has our incomplete fix for non-ascii matching. In general we always utf8::downgrade strings because native utf8 perl matching+capture can be horrendously slow - which is one reason why we switched to RE2 in a few performance critical areas
from re-engine-re2.
The fact one person wants an upgrade and one a downgrade is why this is tricky to fix ;-)
Ideally what needs to happen is re::e::RE2 needs to compile two RE2
s, one as UTF-8 and one as Latin1, then pick (maybe one lazily, it needs to do at least one to handle errors). The first problem is however that's done has the potential to seriously affect performance for many cases.
There's also the issue it isn't as simple as you may think, because you can't even compile some regexps as Latin1 with re::engine::RE2, there's a 12 year old note in the todo that's still relevant: https://github.com/dgl/re-engine-RE2/blame/master/TODO
$ perl -Mutf8 -Mblib -e 'use re::engine::RE2 -strict => 1; /😀\x{1F01}/'
$ perl -Mutf8 -Mblib -e 'use re::engine::RE2 -strict => 1; /\x{1F01}/'
invalid escape sequence: \x{1F0 at -e line 1.
So maybe the overall approach is compile as UTF-8, then if needed lazily try Latin1. Maybe we could use the Perl regexp /a
flag to actually mean Latin1, although that needs some more design thought; certainly how to do this while making it more perl compatible without regressing people's potentially performance critical code.
from re-engine-re2.
The workaround we actually use is to compile two patterns, one utf8 and one non-utf8, and then use the appropriate one for the target string. Successfully downgraded target strings get the non-utf8 pattern. That way we get the benefit of faster processing on target strings that don't need utf8 matching. It works for us, but that's because we don't need to use \x escapes above \xFF in our patterns.
from re-engine-re2.
Related Issues (4)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from re-engine-re2.