Comments (9)
well that was a little longer than a couple weeks...
if you get a chance, please check out the chimera
branch and give it a shot, or use this Dockerfile
as a baseline. I'd like to do some more testing before releasing, to iron out any segfaults or weirdness, given the amount of changes to almost every API wrapper in this extension.
there is a unit test example for compiling a database and scanning for matches here, from a user perspective it should be pretty straight forward and follows the same event handler signature as the C API.
note that the build requirements are updated:
- requires Hyperscan v5.4.0
- build Hyperscan with
-fPIC
(setCFLAGS
andCXXFLAGS
) - libpcre is statically linked, so there should be no requirement to install it globally, however you probably need to download the source (the docker image uses 8.44) and untar it in your Hyperscan source directory in order for the Hyperscan cmake stuff to actually turn the Chimera headers on. i haven't tested without this yet, though.
from python-hyperscan.
Wow, cool to see progress on this!
I only got as far as building a wheel in your Dockerfile and installing it in Debian, but can at least confirm that works:
FROM darvid/manylinux-hyperscan:5.4.0 AS hyperscan
WORKDIR /python-hyperscan
ENV PCRE_PATH=/opt/pcre/.libs
RUN python3.9 -m pip wheel https://github.com/darvid/python-hyperscan/archive/chimera.zip#egg=hyperscan
FROM python:buster
ENV LD_LIBRARY_PATH=/opt/hyperscan/lib:${LD_LIBRARY_PATH}
COPY --from=hyperscan /opt/hyperscan/ /opt/hyperscan
COPY --from=hyperscan /python-hyperscan/ /python-hyperscan
RUN pip install --no-index --find-links=/python-hyperscan hyperscan
# python
Python 3.9.6 (default, Jun 29 2021, 19:18:53)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hyperscan
>>> db = hyperscan.Database(chimera=True, mode=hyperscan.CH_MODE_GROUPS)
>>> db.compile(expressions=[rb'(a.) (b.)'])
>>> def on_match(*args, **kwargs):
... print(args, kwargs)
>>> db.scan(b'1x ax bx cx', on_match)
(0, 3, 8, 0, [(1, 3, 8), (1, 3, 5), (1, 6, 8)], None) {}
When I get a chance I'll try adapting the eyecite library to use this -- that would exercise things a little more.
Given the new build requirements I'm curious about packaging. Prior to this, installing the requirements for this library on Debian is just:
echo 'deb http://deb.debian.org/debian buster-backports main' >> /etc/apt/sources.list.d/backports.list
apt-get -t buster-backports install -y libhyperscan-dev
And on Mac is just brew install hyperscan
.
Seems like this might bump up the benefits of static linking in #29? But I'm also curious if ideally Debian/homebrew/etc. should include chimera support in their ports -- I'm a little out of my depth about whether that would be a good idea in general.
FWIW I'm only seeing a size of 50MB rather than 200+, but I might be looking at the wrong thing:
root@c66cea9224eb:/eyecite# du -hs /opt/hyperscan/
51M /opt/hyperscan/
root@c66cea9224eb:/eyecite# du -hs /python-hyperscan/
168K /python-hyperscan/
from python-hyperscan.
@brianthelion looks like I forgot to document that the Chimera match event handler has a different signature - one of the arguments is actually a ch_capture
struct, which in Python is a list of capture groups, which are tuples of (flags, start, end). See the official docs here.
Here's a quick and dirty example that I could think of for capturing group names, maybe there is a better way with the Chimera support but this is kind of analogous to the way I would do it regardless of whether you're using Chimera or not:
#!/usr/bin/env python
import hyperscan
patterns = {
'✨ α ✨': (b'1x', 0),
'✨ β ✨': (b'ax', 0),
'✨ γ ✨': (b'bx', 0),
'✨ δ ✨': (b'cx', 0),
}
def match_event_handler(id, start, end, flags, captured, ctx):
for i, (cap_flags, cap_from, cap_to) in enumerate(captured):
print(f"Expr ID: {id}")
print(f"Capture flags: {cap_flags}")
print(f"Capture #: {i}")
print(f"Capture group name: {ctx['group_names'][id]}")
print(f"Test string: {repr(ctx['chunk'])}")
print(f"Captured: {ctx['chunk'][cap_from:cap_to]}")
print()
print()
def main():
db = hyperscan.Database(chimera=True, mode=hyperscan.CH_MODE_GROUPS)
expressions, ids, flags = [], [], []
group_names: dict[int, str] = {}
for i, (group_name, (expr, flags_)) in enumerate(patterns.items()):
ids.append(i)
expressions.append(expr)
flags.append(flags_)
group_names[i] = group_name
db.compile(expressions=expressions, ids=ids, flags=flags)
chunk = b'1x ax bx cx'
ctx = {
"group_names": group_names,
"chunk": chunk,
}
db.scan(chunk, match_event_handler, context=ctx)
if __name__ == '__main__':
main()
$ python groups.py
Expr ID: 0
Capture flags: 1
Capture #: 0
Capture group name: ✨ α ✨
Test string: b'1x ax bx cx'
Captured: b'1x'
Expr ID: 1
Capture flags: 1
Capture #: 0
Capture group name: ✨ β ✨
Test string: b'1x ax bx cx'
Captured: b'ax'
Expr ID: 2
Capture flags: 1
Capture #: 0
Capture group name: ✨ γ ✨
Test string: b'1x ax bx cx'
Captured: b'bx'
Expr ID: 3
Capture flags: 1
Capture #: 0
Capture group name: ✨ δ ✨
Test string: b'1x ax bx cx'
Captured: b'cx'
from python-hyperscan.
hey - I have, and have avoided it because I wasn't sure anyone had a good use case for it - Chimera is essentially an entirely separate C API for interacting with scratch space, match event handlers, compiling, etc... even the flags are different.
so it will either require duplicating much of the Python interface to work with the Chimera side of Hyperscan, or some extensive use of macros to dynamically switch to the Chimera API.
I'll try to find time to take a closer look in the next couple of weeks and keep you updated.
from python-hyperscan.
I only got as far as building a wheel in your Dockerfile and installing it in Debian
oh, in that case I don't think it requires (re-)building Hyperscan from source, that's good. I'll just update the documentation on building from source, which will require those additional steps.
Seems like this might bump up the benefits of static linking in #29?
static linking would still be beneficial to people running LTS distros who don't/can't install a recent Hyperscan release from a PPA, but I don't know how many people/applications actually have that requirement (and also use this project). I personally think it would be a good idea for package maintainers to include Chimera support, yeah, or maybe add a secondary, "fatter" package with Chimera and static + shared libs.
FWIW I'm only seeing a size of 50MB rather than 200+
so the /opt/hyperscan
directory is just Hyperscan itself, the 200mb was coming from this extension statically linked to the libs, which definitely seems a bit much - looking back, I think I relied on pkg-config --static
which, long story short, not every package that supports static linking has the proper *.pc
file allowing the use of pkg-config
. I think I must have linked glibc and other stuff, instead of just libhs
.
TLDR:
python-hyperscan on chimera [!] is 📦 v0.2.0 via 🐍 v3.9.1+ (.venv)took 3s
$ ldd src/hyperscan/_hyperscan.cpython-39-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007ffe627b1000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2884766000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2884574000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2884393000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2884378000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2884355000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2885563000)
python-hyperscan on chimera [!] is 📦 v0.2.0 via 🐍 v3.9.1+ (.venv)
$ ls -l src/hyperscan/_hyperscan.cpython-39-x86_64-linux-gnu.so
Permissions Size User Date Modified Name
.rwxr-xr-x 15Mi david 9 Jul 21:50 src/hyperscan/_hyperscan.cpython-39-x86_64-linux-gnu.so
15mb is a lot better! but I think we may want to provide both a statically linked and dynamically linked option, so there might be work there. if you're down to test the statically linked wheel, I can push the updated build script to the chimera branch in a bit.
from python-hyperscan.
Cool to see this landed! Thanks for your work on it. Note that the readme still indicates 5.2 and no Chimera support.
from python-hyperscan.
thanks, updated
from python-hyperscan.
@darvid Thanks for this great addition!
Question: It appears that named capture groups are supported as part of PCRE2, but nothing like Python's re.Match.groupdict()
is provided by Hyperscan out of the box. So how do we back into the group names from within the callback?
It seems that one option would be to use re.Pattern.groupindex
to map group numbers onto names, but it's not totally obvious how the group numbers are represented in the callback args
and kwargs
in your example above:
>>> db.scan(b'1x ax bx cx', on_match)
(0, 3, 8, 0, [(1, 3, 8), (1, 3, 5), (1, 6, 8)], None) {}
Docstring for .scan
says
match_event_handler (callable): The match callback, which is
invoked for each match result, and passed the expression
id, start offset, end offset, flags, and a context object.
I count 5 args mentioned in the docstring to 6 returned by the example code.
I'm going to try a few things and report back, but if you know the answer off the top of your head it would save me some fumbling. Thanks!
from python-hyperscan.
First, thanks for all the work on this extremely useful library –
Regarding named capture groups, does the above approach lead to a case in which, for a pattern like b'(?<foo>ax) bx (?<bar>cx)
, the names foo
and bar
can be recovered (along with their associated content) in the match handler?
from python-hyperscan.
Related Issues (20)
- symbol not found in flat namespace '_ch_alloc_scratch' HOT 10
- Add args for early termination of scanning if only need to find one match regex or just judging matched
- document release/publish process
- How to match an exact string with hyperscan like with re.findall
- Handling scan termination from match callback could be cleaner HOT 2
- switch to vectorscan HOT 2
- Request for maintainer(s) HOT 5
- multiprocessing problem.
- Memory leak in Database object when compiling, dumping and loading. HOT 15
- Strange "hyperscan.InvalidError: error code -1" HOT 6
- Named capture groups with Chimera
- install by Dockerfile
- v0.2.0 does not work with Python 3.10 HOT 5
- ModuleNotFoundError: No module named 'hyperscan._hyperscan'
- when it will be ready for windows ? HOT 1
- Please do a release HOT 1
- Problem with musl and fat runtime? HOT 5
- pypi don't have py3.9 whl release and tar.gz with source code HOT 2
- Python 3.10 using error
- Can't import from hyperscan in python 3.12 HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-hyperscan.