Code Monkey home page Code Monkey logo

Comments (9)

darvid avatar darvid commented on June 11, 2024 1

well that was a little longer than a couple weeks...

if you get a chance, please check out the chimera branch and give it a shot, or use this Dockerfile as a baseline. I'd like to do some more testing before releasing, to iron out any segfaults or weirdness, given the amount of changes to almost every API wrapper in this extension.

there is a unit test example for compiling a database and scanning for matches here, from a user perspective it should be pretty straight forward and follows the same event handler signature as the C API.

note that the build requirements are updated:

  • requires Hyperscan v5.4.0
  • build Hyperscan with -fPIC (set CFLAGS and CXXFLAGS)
  • libpcre is statically linked, so there should be no requirement to install it globally, however you probably need to download the source (the docker image uses 8.44) and untar it in your Hyperscan source directory in order for the Hyperscan cmake stuff to actually turn the Chimera headers on. i haven't tested without this yet, though.

from python-hyperscan.

jcushman avatar jcushman commented on June 11, 2024 1

Wow, cool to see progress on this!

I only got as far as building a wheel in your Dockerfile and installing it in Debian, but can at least confirm that works:

FROM darvid/manylinux-hyperscan:5.4.0 AS hyperscan
WORKDIR /python-hyperscan
ENV PCRE_PATH=/opt/pcre/.libs
RUN python3.9 -m pip wheel https://github.com/darvid/python-hyperscan/archive/chimera.zip#egg=hyperscan

FROM python:buster
ENV LD_LIBRARY_PATH=/opt/hyperscan/lib:${LD_LIBRARY_PATH}
COPY --from=hyperscan /opt/hyperscan/ /opt/hyperscan
COPY --from=hyperscan /python-hyperscan/ /python-hyperscan
RUN pip install --no-index --find-links=/python-hyperscan hyperscan
# python
Python 3.9.6 (default, Jun 29 2021, 19:18:53)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import hyperscan
>>> db = hyperscan.Database(chimera=True, mode=hyperscan.CH_MODE_GROUPS)
>>> db.compile(expressions=[rb'(a.) (b.)'])
>>> def on_match(*args, **kwargs):
...   print(args, kwargs)
>>> db.scan(b'1x ax bx cx', on_match)
(0, 3, 8, 0, [(1, 3, 8), (1, 3, 5), (1, 6, 8)], None) {}

When I get a chance I'll try adapting the eyecite library to use this -- that would exercise things a little more.

Given the new build requirements I'm curious about packaging. Prior to this, installing the requirements for this library on Debian is just:

echo 'deb http://deb.debian.org/debian buster-backports main' >> /etc/apt/sources.list.d/backports.list
apt-get -t buster-backports install -y libhyperscan-dev

And on Mac is just brew install hyperscan.

Seems like this might bump up the benefits of static linking in #29? But I'm also curious if ideally Debian/homebrew/etc. should include chimera support in their ports -- I'm a little out of my depth about whether that would be a good idea in general.

FWIW I'm only seeing a size of 50MB rather than 200+, but I might be looking at the wrong thing:

root@c66cea9224eb:/eyecite# du -hs /opt/hyperscan/
51M	/opt/hyperscan/
root@c66cea9224eb:/eyecite# du -hs /python-hyperscan/
168K	/python-hyperscan/

from python-hyperscan.

darvid avatar darvid commented on June 11, 2024 1

@brianthelion looks like I forgot to document that the Chimera match event handler has a different signature - one of the arguments is actually a ch_capture struct, which in Python is a list of capture groups, which are tuples of (flags, start, end). See the official docs here.

Here's a quick and dirty example that I could think of for capturing group names, maybe there is a better way with the Chimera support but this is kind of analogous to the way I would do it regardless of whether you're using Chimera or not:

#!/usr/bin/env python
import hyperscan

patterns = {
    '✨ α ✨': (b'1x', 0),
    '✨ β ✨': (b'ax', 0),
    '✨ γ ✨': (b'bx', 0),
    '✨ δ ✨': (b'cx', 0),
}


def match_event_handler(id, start, end, flags, captured, ctx):
    for i, (cap_flags, cap_from, cap_to) in enumerate(captured):
        print(f"Expr ID:             {id}")
        print(f"Capture flags:       {cap_flags}")
        print(f"Capture #:           {i}")
        print(f"Capture group name:  {ctx['group_names'][id]}")
        print(f"Test string:         {repr(ctx['chunk'])}")
        print(f"Captured:            {ctx['chunk'][cap_from:cap_to]}")
        print()
    print()


def main():
    db = hyperscan.Database(chimera=True, mode=hyperscan.CH_MODE_GROUPS)
    expressions, ids, flags = [], [], []
    group_names: dict[int, str] = {}
    for i, (group_name, (expr, flags_)) in enumerate(patterns.items()):
        ids.append(i)
        expressions.append(expr)
        flags.append(flags_)
        group_names[i] = group_name
    db.compile(expressions=expressions, ids=ids, flags=flags)
    chunk = b'1x ax bx cx'
    ctx = {
        "group_names": group_names,
        "chunk": chunk,
    }
    db.scan(chunk, match_event_handler, context=ctx)


if __name__ == '__main__':
    main()
$ python groups.py
Expr ID:             0
Capture flags:       1
Capture #:           0
Capture group name:  ✨ α ✨
Test string:         b'1x ax bx cx'
Captured:            b'1x'


Expr ID:             1
Capture flags:       1
Capture #:           0
Capture group name:  ✨ β ✨
Test string:         b'1x ax bx cx'
Captured:            b'ax'


Expr ID:             2
Capture flags:       1
Capture #:           0
Capture group name:  ✨ γ ✨
Test string:         b'1x ax bx cx'
Captured:            b'bx'


Expr ID:             3
Capture flags:       1
Capture #:           0
Capture group name:  ✨ δ ✨
Test string:         b'1x ax bx cx'
Captured:            b'cx'

from python-hyperscan.

darvid avatar darvid commented on June 11, 2024

hey - I have, and have avoided it because I wasn't sure anyone had a good use case for it - Chimera is essentially an entirely separate C API for interacting with scratch space, match event handlers, compiling, etc... even the flags are different.

so it will either require duplicating much of the Python interface to work with the Chimera side of Hyperscan, or some extensive use of macros to dynamically switch to the Chimera API.

I'll try to find time to take a closer look in the next couple of weeks and keep you updated.

from python-hyperscan.

darvid avatar darvid commented on June 11, 2024

I only got as far as building a wheel in your Dockerfile and installing it in Debian

oh, in that case I don't think it requires (re-)building Hyperscan from source, that's good. I'll just update the documentation on building from source, which will require those additional steps.

Seems like this might bump up the benefits of static linking in #29?

static linking would still be beneficial to people running LTS distros who don't/can't install a recent Hyperscan release from a PPA, but I don't know how many people/applications actually have that requirement (and also use this project). I personally think it would be a good idea for package maintainers to include Chimera support, yeah, or maybe add a secondary, "fatter" package with Chimera and static + shared libs.

FWIW I'm only seeing a size of 50MB rather than 200+

so the /opt/hyperscan directory is just Hyperscan itself, the 200mb was coming from this extension statically linked to the libs, which definitely seems a bit much - looking back, I think I relied on pkg-config --static which, long story short, not every package that supports static linking has the proper *.pc file allowing the use of pkg-config. I think I must have linked glibc and other stuff, instead of just libhs.

TLDR:

python-hyperscan on  chimera [!] is 📦 v0.2.0 via 🐍 v3.9.1+ (.venv)took 3s
$ ldd src/hyperscan/_hyperscan.cpython-39-x86_64-linux-gnu.so
        linux-vdso.so.1 (0x00007ffe627b1000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2884766000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2884574000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2884393000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2884378000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2884355000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2885563000)

python-hyperscan on  chimera [!] is 📦 v0.2.0 via 🐍 v3.9.1+ (.venv)
$  ls -l src/hyperscan/_hyperscan.cpython-39-x86_64-linux-gnu.so
Permissions Size User  Date Modified Name
.rwxr-xr-x  15Mi david  9 Jul 21:50   src/hyperscan/_hyperscan.cpython-39-x86_64-linux-gnu.so

15mb is a lot better! but I think we may want to provide both a statically linked and dynamically linked option, so there might be work there. if you're down to test the statically linked wheel, I can push the updated build script to the chimera branch in a bit.

from python-hyperscan.

jcushman avatar jcushman commented on June 11, 2024

Cool to see this landed! Thanks for your work on it. Note that the readme still indicates 5.2 and no Chimera support.

from python-hyperscan.

darvid avatar darvid commented on June 11, 2024

thanks, updated

from python-hyperscan.

brianthelion avatar brianthelion commented on June 11, 2024

@darvid Thanks for this great addition!

Question: It appears that named capture groups are supported as part of PCRE2, but nothing like Python's re.Match.groupdict() is provided by Hyperscan out of the box. So how do we back into the group names from within the callback?

It seems that one option would be to use re.Pattern.groupindex to map group numbers onto names, but it's not totally obvious how the group numbers are represented in the callback args and kwargs in your example above:

>>> db.scan(b'1x ax bx cx', on_match)
(0, 3, 8, 0, [(1, 3, 8), (1, 3, 5), (1, 6, 8)], None) {}

Docstring for .scan says

         match_event_handler (callable): The match callback, which is
              invoked for each match result, and passed the expression
              id, start offset, end offset, flags, and a context object.

I count 5 args mentioned in the docstring to 6 returned by the example code.

I'm going to try a few things and report back, but if you know the answer off the top of your head it would save me some fumbling. Thanks!

from python-hyperscan.

atomikalabs avatar atomikalabs commented on June 11, 2024

First, thanks for all the work on this extremely useful library –

Regarding named capture groups, does the above approach lead to a case in which, for a pattern like b'(?<foo>ax) bx (?<bar>cx), the names foo and bar can be recovered (along with their associated content) in the match handler?

from python-hyperscan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.