Code Monkey home page Code Monkey logo

Comments (3)

imgurbot12 avatar imgurbot12 commented on June 1, 2024

Here is a sample i took from the tail end of the syslog (Just to give you an idea of whats being parsed):

Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)
Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)
Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)
Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry
Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name

from python-hyperscan.

darvid avatar darvid commented on June 1, 2024

Few things:

First off, I don't think Hyperscan supports capture groups. You can test this by compiling a pattern like foo(bar), scanning 'foobar', and noticing that it only fires one match callback, with 0 and 6 as the start/end offsets. So bear that in mind if you're trying to use Hyperscan to parse log messages.

I think the end offsets you're seeing make sense based on the regex you are using. Here's your script modified to add some debugging context:

import hyperscan
from time import time

#timestamp regex
_timestamp = r"(\d{2}:\d{2}:\d{2})?"

#format: [timestamp] 00[00]/00/00[00] [timesamp]
_standard_date = (
    r"%s( )?"                                       #timestamp
    r"(([1-2][0-9]{3}|3000)|([0-2]?[0-9]|3[0-1]))"  #00[00]
    r"(\/|-)"                                       #/ or -
    r"([0-2]?[0-9]|3[0-1])"                         #00
    r"(\/|-)"                                       #/ or -
    r"(([1-2][0-9]{3}|3000)|([0-2]?[0-9]|3[0-1]))"  #00[00]
    r"%s( )?"                                       #timestamp
)%(_timestamp,_timestamp)

#format: [timestamp] [day or year] [month] [year or day] [timestamp] - not specficly in that order
_written_date = (
    r"%s( )?"
    r"(?:"
        r"([1-2][0-9]{3}|3000)|(([0-2]?[0-9]|3[0-1])(-([0-2]?[0-9]|3[0-1]))?)"
    r")?(?:, | of )?"
    r"(?:"
        r"(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?"
        r"|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?"
        r"|Nov(?:ember)?|Dec(?:ember)?)"
    r")(?:,|\.)?( )?"
    r"("
        r"([1-2][0-9]{3}|3000)|(([0-2]?[0-9]|3[0-1])(-([0-2]?[0-9]|3[0-1]))?)"
    r")?"
    r"( )?%s"
)%(_timestamp,_timestamp)

def database_stream():
    db = hyperscan.Database(mode=(hyperscan.HS_MODE_STREAM | hyperscan.HS_MODE_SOM_HORIZON_LARGE))
    expressions = (_standard_date.encode(), _written_date.encode())
    print(expressions[1])
    ids = (0, 1)
    flags = (0, 0)
    db.compile(expressions=expressions, ids=ids, elements=len(expressions), flags=flags)
    return db

def done(eid, start, end, flags, context):
    """
    On a found result:
        -Expression ID
        -Start Offset
        -End Offset
        -Flags (Currently Unused)
        -Context object passed to scan
    """
    print(context['line'])
    start -= context['offset']
    end -= context['offset']
    print('eid: {} match: {}'.format(eid, context['line'][start:end].decode()))

def test_stream_scan(database_stream):
    with open("syslog.txt", "rb") as f: lines = f.read().splitlines()
    start = time()
    offset = 0
    with database_stream.stream(match_event_handler=done) as stream:
        for line in lines:
            stream.scan(line, context={'line': line, 'offset': offset})
            offset += len(line)
    print("lines/sec:", len(lines)/(time()-start))
    print("lines:", len(lines))

if __name__ == "__main__":
    stream = database_stream()
    test_stream_scan(stream)

And the output:

b'Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)'
eid: 1 match: Jul 1 23:00:01
b'Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)'
eid: 1 match: Jul 1 23:00:01
b'Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)'
eid: 1 match: Jul 1 23:00:01
b'Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry'
eid: 1 match: Jul
b'Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry'
eid: 1 match: Jul
b'Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry'
eid: 1 match: Jul 1
b'Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry'
eid: 1 match: Jul 1
b'Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry'
eid: 1 match: Jul 1 20:20:01
b'Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name'
eid: 1 match: Jul
b'Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name'
eid: 1 match: Jul
b'Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name'
eid: 1 match: Jul 1
b'Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name'
eid: 1 match: Jul 1
b'Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name'
eid: 1 match: Jul 1 20:21:01
lines/sec: 8361.850079744816
lines: 5

The duplicate matches are from the use of match zero-or-one quantifier (?) at the end of capture groups. Because this is greedy, you get two matches, one without the space and one with. Since you have a zero-or-one quantifier at the end of each portion of the timestamp, you get multiple matches, e.g. "Jul", "Jul 1", and "Jul 1 23:00:01".

Hope that makes sense.

from python-hyperscan.

imgurbot12 avatar imgurbot12 commented on June 1, 2024

@darvid Thanks for the quick response.
With the example you've given i understand much more than before. I can probably form some working model from your explanation. Much Appreciated.

from python-hyperscan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.