Comments (3)
Here is a sample i took from the tail end of the syslog (Just to give you an idea of whats being parsed):
Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)
Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)
Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)
Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry
Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name
from python-hyperscan.
Few things:
First off, I don't think Hyperscan supports capture groups. You can test this by compiling a pattern like foo(bar)
, scanning 'foobar'
, and noticing that it only fires one match callback, with 0 and 6 as the start/end offsets. So bear that in mind if you're trying to use Hyperscan to parse log messages.
I think the end offsets you're seeing make sense based on the regex you are using. Here's your script modified to add some debugging context:
import hyperscan
from time import time
#timestamp regex
_timestamp = r"(\d{2}:\d{2}:\d{2})?"
#format: [timestamp] 00[00]/00/00[00] [timesamp]
_standard_date = (
r"%s( )?" #timestamp
r"(([1-2][0-9]{3}|3000)|([0-2]?[0-9]|3[0-1]))" #00[00]
r"(\/|-)" #/ or -
r"([0-2]?[0-9]|3[0-1])" #00
r"(\/|-)" #/ or -
r"(([1-2][0-9]{3}|3000)|([0-2]?[0-9]|3[0-1]))" #00[00]
r"%s( )?" #timestamp
)%(_timestamp,_timestamp)
#format: [timestamp] [day or year] [month] [year or day] [timestamp] - not specficly in that order
_written_date = (
r"%s( )?"
r"(?:"
r"([1-2][0-9]{3}|3000)|(([0-2]?[0-9]|3[0-1])(-([0-2]?[0-9]|3[0-1]))?)"
r")?(?:, | of )?"
r"(?:"
r"(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?"
r"|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?"
r"|Nov(?:ember)?|Dec(?:ember)?)"
r")(?:,|\.)?( )?"
r"("
r"([1-2][0-9]{3}|3000)|(([0-2]?[0-9]|3[0-1])(-([0-2]?[0-9]|3[0-1]))?)"
r")?"
r"( )?%s"
)%(_timestamp,_timestamp)
def database_stream():
db = hyperscan.Database(mode=(hyperscan.HS_MODE_STREAM | hyperscan.HS_MODE_SOM_HORIZON_LARGE))
expressions = (_standard_date.encode(), _written_date.encode())
print(expressions[1])
ids = (0, 1)
flags = (0, 0)
db.compile(expressions=expressions, ids=ids, elements=len(expressions), flags=flags)
return db
def done(eid, start, end, flags, context):
"""
On a found result:
-Expression ID
-Start Offset
-End Offset
-Flags (Currently Unused)
-Context object passed to scan
"""
print(context['line'])
start -= context['offset']
end -= context['offset']
print('eid: {} match: {}'.format(eid, context['line'][start:end].decode()))
def test_stream_scan(database_stream):
with open("syslog.txt", "rb") as f: lines = f.read().splitlines()
start = time()
offset = 0
with database_stream.stream(match_event_handler=done) as stream:
for line in lines:
stream.scan(line, context={'line': line, 'offset': offset})
offset += len(line)
print("lines/sec:", len(lines)/(time()-start))
print("lines:", len(lines))
if __name__ == "__main__":
stream = database_stream()
test_stream_scan(stream)
And the output:
b'Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname CRON[10728]: (root) CMD (timeshift --check)'
eid: 1 match: Jul 1 23:00:01
b'Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname crontab[10779]: (root) LIST (root)'
eid: 1 match: Jul 1 23:00:01
b'Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)'
eid: 1 match: Jul
b'Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)'
eid: 1 match: Jul 1
b'Jul 1 23:00:01 hostname crontab[10780]: (root) LIST (root)'
eid: 1 match: Jul 1 23:00:01
b'Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry'
eid: 1 match: Jul
b'Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry'
eid: 1 match: Jul
b'Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry'
eid: 1 match: Jul 1
b'Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry'
eid: 1 match: Jul 1
b'Jul 1 20:20:01 hostname sm-msp-queue[28678]: My unqualified host name (hostname) unknown; sleeping for retry'
eid: 1 match: Jul 1 20:20:01
b'Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name'
eid: 1 match: Jul
b'Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name'
eid: 1 match: Jul
b'Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name'
eid: 1 match: Jul 1
b'Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name'
eid: 1 match: Jul 1
b'Jul 1 20:21:01 hostname sm-msp-queue[28678]: unable to qualify my own domain name (hostname) -- using short name'
eid: 1 match: Jul 1 20:21:01
lines/sec: 8361.850079744816
lines: 5
The duplicate matches are from the use of match zero-or-one quantifier (?
) at the end of capture groups. Because this is greedy, you get two matches, one without the space and one with. Since you have a zero-or-one quantifier at the end of each portion of the timestamp, you get multiple matches, e.g. "Jul", "Jul 1", and "Jul 1 23:00:01".
Hope that makes sense.
from python-hyperscan.
@darvid Thanks for the quick response.
With the example you've given i understand much more than before. I can probably form some working model from your explanation. Much Appreciated.
from python-hyperscan.
Related Issues (20)
- symbol not found in flat namespace '_ch_alloc_scratch' HOT 10
- Add args for early termination of scanning if only need to find one match regex or just judging matched
- document release/publish process
- Illegal instruction crash on import HOT 1
- loadb docs example requires a mode
- switch to vectorscan HOT 2
- Request for maintainer(s) HOT 5
- multiprocessing problem.
- Memory leak in Database object when compiling, dumping and loading. HOT 15
- Strange "hyperscan.InvalidError: error code -1" HOT 6
- Named capture groups with Chimera
- install by Dockerfile
- v0.2.0 does not work with Python 3.10 HOT 5
- ModuleNotFoundError: No module named 'hyperscan._hyperscan'
- when it will be ready for windows ? HOT 1
- Please do a release HOT 1
- Problem with musl and fat runtime? HOT 5
- pypi don't have py3.9 whl release and tar.gz with source code HOT 2
- Python 3.10 using error
- Can't import from hyperscan in python 3.12 HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-hyperscan.