The bitswanpump from libertyaces

IteratorSource fails with TypeError

Pipeline 'Pipeline' stopped due to a processing error: 'list_iterator' object is not callable (<class 'TypeError'>)

Add an AnalyzingSource

The idea behind the AnalyzingSource is that the analyze() method of the matrix could be triggered by a source that will expect that the output of the analysis will be fed into a pipeline that contains AnalyzingSource. The cycle could be configured to be a time or an PubSub event - but it will also be defined by a duty cycle of the pipeline - analyze()will be called only if the pipeline is not throttled, which means that this system has a self-load-balancing property.

Add an config item that specifis UDP receive buffer size

BitSwanPump/bspump/socket/udp.py

Line 30 in 30b543f

self.Socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)

https://stackoverflow.com/questions/28563518/buffer-size-for-reading-udp-packets-in-python

self.Socker.setsockopt(socket.SOL_SOCKET,vsocket.SO_RCVBUF, [value-from-config])

name of the config item:
"receiver_buffer_size"

[KafkaSink] Some producer params cannot be used with ASAB file configuration

Eg. for "acks" KafkaProducer expects value to be one of (0, 1, -1, "all"), but ASAB file configuration always passes value of type string (eg "-1"), which fails with ValueError("Invalid ACKS parameter").

The same issue is for all int and boolean parameters. With these KafkaProducer raises error, or ignores invalid string value.

[MySQLBinaryLogSource] Enrich event with table related data from RowsEvent

Currently it is not possible to get information about table name, where the modification occured.

[MySQLBinaryLogSource] loop stops when reaches end of bin-log

The loop should stay blocked on the end of bin-log and wait for next replication events

[KafkaSource] Application may crash if pipeline is throttled during kafka error recovery

Version: 20.03.04 and current master

With secondary pipeline throttling propagation mechanism can happen, that pipeline is throttled more than once in short time. In this case KafkaSource#_not_ready_handler is running twice at the same time and causes issues in _commit error recovery. If _commit fails with recoverable error, it starts spawning connections exponentially, until too many open files and Device or resource busy is raised and application crashes.

This issue was introduced in commit 4a8f24b

Reproducer application:
https://gist.githubusercontent.com/bedlaj/ccc602d8bbccb2abd890eb60dfe92840/raw/a8db22c54ab54edac1f8d90405f7fb914f2e6c05/bspump-kafka-fail.py

Log:
https://gist.githubusercontent.com/bedlaj/ccc602d8bbccb2abd890eb60dfe92840/raw/a8db22c54ab54edac1f8d90405f7fb914f2e6c05/bspump-kafka-fail.log

Possible fix can be introduce self._commit_ready = asyncio.Event(loop=app.Loop) and use this in _commit method to ensure the error recovery is not running in parallel.

Close row in Matrix should clear values as well

BitSwanPump/bspump/abc/matrix.py

Line 99 in 02a19be

def close_row(self, row_index):

matrix fields should be set to defalt set when calling close_row() because when we use numpy functions directly on matrix we will get hits from closed rows till they are reused

[ElasticSearchSink] Use "fixed" as default rollover mechanism in order to work with Elastic ILM (Index Lifecycle Management) in the default

https://github.com/LibertyAces/BitSwanPump/blob/master/bspump/elasticsearch/sink.py#L29 ... remove "time", put "fixed"

Analyzers should take "matrix_id" as a parameter

BitSwanPump/bspump/analyzer/timewindowanalyzer.py

Line 63 in 75c3c2e

    
           def __init__(self, app, pipeline, tw_format='f8', tw_dimensions=(15,1), resolution=60, start_time=None, clock_driven=True, time_window_id=None, id=None, config=None):

not time_window_id or sessions_id in SA

Default value for analyze_on_clock inconsistent between timewindowmatrix and timewindowanalyzer

BitSwanPump/bspump/analyzer/timewindowmatrix.py

Line 46 in 1b654b9

    
           def __init__(self, app, dtype='float_', start_time=None, resolution=60, columns=15, clock_driven=True,  id=None, config=None):

BitSwanPump/bspump/analyzer/timewindowanalyzer.py

Line 67 in 1b654b9

    
           def __init__(self, app, pipeline, matrix_id=None, dtype='float_', columns=15, analyze_on_clock=False, resolution=60,

timewindowanalyzer have default value analyze_on_clock=False, but timewindowmatrix have default value clock_driven=True,
While meaning is somewhat different this should still be unified to prevent confusion when inicialized separately (usualy when analyzer is not wall clock driven than matrix wont be as well)

Matrix should expect dtype instead of column_names and column_formats

The current concept of column_names and column_formats is unfortunately broken and has to be refactored:

The matrix has following dimensions:

1st is a row ... the dimension that is expected to grow and shrink over the lifetime of the matrix
2nd (column) and more ... optional dimensions that are more or less fixed, change is possible but unlikely
dtype ... a data type of the cell (!!!) of the matrix, this one is given and fixed over the whole lifecycle of the matrix. It is expected that an user will define his/her type.

So creating dtype from column_* is incorrect.

SessionAnalyzer uses 1D matrix with arbitrary dtype.
TimeWindowAnalyzer and GeoAnalyzer uses 2D matrix with arbitrary dtype.

Introduce lookup cache aging

Kafka Source stalls event processing becomes non-interactive

#30

when max_partition_fetch_bytes is too high or if there is problem in processing (like slow lookup)

than kafka source is uneable to commit in time, stalls and is uneable to recover.
This way any unexpected small network/db glitch can indirectly kill whole processing and forces BSpump restart

Remove None values from FileCSVSink.ConfigDefaults

Replace None values with real defaults in FileCSVSink.ConfigDefaults

Python's csv formating documentation

Related change in asab

Matrix configuration should be optional in the *analyzer constructors

column_formats, column_names should be optional argument, because if I specify matrix_id, I don't have to specify column_formats, column_names (+ it is confusing)

Add a check for providing either column_formats, column_names OR matrix_id ... but in the signature all mentioned arguments are optional.

Document this well (say specify column_formats, column_names to create own matrix OR matrix_id to use existing matrix.

BitSwanPump/bspump/analyzer/sessionanalyzer.py

Line 54 in 02a19be

    
           def __init__(self, app, pipeline, column_formats, column_names, analyze_on_clock=False, matrix_id=None, id=None, config=None):

... and other analyzers.

Work on the copy of the list to prevent propagating changes into an original list

Add column_names = column_names[:]

BitSwanPump/bspump/analyzer/sessionmatrix.py

Line 55 in 02a19be

column_formats.append("i8")

[MetricsCounter] Metric "error" is never incremented

There is unreachable code in

BitSwanPump/bspump/pipeline.py

Line 176 in 372a592

self.MetricsCounter.add('error', 1)

Refactor naming in ContentFilter

BitSwanPump/bspump/filter/contentfilter.py

Line 42 in f793bfa

def do_on_hit(self, event):

Improve the support of pypy in matrix-based analyzers

A recent refactoring broken the pypy support.
It should be relatively easy to fix that.

Improve the error message in the exception when throttling/unthrottling

BitSwanPump/bspump/pipeline.py

Line 216 in 5a81ecd

self._throttles.remove(who)

When unthrottling a pipeline with an object that is not present among throttles:

21-Aug-2019 22:49:16.978449 ERROR root Task exception was never retrieved
{'future': <Task finished coro=<DHCPDeviceTracker.throttle() done, defined at /var/pumpy/DHCP/custommodules/DHCPDeviceTracker.py:101> exception=KeyError(DHCPDeviceTracker('DhcpEventInPipeline.DHCPDeviceTracker'))>}
Traceback (most recent call last):
 File "/var/pumpy/DHCP/custommodules/DHCPDeviceTracker.py", line 105, in throttle
   self.pipeline.throttle(self, enable=False)
 File "/usr/local/lib/python3.7/site-packages/bspump/pipeline.py", line 216, in throttle
   self._throttles.remove(who)
KeyError: DHCPDeviceTracker('DhcpEventInPipeline.DHCPDeviceTracker')
21-Aug-2019 22:49:17.032356 ERROR root Task exception was never retrieved

KeyError should clearly says "Object ... is not present among throttlers"

Similar message (and check) should appear when throttling and the object IS already present.

[Idea] Aggregator pattern

Allow to specify a partition(s) in KafkaSource

and replace client-side filtering.

Predicate and Evaluate method of Analyzers have to contain context argument

BitSwanPump/bspump/analyzer/analyzer.py

Line 22 in 75c3c2e

def predicate(self, event):

def predicate(self, context, event):
def evaluate(self, context, event):

[MySQLBinaryLogSource] Events are not transmitted if config value "log_file" not passed

It should start streaming from start, but no event is transmitted if Config['log_file'] = '' (Default)

Add metrics into Matrix

A bunch of useful metrics has to be added into Matrix implementation - such as number of row (total and closed), number of columns

BitSwanPump/bspump/analyzer/sessionmatrix.py

Line 59 in 30b543f

super().__init__(app, column_names, column_formats, id=id, config=config)

[InfluxDBConnection] output_bucket_max_size cannot be configured with file config

output_bucket_max_size needs to be casted to int.
And timeout parameter is not used at all.

Improvements in a Kafka sink/source/connection

Add a support for Snappy, LZ4 a GSSAPI (especially in the official Docker container)
Add a API to Kafka Source to set offset for easier debugging
Add a key based consuming for a Kafka Source

[TimeWindowMatrix] close_row() got an unexpected keyword argument 'clear'

Add support from MongoDB change streams aka MongoChangeStreamSource

https://docs.mongodb.com/manual/changeStreams/

Add an option to specify input file charset encoding

BitSwanPump/bspump/file/filecsvsource.py

Line 15 in b683de3

ConfigDefaults = {

A simple IPC (InterProcess Communication)

Add a bspump.ipc module with a Sink and Source that allows to interconnect pipelines in different processed using UDP, TCP and Unix Streams/Datagrams.

MySQL lookup improvements

Allow to specify fields in the select (currently, there is hardcoded * which could be a default value)
self.Table to be renamed to self.From + officially support (that includes a documentation) a more complex way how to specify what to select ... see example below

'table': ' `DHCPSTATUS` JOIN RADIUS_USERDB_O2TV ON AP_SERVICEID = CETINSERVICEID JOIN RADIUSUSERS ON CASEID = TV '

Add a support for Google Drive

... so that a BitSwan pipeline can read/write files from/to Google Drive.
We will likely start with "write" (Sink) part.

Connection
Source
Sink

Add an "ReducingSource"

It is a special type of InternalSource but instead of Queue, there is an user-provided aggregation function.
It means that incoming events are aggregated by such a function (e.g. count, sum or moving average) and when the secondary pipeline (with InternalSource) is ready, it picks up that aggregation.

Possible scenario: the secondary pipeline is not saturated, so no aggregation is happening. Maybe some minimum number (of time and/or events) to be defined that enforces the aggregation if needed.

Mysql lookup _count(self) is Broken

following (valid) configuration breaks the lookup
``` config = { 'from': 'DHCPSTATUS` JOIN RADIUS_USERDB_O2TV ON AP_SERVICEID = CETINSERVICEID JOIN RADIUSUSERS ON CASEID = TV ',
'key': 'SUBNET',
'statement': ' SUBNET, KRAJ, UNIX_TIMESTAMP() as lookup_fetch_time ',
}

where `_count(self)` fetches mysql error ...
option to specify SQL template is questionable solution since we are uneable to modify arguments...

another point is what value this function actualy have since there are scenarios (like mine) where this function will say 250k ... but there is actualy only 10k uniqe key-value pairs ... since function is counting amount of rows but not uniqe rows so its questionable how reliable this is

Default predicate returns True ...

BitSwanPump/bspump/analyzer/analyzer.py

Line 22 in 75c3c2e

def predicate(self, event):

"Clock_driven" functionality must be part of the ABC analyzer

BitSwanPump/bspump/analyzer/timewindowanalyzer.py

Line 81 in 75c3c2e

if clock_driven:

... currently it is only in TimeWindowAnalyzer but it should be available also for SA and others.

Correct a lookup constructor code convention

The current lookup convention violates the standard set by other top-level objects (Sinks, Pipelines, etc.).

E.g. here:

BitSwanPump/bspump/elasticsearch/lookup.py

Line 48 in 65b289d

def __init__(self, app, lookup_id, es_connection, config=None):

Correct example:
def __init__(self, app, connection, id=None, config=None):

Incorrect example:
def __init__(self, app, lookup_id, mysql_connection, config=None):

Spotted deviances:

Argument order is wrong (connection goes after app argument)
id/lookup_id should be None with a default generated from a class name
No qualifying prefixes such as (mysql_connection ... only connection, lookup_id argument should be called id)

Every lookup needs to be adjusted across the whole BitSwan BSPump

Kafka Source default value for max_partition_fetch_bytes too high

BitSwanPump/bspump/kafka/source.py

Line 38 in 52ee17b

'max_partition_fetch_bytes': 1048576,

When there is more complex (slow ) processing in the pipeline it may timeout from kafka side and sourcre crashes (first commit gets error, retry fail as well)
Default value should be significantly more defensive (10% of current?)

User can allways incerease this value if he feels its required

[MySQLBinaryLogSource] Add support for GTID replication mode

This will allow to continue streaming events from last position persisted on master node.

Add a common attributes and methods to all top-level objects

Top-level objects are:

Pipeline
Source
Processor & Sink (ProcessorBase)
Lookup
Connection
Matrix

Members to be added:

.Loop
.time()
.App

Incorrect MySQL statement for count ...

BitSwanPump/bspump/mysql/lookup.py

Line 61 in 30b543f

    
           'query_count': 'SELECT COUNT({}) as \'n\' FROM {};',  # Specify query string to count number of records in the database

It will fail when multiple fields are used. Should be:
SELECT COUNT(*) FROM {...}

The statement field has no value and only breaks things when multiple fields are specified.

Allow multiple keys in ElasticSearch lookup

ES lookup should allow multiple keys + Add functionality to filter in range of times.

Example - For input {'epgTime': '2019-09-02 01:35:22', 'channelId': 2} the lookup should return 'epgId': 105

ElasticSearch sink will reinsert the whole bulk on each individual error in insert.

BitSwanPump/bspump/elasticsearch/connection.py

Line 203 in b271985

self._output_queue.put_nowait(bulk_out)

[ElasticSearchSink] Provide mechanism for upsert, other operations and dynamic index

Currently there is hardcoded operation "index".
"_id" cannot be set from user code
There is no option to set target index dynamically

Other metadata (E.g. "doc_as_upsert") cannot be set from user code.

Provide a possibility to specify the cache replacement policy for storage backed lookups

Currently, a strorage-backed lookups (such as MySQL or Mongo lookups) provides a very naive implementation of the cache in form of dictionary. E.g. :

BitSwanPump/bspump/mysql/lookup.py

Line 44 in 65b289d

self.Cache = {}

This needs to be extended so that the user can specify cache replacement policy that suits his purpose the best, including a custom caching strategy, implemented by a user completely (see https://en.wikipedia.org/wiki/Cache_replacement_policies fo theory).

A cache replacement policy, respectively the cache itself is implemented as a dictionary. The lookup expects the interface of MutableMapping as defined in https://docs.python.org/3/library/collections.abc.html#collections.abc.MutableMapping .
The lookup should accept an optional parameter cache that will be used for a cache attribute (default is a current dict() until we have more sane implementation).

Additionally, we should implement bspump.cache package with a few cache replacement policies (e.g. bspump.cache.LRUCacheDict() ...).

libertyaces / bitswanpump Goto Github PK

bitswanpump's People

Stargazers

Watchers

Forkers

bitswanpump's Issues

Recommend Projects

Recommend Topics

Recommend Org