onyx-platform / onyx Goto Github PK

View Code? Open in Web Editor NEW

2.0K 122.0 204.0 16.61 MB

Distributed, masterless, high performance, fault tolerant data processing

Home Page: http://www.onyxplatform.org

License: Eclipse Public License 1.0

Clojure 99.72% Shell 0.28%

clojure distributed streaming batch data

onyx's People

Contributors

Stargazers

Watchers

Forkers

ahoy-jon zoowii vijaykiran ghosthamlet murraju esaul hlprmnky prasincs malcolmsparks martindale nullnotfound gsh45 snormore bblanton rukor ahammel l1x jwymanm ebottabi masegraye sourceops cswaroop brennonyork narma ilovejs jnbala johnswanson neuroradiology yuppiechef mpenet hardikus zoka gregvirool jackdempsey flaing lbradstreet leathekd erichmond weaver-viii tomasu82 ravisinghsfbay jethrotan cloudbring podviaznikov sleyzerzon noisesmith siathalysedi tvanhens michael-okeefe darth10 mccraigmccraig lenaschoenburg ideal-knee hkrishnan kevingreene fjdoria76 iperdomo deraen yashodhank kendru rowhit lgscofield tolitius extremenelson metasoarous jocrau mushketyk gigafone zhouwanfeng kakamessi99 reaperhulk zhanghuabin tiensonqin mindis tharanga-abeyseela fengshao0907 jango2015 prayagupa it-crasy dlnufox delkyd lemonhall rasom qingniufly clojurians-org neverfox greywolve devth jcf fysoft2006 mariusz-jachimowicz-83 miaohongbiao malesch anusornc dhineshns honne davengeo liutigger007 colinhicks leolujuyi

onyx's Issues

Cast job-id to str in `await-job-completion`

This API will compare it to a UUID and fail. Do the cast inside the function so users don't need to worry about it.

Alternate peer balancing strategy

As of 0.3.0, the only strategy for balancing peers across jobs and tasks is round robin/breadth-first, respectively. This ticket should break out the algorithms used for planning and coordination into functions behind multimethods, and allow for a greedy strategy. A greedy strategy will try to complete an entire job before moving on to the next.

Shutdown protocol needs to be reimplemented for dev-env and peer

Removed while construction took place on 0.5.0.

Maximum peers per task option

The catalog should off an optional onyx/max-peers parameter that takes an integer value representing the maximum number of peers that may be executing an instance of that task at any single point in time.

Peer can deadlock on task completion

Reproduced in core.async plugin tests with a high number of virtual peers. Sometimes, closing a peer will block as it tries to flush its pipeline. The pipeline will block on reading from an ingress queue. This queue should always provide the sentinel value. Something is hanging on to the sentinel as a consumer and never committing it back to the queue, hence the hang.

Virtual peers can be starved on Grouping tasks

This is a particularly rough edge case. If a virtual peer receives a grouping task, it's capable of being starved from receiving the sentinel segment off the queue due to the way that HornetQ pins messages as it groups. If a consumer closes out, it might not necessarily requeue the sentinel in a server node where other consumers can reach it. Hence, the other virtual peers may deadlock and wait forever. This only affects batch mode - streaming mode is fine.

Add :onyx/params to catalog entry

Allow value-level parameterization through the catalog.

[{...
  :my/param 42
  :my/other-param 44
  :onyx/params [:my/param :my/other-param]}]

Compress the user guide into a single page

Should enable better searchability of the docs.

Any task that follows a sequential task emits an exception

With any task proceeded by a sequential task, the following exception will be thrown:

org.hornetq.api.core.HornetQInternalErrorException: HQ119000: ClientSession closed while creating session
    type: #<INTERNAL_ERROR>
org.hornetq.core.client.impl.ClientSessionFactoryImpl.createSessionInternal      ClientSessionFactoryImpl.java:  782
        org.hornetq.core.client.impl.ClientSessionFactoryImpl.createSession      ClientSessionFactoryImpl.java:  366
                               sun.reflect.NativeMethodAccessorImpl.invoke0      NativeMethodAccessorImpl.java      
                                sun.reflect.NativeMethodAccessorImpl.invoke      NativeMethodAccessorImpl.java:   57
                            sun.reflect.DelegatingMethodAccessorImpl.invoke  DelegatingMethodAccessorImpl.java:   43
                                            java.lang.reflect.Method.invoke                        Method.java:  606
                                clojure.lang.Reflector.invokeMatchingMethod                     Reflector.java:   93
                           clojure.lang.Reflector.invokeNoArgInstanceMember                     Reflector.java:  313
                                            onyx.queue.hornetq/eval20966/fn                        hornetq.clj:  201
                                                clojure.lang.MultiFn.invoke                       MultiFn.java:  231
                                       onyx.peer.operation/start-lifecycle?                      operation.clj:   55
                                           onyx.peer.transform/eval21156/fn                      transform.clj:   95
                                                clojure.lang.MultiFn.invoke                       MultiFn.java:  231
                    onyx.peer.task-lifecycle-extensions/merge-api-levels/fn      task_lifecycle_extensions.clj:   19
                                             clojure.lang.ArrayChunk.reduce                    ArrayChunk.java:   63
                                                  clojure.core.protocols/fn                      protocols.clj:   98
                                                clojure.core.protocols/fn/G                      protocols.clj:   19

The virtual peer will shutdown and instantly reboot, continuing as normal. This bug is mostly harmless. It is causes by the concurrent optimizations set for the HornetQ configuration. The Session Factory is swapped out in favor of a different factory, but the new factory doesn't "stick" for new tasks. Virtual peers reuse the old Session Factory that has been closed. The exception is thrown. After reboot, a fresh Session Factory is used.

Harmless, but annoying to see in the logs.

Support full DAG workflows

In 0.4.0, we're going to move away from the tree/map based workflow to a vector-of-vectors. This will properly support multi input streams to any task, and continue to support multiple output streams. It will look like this:

[[:in-1 :inc]
 [:in-2 :inc]
 [:in-3 :inc]
 [:inc :out]]

Tasks:

Parameterize back-off timeout when peer fails to join cluster

Hard coded for 250ms.

Processor spikes after multiple Onyx start/stop cycles

Observing high processor load after starting and stopping Onyx many times in the same repl session. Reproducing what @prasincs saw a few weeks ago.

Standardize naming of lifecycle event map

There's been some confusion around what the difference between the "event map", "lifecycle event map", and "context map" are. They are all the same thing. This should be fixed in the docs. I think I'd like to choose "lifecycle event" as the canonical term.

Rename `await-job-completion` API

Seems like this function should block, but actually returns a future.

Task that can complete when only 1 input has been exhausted

Considering a feature that will let a task complete when only one (not all) of its upstream inputs have pushed the sentinel onto the input stream. This would aid use cases where a privileged kill stream is utilized.

Just a placeholder, needs more thought.

Use LMAX Disruptor for local execution

When we're running locally, it might be better for performance to use LMAX Disruptor instead of HornetQ for messaging. Requires benchmarking.

Validate workflow DAG

As mentioned in #2, workflow should be validated such that only input tasks are missing incoming edges, only output tasks should be missing output edges, and DAG should not have any cycles (dependency will throw an exception for you when creating the graph).

Implement Coverage Protection

Coverage Protection is described here: https://github.com/MichaelDrogalis/onyx/blob/12c72be61d056446b1b7fe0a54a33782bbedc03b/doc/design/masterless.md#partial-coverage-protection

Monitoring dashboard

This issue serves as a placeholder for the creation of another repository - onyx-dashboard. This dashboard will serve as a point of monitoring the status of what's happening inside Onyx by querying ZooKeeper. The data in ZooKeeper is immutable, and compressed with Fressian.

Onyx should throw a helpful error when lifecycle methods do not return maps

The exception right now is confusing. start-lifecycle?, inject-lifecycle-resources, inject-temporal-resources,close-temporal-resources, andclose-lifecycle-resources` all need a type check on their return values.

Update masterless examples

Some of the examples are incorrect due to design changes, or require pictures for better explanation.

Change logging to be less chatty

Unsure of how I want it to look, but make it less verbose.

Update log command API reference

Command section is mostly correct, but needs a final pass to ensure that it's update to date.

submit-job hangs if catalog contains task maps with non-serializable values

If you supply a core async channel (or other non-serializable object) in a task map within a catalog, and submit-job, it will hang silently.

Ideally there should be some kind of validation of the catalog for serializabilty.

Add log rotation to Onyx logging configuration

Logs gets pretty huge after a short time. Add Rotor to Timbre to fix that.

Job throwing an exception crashes the peer

If a task throws an exception, the Peer crashes and isn't able to service additional work.

Plugin request: HDFS

This plugin should be capable of reading off the Hadoop file system and writing segments back to it. The point of input and partitioning should be a single file, and the partitioning will happen over the byte sequence representing the file distributed over blocks in the cluster.

why virtual peer use so many threads and channels

Hi, I wonder why virtual peer use 15 threads to process data? Are there other considerations?
(-> (fn1 event) fn2 fn3), isn't this simpler?

Reintroduce task timeouts

In a very early version of Onyx, if the batch size of messages didn't accrue within a certain period of time, Onyx wouldn't attempt to keep reading and time out. This was removed due to a bug in HornetQ that didn't preserve sequential ordering. This is useful for sparse message streams, so I have found a workaround to add this back in.

Plugin request: Kafka

A Kafka plugin should be created that offers both input and output functionality. Additionally, it should be capable of working with Kafka partitions.

Log Coordinator and Peer output to different files

If the Coordinator and Peer are running on the same machine, they'll log to the same file. This can be a little bit of a pain during development. Each should log to its own file.

Aggregate does not respect onyx/batch-timeout

Reads block forever in aggregation readers. Fixed in 0.4.0-SNAPSHOT.

Coordinator logging

The Coordinator logs very infrequently as of 0.3.0. Logging using Dire should be implemented on events like job submission, task completion, peer birth/death, etc.

Throw an exception on bad catalog format

Submitting a catalog that doesn't conform to the specification of the informational model throws an unhelpful assertion inside the Coordinator. The catalog should be validated using something a library like Schema to obtain helpful error messages.

Make job ID available in lifecycle context map

Key it under onyx.core/job-id.

Implement kill-job core API function

There needs to be an API function that takes a job ID and halts any peer execution of that job's tasks. The job's tasks will no longer be eligible for execution.

Virtual peers can fail to make progress at end of task

Reproduced with the grouping test in Onyx core by turning up the number of virtual peers. Observed that two peers can continually take the sentinel segment off the queue and re-enqueue it infinitely, neither of them able to complete the task.

Throw an exception when a workflow keyword isn't in the catalog

This silently fails right now, the exception gets swallowed up and everything halts.

Emit an error if `:onyx/name` = `:onyx/type`

If both of these are the same, the same multimethod for lifecycle resources gets dispatched to. This is really confusing. See #36

Java API for Onyx

As of release 0.3.0, Clojure is the only supported language for Onyx. Java users can use the APIs that Clojure offers to tap some of the Onyx functionality, but this becomes problematic for areas such as lifecycle extensions that rely on implementations of multimethods.

Furthermore, EDN isn't the friendliest cross-language data format to send catalogs and workflows through. Part of this issue should explore options that Java users have on this front.

inject-lifecycle-resources is invoked twice for every task

Seeing this happen exactly twice, every single time. Never noticed until now.

Reimplement await-job-completion

This was removed by the underlying mechanism changed for 0.5.0. Reimplement this API function.

Map-style workflows don't validate anymore

Messed this one up during the redesign. Fix this before shipping 0.5.0.

Throw an exception on bad workflow format

Submitting a workflow that doesn't conform to the specification of the informational model throws an unhelpful assertion inside the Coordinator. The workflow should be validated using something a library like Schema to obtain helpful error messages.

Restart peer on failure

When a peer dies, it should attempt to recover by rebooting itself. See core.async test for failure.

Grouping configurable through catalog entry

From the mailing list:

"Also, it'd be really nice to specify a group-by operation on the input queue of each function, because then you could make things like wordcount really easy--you'd be able to say "send all instances of the same word to the the same downstream task", which would enable workflows that require an implicit sort/shuffle step."

https://groups.google.com/forum/#!topic/onyx-user/xniQcgCPEn8

Optimize aggregators

Aggregators are at a disadvantage, performance-wise, to transformers and groupers. Aggregators can only hold a single session open which needs to be reused across pipeline iterations in the peer. The reason for this is that if multiple sessions were used, all sessions needs to be read from at the same time since groupers will pin particular message ids to consumers. These sessions shouldn't be closed, otherwise the messages will be repinned. Further, once the sentinel is read, all other sessions will block indefinitely.

The goal of this issue is to speed up this aggregators using an alternate design approach.