nilenso / goose Goto Github PK

View Code? Open in Web Editor NEW

248.0 10.0 9.0 1.58 MB

The Next-Level background job processing library for Clojure

Home Page: https://github.com/nilenso/goose/wiki

License: MIT License

Clojure 95.20% Shell 1.06% CSS 2.10% JavaScript 1.64%

clojure redis asynchronous background-processing batch-processing rabbitmq scheduler

goose's Introduction

Goose

The Next-Level background job processing library for Clojure.

Simple. Pluggable. Reliable. Extensible. Scalable.

Performance

Please refer to the Benchmarking section.

Features

Reliable - Code/Hardware/Network failure won't cause data loss
Native support for RabbitMQ & Redis
Pluggable Message Broker & Metrics Backend
Scheduled Jobs
Batch Jobs
Periodic Jobs
Error Handling & Retries
Console
Extensible using Middlewares
Concurrency & Parallelism friendly
... more details in Goose Wiki

Getting Started

Add Goose as a dependency

;;; Clojure CLI/deps.edn
com.nilenso/goose {:mvn/version "0.5.1"}

;;; Leiningen/Boot
[com.nilenso/goose "0.5.1"]

Client

(ns my-app
  (:require
    [goose.brokers.rmq.broker :as rmq]
    [goose.client :as c]))

(defn my-fn
  [arg1 arg2]
  (println "my-fn called with" arg1 arg2))

(let [rmq-producer (rmq/new-producer rmq/default-opts)
      ;; Along with RabbitMQ, Goose supports Redis as well.
      client-opts (assoc c/default-opts :broker rmq-producer)]
  ;; Supply a fully-qualified function symbol for enqueuing.
  ;; Args to perform-async are variadic.
  (c/perform-async client-opts `my-fn "foo" :bar)
  (c/perform-in-sec client-opts 900 `my-fn "foo" :bar)
  ;; When shutting down client...
  (rmq/close rmq-producer))

Worker

(ns my-worker
  (:require
    [goose.brokers.rmq.broker :as rmq]
    [goose.worker :as w]))

;;; 'my-app' namespace should be resolvable by worker.
(let [rmq-consumer (rmq/new-consumer rmq/default-opts)
      ;; Along with RabbitMQ, Goose supports Redis as well.
      worker-opts (assoc w/default-opts :broker rmq-consumer)
      worker (w/start worker-opts)]
  ;; When shutting down worker...
  (w/stop worker) ; Performs graceful shutsdown.
  (rmq/close rmq-consumer))

Refer to wiki for Redis, Periodic Jobs, Error Handling, Monitoring, Production Readiness, etc.

Getting Help

Please open an issue or ping us on #goose @Clojurians slack.

Companies using Goose in Production

Contributing

As a first step, go through all the architecture-decisions
Discuss with maintainers on the issues page or at #goose @Clojurians slack
See the contributing guide for setup & guidelines

Why the name "Goose"?

Named after LT Nick 'Goose' Bradshaw, the sidekick to Captain Pete 'Maverick' Mitchell in Top Gun.

License

goose's People

Contributors

Stargazers

Watchers

Forkers

olttwa alekcz davidalphafox kitallis alishamohanty siripr4 neenaoffline chage charan1973

goose's Issues

Reconsider client library for Redis

Context

To begin with, Goose chose Carmine because it was popular, stable, well-maintained & most importantly, it solved basic needs of pushing to & popping from a list.

Issues

With passage of time, we've discovered certain issues:

Carmine doesn't support closing connections cleanly. More details can be found in Carmine's issue #266 and issue #224
Carmine doesn't support commands introduced in Rredis 6.2.0.LMOVE command is needed to enqueue in-progress jobs to the front of Job queue. Lua scripting or atomic transactions are difficult to implement for this task.

As a workaround, Goose sends dummy value to a utility queue. This helps exit the blocking call as it receives a message it was waiting on.

Requirements

Ideally, the client should help Goose handle it's connection-pool, be well-maintained, support redis cluster, etc.

Options

Celtuce and Obiwan have provisions for closing a connection.
Cons: They don't seem well-maintained & stable. When spiking them locally, I observed random issues like thread not closing, unexpected serializations, etc.
Write a simple wrapper around Jedis or use any stable Java library using Interop

Add Performance Tests for 0.3 Release

ADR for Multiple Brokers

For Redis, RabbitMQ & Amazon SQS, sketch out approach, Interface & feasible features.

Issue

Goose has many context-heavy keywords & the same keyword might mean multiple things depending on time & perspective of object.

Create a Glossary defining the following:

broker
threads
queues
- queue
- schedule-queue
- retry-queue
- dead-queue
- in-progress-queue, preservation-queue & orphan-queue
...

[Reliability] Handle abrupt shutdown of worker

Use an in-progress queue for abrupt worker process shutdowns
Use LMOVE right left, or BRPOPLPUSH as redis 6.2 is relatively new
Checkout https://klotzandrew.com/blog/sidekiq-lost-messages
Goose has a single queue per worker because reliability cannot be achieved using multiple queues in redis as mentioned here

Create an audit trail for a job

In the beginning a Job can be:

enqueued
scheduled
cron-scheduled

As part of it's lifecycle, the job could be:

enqueued to ready-queue by a scheduler
fail execution
be retried
be dead

Since a job could stay in Goose ecosystem for 60+ days, an audit-trail would help with debugging.

The audit-trail can have following details:

event
time

Customized queues

While enqueuing, Clients can specify queue name
Workers can be initialized with a set of queues

Priority can be tweaked by concurrency of workers.

Clean Graceful Shutdown

Because of #14, Goose can either timeout for long times, or shutdown in a clean manner.

If we have a way to interrup connections to redis in a clean manner, Goose will have long-polling & clean graceful-shutdown.

Support for Redis Cluster via Carmine Library

Context

In past, we've struggled with closing connection pool during graceful shutdown, lack of Redis v6.2.0 support.
Despite these struggles, after spiking various libraries (link to Redis ADR), we decided to stick with Carmine.

Problem

Carmine lacks support for Redis Cluster

Solution

Goose can have a small set of macros for the subset of Redis operations to map keys to the right cluster node. This macro will rewrite Redis calls to follow the cluster protocol of identifying the node and then query it
If these macros can be generalised, we can raise a PR to Carmine itself

Coordinate multiple-worker scheduler polling interval

Issue

For multiple worker processes, amplify & randomize polling interval.
This reduces load on redis, approximately achieves configured scheduled queue polling interval despite n workers

Solution

Blocked on #28
Sleep for (* poll-interval process-count (rand))

[redis] Periodic Jobs Feature

Like perform-in-sec, schedule a job to run recurrently.

For ex, perform-every and take CRON as input.

Implementation details:

Add a perfrom-every function to Goose Client that takes a cron expression
Calculate next date for the expression, say 3-Aug-1-PM
Schedule the job to run at next CRON 3-Aug-1-PM
When scheduler finds jobs due for execution, enqueue it to front of queue
Alongwith enqueuing, also schedule it back into the queue for recurring execution
Enqueue & re-scheduling should happen in 1 transaction so as to not loose the job
Define number of times a periodic job should run and then stop
Limit :scheduler-polling-interval-sec config to 60s as minimum interval of periodic time is 1 min

Open questions:

Should we reuse scheduled-jobs queue for periodic jobs?
How we'll calculate latency? Calculate time difference from previous cron schedule?
Alongwith standard APIs, add an API to modify CRON period?

0.2 Release Laundry List

Total process count for statsD & scheduler sleep time
Add docstrings for cljdocs
Update API, StatsD & Middleware logic as per Wiki
Inject error service config into error & death handlers
Update README & it's badges
Add prefixed-queue to Job
Add Redis as default broker

[rmq] Implement Publisher Confirms

Enable publisher-confirm mode on client
- If broker responds with basic.nack for 3 jobs in a row
  - Callback with failed jobs
  - switch to synchronous acks for future jobs
  - switch to async after 5 successful acks
- Do acks async because latency can be few hundred millis

References

Integration test for orphan-checker

Issue

How to kill a worker thread from inside a thread?
Calling .shutdownNow() causes in-progress job to fail and be scheduled for retry
Kill a worker thread & validate orphan job is re-picked by another worker thread
This implicitly tests heartbeat

Implement `goose` initialization

Acceptance Criteria:

Accepts redis connection params
Accepts parallelism config
Accepts backend URL

To be decided...

Inject a logger in Goose

Checkout timbre & tools.logging
Idiomatic/Functional way would be the server injecting a logger & Goose sending events to the logger.
The interface can be: log('time', 'level', 'msg', & params...)
- try it out once with tools.logging

Test code written in README

Use library: seancorfield/readme
Reference: Nilenso honeysql-postgres

Internal Protocol for Multi-Broker support

This will be a first step in direction of supporting RabbitMQ
The protocol will have following functions:

- enqueue
- schedule
- start
  - reify (goose.worker/stop)
- middleware-chain?
- APIs
    - enqueued
    - dead
    - scheduled

Current interface

(async 
  `fn-sym
   {:args '(1 2 3) 
   :other-opts :other-vals})

Better interface

(async
  opts
 `fn-sym
 "variadic" :args 1.0 2 {:map :val} '("list") ["vector"])

Multiple worker threads

Client can configure number of worker processes to run in parallel
Stick to 1 process spawning n threads for now

Test statsd-metrics

When integration tests are run, listen on configured statsd port to verify stats are emitted as expected. Refer this github gist for Datagram listener in Clojure

Auto-stop Periodic Jobs after a certain count/time

When registering a periodic job, set a run-count or run-until.
The job will be deregistered from cron-schedule once the run-count has been reached.

nil run-count/run-until means the job will run indefinitely.

Done

Deferred to 0.2

All exposed functions have a doc string, and present in Clojure docs. Host on cljdocs once done
Add logo to README
Flow-chart for scheduler, enqueue, failed jobs, failed jobs with custom queue, dead jobs
Maintainability Badges
wiki for every feature
Changelog

Emit stats

Goose should emit stats for:

enqueue-execute time diff
schedule SLA diff
successful/failed/orphaned/dead jobs

Implement `worker` function

Acceptance Criteria:

Pull jobs from redis
Deserialize the arguments and execute the functions
Gracefully shutdown and enq in-progress jobs back

Stretch goal:

Multi-threaded workers

Reconsider number of threads long-polling redis

Issue

Goose polls redis n times for n threads.
To reduce load on redis, we might want to consider polling from just 1 thread, and enqueuing jobs' execution to the threadpool. To limit execution parallelism/concurrency to the user config, we can have an in-memory buffered queue.

N worker instances with T threads polling means O(N*T) operations per second (or long-polling if #65 gets resolved) slamming Redis.

What next?

Benchmark 2 approaches:
- Polling redis n times
- Polling redis once

While benchmarking, measure 2 things:

time taken to complete 1000 jobs averaging 50ms execution time
redis memory/CPU consumption

API to Pause/Unpause an execution Queue

Add a Metrics Protocol for custom metrics collection

By default, Goose sends metrics to statsd
Have a metrics protocol for users to inject their implementation of metrics

Maintain worker process count using heartbeat

Issue

A count of worker process is needed to coordinate reliable processing of hanging jobs, scheduler-polling time, etc.

Solution

Generate process-id (based on VM host/container-id + random string)
- create key with TTL of 1 min
- Renew TTL every 30 sec
- Run GC every 1 min to handle abrupt shutdowns
On startup, add to processes list
On shutdown, remove from list

Generate smart job IDs

Instead of a random UUID, a job-id can have date+time of enqueuing, and other generic info limited to <60 characters

Benchmark performance

Outcomes

Finalize between 2 interfaces: pre-defined jobs using code OR resolving jobs at runtime
Stats for users
Comparison for newer releases

Schedule a job

Ability to enqueue a job to be picked up at a certain time in future.

Add extensive Documentation & Flow-charts

Create either a Github wiki, or leverage clj-docs.

It should have following details:

docstring for public-facing functions
Broker config, Enqueuing/Dequeuing, Scheduling, Error-handling, API, etc.
Flow-charts for life-cycles of a job
- Enqueue-Dequeue
- Schedule-Enqueue-Dequeue
- Enqueue-Fail-Schedule-Dequeue
- Enqueue-Fail-Schedule-Dequeue-from-retry-queue
- Enqueu-Fail-Dead

[rmq] API to manage jobs from within Goose

Not all APIs from Redis can be replicated

Implement `async` function

Acceptance Criteria:

Arguments:

Takes the job function
Arguments for the job function

Validations:

Ensure that the job-fn is serializable
Ensure that the namespace in the job-fn is present
Ensure that the args are edn-serializable

Implementation:

Puts them in redis to be later picked up by the worker

Choose job-enqueuing interface

At, present, Goose has 2 options for interfaces:

Provide fully-qualified & resolvable function symbol
Predefine functions that can be enqueued, & their retry/schedule configs

Choose between the 2 based on #33

Mark a job as poisonous after n recoveries

Sometimes, a job execution might lead to worker process being crashed.

Due to orphan-checks, such jobs will be re-enqueued & retried.

Keep a note of recovered jobs, if a job is recovered more than n times, it can be assumed it's poisonous & leads to worker crash.

Add middleware support

Users should be able to inject code pre/post job execution
Pre/post Job enqueue isn't necessary because of Goose's interface

Intelligent Args Validation

Issues

Args validation just checks if it is edn serializable. edn serializes anonymous & symbolized functions too, which isn't supported by Goose as they cannot be stored somewhere and retrieved in a different JVM.

Possible solutions:

Try them out, they aren't tested at time of writing

Check if parents of said object are serializable?
Define an exhaustive list of allowed types and validate against them
- Refer https://github.com/ptaoussanis/nippy for the exhaustive list
- Refer Sidekiq best practices on job params

API to manage jobs

An API that helps view:

List failed jobs in retry queue
List dead jobs (exhausted retries)
Retry dead jobs
Retry failed jobs now (instead of later)

[rmq] Error Handling & Retries

Implementation details

RETRY-EXCHANGE (Re-Enqeueue with delay)
Use Dead-Letter Exchanges
https://dzone.com/articles/rabbitmq-consumer-retry-mechanism

Error handling & Retries

When a job throws an exception, worker re-enqueues with updated retry-count
Job is retried with an exponential back-off function. User can configure their own back-off function
users can add an error service like Honeybadger, sentry to report errors on email
If 0 retries, put in dead-letter queue

[rmq] Add enqueue-dequeue feature

Implementation details:

Client

connection factory
create a channel. (is should be re-usable across threads)
create a queue (this operation should be memoized)
- durable: true, auto-delete: false, exclusive: false
publish
- persistent: true, priority: 0

Worker

connection factory
create a channel. (is should be re-usable across threads)
- basic.qos = number of threads
- global:false (only share across consumer on the channel, not connection)
- https://www.rabbitmq.com/confirms.html#channel-qos-prefetch
- https://www.rabbitmq.com/consumer-prefetch.html
create a queue (only happens on initialization)
- durable: true, auto-delete: false, exclusive: false
subscribe
- manual_ack: true
channel.ACK
Set consumer_timeout in rabbitmq.conf
- https://www.rabbitmq.com/consumers.html#acknowledgement-timeout

Modify validations/assertions approach

Issue

The approach to validation doesn't feel functional. We avoided spec, :pre form in defn, expound, Metis, Validateur because of 2 reasons:

They simply print error statements. Client doesn't get an explicit error message describing what went wrong
Customized validations weren't possible

Unsolved requirement

A perfect solution would mean users generating/understanding function params just by looking at validations

FWIW, Claypoole, a popular Clojure library also does Validations exactly like Goose

Possible solutions

Whichever solution gets chosen, be sure it follows above 2 requirements.

Wrap spec inside exceptions like this gist

Todo

Use mocks to assert client/worker validates redis, queue, etc. To avoid duplication, we aren't validating redis from client tests, as redis tests already check that.
Use table-driven tests for validations

Client-side Callbacks when a Job executes

Should be picked up after #105 is completed

Allow clients to register for a callback when a Batch of Jobs is executed
Callback can contain either execution result (success scenario) or exception (failure scenario)

[rmq] Add Scheduling feature

Implementation details

If schedule-time is in the past, publish with priority: 1
- Check if this is necessary. i.e. does a negative delay result in job being enqueued to head of queue
publish(x-delay: 123 ms)

[rmq] Add Periodic Jobs feature

[rmq] Ready it for Production usage

RMQ sets redeliver to true. It can be used in middlewares, or to send stats
- https://www.rabbitmq.com/confirms.html#automatic-requeueing
Heartbeat vs TCP keepalive
- Detect if TCP keepalive is configured, then Heartbeat can be disabled
- Set hearbeat timeout to 15 seconds
- https://www.rabbitmq.com/heartbeats.html#false-positives
Best Practices Wiki
- Higher thread-count doesn't necessary imply better throughput. Based on CPU cores & RAM, by doing trial & error - Avoid law of diminishing returns https://blog.rabbitmq.com/posts/2014/04/finding-bottlenecks-with-rabbitmq-3-3/#consumer-utilisation
- https://www.rabbitmq.com/production-checklist.html
- https://www.rabbitmq.com/monitoring.html
- use a cluster

nilenso / goose Goto Github PK

goose's Introduction

Goose

Performance

Features

Getting Started

Add Goose as a dependency

Client

Worker

Getting Help

Companies using Goose in Production

Contributing

Why the name "Goose"?

License

goose's People

Contributors

Stargazers

Watchers

Forkers

goose's Issues

Context

Issues

Requirements

Options

Issue

Context

Problem

Solution

Issue

Solution

Implementation details:

Open questions:

References

Issue

Current interface

Better interface

Done

Deferred to 0.2

Issue

What next?

Issue

Solution

Issues

Possible solutions:

Implementation details

Client

Worker

Issue

Unsolved requirement

Possible solutions

Todo

Implementation details

Checklist

Recommend Projects

Recommend Topics

Recommend Org