contribsys / faktory Goto Github PK
View Code? Open in Web Editor NEWLanguage-agnostic persistent background job server
Home Page: https://contribsys.com/faktory/
License: Other
Language-agnostic persistent background job server
Home Page: https://contribsys.com/faktory/
License: Other
What should Faktory look like in a world full of Kubernetes? My understanding is that Kubernetes could be very useful in scaling worker processes as queues grow. How can Faktory make this easy?
Docker is a newer way of distributing an entire stack of dependencies as a single unit. Anyone want or need a Docker install for Faktory?
Right now the password authentication is home-grown. I'd suggest we modify the protocol to use SASL PLAIN instead but I'm open to expert advice here. I've seen memcached add-ons on Heroku use SASL before.
It would be awesome to use Faktory on Heroku. No idea what's involved in doing this, mainly opening this issue as a placeholder.
2017/10/30 12:02:34 failed to push task to faktory: ERR default is too large, currently 100001, max size is 100000
Was 100,000 picked for any reason? Is this a limit in RocksDB?
If not, would be great if this was configurable.
I noticed that if a connection doesn't send a heartbeat, it doesn't get reaped in the 60s timeout. Would an acceptable solution to this be to write a heartbeat to the map the moment a connection is opened (clientWorkerFromHello)? Alternative? I'm willing to open a PR—wondering what the intended behavior is.
Looks like this is what happens when master tries to boot an existing Store without priorities. @andrewstucki Any ideas on backwards compatibility or graceful fallback?
panic: runtime error: index out of range
goroutine 5 [running]:
encoding/binary.binary.bigEndian.Uint64(...)
/usr/local/Cellar/go/1.9.2/libexec/src/encoding/binary/binary.go:124
github.com/contribsys/faktory/storage.decodeKey(...)
/Users/mikeperham/src/github.com/contribsys/faktory/storage/queue_rocksdb.go:496
github.com/contribsys/faktory/storage.(*rocksQueue).Init(0xc42010d0a0, 0x0, 0x0)
/Users/mikeperham/src/github.com/contribsys/faktory/storage/queue_rocksdb.go:213 +0x751
github.com/contribsys/faktory/storage.(*rocksStore).GetQueue(0xc420144000, 0xc4200149c8, 0x7, 0x0, 0x0, 0x0, 0x0)
/Users/mikeperham/src/github.com/contribsys/faktory/storage/rocksdb.go:254 +0x239
github.com/contribsys/faktory/storage.(*rocksStore).init(0xc420144000, 0x0, 0x0)
/Users/mikeperham/src/github.com/contribsys/faktory/storage/rocksdb.go:197 +0x566
github.com/contribsys/faktory/storage.OpenRocks(0x43e23f8, 0x13, 0x0, 0x0, 0x0, 0x0)
/Users/mikeperham/src/github.com/contribsys/faktory/storage/rocksdb.go:94 +0x94f
github.com/contribsys/faktory/storage.Open(0x43dc968, 0x7, 0x43e23f8, 0x13, 0x0, 0x0, 0x0, 0x0)
/Users/mikeperham/src/github.com/contribsys/faktory/storage/types.go:82 +0x12d
github.com/contribsys/faktory/server.(*Server).Start(0xc42010e090, 0x0, 0x0)
/Users/mikeperham/src/github.com/contribsys/faktory/server/server.go:113 +0x7d
github.com/contribsys/faktory/webui.bootRuntime.func1(0xc42010e090)
/Users/mikeperham/src/github.com/contribsys/faktory/webui/web_test.go:87 +0x2b
created by github.com/contribsys/faktory/webui.bootRuntime
/Users/mikeperham/src/github.com/contribsys/faktory/webui/web_test.go:86 +0x12b
Right now, all connections are supposed to send a WID. Oops, I forgot: not all connections are from workers! It will be normal for app servers to push jobs to Faktory and these connections should not be part of the data displayed on the Busy page. Consuming processes == workers == listed on Busy page. Figure out how this will change the HELLO and heartbeat handling.
The cmd/repl.go
code is a mess, needs to be refactored and tests added for it.
Besides being more language-agnostic, would love to understand the benefits of this over sidekiq in more detail. As an extension; how does this differ from queuing systems like kafka, zeromq, rabbitmq, etc. and what task-specific sugar is added on top?
Apologies if this is super apparent somewhere already! Looks like a cool project, excited to play around with it.
Using the block comments to document the methods puts does not render well in the end. https://godoc.org/github.com/contribsys/faktory
Small task, but good docs might smooth out the learning curve. Hoping to tackle bigger features soon.
When a FETCH
returns no results, it currently responds with:
$0\r\n
\r\n
However, REDIS RESP defines the Null Bulk String specifically "to signal non-existence of a value":
$-1\r\n
Which seems more appropriate.
Polling is a relatively common way of checking for job completion, but Faktory does not currently provide a way for applications to do that. To support this, it'd be useful to have an API call that checks whether a given job is still enqueued (i.e., has not been processed yet). Something like:
ENQUEUED {"queue": "<queue>", "jid": "<jid>"}
Which returns one of
+gone
+queued
+dead
queue
should default to default
if not given. We need the queue to be explicitly named to avoid searing all the queues and sets.
For further motivation and discussion, see the Gitter thread starting from https://gitter.im/contribsys/faktory?at=5a03413f7081b66876c7a6ae.
I’m working on a Python worker for Faktory and have a couple of questions.
Do all workers need to implement BEAT? What is FLUSH for?
Work is needed to tune RocksDB for our queue usage patterns. Some knobs are already visible in the gorockdb bindings, others need direct C++ access.
Running the testing suite with -race
enabled detects a number of race conditions. Some of them are purely in the test suite i.e. the count incrementing in goroutines in storage/queue_test.go not being protected by a mutex, but some of them are deeper in the core, i.e. server/tasks.go appending to an internal tasks array in one goroutine and ranging over it in another.
Might want to consider enabling -race
on the tests to track these down and to generate a list of places that need fixing.
We could use help building and distributing Faktory via Homebrew. This would make basic usage on OSX much, much easier. Anyone know how to do this?
Need @sethterpstra's help here:
Faktory looks like a decent message broker. But how do I advocate it to my teammates over, say, RabbitMQ?
The Web UI currently is missing CSRF protection because we don't have any notion of session to store the current token value. The hooks are all there in the forms to inject an authenticity_token, we just need to implement it.
A poison pill is a job which causes the worker process to crash. This will lead to the job being retried every hour (by default) when the job reservation times out, forever. How do we catch and prevent this? Should reservation recovery increment Job.Failure.RetryCount and follow the retry logic? Can we detect a poison pill and move it to the Dead set immediately?
They can be merged through some refaktoring.
Adding it should be as simple as running the build.sh I believe. But, it doesn't need to install the Go distribution because if we specify the language in travis, it automatically contains Go.
So, we just need to install rocks db and then run the tests.
Also, adding go vet and go lint will be good. Though that is a separate task.
Can we support simple job prioritization? Today the queue key in RocksDB is queueName|index
, we could change the key to be queueName|priority|index
but that would make it impossible to get the next job as an O(1) operation. We could use a live cursor that would walk the RocksDB key tree in real-time but I'm not sure of the performance implications. Other ideas?
I was doing something unrelated and found a bug. Early software, love the project, no judgey mode here. 😄 🎈
./faktory-cli
Faktory 0.5.0
Copyright © 2017 Contributed Systems LLC
[snip]
> ^D
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x4129419]
goroutine 1 [running]:
main.repl(0xc42001a280, 0x18, 0x0, 0x0)
[snip]/faktory/cmd/repl.go:59 +0x419
main.main()
[snip]/faktory/cmd/repl.go:42 +0x168
exit
and ctrl-c
do not panic. I'll poke around and see if anything is easy to fix but I don't know this codebase yet.
If a worker has a concurrency of 10, then it may have 10 (or 11) connections open.
If the BEAT is once per 15s per worker (not per connection) is it worth adding a ping / pong command that a thread tied up with a long running job can send to the server to keep the connection alive?
The worker could send a periodic BEAT but that updates the db (I think) which incurs some cost.
I’m happy to submit a PR
Wondering what your thoughts are on using something like https://github.com/hashicorp/raft and making this work with clustering support baked in?
I'm keeping this short: I'm pretty sure the project and its users can benefit from exporting prometheus metrics. Metrics are important in modern application deployments, especially if there are many moving parts.
I would suggest to export some basic metrics about the application itself, queues, jobs and the web server (latency, http status codes, etc.). They are also interesting for scaling Kubernetes deplyoments: See #19.
If this is desired, we can discuss it in detail and I would love to contribute a basic implementation.
Run coverage reports for the {storage,server,webui} packages. Find any non-trivial blocks of uncovered code. Write a test for that block. Rinse, repeat until no blocks remain.
The server doesn't populate the Stats from storage when booting so those numbers always start from zero. Read the current values from storage on bootup.
We have a very basic queue push/pop tool in test/load but it would be nice to have broader coverage of more commands, more connections, etc. What makes sense to add?
Plain Go Linux binaries can be used on any Linux distro but using CGO to pull in RocksDB means that we also link in libc and a bunch of other dynamic libraries, making them distro-specific. I'd like to minimize that and statically compile anything that we possibly can, in order to increase platform support and minimize bugs due to differing versions between distros.
Investigate static linking as many things as possible. Today the list looks like this:
ubuntu@ubuntu-xenial:/vagrant$ make build
go generate ./...
go build -o faktory cmd/main.go
ubuntu@ubuntu-xenial:/vagrant$ ldd ./faktory
/lib64/ld-linux-x86-64.so.2 (0x00007f1f65321000)
linux-vdso.so.1 => (0x00007ffd3f1db000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1f65104000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1f64a79000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1f63e67000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f1f64d82000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f1f6485f000)
libbz2.so.1.0 => /lib/x86_64-linux-gnu/libbz2.so.1.0 (0x00007f1f6464f000)
libsnappy.so.1 => /usr/lib/x86_64-linux-gnu/libsnappy.so.1 (0x00007f1f64447000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1f64231000)
I suspect that we can statically compile the bottom five into the distributed binary. I believe CockroachDB did some investigation into this in the Go 1.1 timeframe but it's not clear to me if the situation has gotten any better.
When a worker calls FETCH
, if there is no work to be done then the server will wait 2s before returning, which is great because there should usually be a worker waiting to accept a new job and the end to end latency is nice and low.
If a worker has its status set to quiet or stop then when worker calls FETCH
it will get an immediate empty response. If a worker depends on the 2s timeout to FETCH
for flow control then it'll spin and send 100s of FETCH
requests / s that will never get any job.
If a worker is responsible for its own flow control, then it'll need to slow the frequency of FETCH
requests which will increase the latency of tasks through the system.
I think if a worker calls FETCH
on a quiet'd / stop'd connection then the server should wait 1s and then return an err ERR.
What are your thoughts? If you think it makes sense I'll create a PR
Noticed that the idea of building on Windows was raised in Gitter. Thought that I'd take a stab at getting it building with CI on Windows. I'm not by any means even a fan of Windows systems, but figured that maybe it would help get more people contributing.
However, I'm not sure how much you want to actually support Windows--personally, with all the craziness of Windows systems and non-POSIX-compliant stuff... I wouldn't... but, it looks fairly possible given that the only real platform-specific dependency is pretty easy to install with a Microsoft supported package manager. Before I spend too much time on this though, I figured I'd get your thoughts.
On a side note, man, Windows builds of library dependencies are SLOW. It takes like 25 mins just to build all the required dependencies through appveyor. Thank goodness for caching.
Security is always tricky, with compromise and tension between usability and safety. My basic policy: development should be easy, production should be safe. More thoughts:
Development mode should be quickly usable, without needing to read docs or configure options in order to get working. If someone is running Faktory locally, she should be able to install and start it immediately. The default would be to listen on localhost only, with no TLS or passwords required in this mode.
In production, we want to listen on a non-localhost interface. Once ports exposed to the outside world, we need to get more paranoid and require TLS and authentication.
We can't provide TLS without the user providing a cert for the hostname used to access the API and Web UI. The user will need to provide a TLS cert and CA chain, which I propose reside at a conventional location:
{~/.faktory,/etc/faktory}/public.crt
{~/.faktory,/etc/faktory}/private.key
{~/.faktory,/etc/faktory}/ca-chain.crt
Those filenames can be actual bytes or soft links to the files elsewhere on the disk.
In summary:
The user can run Faktory in production using e.g. stunnel, proxy to expose insecure localhost ports instead of TLS. Since Faktory is listening on localhost, it doesn't care or require any security. It's up to the user to expose those ports securely.
TLS doesn't solve our problem, we still need to limit access. I propose a simple password scheme for all connecting clients when sending the AHOY payload. Clients should send two attributes, pwdhash
and salt
where pwdhash
is sha256(password+salt)
. Salt should be, at minimum, a 12-char random string and generated unique for each AHOY.
AHOY {"pwdhash":"d40b8917d7aff72a40a677c55992d2edc1b41331ec3b24641f2affa67b8dba09","salt":"aos,dfis33dkvn"}
This is the SHA256 for the password "helloworld":
$ echo -n helloworldaos,dfis33dkvn | shasum -a 256
d40b8917d7aff72a40a677c55992d2edc1b41331ec3b24641f2affa67b8dba09 -
Without a valid pwdhash and salt in the AHOY, Faktory will disconnect the client immediately.
ps
output, I believe current best practices suggest an ENV variable to be best. Other ideas?I noticed the Ruby client is named faktory_worker_ruby
, but it has a top level namespace Faktory
. Think I'm going to follow that convention for the Elixir app.
How do you feel about faktory_worker_ex
instead of factory_worker_elixir
(and curious why you didn't do faktory_working_rb
)?
Asking the important questions here... ;)
Why is there a hard limit of 60 seconds for reserve_for? We have a lot of small jobs (2-5 seconds each) which need to have a lower reserve_for to tolerate workers going down. Is there any chance of lowering the limit later?
Right now faktory needs to be shut down to take a backup with faktory-cli. The debug page provides a live backup button but isn't easily scriptable. Faktory will get a bunch of adminy/opsy-type commands as it matures, how do we expose those?
How should someone be able to script a backup? Random thoughts:
BACKUP
command or family of DB [VERB]
commands?/var/run/faktory.pipe
named pipe which allows access to more adminy things without worry about network security. Inspeqtor uses a pipe to provide introspection. This would provide a separate avenue for commands; do we need to have this split between worker commands over the network and admin commands over a pipe?Bachman-Turner Overdrive would probably really enjoy the shout-out but only if you get those lyrics perfect:
Chorus:
And I'll be taking care of business (every day)
Taking care of business (every way)
I've been taking care of business (it's all mine)
Taking care of business and working overtime, work out
I think the Readme needs an update.
If an application schedules a job using at
, and then discovers that the job is no longer necessary (likely due to a user action), it would be convenient to have a way of cancelling that job, rather than expending resources on executing it. There are ways to do this out-of-band by having workers consult a "cancelled jobs" table somewhere, but this seems like unfortunate complexity to add to the application when Faktory already has the necessary control structures.
Note that this feature would not let you abort or cancel jobs that are currently running, nor does it give any guarantees that a job will not run. Rather this is a hint to the system that a given job need not subsequently be given out to workers. We could scope this only to jobs with an at
set, but that seems like an unnecessary restriction.
The proposed API call would be something like:
DEQUEUE {"queue": "<queue>", "jid": "<jid>"}
which returns one of
+OK
+gone
queue
defaults to default
if not given. As with #81, we want to explicitly name the queue to avoid having to search over all queues. See also Gitter discussion starting at https://gitter.im/contribsys/faktory?at=5a02d85286d308b755c10755
Add a cli prune
command to prune everything but the last N backups. See API in contribsys/gorocksdb.
We might want to put in tunable but soft and hard limits to the network connections allowed by Faktory.
To reproduce:
Terminal 1: `make run`
Terminal 2:
# should work fine
go run test/load/main.go 30000 10
# crash!
go run test/load/main.go 30000 500
dial tcp [::1]:7419: socket: too many open files
Hi, First of all, very interesting project 👏
I'm from The Netherlands and when I opened the web GUI it showed me everything in Dutch.
I looked it up, and it seems that it uses the Accept-Language
header to automagically select this.
I don't like my technical tools to be translated in Dutch. I prefer English for that.
Would it be possible to make this optional?
Right now Faktory requires TLS by default if not listening explicitly on "localhost". This ignores modern development practices like containers with host-only networking where the service in the container is listening on non-localhost but is still effectively local.
Instead ignore the network binding and rely on the environment flag:
-e development
means we don't require TLS by default. We'll still check the TLS directory for certs and use them if possible.-e production
means we require TLS by default. The user can pass -no-tls otherwise we exit early if TLS certs aren't found.How do we ensure that people actively enable the production flag when in production? Should it be the default and we force people to enable development?
Clients should use "tcp+tls" for the protocol if they want TLS tcp+tls://192.168.1.1:7419
or "tcp" for an unencrypted connection. This feels ugly and alternative suggestions are welcome.
Is sharing the same port between TLS and non-TLS sockets a bad idea? Should we be using 7419 for plain sockets and 7443 for TLS sockets, to make the security setup more explicit?
AFAIK Logrus is not a huge dependency and he's stating he doesn't have time to support the project anymore, we should fork the basic functionality into the util package. Remember to preserve the copyright notice to keep to the MIT license.
The best description of the protocol for talking to Faktory we currently have is https://github.com/contribsys/faktory/wiki/Worker-Lifecycle, with bits and pieces in https://github.com/contribsys/faktory/wiki/The-Job-Payload. For anything missing, implementors must consult the Go client source. This is suboptimal.
Faktory should provide an in-tree (not Wiki), detailed description of the protocol, with complete examples of interactions (including the exact REDIS RESP formatted response). I recommend looking at the IMAP RFC for inspiration (see, for example, the docs for SELECT
in §6.3.1). Implementors should not need to look at the Go client source at all. This also means that we need to sort out some current disagreements between the Wiki docs and the Go source. For example, what exactly does BEAT
return? The docs say:
The response can be OK or a JSON hash with further data for the worker:
{"state":"quiet"}
Whereas the Go client assumes it's always a string:
Line 329 in 2e7a019
Line 325 in 2e7a019
Ideally, we should also supply a set of test cases with trace responses from the server, and expected messages from the client, but that can be another step down the line.
My worker processes persist in the web UI despite having been stopped for hours.
Also, the signals do not persist. For example, when I hit the quiet
button in the UI, then the next BEAT response will report quiet
, but any subsequent BEATs revert back to ok
despite the UI still showing it's quiet.
Here's a screenshot: http://storage.stochasticbytes.com.s3.amazonaws.com/4KIdyMeh.png
✌️
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.