Comments (16)
I'll try to collect an end-to-end strace with the whole communication between the worker and the faktory server... I hope to have some results in a couple of hours.
from faktory_worker_python.
Managed to get the strace. Seems like the last command to the faktory server is BEAT
. Here is the strace just before the infinite loop:
[pid 4950] sendto(3, "ACK {\"jid\": \"a125ff80a3654fa7bd1906f86651611f\"}\r\n", 49, 0, NULL, 0) = 49
[pid 4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid 4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid 4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid 4950] sendto(3, "FETCH etherbi_decode\r\n", 22, 0, NULL, 0) = 22
[pid 4950] recvfrom(3, "$218\r\n{\"jid\":\"bab858ac524f4027bfcf1c285023b37c\",\"queue\":\"etherbi_decode\",\"jobtype\":\"decode\",\"args\":[229469],\"priority\":5,\"created_at\":\"2018-02-03T01:33:14.101
992958Z\",\"enqueued_at\":\"2018-02-03T01:33:14.101999803Z\",\"retry\":5}\r\n", 4096, 0, NULL, NULL) = 226
[pid 4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
from faktory_worker_python.
I think it failed before, on the 3rd line we send a BEAT and would expect a json blob back with a state key, but we just get +OK which doesn't make sense.
Did anything change when this started? Update to faktory 0.7?
from faktory_worker_python.
Indeed we upgraded the server to 0.7.0 at end of last week. We downgraded to faktory 0.6.1 and the workers have been running for over 3 hours. So far no problems. Seems like the upgrade might be the cause for the above issue...
from faktory_worker_python.
That makes sense then -- I've been running 0.6.1 in production and had no issues. I'll do some testing here with 0.7 and see if I can find a way to reproduce the issue.
Do you know if the elixir worker maintains a separate connection for sending BEAT
s?
from faktory_worker_python.
I think the elixir client is using a pool of connections, so it will be using different connection each time https://github.com/cjbottaro/faktory_worker_ex/blob/60d265deb72de4deab4086b4704826b01265d475/lib/faktory/heartbeat.ex#L30
Also keep in mind that in our setup the elixir workers have been much less utilized than the python workers, so that might be the reason we didn't see any problems. Anecdotally, the elixir workers seem to be more stable so far...
from faktory_worker_python.
Has been running faktory 0.6.1 all day today and no issues. This really starts to smell like a regression in 0.7.0... May be we should move this issue into the faktory
repo?
from faktory_worker_python.
I think it failed before, on the 3rd line we send a BEAT and would expect a json blob back with a state key, but we just get +OK which doesn't make sense.
The server only sends a JSON state when the state changes. Normal conditions will get an OK response. The wiki documents this and I'm pretty sure it didn't change between 061 and 070.
https://github.com/contribsys/faktory/wiki/Worker-Lifecycle#heartbeat
from faktory_worker_python.
Here is a bit more from the log. Seems like BEAT
is indeed returning +OK\r\n
. The question is: why the last BEAT
caused the connection to fall into this infinite state:
[pid 4950] sendto(3, "FETCH etherbi_decode\r\n", 22, 0, NULL, 0) = 22
[pid 4950] recvfrom(3, "$218\r\n{\"jid\":\"a125ff80a3654fa7bd1906f86651611f\",\"queue\":\"etherbi_decode\",\"jobtype\":\"decode\",\"args\":[229415],\"priority\":5,\"created_at\":\"2018-02-03T01:31:32.598003581Z\",\"enqueued_at\":\"2018-02-03T01:31:32.598009936Z\",\"retry\":5}\r\n", 4096, 0, NULL, NULL) = 226
[pid 4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid 4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid 4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid 4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid 4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid 4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid 4950] sendto(3, "ACK {\"jid\": \"a125ff80a3654fa7bd1906f86651611f\"}\r\n", 49, 0, NULL, 0) = 49
[pid 4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid 4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid 4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid 4950] sendto(3, "FETCH etherbi_decode\r\n", 22, 0, NULL, 0) = 22
[pid 4950] recvfrom(3, "$218\r\n{\"jid\":\"bab858ac524f4027bfcf1c285023b37c\",\"queue\":\"etherbi_decode\",\"jobtype\":\"decode\",\"args\":[229469],\"priority\":5,\"created_at\":\"2018-02-03T01:33:14.101992958Z\",\"enqueued_at\":\"2018-02-03T01:33:14.101999803Z\",\"retry\":5}\r\n", 4096, 0, NULL, NULL) = 226
[pid 4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid 4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
from faktory_worker_python.
For sure we are not seeing this behavior with faktory
0.6.1. We've been rolling with it for the past 2 days, while everything else is the same and not problems so far.
from faktory_worker_python.
@mperham yes, your right, my bad. The BEAT / +OK is fine.
@valo do you have multiple workers running and if you do, do they all fail at the same time?
from faktory_worker_python.
I have 40-50 workers and they gradually start to fail. I think restarting a worker returns it to a good state. In that sense it is not like the server is going into foobar state, as I don't need to restart the server. I only restart the workers to keep the queue progressing.
from faktory_worker_python.
I can reproduce what I think is the same bug, by starting a worker and then killing faktory -- that sends the worker into an infinite loop of read()
ing from a dead socket.
I'm surprised this doesn't raise an exception
from faktory_worker_python.
Did this ever come up again after changing to 0.6 ?? ... I am also running 0.7.0 with python_worker 0.3 and facing a similar situation where worker hangs up and stops fetching jobs.
from faktory_worker_python.
It isn't an issue with Faktory 0.6 vs 0.7 -- the problem is that if the connection between the worker and Faktory is interrupted, then the worker doesn't reconnect (or exit).
It seems like there is a bug with faktory 0.7 that causes the connection to close from time to time (doesn't happen on 0.6) but I haven't been able to pin that down and the worker needs to reconnect properly anyway.
from faktory_worker_python.
Just want to bump this issue. I'm experiencing a hung worker if the Faktory server dies/restarts. The worker does not resume after the Faktory server comes back up. This is with Faktory 0.8.0 and faktory_worker_python 0.4.0 (installed via pip today). Thanks for anything you can do to resolve this. Let me know if there's any testing I can do to help.
from faktory_worker_python.
Related Issues (20)
- i/o timeout Closing connection
- worker thread got OSError: [WinError 6] The handle is invalid when stop worker from Web UI
- Release 1.0
- Add support for sending exceptions to Sentry HOT 5
- Worker crashes if there's an error when fetching new jobs HOT 1
- Getting INFO times out after starting a lot of workers HOT 2
- Example worker restarting itself on 3.9 HOT 6
- Cannot install version 0.5.0 via pip HOT 3
- ImportError: cannot import name 'BrokenThreadPool' for python version less than 3.7 HOT 1
- Flag version >0.5 to only be python 3.7+ compatible HOT 2
- Batch Support HOT 3
- How to use the `at` argument in queue? HOT 1
- Worker heartbeat crashes with "type 'NoneType' is not iterable" HOT 1
- Remove references to `priority` as a job argument
- FaktoryConnectionResetError in Connection.reply
- AttributeError: Can't pickle local object 'ETL_task.<locals>.task' HOT 6
- Explaination on the thread execution model HOT 4
- Make a new release (v1.1) HOT 1
- Access to Job status in Python client for faktory
- BrokenProcessPool error HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from faktory_worker_python.