I have a single threaded worker defined as: <div class="highlight highlight-source

Worker doesn't reconnect to Faktory if the connection is reset about faktory_worker_python HOT 16 CLOSED

cdrx commented on July 2, 2024

Worker doesn't reconnect to Faktory if the connection is reset

from faktory_worker_python.

Comments (16)

valo commented on July 2, 2024

I'll try to collect an end-to-end strace with the whole communication between the worker and the faktory server... I hope to have some results in a couple of hours.

from faktory_worker_python.

valo commented on July 2, 2024

Managed to get the strace. Seems like the last command to the faktory server is BEAT. Here is the strace just before the infinite loop:

[pid  4950] sendto(3, "ACK {\"jid\": \"a125ff80a3654fa7bd1906f86651611f\"}\r\n", 49, 0, NULL, 0) = 49
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "FETCH etherbi_decode\r\n", 22, 0, NULL, 0) = 22
[pid  4950] recvfrom(3, "$218\r\n{\"jid\":\"bab858ac524f4027bfcf1c285023b37c\",\"queue\":\"etherbi_decode\",\"jobtype\":\"decode\",\"args\":[229469],\"priority\":5,\"created_at\":\"2018-02-03T01:33:14.101
992958Z\",\"enqueued_at\":\"2018-02-03T01:33:14.101999803Z\",\"retry\":5}\r\n", 4096, 0, NULL, NULL) = 226
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0

from faktory_worker_python.

cdrx commented on July 2, 2024

I think it failed before, on the 3rd line we send a BEAT and would expect a json blob back with a state key, but we just get +OK which doesn't make sense.

Did anything change when this started? Update to faktory 0.7?

from faktory_worker_python.

valo commented on July 2, 2024

Indeed we upgraded the server to 0.7.0 at end of last week. We downgraded to faktory 0.6.1 and the workers have been running for over 3 hours. So far no problems. Seems like the upgrade might be the cause for the above issue...

from faktory_worker_python.

cdrx commented on July 2, 2024

That makes sense then -- I've been running 0.6.1 in production and had no issues. I'll do some testing here with 0.7 and see if I can find a way to reproduce the issue.

Do you know if the elixir worker maintains a separate connection for sending BEATs?

from faktory_worker_python.

valo commented on July 2, 2024

I think the elixir client is using a pool of connections, so it will be using different connection each time https://github.com/cjbottaro/faktory_worker_ex/blob/60d265deb72de4deab4086b4704826b01265d475/lib/faktory/heartbeat.ex#L30

Also keep in mind that in our setup the elixir workers have been much less utilized than the python workers, so that might be the reason we didn't see any problems. Anecdotally, the elixir workers seem to be more stable so far...

from faktory_worker_python.

valo commented on July 2, 2024

Has been running faktory 0.6.1 all day today and no issues. This really starts to smell like a regression in 0.7.0... May be we should move this issue into the faktory repo?

from faktory_worker_python.

mperham commented on July 2, 2024

I think it failed before, on the 3rd line we send a BEAT and would expect a json blob back with a state key, but we just get +OK which doesn't make sense.

The server only sends a JSON state when the state changes. Normal conditions will get an OK response. The wiki documents this and I'm pretty sure it didn't change between 061 and 070.

https://github.com/contribsys/faktory/wiki/Worker-Lifecycle#heartbeat

from faktory_worker_python.

valo commented on July 2, 2024

Here is a bit more from the log. Seems like BEAT is indeed returning +OK\r\n. The question is: why the last BEAT caused the connection to fall into this infinite state:

[pid  4950] sendto(3, "FETCH etherbi_decode\r\n", 22, 0, NULL, 0) = 22
[pid  4950] recvfrom(3, "$218\r\n{\"jid\":\"a125ff80a3654fa7bd1906f86651611f\",\"queue\":\"etherbi_decode\",\"jobtype\":\"decode\",\"args\":[229415],\"priority\":5,\"created_at\":\"2018-02-03T01:31:32.598003581Z\",\"enqueued_at\":\"2018-02-03T01:31:32.598009936Z\",\"retry\":5}\r\n", 4096, 0, NULL, NULL) = 226
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "ACK {\"jid\": \"a125ff80a3654fa7bd1906f86651611f\"}\r\n", 49, 0, NULL, 0) = 49
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "FETCH etherbi_decode\r\n", 22, 0, NULL, 0) = 22
[pid  4950] recvfrom(3, "$218\r\n{\"jid\":\"bab858ac524f4027bfcf1c285023b37c\",\"queue\":\"etherbi_decode\",\"jobtype\":\"decode\",\"args\":[229469],\"priority\":5,\"created_at\":\"2018-02-03T01:33:14.101992958Z\",\"enqueued_at\":\"2018-02-03T01:33:14.101999803Z\",\"retry\":5}\r\n", 4096, 0, NULL, NULL) = 226
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0

from faktory_worker_python.

valo commented on July 2, 2024

For sure we are not seeing this behavior with faktory 0.6.1. We've been rolling with it for the past 2 days, while everything else is the same and not problems so far.

from faktory_worker_python.

cdrx commented on July 2, 2024

@mperham yes, your right, my bad. The BEAT / +OK is fine.

@valo do you have multiple workers running and if you do, do they all fail at the same time?

from faktory_worker_python.

valo commented on July 2, 2024

I have 40-50 workers and they gradually start to fail. I think restarting a worker returns it to a good state. In that sense it is not like the server is going into foobar state, as I don't need to restart the server. I only restart the workers to keep the queue progressing.

from faktory_worker_python.

cdrx commented on July 2, 2024

I can reproduce what I think is the same bug, by starting a worker and then killing faktory -- that sends the worker into an infinite loop of read()ing from a dead socket.

I'm surprised this doesn't raise an exception

from faktory_worker_python.

shauryashahi commented on July 2, 2024

Did this ever come up again after changing to 0.6 ?? ... I am also running 0.7.0 with python_worker 0.3 and facing a similar situation where worker hangs up and stops fetching jobs.

from faktory_worker_python.

cdrx commented on July 2, 2024

It isn't an issue with Faktory 0.6 vs 0.7 -- the problem is that if the connection between the worker and Faktory is interrupted, then the worker doesn't reconnect (or exit).

It seems like there is a bug with faktory 0.7 that causes the connection to close from time to time (doesn't happen on 0.6) but I haven't been able to pin that down and the worker needs to reconnect properly anyway.

from faktory_worker_python.

jakegatsby commented on July 2, 2024

Just want to bump this issue. I'm experiencing a hung worker if the Faktory server dies/restarts. The worker does not resume after the Faktory server comes back up. This is with Faktory 0.8.0 and faktory_worker_python 0.4.0 (installed via pip today). Thanks for anything you can do to resolve this. Let me know if there's any testing I can do to help.

from faktory_worker_python.

Worker doesn't reconnect to Faktory if the connection is reset about faktory_worker_python HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent