Code Monkey home page Code Monkey logo

Comments (16)

valo avatar valo commented on July 2, 2024

I'll try to collect an end-to-end strace with the whole communication between the worker and the faktory server... I hope to have some results in a couple of hours.

from faktory_worker_python.

valo avatar valo commented on July 2, 2024

Managed to get the strace. Seems like the last command to the faktory server is BEAT. Here is the strace just before the infinite loop:

[pid  4950] sendto(3, "ACK {\"jid\": \"a125ff80a3654fa7bd1906f86651611f\"}\r\n", 49, 0, NULL, 0) = 49
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "FETCH etherbi_decode\r\n", 22, 0, NULL, 0) = 22
[pid  4950] recvfrom(3, "$218\r\n{\"jid\":\"bab858ac524f4027bfcf1c285023b37c\",\"queue\":\"etherbi_decode\",\"jobtype\":\"decode\",\"args\":[229469],\"priority\":5,\"created_at\":\"2018-02-03T01:33:14.101
992958Z\",\"enqueued_at\":\"2018-02-03T01:33:14.101999803Z\",\"retry\":5}\r\n", 4096, 0, NULL, NULL) = 226
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0

from faktory_worker_python.

cdrx avatar cdrx commented on July 2, 2024

I think it failed before, on the 3rd line we send a BEAT and would expect a json blob back with a state key, but we just get +OK which doesn't make sense.

Did anything change when this started? Update to faktory 0.7?

from faktory_worker_python.

valo avatar valo commented on July 2, 2024

Indeed we upgraded the server to 0.7.0 at end of last week. We downgraded to faktory 0.6.1 and the workers have been running for over 3 hours. So far no problems. Seems like the upgrade might be the cause for the above issue...

from faktory_worker_python.

cdrx avatar cdrx commented on July 2, 2024

That makes sense then -- I've been running 0.6.1 in production and had no issues. I'll do some testing here with 0.7 and see if I can find a way to reproduce the issue.

Do you know if the elixir worker maintains a separate connection for sending BEATs?

from faktory_worker_python.

valo avatar valo commented on July 2, 2024

I think the elixir client is using a pool of connections, so it will be using different connection each time https://github.com/cjbottaro/faktory_worker_ex/blob/60d265deb72de4deab4086b4704826b01265d475/lib/faktory/heartbeat.ex#L30

Also keep in mind that in our setup the elixir workers have been much less utilized than the python workers, so that might be the reason we didn't see any problems. Anecdotally, the elixir workers seem to be more stable so far...

from faktory_worker_python.

valo avatar valo commented on July 2, 2024

Has been running faktory 0.6.1 all day today and no issues. This really starts to smell like a regression in 0.7.0... May be we should move this issue into the faktory repo?

from faktory_worker_python.

mperham avatar mperham commented on July 2, 2024

I think it failed before, on the 3rd line we send a BEAT and would expect a json blob back with a state key, but we just get +OK which doesn't make sense.

The server only sends a JSON state when the state changes. Normal conditions will get an OK response. The wiki documents this and I'm pretty sure it didn't change between 061 and 070.

https://github.com/contribsys/faktory/wiki/Worker-Lifecycle#heartbeat

from faktory_worker_python.

valo avatar valo commented on July 2, 2024

Here is a bit more from the log. Seems like BEAT is indeed returning +OK\r\n. The question is: why the last BEAT caused the connection to fall into this infinite state:

[pid  4950] sendto(3, "FETCH etherbi_decode\r\n", 22, 0, NULL, 0) = 22
[pid  4950] recvfrom(3, "$218\r\n{\"jid\":\"a125ff80a3654fa7bd1906f86651611f\",\"queue\":\"etherbi_decode\",\"jobtype\":\"decode\",\"args\":[229415],\"priority\":5,\"created_at\":\"2018-02-03T01:31:32.598003581Z\",\"enqueued_at\":\"2018-02-03T01:31:32.598009936Z\",\"retry\":5}\r\n", 4096, 0, NULL, NULL) = 226
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "ACK {\"jid\": \"a125ff80a3654fa7bd1906f86651611f\"}\r\n", 49, 0, NULL, 0) = 49
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "+OK\r\n", 4096, 0, NULL, NULL) = 5
[pid  4950] sendto(3, "FETCH etherbi_decode\r\n", 22, 0, NULL, 0) = 22
[pid  4950] recvfrom(3, "$218\r\n{\"jid\":\"bab858ac524f4027bfcf1c285023b37c\",\"queue\":\"etherbi_decode\",\"jobtype\":\"decode\",\"args\":[229469],\"priority\":5,\"created_at\":\"2018-02-03T01:33:14.101992958Z\",\"enqueued_at\":\"2018-02-03T01:33:14.101999803Z\",\"retry\":5}\r\n", 4096, 0, NULL, NULL) = 226
[pid  4950] sendto(3, "BEAT {\"wid\": \"3ff3d2d33f254bd4b3a2e0c25c7b6c53\"}\r\n", 50, 0, NULL, 0) = 50
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0
[pid  4950] recvfrom(3, "", 4096, 0, NULL, NULL) = 0

from faktory_worker_python.

valo avatar valo commented on July 2, 2024

For sure we are not seeing this behavior with faktory 0.6.1. We've been rolling with it for the past 2 days, while everything else is the same and not problems so far.

from faktory_worker_python.

cdrx avatar cdrx commented on July 2, 2024

@mperham yes, your right, my bad. The BEAT / +OK is fine.

@valo do you have multiple workers running and if you do, do they all fail at the same time?

from faktory_worker_python.

valo avatar valo commented on July 2, 2024

I have 40-50 workers and they gradually start to fail. I think restarting a worker returns it to a good state. In that sense it is not like the server is going into foobar state, as I don't need to restart the server. I only restart the workers to keep the queue progressing.

from faktory_worker_python.

cdrx avatar cdrx commented on July 2, 2024

I can reproduce what I think is the same bug, by starting a worker and then killing faktory -- that sends the worker into an infinite loop of read()ing from a dead socket.

I'm surprised this doesn't raise an exception

from faktory_worker_python.

shauryashahi avatar shauryashahi commented on July 2, 2024

Did this ever come up again after changing to 0.6 ?? ... I am also running 0.7.0 with python_worker 0.3 and facing a similar situation where worker hangs up and stops fetching jobs.

from faktory_worker_python.

cdrx avatar cdrx commented on July 2, 2024

It isn't an issue with Faktory 0.6 vs 0.7 -- the problem is that if the connection between the worker and Faktory is interrupted, then the worker doesn't reconnect (or exit).

It seems like there is a bug with faktory 0.7 that causes the connection to close from time to time (doesn't happen on 0.6) but I haven't been able to pin that down and the worker needs to reconnect properly anyway.

from faktory_worker_python.

jakegatsby avatar jakegatsby commented on July 2, 2024

Just want to bump this issue. I'm experiencing a hung worker if the Faktory server dies/restarts. The worker does not resume after the Faktory server comes back up. This is with Faktory 0.8.0 and faktory_worker_python 0.4.0 (installed via pip today). Thanks for anything you can do to resolve this. Let me know if there's any testing I can do to help.

from faktory_worker_python.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.