we recently updated to ruby 3.3 and noticed that workers were becoming stuck processin

I accidentally closed this, and reponed. <a class="user-mention notr

Ruby 3.3 and stuck workers. ,about resque/resque

Comments (22)

alexandrachakeres-wk commented on May 24, 2024 4

We've been having this exact issue. Since upgrading Ruby from 3.1.2 to 3.3.0 about a month ago, resque workers are going to ~0% CPU / stopping most logs, but not dying. They stopped working one by one over the course of a couple days. Bouncing the service restored them for a while, but they eventually stop working again.

In every pod's logs I've checked, the last 2 real things logged are similar to the following (always ends with Running before_fork hooks with... but different jobs/args):

got: (Job{schedule} | ScheduleJob | [195293])
Running before_fork hooks with [(Job{schedule} | ScheduleJob | [195293])]

Normally next we'd see something like Running after_fork hooks with [(Job{schedule} | ScheduleJob | [195293])] logged, but it never gets there. FYI our code doesn't have any Resque.before_fork or Resque.after_fork blocks (as documented in the resque docs here). So it sounds like something could be going wrong in the default before_fork hook, considering the docs say The before_fork hook will be run in the parent process. So, be careful - any changes you make will be permanent for the lifespan of the worker.. Or maybe (more likely) during the fork. I would be curious if others also saw Running before_fork hook... as the final real log before things went basically silent. Could help narrow in on a fix.

Versions:

ruby: 3.3.0
resque: 2.6.0
rails: 6.1.7.6
redis gem: 4.6.0
redis: 5.0.6
We do not use the resque-pool gem

Update: We are no longer seeing lazy (0% CPU) workers after downgrading to Ruby 3.2.3. Obviously this is not a long-term solution though.

from resque.

ChrisCPO commented on May 24, 2024 2

Small Update

Historically we have had some workers die, where they freeze and also loose
their heartbeat, it expires. We check for expired workers every few minutes and
kill them. Due to resque-retry these jobs are failed and a new worker performs
the job, not perfect but it works.

However with ruby 3.3 these workers, they never loose their heartbeat, and thus never
appear dead, they seem to "run" until their dyno/server is restarted. Let's call
these Ghost workers, they are not alive or dead. Also we noticed that there
might be an increase number of ghost workers, though at the time we were running
a large amount of background workers.

@rexmadden, @leboshi, have you had a similar experience?

Seems like there is an historical issue here that has mutated with ruby 3.3

from resque.

leboshi commented on May 24, 2024 1

We didn't get a chance to dive into a diagnosis as deeply as you did, but the symptoms are consistent with yours. Workers would run a few jobs then just hang forever. In the /jobs page, we'd see the workers reporting that they were still processing the previous job they picked up, even hours later. We got tired of manually rebooting workers and just downgraded to Ruby 3.2.3. Obviously not tenable long-term.

from resque.

nevans commented on May 24, 2024 1

I also haven't upgraded anything significant to ruby 3.3 in production yet (I've been waiting for the 3.3.1 release, which will fix some bugs that affect my codebases). So I can't really offer anything but wild guesses at this point, but...

Unfortunately, (if I remember correctly) the heartbeat runs in a separate thread in the worker process, so it's only good for knowing if the worker process itself has frozen. It can't tell you if the code in the main thread has hung indefinitely or entered an infinite loop. And (again, IIRC) the heartbeat only runs in the parent worker process, and not in the forked child process that works each job. The parent simply waits on the child with Process.waitpid(@child).

If (hypothetically) in ruby 3.2, some system calls or any other VM code might potentially hang without a timeout _while holding the global VM lock, and if ruby 3.3 improved that code to no longer hold the global VM lock but it still hangs without a timeout, then that might explain your symptoms. This is is just a wild guess of a hypothesis... but a quick git log -G waitpid in the ruby repo does indicate that changes were made to Process.waitpid between 3.2 and 3.3, and its probably not related this bug. But... maybe?

Have you tried running with ENV["FORK_PER_JOB"] = "false" on a percentage of your workers?

If you have the ability to point rbspy, rbtrace, and/or strace at your stuck processes, that could be very informative. Even simpler, sometimes simply looking at the proclines (via something like ps faux | grep resque | grep -v grep can be useful or revealing.

from resque.

ChrisCPO commented on May 24, 2024 1

we have not tried to run 3.3.1 on any production apps. I assume it would still be an issue. might have some time to test on a staging app soon.

from resque.

tylerCaineRhodes commented on May 24, 2024

We've been experiencing similar issues with frozen workers, ever since upgrading to ruby 3.3.0:

Versions

ruby: 3.3.0
resque: 2.6.0
resque-pool: 0.7.1
rails: 7.1.3

from resque.

PatrickTulskie commented on May 24, 2024

Just wanted to put an update here since I have been looking into this in the background, but also I'm not running Ruby 3.3 anywhere in production yet so I haven't been able to triage at scale.

I could use a little help in nailing down the exact behavior here though and any logs or other info that people can provide. Specifically, are people seeing the actual forked workers die off/exit? Or are they actually running but just no longer receiving jobs? Lastly, what version of redis server and gem are you running?

from resque.

alexandrachakeres-wk commented on May 24, 2024

Thanks for taking a look! I just added some more version details to my comment above. We did not see the resqueworker service pods exit. They kept running but stopped processing jobs. If I remember correctly, we did continue to see some random logging (unrelated to processing jobs). Not sure if that answers your question.

from resque.

ChrisCPO commented on May 24, 2024

I accidentally closed this, and reponed.

@PatrickTulskie What we noticed was jobs that should take a few seconds were taking 45+ minutes and the worker simply never stopped working that job. We also saw no logs from those workers, and we have extensive logging around our jobs. The only logs we got was from the enqueuer, before_enqueue and after_enqueue which I think happens in the enqueueing code.Resque.enqueue(...) etc.

We often get workers similar to this, but those workers are missing a heartbeat we have an outside task that detects these and runs Resque::Worker.all_workers_with_expired_heartbeats.each { |w| w.unregister_worker }, kills the worker re-enqueues the job for us.

However in this case these workers still had active hearbeats and their state was working. One of the console commands we ran was Resque::Worker.all.map { |w| [w.state, Time.now, w.heartbeat, Time.now - w.heartbeat, w.started] }. It's like these workers were "alive" but not doing anything.

When we unregistered these workers via getting their ID and unregistering them or Restarting or worker dynos we noticed that the failed jobs vanished, they did not end up in the failed queue as expected.

from resque.

tylerCaineRhodes commented on May 24, 2024

However in this case these workers still had active hearbeats and their state was working. One of the console commands we ran was Resque::Worker.all.map { |w| [w.state, Time.now, w.heartbeat, Time.now - w.heartbeat, w.started] }. It's like these workers were "alive" but not doing anything.

This. The jobs will not be marked as expired heartbeats and will claim to still be working, but they'll be idle... We can also see via the logs that a particular job for that worker has finished but will be in this idle state anyway, not picking anything else up and still "working".

irb (main) > Resque::Worker.all_workers_with_expired_heartbeats
[]

irb (main) > some_stalled_worker.working?
true

Unfortunately we've resorted to migrating to Sidekiq for some business critical cron jobs.

from resque.

agustin commented on May 24, 2024

Hey is this still an issue that's being experience on ruby 3.3.1? does anyone tested in production?

from resque.

ozydingo commented on May 24, 2024

We are hitting what appears to be this issue, running ruby 3.3.1. I haven't tried to instrument additional logging on the fork hooks yet, though it sounds like from OP's comment I won't see these logs anyway. We are getting workers that are registered as running job X, but have no logs from application code that we expect to see from that job, and are evidently idling doing nothing. The last logs from the hung workers are from the previous job, which we have evidence came to a successful completion without errors.

In case it's worth any kind of hint, we had a similarly presenting issue previously that related to Resque workers hanging after failure, due to a combination of ddtrace and airbrake, and Airbrake's change in 6.2 to not use Thread#enclose is what solved that immediate problem. Otherwise, the problem manifested very similarly, where a worker would hang out idling.

from resque.

ozydingo commented on May 24, 2024

After adding logging to the various available hooks, I have a bit more info to @PatrickTulskie and @nevans questions:

Context:

Ruby 3.3.1p55
Gems: resque 2.6.0, redis 4.8.1, rails 7.1.3.2
Redis server: 6.2 AWS Elasticache

Specific scenario:

We're running resque workers in a kubernetes setup, with one worker per pod
The affected pod / worker I am looking at is worker-auto-order-5688d5ccf-jph4w

We can see the worker in question has been running for hours.

{now: Time.zone.now.utc, state: worker.state, started: worker.started, hostname: worker.h
ostname, job: worker.job}
=>
{:now=>2024-05-16 02:18:18.731606145 UTC,
 :state=>:working,
 :started=>"2024-05-15 20:17:47 +0000",
 :hostname=>"worker-auto-order-5688d5ccf-jph4w",
 :job=>
  {"queue"=>"auto-order",
   "run_at"=>"2024-05-15T23:11:51Z",
   "payload"=>{"class"=>"AutoOrderVpi", "args"=>[671467342, 8846, 15]}}}

Logs from this worker filtered to the hook messages show the last two jobs (671467333, 671467337) forking, starting, and completing, but no logs whatsoever from the current job (671467342):

$ kubectl logs worker-auto-order-5688d5ccf-jph4w | grep "Resque hook" | tail -8

Defaulted container "worker-auto-order" out of: worker-auto-order, download-asset-manifest (init)
I, [2024-05-15T23:11:47.834201 #1]  INFO -- : Resque hook: before_fork
I, [2024-05-15T23:11:47.858373 #77212]  INFO -- : Resque hook: after_fork with job {"class"=>"AutoOrderVpi", "args"=>[671467333, 9392, 15]}
I, [2024-05-15T23:11:47.863781 #77212]  INFO -- : Resque hook: before_perform; AutoOrderVpi; [671467333, 9392, 15]
I, [2024-05-15T23:11:49.029556 #77212]  INFO -- : Resque hook: after_perform; AutoOrderVpi; [671467333, 9392, 15]
I, [2024-05-15T23:11:49.372468 #1]  INFO -- : Resque hook: before_fork
I, [2024-05-15T23:11:49.388899 #77281]  INFO -- : Resque hook: after_fork with job {"class"=>"AutoOrderVpi", "args"=>[671467337, 2615, 15]}
I, [2024-05-15T23:11:49.393569 #77281]  INFO -- : Resque hook: before_perform; AutoOrderVpi; [671467337, 2615, 15]
I, [2024-05-15T23:11:50.751182 #77281]  INFO -- : Resque hook: after_perform; AutoOrderVpi; [671467337, 2615, 15]

The unfiltered logs show that this after_perform is the last log line for this worker before hours of silence:

$ kubectl logs worker-auto-order-5688d5ccf-jph4w | tail -5

Defaulted container "worker-auto-order" out of: worker-auto-order, download-asset-manifest (init)
(77281) Auto-Order [ **REDACTED** ] -- Performing auto-order!
/home/runner/bundle/ruby/3.3.0/gems/curb-0.9.11/lib/curl/easy.rb:67: warning: undefining the allocator of T_DATA class Curl::Multi
(77281) Auto-Order [ **REDACTED** ] -- 0 new orders found.
(77281) Auto-Order [ **REDACTED** ] -- Fininshed with [ **REDACTED** ]
I, [2024-05-15T23:11:50.751182 #77281]  INFO -- : Resque hook: after_perform; AutoOrderVpi; [671467337, 2615, 15]

Notice that the last log line from this worker is immediately before the current reported job.run_at time of 2024-05-15T23:11:51Z.

The worker does not have an expired heartbeat

Resque::Worker.all_workers_with_expired_heartbeats
=> []

And indeed is still sending active heartbeats

heartbeats = Resque::Worker.all_heartbeats.select { |h| h.split(":").first == ENV["POD_NAME"] }; {now: Time.zone.now.utc, heatbeats: heartbeats }
=>
{:now=>2024-05-16 02:21:28.461478023 UTC,
 :heatbeats=>{"worker-auto-order-5688d5ccf-jph4w:1:auto-order-fast,auto-order"=>"2024-05-16T02:20:48+00:00"}}

ps output:

$ kubectl exec worker-auto-order-5688d5ccf-jph4w -- ps auxf

Defaulted container "worker-auto-order" out of: worker-auto-order, download-asset-manifest (init)
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner     80897  0.0  0.0   8648  3228 ?        Rs   02:26   0:00 ps auxf
runner         1  2.9  0.1 2365384 373608 ?      Ssl  May15  11:02 resque-2.6.0: Processing auto-order since 1715814711 [AutoOrderVpi]
runner     77315  0.0  0.1 2512380 355348 ?      S    May15   0:00 resque-2.6.0: Processing auto-order since 1715814711 [AutoOrderVpi]

Unfortunately I can't get strace or rbspy output, I believe due to the docker container not having sufficient priveleges. The following errors are the same even if I run as the root user, which I am able to access.

$ strace -f -p 77315
strace: attach: ptrace(PTRACE_SEIZE, 77315): Operation not permitted

$ rbspy record --pid 77315
rbspy is recording traces. Press Ctrl+C to stop.


Time since start: 1s. Press Ctrl+C to stop.
Summary of profiling data so far:
% self  % total  name
Warning: no profile samples were collected
Wrote raw data to /home/runner/.cache/rbspy/2024-05-16-VNxnAu1e37.raw.gz
Wrote formatted output to /home/runner/.cache/rbspy/2024-05-16-1YGNUU5IEQ.flamegraph.svg
Something went wrong while rbspy was sampling the process. Here's what we know:
- new spy
- get ruby VM state
- Failed to read Ruby version symbol
- Permission denied (os error 13)
- Permission denied (os error 13)

That's about as much as I can get from this pod, which will likely get cycled before additional responses here. I'm not fully opposed to running a limited experiment with added capabilities on a limited set of pods to get rbspy and/or strace, though doing so is uncharted territory for me at this time.

To summarize:

A single worker has been working for hours
It completed its last job successfully, hooks and all, then was assigned a new job
No hooks from the new job were observed running, no code from the new job was observed, no logs emitted after the previous job completed.
The worker was silent for hours when I found it in this position, still sending heartbeats and still allegedly but definitely not actually processing this new job.
The worker state is :working, with both parent and child processes reporting "Processing auto-order since 1715814711"; more than 3 hours ago for a job that takes minutes at most.

from resque.

PatrickTulskie commented on May 24, 2024

@ozydingo thanks for that. Interesting that it just goes silent even though it's sending heart beats.

For those of you experiencing this issue, are you able to consistently reproduce it or is it sporadic and doesn't always happen? I'm just trying to think of how to create a test setup for tracing where we can reproduce the bug. Something like a worker that is just constantly getting jobs where it sleeps for n seconds or something until it breaks.

from resque.

ozydingo commented on May 24, 2024

Unfortunately it's inconsistent. Our system seems to go for hours without issue, then things start going south. And in the case of the worker I documented here, this is in a deployment of usually 5 worker pods processing a pretty constant churn of polling jobs (consistent with the previous job logs shown), so it doesn't seem to be related to any period of idleness. I wish I could give you more.

And I'm definitely stumped as from what I can tell from the above, it seems we have hit lines 271 and 272 of worker.rb such that the payload and procline are set. fork_per_job? is definitively true, so then we'd call perform_with_fork and therefore run_hook :before_fork, job, and I can't really see anything in that path that would just ... stop or fail in any way prior to actually running my logging before_fork hook.

from resque.

ozydingo commented on May 24, 2024

I was able to get strace and rbspy working, and got another example to look at. Since it's a bit b uried, i'll highlight here the bit below about friendly_errors, exit, and RUN_AT_EXIT_HOOKS.

ps:

2024-05-16 11:44:32 worker-auto-order (prod):/app
$ date
Thu May 16 11:44:35 UTC 2024
2024-05-16 11:44:35 worker-auto-order (prod):/app
$ ps -auxf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner      4498  0.3  0.0   9880  7560 pts/0    Ss   11:44   0:00 /bin/bash
runner      4620  0.0  0.0   8652  3308 pts/0    R+   11:44   0:00  \_ ps -auxf
runner         1  2.2  0.1 2287068 368288 ?      Ssl  09:59   2:20 resque-2.6.0: Forked 3921 at 1715854204
runner      3921  0.0  0.1 2393472 341108 ?      S    10:10   0:00 resque-2.6.0: Processing auto-order since 1715854203

The parent process is "forked" instead of "processing", like the previous example.

rbspy:

$ sudo /home/runner/.cargo/bin/rbspy record -p 3921

Time since start: 78s. Press Ctrl+C to stop.
Summary of profiling data so far:
% self  % total  name
100.00   100.00  _fork [c function] - (unknown)
  0.00   100.00  work_one_job - /home/runner/bundle/ruby/3.3.0/gems/resque-2.6.0/lib/resque/worker.rb:285
  0.00   100.00  work - /home/runner/bundle/ruby/3.3.0/gems/resque-2.6.0/lib/resque/worker.rb:265
  0.00   100.00  with_friendly_errors - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/friendly_errors.rb:125
  0.00   100.00  top_level - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/application.rb:141
  0.00   100.00  synchronize [c function] - (unknown)
  0.00   100.00  start - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/vendor/thor/lib/thor/base.rb:495
  0.00   100.00  start - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/cli.rb:31
  0.00   100.00  standard_exception_handling - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/application.rb:219
  0.00   100.00  run_with_threads - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/application.rb:157
  0.00   100.00  run - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/application.rb:85
  0.00   100.00  run - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/vendor/thor/lib/thor/command.rb:37
  0.00   100.00  run - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/cli/exec.rb:30
  0.00   100.00  perform_with_fork - /home/runner/bundle/ruby/3.3.0/gems/resque-2.6.0/lib/resque/worker.rb:942
  0.00   100.00  loop - <internal:kernel>:192
  0.00   100.00  load [c function] - (unknown)
  0.00   100.00  kernel_load - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/cli/exec.rb:65
  0.00   100.00  invoke_with_call_chain - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/task.rb:226
  0.00   100.00  invoke_task - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/application.rb:183
  0.00   100.00  invoke_command - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/vendor/thor/lib/thor/invocation.rb:129

Line 285 appears to be the end of def work_one_job, confirmed on this specific machine:

$ cat /home/runner/bundle/ruby/3.3.0/gems/resque-2.6.0/lib/resque/worker.rb | head -n 285 | tail -19
    def work_one_job(job = nil, &block)
      return false if paused?
      return false unless job ||= reserve

      working_on job
      procline "Processing #{job.queue} since #{Time.now.to_i} [#{job.payload_class_name}]"

      log_with_severity :info, "got: #{job.inspect}"
      job.worker = self

      if fork_per_job?
        perform_with_fork(job, &block)
      else
        perform(job, &block)
      end

      done_working
      true
    end

friendly_errors may be of interest:

$ cat /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/friendly_errors.rb | head -n 125 | tail -n 11

  def self.with_friendly_errors
    FriendlyErrors.enable!
    yield
  rescue SignalException
    raise
  rescue Exception => e # rubocop:disable Lint/RescueException
    raise if FriendlyErrors.disabled?

    FriendlyErrors.log_error(e)
    exit FriendlyErrors.exit_status(e)
  end

Is there some exception bubbling up here causing exit to be hit? I know that might interact interestingly with resque. We have RUN_AT_EXIT_HOOKS=1, which was recommended by the datadog folks to get code profiling to work (however I have already disabled that code in the current workers, that was my first move when I noticed the hanging workers)

strace:

$ sudo strace -f -p 3921 | head -n 100
strace: Process 3921 attached
futex(0x55f1e553c0b0, FUTEX_WAIT_PRIVATE, 2, NULL

Output hangs there with no closing paren and no newline. I don't know how to parse strace output so I'm just parroting it.

The parent process has more verbose but maybe not so informative output. Here's a sample -- it just keeps vommitting out more if I let it run

[pid    44] futex(0x55f1ee79f8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid    45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=42942872}, 0x7f2c8c42c7b0) = 0
[pid    45] futex(0x55f1ee79f8c0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid    44] <... futex resumed>)        = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid    44] futex(0x55f1ee79f870, FUTEX_WAKE_PRIVATE, 1) = 0
[pid    44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=166474748}) = 0
[pid    44] clock_gettime(0xfffffff6 /* CLOCK_??? */, {tv_sec=11, tv_nsec=478485871}) = 0
[pid    44] clock_gettime(0xffffffb6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=13040685}) = 0
[pid    44] clock_gettime(0xffffffae /* CLOCK_??? */, {tv_sec=0, tv_nsec=41377001}) = 0
[pid    44] clock_gettime(0xffffffa6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=172823169}) = 0
[pid    44] clock_gettime(0xffffff76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=265477740}) = 0
[pid    44] clock_gettime(0xffffff6e /* CLOCK_??? */, {tv_sec=0, tv_nsec=890385521}) = 0
[pid    44] clock_gettime(0xffffff66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=778745020}) = 0
[pid    44] clock_gettime(0xffffff5e /* CLOCK_??? */, {tv_sec=0, tv_nsec=817141373}) = 0
[pid    44] clock_gettime(0xffffff0e /* CLOCK_??? */, {tv_sec=0, tv_nsec=676852386}) = 0
[pid    44] clock_gettime(0xffffff06 /* CLOCK_??? */, {tv_sec=0, tv_nsec=316683}) = 0
[pid    44] clock_gettime(0xfffffeee /* CLOCK_??? */, {tv_sec=0, tv_nsec=672358079}) = 0
[pid    44] clock_gettime(0xfffffee6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=220416}) = 0
[pid    44] clock_gettime(0xfffffece /* CLOCK_??? */, {tv_sec=0, tv_nsec=851675597}) = 0
[pid    44] clock_gettime(0xfffffec6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=180611}) = 0
[pid    44] clock_gettime(0xfffffeae /* CLOCK_??? */, {tv_sec=0, tv_nsec=7388677}) = 0
[pid    44] clock_gettime(0xfffffe96 /* CLOCK_??? */, {tv_sec=10, tv_nsec=779297188}) = 0
[pid    44] clock_gettime(0xfffffe8e /* CLOCK_??? */, {tv_sec=9, tv_nsec=229016859}) = 0
[pid    44] clock_gettime(0xfffffe76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=882559104}) = 0
[pid    44] clock_gettime(0xfffffe6e /* CLOCK_??? */, {tv_sec=5, tv_nsec=667769214}) = 0
[pid    44] clock_gettime(0xfffffe66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=151259012}) = 0
[pid    44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=166910183}) = 0
[pid    44] futex(0x55f1ee79f8c4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid    45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=44852822}, 0x7f2c8c42c7b0) = 0
[pid    45] futex(0x55f1ee79f8c4, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid    44] <... futex resumed>)        = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid    44] futex(0x55f1ee79f870, FUTEX_WAKE_PRIVATE, 1) = 0
[pid    44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=166952274}) = 0
[pid    44] clock_gettime(0xfffffff6 /* CLOCK_??? */, {tv_sec=11, tv_nsec=478485871}) = 0
[pid    44] clock_gettime(0xffffffb6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=13040685}) = 0
[pid    44] clock_gettime(0xffffffae /* CLOCK_??? */, {tv_sec=0, tv_nsec=41377001}) = 0
[pid    44] clock_gettime(0xffffffa6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=172823169}) = 0
[pid    44] clock_gettime(0xffffff76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=265477740}) = 0
[pid    44] clock_gettime(0xffffff6e /* CLOCK_??? */, {tv_sec=0, tv_nsec=890385521}) = 0
[pid    44] clock_gettime(0xffffff66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=778745020}) = 0
[pid    44] clock_gettime(0xffffff5e /* CLOCK_??? */, {tv_sec=0, tv_nsec=817141373}) = 0
[pid    44] clock_gettime(0xffffff0e /* CLOCK_??? */, {tv_sec=0, tv_nsec=676852386}) = 0
[pid    44] clock_gettime(0xffffff06 /* CLOCK_??? */, {tv_sec=0, tv_nsec=316683}) = 0
[pid    44] clock_gettime(0xfffffeee /* CLOCK_??? */, {tv_sec=0, tv_nsec=672358079}) = 0
[pid    44] clock_gettime(0xfffffee6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=220416}) = 0
[pid    44] clock_gettime(0xfffffece /* CLOCK_??? */, {tv_sec=0, tv_nsec=851675597}) = 0
[pid    44] clock_gettime(0xfffffec6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=180611}) = 0
[pid    44] clock_gettime(0xfffffeae /* CLOCK_??? */, {tv_sec=0, tv_nsec=7388677}) = 0
[pid    44] clock_gettime(0xfffffe96 /* CLOCK_??? */, {tv_sec=10, tv_nsec=779354288}) = 0
[pid    44] clock_gettime(0xfffffe8e /* CLOCK_??? */, {tv_sec=9, tv_nsec=229016859}) = 0
[pid    44] clock_gettime(0xfffffe76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=882559104}) = 0
[pid    44] clock_gettime(0xfffffe6e /* CLOCK_??? */, {tv_sec=5, tv_nsec=667769214}) = 0
[pid    44] clock_gettime(0xfffffe66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=151259012}) = 0
[pid    44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=167329296}) = 0
[pid    44] futex(0x55f1ee79f8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid    45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=41631252}, 0x7f2c8c42c7b0) = 0
[pid    45] futex(0x55f1ee79f8c0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid    44] <... futex resumed>)        = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid    44] futex(0x55f1ee79f870, FUTEX_WAKE_PRIVATE, 1) = 0
[pid    44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=167376646}) = 0
[pid    44] clock_gettime(0xfffffff6 /* CLOCK_??? */, {tv_sec=11, tv_nsec=478485871}) = 0
[pid    44] clock_gettime(0xffffffb6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=13040685}) = 0
[pid    44] clock_gettime(0xffffffae /* CLOCK_??? */, {tv_sec=0, tv_nsec=41377001}) = 0
[pid    44] clock_gettime(0xffffffa6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=172823169}) = 0
[pid    44] clock_gettime(0xffffff76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=265477740}) = 0
[pid    44] clock_gettime(0xffffff6e /* CLOCK_??? */, {tv_sec=0, tv_nsec=890385521}) = 0
[pid    44] clock_gettime(0xffffff66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=778745020}) = 0
[pid    44] clock_gettime(0xffffff5e /* CLOCK_??? */, {tv_sec=0, tv_nsec=817141373}) = 0
[pid    44] clock_gettime(0xffffff0e /* CLOCK_??? */, {tv_sec=0, tv_nsec=676852386}) = 0
[pid    44] clock_gettime(0xffffff06 /* CLOCK_??? */, {tv_sec=0, tv_nsec=316683}) = 0
[pid    44] clock_gettime(0xfffffeee /* CLOCK_??? */, {tv_sec=0, tv_nsec=672358079}) = 0
[pid    44] clock_gettime(0xfffffee6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=220416}) = 0
[pid    44] clock_gettime(0xfffffece /* CLOCK_??? */, {tv_sec=0, tv_nsec=851675597}) = 0
[pid    44] clock_gettime(0xfffffec6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=180611}) = 0
[pid    44] clock_gettime(0xfffffeae /* CLOCK_??? */, {tv_sec=0, tv_nsec=7388677}) = 0
[pid    44] clock_gettime(0xfffffe96 /* CLOCK_??? */, {tv_sec=10, tv_nsec=779409707}) = 0
[pid    44] clock_gettime(0xfffffe8e /* CLOCK_??? */, {tv_sec=9, tv_nsec=229016859}) = 0
[pid    44] clock_gettime(0xfffffe76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=882559104}) = 0
[pid    44] clock_gettime(0xfffffe6e /* CLOCK_??? */, {tv_sec=5, tv_nsec=667769214}) = 0
[pid    44] clock_gettime(0xfffffe66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=151259012}) = 0
[pid    44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=167790802}) = 0
[pid    44] futex(0x55f1ee79f8c4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid    45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=42501013}, 0x7f2c8c42c7b0) = 0
[pid    45] futex(0x55f1ee79f8c4, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid    44] <... futex resumed>)        = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid    44] futex(0x55f1ee79f870, FUTEX_WAKE_PRIVATE, 1) = 0
[pid    44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=167833694}) = 0
[pid    44] clock_gettime(0xfffffff6 /* CLOCK_??? */, {tv_sec=11, tv_nsec=478485871}) = 0
[pid    44] clock_gettime(0xffffffb6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=13040685}) = 0
[pid    44] clock_gettime(0xffffffae /* CLOCK_??? */, {tv_sec=0, tv_nsec=41377001}) = 0
[pid    44] clock_gettime(0xffffffa6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=172823169}) = 0
[pid    44] clock_gettime(0xffffff76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=265477740}) = 0
[pid    44] clock_gettime(0xffffff6e /* CLOCK_??? */, {tv_sec=0, tv_nsec=890385521}) = 0
[pid    44] clock_gettime(0xffffff66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=778745020}) = 0
[pid    44] clock_gettime(0xffffff5e /* CLOCK_??? */, {tv_sec=0, tv_nsec=817141373}) = 0
[pid    44] clock_gettime(0xffffff0e /* CLOCK_??? */, {tv_sec=0, tv_nsec=676852386}) = 0
[pid    44] clock_gettime(0xffffff06 /* CLOCK_??? */, {tv_sec=0, tv_nsec=316683}) = 0
[pid    44] clock_gettime(0xfffffeee /* CLOCK_??? */, {tv_sec=0, tv_nsec=672358079}) = 0
[pid    44] clock_gettime(0xfffffee6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=220416}) = 0
[pid    44] clock_gettime(0xfffffece /* CLOCK_??? */, {tv_sec=0, tv_nsec=851675597}) = 0
[pid    44] clock_gettime(0xfffffec6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=180611}) = 0
[pid    44] clock_gettime(0xfffffeae /* CLOCK_??? */, {tv_sec=0, tv_nsec=7388677}) = 0
[pid    44] clock_gettime(0xfffffe96 /* CLOCK_??? */, {tv_sec=10, tv_nsec=779466345}) = 0
[pid    44] clock_gettime(0xfffffe8e /* CLOCK_??? */, {tv_sec=9, tv_nsec=229016859}) = 0
[pid    44] clock_gettime(0xfffffe76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=882559104}) = 0
[pid    44] clock_gettime(0xfffffe6e /* CLOCK_??? */, {tv_sec=5, tv_nsec=667769214}) = 0
[pid    44] clock_gettime(0xfffffe66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=151259012}) = 0
[pid    44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=168185170}) = 0
[pid    44] futex(0x55f1ee79f8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid    45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=42168942}, 0x7f2c8c42c7b0) = 0
[pid    45] futex(0x55f1ee79f8c0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid    44] <... futex resumed>)        = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid    44] futex(0x55f1ee79f870, FUTEX_WAKE_PRIVATE, 1) = 0
[pid    44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=168223530}) = 0
[pid    44] clock_gettime(0xfffffff6 /* CLOCK_??? */, {tv_sec=11, tv_nsec=478485871}) = 0
[pid    44] clock_gettime(0xffffffb6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=13040685}) = 0
[pid    44] clock_gettime(0xffffffae /* CLOCK_??? */, {tv_sec=0, tv_nsec=41377001}) = 0
[pid    44] clock_gettime(0xffffffa6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=172823169}) = 0
[pid    44] clock_gettime(0xffffff76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=265477740}) = 0
[pid    44] clock_gettime(0xffffff6e /* CLOCK_??? */, {tv_sec=0, tv_nsec=890385521}) = 0
[pid    44] clock_gettime(0xffffff66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=778745020}) = 0
[pid    44] clock_gettime(0xffffff5e /* CLOCK_??? */, {tv_sec=0, tv_nsec=817141373}) = 0
[pid    44] clock_gettime(0xffffff0e /* CLOCK_??? */, {tv_sec=0, tv_nsec=676852386}) = 0
[pid    44] clock_gettime(0xffffff06 /* CLOCK_??? */, {tv_sec=0, tv_nsec=316683}) = 0
[pid    44] clock_gettime(0xfffffeee /* CLOCK_??? */, {tv_sec=0, tv_nsec=672358079}) = 0
[pid    44] clock_gettime(0xfffffee6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=220416}) = 0
[pid    44] clock_gettime(0xfffffece /* CLOCK_??? */, {tv_sec=0, tv_nsec=851675597}) = 0
[pid    44] clock_gettime(0xfffffec6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=180611}) = 0
[pid    44] clock_gettime(0xfffffeae /* CLOCK_??? */, {tv_sec=0, tv_nsec=7388677}) = 0
[pid    44] clock_gettime(0xfffffe96 /* CLOCK_??? */, {tv_sec=10, tv_nsec=779523837}) = 0
[pid    44] clock_gettime(0xfffffe8e /* CLOCK_??? */, {tv_sec=9, tv_nsec=229016859}) = 0
[pid    44] clock_gettime(0xfffffe76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=882559104}) = 0
[pid    44] clock_gettime(0xfffffe6e /* CLOCK_??? */, {tv_sec=5, tv_nsec=667769214}) = 0
[pid    44] clock_gettime(0xfffffe66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=151259012}) = 0
[pid    44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=168589178}) = 0
[pid    44] futex(0x55f1ee79f8c4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid    45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0

There are some potentially more interesting snippets:

[pid    44] futex(0x55f1ee79f8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid    45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=39936383},  <unfinished ...>
[pid    18] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid    18] futex(0x55f1eff435b0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid    18] write(7, "\27\3\3\0Od\350\315\245\307\177\"\341\354\375\201,\325^\t_V\346\255\305zq\321\377\267\234\32"..., 84) = 84
[pid    18] read(7, 0x7f2c600d6fd3, 5)  = -1 EAGAIN (Resource temporarily unavailable)

and

[pid    44] futex(0x55f1ee79f8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid    45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=48153081},  <unfinished ...>
[pid    49] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid    49] futex(0x55f1ec353050, FUTEX_WAKE_PRIVATE, 1) = 0
[pid    49] futex(0x55f1ee2fccf0, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3948641, tv_nsec=321738460}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid    19] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid    19] futex(0x55f1f0047c80, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid    45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid    19] <... futex resumed>)        = 0
[pid    45] getpid()                    = 1
[pid    45] tgkill(1, 19, SIGPROF)      = 0
[pid    19] --- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=1, si_uid=1000} ---
[pid    45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid    19] rt_sigreturn({mask=[]})     = 139829380472440

I'm not sure what those pids are, but as a reminder this container is only running bundle exec rake resque:work, so anything must come from that or OS / k8s invoked processes.

from resque.

ozydingo commented on May 24, 2024

rbspy for the parent process

$ sudo /home/runner/.cargo/bin/rbspy record -p 1

Time since start: 101s. Press Ctrl+C to stop.
Summary of profiling data so far:
% self  % total  name
 98.28    98.28  _native_idle_sampling_loop [c function] - (unknown)
  1.41     1.41  sleep [c function] - (unknown)
  0.10     0.10  select [c function] - (unknown)
  0.07     0.07  count_objects [c function] - (unknown)
  0.03     0.03  _native_serialize [c function] - (unknown)
  0.01     0.26  block (2 levels) in create_timeout_thread - /usr/local/lib/ruby/3.3.0/timeout.rb:112
  0.01     0.03  map [c function] - (unknown)
  0.01     0.03  block in run_sdam_flow - /home/runner/bundle/ruby/3.3.0/gems/mongo-2.19.3/lib/mongo/server/monitor.rb:256
  0.01     0.01  sysread_nonblock [c function] - (unknown)
  0.01     0.01  read_status_line - /usr/local/lib/ruby/3.3.0/net/http/response.rb:162
  0.01     0.01  get_hash [c function] - (unknown)
  0.01     0.01  flush - /home/runner/bundle/ruby/3.3.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/telemetry.rb:97
  0.01     0.01  downcase [c function] - (unknown)
  0.01     0.01  block in record_loaded_specs - /home/runner/bundle/ruby/3.3.0/gems/ddtrace-1.22.0/lib/datadog/profiling/collectors/code_p
  0.01     0.01  block in initialize - /home/runner/bundle/ruby/3.3.0/gems/mongo-2.19.3/lib/mongo/address.rb:81
  0.01     0.01  _native_do_export [c function] - (unknown)
  0.00    98.28  block (2 levels) in start - /home/runner/bundle/ruby/3.3.0/gems/ddtrace-1.22.0/lib/datadog/profiling/collectors/idle_samp
  0.00     1.58  synchronize [c function] - (unknown)
  0.00     1.39  wait [c function] - (unknown)
  0.00     1.11  loop - <internal:kernel>:192

from resque.

nevans commented on May 24, 2024

Sorry I don't have time to do more than skim all of the details you've shared. But, with that caveat, I would personally experiment with 1) disabling or removing telemetry (ddtrace, airbrake, etc) on a percentage of your workers, and 2) setting ENV["FORK_PER_JOB"] = "false" on a percentage of your workers. It might be related to neither of these, but those are my first thoughts after seeing what you shared

Obviously you'll still need any disabled telemetry, but if it is related to DataDog or Airbrake, at least you'll have another clue about where to debug. I've debugged similar issues in the past and tracked them down to airbrake and/or newrelic_rpm (IIRC). I'm curious, for the others who had issues, what telemetry gems are you running (and what versions)?

from resque.

ozydingo commented on May 24, 2024

Those are good suggestions to experiment; unfortunately I think I've run out of time to keep this condition going in our prod application and will have to resort to reverting to ruby 3.2 for now to restore stability.

fwiw:

airbrake (13.0.3)
airbrake-ruby (6.2.1)
ddtrace (1.22.0)

from resque.

alexandrachakeres-wk commented on May 24, 2024

We did not use airbrake or ddtrace, but did use newrelic_rpm 9.7.0 at the time the bug occurred.

from resque.

Ruby 3.3 and stuck workers. about resque HOT 22 OPEN

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent