Comments (22)
We've been having this exact issue. Since upgrading Ruby from 3.1.2 to 3.3.0 about a month ago, resque workers are going to ~0% CPU / stopping most logs, but not dying. They stopped working one by one over the course of a couple days. Bouncing the service restored them for a while, but they eventually stop working again.
In every pod's logs I've checked, the last 2 real things logged are similar to the following (always ends with Running before_fork hooks with...
but different jobs/args):
got: (Job{schedule} | ScheduleJob | [195293])
Running before_fork hooks with [(Job{schedule} | ScheduleJob | [195293])]
Normally next we'd see something like Running after_fork hooks with [(Job{schedule} | ScheduleJob | [195293])]
logged, but it never gets there. FYI our code doesn't have any Resque.before_fork
or Resque.after_fork
blocks (as documented in the resque docs here). So it sounds like something could be going wrong in the default before_fork
hook, considering the docs say The before_fork hook will be run in the parent process. So, be careful - any changes you make will be permanent for the lifespan of the worker.
. Or maybe (more likely) during the fork. I would be curious if others also saw Running before_fork hook...
as the final real log before things went basically silent. Could help narrow in on a fix.
Versions:
- ruby: 3.3.0
- resque: 2.6.0
- rails: 6.1.7.6
- redis gem: 4.6.0
- redis: 5.0.6
- We do not use the resque-pool gem
Update: We are no longer seeing lazy (0% CPU) workers after downgrading to Ruby 3.2.3. Obviously this is not a long-term solution though.
from resque.
Small Update
Historically we have had some workers die, where they freeze and also loose
their heartbeat, it expires. We check for expired workers every few minutes and
kill them. Due to resque-retry these jobs are failed and a new worker performs
the job, not perfect but it works.
However with ruby 3.3 these workers, they never loose their heartbeat, and thus never
appear dead, they seem to "run" until their dyno/server is restarted. Let's call
these Ghost workers, they are not alive or dead. Also we noticed that there
might be an increase number of ghost workers, though at the time we were running
a large amount of background workers.
@rexmadden, @leboshi, have you had a similar experience?
Seems like there is an historical issue here that has mutated with ruby 3.3
from resque.
We didn't get a chance to dive into a diagnosis as deeply as you did, but the symptoms are consistent with yours. Workers would run a few jobs then just hang forever. In the /jobs
page, we'd see the workers reporting that they were still processing the previous job they picked up, even hours later. We got tired of manually rebooting workers and just downgraded to Ruby 3.2.3. Obviously not tenable long-term.
from resque.
I also haven't upgraded anything significant to ruby 3.3 in production yet (I've been waiting for the 3.3.1 release, which will fix some bugs that affect my codebases). So I can't really offer anything but wild guesses at this point, but...
Unfortunately, (if I remember correctly) the heartbeat runs in a separate thread in the worker process, so it's only good for knowing if the worker process itself has frozen. It can't tell you if the code in the main thread has hung indefinitely or entered an infinite loop. And (again, IIRC) the heartbeat only runs in the parent worker process, and not in the forked child process that works each job. The parent simply waits on the child with Process.waitpid(@child)
.
If (hypothetically) in ruby 3.2, some system calls or any other VM code might potentially hang without a timeout _while holding the global VM lock, and if ruby 3.3 improved that code to no longer hold the global VM lock but it still hangs without a timeout, then that might explain your symptoms. This is is just a wild guess of a hypothesis... but a quick git log -G waitpid
in the ruby repo does indicate that changes were made to Process.waitpid
between 3.2 and 3.3, and its probably not related this bug. But... maybe?
Have you tried running with ENV["FORK_PER_JOB"] = "false"
on a percentage of your workers?
If you have the ability to point rbspy
, rbtrace
, and/or strace
at your stuck processes, that could be very informative. Even simpler, sometimes simply looking at the proclines (via something like ps faux | grep resque | grep -v grep
can be useful or revealing.
from resque.
we have not tried to run 3.3.1 on any production apps. I assume it would still be an issue. might have some time to test on a staging app soon.
from resque.
We've been experiencing similar issues with frozen workers, ever since upgrading to ruby 3.3.0:
Versions
- ruby: 3.3.0
- resque: 2.6.0
- resque-pool: 0.7.1
- rails: 7.1.3
from resque.
Just wanted to put an update here since I have been looking into this in the background, but also I'm not running Ruby 3.3 anywhere in production yet so I haven't been able to triage at scale.
I could use a little help in nailing down the exact behavior here though and any logs or other info that people can provide. Specifically, are people seeing the actual forked workers die off/exit? Or are they actually running but just no longer receiving jobs? Lastly, what version of redis server and gem are you running?
from resque.
Thanks for taking a look! I just added some more version details to my comment above. We did not see the resqueworker service pods exit. They kept running but stopped processing jobs. If I remember correctly, we did continue to see some random logging (unrelated to processing jobs). Not sure if that answers your question.
from resque.
I accidentally closed this, and reponed.
@PatrickTulskie What we noticed was jobs that should take a few seconds were taking 45+ minutes and the worker simply never stopped working that job. We also saw no logs from those workers, and we have extensive logging around our jobs. The only logs we got was from the enqueuer, before_enqueue
and after_enqueue
which I think happens in the enqueueing code.Resque.enqueue(...)
etc.
We often get workers similar to this, but those workers are missing a heartbeat we have an outside task that detects these and runs Resque::Worker.all_workers_with_expired_heartbeats.each { |w| w.unregister_worker }
, kills the worker re-enqueues the job for us.
However in this case these workers still had active hearbeats and their state was working
. One of the console commands we ran was Resque::Worker.all.map { |w| [w.state, Time.now, w.heartbeat, Time.now - w.heartbeat, w.started] }
. It's like these workers were "alive" but not doing anything.
When we unregistered these workers via getting their ID and unregistering them or Restarting or worker dynos we noticed that the failed jobs vanished, they did not end up in the failed queue as expected.
from resque.
However in this case these workers still had active hearbeats and their state was
working
. One of the console commands we ran wasResque::Worker.all.map { |w| [w.state, Time.now, w.heartbeat, Time.now - w.heartbeat, w.started] }
. It's like these workers were "alive" but not doing anything.
This. The jobs will not be marked as expired heartbeats and will claim to still be working, but they'll be idle... We can also see via the logs that a particular job for that worker has finished but will be in this idle state anyway, not picking anything else up and still "working".
irb (main) > Resque::Worker.all_workers_with_expired_heartbeats
[]
irb (main) > some_stalled_worker.working?
true
Unfortunately we've resorted to migrating to Sidekiq for some business critical cron jobs.
from resque.
Hey is this still an issue that's being experience on ruby 3.3.1? does anyone tested in production?
from resque.
We are hitting what appears to be this issue, running ruby 3.3.1. I haven't tried to instrument additional logging on the fork hooks yet, though it sounds like from OP's comment I won't see these logs anyway. We are getting workers that are registered as running job X, but have no logs from application code that we expect to see from that job, and are evidently idling doing nothing. The last logs from the hung workers are from the previous job, which we have evidence came to a successful completion without errors.
In case it's worth any kind of hint, we had a similarly presenting issue previously that related to Resque workers hanging after failure, due to a combination of ddtrace and airbrake, and Airbrake's change in 6.2 to not use Thread#enclose
is what solved that immediate problem. Otherwise, the problem manifested very similarly, where a worker would hang out idling.
from resque.
After adding logging to the various available hooks, I have a bit more info to @PatrickTulskie and @nevans questions:
Context:
- Ruby 3.3.1p55
- Gems: resque 2.6.0, redis 4.8.1, rails 7.1.3.2
- Redis server: 6.2 AWS Elasticache
Specific scenario:
- We're running resque workers in a kubernetes setup, with one worker per pod
- The affected pod / worker I am looking at is
worker-auto-order-5688d5ccf-jph4w
We can see the worker in question has been running for hours.
{now: Time.zone.now.utc, state: worker.state, started: worker.started, hostname: worker.h
ostname, job: worker.job}
=>
{:now=>2024-05-16 02:18:18.731606145 UTC,
:state=>:working,
:started=>"2024-05-15 20:17:47 +0000",
:hostname=>"worker-auto-order-5688d5ccf-jph4w",
:job=>
{"queue"=>"auto-order",
"run_at"=>"2024-05-15T23:11:51Z",
"payload"=>{"class"=>"AutoOrderVpi", "args"=>[671467342, 8846, 15]}}}
Logs from this worker filtered to the hook messages show the last two jobs (671467333, 671467337) forking, starting, and completing, but no logs whatsoever from the current job (671467342):
$ kubectl logs worker-auto-order-5688d5ccf-jph4w | grep "Resque hook" | tail -8
Defaulted container "worker-auto-order" out of: worker-auto-order, download-asset-manifest (init)
I, [2024-05-15T23:11:47.834201 #1] INFO -- : Resque hook: before_fork
I, [2024-05-15T23:11:47.858373 #77212] INFO -- : Resque hook: after_fork with job {"class"=>"AutoOrderVpi", "args"=>[671467333, 9392, 15]}
I, [2024-05-15T23:11:47.863781 #77212] INFO -- : Resque hook: before_perform; AutoOrderVpi; [671467333, 9392, 15]
I, [2024-05-15T23:11:49.029556 #77212] INFO -- : Resque hook: after_perform; AutoOrderVpi; [671467333, 9392, 15]
I, [2024-05-15T23:11:49.372468 #1] INFO -- : Resque hook: before_fork
I, [2024-05-15T23:11:49.388899 #77281] INFO -- : Resque hook: after_fork with job {"class"=>"AutoOrderVpi", "args"=>[671467337, 2615, 15]}
I, [2024-05-15T23:11:49.393569 #77281] INFO -- : Resque hook: before_perform; AutoOrderVpi; [671467337, 2615, 15]
I, [2024-05-15T23:11:50.751182 #77281] INFO -- : Resque hook: after_perform; AutoOrderVpi; [671467337, 2615, 15]
The unfiltered logs show that this after_perform
is the last log line for this worker before hours of silence:
$ kubectl logs worker-auto-order-5688d5ccf-jph4w | tail -5
Defaulted container "worker-auto-order" out of: worker-auto-order, download-asset-manifest (init)
(77281) Auto-Order [ **REDACTED** ] -- Performing auto-order!
/home/runner/bundle/ruby/3.3.0/gems/curb-0.9.11/lib/curl/easy.rb:67: warning: undefining the allocator of T_DATA class Curl::Multi
(77281) Auto-Order [ **REDACTED** ] -- 0 new orders found.
(77281) Auto-Order [ **REDACTED** ] -- Fininshed with [ **REDACTED** ]
I, [2024-05-15T23:11:50.751182 #77281] INFO -- : Resque hook: after_perform; AutoOrderVpi; [671467337, 2615, 15]
Notice that the last log line from this worker is immediately before the current reported job.run_at
time of 2024-05-15T23:11:51Z
.
The worker does not have an expired heartbeat
Resque::Worker.all_workers_with_expired_heartbeats
=> []
And indeed is still sending active heartbeats
heartbeats = Resque::Worker.all_heartbeats.select { |h| h.split(":").first == ENV["POD_NAME"] }; {now: Time.zone.now.utc, heatbeats: heartbeats }
=>
{:now=>2024-05-16 02:21:28.461478023 UTC,
:heatbeats=>{"worker-auto-order-5688d5ccf-jph4w:1:auto-order-fast,auto-order"=>"2024-05-16T02:20:48+00:00"}}
ps output:
$ kubectl exec worker-auto-order-5688d5ccf-jph4w -- ps auxf
Defaulted container "worker-auto-order" out of: worker-auto-order, download-asset-manifest (init)
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
runner 80897 0.0 0.0 8648 3228 ? Rs 02:26 0:00 ps auxf
runner 1 2.9 0.1 2365384 373608 ? Ssl May15 11:02 resque-2.6.0: Processing auto-order since 1715814711 [AutoOrderVpi]
runner 77315 0.0 0.1 2512380 355348 ? S May15 0:00 resque-2.6.0: Processing auto-order since 1715814711 [AutoOrderVpi]
Unfortunately I can't get strace
or rbspy
output, I believe due to the docker container not having sufficient priveleges. The following errors are the same even if I run as the root user, which I am able to access.
$ strace -f -p 77315
strace: attach: ptrace(PTRACE_SEIZE, 77315): Operation not permitted
$ rbspy record --pid 77315
rbspy is recording traces. Press Ctrl+C to stop.
Time since start: 1s. Press Ctrl+C to stop.
Summary of profiling data so far:
% self % total name
Warning: no profile samples were collected
Wrote raw data to /home/runner/.cache/rbspy/2024-05-16-VNxnAu1e37.raw.gz
Wrote formatted output to /home/runner/.cache/rbspy/2024-05-16-1YGNUU5IEQ.flamegraph.svg
Something went wrong while rbspy was sampling the process. Here's what we know:
- new spy
- get ruby VM state
- Failed to read Ruby version symbol
- Permission denied (os error 13)
- Permission denied (os error 13)
That's about as much as I can get from this pod, which will likely get cycled before additional responses here. I'm not fully opposed to running a limited experiment with added capabilities on a limited set of pods to get rbspy and/or strace, though doing so is uncharted territory for me at this time.
To summarize:
- A single worker has been working for hours
- It completed its last job successfully, hooks and all, then was assigned a new job
- No hooks from the new job were observed running, no code from the new job was observed, no logs emitted after the previous job completed.
- The worker was silent for hours when I found it in this position, still sending heartbeats and still allegedly but definitely not actually processing this new job.
- The worker state is
:working
, with both parent and child processes reporting "Processing auto-order since 1715814711"; more than 3 hours ago for a job that takes minutes at most.
from resque.
@ozydingo thanks for that. Interesting that it just goes silent even though it's sending heart beats.
For those of you experiencing this issue, are you able to consistently reproduce it or is it sporadic and doesn't always happen? I'm just trying to think of how to create a test setup for tracing where we can reproduce the bug. Something like a worker that is just constantly getting jobs where it sleeps for n seconds or something until it breaks.
from resque.
Unfortunately it's inconsistent. Our system seems to go for hours without issue, then things start going south. And in the case of the worker I documented here, this is in a deployment of usually 5 worker pods processing a pretty constant churn of polling jobs (consistent with the previous job logs shown), so it doesn't seem to be related to any period of idleness. I wish I could give you more.
And I'm definitely stumped as from what I can tell from the above, it seems we have hit lines 271 and 272 of worker.rb
such that the payload and procline are set. fork_per_job?
is definitively true
, so then we'd call perform_with_fork
and therefore run_hook :before_fork, job
, and I can't really see anything in that path that would just ... stop or fail in any way prior to actually running my logging before_fork
hook.
from resque.
I was able to get strace and rbspy working, and got another example to look at. Since it's a bit b uried, i'll highlight here the bit below about friendly_errors
, exit
, and RUN_AT_EXIT_HOOKS
.
ps:
2024-05-16 11:44:32 worker-auto-order (prod):/app
$ date
Thu May 16 11:44:35 UTC 2024
2024-05-16 11:44:35 worker-auto-order (prod):/app
$ ps -auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
runner 4498 0.3 0.0 9880 7560 pts/0 Ss 11:44 0:00 /bin/bash
runner 4620 0.0 0.0 8652 3308 pts/0 R+ 11:44 0:00 \_ ps -auxf
runner 1 2.2 0.1 2287068 368288 ? Ssl 09:59 2:20 resque-2.6.0: Forked 3921 at 1715854204
runner 3921 0.0 0.1 2393472 341108 ? S 10:10 0:00 resque-2.6.0: Processing auto-order since 1715854203
The parent process is "forked" instead of "processing", like the previous example.
rbspy:
$ sudo /home/runner/.cargo/bin/rbspy record -p 3921
Time since start: 78s. Press Ctrl+C to stop.
Summary of profiling data so far:
% self % total name
100.00 100.00 _fork [c function] - (unknown)
0.00 100.00 work_one_job - /home/runner/bundle/ruby/3.3.0/gems/resque-2.6.0/lib/resque/worker.rb:285
0.00 100.00 work - /home/runner/bundle/ruby/3.3.0/gems/resque-2.6.0/lib/resque/worker.rb:265
0.00 100.00 with_friendly_errors - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/friendly_errors.rb:125
0.00 100.00 top_level - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/application.rb:141
0.00 100.00 synchronize [c function] - (unknown)
0.00 100.00 start - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/vendor/thor/lib/thor/base.rb:495
0.00 100.00 start - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/cli.rb:31
0.00 100.00 standard_exception_handling - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/application.rb:219
0.00 100.00 run_with_threads - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/application.rb:157
0.00 100.00 run - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/application.rb:85
0.00 100.00 run - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/vendor/thor/lib/thor/command.rb:37
0.00 100.00 run - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/cli/exec.rb:30
0.00 100.00 perform_with_fork - /home/runner/bundle/ruby/3.3.0/gems/resque-2.6.0/lib/resque/worker.rb:942
0.00 100.00 loop - <internal:kernel>:192
0.00 100.00 load [c function] - (unknown)
0.00 100.00 kernel_load - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/cli/exec.rb:65
0.00 100.00 invoke_with_call_chain - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/task.rb:226
0.00 100.00 invoke_task - /home/runner/bundle/ruby/3.3.0/gems/rake-13.1.0/lib/rake/application.rb:183
0.00 100.00 invoke_command - /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/vendor/thor/lib/thor/invocation.rb:129
Line 285 appears to be the end
of def work_one_job
, confirmed on this specific machine:
$ cat /home/runner/bundle/ruby/3.3.0/gems/resque-2.6.0/lib/resque/worker.rb | head -n 285 | tail -19
def work_one_job(job = nil, &block)
return false if paused?
return false unless job ||= reserve
working_on job
procline "Processing #{job.queue} since #{Time.now.to_i} [#{job.payload_class_name}]"
log_with_severity :info, "got: #{job.inspect}"
job.worker = self
if fork_per_job?
perform_with_fork(job, &block)
else
perform(job, &block)
end
done_working
true
end
friendly_errors
may be of interest:
$ cat /home/runner/bundle/ruby/3.3.0/gems/bundler-2.4.21/lib/bundler/friendly_errors.rb | head -n 125 | tail -n 11
def self.with_friendly_errors
FriendlyErrors.enable!
yield
rescue SignalException
raise
rescue Exception => e # rubocop:disable Lint/RescueException
raise if FriendlyErrors.disabled?
FriendlyErrors.log_error(e)
exit FriendlyErrors.exit_status(e)
end
Is there some exception bubbling up here causing exit
to be hit? I know that might interact interestingly with resque. We have RUN_AT_EXIT_HOOKS=1
, which was recommended by the datadog folks to get code profiling to work (however I have already disabled that code in the current workers, that was my first move when I noticed the hanging workers)
strace:
$ sudo strace -f -p 3921 | head -n 100
strace: Process 3921 attached
futex(0x55f1e553c0b0, FUTEX_WAIT_PRIVATE, 2, NULL
Output hangs there with no closing paren and no newline. I don't know how to parse strace output so I'm just parroting it.
The parent process has more verbose but maybe not so informative output. Here's a sample -- it just keeps vommitting out more if I let it run
[pid 44] futex(0x55f1ee79f8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=42942872}, 0x7f2c8c42c7b0) = 0
[pid 45] futex(0x55f1ee79f8c0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid 44] <... futex resumed>) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 44] futex(0x55f1ee79f870, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=166474748}) = 0
[pid 44] clock_gettime(0xfffffff6 /* CLOCK_??? */, {tv_sec=11, tv_nsec=478485871}) = 0
[pid 44] clock_gettime(0xffffffb6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=13040685}) = 0
[pid 44] clock_gettime(0xffffffae /* CLOCK_??? */, {tv_sec=0, tv_nsec=41377001}) = 0
[pid 44] clock_gettime(0xffffffa6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=172823169}) = 0
[pid 44] clock_gettime(0xffffff76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=265477740}) = 0
[pid 44] clock_gettime(0xffffff6e /* CLOCK_??? */, {tv_sec=0, tv_nsec=890385521}) = 0
[pid 44] clock_gettime(0xffffff66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=778745020}) = 0
[pid 44] clock_gettime(0xffffff5e /* CLOCK_??? */, {tv_sec=0, tv_nsec=817141373}) = 0
[pid 44] clock_gettime(0xffffff0e /* CLOCK_??? */, {tv_sec=0, tv_nsec=676852386}) = 0
[pid 44] clock_gettime(0xffffff06 /* CLOCK_??? */, {tv_sec=0, tv_nsec=316683}) = 0
[pid 44] clock_gettime(0xfffffeee /* CLOCK_??? */, {tv_sec=0, tv_nsec=672358079}) = 0
[pid 44] clock_gettime(0xfffffee6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=220416}) = 0
[pid 44] clock_gettime(0xfffffece /* CLOCK_??? */, {tv_sec=0, tv_nsec=851675597}) = 0
[pid 44] clock_gettime(0xfffffec6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=180611}) = 0
[pid 44] clock_gettime(0xfffffeae /* CLOCK_??? */, {tv_sec=0, tv_nsec=7388677}) = 0
[pid 44] clock_gettime(0xfffffe96 /* CLOCK_??? */, {tv_sec=10, tv_nsec=779297188}) = 0
[pid 44] clock_gettime(0xfffffe8e /* CLOCK_??? */, {tv_sec=9, tv_nsec=229016859}) = 0
[pid 44] clock_gettime(0xfffffe76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=882559104}) = 0
[pid 44] clock_gettime(0xfffffe6e /* CLOCK_??? */, {tv_sec=5, tv_nsec=667769214}) = 0
[pid 44] clock_gettime(0xfffffe66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=151259012}) = 0
[pid 44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=166910183}) = 0
[pid 44] futex(0x55f1ee79f8c4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=44852822}, 0x7f2c8c42c7b0) = 0
[pid 45] futex(0x55f1ee79f8c4, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid 44] <... futex resumed>) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 44] futex(0x55f1ee79f870, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=166952274}) = 0
[pid 44] clock_gettime(0xfffffff6 /* CLOCK_??? */, {tv_sec=11, tv_nsec=478485871}) = 0
[pid 44] clock_gettime(0xffffffb6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=13040685}) = 0
[pid 44] clock_gettime(0xffffffae /* CLOCK_??? */, {tv_sec=0, tv_nsec=41377001}) = 0
[pid 44] clock_gettime(0xffffffa6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=172823169}) = 0
[pid 44] clock_gettime(0xffffff76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=265477740}) = 0
[pid 44] clock_gettime(0xffffff6e /* CLOCK_??? */, {tv_sec=0, tv_nsec=890385521}) = 0
[pid 44] clock_gettime(0xffffff66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=778745020}) = 0
[pid 44] clock_gettime(0xffffff5e /* CLOCK_??? */, {tv_sec=0, tv_nsec=817141373}) = 0
[pid 44] clock_gettime(0xffffff0e /* CLOCK_??? */, {tv_sec=0, tv_nsec=676852386}) = 0
[pid 44] clock_gettime(0xffffff06 /* CLOCK_??? */, {tv_sec=0, tv_nsec=316683}) = 0
[pid 44] clock_gettime(0xfffffeee /* CLOCK_??? */, {tv_sec=0, tv_nsec=672358079}) = 0
[pid 44] clock_gettime(0xfffffee6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=220416}) = 0
[pid 44] clock_gettime(0xfffffece /* CLOCK_??? */, {tv_sec=0, tv_nsec=851675597}) = 0
[pid 44] clock_gettime(0xfffffec6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=180611}) = 0
[pid 44] clock_gettime(0xfffffeae /* CLOCK_??? */, {tv_sec=0, tv_nsec=7388677}) = 0
[pid 44] clock_gettime(0xfffffe96 /* CLOCK_??? */, {tv_sec=10, tv_nsec=779354288}) = 0
[pid 44] clock_gettime(0xfffffe8e /* CLOCK_??? */, {tv_sec=9, tv_nsec=229016859}) = 0
[pid 44] clock_gettime(0xfffffe76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=882559104}) = 0
[pid 44] clock_gettime(0xfffffe6e /* CLOCK_??? */, {tv_sec=5, tv_nsec=667769214}) = 0
[pid 44] clock_gettime(0xfffffe66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=151259012}) = 0
[pid 44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=167329296}) = 0
[pid 44] futex(0x55f1ee79f8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=41631252}, 0x7f2c8c42c7b0) = 0
[pid 45] futex(0x55f1ee79f8c0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid 44] <... futex resumed>) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 44] futex(0x55f1ee79f870, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=167376646}) = 0
[pid 44] clock_gettime(0xfffffff6 /* CLOCK_??? */, {tv_sec=11, tv_nsec=478485871}) = 0
[pid 44] clock_gettime(0xffffffb6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=13040685}) = 0
[pid 44] clock_gettime(0xffffffae /* CLOCK_??? */, {tv_sec=0, tv_nsec=41377001}) = 0
[pid 44] clock_gettime(0xffffffa6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=172823169}) = 0
[pid 44] clock_gettime(0xffffff76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=265477740}) = 0
[pid 44] clock_gettime(0xffffff6e /* CLOCK_??? */, {tv_sec=0, tv_nsec=890385521}) = 0
[pid 44] clock_gettime(0xffffff66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=778745020}) = 0
[pid 44] clock_gettime(0xffffff5e /* CLOCK_??? */, {tv_sec=0, tv_nsec=817141373}) = 0
[pid 44] clock_gettime(0xffffff0e /* CLOCK_??? */, {tv_sec=0, tv_nsec=676852386}) = 0
[pid 44] clock_gettime(0xffffff06 /* CLOCK_??? */, {tv_sec=0, tv_nsec=316683}) = 0
[pid 44] clock_gettime(0xfffffeee /* CLOCK_??? */, {tv_sec=0, tv_nsec=672358079}) = 0
[pid 44] clock_gettime(0xfffffee6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=220416}) = 0
[pid 44] clock_gettime(0xfffffece /* CLOCK_??? */, {tv_sec=0, tv_nsec=851675597}) = 0
[pid 44] clock_gettime(0xfffffec6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=180611}) = 0
[pid 44] clock_gettime(0xfffffeae /* CLOCK_??? */, {tv_sec=0, tv_nsec=7388677}) = 0
[pid 44] clock_gettime(0xfffffe96 /* CLOCK_??? */, {tv_sec=10, tv_nsec=779409707}) = 0
[pid 44] clock_gettime(0xfffffe8e /* CLOCK_??? */, {tv_sec=9, tv_nsec=229016859}) = 0
[pid 44] clock_gettime(0xfffffe76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=882559104}) = 0
[pid 44] clock_gettime(0xfffffe6e /* CLOCK_??? */, {tv_sec=5, tv_nsec=667769214}) = 0
[pid 44] clock_gettime(0xfffffe66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=151259012}) = 0
[pid 44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=167790802}) = 0
[pid 44] futex(0x55f1ee79f8c4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=42501013}, 0x7f2c8c42c7b0) = 0
[pid 45] futex(0x55f1ee79f8c4, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid 44] <... futex resumed>) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 44] futex(0x55f1ee79f870, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=167833694}) = 0
[pid 44] clock_gettime(0xfffffff6 /* CLOCK_??? */, {tv_sec=11, tv_nsec=478485871}) = 0
[pid 44] clock_gettime(0xffffffb6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=13040685}) = 0
[pid 44] clock_gettime(0xffffffae /* CLOCK_??? */, {tv_sec=0, tv_nsec=41377001}) = 0
[pid 44] clock_gettime(0xffffffa6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=172823169}) = 0
[pid 44] clock_gettime(0xffffff76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=265477740}) = 0
[pid 44] clock_gettime(0xffffff6e /* CLOCK_??? */, {tv_sec=0, tv_nsec=890385521}) = 0
[pid 44] clock_gettime(0xffffff66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=778745020}) = 0
[pid 44] clock_gettime(0xffffff5e /* CLOCK_??? */, {tv_sec=0, tv_nsec=817141373}) = 0
[pid 44] clock_gettime(0xffffff0e /* CLOCK_??? */, {tv_sec=0, tv_nsec=676852386}) = 0
[pid 44] clock_gettime(0xffffff06 /* CLOCK_??? */, {tv_sec=0, tv_nsec=316683}) = 0
[pid 44] clock_gettime(0xfffffeee /* CLOCK_??? */, {tv_sec=0, tv_nsec=672358079}) = 0
[pid 44] clock_gettime(0xfffffee6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=220416}) = 0
[pid 44] clock_gettime(0xfffffece /* CLOCK_??? */, {tv_sec=0, tv_nsec=851675597}) = 0
[pid 44] clock_gettime(0xfffffec6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=180611}) = 0
[pid 44] clock_gettime(0xfffffeae /* CLOCK_??? */, {tv_sec=0, tv_nsec=7388677}) = 0
[pid 44] clock_gettime(0xfffffe96 /* CLOCK_??? */, {tv_sec=10, tv_nsec=779466345}) = 0
[pid 44] clock_gettime(0xfffffe8e /* CLOCK_??? */, {tv_sec=9, tv_nsec=229016859}) = 0
[pid 44] clock_gettime(0xfffffe76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=882559104}) = 0
[pid 44] clock_gettime(0xfffffe6e /* CLOCK_??? */, {tv_sec=5, tv_nsec=667769214}) = 0
[pid 44] clock_gettime(0xfffffe66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=151259012}) = 0
[pid 44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=168185170}) = 0
[pid 44] futex(0x55f1ee79f8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=42168942}, 0x7f2c8c42c7b0) = 0
[pid 45] futex(0x55f1ee79f8c0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid 44] <... futex resumed>) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 44] futex(0x55f1ee79f870, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=168223530}) = 0
[pid 44] clock_gettime(0xfffffff6 /* CLOCK_??? */, {tv_sec=11, tv_nsec=478485871}) = 0
[pid 44] clock_gettime(0xffffffb6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=13040685}) = 0
[pid 44] clock_gettime(0xffffffae /* CLOCK_??? */, {tv_sec=0, tv_nsec=41377001}) = 0
[pid 44] clock_gettime(0xffffffa6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=172823169}) = 0
[pid 44] clock_gettime(0xffffff76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=265477740}) = 0
[pid 44] clock_gettime(0xffffff6e /* CLOCK_??? */, {tv_sec=0, tv_nsec=890385521}) = 0
[pid 44] clock_gettime(0xffffff66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=778745020}) = 0
[pid 44] clock_gettime(0xffffff5e /* CLOCK_??? */, {tv_sec=0, tv_nsec=817141373}) = 0
[pid 44] clock_gettime(0xffffff0e /* CLOCK_??? */, {tv_sec=0, tv_nsec=676852386}) = 0
[pid 44] clock_gettime(0xffffff06 /* CLOCK_??? */, {tv_sec=0, tv_nsec=316683}) = 0
[pid 44] clock_gettime(0xfffffeee /* CLOCK_??? */, {tv_sec=0, tv_nsec=672358079}) = 0
[pid 44] clock_gettime(0xfffffee6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=220416}) = 0
[pid 44] clock_gettime(0xfffffece /* CLOCK_??? */, {tv_sec=0, tv_nsec=851675597}) = 0
[pid 44] clock_gettime(0xfffffec6 /* CLOCK_??? */, {tv_sec=0, tv_nsec=180611}) = 0
[pid 44] clock_gettime(0xfffffeae /* CLOCK_??? */, {tv_sec=0, tv_nsec=7388677}) = 0
[pid 44] clock_gettime(0xfffffe96 /* CLOCK_??? */, {tv_sec=10, tv_nsec=779523837}) = 0
[pid 44] clock_gettime(0xfffffe8e /* CLOCK_??? */, {tv_sec=9, tv_nsec=229016859}) = 0
[pid 44] clock_gettime(0xfffffe76 /* CLOCK_??? */, {tv_sec=0, tv_nsec=882559104}) = 0
[pid 44] clock_gettime(0xfffffe6e /* CLOCK_??? */, {tv_sec=5, tv_nsec=667769214}) = 0
[pid 44] clock_gettime(0xfffffe66 /* CLOCK_??? */, {tv_sec=0, tv_nsec=151259012}) = 0
[pid 44] clock_gettime(0xfffffe9e /* CLOCK_??? */, {tv_sec=98, tv_nsec=168589178}) = 0
[pid 44] futex(0x55f1ee79f8c4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
There are some potentially more interesting snippets:
[pid 44] futex(0x55f1ee79f8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=39936383}, <unfinished ...>
[pid 18] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 18] futex(0x55f1eff435b0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 18] write(7, "\27\3\3\0Od\350\315\245\307\177\"\341\354\375\201,\325^\t_V\346\255\305zq\321\377\267\234\32"..., 84) = 84
[pid 18] read(7, 0x7f2c600d6fd3, 5) = -1 EAGAIN (Resource temporarily unavailable)
and
[pid 44] futex(0x55f1ee79f8c0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=48153081}, <unfinished ...>
[pid 49] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 49] futex(0x55f1ec353050, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 49] futex(0x55f1ee2fccf0, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3948641, tv_nsec=321738460}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 19] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 19] futex(0x55f1f0047c80, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 45] <... clock_nanosleep resumed>0x7f2c8c42c7b0) = 0
[pid 19] <... futex resumed>) = 0
[pid 45] getpid() = 1
[pid 45] tgkill(1, 19, SIGPROF) = 0
[pid 19] --- SIGPROF {si_signo=SIGPROF, si_code=SI_TKILL, si_pid=1, si_uid=1000} ---
[pid 45] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 19] rt_sigreturn({mask=[]}) = 139829380472440
I'm not sure what those pids are, but as a reminder this container is only running bundle exec rake resque:work
, so anything must come from that or OS / k8s invoked processes.
from resque.
rbspy for the parent process
$ sudo /home/runner/.cargo/bin/rbspy record -p 1
Time since start: 101s. Press Ctrl+C to stop.
Summary of profiling data so far:
% self % total name
98.28 98.28 _native_idle_sampling_loop [c function] - (unknown)
1.41 1.41 sleep [c function] - (unknown)
0.10 0.10 select [c function] - (unknown)
0.07 0.07 count_objects [c function] - (unknown)
0.03 0.03 _native_serialize [c function] - (unknown)
0.01 0.26 block (2 levels) in create_timeout_thread - /usr/local/lib/ruby/3.3.0/timeout.rb:112
0.01 0.03 map [c function] - (unknown)
0.01 0.03 block in run_sdam_flow - /home/runner/bundle/ruby/3.3.0/gems/mongo-2.19.3/lib/mongo/server/monitor.rb:256
0.01 0.01 sysread_nonblock [c function] - (unknown)
0.01 0.01 read_status_line - /usr/local/lib/ruby/3.3.0/net/http/response.rb:162
0.01 0.01 get_hash [c function] - (unknown)
0.01 0.01 flush - /home/runner/bundle/ruby/3.3.0/gems/dogstatsd-ruby-5.5.0/lib/datadog/statsd/telemetry.rb:97
0.01 0.01 downcase [c function] - (unknown)
0.01 0.01 block in record_loaded_specs - /home/runner/bundle/ruby/3.3.0/gems/ddtrace-1.22.0/lib/datadog/profiling/collectors/code_p
0.01 0.01 block in initialize - /home/runner/bundle/ruby/3.3.0/gems/mongo-2.19.3/lib/mongo/address.rb:81
0.01 0.01 _native_do_export [c function] - (unknown)
0.00 98.28 block (2 levels) in start - /home/runner/bundle/ruby/3.3.0/gems/ddtrace-1.22.0/lib/datadog/profiling/collectors/idle_samp
0.00 1.58 synchronize [c function] - (unknown)
0.00 1.39 wait [c function] - (unknown)
0.00 1.11 loop - <internal:kernel>:192
from resque.
Sorry I don't have time to do more than skim all of the details you've shared. But, with that caveat, I would personally experiment with 1) disabling or removing telemetry (ddtrace, airbrake, etc) on a percentage of your workers, and 2) setting ENV["FORK_PER_JOB"] = "false"
on a percentage of your workers. It might be related to neither of these, but those are my first thoughts after seeing what you shared
Obviously you'll still need any disabled telemetry, but if it is related to DataDog or Airbrake, at least you'll have another clue about where to debug. I've debugged similar issues in the past and tracked them down to airbrake and/or newrelic_rpm (IIRC). I'm curious, for the others who had issues, what telemetry gems are you running (and what versions)?
from resque.
Those are good suggestions to experiment; unfortunately I think I've run out of time to keep this condition going in our prod application and will have to resort to reverting to ruby 3.2 for now to restore stability.
fwiw:
airbrake (13.0.3)
airbrake-ruby (6.2.1)
ddtrace (1.22.0)
from resque.
We did not use airbrake or ddtrace, but did use newrelic_rpm 9.7.0 at the time the bug occurred.
from resque.
Related Issues (20)
- Unsupported command argument type: TrueClass. Migration from redis gem v4 to v5 HOT 5
- Web Interface failing with Internal Server Error
- Blocking background worker process over SSH HOT 2
- Make the signal for child termination configurable HOT 1
- Workers not terminating as expected HOT 1
- resque-status is not compatible with Resque > 2 HOT 2
- Experiencing a lot of database overhead with forked jobs HOT 3
- deadlock during deallocation of threads and fork
- [feature request] filter sensitive args in resque server
- DirtyExit VS PruneDeadWorkerDirtyExit - what is the difference? HOT 1
- USR2 signal is happening but my resque job is going to dirty exit HOT 2
- start_heartbeat can die while worker is still running HOT 1
- Unpatched CVE-2022-44303 - Reflected XSS HOT 4
- Please publish security advisories for XSS vulnerabilities HOT 4
- Search Feature in failed jobs HOT 2
- Cannot use with Rack 3.0 because of Sinatra (resque-web); Blocked upgrade to Rails 7.1 HOT 6
- Autoscaling resque question HOT 2
- Overview UI Broken Due to Nonce Whitelist HOT 2
- Code & comment for redis initializer method could be clearer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from resque.