@attrs[:run_at] error,about que-rb/que

Comments (33)

chanks commented on September 7, 2024

Actually, the opposite! I wrote that example and then fixed the gem to work
that way, but we haven't had a new release since that change. So, on
master, @attrs[:run_at] should be an instance of Time, whereas in 0.5.0 it
would be a string representation of the time, I believe. I'm not sure
where/why it would be turned into an instance of ActiveSupport::Duration,
unless it has to do with your code?

On Mon, Jan 27, 2014 at 8:03 PM, Chris Kalafarski
[email protected]:

Any time I try to use @attrs[:run_at], even in exactly the same manner as
described in the docs to continuous jobs, I get an error:

"error":{"class":"TypeError","message":"no implicit conversion of ActiveSupport::Duration into String"}

Perhaps something has changed at the docs haven't caught up yet?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/24
.

from que.

farski commented on September 7, 2024

Hmm, don't think so but maybe I was inadvertently. Working on a very early stage app, is master stable enough generally to use for development?

from que.

chanks commented on September 7, 2024

Not generally, but it is at the moment. The only thing preventing 0.6.0 is
for @joevandyk (and whoever else) to play with the named queue system and
give me feedback. To be safe, why don't you use:

gem 'que', :git => 'git://github.com/chanks/que.git', :ref => 'f4588ca'

The upgrade to 0.6.0 once it's out (in the next few days, maybe?) shouldn't
be difficult.

On Mon, Jan 27, 2014 at 8:17 PM, Chris Kalafarski
[email protected]:

Hmm, don't think so but maybe I was inadvertently. Working on a very early
stage app, is master stable enough generally to use for development?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-33450055
.

from que.

chanks commented on September 7, 2024

Also, be sure to read the changelog and docs on the 0.6-docs branch: https://github.com/chanks/que/tree/0.6-docs

from que.

chanks commented on September 7, 2024

Actually, it's better to use the 1ab0859 ref. I just fixed a stupid bug.

from que.

farski commented on September 7, 2024

Thanks for the help. I'm going to stick it out on master until 0.6 is stable.

Also, totally hijacking this thread, I sometimes get jobs that stick around for a long time. Like, I'll shut down my workers and see one or two jobs that were running for several hours. Is there any way to make sure they just error out after a few minutes if they're not done?

from que.

chanks commented on September 7, 2024

That's tricky. Que doesn't offer that functionality out of the box because there isn't a way in Ruby to do that safely. DelayedJob, for example, uses Ruby's Timeout module to provide a max_run_time parameter, but it uses Thread#kill, which is not safe. You can read this and this for a ton of detail on why, but for example, if the timeout triggered while you were in a transaction, it could commit prematurely and corrupt your data. (I have an open pull request to make ActiveRecord more reliable in this regard.)

So, it depends on you're doing that can take so long. If you're making API calls or HTTP requests that hang every once in a while, you could try wrapping those individual parts of your job in timeout blocks - if one took too long, a timeout error would be thrown and Que would just retry the job a little while later like usual. It's not a perfect solution, since there's always a risk of unpredictable things happening when you use Timeout, but the worst-case scenario should still be fixable by restarting the process.

I'll be sure to add a section on timeouts to the Writing Reliable Jobs doc before 0.6.0. It'll basically be what I've written here, plus some code samples.

from que.

farski commented on September 7, 2024

Ok cool, thanks for the info. Haven't really dug into what's causing the problem; the jobs are HTTP requests for XML that gets parsed and then persisted. The HTTP is the likely culprit, since I've never seen the parser hang and I doubt it's postgres. I'll try the timeout block.

from que.

farski commented on September 7, 2024

Even with the timeout block I'm still getting jobs hanging for hours. Any suggestions on how to continue debugging this?

from que.

joevandyk commented on September 7, 2024

Maybe use strace (or related tool) to look at the process and see what it's stuck on? http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/

@chanks would that time zone bug cause hanging here? my tests would hang forever.

from que.

chanks commented on September 7, 2024

@joevandyk The time zone bug shouldn't cause this. It was causing jobs to be set with unpredictable run_at values, and then specs that were waiting on jobs before they could continue would just hang (the jobs themselves would run fine).

@farski You can use Que.log to output information on what your job is doing, that'll help you see where exactly it's blocking. See the logging doc for details.

You can also set the logger level to DEBUG to get more information on what Que is doing. You can do that by setting QUE_LOG_LEVEL=DEBUG when running rake que:work - there isn't currently a way to do that when running workers inside the web process, I should fix that. But if you can post some logs of what's going on, that'll help.

You might also try manually running some jobs with Que::Job.work - that'll grab the most important job, work it, delete it, and return. If you run it in a loop many times and get a hang, it's probably a problem with your job. If you don't, it's probably a problem with the worker system.

Also, if you can provide an example that reproduces this issue outside your application, I'd be happy to take a look at it, but that may be time-consuming.

from que.

farski commented on September 7, 2024

@chanks Ok I'll do some digging. More logging may not get me very far, since I don't generally notice these jobs until they're a few hours old, and that would be a lot of logs to look through at that point to pick out details of a specific worker. I will try brute forcing it with Que::Job.work over the weekend hopefully.

from que.

chanks commented on September 7, 2024

Ok. The logs are emitted in JSON in order to make it easier to write scripts to parse and filter them, but it can still be a hassle.

from que.

farski commented on September 7, 2024

This is probably a dumb question, but if I start que:work with WORKER_COUNT=8, when I do Que.worker_states, shouldn't I get 8 states? I'm only seeing four.

from que.

chanks commented on September 7, 2024

Ack, my mistake. The environment variable to set the number of workers was changed from WORKER_COUNT to QUE_WORKER_COUNT in 0.5.0, but I neglected to update the docs. I'll do that now.

In general, though, Que.worker_states will return however many workers are currently working jobs. If you did have eight workers, but four of them were currently asleep, then you'd only see four in worker_states.

from que.

farski commented on September 7, 2024

So I've had a Que::Job.work loop going for about 18 hours now, and it hasn't gotten hung up on any workers for more than a few minutes. That's quite different than when I run rake que:work, where I will tend to find at least one or two jobs that are several hours old even when the task has been running for only a few hours. That to me makes it seem like the issue ins't with my jobs, per se.

from que.

chanks commented on September 7, 2024

Ok, some questions:

What's Que.worker_states showing? It should show you the general state of the worker's PG connection (like the last query it ran), so if your job is touching the database, that should give you an indication of where it's hanging.
Since it hasn't been clear so far, what's the behavior you see that makes you think it's hanging? Is it that an entry in worker_states stays static for long periods of time, or are you looking at the effects of the job (its DB changes and whatnot), or is it something else?
Does it seem to occur with only some job types, or all job types, or some job types more than others? Does it seem to occur more often when lots of jobs are being worked simultaneously? Does the hanging occur in just one worker or in a few or in all of them? Does the hanging seem to only occur on startup, or is it ever the case that things run fine for a period of time and then hang later?
Try starting up a console and setting Que.worker_count = 8 to run jobs in the console process. If things work fine there, it's an indication that maybe something is wrong with the rake task. If jobs start to hang there, then we can at least inspect the worker threads and see what they're doing.

from que.

farski commented on September 7, 2024

I will get back to you with this
Hang may not be the right word, but what I'm seeing is, after some period of having the rake task running, I will kill the task, and it takes a long time to clean up the current workers. Once they do finish and I see the results, a couple will have an elapsed time of, like, 20000. My only interpretation of that is that the job started over five hours ago and never finished for some reason.
I only have on job class currently. Hard to say if it's more likely to happen with more workers; I can try to keep an eye on that. I don't think it happens right on startup
I will try this.

I have also made some other optimizations in the worker, so I will see how those change things.

from que.

chanks commented on September 7, 2024

I need to document this somewhere, but the intended usage is that SIGINT or SIGTERM tells the workers to stop after their current job (which may be a very long time, depending on the job), while SIGKILL shuts everything down immediately. This works well with Heroku's setup - they issue a SIGTERM and then a SIGKILL 10 seconds later if the process hasn't stopped yet.

It's also really the only safe way to work things - if the Ruby process ends due to a SIGTERM, it kills all the still-running threads, which may result in database corruption (due to the transaction issues explained in the links above). Whereas that doesn't happen if the process ends due to a SIGKILL. So, as long as you're using transactions properly, you should be able to use SIGKILL to get an immediate shutdown without losing any data.

BTW, if you get a hang while running jobs in the console, you can get the backtrace for each worker like:

Que::Worker.workers.map{|w| w.thread.backtrace}

That should provide some quick answers.

from que.

farski commented on September 7, 2024

Ideally once this is in production, I won't be shutting the workers down very frequently. The delay in stopping them was really just how I noticed that there we some very long running jobs. Given that I have a 2min timeout on the HTTP requests and the only other thing the job does is sometimes update records (at most about 2000 records), the 2+hour jobs are still somewhat inexplicable at this point.

I will try to get a backtrace on one next time it happens.

I appreciate all the help!

from que.

chanks commented on September 7, 2024

No problem, I really appreciate you taking the time to help me work this stuff out. One of my fears is that people will try Que out, run into an error or unexpected behavior or a lack of documentation, and then give up and go back to a more familiar alternative without letting me know what I can do to fix it. I'd much rather have people opening issues than getting silently frustrated.

from que.

farski commented on September 7, 2024

So this is probably not very helpful, but here are some stuff I pulled together. This is all from running Que.worker_count = 8 at 2014-02-02 15:18:22 -0500. It's been running fine (meaning, it has been running through jobs successfully) since then (right now it's 2014-02-02 16:50:49 -0500)

https://gist.github.com/farski/5b402703952ad5490d9e

After dumping the worker states every few minutes for quite a while, it was obvious 39751 was not finishing. The job working on that Podcast could potentially be updating every Episode associated with it (actually, deleting all episodes and creating new ones from the RSS feed), so at ~1300 episodes, that's a decent amount of data.

If I'm reading that state dump correctly, it looks like is started that SELECT over an hour ago. Selecting less than 2000 rows probably shouldn't take an hour.

I'm guessing that this is a similar case to what I've been seeing all along when running several workers. Every time I've run this queue more directly (eg the Que::Job.work loop) I haven't seen any single job take anywhere close to this long to finish, and this one isn't even finished yet. I will let it sit there for a while to see if it does ever complete.

from que.

joevandyk commented on September 7, 2024

Could there be any locks on the table? What does select * from pg_stat_activity where state != 'idle' show?

from que.

farski commented on September 7, 2024

https://gist.github.com/farski/a23fbdc7320631b34a7a

from que.

joevandyk commented on September 7, 2024

Do the other queries complete if you kill the vacuum process?

from que.

farski commented on September 7, 2024

Can you be more specific about which "other" queries you mean, and how I should kill the process?

from que.

joevandyk commented on September 7, 2024

You have an autovacuum process running with pid 22834 and 5 active select
queries. Some of the select queries are inside a transaction that's more
than an hour old.

If you kill the vacuum process (with select pg_terminate_backend(22834)),
do the other select queries finish?

On Sunday, February 2, 2014, Chris Kalafarski [email protected]
wrote:

Can you be more specific about which "other" queries you mean, and how I
should kill the process?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-33915559
.

from que.

farski commented on September 7, 2024

So just do Que.execute('select pg_terminate_backend(22384)')? (Sorry, fairly new to postgres)

from que.

chanks commented on September 7, 2024

That should do it, but if it works, it's not a very good long-term solution. It's strange to me that autovacuum would be interfering with this.

from que.

farski commented on September 7, 2024

https://gist.github.com/farski/19e5f58aeeb5f6359c8a

The queue is still doing work, but it looks like a bunch of workers are now not super happy.

from que.

chanks commented on September 7, 2024

Did the autovacuum start back up? That would just reintroduce the problem after a delay.

Vacuuming is going to be blocked by transactions that are left open for long periods of time, but I just don't know why a vacuum would cause a simple SELECT to hang like this. @joevandyk may have a better idea than I do, but I think you should probably ask in #postgresql on freenode or one of the pgsql mailing lists. You might mention that you're using a job queue that holds open some advisory locks (though not within a transaction), but I don't think Que is causing this problem.

from que.

farski commented on September 7, 2024

Not sure about vacuuming starting up; I shut things down before I saw your message and didn't check.

We are launching a thing at work so my time work on this side project over the next few days will be limited.

from que.

chanks commented on September 7, 2024

Ok. Since it was unrelated PG queries that were hanging, I'm going to close this for now. When you figure out what caused this, please let us know, I'm curious.

from que.

@attrs[:run_at] error about que HOT 33 CLOSED

Comments (33)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent