Code Monkey home page Code Monkey logo

Comments (33)

chanks avatar chanks commented on September 7, 2024

Actually, the opposite! I wrote that example and then fixed the gem to work
that way, but we haven't had a new release since that change. So, on
master, @attrs[:run_at] should be an instance of Time, whereas in 0.5.0 it
would be a string representation of the time, I believe. I'm not sure
where/why it would be turned into an instance of ActiveSupport::Duration,
unless it has to do with your code?

On Mon, Jan 27, 2014 at 8:03 PM, Chris Kalafarski
[email protected]:

Any time I try to use @attrs[:run_at], even in exactly the same manner as
described in the docs to continuous jobs, I get an error:

"error":{"class":"TypeError","message":"no implicit conversion of ActiveSupport::Duration into String"}

Perhaps something has changed at the docs haven't caught up yet?


Reply to this email directly or view it on GitHubhttps://github.com//issues/24
.

from que.

farski avatar farski commented on September 7, 2024

Hmm, don't think so but maybe I was inadvertently. Working on a very early stage app, is master stable enough generally to use for development?

from que.

chanks avatar chanks commented on September 7, 2024

Not generally, but it is at the moment. The only thing preventing 0.6.0 is
for @joevandyk (and whoever else) to play with the named queue system and
give me feedback. To be safe, why don't you use:

gem 'que', :git => 'git://github.com/chanks/que.git', :ref => 'f4588ca'

The upgrade to 0.6.0 once it's out (in the next few days, maybe?) shouldn't
be difficult.

On Mon, Jan 27, 2014 at 8:17 PM, Chris Kalafarski
[email protected]:

Hmm, don't think so but maybe I was inadvertently. Working on a very early
stage app, is master stable enough generally to use for development?


Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-33450055
.

from que.

chanks avatar chanks commented on September 7, 2024

Also, be sure to read the changelog and docs on the 0.6-docs branch: https://github.com/chanks/que/tree/0.6-docs

from que.

chanks avatar chanks commented on September 7, 2024

Actually, it's better to use the 1ab0859 ref. I just fixed a stupid bug.

from que.

farski avatar farski commented on September 7, 2024

Thanks for the help. I'm going to stick it out on master until 0.6 is stable.

Also, totally hijacking this thread, I sometimes get jobs that stick around for a long time. Like, I'll shut down my workers and see one or two jobs that were running for several hours. Is there any way to make sure they just error out after a few minutes if they're not done?

from que.

chanks avatar chanks commented on September 7, 2024

That's tricky. Que doesn't offer that functionality out of the box because there isn't a way in Ruby to do that safely. DelayedJob, for example, uses Ruby's Timeout module to provide a max_run_time parameter, but it uses Thread#kill, which is not safe. You can read this and this for a ton of detail on why, but for example, if the timeout triggered while you were in a transaction, it could commit prematurely and corrupt your data. (I have an open pull request to make ActiveRecord more reliable in this regard.)

So, it depends on you're doing that can take so long. If you're making API calls or HTTP requests that hang every once in a while, you could try wrapping those individual parts of your job in timeout blocks - if one took too long, a timeout error would be thrown and Que would just retry the job a little while later like usual. It's not a perfect solution, since there's always a risk of unpredictable things happening when you use Timeout, but the worst-case scenario should still be fixable by restarting the process.

I'll be sure to add a section on timeouts to the Writing Reliable Jobs doc before 0.6.0. It'll basically be what I've written here, plus some code samples.

from que.

farski avatar farski commented on September 7, 2024

Ok cool, thanks for the info. Haven't really dug into what's causing the problem; the jobs are HTTP requests for XML that gets parsed and then persisted. The HTTP is the likely culprit, since I've never seen the parser hang and I doubt it's postgres. I'll try the timeout block.

from que.

farski avatar farski commented on September 7, 2024

Even with the timeout block I'm still getting jobs hanging for hours. Any suggestions on how to continue debugging this?

from que.

joevandyk avatar joevandyk commented on September 7, 2024

Maybe use strace (or related tool) to look at the process and see what it's stuck on? http://chadfowler.com/blog/2014/01/26/the-magic-of-strace/

@chanks would that time zone bug cause hanging here? my tests would hang forever.

from que.

chanks avatar chanks commented on September 7, 2024

@joevandyk The time zone bug shouldn't cause this. It was causing jobs to be set with unpredictable run_at values, and then specs that were waiting on jobs before they could continue would just hang (the jobs themselves would run fine).

@farski You can use Que.log to output information on what your job is doing, that'll help you see where exactly it's blocking. See the logging doc for details.

You can also set the logger level to DEBUG to get more information on what Que is doing. You can do that by setting QUE_LOG_LEVEL=DEBUG when running rake que:work - there isn't currently a way to do that when running workers inside the web process, I should fix that. But if you can post some logs of what's going on, that'll help.

You might also try manually running some jobs with Que::Job.work - that'll grab the most important job, work it, delete it, and return. If you run it in a loop many times and get a hang, it's probably a problem with your job. If you don't, it's probably a problem with the worker system.

Also, if you can provide an example that reproduces this issue outside your application, I'd be happy to take a look at it, but that may be time-consuming.

from que.

farski avatar farski commented on September 7, 2024

@chanks Ok I'll do some digging. More logging may not get me very far, since I don't generally notice these jobs until they're a few hours old, and that would be a lot of logs to look through at that point to pick out details of a specific worker. I will try brute forcing it with Que::Job.work over the weekend hopefully.

from que.

chanks avatar chanks commented on September 7, 2024

Ok. The logs are emitted in JSON in order to make it easier to write scripts to parse and filter them, but it can still be a hassle.

from que.

farski avatar farski commented on September 7, 2024

This is probably a dumb question, but if I start que:work with WORKER_COUNT=8, when I do Que.worker_states, shouldn't I get 8 states? I'm only seeing four.

from que.

chanks avatar chanks commented on September 7, 2024

Ack, my mistake. The environment variable to set the number of workers was changed from WORKER_COUNT to QUE_WORKER_COUNT in 0.5.0, but I neglected to update the docs. I'll do that now.

In general, though, Que.worker_states will return however many workers are currently working jobs. If you did have eight workers, but four of them were currently asleep, then you'd only see four in worker_states.

from que.

farski avatar farski commented on September 7, 2024

So I've had a Que::Job.work loop going for about 18 hours now, and it hasn't gotten hung up on any workers for more than a few minutes. That's quite different than when I run rake que:work, where I will tend to find at least one or two jobs that are several hours old even when the task has been running for only a few hours. That to me makes it seem like the issue ins't with my jobs, per se.

from que.

chanks avatar chanks commented on September 7, 2024

Ok, some questions:

  • What's Que.worker_states showing? It should show you the general state of the worker's PG connection (like the last query it ran), so if your job is touching the database, that should give you an indication of where it's hanging.
  • Since it hasn't been clear so far, what's the behavior you see that makes you think it's hanging? Is it that an entry in worker_states stays static for long periods of time, or are you looking at the effects of the job (its DB changes and whatnot), or is it something else?
  • Does it seem to occur with only some job types, or all job types, or some job types more than others? Does it seem to occur more often when lots of jobs are being worked simultaneously? Does the hanging occur in just one worker or in a few or in all of them? Does the hanging seem to only occur on startup, or is it ever the case that things run fine for a period of time and then hang later?
  • Try starting up a console and setting Que.worker_count = 8 to run jobs in the console process. If things work fine there, it's an indication that maybe something is wrong with the rake task. If jobs start to hang there, then we can at least inspect the worker threads and see what they're doing.

from que.

farski avatar farski commented on September 7, 2024
  1. I will get back to you with this
  2. Hang may not be the right word, but what I'm seeing is, after some period of having the rake task running, I will kill the task, and it takes a long time to clean up the current workers. Once they do finish and I see the results, a couple will have an elapsed time of, like, 20000. My only interpretation of that is that the job started over five hours ago and never finished for some reason.
  3. I only have on job class currently. Hard to say if it's more likely to happen with more workers; I can try to keep an eye on that. I don't think it happens right on startup
  4. I will try this.

I have also made some other optimizations in the worker, so I will see how those change things.

from que.

chanks avatar chanks commented on September 7, 2024

I need to document this somewhere, but the intended usage is that SIGINT or SIGTERM tells the workers to stop after their current job (which may be a very long time, depending on the job), while SIGKILL shuts everything down immediately. This works well with Heroku's setup - they issue a SIGTERM and then a SIGKILL 10 seconds later if the process hasn't stopped yet.

It's also really the only safe way to work things - if the Ruby process ends due to a SIGTERM, it kills all the still-running threads, which may result in database corruption (due to the transaction issues explained in the links above). Whereas that doesn't happen if the process ends due to a SIGKILL. So, as long as you're using transactions properly, you should be able to use SIGKILL to get an immediate shutdown without losing any data.

BTW, if you get a hang while running jobs in the console, you can get the backtrace for each worker like:

Que::Worker.workers.map{|w| w.thread.backtrace}

That should provide some quick answers.

from que.

farski avatar farski commented on September 7, 2024

Ideally once this is in production, I won't be shutting the workers down very frequently. The delay in stopping them was really just how I noticed that there we some very long running jobs. Given that I have a 2min timeout on the HTTP requests and the only other thing the job does is sometimes update records (at most about 2000 records), the 2+hour jobs are still somewhat inexplicable at this point.

I will try to get a backtrace on one next time it happens.

I appreciate all the help!

from que.

chanks avatar chanks commented on September 7, 2024

No problem, I really appreciate you taking the time to help me work this stuff out. One of my fears is that people will try Que out, run into an error or unexpected behavior or a lack of documentation, and then give up and go back to a more familiar alternative without letting me know what I can do to fix it. I'd much rather have people opening issues than getting silently frustrated.

from que.

farski avatar farski commented on September 7, 2024

So this is probably not very helpful, but here are some stuff I pulled together. This is all from running Que.worker_count = 8 at 2014-02-02 15:18:22 -0500. It's been running fine (meaning, it has been running through jobs successfully) since then (right now it's 2014-02-02 16:50:49 -0500)

https://gist.github.com/farski/5b402703952ad5490d9e

After dumping the worker states every few minutes for quite a while, it was obvious 39751 was not finishing. The job working on that Podcast could potentially be updating every Episode associated with it (actually, deleting all episodes and creating new ones from the RSS feed), so at ~1300 episodes, that's a decent amount of data.

If I'm reading that state dump correctly, it looks like is started that SELECT over an hour ago. Selecting less than 2000 rows probably shouldn't take an hour.

I'm guessing that this is a similar case to what I've been seeing all along when running several workers. Every time I've run this queue more directly (eg the Que::Job.work loop) I haven't seen any single job take anywhere close to this long to finish, and this one isn't even finished yet. I will let it sit there for a while to see if it does ever complete.

from que.

joevandyk avatar joevandyk commented on September 7, 2024

Could there be any locks on the table? What does select * from pg_stat_activity where state != 'idle' show?

from que.

farski avatar farski commented on September 7, 2024

https://gist.github.com/farski/a23fbdc7320631b34a7a

from que.

joevandyk avatar joevandyk commented on September 7, 2024

Do the other queries complete if you kill the vacuum process?

from que.

farski avatar farski commented on September 7, 2024

Can you be more specific about which "other" queries you mean, and how I should kill the process?

from que.

joevandyk avatar joevandyk commented on September 7, 2024

You have an autovacuum process running with pid 22834 and 5 active select
queries. Some of the select queries are inside a transaction that's more
than an hour old.

If you kill the vacuum process (with select pg_terminate_backend(22834)),
do the other select queries finish?

On Sunday, February 2, 2014, Chris Kalafarski [email protected]
wrote:

Can you be more specific about which "other" queries you mean, and how I
should kill the process?


Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-33915559
.

from que.

farski avatar farski commented on September 7, 2024

So just do Que.execute('select pg_terminate_backend(22384)')? (Sorry, fairly new to postgres)

from que.

chanks avatar chanks commented on September 7, 2024

That should do it, but if it works, it's not a very good long-term solution. It's strange to me that autovacuum would be interfering with this.

from que.

farski avatar farski commented on September 7, 2024

https://gist.github.com/farski/19e5f58aeeb5f6359c8a

The queue is still doing work, but it looks like a bunch of workers are now not super happy.

from que.

chanks avatar chanks commented on September 7, 2024

Did the autovacuum start back up? That would just reintroduce the problem after a delay.

Vacuuming is going to be blocked by transactions that are left open for long periods of time, but I just don't know why a vacuum would cause a simple SELECT to hang like this. @joevandyk may have a better idea than I do, but I think you should probably ask in #postgresql on freenode or one of the pgsql mailing lists. You might mention that you're using a job queue that holds open some advisory locks (though not within a transaction), but I don't think Que is causing this problem.

from que.

farski avatar farski commented on September 7, 2024

Not sure about vacuuming starting up; I shut things down before I saw your message and didn't check.

We are launching a thing at work so my time work on this side project over the next few days will be limited.

from que.

chanks avatar chanks commented on September 7, 2024

Ok. Since it was unrelated PG queries that were hanging, I'm going to close this for now. When you figure out what caused this, please let us know, I'm curious.

from que.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.