mbuhot / ecto_job Goto Github PK

View Code? Open in Web Editor NEW

274.0 274.0 35.0 220 KB

Transactional job queue with Ecto, PostgreSQL and GenStage

Home Page: https://hexdocs.pm/ecto_job/

License: MIT License

Elixir 99.80% Makefile 0.20%

ecto_job's People

Contributors

Stargazers

Watchers

ecto_job's Issues

Negative Demand Errors

Hi there.

We've got EctoJob running on a reasonably busy queue and have recent started getting error reports such as these:

ERROR 2201W (invalid_row_count_in_limit_clause) LIMIT must not be negative

We were able to trace this back to a negative demand in the Producer state, for example:

%EctoJob.Producer.State{clock: &DateTime.utc_now/0, demand: -31, execution_timeout: 300000, notifications_listen_timeout: 5000, notifier: #PID<0.2601.0>, poll_interval: 60000, repo: Repo, reservation_timeout: 60000, schema: JobQueue}

After a bit of digging and debugging, we narrowed the error down to this part of the code: https://github.com/mbuhot/ecto_job/blob/master/lib/ecto_job/producer.ex#L184-L185

The count returned by JobQueue.reserve_available_jobs is greater than the demand, which results in the demand being updated to a negative number in the Producer's state. Then the following handle_demand fails.

We dug into this a little more, because, looking the JobQueue.reserve_available_jobs query it looks like this shouldn't be happening, because the limit is applied. Then we found this. Basically, there's an issue with using limit in a subquery of an update which can result in the limit being ignored if the query planner chooses a Nested Loop in the execution plan.

The work around for this is to use a CTE, however Ecto doesn't support that. There is a PR on Ecto to add support for CTEs, after which you'd be able to apply the work around (I hope). I couldn't find any other way to get Ecto to use a CTE.

completed jobs

Hi,
pls, is there any configuration or explicit message to send that marks jobs as completed ?
After running my jobs, the state continues as "IN PROGRESS".

Do you think that a configurable cleaner job should be made to clean the completed jobs after X time ?

Using Multiple Queues

Hi,

I am wondering if I could set max demands per job_type.

The scenario is like this - I have two kinds of jobs, which are 1. sequential jobs (job processing is one at a time), 2. parallel jobs (job processings are done many at a time)

How can I use ecto_job to accommodate these kinds of jobs?

Multi-node postgrex listener resulting in more queries than necessary.

We've been noticing some increased CPU usage that seems to be due to an abundance of calls for available jobs to the database.

We have in the neighborhood of 100's of jobs inserted per minute. We have max_demand setup at 100 and we currently have 2 nodes churning away. The workers seem to keep up with the queue so we're almost always buffering demand.

The way I understand the issue, it seems like each insert into the table will result in a call to dispatch_jobs on both nodes so they are competing to get these individual jobs. Also, due to the number of rows we're inserting and how the workers are almost always ahead of what's being queued, we end up doing this A LOT.

I think the desired behavior would actually be to just have the jobs queue up a little more before dispatching. I'm wondering if I should just kill the PG_Notify stuff in my implementation and just rely on polling? Any better ideas?

perform later?

We started to use ecto_job on the project I am working on, but I cannot find if it's possible to schedule a job for running later (e.g. 30 minutes after it's schedule, and not before). I very used to this feature from other job processing libraries such as Sidekiq. Is it supported? If not, are you planning on adding something like this?

Maintainers Wanted!

@ramondelemos @lukyanov @jeanparpaillon I'd really like to get your open PRs merged.
Unfortunately I'm only able to work on EctoJob in my free time, and haven't been able to thoroughly test out the changes.

Would any of you like to become a co-maintainer of EctoJob?

It's generally not a lot of work - EctoJob has been fairly stable over the years. I just like to be careful with updates since a bug in a job queue library could have a big impact on users :)

error list for unsuccessful jobs

Dear @mbuhot , I am thinking about on how to create a PR that deals with job errors, adding an error list in the job data-structure, so every non successful trial will add an error message to the job queue. It will help to have a good vision on what is happening behind the scenes of every job attempt.

Please, could you send me some small guidelines on how to implement it
regards
Henry

pg_notify not working when testing

Using notification when testing it seems like pg notifications are not triggered.
When using the dev env I have notifications working.
I guess this is related to the ecto Sandbox.

OTP 21
Elixir 1.8.1
ecto_job 2.0.0
ecto 3.0.0
postgrex 0.14.1

Add to documentation used connections

Our application was using more connections than what we configured for our Repo. We later found out that every job queue uses an exclusive connection, and that caused the strange number of connections.

Somewhere in the documentation it should explained that usage of connections so the user can understand the behaviour of his application.

Use clock from database rather than application?

Hi there, thanks for making EctoJob! I've been evaluating it recently to possibly use it at our company.

One of the things I noticed while skimming the source is the number of places where you're generating the current time in application code. This jumped out at me given the number of times I've seen issues caused by distributed application servers whose system clocks are not perfectly in sync, timezone misconfigurations, etc 😩

Given that Postgres has a now() function to fetch the current time from the start of the in-progress transaction, and given that this function wouldn't susceptible to skewed clocks, I wondered if you've given any thought to using on the time from the database rather than always calling DateTime.utc_now(). It wouldn't necessarily protect you from issues arising from scheduled jobs, but it might eliminate whole classes of other potential problems.

EctoJob UI

Following from the proof-of-concept in #46 this issue is to track the integration between ecto_job and a separate EctoJob UI from @AaronCowan.

Update README to link to ecto_job_ui project
Add any additional calls to pg_notify for job state changes

Configure max_demand and polling_interval similarly

max_demand is passed as a param to the start_link function, while polling_interval is configured through the application environment.

Unify the two approaches into a Config struct that takes values passed to start_link and falls back to the application environment.

Any interest in using postgres notifications?

Hey folks!

I'm looking at switching out our home grown postgres based job library with ecto_job. One thing we do however is have a trigger on the queue table that does a pg_notify to a jobs topic. Then there's a Postgrex.Notifications process that listens for these and uses it to trigger a job fetch.

Notably, this is NOT used to replace getting the rows out of the database with lock: "FOR UPDATE SKIP LOCKED". It does replacing the polling however, and decreases the overall latency.

Any interest in having a PR that adds this (perhaps optionally?) to ecto_job?

LogLevel in configuration

Hi,

I tried to set log_level in configuration (according to readme), but it still logs like crazy. Is this functionality working?

I just try to get rid of rather useless reoccurring log messages in iex shell.

Thanks!

Infinity Number of Max Attempt

Hi,

Is there a way to set max_attempt infinity?

Requeue seems inconsistent with the general use of Ecto.Multi

One would typically want to handle a multi in the following pattern:

{existing multi}
|> Ecto.Multi.merge(fn %{job: job} = multi ->
     Ecto.Multi.new()
     |> __MODULE__.requeue("requeue_job", job)
end)
|> MyApp.Repo.transaction()
|> case do
       {:error, :non_failed_job} -> do_something
    end

The above would also allow you to pass the error on to fallback controllers and such.

However, the way the requeue function is currently implimented, it does not just register an error on the multi, but rather returns an error. This forces one do add a non-elegant error handling block somewhere midstream something like the following.

|> Ecto.Multi.merge(fn %{job: job} = multi ->
          Ecto.Multi.new()
          |> __MODULE__.requeue("requeue_job", job)
          |> case do
            {:error, :non_failed_job} ->
              Ecto.Multi.new()
              |> Ecto.Multi.error("requeue_job", {:error, :non_failed_job})

            any ->
              any
          end
        end)
|> MyApp.Repo.transaction()
|> case do
       {:error, :non_failed_job} -> do_something
    end

DBConnection error when running test with :manual mode for sandbox

I have an application using EctoJob and it seems I'm getting errors when I'm using the manual sandbox mode.

This is very strange

because I have the same setup in another app and this is not happening.

I can't seem to find what's wrong

See Ecto.Adapters.SQL.Sandbox docs for more information.
    (ecto_sql) lib/ecto/adapters/sql.ex:626: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql) lib/ecto/adapters/sql.ex:562: Ecto.Adapters.SQL.execute/5
    (ecto) lib/ecto/repo/queryable.ex:177: Ecto.Repo.Queryable.execute/4
    (ecto_job) lib/ecto_job/producer.ex:184: EctoJob.Producer.dispatch_jobs/2
    (gen_stage) lib/gen_stage.ex:2103: GenStage.noreply_callback/3
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:"$gen_producer", {#PID<0.427.0>, #Reference<0.3185423543.3099852806.167003>}, {:ask, 100}}
12:02:12.163 [error] GenServer #PID<0.427.0> terminating
** (DBConnection.OwnershipError) cannot find ownership process for #PID<0.426.0>.

test_helper.exs

ExUnit.start()

Ecto.Adapters.SQL.Sandbox.mode(Dispatch.Repo, :manual)

If I comment out the last line everything works.

Retry only for specific failures

Is it possible to distinguish between execution failures, and retry only on desired errors?

Breaks with Repos configured to use UUIDs by default

Problem Description

With an Ecto.Repo configured to use UUIDs for new migrations by default, like:

config :my_app, MyApp.Repo,
  ... other config ...
  migration_primary_key: [
    id: :uuid,
    type: :binary_id,
    autogenerate: true
  ],
... other config ...

the jobs table is created with a PK id column of type UUID, as expected, but the application code is expecting an integer, preventing jobs from being created, even if explicitly given a UUID.

Expected Behavior

Libraries should "play nice" with existing Ecto.Migration and Ecto.Repo configuration options, and should honor the configuration of the host application's Repo with respect to id column type and autogeneration.

Workaround

It isn't possible to use a UUID at all, so the only workaround available is to create the table with a primary ID type of integer. The most straightforward way to do this is to simply comment out the relevant configuration before running the migration that creates the job queue table.

Potential Solutions

Allow for custom implementations via protocols?
Make the migration generator create the migrations alongside application's migrations, where they can be customized?
Allow for configuration of library field type for id column in config.exs
Inherit field types from MyApp.Repo configuration, allowing overrides via library configuration?

EctoJob just suddenly stop.

Hi,

We are experiencing that Ecto Job just stops working.

Our setup ->

We have no. of elixir nodes (usually 3) set up to connect to one PostgreSQL database.

We continue to experience that Ecto Job just stop working. And we have to restart our elixir node to make things start working again.

When Ecto Job stops working, we can observe that there are unprocessed records (around 15-30 records) in AVAILABLE states in the job_queue table.

Please recommend how to check this issue and to protect this from happening again.

Non-concurrent queue

Hello,

can a queue be processed one job after another?

The use-case is generating invoices where the transaction would fail if the invoice has the same number so the jobs needs to run one after another.

As I understand it I could move all single jobs (per customer) to be one big job to avoid running into possible transaction failures, but before that I wanted to explore if there is a possibility to simply run them in sequence.

I can imagine this might be useful to other use-cases too.

Todo: Support ecto 3 and ecto_sql

Ecto 3 is coming in the next couple days and with it the recommended upgrade path for those using it for database interaction is to change dependency from ecto to ecto_sql.

I think to support backwards compatibly for presumably a long time we'll want to introduce an optional dependency.

Reference commit and announcement.

I'll be able to take this on eventually as I'm working on a new project that will be using ecto_sql and will want to use this lib, if no one else gets to it before then.

Publish 3.0 to hex.pm

It appears that 3.0 is not published to hex.pm. Is there a plan to do that soonish? I like that the indexes were added to some columns. Nice project btw!

EDIT: Is 3.0 still in dev?

Honor host application timezone configuration

Problem Description

With an Ecto.Repo configured to use a different time type, like:

config :my_app, MyApp.Repo,
  ... other config ...
  migration_timestamps: [
    type: :timestamptz,
    usec: true
  ],
... other config ...

the jobs table is created with inserted_at and updated_at fields with the configured type, as expected, but creates the expires and schedule columns with the library's defined type (also expected). This behavior is completely normal, but causes application developers to have to write and maintain code to handle time differently for this library than they do elsewhere throughout the application.

Expected Behavior

Workaround

It isn't possible to use an application-defined time type, so the only workaround available seems to be to create the table via the migration as usual, then remember to deal with time differently when working with the application's jobs.

Potential Solutions

Allow for custom implementations via protocols?
Make the migration generator create the migrations alongside application's migrations, where they can be customized?
Allow for configuration of library field type for time columns in config.exs
Inherit field types from MyApp.Repo configuration, allowing overrides via library configuration?

mysql compatible version ?

Is it planned to have a mysql compatible version of ecto_job.

Some projects (unfortunately) relies on mysql. It would be useful ecto_job to be compatible with mysql, even with limited features (poll-based job producer).

hex.pm

Excellent job!
let me suggest to delivery a tagged version to hex.pm

Base expiry constant is not clear

Hey. So I've been reading through your library to determine if its appropriate to use in a production codebase that need to handle fairly modest load but with strong safety guarantees. There's one piece of behaviour that I didn't understand exactly, which is the the 300s base expiry value, where it comes from, whether it makes sense to have it configurable, and how it interacts with the rest of the processing lifecycle.

I understand that this 300s for the RESERVED expiry is the maximum time between when the job is dispatched from the Producer while the Worker is permitted to begin processing. For this use case it seems like an arbitrary but reasonable value, if not a bit high, though it's not obvious what the failure conditions would be in GenStage and when it would make sense to retry earlier

For the IN_PROGRESS expiry during attempt 1, which would be 300s, is it correct that only on the next poll, if the expiry had passed then the job would be retried, assuming it had failed?

The reason I ask, is because 300s seems too long for a first retry attempt. As far as I can understand the real retry timeout would be whatever is smaller between the expiry and the poll interval. Does it makes sense to you @mbuhot to make this configurable as well? Can you think of any unintended consequences of setting it to, say, 10s, other than burning through the retry attempts more quickly?

mbuhot / ecto_job Goto Github PK

ecto_job's People

Contributors

Stargazers

Watchers

Forkers

ecto_job's Issues

Recommend Projects

Recommend Topics

Recommend Org