Code Monkey home page Code Monkey logo

uber / cadence Goto Github PK

View Code? Open in Web Editor NEW
8.1K 1.5K 782.0 66.75 MB

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.

Home Page: https://cadenceworkflow.io

License: MIT License

Makefile 0.22% Go 99.61% Shell 0.14% Dockerfile 0.03%
uber cadence workflows orchestration-engine workflow-automation distributed-systems service-bus service-fabric services-platform golang

cadence's People

Contributors

3vilhamster avatar abhishekj720 avatar agautam478 avatar andrewjdawson2016 avatar anish531213 avatar bowenxia avatar davidporter-id-au avatar demirkayaender avatar groxx avatar jakobht avatar ketsiambaku avatar longquanzheng avatar mantas-sidlauskas avatar meiliang86 avatar mkolodezny avatar neil-xie avatar samarabbas avatar sankari165 avatar shaddoll avatar shreyassrivatsan avatar sivakku avatar taylanisikdemir avatar timl3136 avatar vancexu avatar venkat1109 avatar vytautas-karpavicius avatar wxing1292 avatar yiminc avatar yux0 avatar yycptt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cadence's Issues

Frontend should not serve requests before it's ready

The frontend service registers its thrift handler and starts the TChannel RPC server before it is fully initialized. We see issues where if the requests reach frontend before it's properly initialized it will go into panics.
We need to handle this similar to the way we handle history and matching services where we block incoming requests on a wait group until the initialization of the service is complete.

Create Lock Manager to serialize access to executions

Right now, every request gets a WorkflowExecutionContext from the cache and then acquires a lock on that object. It is possible in edge conditions that two requests end up with two different context objects (request 1 gets the context, the context gets evicted from the cache, then request 2 creates a new object). This will break the guarantee that only one write per execution originates from the history engine at a time.
We can fix this by having a central lock manager that grants locks on executions instead of locking the context object itself.

Matching Engine can lose decision tasks

By design, the matching engine can lose tasks even before recording in the execution history that they started. This is OK for activity tasks, since there are always timeouts for them.
On the other hand, there is no ScheduledToStart timeout for decision tasks (to avoid unnecessary timeouts in case decider was down or not polling tasks). If the decision task is lost, the workflow execution will get stuck forever.

History cache invalidation on Cassandra timeouts

we have an issue where if we got a timeout error while updating the wf mutable state, we couldn't guarantee that we read the correct, latest state on reload. This is because the write could still be applied after executing the read.
This could have lead to corrupting the Events table if we tried to use the stale next_event_id value for subsequent writes.

Handling of history corruption

Hopefully, execution history should never get corrupted. If, for any reason (bugs?) we get into a state where this happens we should not just return a retriable error to the callers.

Deletion of history events on workflow completion

We mark the workflow execution row with a TTL in executions table on completion. This takes care of workflow execution entry in executions table but we still need leak space in the events table as we don't cleanup the history associated with that execution.
We could use the timer queue processor for this purpose and queue up a timer task to delete the execution history after retention period.

History Engine: Timer optimizations

Currently all timers are created on each activity and decision task. We need to implement the logic to create a single timer for each workflow execution and set the next earliest timer when that one fires.

Matching Service should not serve requests before it's ready

The matching service registers its thrift handler and starts the TChannel RPC server before it is fully initialized. We see issues where if the requests reach matching engine before it's properly initialized it will go into panics.
We need to handle this similar to the way we handle History Service, where we block incoming requests on a wait group until the initialization of the service is complete.

History Service: Fix timer task creation on activity heartbeat

History service seems to be creating a timeout task on each heartbeat. Instead we should have last hearbeat time recorded in mutable state and only create new timeout when the first one expires based on the last value for last recorded heartbeat.

History Service: Mutable state API cleanup

Currently mutable state is only used for small part of the API. This work item is created to track it is used for all API calls on History service and is updated to keep track of all relevant information like:

  1. ActivityInfos
  2. TimerInfos
  3. OutstandingDecision
  4. NextEventID
  5. ChildWorkflows
  6. Potentially any Signal ID if it makes sense for any API

Design task to expose mutable state to client-side

Certain workflows are easy to write if mutable state is exposed directly to client for making decisions instead of history. Workflows like cron will prefer this model and it is much more optimized for such scenarios. Also using mutable state for things like activity retries are much preferable rather than having client implement the retry logic.

Basic Server Side Throttling

Cadence is a multi-tenant service and we need to protect against single bad user bringing the entire system down. This task is to implement basic throttling and quotas for each client.

Cadence Feature: Restart failed workflows

This feature is to support restarting workflows from a given point in workflow execution history. Basically you want to preserve the history of an execution up to a point and restart from that location. Very useful when workflow fails due to a bug at a certain point and you want to restart a workflow after fixing the bug.

Matching Engine: Rate limit creation of new tasks for any TaskList

Every TaskList is mapped to single cassandra partition. So if we have all shards writing events to single TaskList, than it becomes the scalability bottleneck for the system. If Sync matching is not happening we end up writing all the tasks to cassandra and under lots of load cassandra transactions start timing out. This behavior ends up in generating very large number of duplicate tasks.
I think we need to put a rate limiter on TaskList to prevent this situation from happening. We should just return a throttle error back to client, and have the client backoff and retry failures. This should cause the system to degrade gracefully under extreme load.

DeleteWorkflowExecution is not transactional with workflow completion update

When decider responds back with complete workflow decision, we first update the execution with new events and then delete workflow execution as a separate transaction. This can cause issues when the update times out but we successfully apply the update. This can cause us to never delete this workflow execution.
We need to make sure that execution is update and deleted in the same transaction.

History client support for host redirect

Now we have support for returning the correct host information when API calls to history service fails with ShardOwnershipLostError.
History client needs to look into the ShardOwnershipLostError and retry the request given the host information as part of the error.

Create ActivityTaskScheduleFailed event in history on bad decisions

If RespondDecisionTask sends in bad request or corrupted data than we just silently ignore the activitySchedule decisions. Instead we need to add relevant failure like ActivityTaskScheduleFailed event and then also create a new DecisionTask for the decider. Here is an instance of the failure:
{"RunID":"c09c5b10-d240-4f8b-bc4c-5735c0bb3805","ScheduleID":212,"Service":"cadence-frontend","WorkflowID":"48018f57-0c39-4d4e-b055-e3df3fff7464","level":"error","msg":"RespondDecisionTaskCompleted. Error: BadRequestError({Message:Missing StartToCloseTimeoutSeconds in the activity scheduling parameters.})","time":"2017-03-07T13:56:56-08:00"}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.