temporalio / temporal Goto Github PK

Temporal service

License: MIT License

Go 99.55% Makefile 0.17% Shell 0.13% PLpgSQL 0.16%

workflows workflow-engine workflow-automation workflow-management workflow-management-system distributed-systems golang service-bus service-fabric microservices-architecture

temporal's Introduction

Temporal

Temporal is a durable execution platform that enables developers to build scalable applications without sacrificing productivity or reliability. The Temporal server executes units of application logic called Workflows in a resilient manner that automatically handles intermittent failures, and retries failed operations.

Temporal is a mature technology that originated as a fork of Uber's Cadence. It is developed by Temporal Technologies, a startup by the creators of Cadence.

Learn more:

Courses
Docs
Internal architecture: docs/

Getting Started

Download and Start Temporal Server Locally

Execute the following commands to start a pre-built image along with all the dependencies.

brew install temporal
temporal server start-dev

Refer to Temporal CLI documentation for more installation options.

Run the Samples

Clone or download samples for Go or Java and run them with the local Temporal server. We have a number of HelloWorld type scenarios available, as well as more advanced ones. Note that the sets of samples are currently different between Go and Java.

Use CLI

Use Temporal CLI to interact with the running Temporal server.

temporal operator namespace list
temporal workflow list

Use Temporal Web UI

Try Temporal Web UI by opening http://localhost:8233 for viewing your sample workflows executing on Temporal.

Repository

This repository contains the source code of the Temporal server. To implement Workflows, Activities and Workers, use one of the supported languages.

Contributing

We'd love your help in making Temporal great. Please review the internal architecture docs.

See CONTRIBUTING.md for how to build and run the server locally, run tests, etc.

If you'd like to work on or propose a new feature, first peruse feature requests and our proposals repo to discover existing active and accepted proposals.

Feel free to join the Temporal community forum or Slack to start a discussion or check if a feature has already been discussed. Once you're sure the proposal is not covered elsewhere, please follow our proposal instructions or submit a feature request.

License

MIT License

temporal's People

Contributors

Stargazers

Watchers

Forkers

alexshtin samarabbas fredenlive mission-peace briancecker sunjiawen fahedhijazi mfateev gurpartap markmark206 ajesse11x hazcod eric-asuncion chris-rcn johncburns1 jefflill tomzhang codervivekb skrul ethernetdan ankumar bhavya-quo fossabot simhaonline holajiawei mastermanu tawawhite sullis xsilmarx chaithanyagk songvi dbysani vitarb ngonser pat-addepar aungmoewin underrun sbrichardson isgasho kunaldawn 5l1v3r1 bostwickenator pyr-sh sergeybykov ebrake loafoe wxing1292 mohithg vikasvb90 linvon emacsway emrul snawar92 deepakchampatiray prince1999-ops traycho weiplanet williamdeve ysudhakar vladsanchez sanket-deepsource dayuanc forkkit jonsie dannyb2018 foxhack robzienert ardagan robholland lidi100 sisivy cybernetics shaneutt shiftwinting shivangisingh91 ocdogan jontro micki0271 jkaveri flossypurse shiwano joudzouzou nanzhong totr dio vibe sonyeric joebowbeer tmanick01 nodamu fengweijp sjmtan ubi-mirrors alpex29 andregmoeller opus-grove grant-steinfeld pingponglabs wahello 24601

temporal's Issues

Allow signalling external workflow with no input

There is no reason to enforce input when workflow signals an external workflow as it is already allowed when signalling from outside:

cause BAD_SIGNAL_WORKFLOW_EXECUTION_ATTRIBUTES
details "BadRequestError{Message: Input is not set on decision.}"

Refactor Makefile

After gRPC migration is done, remove all Thrift targets and arrange all existing ones.

Add golint install with go get -u golang.org/x/lint/golint. If it is not installed output would contain a spam of command not found messages.

Remove mocks from repo

And generate them on the fly the same way it currently works for gRPC mocks.
Note:

Many mocks are not used and can be safely removed.
It is better to place mocks to separate package (mocks sub-dir for example). Reason: mockgen builds entire package to generate mocks, and if old incompatible mocks are left there, it just can't build the package and regenerate mocks. Old mocks can be deleted first but this can lead to the state when there are no mocks at all, if build fails because of some code errors.

Consider migration from gogo to vtprotobuf

Currently we use https://github.com/golang/protobuf which is stable release. There is another repo under active development https://github.com/protocolbuffers/protobuf-go which is considered to be "Next Generation Go Protocol Buffers". I found number of improvements there but don't want to use it right away because it states:

WARNING: This repository is in active development. There are no guarantees about API stability. Breaking changes will occur until a stable release is made and announced.

Split GetWorkflowExecutionHistory to two RPCs

workflowhandler.GetWorkflowExecutionHistory has WaitForNewEvent param which essentially turns it into the long poll API. This RPC needs to be splitted into two (long poll and normal) RPC. Client should have appropriate timeouts set.

Also QueryWorkflow needs longer timeout.

Use dependency injection

Currently we have resource which essentially implements service locator which is anti-pattern in the most cases including ours. Same pattern is also used in few other places such as RPCFactory.

The good alternative is to use dependency injection. It requires dependency container. Wire is a compile time dependency container from Google.

Publish Temporal Helm Chart

Please publish a Helm Chart for installing / upgrading / rolling back a Temporal instance.

Add check that prohibits cross domain calls for global domains

Cross domain calls are supported for local domains which majority of users are using.
Cross domain calls for global domains are currently broken as domains can be active in different DCs.
This issue is to add validation that fails any attempt to do cross domain call for global domains. The validation will go away when cross domain calls for global domains are fully supported.

Group and categorize tests

We currently have integration tests inside host and all other tests are considered to be unit tests. But many of them actually require at least database. We need to categorize all tests:

Pure unit tests. Doesn't require any external dependency, can be run quickly to verify basic things.
Integration tests, Suite 0. Requires database and use mocks for other dependencies.
Integration tests, Suite 1. Doesn't use mocks, but uses onebox implementation of server (which needs to be refactored to use as much as possible of production code path).
End to end tests. Starts real temporal-server and make requests to it through API and client.

Every test category should have target in Makefile and instruction how to run locally.

Use a code-generator for boilerplate/'generic' code

In many parts of the codebase we have very boilerplate code that is hand-maintained. Some offending examples:

persistenceRateLimitedClients - https://github.com/temporalio/temporal/blob/9c9ae82affc728f0e3466bb00e70bc4ddae1b050/common/persistence/persistenceRateLimitedClients.go

persistedMetricsClient - https://github.com/temporalio/temporal/blob/9c9ae82affc728f0e3466bb00e70bc4ddae1b050/common/persistence/persistenceMetricClients.go

resourceImpl - https://github.com/temporalio/temporal/blob/9c9ae82affc728f0e3466bb00e70bc4ddae1b050/common/resource/resourceImpl.go

Using a code generator in some of these places would be very beneficial to future development speed and safety by keeping changes in one place.

Port Go service and client to the new error handling

https://github.com/golang/go/wiki/ErrorValueFAQ

The most important part is that checks of error types should use the newly added methods errors.Is and errors.As instead of comparison and type assertions.

Use gRPC load balancer for frontend

YARPC uses concept of dispatchers which are created with DispatcherProvider interface with one single implementation dnsDispatcherProvider. With migration to gRPC, call to DispatcherProvider was replaced with simple host:port. gRPC has support for round robin and custom load balancers as well as service configs.
These approaches need to be investigated and Temporal should provide flexible plugable mechanism to support different load balancers for frontend.

https://medium.com/@ammar.daniel/grpc-client-side-load-balancing-in-go-cd2378b69242
https://itnext.io/on-grpc-load-balancing-683257c5b7b3

Persisting Ringpop Membership & Custom Host Identity

Summary

Ringpop seed membership via config to be removed and instead use the persistence layer to coordinate.
Support custom host identity in order to support bind on 0.0.0.0 (See Samar's write-up on issue - uber/cadence#2942)
Support separate broadcast address vs bind address for nat'd networking setups. See Cassandra setting write-up here.

Scope of changes - Configuration Layer

Deprecate ringpop.bootstrapMode, ringpop.bootstrapHosts, ringpop.bootstrapFile
- Bootstrapping from persistence will be the only way moving forward
Add rpc.BroadcastAddress to support a separate address that will be connected to for ringpop and rpc.
Add ringpop.CustomHostIdentity to support a unique custom host identity for the cluster and in ringpop.

Scope of changes - Persistence Layer

General Schema -

    host_id        string, 
    rpc_address    inet,
    session_start  timestamp,
    last_heartbeat timestamp,
    record_expiry timestamp
    PRIMARY KEY (host_id)

host_id will be a unique identifier for the host to prevent collisions. The value used is fetched from ringpop.CustomHostIdentity if set, otherwise rpc.BroadcastAddress if set, otherwise from the listening address of ringpop.
rpc_address will be the address to connect to for rpc_operations on this host. The value used will be rpc.BroadcastAddress if set otherwise the listening address of ringpop.
session_start will be the timestamp of the service startup, ie: process lifetime
last_heartbeat will be the timestamp of the last time this row was updated and considered fresh
TTLs will be supported by using the record_expiry field in postgres/mysql and using a db-level trigger created by the schema to perform limited deletions of expired rows on new inserts. This means these tables in these implementations would be lazily cleaned up.
- The record_expiry field will not exist in the Cassandra implementation as the TTL option will be used on insert.
ClusterMetadata Store/Manager interfaces
- Add GetActiveClusterMembers()
  - Request - LastHeartbeatWithin time.Duration
  - Response - List of Active Cluster Members that have heartbeated since duration
- Add UpsertClusterMembership()
  - Request is represented by schema above with added recordExpiresOn field for TTL creation.

Scope of changes - Ringpop Layer
To bootstrap itself from the persistence layer, the service will:

First, write its identity and metadata to the cluster_membership table using UpsertClusterMembership()
Use GetActiveClusterMembers() to get back current healthy members within say the last 10 minutes.
Bootstrap ringpop using the values
Create a go routine to continuing writing its identity to the cluster_membership table on some interval (ie: 1 minute)

Questions/Notes:

What happens if the go routine crashes?
- Inclination is to continue with metrics/errors in log and allow service to continue to perform work. It will be propagated by other healthy cluster members it had already connected to. If there is a bug, this could lead to a zombie cluster where new nodes would not find any other nodes in the table and form their own cluster.

Different retry options based on failure type

Currently both workflow and activity retry options do not differentiate between failure types besides allowing to not retry configured list of failures.

The proposal is to allow specifying different retry options for different failure types. For example intermittent errors can be retried immediately, but some errors that are not intermittent (like NPE) and require human intervention can be retried with much higher intervals.

Add Admin CLI Command to inspect DB ClusterMembership

Admin CLI lets us inspect Ringpop membership, lets also have a way to inspect the DB cluster_membership table and be able to compare. See changes made in this issue - #83

Docker: Docker image with schema loaded

The initial execution of the docker-compose up takes a long time. Mostly because it has to create the DB schema. Consider shipping a docker container with a schema already populated to speed up the startup.

Any other ideas on how to speed up docker-compose up are welcome.

Replace OneboxService interface with Resource

There is leftover OneboxService interface, used by workers in onebox only. It needs to be replaced with Resource interface (which is used everywhere) and removed.

Add header to each payload

Currently payloads do not have associated metadata. It is needed to various tools. For example payload can be serialized, zipped and encrypted protobuf. So metadata would need to specify compression and encryption algorithm as well as proto file name and version.

CLI hangs if domain doesn't exist

This hangs if domain sample doesn't exist: ./cadence -do sample workflow list

Clear indication that service is up and running

docker-compose up and other ways to start a service should produce a clear message into the log that indicates that service is up and running. Something like:

 _____                                    _  
|_   _|__ _ __ ___  _ __   ___  _ __ __ _| | 
  | |/ _ \ '_ ` _ \| '_ \ / _ \| '__/ _` | | 
  | |  __/ | | | | | |_) | (_) | | | (_| | | 
  |_|\___|_| |_| |_| .__/ \___/|_|  \__,_|_| 
                   |_|                       
 ____  _             _           _ 
/ ___|| |_ __ _ _ __| |_ ___  __| |
\___ \| __/ _` | '__| __/ _ \/ _` |
 ___) | || (_| | |  | ||  __/ (_| |
|____/ \__\__,_|_|   \__\___|\__,_|

Make temporal docker to not emit noisy logs

Currently temporal docker emits a lot of logs with 99.9% of them being useless. It also doesn't have clear indication when the service is ready serve requests. My proposal is to emit something like:

Temporal Service Starting...
Temporal Service Started

and then emit only fatal errors. For troubleshooting provide an option to enable logs.

Investigate why upgrading gocql breaks paging

This commit reverted an upgrade to gocql v0.0.0-20200203083758-81b8263d9fe5 back to v0.0.0-20171220143535-56a164ee9f31 which fixed an issue with paging behavior seen after the upgrade. Here is the buildkite before and after the revert.

We're unfortunately using a quite old version of gocql, so I need to investigate further on what the behavioral difference is and if there are any release notes or published changes in the past 3 years in this area.

Remove InterruptedException from WorkflowQueue

Workflow code never throws InterruptedException as CancellationScope should be used for cancellation.

Remove 30 days maximum retention limit

Introduced by uber/cadence#2783
Obviously this is not what most external users want if they have low rate and no archival.

Fix the values of attempt field

ActivityTaskStarted has attempt field. It is usually set 0 on the first activity execution and then to 1 and more in case of retries. I find value of 0 for "attempt" very confusing. We should either start from 1 or rename attempt to retryCount.

Use a SQL Generator for SQL-type DBs

Any feature implementation requiring changes to the persistence layer currently needs multiple implementations, one for each of: Cassandra, Postgres and MySQL. As we continue to expand support for more databases, any changes to the persistence layer will require ever growing boilerplate changes slowing down feature development with the added risk of database-specific bugs due to bespoke implementations per DB.

I've been investigating ways to reduce time spent here and have been investigating SQL query builders such as:

These would give us object interfaces similar to C# linq/Java streams for SQL operations, with some of the libraries providing full compile-time type safety. No support for Cassandra, but this could collapse our MySql and Postgres implementations. I would highly recommend we spend more time looking into this before implementing another SQL database type.

Review JSON serialization

Native json.Marshal/Unmarshal should be used only for pure go objects and only when it is necessary. Everywhere else it must be replaced with:

proto objects and proto serialization similar to #198, if JSON is not required,
JSONPBEncoder if JSON is required and object is a proto object or has proto object field (similar to #193).

Most cases already fixed with PRs above but there are few left (run global search for json.Marshal and json.Unmarshal to find all cases):

archiver/*: tokens, archiver.ArchiveVisibilityRequest, archiver.historyIteratorState
elasticsearch: esVisibilityPageToken, search attributes
persistence: historyV2PagingToken, visibilityPageToken, timerTaskPageToken, getSearchAttributes
mysql: clusterAckLevels
history: toMutableStateJSON. recentBinaryChecksums

Move canary to separate repo

Currently it lives in the same server repo due to legacy reason. Needs to be moved out.

Multi-phase activities

A lot of real life activities have multiple phases that require different timeouts. For example a human task needs to be inserted into an external system with a pretty short timeout and then be picked up by a human within another timeout. The third timeout specifies how long the activity can be worked up after it was claimed.

Current solution to the above use case is to use an activity to insert the task into the external system. And then signal workflow about each task state change. While this workaround works it significantly complicates workflow code especially having that timeouts should be enforced in the business logic through timers.

The proposal is to model an activity execution as a list of phases with each phase having its own timeout.

There are multiple ways to achieve this. Here are two options:

Specify list of (phase, timeout, retryPolicy) triples when scheduling an activity. Allow retrieving the current value of the triple when retrying an activity. Add an additional API to complete a phase or augment CompleteActivityTask with additional fields to use it for phase completion.
Treat phases as an internal activity detail. In this case an activity can override heartbeat timeout as part of a heartbeat. This way it can store whatever information needed in the progress field and change heartbeat timeout to conform with the current phase requirements. For example in the case of human task the initial heartbeat timeout is going to be as small as needed to insert the task into the external system, then it is going to be extended to the maximum queue time, and so on.

retry activity get timeout error

With activity retry policy, user could set the activity to retry on all errors until an expiration timeout.
For example keep retry when activity returns cadence.CustomeError("please-retry", "some details useful to workflow")).
The last retry attempt could be scheduled very close to the expiration deadline and thus likely experiencing StartToClose timeout error. In this case, workflow would get timeout error and would lost the last error with the details that could be useful to workflow.
It would be much useful to actually return last application error instead of timeout error.

Things that might be helpful:

Do not schedule retry attempt if next attempt will be very close to retry expiration. Essentially, if next_retry_time + StartToClose > expiration_time, then do not retry.
Or
Store last application error and uses the application error when timeout error occur in last attempt.

NumHistoryShards should be persisted and enforced after being initially configured

NumHistoryShards can currently be changed which should not be allowed.

1. How should the user configure this initially?

Current assumption is this will stay in config.

2. What is the proposed behavior?

There exists three cases assuming the config value is required to be set:

Config set, not persisted (Initialization) - Proposed Behavior: Persist and continue normally
Config set, same as persisted - Proposed Behavior: Continue normally
Config set, differs from persisted value - Proposed Behavior: Fatal error with clear description of differing values

3. Where will this be persisted?

A new table named CLUSTER_METADATA with columns:

Version - int - PK - Monotonically increasing
HistoryShardCount - int

The table would only contain HistoryShardCount as a value column for now, but in the future could be extended with additional config values as columns.

An alternative schema could be to use an attribute store with a generic schema such as:

Attribute - text
Value - text or byte
ItemVersion - int

where the PK would be compound key of all columns or a UUID.

Thoughts?

Add metadata header to all blob payload fields

Currently all blob payload fields do not have any associated metadata. It works only in monolithic applications, but might cause a lot of compatibility issues when Temporal is used as an asynchronous service mesh that connects multiple services. Another problem is that migrating applications from one encoding scheme to another is practically impossible while workflows are running.
The strawman solution is to change all the binary blobs to Payload (name is up to the discussion) which includes Header field.
This would allow to specify something like: this is an encrypted field with ## version of certificate, gziped, JSON encoded.

The API proposal (Header already exists at common.proto)

message Header {
    map<string, bytes> fields = 1;
}

message Payload {
    Header header = 1;
    bytes data = 2;
}

Then, for example,ActivityTaskCompletedEventAttributes change from:

message ActivityTaskCompletedEventAttributes {
    bytes result = 1;
    int64 scheduledEventId = 2;
    int64 startedEventId = 3;
    string identity = 4;
}

message ActivityTaskCompletedEventAttributes {
    Payload result = 1;
    int64 scheduledEventId = 2;
    int64 startedEventId = 3;
    string identity = 4;
}

A possible alternative that potentially reduces the size of the Payload is to use an enum instead of string for metadata key names. But having that the size of payloads measures in hundreds of kilobytes I'm not sure if it is worth the complexity.

TLS for MySQL

Set Explicit Default for all Timestamp Values in MySQL Schema

During the clusterMembership table schema creation, I found MySQL has some hidden default value behavior for Timestamp columns that are declared NOT NULL in some cases. This caused the schema to work correctly on my local mysql (version: 8.0.19 Homebrew) but failed on the server mysql (version: 5.7).

From https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_explicit_defaults_for_timestamp:

If explicit_defaults_for_timestamp is disabled, the server enables the nonstandard behaviors and handles TIMESTAMP columns as follows:

TIMESTAMP columns not explicitly declared with the NULL attribute are automatically declared with the NOT NULL attribute. Assigning such a column a value of NULL is permitted and sets the column to the current timestamp.

The first TIMESTAMP column in a table, if not explicitly declared with the NULL attribute or an explicit DEFAULT or ON UPDATE attribute, is automatically declared with the DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP attributes.

TIMESTAMP columns following the first one, if not explicitly declared with the NULL attribute or an explicit DEFAULT attribute, are automatically declared as DEFAULT '0000-00-00 00:00:00' (the “zero” timestamp). For inserted rows that specify no explicit value for such a column, the column is assigned '0000-00-00 00:00:00' and no warning occurs.

Depending on whether strict SQL mode or the NO_ZERO_DATE SQL mode is enabled, a default value of '0000-00-00 00:00:00' may be invalid. Be aware that the TRADITIONAL SQL mode includes strict mode and NO_ZERO_DATE. See Section 5.1.10, “Server SQL Modes”.

Remove V1 for everything that have V2

Go through all proto files and for everything that has V2 in name, remove corresponding message/rpc w/o V2 and then remove V2 from name.

EventId and Version default values

In PR #86 default values for EventId and Version were changed to common.EmptyEventID and common.EmptyVersion (they were nil before that and nil is not a valid value for proto structs).

Although it works for now, but this approach is an error prone. Everytime new object with these values is created, they need to be explicitly set to those default value. Otherwise they will be 0 which will be treated as regular value.

My proposal is to change constants values to 0. Event Ids starts with 1, so 0 is good for "unset". Version is different and it might be that 0 is a valid version. We need to check entire code base and logic to be sure that versions also starts with 1.

Add cron activity

Currently a workflow can be started with cron option. When specified the workflow on completion is not going to close, but schedule a next run according to schedule. This is very useful for modeling periodic jobs.

For variety of use cases (like polling for a result) it would be nice to specify cron on an activity. This would re execute the activity upon its completion. Another option is to support an ability to schedule the next invocation from the activity itself. Something like:
Activity.executeAgaintIn(Duration.ofMinutes(10));

Decide right default for EmitMetric attribute of Domain Config

Currently default of EmitMetric is set to 'false'. Which means every time you register a Domain you need to specify the value for EmitMetric = true if metric needs to be enabled for domain.

Following are the options:

Default Value as "true": We could have the default value for this boolean as true. This is not recommended practice with protobuf.
Default Value as "false": User has to explicitly pass "true" to enable metric. Might be bad experience if intention is to enable it for most of the domain.
Rename attribute to "DisableMetric": Default value of false will cause it to enable metric by default for domains.

CLI SIGSEGV

maxpro:temporal maxim$ ./tctl -do default workflow list -op
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x522dbf7]

goroutine 1 [running]:
github.com/temporalio/temporal/tools/cli.appendWorkflowExecutionsToTable(0xc0004e0780, 0xc000010028, 0x1, 0x1, 0x15fd4b0000000001)
	/Users/maxim/temporal/temporal/tools/cli/workflowCommands.go:1065 +0x707
github.com/temporalio/temporal/tools/cli.listWorkflow.func1(0x0, 0x0, 0x0, 0xc000172280, 0x574cd7c, 0x57493d7, 0x4)
	/Users/maxim/temporal/temporal/tools/cli/workflowCommands.go:1032 +0x288
github.com/temporalio/temporal/tools/cli.ListWorkflow(0xc0001ecc60)
	/Users/maxim/temporal/temporal/tools/cli/workflowCommands.go:571 +0x263
github.com/temporalio/temporal/tools/cli.newWorkflowCommands.func8(0xc0001ecc60)
	/Users/maxim/temporal/temporal/tools/cli/workflow.go:140 +0x2b
github.com/urfave/cli.HandleAction(0x54219a0, 0x57df450, 0xc0001ecc60, 0xc0000e6a00, 0x0)
	/Users/maxim/go/pkg/mod/github.com/urfave/[email protected]/app.go:492 +0x7c
github.com/urfave/cli.Command.Run(0x57494b7, 0x4, 0x0, 0x0, 0xc00019e780, 0x1, 0x1, 0x578938c, 0x27, 0x0, ...)
	/Users/maxim/go/pkg/mod/github.com/urfave/[email protected]/command.go:210 +0x991
github.com/urfave/cli.(*App).RunAsSubcommand(0xc000532000, 0xc0001ec580, 0x0, 0x0)
	/Users/maxim/go/pkg/mod/github.com/urfave/[email protected]/app.go:379 +0x7fa
github.com/urfave/cli.Command.startApp(0x574ec9a, 0x8, 0x0, 0x0, 0xc00019ed50, 0x1, 0x1, 0x5769af0, 0x19, 0x0, ...)
	/Users/maxim/go/pkg/mod/github.com/urfave/[email protected]/command.go:298 +0x817
github.com/urfave/cli.Command.Run(0x574ec9a, 0x8, 0x0, 0x0, 0xc00019ed50, 0x1, 0x1, 0x5769af0, 0x19, 0x0, ...)
	/Users/maxim/go/pkg/mod/github.com/urfave/[email protected]/command.go:98 +0x11dc
github.com/urfave/cli.(*App).Run(0xc000181a00, 0xc0000be060, 0x6, 0x6, 0x0, 0x0)
	/Users/maxim/go/pkg/mod/github.com/urfave/[email protected]/app.go:255 +0x6ab
main.main()
	/Users/maxim/temporal/temporal/cmd/tools/cli/main.go:33 +0x4b
maxpro:temporal maxim$

Handling private repos with DockerFile

The DockerFile was updated to work with temporal repo as it is currently private.
#14
However the repo is eventually going to be public. This issue is to investigate a better way to support this in DockerFile (perhaps keeping multiple versions of the DockerFile or using the Makefile to have a docker target which takes care of specifying the keys if needed)

Refactor Failure representation in the APIs

Currently failures in all customer facing proto structures (not gRPC ones) are represented by two fields reason and details. This kind of works when all parts of a workflow are implemented in a single language. For example Java SDK serializes complete exception chain including stack trace of each exception into the details field. The problem is this serialization/deserialization code is completely Java specific. Now as Python client is added it uses different format for the stack trace. Go (in case of panic) also might require stack trace serialization/deserialization. Even without stack traces chaining of errors coming from SDKs in different languages is practically impossible without unifying the serialization format.

The proposal is that instead of unifying the serialization format, model the chain of errors with attached strack traces through a generic protobuf structure. This way no language specific serialization is necessary and errors coming from different components can be chained even if they are from different languages.

The strawman API changes.

message ApplicationFailureInfo {
    string sdk = 1; // java, go, python, etc
    string message = 2;
    string type = 3; // Exception class,  Go error type, etc.
    bytes details = 4; // Possibly error payload beyond the message.
}

message ActivityTaskFailureInfo {
    int64 scheduledEventId = 1;
    int64 startedEventId = 2;
    string identity = 3;
}

message ChildWorkflowExecutionFailureInfo {
    string domain = 1;
    WorkflowExecution workflowExecution = 2;
    WorkflowType workflowType = 3;
    int64 initiatedEventId = 4;
    int64 startedEventId = 5;
}

message Failure {
    oneof failureInfo {
        ActivityTaskFailureInfo activityTaskFailureInfo = 1;
        ChildWorkflowExecutionFailureInfo childWorkflowExecutionFailureInfo = 2;
        ApplicationFailureInfo applicationFailureInfo = 3;
    }
    repeated string backtrace = 4;
}

message FailureChain {
    repeated Failure failures = 1; // The initial cause is the first in the list
}

An open question if providing a typed backtrace would bring any value. For example it can be modeled as

message StackTraceElement {
    string file = 1;
    int32 line = 2;
    string component = 3; // declaringClass for Java, module for Go
    string function = 4;
}

But it might require more fields for some languages for example PHP stack trace is much reacher:
https://www.php.net/manual/en/function.debug-backtrace.php

Support continue as new for cron workflows

Cron workflows use the continue as new internally to schedule the next run. This doesn't play well with workflows that call continue as new explicitly.
The proposal is to allow calling continue as new explicitly and only schedule the next run automatically with a delay when a workflow completes.

Workflow service decomposition

Currently workflow service contains RPC funcs which are not related to workflow. My suggestion is to extract replicationservice and metadataservice.

replicationservice:

rpc GetReplicationMessages (GetReplicationMessagesRequest) returns (GetReplicationMessagesResponse) {}
rpc GetDomainReplicationMessages (GetDomainReplicationMessagesRequest) returns (GetDomainReplicationMessagesResponse) {}
rpc ReapplyEvents (ReapplyEventsRequest) returns (ReapplyEventsResponse) {}

metadataservice:

rpc RegisterDomain (RegisterDomainRequest) returns (RegisterDomainResponse) {}
rpc DescribeDomain (DescribeDomainRequest) returns (DescribeDomainResponse) {}
rpc ListDomains (ListDomainsRequest) returns (ListDomainsResponse) {}
rpc UpdateDomain (UpdateDomainRequest) returns (UpdateDomainResponse) {}
rpc DeprecateDomain (DeprecateDomainRequest) returns (DeprecateDomainResponse) {}
rpc GetSearchAttributes (GetSearchAttributesRequest) returns (GetSearchAttributesResponse) {}
rpc GetClusterInfo(GetClusterInfoRequest) returns (GetClusterInfoResponse){}

Maybe join it with healthservice.

workflowservice:
All the rest.

TestMatcherSuite/TestMustOfferRemoteMatch is flaky

=== RUN   TestMatcherSuite/TestMustOfferRemoteMatch
    TestMatcherSuite/TestMustOfferRemoteMatch: matcher_test.go:404: 
        	Error Trace:	matcher_test.go:404
        	Error:      	Expected value not to be nil.
        	Test:       	TestMatcherSuite/TestMustOfferRemoteMatch
    TestMatcherSuite/TestMustOfferRemoteMatch: matcher_test.go:406: 
        	Error Trace:	matcher_test.go:406
        	Error:      	Should be true
        	Test:       	TestMatcherSuite/TestMustOfferRemoteMatch
    TestMatcherSuite/TestMustOfferRemoteMatch: matcher_test.go:407: 
        	Error Trace:	matcher_test.go:407
        	Error:      	Should be true
        	Test:       	TestMatcherSuite/TestMustOfferRemoteMatch
    TestMatcherSuite/TestMustOfferRemoteMatch: matcher_test.go:408: 
        	Error Trace:	matcher_test.go:408
        	Error:      	Not equal: 
        	            	expected: "/__temporal_sys/tl0/1"
        	            	actual  : ""
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-/__temporal_sys/tl0/1
        	            	+
        	Test:       	TestMatcherSuite/TestMustOfferRemoteMatch
    TestMatcherSuite/TestMustOfferRemoteMatch: matcher_test.go:409: 
        	Error Trace:	matcher_test.go:409
        	Error:      	Not equal: 
        	            	expected: "tl0"
        	            	actual  : ""
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-tl0
        	            	+
        	Test:       	TestMatcherSuite/TestMustOfferRemoteMatch

Validate dynamic configuration initialization

Make sure dynamic config works correctly on local dev setups. Currently logs are getting spammed with errors like the following:

{"level":"warn","ts":"2020-03-17T10:57:04.577-0700","msg":"Failed to fetch key from dynamic config","key":"matching.idleTasklistCheckInterval","error":"unable to find key","logging-call-at":"config.go:61"}

Failure reason is passed as string with hardcoded format

Currently server and client build failure reason using one of standard prefixes:

const (
	errReasonPanic    = "cadenceInternal:Panic"
	errReasonGeneric  = "cadenceInternal:Generic"
	errReasonCanceled = "cadenceInternal:Canceled"
	errReasonTimeout  = "cadenceInternal:Timeout"
)

Timeout reason also uses TimeoutType proto enum converted to string in a format like cadenceInternal:Timeout TimeoutTypeStartToClose. This string is passed to the client. Client parse the string and retries based on value. Therefore this string is a part of contract and it relies on proto/thrift enum representation in string format.

There should be a better approach! timeSequence.go is a good place to start from.

Task list with rate limiting per key

There are multiple use cases that require rate limiting per dynamically created list of keys. For example some downstream API has rate limit per calling customer. This is possible to achieve using task list throttling, but as the set of keys is usually dynamic managing client side workers is a non trivial task.

The proposed feature is to provide a special task list type that would support rate limiting per some "partition key". So when an activity is scheduled the key should be provided when invoking it. And on the worker side no changes are needed (besides the mechanism to configure the task list itself).

Absolute Expiry/Creation Time should be provided by Add*Tasks Callers

Currently scheduledToStartTimeout is provided by History to Matching service in AddDecisionTask/AddActivityTask calls. This means that Matching recomputes the timeout with time.Now.Add(schedToStartTimeout). In forwarding scenarios, multiple matching services could see the same task but each one recomputes the expiry.

Additionally Task.CreationTime fields are stamped by MatchingService when they should be stamped by history. CreationTime is used to set the AsyncMatchLatency metric value, which means its incorrect, especially in forwarding scenarios.

Long term, absolute timestamps for both Expiry and Creation should be provided by callers rather than relative times.

Add workers autoscaling through KEDA

https://github.com/kedacore/keda

Repos rename proposal

temporal -> temporal-server
temporal-go-client -> temporal-go-sdk
temporal-java-client -> tempoal-java-sdk