googlecloudplatform / cloud-run-release-manager Goto Github PK

View Code? Open in Web Editor NEW

128.0 10.0 10.0 234 KB

[EXPERIMENTAL] Automated canary rollout tool for Cloud Run services.

License: Apache License 2.0

Go 99.70% Dockerfile 0.30%

cloud-run-release-manager's Issues

strict validation for positional arguments

If I run go run ./cmd/operator 1 2 3 this works fine, but it shouldn't. We should not accept unused input silently.

Automatically detect the default project value

As mentioned in this comment, we should determine the default value for the project if possible.

config: Validate metrics criteria

As of now, config.IsValid is not checking if the metrics criteria are valid, which can propagate errors down in the process when the anomaly should be detected way before.

Remove paddings in diagrams about rollout/rollback stages

The README file contains diagrams that could be larger if we remove the paddings in the diagrams

install with a click of a button

Could the steps at https://github.com/GoogleCloudPlatform/cloud-run-release-operator#setup- be performed after I clicked on an Run on Google Cloud button?

Add a minimum time between each step

In #50, we're introducing a flag to check for metrics in the last N minutes while we are expecting to execute every rollout process after an interval (-cli) or periodically using HTTP requests. The issue with this approach is that if the interval between each rollout process and the health check offset is significant, the candidate can become stable very quickly even if buggy. This is because the health checks of the previous rollout process will have a very big weight when determining the candidate's health

Example:
Interval between each rollout process: 5 minutes
Health check offset: 30 minutes

The health check at min 35 might determine a healthy candidate even though candidate is buggy because most of the metrics is based on the previous 25 minutes before the current step (traffic percentage).

To solve this, an alternative is to allow the user to define the minimum time between each step (increment in traffic percentage). So, if the metrics show a healthy candidate but no enough time has passed, we do not roll forward. However, if the metrics show an unhealthy candidate, we rollback independent of the time between the current and last rollout process.

Publish to Pub/Sub after a rollout/rollback have occurred

In order to allow other functionality around the Release Manager, we need to provide a way to subscribe to events. This will allow things like smoke tests and notification features. The following format could be useful:

{
    "event": "rollout", // or "rollback"
    "candidateRevisionName": "hello-002-abc",
    "candidateRevisionPercent": 50,
    "candidateRevisionURL": "https://candidate---hello-wh32u62c4l-ue.a.run.app",
    "candidateWasPromotedToStable": false,
    "service": {json}
}

Support Manual Rollback

Currently, if we want to roll back due to business issues and return traffic to a previous instance, the release manager make the rollout again.

Create diagram to visualize parameters

Let's create some diagrams that show the workings of some of the parameters (e.g. a timeline showing calls from Cloud Scheduler)

Include troubleshooting section in README

Let's add some common troubleshooting techniques that users can follow to debug the manager. Also, let's include debug mode in setup instructions.

Issue in setup instructions (actAs)

we currently fail with

there were 1 errors: [error#0] rollout failed: failed to perform rollout: failed to replace service: could not update service "hello": googleapi: Error 403: Permission 'iam.serviceaccounts.actAs' denied on service account [email protected] (or it may not exist)., forbidden

this is happening because we have 1 service account named release-manager@[...].

This account has Cloud Run Admin role. But this is not enough to be able to deploy apps that use other serviceaccounts as their identity. (e.g. in this case my hello service uses the Cloud Run default account named "Compute Default service account").

So I think we need to give release-manager@... a project-wide role roles/iam.serviceAccountUser which contains iam.serviceaccounts.actAs permission. That way, it can deploy any Cloud Run app using that uses any --service-account in the same project.

Permissions issue in setup instructions (metrics read)

Like #81,

there were 1 errors: [error#0] rollout failed: failed to perform rollout: failed to diagnose health for candidate "hello-00006-mid": failed to collect metrics: failed to obtain metrics "error-rate-percent": failed to get error rate metrics: error when querying for time series: error when retrieving time series: googleapi: Error 403: Permission monitoring.timeSeries.list denied (or the resource may not exist)., forbidden

@gvso can you please spend more time validating the setup instructions and see if rollouts/rollbacks work with the load generating app that you wrote?

health: Add tests for metrics and thresholds set at 0

As discussed here, it might be good to test metrics and thresholds set to 0

rollout: Add annotations for all health diagnosis

As of now, only healthy and unhealthy diagnosis are being shown in the health report annotation. We need to keep the user updated on all other possibilities such as not enough time since last rollout and inconclusive diagnosis.

config: Rename constructor to New

The WithValue constructor doesn't make much sense now, and we should probably just call it New. Also, let's add more fields to the struct in the tests to make it easier to follow

Support for Cloud Run for Anthos

This is related to #78 but simpler to do since we already have most of the basics. Let's add support for Knative running on Cloud Run for Anthos to cover all the Cloud Run cases

stackdriver: Return inconclusive diagnosis when no request in interval

The Latency and ErrorRate implemenations in the stackdriver package are returning errors when no request was made in the provided interval. This error should be handled differently and instead should result in an Inconclusive health diagnosis.

Create a Knative-compatible client

We need to decouple the project from Cloud Run fully managed and start thinking about how to make it compatible with Knative in general.

-cli-run-interval can be specified without -cli

go run ./cmd/operator -min-requests 0 -min-wait 0 -cli-run-interval 10s
INFO[0000] -project not specified, trying to autodetect one
INFO[0000] project detected: ahmetb-demo
INFO[0000] starting server                               addr=":8080"

rollout: Add tests for Inconclusive and Unknown health diagnosis

The rollout tests are missing test cases when health.Diagnose returns Inconclusive and Unknown

util: Rename LoggerFromContext to LoggerFrom

Since it is clear we are getting the logger from the context, we can switch to a shorter name for the function.

rollout: Explicitly update traffic configuration in UpdateService

UpdateService is making calls to update its routing configuration, which obscures the update of the traffic config. Instead of returning run.Service in this functions, we can return []run.Traffic. Let's refactor the methods for this purpose and move all the traffic-related methods to a new file traffic.go.

health: Separate metrics collection from diagnosis

Let's separate the collection of metrics values from the diagnosis of the revision's health as discussed here

metrics: Rename Metrics interface to Provider

The current name of this interface is confusing. We should give it a different, clearer name as Provider

health: Return inconclusive diagnosis if request count is not met

As of now, the health diagnosis completely ignore the request count option. In fact, request count is still not a valid metrics configuration. However, metrics.Provider already includes a RequestCount that should be leveraged to return inconclusive diagnosis if no enough requests were made.

Set health check offset based on last updated time

In #38, we introduced rollbacks and integration with the health package. When obtaining the metrics, we passed a offset of how much time in the past we want to look up (e.g. 10 minutes).

The current implementation always look back at a constant number of time (e.g. 10 minutes, 1 hour, etc.). This has the (potential) advantage that it allows us to measure the latency/error rate given an average load time (e.g. 6000 requests in 10 minutes = 10 request/second). However, it also has the disadvantage of potentially taking longer to completely rollout a new revision (or not rolling out at all if an unreachable request count is set). For instance, if the min request count is 6000 and we check with a constant offset of 10 minutes, it can happen that the new revision always get a number of requests that is below 6000 in the last 10 minutes. That is, the candidate never gets more traffic and stays a candidate for a really long time. This is especially more likely when the candidate gets small shares of the traffic (at the beginning of the rollout).

The alternative solution would be to add an annotation about the last time the candidate's traffic was increased, so we can calculate an offset = time.Now() - lastTime. This would basically determine the health based on the accumulated requests since the last roll forward.

stackdriver: Log information on Latency and ErrorRate calls

We need to add more logs on what is currently happening when calling these functions.

Verbose logging for project detection

Missed this during the reviews.
Right now I run without -project in -verbosity=debug mode I only see:

INFO[0000] -project not specified, trying to autodetect one
INFO[0000] project detected: ahmetb-demo

I'd much rather see:

INFO[0000] -project not specified, trying to autodetect one
DEBUG[000] trying gce metadata service for project ID
DEBUG[000] gce metadata service not detected
DEBUG[000] trying gcloud for core/project
DEBUG[000] found project id on gcloud
INFO[0000] project detected: ahmetb-demo

rollout: Add updated time to health report annotation

We need to give more context to the user about when the health report was generated as discussed here

googlecloudplatform / cloud-run-release-manager Goto Github PK

cloud-run-release-manager's Issues

Recommend Projects

Recommend Topics

Recommend Org