googlecloudplatform / cloud-run-release-manager Goto Github PK

[EXPERIMENTAL] Automated canary rollout tool for Cloud Run services.

License: Apache License 2.0

Go 99.70% Dockerfile 0.30%

cloud-run-release-manager's Introduction

Cloud Run Release Manager

The Cloud Run Release Manager provides an automated way to gradually roll out new versions of your Cloud Run services. By using metrics, it automatically decides to slowly increase traffic to a new version or roll back to the previous one.

Disclaimer: This project is not an official Google product and is provided as-is.

You might encounter issues in production, since this project is currently in alpha.

How does it work?
- Examples
  - Scenario 1: Automated Rollouts
  - Scenario 2: Automated Rollbacks
Setup on GCP
Configuration
- Choosing services
- Rollout strategy
Try it out (locally)
Observability & Troubleshooting
- What's happening with my rollout?
- Release Manager logs

How does it work?

Cloud Run Release Manager is not a built-in Cloud Run feature. It’s an external tool deployed to your project. It oversees your Cloud Run services (that have opted-in for gradual rollouts) periodically, detects newly deployed revisions, monitors their metrics and makes a rollout or rollback decision.

To opt-in services for gradual rollouts, you should label your services with rollout-strategy=gradual (default value). If a service has a newly deployed revision with 0% traffic, the Release Manager automatically assigns some initial traffic to the new revision (5% by default).

The Release Manager manages 2 revisions at a time: the last revision that reached 100% of the traffic (tagged as stable) and the newest deployment (tagged as candidate).

Depending on the candidate revision’s health and other configurable factors (such as served request count or time elapsed), this revision is either gradually rolled out to a higher percentage of traffic, or entirely rolled back.

Examples

Scenario 1: Automated Rollouts

I have version v1 of an application deployed to Cloud Run
I deploy a new version, v2, to Cloud Run with --no-traffic option (gets 0% of the traffic)
The new version is automatically detected and assigned 5% of the traffic
Every minute, metrics for v2 in the last 30 minutes are retrieved. Metrics show a "healthy" version and traffic to v2 is increased to 30% only after 30 minutes have passed since last update
Metrics show a "healthy" version again and traffic to v2 is increased to 50% only after 30 minutes have passed since last update
The process is repeated until the new version handles all the traffic and becomes stable

Scenario 2: Automated Rollbacks

I have version v1 of an application deployed to Cloud Run
I deploy a new version, v2, to Cloud Run with --no-traffic option (gets 0% of the traffic)
The new version is automatically detected and assigned 5% of the traffic
Every minute, metrics for v2 in the last 30 minutes are retrieved. Metrics show a "healthy" version and traffic to v2 is increased to 30% only after 30 minutes have passed since last update
Metrics for v2 are retrieved one more time and show an "unhealthy" version. Traffic to v2 is immediately dropped, and all traffic is redirected to v1

Setup on GCP

Cloud Run Release Manager is distributed as a service deployed to your GCP project, running on Cloud Run and invoked periodically by Cloud Scheduler.

To set up the Release Manager on Cloud Run, run the following steps on your shell:

Set your project ID in a variable:
```
PROJECT_ID=<your-project>
```

Create a new service account:

gcloud iam service-accounts create release-manager \
    --display-name "Cloud Run Release Manager"

Give it permissions to manage your services on the Cloud Run API:

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:release-manager@${PROJECT_ID}.iam.gserviceaccount.com \
    --role=roles/run.admin

Also, give it permissions to use other service accounts as its identity when updating Cloud Run services:

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:release-manager@${PROJECT_ID}.iam.gserviceaccount.com \
    --role=roles/iam.serviceAccountUser

Finally, give it access to metrics on your services:

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:release-manager@${PROJECT_ID}.iam.gserviceaccount.com \
     --role=roles/monitoring.viewer

Build and push the container image for Release Manager to Google container Registry.

git clone https://github.com/GoogleCloudPlatform/cloud-run-release-manager.git

gcloud builds submit ./cloud-run-release-manager \
    -t gcr.io/$PROJECT_ID/cloud-run-release-manager

Deploy the Release Manager as a Cloud Run service:
```
gcloud run deploy release-manager --quiet \
    --platform=managed \
    --region=us-central1 \
    --image=gcr.io/$PROJECT_ID/cloud-run-release-manager \
    --service-account=release-manager@${PROJECT_ID}.iam.gserviceaccount.com \
    --args=-verbosity=debug \
    --args=-healthcheck-offset=10m \
    --args=-min-requests=0 \
    --args=-max-error-rate=1 \
    --args=-min-wait=10m
```
In the command above, the configuration options provided using --args configure how the Cloud Run Release Manager increases or drops the traffic to a newly deployed revision:
- --args=-healthcheck-offset=10m: Look back 10 minutes to evaluate a new revision’s health while making a rollout or rollback decision.
- --args=-min-requests=0: Do not require a minimum number of requests arriving to the new revision while making a rollout decision.
- --args=-max-error-rate=1: Require new revision’s error rate to be ≤1% to roll out. If it is >1%, it will be rolled back.
- --args=-min-wait=10m: New revision should stay at least 10 minutes in its current percentage before it is rolled out further.
- --args=-verbosity=debug: Log more details from the tool (optional)
To edit these options later, you can redeploy using the command above, or go to Cloud Console.

Check out other configuration options available to more finely tune the Release Manager and its rollout strategy.

Find the URL of your Cloud Run service and set as URL variable:

URL=$(gcloud run services describe release-manager \
    --platform=managed --region=us-central1 \
    --format='value(status.url)')

Set up a Cloud Scheduler job to call the Release Manager (deployed on Cloud Run) every minute:

gcloud services enable cloudscheduler.googleapis.com

gcloud beta scheduler jobs create http cloud-run-release-manager --schedule "* * * * *" \
    --http-method=GET \
    --uri="${URL}/rollout" \
    --oidc-service-account-email=release-manager@${PROJECT_ID}.iam.gserviceaccount.com \
    --oidc-token-audience="${URL}/rollout"

At this point, you can start deploying services with label rollout-strategy=gradual and deploy new revisions with --no-traffic option and the Release Manager will slowly roll it out.

See the Troubleshooting guide to understand and observe the rollout status of your services.

Configuration

Currently, all the configuration options are specified through command-line arguments.

To customize these options, use the --args=... option while deploying this tool to Cloud Run (e.g. --args=-min-requests=0) instead of specifying them directly on gcloud run deploy command.

Choosing services

Cloud Run Release Manager can manage the rollout of multiple services at the same time.

To opt-in a Cloud Run service for automated rollouts and rollbacks, the service must have the configured a label. By default, services with the label rollout-strategy=gradual are looked for.

-project: Google Cloud project ID that has the Cloud Run services deployed
-regions: Regions where to look for opted-in services (default: all available Cloud Run regions)
-label: The label selector to match to the opted-in services (default: rollout-strategy=gradual)

Rollout strategy

The rollout strategy consists of the steps and health criteria.

⚠️ WARNING: Cloud Monitoring Metrics API has a reporting delay (up to 3 minutes). Therefore, the Release Manager will be subject to the same delay while querying the health of a new revision. Configuring the -healthcheck-offset or -min-wait for less than 5 minutes might result in misinterpreting a service’s health.

-healthcheck-offset: Time window to look back during health check to assess the candidate revision's health (default: 30m).
-min-requests: The minimum number of requests needed to determine the candidate's health (default: 100). This minimum value is expected in the time window determined by -healthcheck-offset
-min-wait: The minimum time before rolling out further (default: 30m)
-steps: Percentages of traffic the candidate should go through (default: 5,20,50,80)
-max-error-rate: Expected maximum rate (in percent) of server errors (default: 1)
-latency-p99: Expected maximum latency for 99th percentile of requests (in milliseconds), 0 to ignore (default: 0)
-latency-p95: Expected maximum latency for 95th percentile of requests (in milliseconds), 0 to ignore (default: 0)
-latency-p50: Expected maximum latency for 50th percentile of requests (in milliseconds), 0 to ignore (default: 0)
-cli-run-interval: The time between each health check (default: 60s). This is only needed if running with -cli.

The time arguments above follow Go time.Duration syntax (e.g. 30s, 10m, 1h30m).

Try it out (locally)

Note: This section applies only if you want to run Cloud Run Release Manager locally for troubleshooting, development and demo purposes.

Label the Cloud Run services (with label rollout-strategy=gradual) for them to be selected for gradual rollouts:
```
gcloud run services update <YOUR_SERVICE> \
  --labels rollout-strategy=gradual \
  --region us-east1
```
Clone this repository.

Make sure you have Go compiler installed, run:

go build -o cloud_run_release_manager ./cmd/operator

To start the program, run:

./cloud_run_release_manager -cli -verbosity=debug -project=<YOUR_PROJECT>

Once you run this command, it will check the health of Cloud Run services with the label rollout-strategy=gradual every minute by looking at the candidate's metrics for the past 30 minutes by default.

The health is determined using the metrics and configured health criteria. If metrics show a healthy candidate, traffic to the candidate revision is increased. But if metrics show an unhealthy candidate, a roll back is performed.

See the Troubleshooting guide to understand and observe the rollout status of your services.

Observability & Troubleshooting

What's happening with my rollout?

To check the status of your rollout, go to Cloud Run and click on your service.

Under the Revisions section, you can see how the traffic is currently split between your stable and candidate revisions.

For more detailed information, you can use the annotations automatically added by the Release Manager. To view the annotations, click on the YAML section:

Sample annotation:

rollout.cloud.run/stableRevision: hello-00040-opa
rollout.cloud.run/candidateRevision: hello-00039-boc
rollout.cloud.run/lastFailedCandidateRevision: hello-00032-doc
rollout.cloud.run/lastRollout: '2020-08-13T15:35:10-04:00'
rollout.cloud.run/lastHealthReport: |-
  status: healthy
  metrics:
  - request-count: 150 (needs 100)
  - error-rate-percent: 1.00 (needs 1.00)
  - request-latency[p99]: 503.23 (needs 750.00)
  lastUpdate: 2020-08-13T15:35:10-04:00

rollout.cloud.run/stableRevision is the name of the current stable revision
rollout.cloud.run/candidateRevision is the revision name of the current candidate
rollout.cloud.run/lastFailedCandidateRevision is the last revision that was considered a candidate but failed to meet the health criteria at some point of its rollout process
rollout.cloud.run/lastRollout contains the last time a rollout occurred (traffic to the candidate was increased)
rollout.cloud.run/lastHealthReport contains information on why a rollout or rollback occurred. It shows the results of the health assessment and the actual values for each of the metrics

Release Manager logs

Release Manager sends its logs to Cloud Logging. If there’s something preventing the tool from working properly, it will be logged. However, you can also use the logs to view a detailed history of the rollout or rollback decisions.

You can quickly find out if there are errors from Release Manager by using the Logs Viewer (click to run the query below).

resource.type = "cloud_run_revision"
resource.labels.service_name = "release-manager"
resource.labels.location = "us-central1"
severity >= ERROR

If you want to filter the errors for a specific service, you can include the service's name in the query:

jsonPayload.context.data.service = "<YOUR_SERVICE>"

You can also include a full list of the logs by changing the severity filter to severity >= DEBUG. You must have set the flag -verbosity=debug when deploying the Release Manager to have full logs about your rollouts.

This is not an official Google project. See LICENSE.

cloud-run-release-manager's People

Contributors

Stargazers

Watchers

Forkers

ahmetb weibaohui muskanmahajan37 stewartreichling isabella232 valery-barysok fcorrea shanemorris02 mollypi ashyfox

cloud-run-release-manager's Issues

-cli-run-interval can be specified without -cli

go run ./cmd/operator -min-requests 0 -min-wait 0 -cli-run-interval 10s
INFO[0000] -project not specified, trying to autodetect one
INFO[0000] project detected: ahmetb-demo
INFO[0000] starting server                               addr=":8080"

health: Return inconclusive diagnosis if request count is not met

As of now, the health diagnosis completely ignore the request count option. In fact, request count is still not a valid metrics configuration. However, metrics.Provider already includes a RequestCount that should be leveraged to return inconclusive diagnosis if no enough requests were made.

Support Manual Rollback

Currently, if we want to roll back due to business issues and return traffic to a previous instance, the release manager make the rollout again.

rollout: Add annotations for all health diagnosis

As of now, only healthy and unhealthy diagnosis are being shown in the health report annotation. We need to keep the user updated on all other possibilities such as not enough time since last rollout and inconclusive diagnosis.

Verbose logging for project detection

Missed this during the reviews.
Right now I run without -project in -verbosity=debug mode I only see:

INFO[0000] -project not specified, trying to autodetect one
INFO[0000] project detected: ahmetb-demo

I'd much rather see:

INFO[0000] -project not specified, trying to autodetect one
DEBUG[000] trying gce metadata service for project ID
DEBUG[000] gce metadata service not detected
DEBUG[000] trying gcloud for core/project
DEBUG[000] found project id on gcloud
INFO[0000] project detected: ahmetb-demo

Publish to Pub/Sub after a rollout/rollback have occurred

In order to allow other functionality around the Release Manager, we need to provide a way to subscribe to events. This will allow things like smoke tests and notification features. The following format could be useful:

{
    "event": "rollout", // or "rollback"
    "candidateRevisionName": "hello-002-abc",
    "candidateRevisionPercent": 50,
    "candidateRevisionURL": "https://candidate---hello-wh32u62c4l-ue.a.run.app",
    "candidateWasPromotedToStable": false,
    "service": {json}
}

Add a minimum time between each step

In #50, we're introducing a flag to check for metrics in the last N minutes while we are expecting to execute every rollout process after an interval (-cli) or periodically using HTTP requests. The issue with this approach is that if the interval between each rollout process and the health check offset is significant, the candidate can become stable very quickly even if buggy. This is because the health checks of the previous rollout process will have a very big weight when determining the candidate's health

Example:
Interval between each rollout process: 5 minutes
Health check offset: 30 minutes

The health check at min 35 might determine a healthy candidate even though candidate is buggy because most of the metrics is based on the previous 25 minutes before the current step (traffic percentage).

To solve this, an alternative is to allow the user to define the minimum time between each step (increment in traffic percentage). So, if the metrics show a healthy candidate but no enough time has passed, we do not roll forward. However, if the metrics show an unhealthy candidate, we rollback independent of the time between the current and last rollout process.

install with a click of a button

Could the steps at https://github.com/GoogleCloudPlatform/cloud-run-release-operator#setup- be performed after I clicked on an Run on Google Cloud button?

rollout: Explicitly update traffic configuration in UpdateService

UpdateService is making calls to update its routing configuration, which obscures the update of the traffic config. Instead of returning run.Service in this functions, we can return []run.Traffic. Let's refactor the methods for this purpose and move all the traffic-related methods to a new file traffic.go.

Automatically detect the default project value

As mentioned in this comment, we should determine the default value for the project if possible.

Issue in setup instructions (actAs)

we currently fail with

there were 1 errors: [error#0] rollout failed: failed to perform rollout: failed to replace service: could not update service "hello": googleapi: Error 403: Permission 'iam.serviceaccounts.actAs' denied on service account [email protected] (or it may not exist)., forbidden

this is happening because we have 1 service account named release-manager@[...].

This account has Cloud Run Admin role. But this is not enough to be able to deploy apps that use other serviceaccounts as their identity. (e.g. in this case my hello service uses the Cloud Run default account named "Compute Default service account").

So I think we need to give release-manager@... a project-wide role roles/iam.serviceAccountUser which contains iam.serviceaccounts.actAs permission. That way, it can deploy any Cloud Run app using that uses any --service-account in the same project.

strict validation for positional arguments

If I run go run ./cmd/operator 1 2 3 this works fine, but it shouldn't. We should not accept unused input silently.

Include troubleshooting section in README

Let's add some common troubleshooting techniques that users can follow to debug the manager. Also, let's include debug mode in setup instructions.

metrics: Rename Metrics interface to Provider

The current name of this interface is confusing. We should give it a different, clearer name as Provider

Create diagram to visualize parameters

Let's create some diagrams that show the workings of some of the parameters (e.g. a timeline showing calls from Cloud Scheduler)

util: Rename LoggerFromContext to LoggerFrom

Since it is clear we are getting the logger from the context, we can switch to a shorter name for the function.

health: Separate metrics collection from diagnosis

Let's separate the collection of metrics values from the diagnosis of the revision's health as discussed here

config: Rename constructor to New

The WithValue constructor doesn't make much sense now, and we should probably just call it New. Also, let's add more fields to the struct in the tests to make it easier to follow

Permissions issue in setup instructions (metrics read)

Like #81,

there were 1 errors: [error#0] rollout failed: failed to perform rollout: failed to diagnose health for candidate "hello-00006-mid": failed to collect metrics: failed to obtain metrics "error-rate-percent": failed to get error rate metrics: error when querying for time series: error when retrieving time series: googleapi: Error 403: Permission monitoring.timeSeries.list denied (or the resource may not exist)., forbidden

@gvso can you please spend more time validating the setup instructions and see if rollouts/rollbacks work with the load generating app that you wrote?

stackdriver: Log information on Latency and ErrorRate calls

We need to add more logs on what is currently happening when calling these functions.

rollout: Add updated time to health report annotation

We need to give more context to the user about when the health report was generated as discussed here

Create a Knative-compatible client

We need to decouple the project from Cloud Run fully managed and start thinking about how to make it compatible with Knative in general.

Set health check offset based on last updated time

In #38, we introduced rollbacks and integration with the health package. When obtaining the metrics, we passed a offset of how much time in the past we want to look up (e.g. 10 minutes).

The current implementation always look back at a constant number of time (e.g. 10 minutes, 1 hour, etc.). This has the (potential) advantage that it allows us to measure the latency/error rate given an average load time (e.g. 6000 requests in 10 minutes = 10 request/second). However, it also has the disadvantage of potentially taking longer to completely rollout a new revision (or not rolling out at all if an unreachable request count is set). For instance, if the min request count is 6000 and we check with a constant offset of 10 minutes, it can happen that the new revision always get a number of requests that is below 6000 in the last 10 minutes. That is, the candidate never gets more traffic and stays a candidate for a really long time. This is especially more likely when the candidate gets small shares of the traffic (at the beginning of the rollout).

The alternative solution would be to add an annotation about the last time the candidate's traffic was increased, so we can calculate an offset = time.Now() - lastTime. This would basically determine the health based on the accumulated requests since the last roll forward.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.