Code Monkey home page Code Monkey logo

googlecloudplatform / cloud-ops-sandbox Goto Github PK

View Code? Open in Web Editor NEW
397.0 30.0 146.0 38.03 MB

Cloud Operations Sandbox is an open source collection of tools that helps practitioners to learn O11y and R9y practices from Google and apply them using Cloud Operations suite of tools.

Home Page: https://cloud-ops-sandbox.dev

License: Apache License 2.0

Shell 35.97% HCL 64.03%
sre cloud stackdriver stackdriver-logs stackdriver-monitoring stackdriver-trace opencensus profiler opentelemetry cloudops

cloud-ops-sandbox's Introduction

๐Ÿ“ˆ ๐Ÿ“Š ๐Ÿ‘ฃ ๐Ÿชต Cloud Operations Sandbox

Cloud Operations (Ops) Sandbox is an end-to-end demo that helps practitioners to learn about Cloud Operations (formerly Stackdriver) and Service Reliability Engineering practices from Google.

Sandbox is composed of the Online Boutique microservice demo application and a collection of various observability instruments. It offers:

  • Study running a microservice application on GKE
  • Monitor application's behavior using various system and application metrics displayed on per-service dashboards
  • Explore Uptime checks, Service SLOs and other instruments of Cloud Operations suite of Google Cloud
  • Experiment with created observability instruments and build new ones
  • Run quick labs using Sandbox Recipes (๐Ÿšง temporary unavailable)

Warning Check discontinued functionality for the list of functions that are no longer supported or changed in the recent versions.

Using Cloud Ops Sandbox

Cloud Ops Sandbox runs on Google Cloud. To use it you will need a Google Cloud account with an access to create a new GCP project or to provision resource on the existing GCP project.

Launch

You can launch Cloud Ops Sandbox using Cloud Shell button below and following walkthrough instructions:

Launch in Cloud Shell

Or, you can launch it on your workstation. To run it locally you will need to make sure that the following software is available:

And to have a Google Cloud project where you want to launch Cloud Ops Sandbox. After that, run the following commands while replacing PROJECT_ID with your project ID:

git clone https://github.com/GoogleCloudPlatform/cloud-ops-sandbox
gcloud auth application-default login
cloud-ops-sandbox/provisioning/sandboxctl create -p PROJECT_ID

These commands will clone this repo to your local environment's current directory, acquire authentication toke for Terraform and launch Cloud Ops Sandbox with default settings. The script will prompt you for additional information.

You can learn more about customized options by running:

cloud-ops-sandbox/provisioning/sandboxctl -h

Use Cloud Ops Sandbox

Read more about Cloud Ops Sandbox and how to use it in the documentation.

Discontinued Functionality

The following functionality is not available in versions โ‰ฅ 0.9:

  • Rating service is not a part of the demo application. It has the following effects:
    • Launch does not provision AppEngine services and CloudSQL DB.
    • Sandbox does not define a window-based SLO.
    • SLO recipe that uses the rating service will not be available.
  • One-click installation is no longer available. Users will use sandboxctl CLI tool to create and delete Sandbox. Users can leverage the walkthrough tutorial for launch instructions.
  • Starting this version, Sandbox does not create custom Cloud Shell images.
  • Starting this version, launch will not create a new Google Cloud project. Users will have to provide a project ID to host Sandbox as a parameter to CLI.
  • [Website] will be retired at the end of 2023 Summer. Until that time, it will provide a link to launch version 0.8.2 of Sandbox.
  • This version uses version 0.6.0 of Online Boutique. The load generator in this version does not expose GUI. As a result, it is not possible to customize the artificant load on the demo application. Follow up GoogleCloudPlatform/microservices-demo#1692 to track the progress.
  • SRE recipe functionality is temporary removed. Follow up #1009 to track the progress.

Legacy version (0.8) of Cloud Ops Sandbox

The legacy version (0.8.2) is no longer supported. You still should be able to deploy it by pressing

Launch in Cloud Shell

Code of Conduct

Please see the code of conduct

Contributions

Please see the contributing guidelines

License

This product and Online Boutique application, its code and assets are licensed under Apache 2.0. Full license text is available in LICENSE.


Note This is not an official Google project. Please, report any issues or feature requests related to this project here.

cloud-ops-sandbox's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cloud-ops-sandbox's Issues

Create terraform for productcatalogservice custom dashboard

#WHAT: Create dashboard that represents four golden signals of monitoring (latency, traffic, errors, and saturation) for the productcatalogservice

#WHY: Creates opinionated monitoring for this service

#HOW: Terraform resources added to initialization terraform scripts

Create terraform for paymentservice, emailservice, and shippingservice custom dashboard

#WHAT: Create dashboard that represents four golden signals of monitoring (latency, traffic, errors, and saturation) for the three services: paymentservice, emailservice, and shippingservice

#WHY: Creates opinionated monitoring for these services

#HOW: Terraform resources added to initialization terraform scripts. Should use a terroform templatefile to provision the same dashboards for each of these services

Difference in code resulting in Invalid Snapshot Position Error

Description: A user may get an "Invalid Snapshot Position" Error when following these steps to create a snapshot in the Stackdriver Sandbox User Guide. This made it impossible to see any values coming from the application, since the snapshot was never activated.

Error: The error text reads: "Invalid snapshot position: Loaded script contained x lines. Please ensure that the snapshot was set in the same code version as the deployed source." A picture is shown below.
Screenshot 2020-07-01 at 12 14 41 PM

Fix: This error is occurring because the service is deployed using container images that were created from the source code that is earlier than the currently checked-in version on GitHub. One possible fix is that the service should be deployed using updated container images from the most recent version.

Create terraform for SLOs and alerting policies for frontend service

#WHAT: Create two SLOs for the frontend service (1 for availability and 1 for performance) and an alerting policy on budget burn for each one. SLO defined as 99.9% availability/performance over 30 day rolling window. Alert when 10% budget expended over 1 hour.

#WHY: Demonstrate best practice alerting on SLO error budget burn through terraform for this service

#HOW: Terraform resources added into initialization terraform scripts

docs: Error Reporting not as instructive as it could be

The Error Reporting section has two issues:

  1. It seems that we need to manually enable the Error Reporting API to get it to work. This should be added to the instructions in the docs.
  2. This demonstration would be more powerful if the user were able to manufacture some errors and see them reported on the UI. The docs could have some "how to break it" instructions, e.g. stopping nodes through the GCP console or a particular pattern using the load generator.

Create terraform for SLOs and alerting policies for checkoutservice

#WHAT: Create two SLOs for the checkoutservice (1 for availability and 1 for performance) and an alerting policy on budget burn for each one. SLO defined as 99.9% availability/performance over 30 day rolling window. Alert when 10% budget expended over 1 hour.

#WHY: Demonstrate best practice alerting on SLO error budget burn through terraform for this service

#HOW: Terraform resources added into initialization terraform scripts

Stackdriver Sandbox website does not link to User Guide or repo

The Stackdriver Sandbox website does not link to the official user guide, and only has set-up instructions through step 2 (run ./install.sh). If a user were approaching from that website, they may be confused with how to proceed from there.

It would be helpful to include a link from that website to the User Guide, which has more detailed documentation. For example, add to the website "Step 3: Follow Set-Up instructions in the User Guide" (along with a link to the User Guide).

Error reading Project "stackdriver-sandbox-4086952698": googleapi: Error 403: The caller does not have permission, forbidden

Full Cloud Shell output:

./install.sh
Checking Prerequisites...
Checking for billing accounts
using billing account: Personal Billing Account
Make sure Terraform is installed
Initialize terraform state

Initializing the backend...

Initializing provider plugins...

The following providers do not have any version constraints in configuration,so the latest version was installed.
To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

  • provider.null: version = "~> 2.1"
    Terraform has been successfully initialized!
    You may now begin working with Terraform. Try running "terraform plan" to see
    any changes that are required for your infrastructure. All Terraform commands
    should now work.
    If you ever set or change modules or backend configuration for Terraform,
    rerun this command to reinitialize your working directory. If you forget, other
    commands will detect it and remind you to do so if necessary.
    Apply Terraform automation
    null_resource.customize_manifests: Refreshing state... [id=8555647629071790355]
    data.google_billing_account.acct: Refreshing state...
    random_id.project: Refreshing state... [id=85ny-g]
    google_project.project: Refreshing state... [id=stackdriver-sandbox-4086952698]
    Error: Error reading Project "stackdriver-sandbox-4086952698": googleapi: Error 403: The caller does not have permission, forbidden

Running install.sh multiple times

How do we want the script to behave when it is run multiple times? Currently, terraform will skip deploying a new project, but the load balancer portion will create a new load balancer.

Do we want the script to cancel if it sees tfstate files in the current directory? Or should it always deploy another sandbox instance? Or something else?

Create terraform for SLOs and alerting policies for shippingservice

#WHAT: Create two SLOs for the shippingservice (1 for availability and 1 for performance) and an alerting policy on budget burn for each one. SLO defined as 99.9% availability/performance over 30 day rolling window. Alert when 10% budget expended over 1 hour.

#WHY: Demonstrate best practice alerting on SLO error budget burn through terraform for this service

#HOW: Terraform resources added into initialization terraform scripts

Replace "Stackdriver" with "Operations" and remove stackdriver-only items

Since Stackdriver has been rebranded to Operations, the following would be useful:

  • The install script links to Stackdriver Dashboard (app.google.stackdriver.com/accounts/create) at the end, but this page no longer contains any information. This should be removed from the install script and docs updated accordingly.
  • All instances of project name "Stackdriver" should be renamed to "Operations" in the docs, and tool names from "Stackdriver *" to "Cloud *".
  • There are a few images that may be misleading since they show "Stackdriver" in the UI where users should now expect to see "Operations"

Create terraform for checkoutservice custom dashboard

#WHAT: Create dashboard that represents four golden signals of monitoring (latency, traffic, errors, and saturation) for the checkoutservice

#WHY: Creates opinionated monitoring for this service

#HOW: Terraform resources added to initialization terraform scripts

Create terraform for SLOs and alerting policies for currencyservice

#WHAT: Create two SLOs for the currencyservice (1 for availability and 1 for performance) and an alerting policy on budget burn for each one. SLO defined as 99.9% availability/performance over 30 day rolling window. Alert when 10% budget expended over 1 hour.

#WHY: Demonstrate best practice alerting on SLO error budget burn through terraform for this service

#HOW: Terraform resources added into initialization terraform scripts

Create terraform for SLOs and alerting policies for paymentservice

#WHAT: Create two SLOs for the paymentservice (1 for availability and 1 for performance) and an alerting policy on budget burn for each one. SLO defined as 99.9% availability/performance over 30 day rolling window. Alert when 10% budget expended over 1 hour.

#WHY: Demonstrate best practice alerting on SLO error budget burn through terraform for this service

#HOW: Terraform resources added into initialization terraform scripts

emailservice crashloop

The emailservice appears to be in a crashloop when testing on my personal project. The error message is:

google.api_core.exceptions.PermissionDenied: 403 Stackdriver Trace API has not been used in project 1093971809799 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/cloudtrace.googleapis.com/overview?project=1093971809799 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.

Do we need to enable the API before using stackdriver sandbox? Is this not done as part of the install script?

Create terraform for SLOs and alerting policies for emailservice

#WHAT: Create two SLOs for the emailservice (1 for availability and 1 for performance) and an alerting policy on budget burn for each one. SLO defined as 99.9% availability/performance over 30 day rolling window. Alert when 10% budget expended over 1 hour.

#WHY: Demonstrate best practice alerting on SLO error budget burn through terraform for this service

#HOW: Terraform resources added into initialization terraform scripts

Create terraform for SLOs and alerting policies for cartservice

#WHAT: Create two SLOs for the cartservice (1 for availability and 1 for performance) and an alerting policy on budget burn for each one. SLO defined as 99.9% availability/performance over 30 day rolling window. Alert when 10% budget expended over 1 hour.

#WHY: Demonstrate best practice alerting on SLO error budget burn through terraform for this service

#HOW: Terraform resources added into initialization terraform scripts

Create Terraform HTTP uptime check for demo application w/ alerting policy & email notification

#WHAT: Create an HTTP uptime check on the external IP of the demo application. Alerting policy is defined as number of fails exceeds 1 for 1 minute over 20 minutes polling period. Email notification alerts to owner of the project

#WHY: Add best practice uptime check monitoring in the form of uptime check and associated alerting policy

#HOW: Terraform resources added to initialization terraform scripts

Background image is missing

main.css includes:

background-image: url(../images/content_copy.svg);

There is no content_copy.svg in the repository. Should the image be added or the css be updated?

docs: add instructions to create Analysis Report

It would make it more clear to introduce the user to how to make an Analysis Report when going through the Trace section in the User Guide, since when the user first opens Trace, there should be no reports generated.

Between "Finally, click Analysis Reports in the navigation menu to see a list of reports that are generated" and "View one of the reports that was created (or the one you created yourself) to understand either the density or cumulative distribution of latency for the call you selected", a few sentences can be added with documentation on Report-creation.

Loadgen not always creates firewall rule for locust

Loadgenerator tool has option

./loadgenerator-tool setup

This option creates a firewall rule that allows accessing the VM that runs locust using external IP.
We need to remove this option and instead make sure that every time new job is created, corresponding firewall rule is created/updated as well.

Insufficient quota when running install.sh

There's an insufficient quota issue when running the script install.sh. It failed to create the cluster.

Error: googleapi: Error 403: Insufficient regional quota to satisfy request: resource "IN_USE_ADDRESSES": request requires '5.0' and is short '1.0'. project has a quota of '4.0' with '4.0' available. View and manage quotas at https://console.cloud.google.com/iam-admin/quotas?usage=USED&project=stackdriver-sandbox-69333807., forbidden

on 03_gke_cluster.tf line 42, in resource "google_container_cluster" "gke":
42: resource "google_container_cluster" "gke" {

Create Terraform for frontend service custom dashboard

#WHAT: Creates dashboard representing four golden signals of monitoring (latency, traffic, errors, and saturation) for the frontend service

#WHY: Creates opinionated monitoring of this service

#HOW: Terraform resources added to initialization terraform scripts

Create terraform for SLOs and alerting policies for recommendationservice

#WHAT: Create two SLOs for the recommendationservice (1 for availability and 1 for performance) and an alerting policy on budget burn for each one. SLO defined as 99.9% availability/performance over 30 day rolling window. Alert when 10% budget expended over 1 hour.

#WHY: Demonstrate best practice alerting on SLO error budget burn through terraform for this service

#HOW: Terraform resources added into initialization terraform scripts

Create terraform for cartservice custom dashboard

#WHAT: Create dashboard that represents four golden signals of monitoring (latency, traffic, errors, and saturation) for the cartservice

#WHY: Creates opinionated monitoring for this service

#HOW: Terraform resources added to initialization terraform scripts

docs: how to manufacture errors for Error Reporting

The Error Reporting section would be a more powerful demonstration if the user were able to manufacture some errors and see them reported on the UI. The docs could have some "how to break it" instructions, e.g. stopping nodes through the GCP console or a particular pattern using the load generator.

Create terraform for SLOs and alerting policies for productcatalogservice

#WHAT: Create two SLOs for the productcatalogservice (1 for availability and 1 for performance) and an alerting policy on budget burn for each one. SLO defined as 99.9% availability/performance over 30 day rolling window. Alert when 10% budget expended over 1 hour.

#WHY: Demonstrate best practice alerting on SLO error budget burn through terraform for this service

#HOW: Terraform resources added into initialization terraform scripts

Load generator instance was not created

I just created a new Google account ([email protected]) to create a new sandbox. Here's the last part of Cloud Shell output. It did not create the load generator instance.


Stackdriver Sandbox deployed successfully!
Stackdriver Dashboard: https://app.google.stackdriver.com/accounts/create
Google Cloud Console Dashboard: https://console.cloud.google.com/kubernetes/workload?project=stackdriver-sandbox-347738008
Hipstershop web app address: http://34.67.163.61
Load generator web interface: [not found]


Create terraform for currencyservice custom dashboard

#WHAT: Create dashboard that represents four golden signals of monitoring (latency, traffic, errors, and saturation) for the currencyservice

#WHY: Creates opinionated monitoring for this service

#HOW: Terraform resources added to initialization terraform scripts

Create terraform for SLOs and alerting policies for adservice

#WHAT: Create two SLOs for the adservice (1 for availability and 1 for performance) and an alerting policy on budget burn for each one. SLO defined as 99.9% availability/performance over 30 day rolling window. Alert when 10% budget expended over 1 hour.

#WHY: Demonstrate best practice alerting on SLO error budget burn through terraform for this service

#HOW: Terraform resources added into initialization terraform scripts

Create terraform for recommendationservice custom dashboard

#WHAT: Create dashboard that represents four golden signals of monitoring (latency, traffic, errors, and saturation) for the recommendationservice

#WHY: Creates opinionated monitoring for this service

#HOW: Terraform resources added to initialization terraform scripts

Create terraform for adservice custom dashboard

#WHAT: Create dashboard that represents four golden signals of monitoring (latency, traffic, errors, and saturation) for the adservice

#WHY: Creates opinionated monitoring for this service

#HOW: Terraform resources added to initialization terraform scripts

Enable API and set-up billing as part of Sandbox provisioning

When setting up Stackdriver Debugger following the steps in the Sandbox User Guide, I receive a message asking whether I would like to enable the API in the cloud shell.

Screenshot 2020-06-30 at 6 41 32 PM

This is a step that could be done as a part of Sandbox provisioning.

Note: following the choice to enable the API, in order to continue, the user must enable billing specifically for an API method called. If the API is enabled, it would be helpful to either add instructions to the User Guide specifying that API billing should be set-up, or find a way to do it automatically. Screenshot for this is shown below.

The earlier billing is enabled, the better, since it can take a few minutes for the system to register the updated billing information before the user can move on to the next step.
Screenshot 2020-07-01 at 10 22 49 AM

docs: stackdriver workspaces are now embedded in GCP console UI

The docs/README.md#Stackdriver Monitoring section has some outdated information:

  • The Stackdriver Monitoring console is no longer "a separate UI from the consoles for other GCP and Stackdriver products." The images should be updated too.
  • The new Monitoring dashboard does not have a Resources > (Infrastructure) Kubernetes Engine. I found similar info through Monitoring > Dashboards > Kubernetes Engine > Infrastructure tab, but I'm not sure if that's the right place.
  • Similarly, the route for finding Metrics Explorer is Monitoring > Metrics Explorer instead of going through Resources
  • In log query creation, resource is under Kubernetes Container > stackdriver-sandbox > default > server instead of GKE Container > stackdriver-sandbox > default + server from dropdown menu
  • In Logging, it looks like "Create Export" is now "Create Sink"

Move instructions on how to build/deploy hipstershop to contributing.md

Currently readme.md contains both the explanation what is Sandbox, as well as instructions for developers on how to build and deploy it using skaffold. The latter should move to contributing.md that is only relevant to folks interested in contributing to this repo as opposed to readme.md that targets everyone, including folks who discover Sandbox and only want to use it as finished product.

Create Terraform for User Experience Dashboard

#WHAT: Terraform support to create example dashboard for User Experience, contains metrics that are relevant to user interaction with the application

#WHY: We want to provide dashboard that reveals metrics about the demo application as whole

#HOW: Additional terraform config file created alongside initial provisioning terraform

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.