dora-team / fourkeys Goto Github PK

Platform for monitoring the four key software delivery metrics of software delivery

License: Apache License 2.0

Dockerfile 8.58% Python 54.28% Shell 15.62% HCL 21.46% JavaScript 0.06%

metrics monitoring

fourkeys's Introduction

This repository is not currently maintained. We encourage you to explore it, fork it, or otherwise use it as inspiration for your own metrics instrumentation.

Background

Through six years of research, the DevOps Research and Assessment (DORA) team has identified four key metrics that indicate the performance of software delivery. Four Keys allows you to collect data from your development environment (such as GitHub or GitLab) and compiles it into a dashboard displaying these key metrics.

These four key metrics are:

Deployment Frequency
Lead Time for Changes
Time to Restore Services
Change Failure Rate

Who should use Four Keys

Use Four Keys if:

You want to measure your team's software delivery performance. For example, you may want to track the impact of new tooling or more automated test coverage, or you may want a baseline of your team's performance.
You have a project in GitHub or GitLab.
Your project has deployments.

Four Keys works well with projects that have deployments. Projects with releases and no deployments, for example, libraries, do not work well because of how GitHub and GitLab present their data about releases.

For a quick baseline of your team's software delivery performance, you can also use the DORA DevOps Quick Check. The quick check also suggests DevOps capabilities you can work on to improve your performance. The Four Keys project itself can help you improve the following DevOps capabilities:

How it works

Events are sent to a webhook target hosted on Cloud Run. Events are any occurrence in your development environment (for example, GitHub or GitLab) that can be measured, such as a pull request or new issue. Four Keys defines events to measure, and you can add others that are relevant to your project.
The Cloud Run target publishes all events to Pub/Sub.
A Cloud Run instance is subscribed to the Pub/Sub topics, does some light data transformation, and inputs the data into BigQuery.
The BigQuery view to complete the data transformations and feed into the dashboard.

This diagram shows the design of the Four Keys system:

Code structure

bq-workers/
- Contains the code for the individual BigQuery workers. Each data source has its own worker service with the logic for parsing the data from the Pub/Sub message. For example, GitHub has its own worker which only looks at events pushed to the GitHub-Hookshot Pub/Sub topic
dashboard/
- Contains the code for the Grafana dashboard displaying the Four Keys metrics
data-generator/
- Contains a Python script for generating mock GitHub or Gitlab data.
event-handler/
- Contains the code for the event-handler, which is the public service that accepts incoming webhooks.
queries/
- Contains the SQL queries for creating the derived tables.
setup/
- Contains the code for setting up and tearing down the Four Keys pipeline. Also contains a script for extending the data sources.
shared/
- Contains a shared module for inserting data into BigQuery, which is used by the bq-workers
terraform/
- Contains Terraform modules and submodules, and examples for deploying Four Keys using Terraform.

How to use

Out of the box

The project uses Python 3 and supports data extraction for Cloud Build and GitHub events.

Fork this project.
Run the automation scripts, which does the following (See the setup README for more details):
1. Create and deploy the Cloud Run webhook target and ETL workers.
2. Create the Pub/Sub topics and subscriptions.
3. Enable the Google Secret Manager and create a secret for your GitHub repo.
4. Create a BigQuery dataset, tables and views.
5. Output a URL for the newly generated Grafana dashboard.
Set up your development environment to send events to the webhook created in the second step.
1. Add the secret to your GitHub webhook.

NOTE: Make sure you don't use "Squash Merging" in Git when merging back into trunk. This breaks the link between the commit into trunk and the commits from the branch you developed on and as thus it is not possible to measure "Time to Change" on these commits. It is possible to disable this feature in the settings of your repo

Generating mock data

The setup script includes an option to generate mock data. Generate mock data to play with and test the Four Keys project.

The data generator creates mocked GitHub events, which are ingested into the table with the source “githubmock.” It creates following events:

5 mock commits with timestamps no earlier than a week ago
- Note: Number can be adjusted
1 associated deployment
Associated mock incidents
- Note: By default, less than 15% of deployments create a mock incident. This threshold can be adjusted in the script.

To run outside of the setup script:

Ensure that you’ve saved your webhook URL and secret in your environment variables:
```
export WEBHOOK={your event handler URL}
export SECRET={your event-handler secret}
```
Run the following command:
```
python3 data-generator/generate_data.py --vc_system=github
```
You can see these events being run through the pipeline:
- The event handler logs show successful requests
- The Pub/Sub topic show messages posted
- The BigQuery GitHub parser show successful requests

You can query the events_raw table directly in BigQuery:

SELECT * FROM four_keys.events_raw WHERE source = 'githubmock';

Reclassifying events / updating your queries

The scripts consider some events to be “changes”, “deploys”, and “incidents.” You may want to reclassify one or more of these events, for example, if you want to use a label for your incidents other than “incident.” To reclassify one of the events in the table, no changes are required on the architecture or code of the project.

Update the view in BigQuery for the following tables:
- four_keys.changes
- four_keys.deployments
- four_keys.incidents
To update the view, we recommend that you update the sql files in the queries folder, rather than in the BigQuery UI.
Once you've edited the SQL, run terraform apply to update the view that populates the table:
```
cd ./setup && terraform apply
```

Notes:

To feed into the dashboard, the table name should be one of changes, deployments, incidents.

Extending to other event sources

To add other event sources:

Add to the AUTHORIZED_SOURCES in sources.py.
1. If create a verification function, add the function to the file as well.
Run the new_source.sh script in the setup directory. This script creates a Pub/Sub topic, a Pub/Sub subscription, and the new service using the new_source_template .
1. Update the main.py in the new service to parse the data properly.
Update the BigQuery script to classify the data properly.

If you add a common data source, please submit a pull request so that others may benefit from the functionality.

Running tests

This project uses nox to manage tests. The noxfile defines what tests run on the project. It’s set up to run all the pytest files in all the directories, as well as run a linter on all directories.

To run nox:

Ensure that nox is installed:
```
pip install nox
```
Use the following command to run nox:
```
python3 -m nox
```

Listing tests

To list all the test sessions in the noxfile, use the following command:

python3 -m nox -l

Running a specific test

Once you have the list of test sessions, you can run a specific session with:

python3 -m nox -s "{name_of_session}"

The "name_of_session" will be something like "py-3.6(folder='.....').

Data schema

`four_keys.events_raw`

Field Name	Type	Notes
source	STRING	eg: github
event_type	STRING	eg: push
id*	STRING	Id of the development object. Eg, bug id, commit id, PR id
metadata	JSON	Body of the event
time_created	TIMESTAMP	The time the event was created
signature	STRING	Encrypted signature key from the event. This will be the unique key for the table.
msg_id	STRING	Message id from Pub/Sub

*indicates that the ID is generated by the original system, such as GitHub.

This table will be used to create the following three derived tables:

`four_keys.deployments`

Note: Deployments and changes have a many to one relationship. Table only contains successful deployments.

Field Name	Type	Notes
🔑deploy_id	string	Id of the deployment - foreign key to id in events_raw
changes	array of strings	List of id’s associated with the deployment. Eg: commit_id’s, bug_id’s, etc.
time_created	timestamp	Time the deployment was completed

`four_keys.changes`

Field Name	Type	Notes
🔑change_id	string	Id of the change - foreign key to id in events_raw
time_created	timestamp	Time_created from events_raw
change_type	string	The event type

`four_keys.incidents`

Field Name	Type	Notes
🔑incident_id	string	Id of the failure incident
changes	array of strings	List of deployment ID’s that caused the failure
time_created	timestamp	Min timestamp from changes
time_resolved	timestamp	Time the incident was resolved

Dashboard

The dashboard displays all four metrics with daily systems data, as well as a current snapshot of the last 90 days. The key metric definitions and description of the color coding are below.

For a deeper understanding of the metrics and intent of the dashboard, see the 2019 State of DevOps Report.

For details about how Four Keys calculates each metric in this dashboard, see the Four Keys Metrics calculation doc.

Key metrics definitions

This Four Keys project defines the key metrics as follows:

Deployment Frequency

How frequently a team successfully releases to production, e.g., daily, weekly, monthly, yearly.

Lead Time for Changes

The median amount of time for a commit to be deployed into production.

Time to Restore Services

For a failure, the median amount of time between the deployment which caused the failure and the remediation. The remediation is measured by closing an associated bug / incident report.

Change Failure Rate

The number of failures per the number of deployments. For example, if there are four deployments in a day and one causes a failure, that is a 25% change failure rate.

For more information on the calculation of the metrics, see the METRICS.md

Color coding

The dashboard has color coding to show the performance of each metric. Green is strong performance, yellow is moderate performance, and red is poor performance. Below is the description of the data that corresponds to the color for each metric.

The data ranges used for this color coding roughly follows the ranges for elite, high, medium, and low performers that are described in the 2019 State of DevOps Report.

Deployment Frequency

Purple: On-Demand (multiple deploys per day)
Green: Daily, Weekly
Yellow: Monthly
Red: Between once per month and once every 6 months.
- This is expressed as “Yearly.”

Lead Time to Change

Purple: Less than one day
Green: Less than one week
Yellow: Between one week and one month
Red: Between one month and 6 months.
Red: Anything greater than 6 months
- This is expressed as “One year.”

Time to Restore Service

Purple: Less than one hour
Green: Less than one day
Yellow: Less than one week
Red: Between one week and a month
- This is expressed as “One month”
Red: Anything greater than a month
- This is expressed as “One year”

Change Failure Rate

Green: Less than 15%
Yellow: 16% - 45%
Red: Anything greater than 45%

The following chart is from the 2019 State of DevOps Report, and shows the ranges of each key metric for the different category of performers.

Disclaimer: This is not an officially supported Google product

fourkeys's People

Contributors

Stargazers

Watchers

Forkers

nathenharvey zelladoor c4fun jgoldfed edwint88 jean sophieweston michelleirvine iamjarvo karthikksamy dantwining devopskev maxknee taamit1 valery-barysok shibayu36 ipv1337 mikutas sofcalca gjvanhalem j3p0uk mainakibui mroulette kodai12 muskanmahajan37 faboo03 williamhsu17 cdriscol richgriffin netent-tech pokutuna lucidtechnics stevepereira ghazal-naderi danjmfox ldouillet jeromeky kristofa dgrenner sapientcoffee shahargl abangser tonyheupel jeremysolarz kellynoel bschnell-google davidstanke permutive randycruz-gt samwiskow rlingutla gkraus jadethomas abhishektiwari nandosuk mvsantos ragsmpl pulipativedanth bpkgoud shawnho1018 hrbrain isabella232 pulp-digital-archive danhappycode sidikriyan davidjsanders basis-org manuphatak bhagyeshsoni aabadm54 kiiraan bcc-code lwilches qingyukathyxu abrahammer sayrhino kbritton pithomlabs pwill321 mammutmw velith uswitch hygo-ingka gabrielcc nicolas2mey sbrudz krmroland sean-sype hanzoarchives irab hndsoff8 aj2o venksters yeshwanth1993 hieunba cision jake-mok-nelson glasnt abinayasubba atdavidpark

fourkeys's Issues

Add more data to dashboard screenshot

Add more data to show in the README and the METRICS.md.

Separate "Elite" from "High" performance in metrics buckets

Per the findings in the 2018-2019 SODR Report, there are now four buckets of performers. Accordingly, the Quick Check tool offers four buckets when reporting results.

I suggest we use the same rubric, with four buckets for color-coding the dashboard.

Use a global requirements file for nox tests

This may or may not be a good idea. Let’s explore. See: #75 (comment)

help with setup

Hi 4Keys!
My team needs some help setting this up for private repos. What appears to be simple has taken over a week and no success in sight yet.
Is there anyway one of you can help us out with 30mins of your time?

Ketan

Make `cleanup.sh -b` more conservative

When the -b flag is used with setup/cleanup.sh, it will delete any projects named "fourkeys-*" ... this includes projects that were created by setup.sh, but also may include other projects, like fourkeys-foo, fourkeys-bar, or (eek!) fourkeys-testing.

The cleanup script script should filter more narrowly for projects to delete.

Refactor data generator script

There are several conditionals based on VCS--if <github>: ... else if <gitlab>: ...

This is already inelegant and will quickly become unmaintainable as we add VCSes.

Datastudio dashboard is not showing my data

Hi @dinagraves

I followed the installation instructions and also generated some mock data. At last I connect the "Four Keys Dashboard" Connector with my project and it gave me a "Four Keys" datasource and the shared public template.
But when opening the dashboard I do see some numbers from June but not my mocked data.

Honestly I also can't figure out how the datasource/connector is "calculating"/"transforming" the actual data in the Bigquery tables (changes, deployments, incidents) into the dashboard.
Could you probably shed some light into this?

Thanks anyways for this cool project and work you've done on this!
best regards,
Jo

Ensure at least 1 incident is created with data mocking script

Currently, the script uses random numbers and probability to create incidents. There should be at minimum one incident for the dashboard. Otherwise, the dashboard will not display the incident related charts.

How to setup on local linux environment or dockerized environment?

Hi Dina,

I have attended your Four Keys presentation during cdCon2020. I liked the tool. I would like to know if it is possible to run the dashboard on the local setup? I would move it to GCP later. Thanks

regards,
Jayesh

Only run tests on code that has changed

Update Nox to only run applicable session on the files that have code changes in a PR.

Documentation: INSTALL vs README

Include more information in the INSTALL.md. Currently there is information on how to do a variety of things in the README (eg run the data generator at will) which should also be included in the INSTALL.md.

Create script to attach additional commit information to deployment

Write script
Update documentation

reference:

Add unit tests for data generator

The python script should have unit tests.

Standardize terraform resource names

use underscores! (see https://www.terraform-best-practices.com/naming)

...also look for any other opportunities for best practices.

Is there a linter?

Setup Issue: Bigquery Job Latency

There is sometimes a delay in the bigquery job execution causing the datastudio dashboard to load before the bigquery scripts have finished populating the tables. This results in an empty dashboard.

The setup script should wait until the bigquery jobs have finished to open the Datastudio config.

Getting error when running setup.sh using ubuntu on a windows machine

Hey @dinagraves,

Thanks for all the work with this project, we are in the process of implementing it in our organization.

I'm getting an error when running setup.sh (following you're youtube video and was hoping you could point me in the right direction. My environment are as follows

Operating System

Running on Ubuntu 20.04.1 LTS

gcloud

I want to make use of an already existing project called "bcc-fourkeys" in our gcloud instance
I confirmed that billing is enabled and that I have owner access

executing setup.sh

When I attempt to run setup.sh I get the following error

Please let me know if you need an more information.
Greatly appreciate the help.

Kind Regards,
Philip

FR: Create tool to trigger the scheduled scripts on demand

Currently if you want to re-run the scripts that populate the derived tables, you have to find the query history and execute it again manually. We should have an easy python script or set of gcloud commands to run the scripts and update the tables.

How to get Git history from any repo to generate events

I have added my webhook to my gitlab repo, and now looking at how to review my recent project history.

Is there a mechanism to scan a git repo (any) and generate corresponding events_raw so start with 6-9 months of historical way of working on the four metrics? Or would that miss too much info to be of use?

Thinking this would be useful to maximise the info contained on any git repo (regardless of gitlab, github, bitbucket, etc...).

BQ Data Transfer Auth causes failure in schedule.py

Running the python3 schedule.py --query_file=changes.sql --table=changes --access_token=${token} part of setup.sh initially fails with the following exception.

Traceback (most recent call last):
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INVALID_ARGUMENT
	details = "Failed to find a valid credential. The request to create a transfer config is supposed to contain an authorization code."
	debug_error_string = "{"created":"@1607609612.827897000","description":"Error received from peer ipv4:216.58.207.234:443","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Failed to find a valid credential. The request to create a transfer config is supposed to contain an authorization code.","grpc_status":3}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "schedule.py", line 105, in <module>
    app.run(create_or_update_scheduled_query)
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "schedule.py", line 100, in create_or_update_scheduled_query
    response = client.create_transfer_config(parent, transfer_config)
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/google/cloud/bigquery_datatransfer_v1/gapic/data_transfer_service_client.py", line 811, in create_transfer_config
    return self._inner_api_calls["create_transfer_config"](
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/google/api_core/gapic_v1/method.py", line 145, in __call__
    return wrapped_func(*args, **kwargs)
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/google/api_core/retry.py", line 281, in retry_wrapped_func
    return retry_target(
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/google/api_core/retry.py", line 184, in retry_target
    return target()
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
    return func(*args, **kwargs)
  File "/Users/segan3/Library/Python/3.8/lib/python/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.InvalidArgument: 400 Failed to find a valid credential. The request to create a transfer config is supposed to contain an authorization code.

This was from running it with an auth token from my personal account, which had the Owner role in the project.

There are probably better solutions, but I managed to get around it by first creating a similar scheduled query manually to there get option to grant access to the needed scope.

Issue running setup script

djsanders@

Error running setup.sh

Selected "No" when asked to create project
Supplied existing project ID
Selected "Yes" when asked to create example project (helloworld)

Would you like to create a separate new project to test deployments for the four key metrics? (y/n):y
Setting up project for Helloworld...
+ gcloud projects create --folder=
ERROR: (gcloud.projects.create) Missing required argument [PROJECT_ID]: an id or a name must be provided for the new project
+ gcloud beta billing projects link --billing-account=billingAccounts/xxxxxx-xxxxxx-xxxxxx
ERROR: (gcloud.beta.billing.projects.link) argument PROJECT_ID: Must be specified.
Usage: gcloud beta billing projects link PROJECT_ID --billing-account=ACCOUNT_ID [optional flags]
  optional flags may be  --help

parallelize unit tests

We can make the unit tests go faster, by running all of the python versions in parallel.

Per this doc, nox allows specifying python version on the command line, e.g.: python3 -m nox --python=3.7

We can use Cloud Build's parallel steps to test py 3.6, 3.7, and 3.8 (as specified in the noxfile) simultaneously.

Schedule setup fails on auth issue.

I'm getting...

python3 schedule.py --query_file=changes.sql --table=changes --access_token=ya29.a0AfH6SMB8FMz2I_ZuM76BZ44PtNIO37BZLyTwr7ZcWQFA-t9geupxv12zDGJRSZHzlRTIdSIY6mPrh-4n7OH8k1U_lA9hSuE7VqpHLJSOWTx-iGWUBhZ2hAbXVb_Oqz9h7mJaWG8TCrEFfAZddXyuLzEgNA2AFiuqWJP3F4e2pICRa-7p7XVKyQ
Traceback (most recent call last):
  File "/home/davidstanke/.local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/home/davidstanke/.local/lib/python3.6/site-packages/grpc/_channel.py", line 923, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/davidstanke/.local/lib/python3.6/site-packages/grpc/_channel.py", line 826, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.FAILED_PRECONDITION
        details = "BigQuery Data Transfer service account is not found, this could be fixed by re-enabling BigQuery Data Transfer service API for project fourkeys-007886. Please ask the project owner to re-enable the API at https://console.cloud.google.com/apis/library/bigquerydatatransfer.googleapis.com."
        debug_error_string = "{"created":"@1611155717.256131935","description":"Error received from peer ipv4:172.217.8.10:443","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"BigQuery Data Transfer service account is not found, this could be fixed by re-enabling BigQuery Data Transfer service API for project fourkeys-007886. Please ask the project owner to re-enable the API at https://console.cloud.google.com/apis/library/bigquerydatatransfer.googleapis.com.","grpc_status":9}"

Per the docs, the setup script should have prompted me to configure BigQuery interactively, but that didn't seem to happen(?).

I'm investigating.

Invalid syntax in instructions to get event handler endpoint

When I tried the command to get the event handler endpoint in INSTALL.md, I received the following error:

ERROR: (gcloud.run) unrecognized arguments: --platform (did you mean '--format'?)

It looks like the syntax for gcloud run services has changed as of version 322.0.0 and --platform managed --region ${FOURKEYS_REGION} now need to come after the positional arguments.

The following syntax works:

gcloud run services describe event-handler --platform managed --region ${FOURKEYS_REGION} --format=yaml | grep url | head -1 | sed -e 's/  *url: //g'

Remove schema jsons for changes/incidents/deployments

The setup/*_schema.json files--other than events_raw_schema.json--aren't actually used during setup.

Let's remove these files, but add the table definitions to documentation for reference.

use `latest` version of event-handler secret

...instead of hardcoding it like here:
https://github.com/GoogleCloudPlatform/fourkeys/blob/fe7d7d8114b2b1786b1312b5543276108baa2f24/event_handler/sources.py#L45

Datastudio dashboard template

I can't seem to figure out the Datastudio Dashboard template step here. Launching the Datastudio URL, it gives me a message "The linked connectorid was invalid".

I also tried looking for a template in Datastudio community but can't find any similar.

Example env.sh

In multiple parts of the project, an env.sh file is referenced.
I think an example of the file should be provided, to simplify the full setup and later usages.

Data connector details are missing

This issue is just for the record. (sorry, to push you @dinagraves)
Would be cool to get more insight into the data connector and how it crunshes numbers. 😀

Link in README.md for connector is broken

The link configured on the text "Apps Script" is broken, it returns a 404.

Link to directory of the file with the error
https://github.com/GoogleCloudPlatform/fourkeys/tree/main/connector

Show the dashboard in the README

Adding a screenshot of the dashboard that includes the sample data will help users understand visualize the end state and motivate them to go through the process of setting up the dashboard.

Setup: Set Cloud Run region

Cloud Run configs should be set globally. The lack of a region tag caused the service account binding to fail:

+ gcloud iam service-accounts create cloud-run-pubsub-invoker --display-name 'Cloud Run Pub/Sub Invoker'
Created service account [cloud-run-pubsub-invoker].
+ gcloud run --platform managed services add-iam-policy-binding github-worker --member=serviceAccount:[email protected] --role=roles/run.invoker
ERROR: (gcloud.run.services.add-iam-policy-binding) Error parsing [service].
The [service] resource is not properly specified.
Failed to find attribute [region]. The attribute can be set in the following ways: 
- provide the argument [--region] on the command line
- set the property [run/region]
- specify from a list of available regions in a prompt
+ gcloud run --platform managed services add-iam-policy-binding cloud-build-worker --member=serviceAccount:[email protected] --role=roles/run.invoker
ERROR: (gcloud.run.services.add-iam-policy-binding) Error parsing [service].
The [service] resource is not properly specified.
Failed to find attribute [region]. The attribute can be set in the following ways: 
- provide the argument [--region] on the command line
- set the property [run/region]

Setup: Dashboard Configuration

Add instructions about the datastudio configuration to the setup.

Also, return the URL to the terminal.

Unauthorized calls to event-handler should throw error 403

When an exception is raised in the event handler due to an unauthorized caller (thrown here or here), the application currently returns HTTP status 500.

It would be more correct to return a 403 (Forbidden).

Docs about "time to restore" calculation not consistent in METRICS.md file

In METRICS.md it says "the median amount of time between the deployment which caused the failure" but in the slq query it uses "time_created" which is the time the incident was created. Please see below screenshot

Setup: Manual Prompts

Enumerate all the manual prompts in the setup so the user knows what to expect.

Metrics dependency on deployment patterns

The statistics works as it should if each merge to main branch is deployed to production separately. If there are more merges to main branch before the deploy to production, either the lead time or the deployment frequency will be incorrect.

Option 1: Create deployment event for each merge to main branch. Will include all commits in lead time, but will generate higher number of deployment events.
Option 2: Create one deployment event for last merge to main branch. Will only include commits for last merge in lead time calculation, but will generate correct number of deployment events.

Example to hopefully make it more clear (using option 2):
Create branch a, make 2 commits, merge in to master. Create branch b, make 2 commits, merge to master.

Git log afterwards:

git log --decorate=no --date-order --reverse --pretty=oneline

6bd52a46ab919384b043a40750e0aed8e0e0d43b Branch a, commit 1
5a93561896c7c04758df9fe05eaaa4f7154e53f6 Branch a, commit 2
807b8acfc1007e544941944df182d10e6f9f52fd Merge pull request #3 from org-name/branch-a
964198b26178b4203f14a77c77080af70a445750 B - 1
bb13c56e5e9b214682c28647adc787ff061183e2 B - 2
21e706a11d41e467fe055dbdb5fe21a609427a20 Merge pull request #4 from org-name/branch-b

Create deployment with GitHub API for last merge:

curl -u $GITHUB_USER:$GITHUB_TOKEN \
  -X POST \
  -H "Accept: application/vnd.github.v3+json" \
  https://$HOSTNAME/api/v3/repos/org-name/test-four-keys/deployments \
  -d '{"ref":"21e706a11d41e467fe055dbdb5fe21a609427a20"}'

curl -u $GITHUB_USER:$GITHUB_TOKEN \
  -X POST \
  -H "Accept: application/vnd.github.v3+json" \
  https://$HOSTNAME/api/v3/repos/org-name/test-four-keys/deployments/8/statuses \
  -d '{"state":"success"}'

Resulting BigQuery contents in deployments table:

  {
    "source": "github",
    "deploy_id": "8",
    "time_created": "2020-12-09 11:44:48 UTC",
    "repository": "org-name/test-four-keys",
    "changes": [
      "21e706a11d41e467fe055dbdb5fe21a609427a20",
      "964198b26178b4203f14a77c77080af70a445750",
      "bb13c56e5e9b214682c28647adc787ff061183e2"
    ]
  }

Note that the commits related to branch a are not included.

In this case only the changes on branch b will be included in the lead time dashboard, but deployment frequency will show 1 deployment. If another deployment event had been created for merge commit of branch a, the frequency would be too high.

Create end-to-end tests for data generator

Deployment definition change in recent PR (#34) was not reflected in the data generator, causing new dashboard setups to fail: #37

Documentation is out of date

The INSTALL.md refers to a "hipster store" sample, which was subsequently changed to a hello world sample. Should be updated to reflect the latest state of the setup script.

Rename `master` to `main`

Looks like it just got easier -- any reason we shouldn't proceed? @dinagraves @donmccasland

How to integrate Jira as issue/bug tracker and change failure source?

Hi,

We are using Jira instead of Github issues to track our bugs or change failures. What could be a good way to integrate Jira as source?
As a metric we could use the time between opening and resolving a Jira issue with type bug (in our case).
I was thinking about polling Jira once daily to import all issues with a specific type "bug" into the four_keys "incidents" table directly. I guess we would need to deactivate the scheduled query for incidents since it would truncate other created entries.
What do you think?

BR, Jo

Terraform: Use a custom service account for Cloud Run <-> Secret Manager auth

...this will then render the GCE api service unnecessary, so remove that resource.

Bitbucket Support

Hi all,
Are there any plans to add bitbucket support to this project? I may be able to help with this too, if it's not too much work.

Thanks

Update documentation re: python & BQ

Add information to documentation about BQ authorization flow
Add note on documentation about having certificates installed on python version

Update requirements.txt for event_handler

The requirements file for event_handler includes the lines:

pytest==5.3.0; python_version > "3.0"
pytest==4.6.6; python_version < "3.0"

...since we require python >= 3.6, and since pytest is now at version 6, I propose we update this to:

pytest~=6.0.0

Does that sound okay to you, @dinagraves

(this is what I'm using for the event_handler tests)

Also, we could get faster installs and slimmer prod builds if we isolate testing dependencies. AIUI, PIP doesn't support "dev dependencies" like e.g. npm, but I did find this: https://stackoverflow.com/questions/17803829/how-to-customize-a-requirements-txt-for-multiple-environments

Use bash to schedule queries

As seen in #47, using Python to set up the scheduled queries is fragile. I had a better time calling the gcloud CLI directly from bash. Let's replace /queries/schedule.py with a bash script.

It's important to preserve the ability to re-run the schedule setter-upper any time (independent of setup.sh), so this script should be an independent module that setup.sh will invoke.

Datastudio connector points to dead link

Currently points to https://github.com/dinagraves/four-keys-playground

Update it to https://github.com/GoogleCloudPlatform/fourkeys

Data Studio Connector is Unverified

Today I found I can't browse our dashboards created before.
When opening the dashboards, following pop-up appeared.

And clicking "承認(Approval)", then following pop-up appeared.

What happened on "connector" ? (Honestly, I'm not familiar with "connector")
Maybe is it related to "publishing connector" ? -> #42 (comment)

Or anything wrong with me ?

Setup: gcloud projects list delay

If the user chooses to create a new project via the setup script, there is sometimes a delay before it is listed in the gcloud projects list command. The result is that we are unable to fetch the project number, which in turn means that the rest of the setup script will fail.

The script should wait until there is a valid project number, retrying the gcloud command until it returns the number, or a maximum of 5 minutes.

Existing event-handler secret not honored when setup is rerun - mocking data will fail

If you rerun setup.sh on an existing project where an event-handler secret is already existing, it will try to generate a new one instead of honoring the existing one.
Because of this mocking data will fail and you'll get an exception from the event-handler.

Exception: Unverified Signature
at index (/app/event_handler.py:50)
at dispatch_request (/usr/local/lib/python3.7/site-packages/flask/app.py:1935)
at full_dispatch_request (/usr/local/lib/python3.7/site-packages/flask/app.py:1949)
at reraise (/usr/local/lib/python3.7/site-packages/flask/_compat.py:39)
at handle_user_exception (/usr/local/lib/python3.7/site-packages/flask/app.py:1820)
at full_dispatch_request (/usr/local/lib/python3.7/site-packages/flask/app.py:1951)
at wsgi_app (/usr/local/lib/python3.7/site-packages/flask/app.py:2446)

setup.sh gcloud command flag order may be incorrect

I noticed in some places where gcloud is run that flags are not after the command:

For example:

export WEBHOOK=$(gcloud run --platform managed --region ${FOURKEYS_REGION} services describe event-handler --format=yaml | grep url | head -1 | sed -e 's/ *url: //g')

Running the above, I get this error:

`ERROR: (gcloud.run) unrecognized arguments: --platform (did you mean '--format'?)

To search the help text of gcloud commands, run:
gcloud help -- SEARCH_TERMS`

Modifications that seem to work:

export WEBHOOK=$(gcloud run services describe event-handler --platform managed --region ${FOURKEYS_REGION} --format=yaml | grep url | head -1 | sed -e 's/ *url: //g')