doitintl / bigquery-optimization-queries Goto Github PK

Queries to assist with BigQuery cost and performance.

License: MIT License

Python 100.00%

bigquery-optimization-queries's Introduction

Introduction

Note that these queries go along with an ebook about optimizing BigQuery costs written by Sayle Matthews of DoiT International. This also goes along with the DoiT Engineering Blog series located here

These queries are to assist in optimizing BigQuery usage in projects and organizations. Most focus on costs, but there are a few that focus on concurrency and also some that recommend whether a query will run better under an on-demand or flat-rate pricing scheme.

Usage

In each file is a variable called interval_in_days which is the number of days in the past that the query will look at for doing its work. In order to increase or decrease the amount of data processed by the query just change this value and it will be reflected throughout the rest of the query.

Note that we have set this value to a default that reflects a good timeframe vs amount of data processed by it.

Additionally each file should be named on what exactly it does and also has a comment at the top of explaining what the query does.

Note on Costs

Some of these queries can process through a LOT of data so it's HIGHLY recommended to verify the estimated cost of each query before running it. Depending upon how much usage your dataset sees over the specified timeframe then this could easily be upwards of tens of gigabytes if not more per query.

Blindly run these queries at your own risk! It's very much recommended to reduce the interval_in_days value when the query will be processing a very large amount of data.

Audit Log vs Information Schema

The queries are broken up into two different folders: audit_log and information_schema. These correspond to the different schemas that may need to be queried. In general most people will use the information_schema queries because they do not have a BigQuery audit log sink setup.

If you have an audit log sink setup for BigQuery already (or are a DoiT customer with the BQ Lens feature enabled) read the blog entry for a detailed guide of how to discover the location of your sink.

Note that if you are currently a DoiT International customer and have the BQ Lens feature enabled in your Cloud Management Platform (CMP) then you should use the audit log queries as you will already have the tables created

Generating Queries for Your Project

If you look at the queries you will see some placeholders for and in the SQL code. These need to be replaced with the correct values prior to running the queries.

In order to assist in doing this easier there is a file called generate_sql_files.py inside of this repository. This will perform a search and replace operation on all of the .sql files with your specified values.

generate_sql_files.py Usage

generate_sql_files.py [--location <dataset-location>] <project> <output-directory>

Note that this will create an exact copy of the files in the directory you specify as the output-directory.

location:
This is the location of your BigQuery dataset. Note the format is the same as BigQuery's such as 'region-us' for the US multi-region or 'us-central1' for the US Central Zone 1 region. The default value is 'region-us' if you do not specify anything.

project:
This is the name of the GCP project you are going to be querying against.

output-directory:
The directory where you would like the generated files to be stored. The script will attempt to create this directory if it doesn't already exist. Note that this should not be the same directory as the one where the script is located.

Contributing

If you see any bugs please feel free to reach out or perform a pull request on the code.

bigquery-optimization-queries's People

Contributors

Stargazers

Watchers

bigquery-optimization-queries's Issues

BigQuery slot computation: don't "average" the slots

When doing some aggregations to compute the amount of slots per unit of time (like in the queries: query_slots_per_second.sql & slots_by_second.sql), the slots per unit of time is assessed by doing a sort of average of the slot usage ("SAFE_DIVIDE(total_slot_ms, )").

This makes the (false) assumption that the slot usage is always the same during the whole duration of the query.

Instead of using the JOBS_BY_PROJECT tables, it is possible to use the JOBS_TIMELINE tables, which give a better insight of what happens each second

information_schema/top_costly_queries.sql - serious duplications counting / summing happens

Hey, just stumbled upon this repository, very useful idea - however, when looking at the results of this query we see heavy duplications occuring.
A query that was triggered 1,141 times in the past 14 days, appears in the result as if it was triggered over 1 million times.
The cost for that query alone results as if it was a 100 times greater than our total project cost, which definitley made no sense.

I believe the issue comes from the cross join between src1 and src2 resulting in many duplications for each job, but I hope I'm understanding the query correctly.

BigQuery slot computation: double counting

When doing some aggregations to compute the amount of slots per unit of time (like in the queries: query_slots_per_second.sql & slots_by_second.sql), then the following filter should be added in the WHERE condition:

statement_type != 'SCRIPT'

Otherwise, the amount of slots will be counted twice in some situations. This is mentioned in this doc https://cloud.google.com/bigquery/docs/information-schema-jobs-timeline :

For script jobs, the parent job also reports the total slot usage from its children jobs. To avoid double counting, use WHERE statement_type != "SCRIPT" to exclude the parent job.

BigQuery pricing unit for onDemandCost is tebibyte instead of terabyte

Many queries used in this repo are calculating the on-demand costs and billed bytes/MBs/GBs like in this example:

       SUM(COALESCE(totalBilledBytes, 0)) AS totalBilledBytes,
        ROUND(SUM(COALESCE(totalBilledBytes,
              0)) / 1000000, 2) AS totalMegabytesBilled,
        ROUND(SUM(COALESCE(totalBilledBytes,
              0)) / 1000000000, 2) AS totalGigabytesBilled,
        ROUND(SUM(COALESCE(totalBilledBytes,
              0)) / 1000000000000, 2) AS totalTerabytesBilled,
        ROUND(SAFE_DIVIDE(SUM(COALESCE(totalBilledBytes, 0)),
          1000000000000) * 5, 2) AS onDemandCost,

But the pricing unit used by Google to calculate the costs is tebibyte , not terabyte

Kibibyte (KiB) 1024¹ = 1,024
Mebibyte (MiB) 1024² = 1,048,576
Gibibyte (GiB) 1024³ = 1,073,741,824
Tebibyte (TiB) 1024⁴ = 1,099,511,627,776

I have used this query with the exported billing records table to check this:

select
  service.description as service_description, 
  sku.description as sku_description,
  usage.unit as usage_unit,
  usage.pricing_unit as usage_pricing_unit,
  count(*) as count,
  sum(cost) as cost_eur,
  sum(cost / currency_conversion_rate) as cost_usd,
  sum(usage.amount) as usage_amount,
  sum(usage.amount_in_pricing_units) as amount_in_pricing_units,
  sum(cost / currency_conversion_rate) / sum(usage.amount_in_pricing_units) as cost_per_pricing_unit_usd,
  sum(usage.amount) / sum(usage.amount_in_pricing_units) as amount_to_pricing_unit_conversion_rate,
  pow(1024, 4)  as tebibyte_in_bytes
from `<MY_BILLING_EXPORT_TABLE`
where 1=1
  and _partitiontime >= timestamp_sub(current_timestamp, interval 7 day)
  and service.description = "BigQuery"
group by service.description, sku.description, usage_unit, usage_pricing_unit
order by service.description, sku.description

BTW: Awesome that you have collected these SQL statements here, this will really help me! :-)