Code Monkey home page Code Monkey logo

bigquery-optimization-queries's Introduction

Introduction

Note that these queries go along with an ebook about optimizing BigQuery costs written by Sayle Matthews of DoiT International. This also goes along with the DoiT Engineering Blog series located here

These queries are to assist in optimizing BigQuery usage in projects and organizations. Most focus on costs, but there are a few that focus on concurrency and also some that recommend whether a query will run better under an on-demand or flat-rate pricing scheme.

Usage

In each file is a variable called interval_in_days which is the number of days in the past that the query will look at for doing its work. In order to increase or decrease the amount of data processed by the query just change this value and it will be reflected throughout the rest of the query.

Note that we have set this value to a default that reflects a good timeframe vs amount of data processed by it.

Additionally each file should be named on what exactly it does and also has a comment at the top of explaining what the query does.

Note on Costs

Some of these queries can process through a LOT of data so it's HIGHLY recommended to verify the estimated cost of each query before running it. Depending upon how much usage your dataset sees over the specified timeframe then this could easily be upwards of tens of gigabytes if not more per query.

Blindly run these queries at your own risk! It's very much recommended to reduce the interval_in_days value when the query will be processing a very large amount of data.

Audit Log vs Information Schema

The queries are broken up into two different folders: audit_log and information_schema. These correspond to the different schemas that may need to be queried. In general most people will use the information_schema queries because they do not have a BigQuery audit log sink setup.

If you have an audit log sink setup for BigQuery already (or are a DoiT customer with the BQ Lens feature enabled) read the blog entry for a detailed guide of how to discover the location of your sink.

Note that if you are currently a DoiT International customer and have the BQ Lens feature enabled in your Cloud Management Platform (CMP) then you should use the audit log queries as you will already have the tables created

Generating Queries for Your Project

If you look at the queries you will see some placeholders for and in the SQL code. These need to be replaced with the correct values prior to running the queries.

In order to assist in doing this easier there is a file called generate_sql_files.py inside of this repository. This will perform a search and replace operation on all of the .sql files with your specified values.

generate_sql_files.py Usage

generate_sql_files.py [--location <dataset-location>] <project> <output-directory>

Note that this will create an exact copy of the files in the directory you specify as the output-directory.

location:
This is the location of your BigQuery dataset. Note the format is the same as BigQuery's such as 'region-us' for the US multi-region or 'us-central1' for the US Central Zone 1 region. The default value is 'region-us' if you do not specify anything.

project:
This is the name of the GCP project you are going to be querying against.

output-directory:
The directory where you would like the generated files to be stored. The script will attempt to create this directory if it doesn't already exist. Note that this should not be the same directory as the one where the script is located.

Contributing

If you see any bugs please feel free to reach out or perform a pull request on the code.

bigquery-optimization-queries's People

Contributors

glen-park-doit avatar jisoo411 avatar jkdabbert avatar sayle-doit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

bigquery-optimization-queries's Issues

BigQuery slot computation: don't "average" the slots

When doing some aggregations to compute the amount of slots per unit of time (like in the queries: query_slots_per_second.sql & slots_by_second.sql), the slots per unit of time is assessed by doing a sort of average of the slot usage ("SAFE_DIVIDE(total_slot_ms, )").

This makes the (false) assumption that the slot usage is always the same during the whole duration of the query.

Instead of using the JOBS_BY_PROJECT tables, it is possible to use the JOBS_TIMELINE tables, which give a better insight of what happens each second

information_schema/top_costly_queries.sql - serious duplications counting / summing happens

Hey, just stumbled upon this repository, very useful idea - however, when looking at the results of this query we see heavy duplications occuring.
A query that was triggered 1,141 times in the past 14 days, appears in the result as if it was triggered over 1 million times.
The cost for that query alone results as if it was a 100 times greater than our total project cost, which definitley made no sense.

I believe the issue comes from the cross join between src1 and src2 resulting in many duplications for each job, but I hope I'm understanding the query correctly.

BigQuery slot computation: double counting

When doing some aggregations to compute the amount of slots per unit of time (like in the queries: query_slots_per_second.sql & slots_by_second.sql), then the following filter should be added in the WHERE condition:

statement_type != 'SCRIPT'

Otherwise, the amount of slots will be counted twice in some situations. This is mentioned in this doc https://cloud.google.com/bigquery/docs/information-schema-jobs-timeline :

For script jobs, the parent job also reports the total slot usage from its children jobs. To avoid double counting, use WHERE statement_type != "SCRIPT" to exclude the parent job.

BigQuery pricing unit for onDemandCost is tebibyte instead of terabyte

Many queries used in this repo are calculating the on-demand costs and billed bytes/MBs/GBs like in this example:

       SUM(COALESCE(totalBilledBytes, 0)) AS totalBilledBytes,
        ROUND(SUM(COALESCE(totalBilledBytes,
              0)) / 1000000, 2) AS totalMegabytesBilled,
        ROUND(SUM(COALESCE(totalBilledBytes,
              0)) / 1000000000, 2) AS totalGigabytesBilled,
        ROUND(SUM(COALESCE(totalBilledBytes,
              0)) / 1000000000000, 2) AS totalTerabytesBilled,
        ROUND(SAFE_DIVIDE(SUM(COALESCE(totalBilledBytes, 0)),
          1000000000000) * 5, 2) AS onDemandCost,

But the pricing unit used by Google to calculate the costs is tebibyte , not terabyte

  • Kibibyte (KiB) 1024¹ = 1,024
  • Mebibyte (MiB) 1024² = 1,048,576
  • Gibibyte (GiB) 1024³ = 1,073,741,824
  • Tebibyte (TiB) 1024⁴ = 1,099,511,627,776

I have used this query with the exported billing records table to check this:

select
  service.description as service_description, 
  sku.description as sku_description,
  usage.unit as usage_unit,
  usage.pricing_unit as usage_pricing_unit,
  count(*) as count,
  sum(cost) as cost_eur,
  sum(cost / currency_conversion_rate) as cost_usd,
  sum(usage.amount) as usage_amount,
  sum(usage.amount_in_pricing_units) as amount_in_pricing_units,
  sum(cost / currency_conversion_rate) / sum(usage.amount_in_pricing_units) as cost_per_pricing_unit_usd,
  sum(usage.amount) / sum(usage.amount_in_pricing_units) as amount_to_pricing_unit_conversion_rate,
  pow(1024, 4)  as tebibyte_in_bytes
from `<MY_BILLING_EXPORT_TABLE`
where 1=1
  and _partitiontime >= timestamp_sub(current_timestamp, interval 7 day)
  and service.description = "BigQuery"
group by service.description, sku.description, usage_unit, usage_pricing_unit
order by service.description, sku.description 

BTW: Awesome that you have collected these SQL statements here, this will really help me! :-)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.