googlecloudplatform / bigquery-utils Goto Github PK

Useful scripts, udfs, views, and other utilities for migration and data warehouse operations in BigQuery.

Home Page: https://cloud.google.com/bigquery/

License: Apache License 2.0

Shell 3.79% Python 25.15% Kotlin 7.04% Java 39.01% HTML 0.19% JavaScript 12.47% TypeScript 3.22% Starlark 1.06% C++ 5.95% Dockerfile 0.08% HCL 1.87% Go 0.16%

bigquery google-cloud-platform data-warehouse sql utilities

bigquery-utils's Introduction

BigQuery Utils

BigQuery is a serverless, highly-scalable, and cost-effective cloud data warehouse with an in-memory BI Engine and machine learning built in. This repository provides useful utilities to assist you in migration and usage of BigQuery.

Getting Started

This repository is broken up into:

Dashboards - Pre-built dashboards for common use cases
Performance Testing - Examples for doing performance testing
- JMeter - Examples for using JMeter to test BigQuery performance
Scripts - Python, Shell, & SQL scripts
- billing - Example queries over the GCP billing export
- optimization - Scripts to help identify areas for optimization in your BigQuery warehouse.
Stored Procedures - Example stored procedures
Third Party - Relevant third party libraries for BigQuery
- compilerworks - BigQuery UDFs which mimic the behavior of proprietary functions in other databases
Tools - Custom tooling for working with BigQuery
- Cloud Functions - Cloud Functions to automate common use cases
UDFs - User-defined functions for common usage as well as migration
- community - Community contributed user-defined functions
- migration - UDFs which mimic the behavior of proprietary functions in the following databases:
  - netezza
  - oracle
  - redshift
  - snowflake
  - teradata
  - vertica
Views - Views over system tables such as audit logs or the INFORMATION_SCHEMA
- query_audit - View to simplify querying the audit logs which can be used to power dashboards (example).

Public UDFs

For more information on UDFs and using those provided in the repository with BigQuery, see the README in the udfs folder.

Contributing

See the contributing instructions to get started contributing.

To contribute UDFs to this repository, see the instructions in the udfs folder.

License

Except as otherwise noted, the solutions within this repository are provided under the Apache 2.0 license. Please see the LICENSE file for more detailed terms and conditions.

Disclaimer

This repository and its contents are not an official Google Product.

bigquery-utils's People

Contributors

Stargazers

Watchers

Forkers

nborwankar shashank-google snackmgmg hashkanna oapa mescanne stankiewicz abdul-rafae-mohammed ryanmcdowell salsferrazza phillcoletti acidburn0zzz epishova moon0602 danieldeleo tslr imbrito lakshmanok davetemplin xugenfa jaketf saravananrkv kcibul swaroopkanuru omarambrosi apstndb namratashah5 francois-baptiste curioustauseef ota2000 nalaimo hlaminswe harishmalan pjedrzej christippett luismdc mqsrinc thirumalvenkat shahd33 kikkyo elsonthomas creativitry allenwang314 iolloyd noah-kuo mingen-pan southpolemonkey krishsuchdev arcdesign karlarao subinbean lbtanh arindamdebnath shirleycohen neemay praveenakunuru kurhula suicidesin jrossthomson sjsyz kjmrknsn nikunjbhartia allenjarvis romeritomorais msalviejo yeniffer mfaizalbe nehanene15 isabella232 matt8955 muskanmahajan37 nikie pdunn galileansolutions georgi-petkov erikafranca tsoetan saliyamwd joracab saher4bc ruinan-liu carillpower gitsathish monicapc1 pradeep-samuel dwtcourses ananddedha altons kardiff18 mic1234 admariner camicb youngtacsong jagadishkumar1710 sdepablos global19 global-localhost prakash059 ivanzaykov mentin

bigquery-utils's Issues

Datastudio error

Im trying to use the dashboard

OVERALL USAGE :
User Configuration Error
This data source was improperly configured.

Invalid argument type.

Error ID: 2ae05114

and for all the other one like by job id i'm having

Too Many Rows
Due to the number of rows, the chart can't be rendered.

Sorry, please try other metrics or dimensions with fewer rows, try a different data source or add a filter.

even if i only select the date of today ...

*** currently we don't have any reservation, we are using public slot, dashboard is here to hep us to identify if it's needed or not ... **

Export data as protocol buffer does not work for more than one field

I tried the example from the documentation which has a STRUCT<STRING, BIGNUMERIC> and it does not produce anything For a single string field, the UDF does return values.

This is the proto file I used:

syntax = "proto3";

package dummypackage;

message DummyMessage {
  string word = 1;
  int64 num_word_count = 2;
}

The UDF:

CREATE OR REPLACE FUNCTION
  dev.toMyProtoMessage(input STRUCT<field_1 STRING, field_2 INT64>)
  RETURNS BYTES
    LANGUAGE js OPTIONS ( library=["gs:/xxxxxx/bigquery-udf/pbwrapper.js"]
) AS r"""
let message = pbwrapper.setup("dummypackage.DummyMessage")
return pbwrapper.parse(message, input)
  """;

Thanks.

Divide by zero error - job_execution.sql

bigquery-utils/dashboards/system_tables/sql/job_execution.sql

Line 44 in 84ca487

    
           total_slot_ms / (TIMESTAMP_DIFF(end_time, start_time, MILLISECOND)) AS avg_slots

On line 44, it gives rise to a divide by zero error. Hence request you to add a safe divide operation as shown below:

SAFE_DIVIDE(total_slot_ms / (TIMESTAMP_DIFF(end_time, start_time, MILLISECOND))

README.md references to a non-existent .sql file

Under Prerequsites section of README.md- #3 and #4 , there is reference to a non-existent file - [big_query_elt_audit_log_v2.sql](https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/views/audit/big_query_elt_audit_log_v2.sql.

I think this should be updated to
big_query_elt_script_logging.sql instead.

audit query calculates billing using base 10, should use base 2

The audit query uses base 10 to calculate GB and TB, but BigQuery is billed using GiB, TiB, e.g., base 2. So this query should use base 2.

https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/views/audit/query_audit.sql#L127-L129

leaked client id?

It seems we are violating some precommit on the repo about secrets here
https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/tools/unsupervised_dataset/sql_crawler/tests/resources/googleCloudSite.html#L4

I get the following error when trying to merge latest master to my fork.
@noah-kuo Can we add this to .gitallowed on the repo if intentional or redact it if not?

tools/unsupervised_dataset/sql_crawler/tests/resources/googleCloudSite.html:4:                    <meta name="google-signin-client-id" content="721724668570-nbkv1cfusk7kk4eni4pjvepaus73b13t.apps.googleusercontent.com">

[ERROR] Matched one or more prohibited patterns

Possible mitigations:
- Mark false positives as allowed using: git config --add secrets.allowed ...
- Mark false positives as allowed by adding regular expressions to .gitallowed at repository's root directory
- List your configured patterns: git config --get-all secrets.patterns
- List your configured allowed patterns: git config --get-all secrets.allowed
- List your configured allowed patterns in .gitallowed at repository's root directory
- Use --no-verify if this is a one-time false positive
F1028 14:06:49.201319 2415103 google_hook.go:84] git-secrets failed with error: exit status 1

Too many subqueries or query is too complex for "Job Concurrency Slow"

Hi, I've successfully copied over most of the data sources for BigQuery System Tables Reports but hit an issue with the "Job Concurrency Slow" datasource. When I swap the INFORMATION_SCHEMA_PUBLIC_V2 references for `region-{region_name}`.INFORMATION_SCHEMA and reconnect I see the error:

Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.

Is this expected? Is there a simplified query I can use instead?

Why is slot_count negative when action <> 'UPDATE'?

Thank you for providing this report, I've got it up and running against our data and I'm sure it will be very useful.

One problem though, our slot commitment is appearing as a negative number:

its because the query for the "Commitments Timeline" data source includes this logic:

SUM(CASE WHEN cccp.action = 'UPDATE' THEN cccp.slot_count ELSE cccp.slot_count * -1 END)

This is causing our one commitment to be returned from the data source query:

SELECT
  change_timestamp,
  commitment_plan,
  action,
  EXTRACT(DATE FROM change_timestamp) AS start_date,
  IFNULL(
    LEAD(DATE_SUB(EXTRACT(DATE FROM change_timestamp), INTERVAL 1 DAY))
      OVER (PARTITION BY state ORDER BY change_timestamp),
    CURRENT_DATE()) AS stop_date,
  SUM(CASE WHEN cccp.action = 'UPDATE' THEN cccp.slot_count ELSE cccp.slot_count * -1 END)
    OVER (
      PARTITION BY state
      ORDER BY change_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS slot_cummulative,
  ROW_NUMBER()
    OVER (
      PARTITION BY EXTRACT(DATE FROM change_timestamp)
      ORDER BY change_timestamp DESC
    ) AS rn
FROM
  region-eu.INFORMATION_SCHEMA.CAPACITY_COMMITMENT_CHANGES_BY_PROJECT AS cccp
WHERE
  state = 'ACTIVE'
  AND commitment_plan != 'FLEX'
ORDER BY change_timestamp

change_timestamp	commitment_plan	action	start_date	stop_date	slot_cummulative	rn
2022-07-29 13:53:58.843000 UTC	MONTHLY	CREATE	2022-07-29	2022-09-23	-1500	1

Can you explain why this logic to multiply by -1 if action <> 'UPDATE' exists? TIA

UDF creation order matters when UDFs call other UDFs

Need to re-design the build process to accommodate UDFs which invoke other UDFs. Currently the build process won't take this into consideration, so a UDF might fail creation because it's invoking another UDF which does not exist yet (hasn't been created by the build process).

BigQuery UDFs not available outside the US and documentation unclear?

Hi, #252 asked for the community UDFs to be made available in regions other than the US - I'm unable to find them.
I searched for bqutil in the BigQuery UI explorer component and got no results.

I searched for a specific function and found it in bigquery-public-data.persistent_udfs - which looks like the content of this repo's community UDFs, which I can't find mentioned anywhere in this repo's docs.

That dataset is in the US multi-region and I can't find an equivalent for the EU multi-region where I am.

am I doing something dumb and not finding the right functions?
are these functions available in the EU? #252 seemed to ask for them to be made available but maybe that got lost.
- if not, can they be made available? It's a fairly big barrier for non-US SQL wranglers to have to deploy them somewhere

random_value UDF result is re-used by query optimizer when invoked in subquery

BigQuery query optimizer will invoke once and re-use the output of random_value UDF if the UDF is called within a SQL subquery.

https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/random_value.sql

Possible Solution:

random_value(arr ANY TYPE, col ANY TYPE) AS
(
(
SELECT val[OFFSET(0)]
FROM(
SELECT
[value, col] AS val
FROM
UNNEST(arr) value
ORDER BY
RAND()
LIMIT 1
)
));

Caveat: the second input arg must have the same type as the array elements in first arg

Ensure all UDFs have description set in their DDL

Make sure all UDFs have description set in the DDL definition like this UDF

Security Policy violation Binary Artifacts

This issue was automatically created by Allstar.

Security Policy Violation
Project is out of compliance with Binary Artifacts policy: binaries present in source code

Rule Description
Binary Artifacts are an increased security risk in your repository. Binary artifacts cannot be reviewed, allowing the introduction of possibly obsolete or maliciously subverted executables. For more information see the Security Scorecards Documentation for Binary Artifacts.

Remediation Steps
To remediate, remove the generated executable artifacts from the repository.

Artifacts Found

tools/automatic_query_fixer/libs/calcite-babel-1.23.0-SNAPSHOT.jar
tools/automatic_query_fixer/libs/calcite-core-1.23.0-SNAPSHOT.jar
tools/automatic_query_fixer/libs/calcite-linq4j-1.23.0-SNAPSHOT.jar
tools/automatic_query_fixer/libs/zetasql_helper_client.jar
tools/unsupervised_dataset/sql_classifier/classifier/bin/calcite-bigquery.jar
tools/unsupervised_dataset/sql_classifier/classifier/bin/calcite-core.jar
tools/unsupervised_dataset/sql_classifier/classifier/bin/calcite-defaultdialect.jar
tools/unsupervised_dataset/sql_classifier/classifier/bin/calcite-dialect1.jar
tools/unsupervised_dataset/sql_classifier/classifier/bin/calcite-postgresql.jar

Additional Information
This policy is drawn from Security Scorecards, which is a tool that scores a project's adherence to security best practices. You may wish to run a Scorecards scan directly on this repository for more details.

Allstar has been installed on all Google managed GitHub orgs. Policies are gradually being rolled out and enforced by the GOSST and OSPO teams. Learn more at http://go/allstar

This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

Add report showing projects using on demand slots

Recently encountered an issue with a Customer where they needed to verify if on demand slots were being used for any projects. They wrote this query against INFORMATION_SCHEMA. Is this something we would consider adding to the system tables dashboard?

select project_id,sum(total_bytes_billed) as bytes_billed,sum(total_slot_ms) as slot_ms from region-us.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION where creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(),INTERVAL 24 HOUR) AND job_id not in ( select parent_job_id from region-us.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION where creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(),INTERVAL 24 HOUR) AND parent_job_id is not null ) AND reservation_id is null and parent_job_id is null AND error_result is null AND job_type="QUERY" AND state="DONE" AND priority="INTERACTIVE" and total_bytes_billed >0 group by project_id order by bytes_billed DESC

Missing values are not taken into account in UDF "median".

Since this UDF does not take missing values into account in the calculation, I believe it needs to be included in "README.md" or the function needs to be modified.

Currently, when an array containing missing values is given, the length of the array is calculated as is.
Therefore, in the following cases, unintended results may be returned.
I think we need to add a function that arbitrarily removes or completes the missing values.

Current source code

SELECT IF (
  MOD(ARRAY_LENGTH(arr), 2) = 0,
  (arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
  arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
)
FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)

Input

DECLARE arr ARRAY<INT64>;
SET arr = [1, 2, 3, 4, 5, 6, NULL];

Current result

3.0

The real result (Remove missing values)

3.5

So I would like to propose a process using the existing function "PERCENTILE_CONT" in the following form.

SELECT
  PERCENTILE_CONT(val, 0.5) OVER()
FROM UNNEST(arr) AS val 
LIMIT 1

Divide by zero error - reservation_utilization_week.sql

bigquery-utils/dashboards/system_tables/sql/reservation_utilization_week.sql

Line 73 in 84ca487

    
           SUM(jbo.total_slot_ms) / (1000 * 60 * 60 * 24 * 7) AS average_weekly_slot_usage,

On lines 73 & 75, it gives rise to the "divide by 0" error. Hence request you to surround the divisions as below:

 SAFE_DIVIDE(SUM(jbo.total_slot_ms),(1000 * 60 * 60 * 24 * 7)) AS average_weekly_slot_usage,
  AVG(rsc.slot_capacity) AS average_reservation_capacity,
  (SAFE_DIVIDE(SAFE_DIVIDE(SUM(jbo.total_slot_ms), (1000 * 60 * 60 * 24 * 7)),AVG(rsc.slot_capacity))) AS reservation_utilization,

release process should include all javascript libraries for UDFs

Who owns this bucket? it isn't part of the release process.
We should have automation to use webpack to stage js libs in a public GCS bucket owned by the release process.

bigquery-utils/udfs/community/nlp_compromise_number.sql

Line 23 in 99f0864

library="gs://fh-bigquery/js/compromise.min.11.14.0.js");

Discussion: Create bqutil UDFs in all other non-US datasets

I often implement BigQuery SQL for Google Analytics 4.
And these UDFs are very helpful in that implementation.

However, the inability to cross regions within a single query and the fact that these useful UDFs are only available in the US region is frustrating.
There are many opportunities to work for clients living in Japan, and there are many cases where it is set in the Japan region.

Any ideas for deploying UDFs to other regions?
For example, bqutil.fn_ane1.get_value() and bqutil.fn_tyo.get_value() are suggested.
ane1 means asia-northeast-1, and TYO is the city code for Tokyo.

Thanks,

Bug: Jaccard UDF giving incorrect results in some cases

I found a bug in the the Jaccard distance function https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/community/jaccard.sqlx (the function below is a copy of the Jaccard UDF, cast as a TEMP FUNCTION)

CREATE TEMP FUNCTION Jaccard(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS r"""
  let intersectSize = 0;

  if(a && b) {
    // Split word into characters, and make characters unique
    const sa = [...new Set(String(a).split(''))]
    const sb = [...new Set(String(b).split(''))]

    const la = (sa.length > sb.length) ? sa.length : sb.length
    const lb = (sa.length > sb.length) ? sb.length : sa.length

    for(let i = 0; i < la; i++) {
      for(let j = 0; j < lb; j++) {
        intersectSize += (sa[i] === sb[j]) ? 1 : 0
      }
    }

    // Compute Jaccard distance
    const union = (sa.length + sb.length) - intersectSize
    return parseFloat(((intersectSize / union) * 100 + Number.EPSILON) / 100).toFixed(2)
  } else {
    return 0.0
  }
""";

with datas as (
select 
'12' as a,
'132' as b
)

select
 Jaccard(a, b)
from datas

This should give 2/3 = 0.667 (since here there the intersection is size 2, and the union is size 3), but instead gives 0.25.

I believe the bug can be fixed by replacing

    const la = (sa.length > sb.length) ? sa.length : sb.length
    const lb = (sa.length > sb.length) ? sb.length : sa.length

with

    const la = sa.length
    const lb = sb.length

With the current version of these lengths, the array indices become out of bounds.

Contributing.md should link to UDF testing Contributing.md

https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/CONTRIBUTING.md

Should be linked in
https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/CONTRIBUTING.md

Average slots calculation in job_execution.sql

In the SQL, it calculates average slots as:

total_slot_ms / (TIMESTAMP_DIFF(end_time, start_time, MILLISECOND)) AS avg_slots

should we use sum(timeline.elapsed_ms), instead of "end_time - start_time", seems the elapsed_ms does not include waiting time? Thx

Request: Handling for objects in json_extract_values function

Issue:
bqutils.fn.json_extract_values returns unusable values [object Object] when children are objects / sub collections.

Example:

Sample query:

-- (select body from`bigquery-public-data.geo_whos_on_first.geojson` limit 100 )

with geojson as 
(select """{"id": 1259392143, "type": "Feature", "properties": {"edtf:cessation": "uuuu", "edtf:inception": "uuuu", "geom:area": 0.0, "geom:area_square_m": 0.0, "geom:bbox": "116.97547,32.41412,116.97547,32.41412", "geom:latitude": 32.41412, "geom:longitude": 116.97547, "gn:admin1_code": "1.0", "gn:asciiname": "Liuyingcun", "gn:country_code": "CN", "gn:dem": 25, "gn:feature_class": "P", "gn:feature_code": "PPL", "gn:geonameid": 10417093, "gn:latitude": 32.41412, "gn:longitude": 116.97547, "gn:modification_date": "2015-08-08", "gn:name": "Liuyingcun", "gn:population": 0, "gn:timezone": "Asia/Shanghai", "iso:country": "CN", "mz:hierarchy_label": 1, "mz:is_current": -1, "name:zho_x_preferred": ["\u5218\u90e2\u6751"], "src:geom": "geonames", "wof:belongsto": [85669739, 102191569, 85632695, 136253041, 890516233], "wof:breaches": [], "wof:concordances": {"gn:id": 10417093}, "wof:country": "CN", "wof:geomhash": "cabdc55714fc98313d2246b1c4f7fb0b", "wof:hierarchy": [{"continent_id": 102191569, "country_id": 85632695, "county_id": 890516233, "empire_id": 136253041, "locality_id": 1259392143, "region_id": 85669739}], "wof:id": 1259392143, "wof:lastmodified": 1537613321, "wof:name": "Liuyingcun", "wof:parent_id": 890516233, "wof:placetype": "locality", "wof:repo": "whosonfirst-data", "wof:superseded_by": [], "wof:supersedes": [], "wof:tags": []}, "bbox": [116.97547, 32.41412, 116.97547, 32.41412], "geometry": {"coordinates": [116.97547, 32.41412], "type": "Point"}}""" as body)

SELECT bqutil.fn.json_extract_keys(body) as jkeys, bqutil.fn.json_extract_values(body) as jvalues
FROM geojson

I used a single row here because the public dataset is 14GB and unpartitioned.

Desired Output:
bqutils.fn.json_extract_values returns a usable value (even just a json string) of the contents of those objects.

Thanks!

Add UDFs for generating deep links to tables and jobs

Table Deep Link: https://console.cloud.google.com/bigquery?p=<project_id>&d=<dataset_id>&t=<table_id>&page=table

Job Deep Link: https://console.cloud.google.com/bigquery?project=<projectId>&j=bq:<location>:<jobId>

replace_with_test_datasets function fails in local testing (not in cloud build)

bigquery-utils/udfs/tests/udf_test_utils.py

Line 86 in 8677d0e

f'`{project_id}.\\g<bq_dataset>_test_{os.getenv("SHORT_SHA")}\\g<udf_name>`(',

os.getenv("SHORT_SHA") will return None in local testing

Issue Accessing Custom JS Libraries specified in js_libs.yaml

Hi,

I am trying to package more npm js libraries and use them in UDFs. I have modified js_libs.yaml to include the following:

compromise:
  versions:
    - 11.14.3
js-levenshtein:
  versions:
    - 1.1.6
jstat:
  versions:
    - 1.9.3
    - 1.9.4
languagedetect:
  versions:
    - 2.0.0
cld:
  versions:
    - 2.8.4

Then I was using cloud shell to deploy via export BQ_LOCATION=US-CENTRAL1 && bash deploy.sh

After that I tried testing that I could access the new npm libraries, but I didn't have the permissions.

The default ones (js-levenshtein, jstat, compromise) worked, but not the new ones.

$ gsutil ls gs://bqutil-lib/bq_js_libs/js-levenshtein-v1.1.6.js
gs://bqutil-lib/bq_js_libs/js-levenshtein-v1.1.6.js
$ gsutil ls gs://bqutil-lib/bq_js_libs/languagedetect-v2.0.0.min.js
AccessDeniedException: 403 [email protected] does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).

For the new ones (languagedetect, cld) I am not able to list or copy the packaged js. And I am also not able to use it in place inside my Bigquery UDF.

Can you assist? Are there some permissions that need to be set so that I can read the new packages I request?

Also, is there a way to log the explicit path of any custom JS packages that are in the yaml file? I eventually found the code that specified that js-levenshtein should not be minimized, but it was not obvious why some were min.js and others were not.

Would also be helpful to have some logging to let the user know if the JS package name they entered was incorrect and that the package was not successfully retrieved from npm.

Thanks!
Jonathan

Errors while creating the BigQuery audit logs view

Hello,
Following the step by step tutorial (https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/views/audit) when running the 'gcloud alpha logging sinks create' with the provided filter, the result is that the sink is created but the tables "cloudaudit_googleapis_com_data_access" and "cloudaudit_googleapis_com_activity" are not created in the mentioned dataset.

Running the same command without the filter is creating the 2 tables.

Feature Request: Authorized View Generator / Refresher script

Often customers have a source table and an authorized view which drops a few blacklisted fields.
Sometimes the schema of the underlying table changes and the view definition drops fields it shouldn't.

Let's create a script that will pull the schema from an existing table and drop a list of blacklisted fields (or sub-fields) and generate the SQL for a view.

The script should take the following inputs:

source table (fully qualified project.dataset.table)
destination view ( fully qualified project.dataset.view)
list of fields to drop

Considerations:

Handle dropping fields from within structs in a nested schema.
Implementing this in Golang would be nice so it can be a compiled CLI utility and for ease of usage from terraform modules.

Nice to Have:

Include automation to Create the view and reinstate it's authorization on the source dataset.

Ugly Python Prototype: gist

System Table Dashboards - Load Slot Estimator

Hi there,

Apologies if I am just missing the obvious but I am trying to create a copy of the system tables dashboard however I can't find one of the tables being referred to for the Load Slot Estimator resource.

The underlying query by default references INFORMATION_SCHEMA_PUBLIC_V1.LOAD_SLOT_ESTIMATOR but I could not find an equivalent table or any reference to it in projects / org.

Thanks.

Documentation for new users is difficult to follow

I've never used SQL, but I have a need to here, as I want to implement ts_tumble in BigQuery to reduce my data for graphing.

I tried following the user guide in Cloudshell Terminal, but I got stuck at:

user_name@cloudshell:~/cloud-console-sample-build (project_name)$ gcloud builds submit . --config=deploy.yaml --substitutions _BQ_LOCATION=US
ERROR: (gcloud.builds.submit) Unable to read file [deploy.yaml]: [Errno 2] No such file or directory: 'deploy.yaml'

I suspected it was because I hadn't cloned git to my project, so I ran:
git clone https://github.com/GoogleCloudPlatform/bigquery-utils.git

But got the same error for the deploy command, and still haven't figured out how to run the UDF in the SQL workspace.

Apologies for the broad issue, but perhaps more users could benefit from a documentation upgrade.

Thanks!

Incorrect BigQuery billing analytics SQL script

Issue

BigQuery billing analytics monthly_invoice_costs_credits.sql script uses join to unnest credits and adds them to total cost. This produces duplicate cost rows and incorrect total cost when there is more than one credit applied to the same line item.

Fix

Join needs to be replaced with subquery to prevent duplication of rows which fixes the total cost. I have the fix branch ready but I don't have access to push the branch, do let me know the steps to create a PR

UDF: "url_param" is missing for first query

According to the README.md, I expect the following result to be "bigquery + udf". However, it is actually return "null".

SELECT bqutil.fn.url_param(
  'https://www.google.com/search?q=bigquery+udf&client=chrome', 'q')

Reason,

It is an implementation intended to set only the query parameter part in the argument.
In the usage method including the sample, not only the query parameter but the entire URL is specified.

It seems that.

Host UDF test framework on PyPi for ease of use in other projects

It would be really great if we generalized the UDF test framework a bit and hosted on PyPi so that customers could leverage it in their own CI/CD processes.

Update UDF contributing docs

Update docs to address the issue that was surfaced in thie PR

Missing Datasource for Hourly Utilization Heatmap section

I see that recently a "Average Hourly Usage" heatmap section was added to "Hourly Usage Report" dashboard.

When i copy the dashboard , the heatmap section points to a "Hourly Utilization Heatmap" datasource. But i am not able to find the corresponding Looker Studio datasource or the SQL query corresponding to it.

Could you please confirm if this datasource is missing and fix this ?

Some of the hyperlink is not working

Hi team, I found that several links in this repo is not working like couple examples below:

(I remember the public dashboard is working couple weeks ago but somehow it's missing now as well)

This repo is super useful for BigQuery monitoring so hopefully it can be fixed soon.

Thanks!

Dataset - Job Concurrency Slow Job - change_timestamp Unavailable in INFORMATION_SCHEMA.RESERVATIONS

When copying the dataset "[PUBLIC] Job Concurrency - Slow Job", there is logic to order the results returned from INFORMATION_SCHEMA.RESERVATIONS by column "change_timestamp". There is no timestamp column available in INFORMATION_SCHEMA.RESERVATIONS. It appears that reservation_id should be the join between the two tables but the values do not match between INFORMATION_SCHEMA.RESERVATIONS and INFORMATION_SCHEMA.JOBS without some string manipulation which is inconsistent between rows.

There is also a second column "action" that is also unavailable in INFORMATION_SCHEMA.RESERVATIONS. There appears to be no soft delete of the reservation record which would be problematic in historical reporting.

This same logic is problematic for both the organization_capacity and reservation_capacity common table expressions

Make Dataset Location Configurable

Issue
Currently the dataset location which functions are deployed is defaulted in the build.sh script to US with no option to override with an input parameter. Add a way to override this default.

Workaround
For those that have this issue, they can work around it by manually updating the location in the build script and executing it.

Create bqutil UDFs in all other non-US datasets

I ran into some issues when trying to use CTE's in combination with bqutil.

This exectues as expected:

SELECT `bqutil.fn.median`([1,1,1,2,3,4,5,100,1000]) as median

However, after adding a CTE:

WITH covid AS (
  SELECT date, daily_confirmed_cases 
  FROM `bigquery-public-data.covid19_ecdc_eu.covid_19_geographic_distribution_worldwide` 
)
SELECT `bqutil.fn.median`([1,1,1,2,3,4,5,100,1000]) as median

BQ throws the error: "Function not found: bqutil.fn.median".

I there a way to explicitly import the BQ utils or any other suggestions to address this issue?

Unable to view the images in bigquery-utils folder.

I am exploring the bigquery-utils repo, but in the images folder, I cannot view any images. Please check this link. Please update the links or images, such that end users can view it.

UDF testing should occur in a test dataset specific to each PR

Currently, all testing occurs in one test dataset ( e.g. fn_test, td_test ).
Two PRs making commits at the same time can stomp each other's changes.

googlecloudplatform / bigquery-utils Goto Github PK

bigquery-utils's Introduction

BigQuery Utils

Getting Started

Public UDFs

Contributing

License

Disclaimer

bigquery-utils's People

Contributors

Stargazers

Watchers

Forkers

bigquery-utils's Issues

Recommend Projects

Recommend Topics

Recommend Org