Code Monkey home page Code Monkey logo

snowplow-bigquery-loader's Introduction

Snowplow BigQuery Loader

Build Status Release License

This project contains applications used to load Snowplow enriched data into Google BigQuery.

Quickstart

Assuming git and SBT installed:

$ git clone https://github.com/snowplow-incubator/snowplow-bigquery-loader
$ cd snowplow-bigquery-loader
$ sbt "project loader" test
$ sbt "project streamloader" test
$ sbt "project mutator" test
$ sbt "project repeater" test

Benchmarks

This project comes with sbt-jmh.

To run a specific benchmark test:

$ sbt 'project benchmark' '+jmh:run -i 20 -wi 10 -f2 -t3 .*TransformAtomic.*'

Or, to run all benchmark tests (once more are added):

$ sbt 'project benchmark' '+jmh:run -i 20 -wi 10 -f2 -t3'

The number of warm-ups and iterations is what the sbt-jmh project recommends but they can be lowered for faster runs.

To see all sbt-jmh options: jmh:run -h.

Add new benchmarks to this module.

Building fatjars

You can build the jar files for Mutator, Repeater and Streamloader with sbt like so:

$ sbt clean 'project mutator' assembly
$ sbt clean 'project repeater' assembly
$ sbt clean 'project streamloader' assembly

Find out more

Technical Docs Setup Guide Contributing
i1 i2 i3

Copyright and License

Snowplow BigQuery Loader is copyright 2018-2023 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

snowplow-bigquery-loader's People

Contributors

aldemirenes avatar alexanderdean avatar benjben avatar chuwy avatar colmsnowplow avatar dilyand avatar istreeter avatar lmath avatar oguzhanunlu avatar peel avatar pondzix avatar simplylizz avatar spenes avatar the-fine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snowplow-bigquery-loader's Issues

Add contexts flattening

If user defines a context with array root type, then in the resulting LoaderRow we get following structure:

"contexts_com_acme_mycontext_1_0_0": [
  [
    {
      "key": "foo",
      "value": "bar"
    },
    {
      "key": "one",
      "value": "two"
    }
  ]
]

First-level array because contexts always have zero-or-more cardinality and second-level array is actual context.

This structure is not supported by BigQuery, which means we have to flatten this array into single-level.

Repeater: consider reducing nack logging

Repeater checks every time if event is old enough to make an insertion attempt. However, it checks it after every pull which happens very often. As a result we have messages like:

DEBUG com.snowplowanalytics.snowplow.storage.bigquery.repeater.Repeater - Event fcbcc0b1-e51b-488b-9a90-7403cdfb19f4/2020-02-19T21:53:36.943Z is not ready yet. Nack

Printed every 300 milliseconds.

However, this is logged with DEBUG level, which means user can technically get rid of it.

Add Bintray credentials to .travis.yml

  • BINTRAY_SNOWPLOW_GENERIC_USER (unencrypted?)
  • BINTRAY_SNOWPLOW_GENERIC_API_KEY

Important note: we'll be publishing tar.gz archives with executables there, not fatjar, because scio/beam does not work with sbt-assembly.

Probably we'll also need docker registry credentials as me and @BenFradet thinking about using sbt-native-packager (?) and its Docker integration and publish an image straight away.

How do you actually run snowplow-bigquery-loader?

I don't know how to create an executable on of the git repo? I am clearly missing something but after cloning and calling the given:

./snowplow-bigquery-loader \
    --config=$CONFIG \
    --resolver=$RESOLVER

It doesn't find it, which makes sense because I don't see this.

Attach a loading timestamp

The most reliable way is to try to achieve it as a SQL function, but doesn't look like streaming inserts provide this kind of API.

Batch Mode doesn't send anything to BigQuery

I have configured to send data every 100 seconds, I can see the records in DataFlow heading into the BigQuery write job, however, they never get written to BigQuery. Why would this be? If you need further information to help, please ask.

Thanks

Repeater: mention life span in statistics logs

Currently Repeater's statistics message looks like following:

Statistics: 43 rows inserted, 114 rows rejected

This message could be totally fine if repeater is running for a month and rejected just 114 rows, but can be also alarming if it reject 114 rows in an hour. In order to figure out how long it is running we need to manually check when it was launched and then substract it from log entry timestamp.

It would be really nice if it was

Statistics: 43 rows inserted, 114 rows rejected in last 4 hours

Support GEOGRAPHY types

Given that BigQuery now has some geography support does it make sense to convert the latitude / longitude columns to a single GEOGRAPHY column of type ST_GEOGPOINT?

Add support for GCP labels in dataflow launcher docker image

When deploying more than 1 pipeline on the same GCP project there is no way to know where the costs are coming from.

GCP supports labeling the resources and to export the billing information into BigQuery. The cost can then be breakdown by using SQL queries that filter by label.

For reference, the google documentation on labels: https://cloud.google.com/compute/docs/labeling-resources

The plan is to label all the resources with "sp_env=prod1" or "sp_env=qa1"

Use sbt-scalafmt plugin

To add the plugin:

addSbtPlugin("org.scalameta" % "sbt-scalafmt" % "2.3.2")

And a .scalafmt.config file can be found here.

Batch Loading stops when one file fails

java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix <REDACTED>, reached max retries: 3, last failed load job

Once this error occurs, no more data is inserted, it seems to block any further inserts, should this gracefully fail and move onto the next file?

Replace FLOAT types with NUMERIC

By replacing some of the current transaction FLOAT types with NUMERIC we avoid any floating point arithmetic issues that we can run into if using BigQuerys FLOAT64.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.