snowplow-incubator / snowplow-bigquery-loader Goto Github PK

View Code? Open in Web Editor NEW

21.0 14.0 14.0 1.64 MB

Loads Snowplow enriched events into Google BigQuery

Scala 100.00%

bigquery iglu snowplow loader google-bigquery gcp

snowplow-bigquery-loader's Introduction

Snowplow BigQuery Loader

This project contains applications used to load Snowplow enriched data into Google BigQuery.

Quickstart

Assuming git and SBT installed:

$ git clone https://github.com/snowplow-incubator/snowplow-bigquery-loader
$ cd snowplow-bigquery-loader
$ sbt "project loader" test
$ sbt "project streamloader" test
$ sbt "project mutator" test
$ sbt "project repeater" test

Benchmarks

This project comes with sbt-jmh.

To run a specific benchmark test:

$ sbt 'project benchmark' '+jmh:run -i 20 -wi 10 -f2 -t3 .*TransformAtomic.*'

Or, to run all benchmark tests (once more are added):

$ sbt 'project benchmark' '+jmh:run -i 20 -wi 10 -f2 -t3'

The number of warm-ups and iterations is what the sbt-jmh project recommends but they can be lowered for faster runs.

To see all sbt-jmh options: jmh:run -h.

Add new benchmarks to this module.

Building fatjars

You can build the jar files for Mutator, Repeater and Streamloader with sbt like so:

$ sbt clean 'project mutator' assembly
$ sbt clean 'project repeater' assembly
$ sbt clean 'project streamloader' assembly

Find out more

Technical Docs	Setup Guide	Contributing

Copyright and License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

snowplow-bigquery-loader's People

Contributors

Stargazers

Watchers

Forkers

the-fine bahn-x kayalardanmehmet burakaydn julianbei skirjak nested-tech talyz newscorp-ghfb danielowji ouzhang alan-yuanzq rohankumardubey

snowplow-bigquery-loader's Issues

Change Travis distribution to Trusty

Switch Forwarder to using subscription

Update GCP SDK to set Snowplow useragent

This is important (for the v0.2.0?) so Snowplow BigQuery usage can be attributed to us.

/cc @yalisassoon

Loader: add a custom metric to track latency distribution

Bump fs2-core to 1.0.5

Bump cats-effect to 1.4.0

Use schemaUpdateOption to mutate BigQuery table

BigQuery client supports mutating schema on-fly with standard Load job API, however it is not available in Beam yet.

https://issues.apache.org/jira/browse/BEAM-876

Emit application status events and contexts

To start with, we're interested to know:

How many events were loaded?
What was the latency?

Add contexts flattening

If user defines a context with array root type, then in the resulting LoaderRow we get following structure:

"contexts_com_acme_mycontext_1_0_0": [
  [
    {
      "key": "foo",
      "value": "bar"
    },
    {
      "key": "one",
      "value": "two"
    }
  ]
]

First-level array because contexts always have zero-or-more cardinality and second-level array is actual context.

This structure is not supported by BigQuery, which means we have to flatten this array into single-level.

Add Bintray Docker registry credentials to .travis.yml

BINTRAY_SNOWPLOW_DOCKER_USER
BINTRAY_SNOWPLOW_DOCKER_API_KEY

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBind er".

SLF4J: Defaulting to no-operation (NOP) logger implementation

Repeater: consider reducing nack logging

Repeater checks every time if event is old enough to make an insertion attempt. However, it checks it after every pull which happens very often. As a result we have messages like:

DEBUG com.snowplowanalytics.snowplow.storage.bigquery.repeater.Repeater - Event fcbcc0b1-e51b-488b-9a90-7403cdfb19f4/2020-02-19T21:53:36.943Z is not ready yet. Nack

Printed every 300 milliseconds.

However, this is logged with DEBUG level, which means user can technically get rid of it.

Use snowplow-badrows

Mutator: replace decorateAsJava with JavaConverters in TableReference.scala

Bump Scala Analytics SDK to 0.4.2

Add Bintray Docker registry credentials to .travis.yml

BINTRAY_SNOWPLOW_DOCKER_USER
BINTRAY_SNOWPLOW_DOCKER_API_KEY

Remove sbt-buildinfo plugin

Settings should be specified and exposed like in: https://github.com/snowplow/snowplow-scala-project.g8/blob/app/src/main/g8/project/BuildSettings.scala

Add Bintray credentials to .travis.yml

BINTRAY_SNOWPLOW_GENERIC_USER (unencrypted?)
BINTRAY_SNOWPLOW_GENERIC_API_KEY

Important note: we'll be publishing tar.gz archives with executables there, not fatjar, because scio/beam does not work with sbt-assembly.

Probably we'll also need docker registry credentials as me and @BenFradet thinking about using sbt-native-packager (?) and its Docker integration and publish an image straight away.

Bump scio-core to 0.8.2

Loader: add PipelineSpec test

Add Snowplow BigQuery Repeater

Common: link docs.snowplowanalytics.com website in README

We need to migrate documentation from wiki to docs.snowplowanalytics.com/

Forwarder: add PipelineSpec test

How do you actually run snowplow-bigquery-loader?

I don't know how to create an executable on of the git repo? I am clearly missing something but after cloning and calling the given:

./snowplow-bigquery-loader \
    --config=$CONFIG \
    --resolver=$RESOLVER

It doesn't find it, which makes sense because I don't see this.

Bump Scio to 0.8.1

Mutator: does not allow for `-` and `_` in base64 encoded JSON

We are using Base64.getDocoder().decode which does not allows for mentioned characters.
We should investigate using Base64.getUrlDocoder().decode which would allow for these.

source: https://discourse.snowplowanalytics.com/t/illegal-base64-character-when-running-bigquery-mutator-with-docker/3662

Bump Iglu Client to 0.6.2

Attach a loading timestamp

The most reliable way is to try to achieve it as a SQL function, but doesn't look like streaming inserts provide this kind of API.

Bump Schema DDL to 0.9.0

Should bring important bug fixes:

Batch Mode doesn't send anything to BigQuery

I have configured to send data every 100 seconds, I can see the records in DataFlow heading into the BigQuery write job, however, they never get written to BigQuery. Why would this be? If you need further information to help, please ask.

Thanks

Repeater: mention life span in statistics logs

Currently Repeater's statistics message looks like following:

Statistics: 43 rows inserted, 114 rows rejected

This message could be totally fine if repeater is running for a month and rejected just 114 rows, but can be also alarming if it reject 114 rows in an hour. In order to figure out how long it is running we need to manually check when it was launched and then substract it from log entry timestamp.

It would be really nice if it was

Statistics: 43 rows inserted, 114 rows rejected in last 4 hours

Add field descriptions from schema to events table

Migrated from snowplow/snowplow#3934

At first I was thinking about using description to store metadata such as Iglu URI, but in the end I agree with @colmsnowplow - this should be something human-readable and if user changes the description - it should not affect the loading.

Bump cats-core to 1.6.1

Return control to user shell once job is submitted

Make sure that treatment of data and date-time format is consistent for JSON validator and BigQuery

Common: make pattern match in `Adapter.adaptType()` exhaustive

Support GEOGRAPHY types

Given that BigQuery now has some geography support does it make sense to convert the latitude / longitude columns to a single GEOGRAPHY column of type ST_GEOGPOINT?

Add support for GCP labels in dataflow launcher docker image

When deploying more than 1 pipeline on the same GCP project there is no way to know where the costs are coming from.

GCP supports labeling the resources and to export the billing information into BigQuery. The cost can then be breakdown by using SQL queries that filter by label.

For reference, the google documentation on labels: https://cloud.google.com/compute/docs/labeling-resources

The plan is to label all the resources with "sp_env=prod1" or "sp_env=qa1"

Handle disallowed characters in column names

BigQuery columns/fields can only contain alphanumeric or underscore characters, however there are use cases whereby we have no control over different characters being in field names - for example, when integrating data from a vendor like branch, users can't always rename fields with different characters.

We might strip special characters, or maybe replace them with underscores to avoid clashes with other column names perhaps.

Use sbt-scalafmt plugin

To add the plugin:

addSbtPlugin("org.scalameta" % "sbt-scalafmt" % "2.3.2")

And a .scalafmt.config file can be found here.

Bump Scala to 2.12.8

Update README

add repeater test example
update copyright

Bump Scala to 2.12.10

Add Github templates

Bump release-manager to 0.4.1

Batch Loading stops when one file fails

java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: Failed to create load job with id prefix <REDACTED>, reached max retries: 3, last failed load job

Once this error occurs, no more data is inserted, it seems to block any further inserts, should this gracefully fail and move onto the next file?

Cache Field.build result

These two operations:

snowplow-bigquery-loader/loader/src/main/scala/com/snowplowanalytics/snowplow/storage/bigquery/loader/LoaderRow.scala

Lines 144 to 145 in 83745a1

    
           .flatMap(schema => DdlSchema.parse(schema).toRight(invalidSchema(schemaKey))) 
        
           .map(schema => Field.build("", schema, false))

Are quite CPU-intense as they involve a lot of JSON traverse and recursive functions. Yet, we call them for every event, while their result could be cached between calls.

Use SerializeJsonRow as FormatFunction in Forwarder

Replace FLOAT types with NUMERIC

By replacing some of the current transaction FLOAT types with NUMERIC we avoid any floating point arithmetic issues that we can run into if using BigQuerys FLOAT64.

	.flatMap(schema => DdlSchema.parse(schema).toRight(invalidSchema(schemaKey)))
	.map(schema => Field.build("", schema, false))