Code Monkey home page Code Monkey logo

bigquery-to-datastore's Introduction

bigquery-to-datastore

CircleCI codecov

This enables us to export a BigQuery table to a Google Datastore kind using Apache Beam on top of Google Dataflow.

You don't have to have duplicated rows whose key values are same. Apache Beam's DatastoreIO doesn't allow us to write same key at once.

Data Pipeline

Requirements

  • Maven
  • Java 1.8+
  • Google Cloud Platform account

Usage

Required arguments

  • --project: Google Cloud Project
  • --inputBigQueryDataset: Input BigQuery dataset ID
  • --inputBigQueryTable: Input BigQuery table ID
  • --keyColumn: BigQuery column name for a key of Google Datastore kind
  • --outputDatastoreNamespace: Output Google Datastore namespace
  • --outputDatastoreKind: OUtput Google Datastore kind
  • --tempLocation: The Cloud Storage path to use for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
  • --gcpTempLocation: A GCS path for storing temporary files in GCP.

Optional arguments

  • --runner: Apache Beam runner.
    • When you don't set this option, it will run on your local machine, not Google Dataflow.
    • e.g. DataflowRunner
  • --parentPaths: Output Google Datastore parent path(s)
    • e.g. Parent1:p1,Parent2:p2 ==> KEY('Parent1', 'p1', 'Parent2', 'p2')
  • --indexedColumns: Indexed columns on Google Datastore.
    • e.g. col1,col2,col3 ==> col1, col2 and col2 are indexed on Google Datastore.
  • --numWorkers: The number of workers when you run it on top of Google Dataflow.
  • --workerMachineType: Google Dataflow worker instance type
    • e.g. n1-standard-1, n1-standard-4

Example to run on Google Dataflow

# compile
mvn clean package

# Run bigquery-to-datastore via the compiled JAR file
java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.7.0.jar \
  com.github.yuiskw.beam.BigQuery2Datastore \
  --project=your-gcp-project \
  --runner=DataflowRunner \
  --inputBigQueryDataset=test_dataset \
  --inputBigQueryTable=test_table \
  --outputDatastoreNamespace=test_namespace \
  --outputDatastoreKind=TestKind \
  --parentPaths=Parent1:p1,Parent2:p2 \
  --keyColumn=id \
  --indexedColumns=col1,col2,col3 \
  --tempLocation=gs://test_bucket/test-log/ \
  --gcpTempLocation=gs://test_bucket/test-log/

How to run

How to build and run it with java

# compile
mvn clean package
# or
make package

# run
java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.7.0.jar --help
# or
./bin/bigquery-to-datastore --help

How to run it on docker

We also offers docker images for this project in yuiskw/bigquery-to-datastore - Docker Hub. We have several docker images based on Apache Beam versions.

docker run yuiskw/bigquery-to-datastore:0.7.0-beam-2.16.0 --help

How to install it with homebrew

You can install it with homebrew from yu-iskw/homebrew-bigquery-to-datastore.

# install
brew install yu-iskw/bigquery-to-datastore/bigquery-to-datastore

# show help
./bin/bigquery-to-datastore --help

Type conversions between BigQuery and Google Datastore

The below table describes the type conversions between BigQuery and Google Datastore. Since Datastore unfortunately doesn't have any data type for time, bigquery-to-datastore ignore BigQuery columns whose data type are TIME.

BigQuery Datastore
BOOLEAN bool
INTEGER int
DOUBLE double
STRING string
TIMESTAMP timestamp
DATE timestamp
TIME ignored: Google Datastore doesn't have time type.
RECORD array
STRUCT Entity

Note

As you probably know, Google Datastore doesn't have any feature much like UPDATE of MySQL. Since DatastoreIO.Write upsert given input entities, it will just overwrite an entity whether or not it already exists. If we would like to insert multiple data separately, we have to combine them on bigquery beforehand.

License

Copyright (c) 2017 Yu Ishikawa.

bigquery-to-datastore's People

Contributors

jontradesy avatar tadeegan avatar yu-iskw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bigquery-to-datastore's Issues

Install it with brew tap

It would be nice like that.

brew tap yu-iskw/bigquery-to-datastore
brew install bigquery-to-datastore

Can specify indexed columns

Overview

I don't make any index of any values at version 0.2. Sometimes, I guess users want to do indexing specific columns.

Command Line Options Spec

java -cp ...bigquery-to-datastore.jar
  ...
  --indexedColumns="age,name"
  ...

Timestamp Issue

Having issue importing a timestamp back into datastore.

: com.google.datastore.v1.client.DatastoreException: Invalid PROTO payload received. Timestamp seconds exceeds limit for field: timestampValue, code=INVALID_ARGUMENT
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:169)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:89)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:1288)
at org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.finishBundle(DatastoreV1.java:1260)

Make a docker image

It would be nice to offer this tool by docker.

docker run yuiskw/bigquery-to-datastore \
  --project=your-gcp-project \
  --runner=DataflowRunner \
  --inputBigQueryDataset=test_dataset \
  --inputBigQueryTable=test_table \
  --outputDatastoreNamespace=test_namespace \
  --outputDatastoreKind=TestKind \
  --parentPaths=Parent1:p1,Parent2:p2 \
  --keyColumn=id \
  --indexedColumns=col1,col2,col3 \
  --tempLocation=gs://test_bucket/test-log/ \
  --gcpTempLocation=gs://test_bucket/test-log/

Add flag for indexing

What is the reason for setting setExcludedFromIndexes to true? Ideally this would be an additional flag when running the main shell script

how do i auth against my gc?

I do have a GC login and am also logged in via the CLI. How do I tell either the JAR or the docker image to pick up my credentials?

Import failing

Hey Yu,

Really great you put this together. I an finally getting successful builds however I am not seeing any data appear in my datastore. Is there something I am doing wrong?

Output is:

[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building bigquery-to-datastore 0.2
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ bigquery-to-datastore ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /Users/cwilliams/Dropbox/Development/DevOps/Google/interview/bestbuy/bigquery-to-datastore/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.6.1:compile (default-compile) @ bigquery-to-datastore ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ bigquery-to-datastore ---
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.options.DataflowPipelineOptions$StagingLocationFactory create
INFO: No stagingLocation provided, falling back to gcpTempLocation
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.DataflowRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 106 files. Enable logging at DEBUG level to see which files will be staged.
Nov 12, 2017 5:08:37 PM org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read validate
INFO: Project of TableReference not set. The value of BigQueryOptions.getProject() at execution time will be used.
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Uploading 106 files from PipelineOptions.filesToStage to staging location to prepare for execution.
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Staging files complete: 106 files cached, 0 files newly uploaded
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/Read(BigQueryTableSource) as step s1
Nov 12, 2017 5:08:40 PM org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource setDefaultProjectIfAbsent
INFO: Project ID not set in TableReference. Using default project from BigQueryOptions.
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/ParMultiDo(Identity) as step s2
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/View.AsIterable/View.CreatePCollectionView/ParDo(ToIsmRecordForGlobalWindow) as step s3
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/View.AsIterable/View.CreatePCollectionView/CreateDataflowView as step s4
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/Create(CleanupOperation)/Read(CreateSource) as step s5
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/Cleanup as step s6
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding ParDo(TableRow2Entity) as step s7
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding DatastoreV1.Write/Convert to Mutation/Map as step s8
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding DatastoreV1.Write/Write Mutation to Datastore as step s9
Dataflow SDK version: 2.1.0
Nov 12, 2017 5:08:42 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/bestbuy-185314/dataflow/job/2017-11-12_08_08_41-5441556467331747849
Submitted job: 2017-11-12_08_08_41-5441556467331747849
Nov 12, 2017 5:08:42 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: To cancel the job using the 'gcloud' tool, run:

gcloud beta dataflow jobs --project=bestbuy-185314 cancel 2017-11-12_08_08_41-5441556467331747849
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12.604 s
[INFO] Finished at: 2017-11-12T17:08:42+01:00
[INFO] Final Memory: 34M/113M
[INFO] ------------------------------------------------------------------------

Any ideas?

Best
Chris

attempting to try this

Maybe I am missing something, but when I try to run the job I'm getting this error:

(dfb1d562509e1bce): java.lang.NullPointerException
at com.github.yuiskw.beam.TableRow2EntityFn.convertTableRowToEntity(TableRow2EntityFn.java:149)
at com.github.yuiskw.beam.TableRow2EntityFn.processElement(TableRow2EntityFn.java:55)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.