yu-iskw / bigquery-to-datastore Goto Github PK

Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow

Makefile 2.95% Shell 3.77% Java 92.81% Dockerfile 0.48%

google-datastore bigquery google-cloud google-dataflow apache-beam beam

bigquery-to-datastore's Introduction

bigquery-to-datastore

This enables us to export a BigQuery table to a Google Datastore kind using Apache Beam on top of Google Dataflow.

You don't have to have duplicated rows whose key values are same. Apache Beam's DatastoreIO doesn't allow us to write same key at once.

Requirements

Maven
Java 1.8+
Google Cloud Platform account

Usage

Required arguments

--project: Google Cloud Project
--inputBigQueryDataset: Input BigQuery dataset ID
--inputBigQueryTable: Input BigQuery table ID
--keyColumn: BigQuery column name for a key of Google Datastore kind
--outputDatastoreNamespace: Output Google Datastore namespace
--outputDatastoreKind: OUtput Google Datastore kind
--tempLocation: The Cloud Storage path to use for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
--gcpTempLocation: A GCS path for storing temporary files in GCP.

Optional arguments

--runner: Apache Beam runner.
- When you don't set this option, it will run on your local machine, not Google Dataflow.
- e.g. DataflowRunner
--parentPaths: Output Google Datastore parent path(s)
- e.g. Parent1:p1,Parent2:p2 ==> KEY('Parent1', 'p1', 'Parent2', 'p2')
--indexedColumns: Indexed columns on Google Datastore.
- e.g. col1,col2,col3 ==> col1, col2 and col2 are indexed on Google Datastore.
--numWorkers: The number of workers when you run it on top of Google Dataflow.
--workerMachineType: Google Dataflow worker instance type
- e.g. n1-standard-1, n1-standard-4

Example to run on Google Dataflow

# compile
mvn clean package

# Run bigquery-to-datastore via the compiled JAR file
java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.7.0.jar \
  com.github.yuiskw.beam.BigQuery2Datastore \
  --project=your-gcp-project \
  --runner=DataflowRunner \
  --inputBigQueryDataset=test_dataset \
  --inputBigQueryTable=test_table \
  --outputDatastoreNamespace=test_namespace \
  --outputDatastoreKind=TestKind \
  --parentPaths=Parent1:p1,Parent2:p2 \
  --keyColumn=id \
  --indexedColumns=col1,col2,col3 \
  --tempLocation=gs://test_bucket/test-log/ \
  --gcpTempLocation=gs://test_bucket/test-log/

How to run

How to build and run it with java

# compile
mvn clean package
# or
make package

# run
java -cp $(pwd)/target/bigquery-to-datastore-bundled-0.7.0.jar --help
# or
./bin/bigquery-to-datastore --help

How to run it on docker

We also offers docker images for this project in yuiskw/bigquery-to-datastore - Docker Hub. We have several docker images based on Apache Beam versions.

docker run yuiskw/bigquery-to-datastore:0.7.0-beam-2.16.0 --help

How to install it with homebrew

You can install it with homebrew from yu-iskw/homebrew-bigquery-to-datastore.

# install
brew install yu-iskw/bigquery-to-datastore/bigquery-to-datastore

# show help
./bin/bigquery-to-datastore --help

Type conversions between BigQuery and Google Datastore

The below table describes the type conversions between BigQuery and Google Datastore. Since Datastore unfortunately doesn't have any data type for time, bigquery-to-datastore ignore BigQuery columns whose data type are TIME.

BigQuery	Datastore
BOOLEAN	bool
INTEGER	int
DOUBLE	double
STRING	string
TIMESTAMP	timestamp
DATE	timestamp
TIME	ignored: Google Datastore doesn't have time type.
RECORD	array
STRUCT	`Entity`

Note

As you probably know, Google Datastore doesn't have any feature much like UPDATE of MySQL. Since DatastoreIO.Write upsert given input entities, it will just overwrite an entity whether or not it already exists. If we would like to insert multiple data separately, we have to combine them on bigquery beforehand.

License

bigquery-to-datastore's People

Contributors

Stargazers

Watchers

Forkers

aikudo alexabbott gt2985 podopie ovotech jontradesy gocardless cunvoas tadeegan saschaheyer seth-deal xugenfa gopinath678 niraj06 iihnordic markedmondson1234 kamibayashi

bigquery-to-datastore's Issues

Install it with brew tap

It would be nice like that.

brew tap yu-iskw/bigquery-to-datastore
brew install bigquery-to-datastore

Can specify indexed columns

Overview

I don't make any index of any values at version 0.2. Sometimes, I guess users want to do indexing specific columns.

Command Line Options Spec

java -cp ...bigquery-to-datastore.jar
  ...
  --indexedColumns="age,name"
  ...

Timestamp Issue

Having issue importing a timestamp back into datastore.

: com.google.datastore.v1.client.DatastoreException: Invalid PROTO payload received. Timestamp seconds exceeds limit for field: timestampValue, code=INVALID_ARGUMENT
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:126)
at com.google.datastore.v1.client.RemoteRpc.makeException(RemoteRpc.java:169)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:89)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:84)
at org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.flushBatch(DatastoreV1.java:1288)
at org.apache.beam.sdk.io.gcp.datastore.DatastoreV1$DatastoreWriterFn.finishBundle(DatastoreV1.java:1260)

Make a docker image

It would be nice to offer this tool by docker.

docker run yuiskw/bigquery-to-datastore \
  --project=your-gcp-project \
  --runner=DataflowRunner \
  --inputBigQueryDataset=test_dataset \
  --inputBigQueryTable=test_table \
  --outputDatastoreNamespace=test_namespace \
  --outputDatastoreKind=TestKind \
  --parentPaths=Parent1:p1,Parent2:p2 \
  --keyColumn=id \
  --indexedColumns=col1,col2,col3 \
  --tempLocation=gs://test_bucket/test-log/ \
  --gcpTempLocation=gs://test_bucket/test-log/

Upgrade apache beam to 2.16.0

Apache Beam 2.3 was released.

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12341608

Add flag for indexing

What is the reason for setting setExcludedFromIndexes to true? Ideally this would be an additional flag when running the main shell script

Unable to write to default namespace

I've tried using default and [default] for the datastore namespace, but neither of these write to the correct '[default]' namespace

how do i auth against my gc?

I do have a GC login and am also logged in via the CLI. How do I tell either the JAR or the docker image to pick up my credentials?

Import failing

Hey Yu,

Really great you put this together. I an finally getting successful builds however I am not seeing any data appear in my datastore. Is there something I am doing wrong?

Output is:

[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building bigquery-to-datastore 0.2
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ bigquery-to-datastore ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /Users/cwilliams/Dropbox/Development/DevOps/Google/interview/bestbuy/bigquery-to-datastore/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.6.1:compile (default-compile) @ bigquery-to-datastore ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ bigquery-to-datastore ---
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.options.DataflowPipelineOptions$StagingLocationFactory create
INFO: No stagingLocation provided, falling back to gcpTempLocation
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.DataflowRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 106 files. Enable logging at DEBUG level to see which files will be staged.
Nov 12, 2017 5:08:37 PM org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read validate
INFO: Project of TableReference not set. The value of BigQueryOptions.getProject() at execution time will be used.
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
Nov 12, 2017 5:08:37 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Uploading 106 files from PipelineOptions.filesToStage to staging location to prepare for execution.
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.util.PackageUtil stageClasspathElements
INFO: Staging files complete: 106 files cached, 0 files newly uploaded
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/Read(BigQueryTableSource) as step s1
Nov 12, 2017 5:08:40 PM org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource setDefaultProjectIfAbsent
INFO: Project ID not set in TableReference. Using default project from BigQueryOptions.
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/ParMultiDo(Identity) as step s2
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/View.AsIterable/View.CreatePCollectionView/ParDo(ToIsmRecordForGlobalWindow) as step s3
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/View.AsIterable/View.CreatePCollectionView/CreateDataflowView as step s4
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/Create(CleanupOperation)/Read(CreateSource) as step s5
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding BigQueryIO.Read/PassThroughThenCleanup/Cleanup as step s6
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding ParDo(TableRow2Entity) as step s7
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding DatastoreV1.Write/Convert to Mutation/Map as step s8
Nov 12, 2017 5:08:40 PM org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator addStep
INFO: Adding DatastoreV1.Write/Write Mutation to Datastore as step s9
Dataflow SDK version: 2.1.0
Nov 12, 2017 5:08:42 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/bestbuy-185314/dataflow/job/2017-11-12_08_08_41-5441556467331747849
Submitted job: 2017-11-12_08_08_41-5441556467331747849
Nov 12, 2017 5:08:42 PM org.apache.beam.runners.dataflow.DataflowRunner run
INFO: To cancel the job using the 'gcloud' tool, run:

gcloud beta dataflow jobs --project=bestbuy-185314 cancel 2017-11-12_08_08_41-5441556467331747849
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12.604 s
[INFO] Finished at: 2017-11-12T17:08:42+01:00
[INFO] Final Memory: 34M/113M
[INFO] ------------------------------------------------------------------------

Any ideas?

Best
Chris

Generate identifiers

Hi, would that be possible to enable an option to automatically generate key identifier when inserting the data? In the way how is it described in here: https://cloud.google.com/datastore/docs/concepts/entities#assigning_identifiers

attempting to try this

Maybe I am missing something, but when I try to run the job I'm getting this error:

(dfb1d562509e1bce): java.lang.NullPointerException
at com.github.yuiskw.beam.TableRow2EntityFn.convertTableRowToEntity(TableRow2EntityFn.java:149)
at com.github.yuiskw.beam.TableRow2EntityFn.processElement(TableRow2EntityFn.java:55)