Code Monkey home page Code Monkey logo

snowplow-gcp's Introduction

CI

Table of Contents

Snowplow on GCP

This project aims to provide set of tools that allow you to easily deploy Snowplow setup on Google Cloud Platform.

After following all those steps you should have:

  • GKE cluster running:
    • Snowplow Scala Stream Collector
    • Beam Enrich
    • BigQuery Loader
  • Pub/Sub topics for collector and enrich stream
  • BigQuery dataset being the final destination of Snowplow events
  • Few GCS buckets

NOTE: This project is still work in progress and some part may not work yet but you are welcomed to help!

Prerequisites

To manage GCP resources you need installed gcloud CLI. For installation options check the official documentation.

This project uses Terraform to bootstrap the infrastructure and kubectl to manage the Kubernetes cluster. On MacOS you should easily install them using Homebrew:

brew install terraform
brew install kubectl

For install option on other systems please check documentation of those projects.

Infrastructure setup

  1. Create GCP project. You can also use already existing one.

  2. Run the following commands:

    export PROJECT_ID=project-name-here
    export SERVICE_ACCOUNT_NAME=snowplow
    bash scripts/setup-iam.sh ${PROJECT_ID} ${SERVICE_ACCOUNT_NAME}

    This will create service account in keys directory. This service account will have roles/editor role and will be used to create GCP resources. This script will also enable required services (GKE).

  3. To bootstrap infrastructure required for Snowplow deployment run:

    export LOCATION=europe-west3
    export GCP_KEY=keys/${SERVICE_ACCOUNT_NAME}.json
    export CLIENT=client-name
    terraform apply -var "gcp_project=${PROJECT_ID}" -var "gcp_location=${LOCATION}" -var "gcp_key_admin=${GCP_KEY}" -var "client=${CLIENT}"

    The CLIENT is a string that is added to all resources name. It's recommended to use terraform workspaces i.e. terraform workspace new my_snowplow.

At this moment all required elements should be up and running. If you wish you can check this in GCP console. In next steps you will deploy the Snowplow components.

Collector deployment

Check snowplow documentation.

To get access to the newly create kubernetes cluster run

gcloud container clusters get-credentials "snowplow-gke" --region ${LOCATION}

Collector configuration requires user to provide GCP project id. You can do this running the following substitution:

sed -i "" "s/googleProjectId =.*/googleProjectId = ${PROJECT_ID}/" k8s/collector/conf.yaml

Then deploy the following CRDs:

kubectl apply -f k8s/collector/conf.yaml
kubectl apply -f k8s/collector/deploy.yaml
kubectl apply -f k8s/collector/service.yaml

This will create snowplow-collector deployment which uses official snowplow image.

To check if the deployment works run

kubectl get pods -A | grep snowplow

and you should see few pods, all in Running state. To verify that everything works smoothly you can run health check script:

bash scripts/collector_health_check.sh

If there was no error, head to PubSub web console and after few seconds you should observe some events in the good topic.

Stream enrich job

Check snowplow documentation.

The next step is to start streaming job on Google Dataflow (Apache Beam). To do this you will use one time kubernetes job.

But before that enrich configuration requires you to provide GCP project id. You can do this running the following substitution:

sed -i "" "s/googleProjectId =.*/googleProjectId = ${PROJECT_ID}/" k8s/enrich/conf.yaml
sed -i "" "s/\*PROJECT\*/${PROJECT_ID}/" k8s/enrich/job.yaml  # does not work

Then we need a key to write to GCS:

cp keys/snowplow-admin.json keys/credentials.json
kubectl create secret generic gcs-writer-sa --from-file keys/credentials.json

TODO: there should be key with limited scope - what scope?. TODO: some more configuration changes are needed

Once you configuration is ready run:

kubectl apply -f k8s/enrich/conf.yaml
kubectl apply -f k8s/enrich/job.yaml

After few seconds run:

kubectl get jobs -A

and you should see that snowplow-enrich has completed.

BigQuery loader deployment

Check snowplow documentation.

Contributing

We welcome all contributions! Please submit an issue or PR no matter if it's bug or a typo.

This project is using pre-commits to ensure the quality of the code. To install pre-commits just do:

pip install pre-commit
# or
brew install pre-commit

And then from project directory run pre-commit install.

snowplow-gcp's People

Contributors

turbaszek avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

snowplow-gcp's Issues

Use helm to deploy all deployments

We should use helm to template and deploy all components.
The variables that have to be templated:

  • k8s/collector/config.yaml collector.streams.sink.googleProjectId
  • PubSub topics:
    • k8s/collector/config.yaml collector.streams.good
    • k8s/collector/config.yaml collector.streams.bad
    • k8s/enrich/config.yaml enrich.in.raw
    • k8s/enrich/config.yaml enrich.out.good
    • k8s/enrich/config.yaml enrich.out.bad

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.