Code Monkey home page Code Monkey logo

gcp-data-training's Introduction

Google Cloud Data Training

This markdown file contains an outline for a data training workshop.

Prerequisites

For this training you need:

  • owner permissions to a GCP project

  • a modern web browser

  • (optional) Google Cloud SDK installed on your laptop

Lab 1

In this lab we explore the Google Analytics data from a demo account.

Products: Google Analytics, Cloud Console, BigQuery, Cloud Shell, Cloud Datalab, Data Studio

  1. The demo account tracks data from Google Merchandise Store

  2. Access the Google Analytics dashboard through this page

  3. Open your Google Cloud Console and navigate to BigQuery

  4. Explore the UI, find your google_analytics_sample dataset and the ga_sessions_* tables within

  5. See the documentation & tutorials:

  1. Try out a simple query like
SELECT fullVisitorId, date, device.deviceCategory, geoNetwork.country
FROM `google_analytics_sample.ga_sessions_201707*`
GROUP BY 1,2,3,4

Exercise

  1. Approximately, how many distinct visitors were there on the Google Merchandise Store site in July 2017?

  2. How many distinct visitors were there by country? By device category (desktop, mobile, tablet)?

Choose your analytics tool: BigQuery UI, Data Studio, Datalab

BigQuery UI

  1. Just use the query editor!

Data Studio

  1. Sign in to Data Studio from Google Marketing Platform

  2. Open a blank report (and answer the questions)

  3. Create a new data source from BigQuery, follow the steps, select "Session level fields" and click "Add to report"

  4. Create a date range for July 2017

  5. Create a new field Distinct visitors with COUNT_DISTINCT(fullVisitorId) and place it in a scorecard

  6. Create a filter control dimension Country and metric Distinct visitors (view it in View mode)

  7. Create similar filter controls for Device category

Datalab

  1. Open Cloud Console in a new tab and activate Cloud Shell

  2. Get help for Cloud Datalab

datalab --help
  1. Enable Compute Engine and Source Repositories APIs
gcloud services enable compute.googleapis.com
gcloud services enable sourcerepo.googleapis.com
  1. Create a new Datalab instance
datalab create --zone europe-north1-a --disk-size-gb 20 my-datalab
  1. Open the Datalab UI in your browser from web preview (change port to 8081)

  2. Create a new notebook and make a query

%%bq query
SELECT ...
FROM `google_analytics_sample.ga_sessions_*` ...
  1. Explore the documentation notebooks in the docs folder

  2. View the general Datalab documendation and the Python reference

  3. Explore the web UI of the "ungit" version control tool

Solution

  1. Number of distinct visitors
SELECT COUNT(DISTINCT fullVisitorId) AS number_of_visitors
FROM `google_analytics_sample.ga_sessions_*`
  1. Visitors by country
SELECT geoNetwork.country, COUNT(DISTINCT fullVisitorId) AS number_of_visitors
FROM `google_analytics_sample.ga_sessions_201707*`
GROUP BY country ORDER BY number_of_visitors DESC
  1. Visitors by device category
SELECT device.deviceCategory, COUNT(DISTINCT fullVisitorId) AS number_of_visitors
FROM `google_analytics_sample.ga_sessions_201707*`
GROUP BY deviceCategory ORDER BY number_of_visitors DESC

Bonus exercise 1

  1. Learn about BigQuery as a data warehouse

Bonus exercise 2

  1. Run queries from BigQuery cookbook

Datalab clean up

  1. Shutdown the notebooks and close the Datalab tabs

  2. Go back to Cloud Shell and close the SSH tunnel by CTRL-C.

  3. View the state of your Datalab instance by

datalab list
  1. Stop the running instance by
datalab stop my-datalab

Lab 2

In this lab we

  1. run a streaming pipeline from Pub/Sub to BigQuery,
  2. run a batch pipeline from BigQuery to Datastore,
  3. schedule the batch pipeline and other tasks with Composer.

Products: Cloud Shell, Cloud Source Repositories, Pub/Sub, Cloud Dataflow, BigQuery, Cloud Datastore, Cloud Storage, Cloud Composer

Frameworks: Apache Beam, Apache Airflow

Preparations

  1. In Cloud Console, navigate to APIs & Services

  2. Enable APIs for Pub/Sub, Cloud Dataflow, and Cloud Composer

or, alternatively,

  1. Enable APIs from the Cloud Shell command line by
gcloud services enable pubsub.googleapis.com
gcloud services enable dataflow.googleapis.com
gcloud services enable composer.googleapis.com
  1. From Cloud Shell Terminal settings go to Terminal preferences/Keyboard and click Alt is Meta. This is to ensure you can enter characters like [] in the terminal without complications.

Pub/Sub

  1. Open Cloud Shell (preferably in a new tab) and clone this repository
git clone https://github.com/qvik/gcp-data-training.git

If you want to use your local code editor instead of the Cloud Shell code editor, follow these steps (you will need to have installed Google Cloud SDK locally):

  • Create a repository in Cloud Source Repositories
gcloud source repos create gcp-data-training
  • In the repository folder, add remote
git remote add google https://source.developers.google.com/p/$GOOGLE_CLOUD_PROJECT/r/gcp-data-training
  • Push
git push google master
  • Clone the gcp-data-training repository to your laptop by following the instructions in Source Repositories

Back to Cloud Shell, everyone!

  1. Create a virtual environment for the publisher by
virtualenv --python=/usr/bin/python pubvenv
  1. Activate it and install the Python client for Pub/Sub
source pubvenv/bin/activate
pip install --upgrade google-cloud-pubsub numpy
  1. Open publisher.py in your code editor and inspect the code

  2. In Cloud Console, navigate to Pub/Sub, create a topic stream_data_ingestion and for it a subscription process_stream_data

or, alternatively,

  1. Create the topic and its subscription from command line
gcloud pubsub topics create stream_data_ingestion
gcloud pubsub subscriptions create --topic stream_data_ingestion process_stream_data
  1. Run publisher.py

  2. Open a new Cloud Shell tab and pull messages from the subscription to make sure data is flowing

gcloud pubsub subscriptions pull --auto-ack \
projects/$GOOGLE_CLOUD_PROJECT/subscriptions/process_stream_data
  1. Interrupt the Python process publisher.py with CTRL-C

Streaming pipeline

  1. Open a new Cloud Shell tab and create a virtual environment for the pipeline
virtualenv --python=/usr/bin/python beamvenv
  1. Activate it and install the Apache Beam Python SDK
source beamvenv/bin/activate
pip install --upgrade apache-beam[gcp]
  1. Open stream_pipeline.py in your code editor and inspect the different suggestions for pipelines

  2. Take a look at the Apache Beam Programming Guide, the Python reference, and the examples in GitHub

  3. Go to BigQuery console and create a dataset my_dataset and an empty table stream_data with fields timestamp: TIMESTAMP, location: STRING, spend: INTEGER

  4. Launch publisher.py in another tab (in its virtual environment) and try out different pipelines with DirectRunner

python stream_pipeline.py --runner DirectRunner
  1. Interrupt the Python processes with CTRL-C

  2. Create a Cloud Storage bucket <project_id>-dataflow for Dataflow temp and staging either from the console or from the command line (see gsutil help)

gsutil mb -l europe-west1 gs://$GOOGLE_CLOUD_PROJECT-dataflow
  1. Take a look at Dataflow documentation and run the pipeline in Dataflow
python stream_pipeline.py --runner DataflowRunner
  1. View the pipeline by navigating to Dataflow in Cloud Console

  2. Clean up by stopping the pipeline and the publisher script

Batch pipeline

  1. In Cloud Console, navigate to Datastore and create a database

  2. Inspect batch_pipeline.py in your code editor and fill in the missing code

  3. Run the pipeline to make sure it works (activate beamvenv first)

python batch_pipeline.py --runner DirectRunner

Cloud Composer

  1. In Cloud Console, navigate to Cloud Composer

  2. Create an environment named data-transfer-environment in europe-west1 (this takes a while to finish)

  3. Take a look at Cloud Composer documentation

  4. Create a Cloud Storage bucket for data export

gsutil mb -l europe-west1 gs://$GOOGLE_CLOUD_PROJECT-data-export
  1. Copy the pipeline into storage bucket
gsutil cp pipelines/batch_pipeline.py gs://$GOOGLE_CLOUD_PROJECT-dataflow/pipelines/
  1. Take a look at Apache Airflow API Reference

  2. Open scheduler.py in your code editor and fill in the missing code

  3. Once the environment is ready, navigate to the Airflow web UI and explore it

  4. Set the Airflow variable

gcloud composer environments run data-transfer-environment \
    --location europe-west1 variables -- --set gcp_project $GOOGLE_CLOUD_PROJECT
  1. Submit your scheduling to Composer by copying scheduler.py into the dags folder of your Composer environment bucket

  2. Run the pipeline manually, if necessary, and inspect the runs in the web UI

  3. View the pipeline by navigating to Dataflow in Cloud Console

Exercise

  1. Draw an architecture diagram of the pipelines in Lab 2

Clean up

  1. In Cloud Console, delete the Composer environment and its storage bucket

  2. Check that no Dataflow pipelines are running

Lab 3

In this lab we train a deep neural network TensorFlow model on Cloud ML Engine. The task is to classify successful marketing phone calls of a Portuguese banking institution.

Products: Cloud ML Engine

Frameworks: TensorFlow

Preparations

  1. Enable the Cloud ML Engine API
gcloud services enable ml.googleapis.com

Preparing the model

  1. Visit the origin of the dataset at UCI Machine Learning Repository

  2. For your convenience, the data has been prepared into training and evaluation sets in mlengine/data/. All the numerical variables except for age have been normalized.

  3. Open the model file trainer/model.py in your code editor and examine the objects CSV_COLUMNS, INPUT_COLUMNS, etc, which encode your data format

  4. Take a look at TensorFlow documentation and fill in the missing feature columns in build_estimator function

  5. Navigate to the repository folder in your Cloud Shell and set the environment variables

TRAIN_DATA=$(pwd)/mlengine/data/bank_data_train.csv
EVAL_DATA=$(pwd)/mlengine/data/bank_data_eval.csv
MODEL_DIR=$(pwd)/mlengine/output
  1. Change directory to mlengine and try the training locally
gcloud ml-engine local train \
    --module-name trainer.task \
    --package-path trainer/ \
    --job-dir $MODEL_DIR \
    -- \
    --train-files $TRAIN_DATA \
    --eval-files $EVAL_DATA \
    --train-steps 1000 \
    --eval-steps 100

Training the model in ML Engine

  1. Set the environment variables
BUCKET_NAME=$GOOGLE_CLOUD_PROJECT-mlengine
REGION=europe-west1
  1. Create a bucket for ML Engine jobs
gsutil mb -l $REGION gs://$BUCKET_NAME
  1. Copy the data into the bucket
gsutil cp $TRAIN_DATA $EVAL_DATA gs://$BUCKET_NAME/data/
  1. Reset the environment variables for data
TRAIN_DATA=gs://$BUCKET_NAME/data/bank_data_train.csv
EVAL_DATA=gs://$BUCKET_NAME/data/bank_data_eval.csv
  1. Set the environment variables for the training job
JOB_NAME=bank_marketing_1
OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME
  1. Run the training job in ML Engine
gcloud ml-engine jobs submit training $JOB_NAME \
    --job-dir $OUTPUT_PATH \
    --runtime-version 1.10 \
    --module-name trainer.task \
    --package-path trainer/ \
    --region $REGION \
    -- \
    --train-files $TRAIN_DATA \
    --eval-files $EVAL_DATA \
    --train-steps 10000 \
    --eval-steps 1000 \
    --verbosity DEBUG
  1. View the job logs in Cloud Shell
gcloud ml-engine jobs stream-logs $JOB_NAME

or, alternatively,

  1. Inspect the training process on TensorBoard (open web preview on port 6006)
tensorboard --logdir=$OUTPUT_PATH

Hyperparameter tuning

  1. Learn more about hyperparameter tuning (see also here)

  2. Open hptuning_config.yaml in your code editor and fill in the missing code

  3. In the mlengine folder, set the environment variables

HPTUNING_CONFIG=$(pwd)/hptuning_config.yaml
JOB_NAME=bank_marketing_hptune_1
OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME
  1. Run training with hyperparameter tuning
gcloud ml-engine jobs submit training $JOB_NAME \
    --job-dir $OUTPUT_PATH \
    --runtime-version 1.10 \
    --config $HPTUNING_CONFIG \
    --module-name trainer.task \
    --package-path trainer/ \
    --region $REGION \
    --scale-tier STANDARD_1 \
    -- \
    --train-files $TRAIN_DATA \
    --eval-files $EVAL_DATA \
    --train-steps 10000 \
    --eval-steps 1000 \
    --verbosity DEBUG
  1. View the job logs in Cloud Shell
gcloud ml-engine jobs stream-logs $JOB_NAME

or, alternatively,

  1. Inspect the training process on TensorBoard (open web preview on port 6006)
tensorboard --logdir=$OUTPUT_PATH/<trial_number>/

Deployment

  1. Set the environment variable
MODEL_NAME=bank_marketing
  1. Create a model in ML Engine
gcloud ml-engine models create $MODEL_NAME --regions=$REGION
  1. Select the job output to use and look up the path to model binaries
gsutil ls -r $OUTPUT_PATH
  1. Set the environment variable with the correct value for <trial_number> and <timestamp>
MODEL_BINARIES=$OUTPUT_PATH/<trial_number>/export/bank_marketing/<timestamp>/
  1. Create a version of the model
gcloud ml-engine versions create v1 \
    --model $MODEL_NAME \
    --origin $MODEL_BINARIES \
    --runtime-version 1.10
  1. From the mlengine folder, inspect the test instances
cat data/bank_data_test_no.json
cat data/bank_data_test_yes.json
  1. Get the prediction for two test instances
gcloud ml-engine predict \
    --model $MODEL_NAME \
    --version v1 \
    --json-instances \
    data/bank_data_test_no.json
gcloud ml-engine predict \
    --model $MODEL_NAME \
    --version v1 \
    --json-instances \
    data/bank_data_test_yes.json

The End

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.