Google Cloud Data Training

This markdown file contains an outline for a data training workshop.

Prerequisites

For this training you need:

owner permissions to a GCP project
a modern web browser
(optional) Google Cloud SDK installed on your laptop

Lab 1

In this lab we explore the Google Analytics data from a demo account.

Products: Google Analytics, Cloud Console, BigQuery, Cloud Shell, Cloud Datalab, Data Studio

The demo account tracks data from Google Merchandise Store
Access the Google Analytics dashboard through this page
Open your Google Cloud Console and navigate to BigQuery
Explore the UI, find your google_analytics_sample dataset and the ga_sessions_* tables within
See the documentation & tutorials:

BigQuery Export for Analytics (optional)
BigQuery Export schema
Standard SQL Reference
Analytics Academy

Try out a simple query like

SELECT fullVisitorId, date, device.deviceCategory, geoNetwork.country
FROM `google_analytics_sample.ga_sessions_201707*`
GROUP BY 1,2,3,4

Exercise

Approximately, how many distinct visitors were there on the Google Merchandise Store site in July 2017?
How many distinct visitors were there by country? By device category (desktop, mobile, tablet)?

Choose your analytics tool: BigQuery UI, Data Studio, Datalab

BigQuery UI

Just use the query editor!

Data Studio

Sign in to Data Studio from Google Marketing Platform
Open a blank report (and answer the questions)
Create a new data source from BigQuery, follow the steps, select "Session level fields" and click "Add to report"
Create a date range for July 2017
Create a new field Distinct visitors with COUNT_DISTINCT(fullVisitorId) and place it in a scorecard
Create a filter control dimension Country and metric Distinct visitors (view it in View mode)
Create similar filter controls for Device category

Datalab

Open Cloud Console in a new tab and activate Cloud Shell
Get help for Cloud Datalab

datalab --help

Enable Compute Engine and Source Repositories APIs

gcloud services enable compute.googleapis.com
gcloud services enable sourcerepo.googleapis.com

Create a new Datalab instance

datalab create --zone europe-north1-a --disk-size-gb 20 my-datalab

Open the Datalab UI in your browser from web preview (change port to 8081)
Create a new notebook and make a query

%%bq query
SELECT ...
FROM `google_analytics_sample.ga_sessions_*` ...

Explore the documentation notebooks in the docs folder
View the general Datalab documendation and the Python reference
Explore the web UI of the "ungit" version control tool

Solution

Number of distinct visitors

SELECT COUNT(DISTINCT fullVisitorId) AS number_of_visitors
FROM `google_analytics_sample.ga_sessions_*`

Visitors by country

SELECT geoNetwork.country, COUNT(DISTINCT fullVisitorId) AS number_of_visitors
FROM `google_analytics_sample.ga_sessions_201707*`
GROUP BY country ORDER BY number_of_visitors DESC

Visitors by device category

SELECT device.deviceCategory, COUNT(DISTINCT fullVisitorId) AS number_of_visitors
FROM `google_analytics_sample.ga_sessions_201707*`
GROUP BY deviceCategory ORDER BY number_of_visitors DESC

Bonus exercise 1

Learn about BigQuery as a data warehouse

Bonus exercise 2

Run queries from BigQuery cookbook

Datalab clean up

Shutdown the notebooks and close the Datalab tabs
Go back to Cloud Shell and close the SSH tunnel by CTRL-C.
View the state of your Datalab instance by

datalab list

Stop the running instance by

datalab stop my-datalab

Lab 2

In this lab we

run a streaming pipeline from Pub/Sub to BigQuery,
run a batch pipeline from BigQuery to Datastore,
schedule the batch pipeline and other tasks with Composer.

Products: Cloud Shell, Cloud Source Repositories, Pub/Sub, Cloud Dataflow, BigQuery, Cloud Datastore, Cloud Storage, Cloud Composer

Frameworks: Apache Beam, Apache Airflow

Preparations

In Cloud Console, navigate to APIs & Services
Enable APIs for Pub/Sub, Cloud Dataflow, and Cloud Composer

or, alternatively,

Enable APIs from the Cloud Shell command line by

gcloud services enable pubsub.googleapis.com
gcloud services enable dataflow.googleapis.com
gcloud services enable composer.googleapis.com

From Cloud Shell Terminal settings go to Terminal preferences/Keyboard and click Alt is Meta. This is to ensure you can enter characters like [] in the terminal without complications.

Pub/Sub

Open Cloud Shell (preferably in a new tab) and clone this repository

git clone https://github.com/qvik/gcp-data-training.git

If you want to use your local code editor instead of the Cloud Shell code editor, follow these steps (you will need to have installed Google Cloud SDK locally):

Create a repository in Cloud Source Repositories

gcloud source repos create gcp-data-training

In the repository folder, add remote

git remote add google https://source.developers.google.com/p/$GOOGLE_CLOUD_PROJECT/r/gcp-data-training

Push

git push google master

Clone the gcp-data-training repository to your laptop by following the instructions in Source Repositories

Back to Cloud Shell, everyone!

Create a virtual environment for the publisher by

virtualenv --python=/usr/bin/python pubvenv

Activate it and install the Python client for Pub/Sub

source pubvenv/bin/activate
pip install --upgrade google-cloud-pubsub numpy

Open publisher.py in your code editor and inspect the code
In Cloud Console, navigate to Pub/Sub, create a topic stream_data_ingestion and for it a subscription process_stream_data

or, alternatively,

Create the topic and its subscription from command line

gcloud pubsub topics create stream_data_ingestion
gcloud pubsub subscriptions create --topic stream_data_ingestion process_stream_data

Run publisher.py
Open a new Cloud Shell tab and pull messages from the subscription to make sure data is flowing

gcloud pubsub subscriptions pull --auto-ack \
projects/$GOOGLE_CLOUD_PROJECT/subscriptions/process_stream_data

Interrupt the Python process publisher.py with CTRL-C

Streaming pipeline

Open a new Cloud Shell tab and create a virtual environment for the pipeline

virtualenv --python=/usr/bin/python beamvenv

Activate it and install the Apache Beam Python SDK

source beamvenv/bin/activate
pip install --upgrade apache-beam[gcp]

Open stream_pipeline.py in your code editor and inspect the different suggestions for pipelines
Take a look at the Apache Beam Programming Guide, the Python reference, and the examples in GitHub
Go to BigQuery console and create a dataset my_dataset and an empty table stream_data with fields timestamp: TIMESTAMP, location: STRING, spend: INTEGER
Launch publisher.py in another tab (in its virtual environment) and try out different pipelines with DirectRunner

python stream_pipeline.py --runner DirectRunner

Interrupt the Python processes with CTRL-C
Create a Cloud Storage bucket <project_id>-dataflow for Dataflow temp and staging either from the console or from the command line (see gsutil help)

gsutil mb -l europe-west1 gs://$GOOGLE_CLOUD_PROJECT-dataflow

Take a look at Dataflow documentation and run the pipeline in Dataflow

python stream_pipeline.py --runner DataflowRunner

View the pipeline by navigating to Dataflow in Cloud Console
Clean up by stopping the pipeline and the publisher script

Batch pipeline

In Cloud Console, navigate to Datastore and create a database
Inspect batch_pipeline.py in your code editor and fill in the missing code
Run the pipeline to make sure it works (activate beamvenv first)

python batch_pipeline.py --runner DirectRunner

Cloud Composer

In Cloud Console, navigate to Cloud Composer
Create an environment named data-transfer-environment in europe-west1 (this takes a while to finish)
Take a look at Cloud Composer documentation
Create a Cloud Storage bucket for data export

gsutil mb -l europe-west1 gs://$GOOGLE_CLOUD_PROJECT-data-export

Copy the pipeline into storage bucket

gsutil cp pipelines/batch_pipeline.py gs://$GOOGLE_CLOUD_PROJECT-dataflow/pipelines/

Take a look at Apache Airflow API Reference
Open scheduler.py in your code editor and fill in the missing code
Once the environment is ready, navigate to the Airflow web UI and explore it
Set the Airflow variable

gcloud composer environments run data-transfer-environment \
    --location europe-west1 variables -- --set gcp_project $GOOGLE_CLOUD_PROJECT

Submit your scheduling to Composer by copying scheduler.py into the dags folder of your Composer environment bucket
Run the pipeline manually, if necessary, and inspect the runs in the web UI
View the pipeline by navigating to Dataflow in Cloud Console

Exercise

Draw an architecture diagram of the pipelines in Lab 2

Clean up

In Cloud Console, delete the Composer environment and its storage bucket
Check that no Dataflow pipelines are running

Lab 3

In this lab we train a deep neural network TensorFlow model on Cloud ML Engine. The task is to classify successful marketing phone calls of a Portuguese banking institution.

Products: Cloud ML Engine

Frameworks: TensorFlow

Preparations

Enable the Cloud ML Engine API

gcloud services enable ml.googleapis.com

Preparing the model

Visit the origin of the dataset at UCI Machine Learning Repository
For your convenience, the data has been prepared into training and evaluation sets in mlengine/data/. All the numerical variables except for age have been normalized.
Open the model file trainer/model.py in your code editor and examine the objects CSV_COLUMNS, INPUT_COLUMNS, etc, which encode your data format
Take a look at TensorFlow documentation and fill in the missing feature columns in build_estimator function
Navigate to the repository folder in your Cloud Shell and set the environment variables

TRAIN_DATA=$(pwd)/mlengine/data/bank_data_train.csv
EVAL_DATA=$(pwd)/mlengine/data/bank_data_eval.csv
MODEL_DIR=$(pwd)/mlengine/output

Change directory to mlengine and try the training locally

gcloud ml-engine local train \
    --module-name trainer.task \
    --package-path trainer/ \
    --job-dir $MODEL_DIR \
    -- \
    --train-files $TRAIN_DATA \
    --eval-files $EVAL_DATA \
    --train-steps 1000 \
    --eval-steps 100

Training the model in ML Engine

Set the environment variables

BUCKET_NAME=$GOOGLE_CLOUD_PROJECT-mlengine
REGION=europe-west1

Create a bucket for ML Engine jobs

gsutil mb -l $REGION gs://$BUCKET_NAME

Copy the data into the bucket

gsutil cp $TRAIN_DATA $EVAL_DATA gs://$BUCKET_NAME/data/

Reset the environment variables for data

TRAIN_DATA=gs://$BUCKET_NAME/data/bank_data_train.csv
EVAL_DATA=gs://$BUCKET_NAME/data/bank_data_eval.csv

Set the environment variables for the training job

JOB_NAME=bank_marketing_1
OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME

Run the training job in ML Engine

gcloud ml-engine jobs submit training $JOB_NAME \
    --job-dir $OUTPUT_PATH \
    --runtime-version 1.10 \
    --module-name trainer.task \
    --package-path trainer/ \
    --region $REGION \
    -- \
    --train-files $TRAIN_DATA \
    --eval-files $EVAL_DATA \
    --train-steps 10000 \
    --eval-steps 1000 \
    --verbosity DEBUG

View the job logs in Cloud Shell

gcloud ml-engine jobs stream-logs $JOB_NAME

or, alternatively,

Inspect the training process on TensorBoard (open web preview on port 6006)

tensorboard --logdir=$OUTPUT_PATH

Hyperparameter tuning

Learn more about hyperparameter tuning (see also here)
Open hptuning_config.yaml in your code editor and fill in the missing code
In the mlengine folder, set the environment variables

HPTUNING_CONFIG=$(pwd)/hptuning_config.yaml
JOB_NAME=bank_marketing_hptune_1
OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME

Run training with hyperparameter tuning

gcloud ml-engine jobs submit training $JOB_NAME \
    --job-dir $OUTPUT_PATH \
    --runtime-version 1.10 \
    --config $HPTUNING_CONFIG \
    --module-name trainer.task \
    --package-path trainer/ \
    --region $REGION \
    --scale-tier STANDARD_1 \
    -- \
    --train-files $TRAIN_DATA \
    --eval-files $EVAL_DATA \
    --train-steps 10000 \
    --eval-steps 1000 \
    --verbosity DEBUG

View the job logs in Cloud Shell

gcloud ml-engine jobs stream-logs $JOB_NAME

or, alternatively,

Inspect the training process on TensorBoard (open web preview on port 6006)

tensorboard --logdir=$OUTPUT_PATH/<trial_number>/

Deployment

Set the environment variable

MODEL_NAME=bank_marketing

Create a model in ML Engine

gcloud ml-engine models create $MODEL_NAME --regions=$REGION

Select the job output to use and look up the path to model binaries

gsutil ls -r $OUTPUT_PATH

Set the environment variable with the correct value for <trial_number> and <timestamp>

MODEL_BINARIES=$OUTPUT_PATH/<trial_number>/export/bank_marketing/<timestamp>/

Create a version of the model

gcloud ml-engine versions create v1 \
    --model $MODEL_NAME \
    --origin $MODEL_BINARIES \
    --runtime-version 1.10

From the mlengine folder, inspect the test instances

cat data/bank_data_test_no.json
cat data/bank_data_test_yes.json

Get the prediction for two test instances

gcloud ml-engine predict \
    --model $MODEL_NAME \
    --version v1 \
    --json-instances \
    data/bank_data_test_no.json

gcloud ml-engine predict \
    --model $MODEL_NAME \
    --version v1 \
    --json-instances \
    data/bank_data_test_yes.json

qvik / gcp-data-training Goto Github PK

gcp-data-training's Introduction

Google Cloud Data Training

Prerequisites

Lab 1

Exercise

BigQuery UI

Data Studio

Datalab

Solution

Bonus exercise 1

Bonus exercise 2

Datalab clean up

Lab 2

Preparations

Pub/Sub

Streaming pipeline

Batch pipeline

Cloud Composer

Exercise

Clean up

Lab 3

Preparations

Preparing the model

Training the model in ML Engine

Hyperparameter tuning

Deployment

The End

Recommend Projects

Recommend Topics

Recommend Org