Here is the scenario we'll be working through in this workshop. You are a data engineer at DoggySwag, an online retailer of dog products. Your analytics team wants to understand how your customers navigate your website in order to find and buy products, in order to optimize the user experience and increase spending.
User behavior data is available through events that have been captured with a web analytics tool, whereas the customer spend data is stored in your own SQL database. The web analytics tool has a reporting layer that provides user behavior insight, but this doesn't include data about customer spending. Your analysts can run SQL queries against the customer spend data, but can't correlate this with their online behavior data.
In this workshop you'll learn how to create a Redshift data warehouse that centralizes data from both these data sources. This way your analysts can run queries across the data sets.
You must have an AWS account and an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and creating various resources. Your IAM permissions must also include access to create IAM roles and policies created by the AWS CloudFormation template.
In this section we'll set up a new VPC with the following resources:
- A Redshift cluster along with the security group rules necessary for accessing it from Etleap.
- An S3 bucket to host our data lake.
- A Glue Catalog database to store data lake metadata.
- An IAM role that will be assigned to the Redshift cluster and used to access the data lake.
- An IAM user that will be used by Etleap to access the data lake.
Log into your AWS account and go to the page to create the stack for this workshop.
- Make sure the AWS region selected is N. Virginia (us-east-1).
- Specify a root password for your Redshift cluster. This must consist of at least 8 alphanumeric characters only, and must contain a lower-case letter, an upper-case letter, and a number.
- The other fields have sensible defaults, but feel free to modify as you see fit.
After entering the all the parameter values:
- Click 'Next'.
- On the next screen, enter any required tags, an IAM role, or any advanced options, and then click 'Next'.
- Review the details on the final screen, Check the box for 'I acknowledge that AWS CloudFormation might create IAM resources' and then choose 'Create' to start building the resources.
- Click 'Create stack' and wait for the creation to complete.
We'll ETL data from two different data sources into your Redshift data warehouse. One source, an S3 bucket, contains JSON-formatted click event data. The other, a MySQL database, contains information about users.
The first thing you'll need to do is log into Etleap using the credentials that were provided to you via email. The email has the subject line 'Your Etleap Account'.
In the rest of this section we'll connect Etleap to the data sources and Redshift destination, so that we can create pipelines in the next section.
Set up the S3 Input connection here. Use the following values:
- Name:
Website Events
- Access ID: (see email with subject 'Etleap and AWS Redshift Dev Days Instructions')
- Secret Key: (see email with subject 'Etleap and AWS Redshift Dev Days Instructions')
- Data Bucket:
etleap-redshift-devdays
- Base Directory:
events
- Additional Properties: Leave as their defaults.
Click 'Create Connection'.
Set up the MySQL connection here. Use the following values:
- Name:
Webstore
- Connection Method: Direct
- Address:
dbtest.etleap.com
- Port:
3306
- Username:
etl
- Password:
Gpte7q3IOtzP
- Database:
webstore
- Additional properties: Leave as their defaults.
Click 'Create Connection'
For this setup you'll need the values from your CloudFormation stack. These are available on the Outputs tab in the Stack Info page.
Set up the Redshift connection here.
- Leave the name as
Amazon Redshift
- Connection Method: Direct
- Connection Information:
- Address: Use the 'Value' of 'RedshiftClusterHostnameOutput' from your CloudFormation stack. Make sure you remove any whitespace at the end of the input.
- Port:
5439
- Authentication:
- Username:
root
- Password: Use the 'Value' of 'RedshiftClusterPasswordOutput' from your CloudFormation stack.
- Username:
- Database properties:
- Database:
warehouse
- Schema:
public
- Database:
- Additional properties: Leave as their defaults.
Click 'Create Connection'.
In this section we'll configure pipelines that will ETL data from the sources into the Redshift database.
- Click the 'Create' button in the top nav-bar in Etleap.
- Pick 'Website Events' as the source.
- This page lists the files and folders available in S3. Click the radio button in the top-left to select the top-level directory.
- Click 'Wrangle Data'.
- Wrangle the data. At a minimum, specify the following rules:
- Split out the event type from the JSON object: highlight the whitespace between the event data and the JSON object in the first row, and pick the suggestion on the right that says
Split data once on ' '
. - Rename the event type column: double-click the header where it says 'split', enter 'event_type', and click enter.
- Parse the JSON object: single-click the 'split1' column and pick the first suggestion on the right.
- Split out the event type from the JSON object: highlight the whitespace between the event data and the JSON object in the first row, and pick the suggestion on the right that says
- Click 'Next'.
- Pick 'Amazon Redshift' as the destination.
- Specify the following destination values:
- Table name:
Website_Events
- Pipeline name:
Website Events
- Table name:
- Click 'Next'.
- Click 'Start ETLing'.
- Click the 'Create' button in the top nav-bar in Etleap.
- Pick 'Webstore' as the source.
- Pick 'purchases' as the table to import.
- Click 'Skip Wrangling'.
- Click 'Generate automatically'.
- Leave the 'Primary key' as 'id' and 'Update timestamp' as 'update_date', and click 'Next'.
- Pick 'Amazon Redshift' as the destination.
- Leave all the options as their defaults in this step and click 'Next'.
- Click 'Start ETLing'.
Etleap is now ETL'ing the data from the sources to Redshift. This will take 5-10 minutes. You can monitor progress here. Once you see events saying 'Website Events loaded successfully' and 'purchases loaded successfully' you can proceed to the next section.
Now that we have our data in Redshift we'll run a query that uses both datasets: we'll figure out how many clicks users have on average on our site segmented by whether or not they have made at least one purchase.
For this setup you'll need the values from your CloudFormation stack. These are available on the Outputs tab in the Stack Info page.
- Go to the Redshift query editor.
- Connect to your Redshift cluster in the 'Credentials' input:
- Cluster: Pick the cluster that begins with 'etleap-redshift-workshop'.
- Database:
warehouse
- Database user:
root
- Database password: Use the 'Value' of 'RedshiftClusterPasswordOutput' from your CloudFormation stack.
- Enter the following query:
WITH users_with_purchases AS (
SELECT DISTINCT p.user_id
FROM purchases p
), clicks_per_user AS (
SELECT split1_userid, COUNT(*) AS clicks
FROM Website_Events
WHERE event_type = 'Click'
GROUP BY split1_userid)
SELECT
SUM(CASE WHEN uwp.user_id IS NOT NULL THEN cpu.clicks ELSE 0 END) /
SUM(CASE WHEN uwp.user_id IS NOT NULL THEN 1 ELSE 0 END) AS with_purchase,
SUM(CASE WHEN uwp.user_id IS NULL THEN cpu.clicks ELSE 0 END) /
SUM(CASE WHEN uwp.user_id IS NULL THEN 1 ELSE 0 END) AS without_purchase
FROM clicks_per_user cpu
LEFT JOIN users_with_purchases uwp
ON cpu.split1_userid = uwp.user_id
- Click 'Run query'.
As you can see, users that have made a purchase have clicked about 36% more on average as those who haven't made a purchase.
In this section we'll configure pipelines that will ETL data from the sources into your S3/Glue Catalog data lake.
Set up the S3 Data Lake connection here. Use the following values:
- Leave the name as
Amazon S3 Data Lake
- Create an access ID and a secret key:
- Make a note of the
DataLakeIAMUser
output from your CloudFormation stack. - Going to the IAM users list and click this user.
- Go to the 'Security credentials' tab.
- Click 'Create access key'.
- Input these values into the Etleap page.
- Make a note of the
- For the bucket, use the
S3DataLakeBucket
output from your CloudFormation stack. Make sure you remove any whitespace at the end of the input. - Leave the base directory as '/'.
- For the Glue database, use the
GlueCatalogDBName
output from your CloudFormation stack. Make sure you remove any whitespace at the end of the input. - For the Glue catalog region, specify 'us-east-1'.
- Click 'Create Connection'. Click 'Ignore and Continue' for the warning about not having data in the input path.
This is similar to the S3-to-Redshift pipeline, except this time the destination is S3/Glue.
- Click the 'Create' button in the top nav-bar in Etleap.
- Pick 'Website Events' as the source.
- This page lists the files and folders available in S3. Click the radio button in the top-left to select the top-level directory.
- Click 'Skip Wrangling'.
- Select the script from the 'Website Events' pipeline and click 'Next'.
- Select
Amazon S3 Data Lake
as the destination. - Specify the following destination values:
- Table name:
Website_Events
- Pipeline name:
Website Events - Lake
- Table name:
- Click 'More destination options' and select 'Parquet' as the output format.
- Click 'Start ETLing'.
Now for the fun part - we're going to use Redshift Spectrum to query data stored both in Redshift and S3 in the same query. First we need to hook up Redshift to Glue Catalog:
- Go to the Redshift query editor.
- Replace the
DataLakeIAMUser
andRedshiftSpectrumIAMRole
output from your CloudFormation stack in the query below and execute it.
CREATE external SCHEMA spectrumdb
FROM data catalog DATABASE '<GlueCatalogDBName>'
IAM_ROLE '<RedshiftSpectrumIAMRole>'
CREATE external DATABASE if not exists;
- Now we're ready to execute the query. Let's use the same query as before, but now using the data in S3 instead. The only difference in this query is the 'spectrumdb.' prefix in the from-clause in the clicks_per_user common table expression.
WITH users_with_purchases AS (
SELECT DISTINCT p.user_id
FROM purchases p
), clicks_per_user AS (
SELECT split1_userid, COUNT(*) AS clicks
FROM spectrumdb.Website_Events
WHERE event_type = 'Click'
GROUP BY split1_userid)
SELECT
SUM(CASE WHEN uwp.user_id IS NOT NULL THEN cpu.clicks ELSE 0 END) /
SUM(CASE WHEN uwp.user_id IS NOT NULL THEN 1 ELSE 0 END) AS with_purchase,
SUM(CASE WHEN uwp.user_id IS NULL THEN cpu.clicks ELSE 0 END) /
SUM(CASE WHEN uwp.user_id IS NULL THEN 1 ELSE 0 END) AS without_purchase
FROM clicks_per_user cpu
LEFT JOIN users_with_purchases uwp
ON cpu.split1_userid = uwp.user_id
- Go to AWS CloudFormation console.
- Make sure the AWS region selected is N. Virginia (us-east-1).
- Select the CloudFormation Stack you created in this workshop.
- Click Delete.
- Click “Delete Stack”.
This will take few min to delete all the resources.