There appears to be unnecessary joins. Also, we need to ascertain that the query scales for a million rows. If Redshift won’t scale for such scale, we need to figure out ways to deal with this.
Create a simplified data model to get insights from transaction log data.
Approach
Consider the following entities:
Card Type: Information about a gift card type
Card Data: Information about a specific card(sold card - plastic)
Merchant: Information about a merchant
Customer: Customer data
Transaction log: Sufficiently denormalized transaction log data fit for insight extraction
Notes
We purposefully denormalize the transaction log data model to facilitate efficient queries from a OLAP standpoint.
All other entities considered are seen as some form of truth that holds good over time.
The data model presented here stems purely from a OLAP standpoint and hence not normalized at all. It is assumed that some system of record is the ultimate owner/origin of the data and that in turn holds data in a fully normalized format. It makes sense probably to assume that data flows into the DW from those systems and have been sufficiently pre-processed to facilitate OLAP operations.
Outcomes
DDL for the tables
Some sample data to start with that can be pushed into S3
Does the following outcomes of the exercise sound reasonable?
An engine/model that can look at data and provide breakage predictions? The thought around the model is important and not the actual data or training. In any case real training data is unavailable. A rudimentary implementation is fine for now for the engine?
The technology components to float up the data and present in a simple UI app is acceptable?
A robust architecture and broad vision on the platform is important at the point? Is it acceptable to show the bigger picture but implement achievable parts of it for the moment? (stub out the rest)
The immediate goal is to gain some preliminary insights from the data. The data is transactional data and some reference data, essentially written once but read many times for analytical purposes.
How?
It would make sense to have transactional data (and reference data as well) loaded into S3, and then use a variety of tools to look at the data.
We use Amazon Redshift to start with. We copy data from S3 into Redshift and will do some EDA. Later on, we will attempt using Spectrum instead of copying data over to Redshift.
We will consume the Redshift queries in a thin API layer (REST APIs).
What?
Model card data, customer data, merchant data and transaction log data.
Do EDA to get some meaningful insights.
Expose insights via REST APIs. Spring Boot stack.
Time permitting, do a ReactJS UI
The overarching objective is to have a fully working model that works end to end for which a demonstration can be done. This demonstration should exhibit, sound engineering practices, architectural maturity, design, logical thinking and coding capabilities.
Exclusions
The model in this challenge is not expected to work with massive scale. The analytical queries would have opportunities to be tuned to work for scale at a later time progressively.
Do these make sense, vis-a-vis objectives and schedule constraints?
As a card issuer, I need to see which cards would have a probability of breakage at what time and overall breakage value
As a card issuer, i need to know the schedule when notifications should be triggered, and the go ahead and approve the notification triggers for customers (system assisted, but human approved maybe)
See a simple user interface that gives me predicted escheatment risk value, and possible uplift from timely triggers.
Load up a decent amount of data so that tests cover positive, negative and edge cases. The focus is right now not scale, but validation and demonstration of insights.
Outcome
Load up S3 with following:
Load around 50 different gift cards covering Open, Close Semi-Open loop cards.
Load Merchant data
Load Transaction log data
Load customer data
Pre-calculate Historical Breakage rates by card for a few years and populate historical breakage rate data. (this is derived data)
List down items you would like/need to do if given more time.
Next Steps
The analytical queries are not tuned for massive scale. They would need to be tuned for scale and leverage the power of Redshift's query optimization techniques like carefully chosen distkey and sortkey.
Use Amazon Spectrum to directly query S3 data and have Redshift act as a conduit between the business application and the OLAP engine.
Tweak the data model that mimics the actual GC business model more closely. There are opportunities to merge certain entities to facilitate better OLAP queries.
Do a UI that floats up data and presents powerful visualizations.
Implement a more advanced algorithm for breakage forecast.
An ambitious goal would be to develop a robust breakage forecastMachine Learning model trained on actual production data. This is of significant commercial value.
Thoughts on cost
Point 1 requires a massive amount of actual test data, and requires some effort.
Points 2 through 4 in the thoughts mentioned above, are fairly straightforward.
Point 5 requires effort and a more comprehensive data model and business research.
Point 6 is necessary to build production worthy code. It takes a non-trivial amount of time and effort.
Point 7 is a significantly complex effort, but has the highest commercial value. It's probably a valuable goal for the GC business and can be a real income generator.