arunkpatra / athena Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 1.92 MB

Data driven and AI led insights for the Gift Card business.

License: MIT License

Java 99.58% Dockerfile 0.21% Shell 0.21%

breakage escheatment uplift analytics gift-cards notifications

athena's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger

athena's Issues

Next Steps: What more can be done beyond the programming challenge?

Ask

List down items you would like/need to do if given more time.

Next Steps

The analytical queries are not tuned for massive scale. They would need to be tuned for scale and leverage the power of Redshift's query optimization techniques like carefully chosen distkey and sortkey.
Use Amazon Spectrum to directly query S3 data and have Redshift act as a conduit between the business application and the OLAP engine.
Tweak the data model that mimics the actual GC business model more closely. There are opportunities to merge certain entities to facilitate better OLAP queries.
Do a UI that floats up data and presents powerful visualizations.
Implement a more advanced algorithm for breakage forecast.
Productionize API': Security, ExceptionHandling, Logging, Monitoring, Tracing, Scaling, Stress testing, Deployment, Containerization etc.
An ambitious goal would be to develop a robust breakage forecast Machine Learning model trained on actual production data. This is of significant commercial value.

Thoughts on cost

Point 1 requires a massive amount of actual test data, and requires some effort.
Points 2 through 4 in the thoughts mentioned above, are fairly straightforward.
Point 5 requires effort and a more comprehensive data model and business research.
Point 6 is necessary to build production worthy code. It takes a non-trivial amount of time and effort.
Point 7 is a significantly complex effort, but has the highest commercial value. It's probably a valuable goal for the GC business and can be a real income generator.

Setup engineering pipelines, codebase and automation.

Setup codebase, build and test frameworks
Setup code quality reporting, coverage, deployment and other automation
Put in all that is required to start coding

Optimize historical breakage rate calculation query

There appears to be unnecessary joins. Also, we need to ascertain that the query scales for a million rows. If Redshift won’t scale for such scale, we need to figure out ways to deal with this.

Data model, and engine design, model and core logic for prediction

Build out the prediction logic. The API layer will call into it. Bake in scalability into the design. Justify design choices.

Amazon Redshift and S3 infrastructure setup for programming challenge

Ask

Setup required Redhsift cluster and S3 bucket.
Load test data
Query from IDE

Outcome

Redshift cluster and S3 bucket with test data.

Setup demo and docs

Load up test data in the demo environment
Deploy and setup the demo

Implement skinny REST APIs to float insights

Ask

Float up insights acquired from Redshift via REST APIs

Approach

Use Spring boot REST infrastructure
Use the required Redshift Java drivers
Document APIs using Swagger

Outcome

Swagger UI endpoint from where APIs could be called
Ensure unit tests exist and coverage is reported

User interface

Simple ReactJS UI would do for now. If not anything else (and if time is a constraint) just Swagger UI is the fallback.

Data load for analytics

Ask

Populate data for analyitcs.

Approach

Load up a decent amount of data so that tests cover positive, negative and edge cases. The focus is right now not scale, but validation and demonstration of insights.

Outcome

Load up S3 with following:

Load around 50 different gift cards covering Open, Close Semi-Open loop cards.
Load Merchant data
Load Transaction log data
Load customer data
Pre-calculate Historical Breakage rates by card for a few years and populate historical breakage rate data. (this is derived data)

Insights derived in programming challenge

Ask

Try to derive the following insights.

Top selling cards by quantity and gross volume
Cards in the 90th percentile by gross sales volumes
Top selling cards by business model
Highest grossing merchants
Top cards by breakage as of today
Gross breakage by merchant as of today
Cards that are about to have breakage for a customer
Breakage forecast for Merchant, categorized by aspects like card category, customer segment, business model, card medium etc.

Approach

Query Redshift data for now. (will think about Spectrum later)
Will optimize queries progressively. Will look at distkey and sortkey level optimizations subsequently.

Outcome

Set of tested queries and sample results
Fine to test with a 'nano scale' dataset to start with.
Load up a reasonable amount of data next, so that the insights are not badly skewed or look childish.

Agree on key use cases

Do these make sense, vis-a-vis objectives and schedule constraints?

As a card issuer, I need to see which cards would have a probability of breakage at what time and overall breakage value
As a card issuer, i need to know the schedule when notifications should be triggered, and the go ahead and approve the notification triggers for customers (system assisted, but human approved maybe)
See a simple user interface that gives me predicted escheatment risk value, and possible uplift from timely triggers.

Layout Broad Solution Architecture

Compile:

Architecturally significant use cases
Architectural constraints
Technology options and evaluation parameters
Technology fitment
Logical and physical architectures

Thoughts and approach on predicting breakage

Ask

What are the possible approaches for predicting breakage for a given card?

Thoughts

Multiple approaches could be used, of which two stand out.

One is based on historical data analysis(looking at past trends).
Another is based on machine learning models. We discuss both.

Outcome

See the wiki page on Gift Card Breakage Forecast Approach for a detailed discussion.

Programming Challenge

Why?

The immediate goal is to gain some preliminary insights from the data. The data is transactional data and some reference data, essentially written once but read many times for analytical purposes.

How?

It would make sense to have transactional data (and reference data as well) loaded into S3, and then use a variety of tools to look at the data.
We use Amazon Redshift to start with. We copy data from S3 into Redshift and will do some EDA. Later on, we will attempt using Spectrum instead of copying data over to Redshift.
We will consume the Redshift queries in a thin API layer (REST APIs).

What?

Model card data, customer data, merchant data and transaction log data.
Do EDA to get some meaningful insights.
Expose insights via REST APIs. Spring Boot stack.
Time permitting, do a ReactJS UI
The overarching objective is to have a fully working model that works end to end for which a demonstration can be done. This demonstration should exhibit, sound engineering practices, architectural maturity, design, logical thinking and coding capabilities.

Exclusions

The model in this challenge is not expected to work with massive scale. The analytical queries would have opportunities to be tuned to work for scale at a later time progressively.

What next?

See #18

Domain model, entities and APIs

Build out the domain, APIs(stub out for now) entities and everything that the UI would need.

Agree on physical outcomes

Does the following outcomes of the exercise sound reasonable?

An engine/model that can look at data and provide breakage predictions? The thought around the model is important and not the actual data or training. In any case real training data is unavailable. A rudimentary implementation is fine for now for the engine?
The technology components to float up the data and present in a simple UI app is acceptable?
A robust architecture and broad vision on the platform is important at the point? Is it acceptable to show the bigger picture but implement achievable parts of it for the moment? (stub out the rest)

Elevator pitch

Elevator pitch
Three minute video, just a screen recording is fine for now?

Data model for programming challenge

Ask

Create a simplified data model to get insights from transaction log data.

Approach

Consider the following entities:

Card Type: Information about a gift card type
Card Data: Information about a specific card(sold card - plastic)
Merchant: Information about a merchant
Customer: Customer data
Transaction log: Sufficiently denormalized transaction log data fit for insight extraction

Notes

We purposefully denormalize the transaction log data model to facilitate efficient queries from a OLAP standpoint.
All other entities considered are seen as some form of truth that holds good over time.
The data model presented here stems purely from a OLAP standpoint and hence not normalized at all. It is assumed that some system of record is the ultimate owner/origin of the data and that in turn holds data in a fully normalized format. It makes sense probably to assume that data flows into the DW from those systems and have been sufficiently pre-processed to facilitate OLAP operations.

Outcomes

DDL for the tables
Some sample data to start with that can be pushed into S3

arunkpatra / athena Goto Github PK

athena's People

Contributors

Stargazers

Watchers

Forkers

athena's Issues

Ask

Next Steps

Thoughts on cost

Ask

Outcome

Ask

Approach

Outcome

Ask

Approach

Outcome

Ask

Approach

Outcome

Ask

Thoughts

Outcome

Why?

How?

What?

Exclusions

What next?

Ask

Approach

Notes

Outcomes

Recommend Projects

Recommend Topics

Recommend Org