Code Monkey home page Code Monkey logo

athena's People

Contributors

arunkpatra avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

gitter-badger

athena's Issues

Next Steps: What more can be done beyond the programming challenge?

Ask

List down items you would like/need to do if given more time.

Next Steps

  1. The analytical queries are not tuned for massive scale. They would need to be tuned for scale and leverage the power of Redshift's query optimization techniques like carefully chosen distkey and sortkey.
  2. Use Amazon Spectrum to directly query S3 data and have Redshift act as a conduit between the business application and the OLAP engine.
  3. Tweak the data model that mimics the actual GC business model more closely. There are opportunities to merge certain entities to facilitate better OLAP queries.
  4. Do a UI that floats up data and presents powerful visualizations.
  5. Implement a more advanced algorithm for breakage forecast.
  6. Productionize API': Security, ExceptionHandling, Logging, Monitoring, Tracing, Scaling, Stress testing, Deployment, Containerization etc.
  7. An ambitious goal would be to develop a robust breakage forecast Machine Learning model trained on actual production data. This is of significant commercial value.

Thoughts on cost

  1. Point 1 requires a massive amount of actual test data, and requires some effort.
  2. Points 2 through 4 in the thoughts mentioned above, are fairly straightforward.
  3. Point 5 requires effort and a more comprehensive data model and business research.
  4. Point 6 is necessary to build production worthy code. It takes a non-trivial amount of time and effort.
  5. Point 7 is a significantly complex effort, but has the highest commercial value. It's probably a valuable goal for the GC business and can be a real income generator.

Implement skinny REST APIs to float insights

Ask

Float up insights acquired from Redshift via REST APIs

Approach

  1. Use Spring boot REST infrastructure
  2. Use the required Redshift Java drivers
  3. Document APIs using Swagger

Outcome

  1. Swagger UI endpoint from where APIs could be called
  2. Ensure unit tests exist and coverage is reported

User interface

Simple ReactJS UI would do for now. If not anything else (and if time is a constraint) just Swagger UI is the fallback.

Data load for analytics

Ask

Populate data for analyitcs.

Approach

Load up a decent amount of data so that tests cover positive, negative and edge cases. The focus is right now not scale, but validation and demonstration of insights.

Outcome

Load up S3 with following:

  1. Load around 50 different gift cards covering Open, Close Semi-Open loop cards.
  2. Load Merchant data
  3. Load Transaction log data
  4. Load customer data
  5. Pre-calculate Historical Breakage rates by card for a few years and populate historical breakage rate data. (this is derived data)

Insights derived in programming challenge

Ask

Try to derive the following insights.

  • Top selling cards by quantity and gross volume
  • Cards in the 90th percentile by gross sales volumes
  • Top selling cards by business model
  • Highest grossing merchants
  • Top cards by breakage as of today
  • Gross breakage by merchant as of today
  • Cards that are about to have breakage for a customer
  • Breakage forecast for Merchant, categorized by aspects like card category, customer segment, business model, card medium etc.

Approach

  • Query Redshift data for now. (will think about Spectrum later)
  • Will optimize queries progressively. Will look at distkey and sortkey level optimizations subsequently.

Outcome

  • Set of tested queries and sample results
  • Fine to test with a 'nano scale' dataset to start with.
  • Load up a reasonable amount of data next, so that the insights are not badly skewed or look childish.

Agree on key use cases

Do these make sense, vis-a-vis objectives and schedule constraints?

  1. As a card issuer, I need to see which cards would have a probability of breakage at what time and overall breakage value
  2. As a card issuer, i need to know the schedule when notifications should be triggered, and the go ahead and approve the notification triggers for customers (system assisted, but human approved maybe)
  3. See a simple user interface that gives me predicted escheatment risk value, and possible uplift from timely triggers.

Layout Broad Solution Architecture

Compile:

  • Architecturally significant use cases
  • Architectural constraints
  • Technology options and evaluation parameters
  • Technology fitment
  • Logical and physical architectures

Thoughts and approach on predicting breakage

Ask

What are the possible approaches for predicting breakage for a given card?

Thoughts

Multiple approaches could be used, of which two stand out.

  • One is based on historical data analysis(looking at past trends).
  • Another is based on machine learning models. We discuss both.

Outcome

See the wiki page on Gift Card Breakage Forecast Approach for a detailed discussion.

Programming Challenge

Why?

  1. The immediate goal is to gain some preliminary insights from the data. The data is transactional data and some reference data, essentially written once but read many times for analytical purposes.

How?

  1. It would make sense to have transactional data (and reference data as well) loaded into S3, and then use a variety of tools to look at the data.
  2. We use Amazon Redshift to start with. We copy data from S3 into Redshift and will do some EDA. Later on, we will attempt using Spectrum instead of copying data over to Redshift.
  3. We will consume the Redshift queries in a thin API layer (REST APIs).

What?

  1. Model card data, customer data, merchant data and transaction log data.
  2. Do EDA to get some meaningful insights.
  3. Expose insights via REST APIs. Spring Boot stack.
  4. Time permitting, do a ReactJS UI
  5. The overarching objective is to have a fully working model that works end to end for which a demonstration can be done. This demonstration should exhibit, sound engineering practices, architectural maturity, design, logical thinking and coding capabilities.

Exclusions

  1. The model in this challenge is not expected to work with massive scale. The analytical queries would have opportunities to be tuned to work for scale at a later time progressively.

What next?

See #18

Agree on physical outcomes

Does the following outcomes of the exercise sound reasonable?

  • An engine/model that can look at data and provide breakage predictions? The thought around the model is important and not the actual data or training. In any case real training data is unavailable. A rudimentary implementation is fine for now for the engine?
  • The technology components to float up the data and present in a simple UI app is acceptable?
  • A robust architecture and broad vision on the platform is important at the point? Is it acceptable to show the bigger picture but implement achievable parts of it for the moment? (stub out the rest)

Elevator pitch

  1. Elevator pitch
  2. Three minute video, just a screen recording is fine for now?

Data model for programming challenge

Ask

Create a simplified data model to get insights from transaction log data.

Approach

Consider the following entities:

  • Card Type: Information about a gift card type
  • Card Data: Information about a specific card(sold card - plastic)
  • Merchant: Information about a merchant
  • Customer: Customer data
  • Transaction log: Sufficiently denormalized transaction log data fit for insight extraction

Notes

  • We purposefully denormalize the transaction log data model to facilitate efficient queries from a OLAP standpoint.
  • All other entities considered are seen as some form of truth that holds good over time.
  • The data model presented here stems purely from a OLAP standpoint and hence not normalized at all. It is assumed that some system of record is the ultimate owner/origin of the data and that in turn holds data in a fully normalized format. It makes sense probably to assume that data flows into the DW from those systems and have been sufficiently pre-processed to facilitate OLAP operations.

Outcomes

  • DDL for the tables
  • Some sample data to start with that can be pushed into S3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.