Code Monkey home page Code Monkey logo

postgres-analytics-tutorial's Introduction

Architecting Real-Time Analytics for your Customers

Slide deck

Spin up the cluster

Use this link.
Do not use the links in the UI to create the schema. We will create it manually.

Clone the repository

Downloads all the scripts and data needed for the tutorial.

git clone https://github.com/citusdata/pgopensv-analytics-tutorial.git
cd pgopensv-analytics-tutorial 

Connect to the database

Connect to the cluster/database using any postgresql client. Below is an example using psql.

psql "<connection-url>"

Follow the below steps one by one after connecting to the cluster.

Schema

Schema has 3 main tables:

  • events: raw table which captures every event. It is a partitioned table. You'd creating a partition every 5 minutes. Used pg_partman to create partitions.
  • rollup_events_5mins: table to store aggregated data every 5-minute intervals.
  • rollup_events_1hr: table to store aggregated data every 1-hour.
    Connect to postgres via psql and run the below command to create the above tables.
    Also note that we are sharding each of the tables on tenant_id column. Hence they are colocated.
\i schema-1.sql
\i schema-2.sql

Data Load

Load a csv file into the events table.

\COPY events(customer_id,event_type,country,browser,device_id,session_id) FROM data/1.csv WITH (FORMAT CSV,HEADER TRUE);

Setup incremental rollup setup

SQL Script to track the event_id until a rollup (5min or 1hour) has been completed. This is used by the actual rollup functions to continue the rollup from that event_id.

\i setup_rollup.sql

Creating rollup functions

Uses the bulk UPSERT (INSERT INTO SELECT ON CONFLICT) to perform the aggregation/rollup.
Rollup function to populate 5-minute rollup table:link to function definition

\i 5minutely_aggregation.sql

Rollup function to populate 1-hr rollup table:link to function definition

\i hourly_aggregation.sql

Run aggregation queries.

5-minute Aggregation

SELECT five_minutely_aggregation();

1-hr Aggregation

SELECT hourly_aggregation();

Load more data

Load a csv file into the events table.

\COPY events(customer_id,event_type,country,browser,device_id,session_id) FROM data/2.csv WITH (FORMAT CSV,HEADER TRUE);

There are more csv files which we can load later.

Re-run our aggregations.

Dashboard Queries

--Get me the total number of events and count of distinct devices in the last 5 minutes?

SELECT sum(event_count), hll_cardinality(sum(device_distinct_count)) 
FROM rollup_events_5min where minute >=now()-interval '5 minutes' AND minute <=now() AND customer_id=1;

--Get me the count of distinct sessions over the last week?

SELECT sum(event_count), hll_cardinality(sum(device_distinct_count)) FROM 
rollup_events_1hr where hour >=date_trunc('day',now())-interval '7 days' AND hour <=now() AND customer_id=1;

-- Get me the trend of my app usage in the last 2 days broken by hour

SELECT hour, sum(event_count) event_count, hll_cardinality(sum(device_distinct_count)) device_count, hll_cardinality(sum(session_distinct_count)) 
session_count FROM rollup_events_1hr where hour >=date_trunc('day',now())-interval '2 days' AND hour <=now() AND customer_id=1 GROUP BY hour;

-- Get me the top devices in the past 30 minutes
SELECT (topn(topn_union_agg(top_devices_1000), 10)).item device_id
FROM rollup_events_5min  where minute >=date_trunc('day',now())-interval '30 minutes' AND minute <=now() AND customer_id=2;

Schedule Aggregation Periodically:

You can run the above aggregations periodically (5min or 1hr) using pg_cron

SELECT cron.schedule('*/5 * * * *', 'SELECT five_minutely_aggregation();');
SELECT cron.schedule('*/5 * * * *', 'SELECT hourly_aggregation();');

Data Expiry

To expire data you can basically directly drop a partition. This can be done on a periodic basis.

DROP TABLE <events_partition>;

References

https://www.citusdata.com/blog/2017/12/27/real-time-analytics-dashboards-with-citus/
https://www.citusdata.com/blog/2018/06/14/scalable-incremental-data-aggregation/
https://www.citusdata.com/blog/2018/03/27/topn-for-your-postgres-database/
https://www.citusdata.com/blog/2017/04/04/distributed_count_distinct_with_postgresql/
https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partman-creating-a-scalable-time-series-database-on-PostgreSQL/

postgres-analytics-tutorial's People

Contributors

craigkerstiens avatar marcocitus avatar saicitus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.