Code Monkey home page Code Monkey logo

ngods's Introduction

ngods: opensource data stack

This repository contains Docker compose script that creates opensource data analytics stack on your local machine.

ngods architecture

Currently, the stack consists of multiple components:

  • Trino for low-latency distributed SQL queries
  • Apache Spark for data pipeline
  • Apache Iceberg for atomic data pipeline with schema evolution, and partitioning for performance
  • Hive Metastore for metadata management (metadata are stored in MariaDB)
  • Minio for mimicking S3 storage on a local computer

I plan to add more components soon:

  • Postgres for low-latency queries (if Trino on Iceberg doesn't provide satisfactory low-latency queries). I'll also add Postgre Foreign Data Wrapper technology for more convenient ELT between Trino and Postgres
  • DBT for ELT on top of Spark SQL or Trino
  • GoodData.CN or Cube.dev for analytics model and metrics
  • Metabase or Apache Superset for dashboards and data visualization

How ngods works: Simple example

You can simply start the ngds by executing

docker-compose up

from the top-level directory where you've cloned this repo.

Once all images are pulled and all containers start, open Minio in your browser http://localhost:9000 log in with minio username and minio123 password and create a top level bucket called bronze .

Then use your favorite SQL console tool (I use DBeaver) and connect to the Trino instance running on your local machine (jdbc url: jdbc:trino://localhost:8080, username: trino, empty database) and execute the following script:

create schema if not exists warehouse.bronze with (location = 's3a://bronze/');

drop table if exists warehouse.bronze.employee;

create table warehouse.bronze.employee (
   employee_id integer not null,
   employee_name varchar not null
)
with (
   format = 'parquet'
);

insert into warehouse.bronze.employee (employee_id, employee_name) values (1, 'john doe');
insert into warehouse.bronze.employee (employee_id, employee_name) values (2, 'jane doe');
insert into warehouse.bronze.employee (employee_id, employee_name) values (3, 'joe doe');
insert into warehouse.bronze.employee (employee_id, employee_name) values (4, 'james doe');

select * from warehouse.bronze.employee;

How ngods works: Loading parquet file as table

Now we'll load the February 2022 NYC taxi trip data parquet file as a new table to ngods.

First, create a new nyc directory in the ./data/stage directory and download the February 2022 NYC taxi trips parquet file to it.

Then open and execute the ngods Spark notebook script that loads the data as a new table to the bronze schema of a warehouse database.

df=spark.read.parquet("/home/data/stage/nyc/fhvhv_tripdata_2022-02.parquet")
df.writeTo("warehouse.bronze.ny_taxis_feb").create()

Now open your SQL console again, connect to the the Trino instance running on your local machine (jdbc url: jdbc:trino://localhost:8080, username: trino, empty database) and execute this SQL queries:

select count(*) from warehouse.bronze.ny_taxis_feb;

select 
	hour(pickup_datetime),
	sum(trip_miles),
	sum(trip_time),
	sum(base_passenger_fare)
from warehouse.bronze.ny_taxis_feb group by 1 order by 1;

You should see your query results in no time!

ngods's People

Contributors

zsvoboda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ngods's Issues

Improvements, selective replacement for the spark and lakehouseing

Integration between spark and dbt should be configured as a thirft server or session when running locally, and I think this can be to allocate more resources than necessary unless it requires multi-node processing.

I think with the dbt-duckdb and the built-in delta-rs or pyiceberg plug-in, we can dramatically reduce this weight in the composition of the aio directory docker files that you've constructed in ngods stocks

In addition, dbt-duckdb currently experimentally supports the iceberg external table option, and delta external table is also being developed by recent PR.

I'd like to ask for your opinion

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.