This is a simple, for-beginner project for learning to orchestrate a data pipeline.
Tech stack used: Spark, Kafka, Deltalake,Trino, Airflow, Postgres.
Data can be downloaded at this link: Link
- Python 3.9.12(Other python versions are yet to be tested)
- Docker + Docker Compose
Clone this repository and install the required packages:
pip install -r requirements.txt
Run docker-compose to start the data lake service:
docker compose up -d
- Push data to Minio: You can either access localhost:9001 to upload files or push the data manually to Minio by running the following command:
python utils/upload_folder.py
- Access the trino container to create a database:
docker exec -it datalake-trino bash
After that, run the following in the terminal:
CREATE SCHEMA IF NOT EXISTS lakehouse.taxi
WITH (location = 's3://taxi/');
CREATE TABLE IF NOT EXISTS lakehouse.taxi.taxi (
VendorID VARCHAR(50),
tpep_pickup_datetime VARCHAR (50),
tpep_dropoff_datetime VARCHAR (50),
passenger_count DECIMAL,
trip_distance DECIMAL,
RatecodeID DECIMAL,
store_and_fwd_flag VARCHAR(50),
PULocationID VARCHAR(50),
DOLocationID VARCHAR(50),
payment_type VARCHAR(50),
fare_amount DECIMAL,
extra DECIMAL,
mta_tax DECIMAL,
tip_amount DECIMAL,
tolls_amount DECIMAL,
improvement_surcharge VARCHAR(50),
total_amount DECIMAL,
congestion_surcharge DECIMAL,
Airport_fee DECIMAL
) WITH (
location = 's3://taxi/part0'
);
- Check Kafka service by going through localhost:9021
- Check Airflow service, run the following commands:
cd pipeline
and
docker compose up -d