New York City Taxi Data Pipeline Project

Introduction

New York City has a large number of taxi cabs that transport people around the city. Local government has been collecting data on taxi trips for years. This data is useful for many purposes, such as traffic planning, taxi dispatching, and fare regulation. In this project, we will build a data pipeline to process the taxi data and provides efficient access to the data for further analysis.

Data Source

The data source is the New York City Taxi & Limousine Commission (TLC) Trip Record Data. The data is available on the TLC website. The data is stored as parquet files on an AWS S3 bucket. The data is partitioned by year and month.

Data Pipeline

The data pipeline consists of the following steps:

Download the data from the S3 bucket to Databricks Volumes in the bronze layer.
Load the data/files into a Spark DataFrame and save it as a Delta table in the silver layer.
Clean the data from the silver layer. 3.1. Update column names and data types. 3.2. Remove unnecessary columns.
Save facts about trips in a Delta table in the gold layer.
Save Payment, Rate, and Zone dimensions in Delta tables in the gold layer.
Aggregate data about trip payments and save it in a Delta table in the data mart.

Data Model

The data model for the lakehouse follows dimensional modeling principles with star schemas. The data model consists of the following tables:

Table Name	Table Type	Description
fct_yellow_taxi_trips	Fact Table	Contains facts about trip made by yellow taxis.
dim_payments	Dimension Table	Contains information about payment types.
dim_rates	Dimension Table	Contains information about rate codes.
dim_zones	Dimension Table	Contains information about taxi zones.
trip_payment	Data Mart Table	Contains aggregated data about trip payments.

The data model diagram is shown below:

Conclusions

In this project, we built a data pipeline to process New York City taxi data. We created a data lakehouse with a dimensional model and a data mart. The data pipeline is automated and can be scheduled to run periodically to update the data in the lakehouse.

Next Steps

The next steps for this project are:

Add a machine learning model to predict trip fares.
Provision mlops to monitor the data pipeline and the machine learning model.
Create a dashboard to visualize the data.

112523chen / nyc_taxi_data_pipeline Goto Github PK

nyc_taxi_data_pipeline's Introduction

New York City Taxi Data Pipeline Project

Introduction

Data Source

Data Pipeline

Data Model

Conclusions

Next Steps

nyc_taxi_data_pipeline's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent