Code Monkey home page Code Monkey logo

nyc_taxi_data_pipeline's Introduction

New York City Taxi Data Pipeline Project

Introduction

New York City has a large number of taxi cabs that transport people around the city. Local government has been collecting data on taxi trips for years. This data is useful for many purposes, such as traffic planning, taxi dispatching, and fare regulation. In this project, we will build a data pipeline to process the taxi data and provides efficient access to the data for further analysis.

Data Source

The data source is the New York City Taxi & Limousine Commission (TLC) Trip Record Data. The data is available on the TLC website. The data is stored as parquet files on an AWS S3 bucket. The data is partitioned by year and month.

Data Pipeline

The data pipeline consists of the following steps:

  1. Download the data from the S3 bucket to Databricks Volumes in the bronze layer.
  2. Load the data/files into a Spark DataFrame and save it as a Delta table in the silver layer.
  3. Clean the data from the silver layer. 3.1. Update column names and data types. 3.2. Remove unnecessary columns.
  4. Save facts about trips in a Delta table in the gold layer.
  5. Save Payment, Rate, and Zone dimensions in Delta tables in the gold layer.
  6. Aggregate data about trip payments and save it in a Delta table in the data mart.

Data Model

The data model for the lakehouse follows dimensional modeling principles with star schemas. The data model consists of the following tables:

Table Name Table Type Description
fct_yellow_taxi_trips Fact Table Contains facts about trip made by yellow taxis.
dim_payments Dimension Table Contains information about payment types.
dim_rates Dimension Table Contains information about rate codes.
dim_zones Dimension Table Contains information about taxi zones.
trip_payment Data Mart Table Contains aggregated data about trip payments.

The data model diagram is shown below:

Data Model Diagram

Conclusions

In this project, we built a data pipeline to process New York City taxi data. We created a data lakehouse with a dimensional model and a data mart. The data pipeline is automated and can be scheduled to run periodically to update the data in the lakehouse.

Next Steps

The next steps for this project are:

  • Add a machine learning model to predict trip fares.
  • Provision mlops to monitor the data pipeline and the machine learning model.
  • Create a dashboard to visualize the data.

nyc_taxi_data_pipeline's People

Contributors

112523chen avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.