Code Monkey home page Code Monkey logo

yellowcabs's Introduction

yellowcabs

A simple data pipeline to calculate the monthly average trip length and a 45 days rolling average trip length of NY yellow cabs.

Disclaimer: I interpreted "trip length" as duration, not as distance

Requirements

  • Python 3.7
  • tox (optional to run tests and linting smoothly)

Setup

Clone this repository and

$ pip install .

or

$ pip install dist/yellowcabs-1.0.0-py3-none-any.whl

For production use the wheel would probably be pushed to a private pypi/devpi index and installed from there - or directly copied and installed into a docker image for production. This depends on how things are being run in production.

Configuration

Environment Variable Description Default
YC_BASE_URL Base URL of the taxi data "https://s3.amazonaws.com/nyc-tlc/trip+data/"
YC_TRIP_DATA Kind of data to analyze. "yellow_tripdata"
YC_LOCAL_CACHE_DIR Location to store cache data <python environment>/share
YC_DB_URL SQLAlchemy connection_string "sqlite:///results.sqlite"

Usage

CLI

$ yellowcabs 2019-01
The average trip duration in 01/2019 was 988 seconds.

Rounded to full seconds for readability. More exact data is available in the results from the data pipeline.

luigi

$ luigi --local-scheduler --module yellowcabs.luigi NYTaxiTripDurationAnalytics --month 2019-01
$ luigi --local-scheduler --module yellowcabs.luigi NYTaxiTripDurationAnalytics --month 2019-02

You probably want to run the luigi pipeline on a monthly base by a cronjob. That way on the start of a month new data-sets from the previous month can be batch-ingested.

The 45 day rolling average trip duration can be found in the table trip_duration_rolling_average. (database as defined in the config)

Scaling

If the data ingested gets too big for being held in ram or written to local temp-files like now, the pipeline would need to be refactored to maybe use Dask instead of plain pandas and a proper data warehouse or at least a proper database to store temprorary result sets.

Since data engineering not a big part of my professional experience I probably went a pretty naive way on my implementation, but I learned something on the way.

yellowcabs's People

Contributors

dermorz avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

ayesha99

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.