Code Monkey home page Code Monkey logo

pachyderm's Introduction

Pachyderm: A Containerized Data Lake

GitHub release GitHub license

News

WE'RE HIRING! Love Docker, Go and distributed systems? Learn more about our team and email us at [email protected].

Getting Started

Already got a kubernetes cluster:

$ kubectl create -f https://pachyderm.io/manifest.json

Otherwise, check out our setup instructions.

If you've never used Pachyderm before you should look at the fruit stand example.

Pachyderm has a CLI called pachctl, you can install it with make install or using Homebrew:

$ brew tap pachyderm/tap && brew install pachctl

Docs for pachctl can be found here.

Pachyderm also supports a Go client library, Godocs are here.

What is Pachyderm?

Pachyderm is a Data Lake -- a place to dump and process gigantic data sets. Pachyderm is inspired by the Hadoop ecosystem but shares no code with it. Instead, we leverage the container ecosystem to provide the broad functionality of Hadoop with the ease of use of Docker.

Pachyderm offers the following core functionality:

  • Virtually limitless storage for any data.
  • Virtually limitless processing power using any tools.
  • Tracking of data history, provenance and ownership. (Version Control for data).
  • Automatic processing on new data as it’s ingested. (Streaming).
  • Chaining processes together. (Pipelining)

What's new about Pachyderm? (How is it different from Hadoop?)

There are two bold new ideas in Pachyderm:

  • Containers as the core processing primitive
  • Version Control for data

These ideas lead directly to a system that's much more powerful, flexible and easy to use.

To process data, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it's all just going in a container! Pachyderm will take your container and inject data into it by way of a FUSE volume. We'll then automatically replicate your container, showing each copy a different chunk of data. With this technique, Pachyderm can scale any code you write to process up to petabytes of data (Example: distributed grep).

Pachyderm also version controls all data using a commit-based distributed filesystem (PFS), similar to what git does with code. Version control for data has far reaching consequences in a distributed filesystem. You get the full history of your data, it's much easier to collaborate with teammates, and if anything goes wrong you can revert the entire cluster with one click!

Version control is also very synergistic with our containerized processing engine. Pachyderm understands how your data changes and thus, as new data is ingested, can run your workload on the diff of the data rather than the whole thing. This means that there's no difference between a batched job and a streaming job, the same code will work for both!

Our Vision

Containers are a revolutionary new technology with a compelling application to big data. Our goal is to fully realize that use case. Hadoop has spawned a sprawling ecosystem of tools but with each new tool the complexity of your cluster grows until maintaining it becomes a full-time job. Containers are the perfect antidote to this problem. What if adding a new tool to your data infrastructure was as easy as installing an app? Thanks to the magic of containers in Pachyderm, it really is!

The most exciting thing about this vision though is what comes next. Pachyderm can do big data with anything that runs on Linux. And anything you build can be easily shared with the rest of the community, afterall it's just a container so it's completely reusable and will run the same every time. We have some ideas of our own about what the best starting building blocks will be, but it's just the tip of the iceburg -- we expect our users will have many more interesting ideas. We can't wait to see what they are!

Contributing

Deploying Pachyderm.

To get started, sign the Contributor License Agreement.

Send us PRs, we would love to see what you do!

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

pachyderm's People

Contributors

jdoliner avatar bufdev avatar sjezewski avatar derekchiang avatar joeyzwicker avatar sr avatar rw-pachyderm avatar tv42 avatar samb1729 avatar mattnenterprise avatar tadhunt avatar teodor-pripoae avatar erikreppel avatar bitwiseman avatar elsonrodriguez avatar anchal-agrawal avatar brendanashworth avatar dwhitena avatar fsouza avatar jkingsman avatar rw avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.