Awesome Big Data

A curated list of awesome big data frameworks, ressources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data.

Your contributions are always welcome!

Awesome Big Data
Other Awesome Lists

Frameworks

Apache Hadoop - framework for distributed processing. Integrated MapReduce, YARN and HDFS.

Distributed Programming

AddThis Hydra - distributed data processing and storage system.
AMPLab SIMR - run Spark on Hadoop MapReduce v1.
Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
Apache Gora - framework for in-memory data model and persistence.
Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
Apache Pig - high level language to express data analysis programs for Hadoop.
Apache S4 - framework for stream processing, implementation of S4.
Apache Spark - framework for in-memory cluster computing.
Apache Spark Streaming - framework for stream processing, part of Spark.
Apache Storm - framework for stream processing by Twitter also on YARN.
Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
Cascalog - data processing and querying library.
Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
Concurrent Cascading - framework for data management/analytics on Hadoop.
Damballa Parkour - MapReduce library for Clojure.
Datasalt Pangool - alternative MapReduce paradigm.
DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance..
Facebook Corona - Hadoop enhancement which removes single point of failure.
Facebook Peregrine - Map Reduce framework.
Facebook Scuba - distributed in-memory datastore.
Google MapReduce - map reduce framework.
Google MillWheel - fault tolerant stream processing framework.
HadoopDB - hybrid of MapReduce and DBMS.
JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
Metamarkers Druid - framework for real-time analysis of large datasets.
Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
Nokia Disco - MapReduce framework developed by Nokia.
Pydoop - Python MapReduce and HDFS API for Hadoop.
Stratosphere - general purpose cluster computing framework.
Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.

Distributed Filesystem

Apache HDFS - a way to store large files across multiple machines.
Ceph Filesystem - software storage platform designed.
Facebook Haystack - object storage system.
Google Colossus - distributed filesystem (GFS2).
Google GFS - distributed filesystem.
Google Megastore - scalable, highly available storage.
GridGain - GGFS, Hadoop compliant in-memory file system.
Lustre file system - high-performance distributed filesystem.
Quantcast File System QFS - open-source distributed file system.
Red Hat GlusterFS - scale-out network-attached storage file system.
Tachyon - reliable file sharing at memory speed across cluster frameworks.

Column Data Model

Actian Vector - column-oriented analytic database.
Apache Accumulo - distribuited key/value store, built on Hadoop.
Apache Cassandra - column-oriented distribuited datastore, inspired by BigTable.
Apache HBase - column-oriented distribuited datastore, inspired by BigTable.
C-Store - column oriented DBMS.
Facebook HydraBase - evolution of HBase made by Facebook.
Google BigTable - column-oriented distributed datastore.
Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable
Hypertable - column-oriented distribuited datastore, inspired by BigTable.
InfiniDB - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
MonetDB - column store database.
OhmData C5 - improved version of HBase.
Parquet - columnar storage format for Hadoop.
Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.

Document Data Model

Crate Data - is an open source massively scalable data store. It requires zero administration.
Facebook Apollo - Facebook’s Paxos-like NoSQL database.
jumboDB - document oriented datastore over Hadoop.
LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
MongoDB - Document-oriented database system.
RethinkDB - document database that supports queries like table joins and group by.

Key-value Data Model

Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
Edis - is a protocol-compatible Server replacement for Redis.
ElephantDB - Distributed database specialized in exporting data from Hadoop.
EventStore - distributed time series database.
LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
Linkedin Voldemort - distributed key/value storage system.
OpenTSDB - distributed time series database on top of HBase.
Redis - in memory key value datastore.
Riak - a decentralized datastore.
Storehaus - library to work with asynchronous key value stores, by Twitter.
Tarantool - an efficient NoSQL database and a Lua application server.

Graph Data Model

Apache Giraph - implementation of Pregel, based on Hadoop.
Apache Spark Bagel - implementation of Pregel, part of Spark.
ArangoDB - multi model distribuited database.
Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
Gremlin - graph traversal Language.
Google Cayley - open-source graph database.
Google Pregel - graph processing framework.
GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
GraphX - resilient Distributed Graph System on Spark.
Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
Neo4j - graph database writting entirely in Java.
OrientDB - document and graph database.
Phoebus - framework for large scale graph processing.
Titan - distributed graph database, built over Cassandra.
Twitter FlockDB - distribuited graph database.

NewSQL Databases

Amazon RedShift - data warehouse service, based on PostgreSQL.
BayesDB - statistic oriented SQL database.
FoundationDB - distributed database, inspired by F1.
Google F1 - distributed SQL database built on Spanner.
Google Spanner - globally distributed semi-relational database.
H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
HandlerSocket - NoSQL plugin for MySQL/MariaDB.
InfiniSQL - infinity scalable RDBMS.
MemSQL - in memory SQL database witho optimized columnar storage on flash.
NuoDB - SQL/ACID compliant distributed database.
Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
SAP HANA - SQL based in-memory database.
SenseiDB - distributed, realtime, semi-structured database.
Sky - database used for flexible, high performance analysis of behavioral data.
SymmetricDS - open source software for both file and database synchronization.

Time-Series Databases

TempoDB - Cloud-based
InfluxDB - Open-source distributed time series database
OpenTSDB - uses HBase
Kairosdb - similar to OpenTSDB but allows for Cassandra
Cube - uses MongoDB to store time series data

SQL-like processing

AMPLAB Shark - data warehouse system for Spark.
Apache Drill - framework for interactive analysis, inspired by Dremel.
Apache HCatalog - table and storage management layer for Hadoop.
Apache Hive - SQL-like data warehouse system for Hadoop.
Apache Phoenix - SQL skin over HBase.
BlinkDB - massively parallel, approximate query engine.
Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
Concurrent Lingual - SQL-like query language for Cascading.
Datasalt Splout SQL - full SQL query engine for big datasets.
Facebook PrestoDB - distributed SQL query engine.
Google BigQuery - framework for interactive analysis, implementation of Dremel.
Pivotal HAWQ - SQL-like data warehouse system for Hadoop.
Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
SparkSQL - Manipulating Structured Data Using Spark.
Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
Stinger - interactive query for Hive.
Tajo - distributed data warehouse system on Hadoop.

Data Ingestion

Amazon Kinesis - real-time processing of streaming data at massive scale.
Apache Chukwa - data collection system.
Apache Flume - service to manage large amount of log data.
Apache Kafka - distributed publish-subscribe messaging system.
Apache Samza - stream processing framework, based on Kafla and YARN.
Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
Facebook Scribe - streamed log data aggregator.
Fluentd - tool to collect events and logs.
HIHO - framework for connecting disparate data sources with Hadoop.
Kestrel - distributed message queue system.
LinkedIn Databus - stream of change capture events for a database.
LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
LinkedIn White Elephant - log aggregator and dashboard.
Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
Pinterest Secor - is a service implementing Kafka log persistance.

Integrated Development Environments

R-Studio - IDE for R.

Service Programming

Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
Apache Avro - data serialization system.
Apache Curator - Java libaries for Apache ZooKeeper.
Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
Apache Thrift - framework to build binary protocols.
Apache Zookeeper - centralized service for process management.
Google Chubby - a lock service for loosely-coupled distributed systems.
Linkedin Norbert - cluster manager.
OpenMPI - message passing framework.
Serf - decentralized solution for service discovery and orchestration.
Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
Twitter Elephant Bird - libraries for working with LZOP-compressed data.
Twitter Finagle - asynchronous network stack for the JVM.

Scheduling

Apache Aurora - is a service scheduler that runs on top of Apache Mesos.
Apache Falcon - data management framework.
Apache Oozie - workflow job schedul.
Chronos - distributed and fault-tolerant scheduler.
Linkedin Azkaban - batch workflow job scheduler.
Sparrow - scheduling platform.

Machine Learning

Apache Mahout - machine learning library for Hadoop.
brain - Neural networks in JavaScript.
Cloudera Oryx - real-time large-scale machine learning.
Concurrent Pattern - machine learning library for Cascading.
convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
Decider - Flexible and Extensible Machine Learning in Ruby.
etcML - text classification with machine learning.
Etsy Conjecture - scalable Machine Learning in Scalding.
H2O - statistical, machine learning and math runtime for Hadoop.
MLbase - distributed machine learning libraries for the BDAS stack.
MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
scikit-learn - scikit-learn: machine learning in Python.
Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
WEKA - suite of machine learning software.

Benchmarking

Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
Berkeley SWIM Benchmark - real-world big data workload benchmark.
Intel HiBench - a Hadoop benchmark suite.
PUMA Benchmarking - benchmark suite for MapReduce application.
Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.

Security

Apache Knox Gateway - single point of secure access for Hadoop clusters.
Apache Sentr - security module for data stored in Hadoop.

System Deployment

Apache Ambari - operational framework for Hadoop mangement.
Apache Bigtop - system deployment framework for the Hadoop ecosystem.
Apache Helix - cluster management framework.
Apache Mesos - cluster manager.
Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
Apache Whirr - set of libraries for running cloud services.
Apache YARN - Cluster manager.
Brooklyn - library that simplifies application deployment and management.
Buildoop - Similar to Apache BigTop based on Groovy language.
Cloudera HUE - web application for interacting with Hadoop.
Facebook Prism - multi datacenters replication system.
Google Borg - job scheduling and monitoring system.
Google Omega - job scheduling and monitoring system.
Hortonworks HOYA - application that can deploy HBase cluster on YARN.
Marathon - Mesos framework for long-running services.

Applications

Apache Kiji - framework to collect and analyze data in real-time, based on HBas.
Apache Nutch - open source web crawler.
Apache OODT - capturing, processing and sharing of data for NASA’s scientific archives.
Apache Tika - content analysis toolkit.
Eclipse BIRT - Eclipse-based reporting system.
Eventhub - open source event analytics platform.
HIPI Library - API for performing image processing tasks on Hadoop’s MapReduce.
Hunk - Splunk analytics for Hadoop.
MADlib - data-processing library of an RDBMS to analyze data.
PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
Qubole - auto-scaling Hadoop cluster, built-in data connectors.
Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
SparkR - R frontend for Spark.
Splunk - analyzer for machine-generated date.
Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.

Search engine and framework

Apache Lucene - Search engine library.
Apache Solr - Search platform for Apache Lucene.
ElasticSearch - Search and analytics engine based on Apache Lucene.
Facebook Unicorn - social graph search platform.
Google Caffeine - continuous indexing system.
Google Percolator - continuous indexing system.
TeraGoogle - large search index.
HBase Comprocessor - implementation of Percolator, part of HBase.
LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
LinkedIn Galene - search architecture at LinkedIn.
LinkedIn Zoie - is a realtime search/indexing system written in Java.
Sphnix Search Server - fulltext search engine.

MySQL forks and evolutions

Amazon RDS - MySQL databases in Amazon’s cloud.
Drizzle - evolution of MySQL 6.0.
Google Cloud SQL - MySQL databases in Google’s cloud.
MariaDB - enhanced, drop-in replacement for MySQL.
MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
Percona Server - enhanced, drop-in replacement for MySQL.
ProxySQL - High Performance Proxy for MySQL.
TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

Memcached forks and evolutions

Facebook McDipper - key/value cache for flash storage.
Facebook Memcached - fork of Memcache.
Twemproxy - a fast, light-weight proxy for memcached and redis.
Twitter Fatcache - key/value cache for flash storage.
Twitter Twemcache - fork of Memcache.

Embedded Databases

BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
HanoiDB - Erlang LSM BTree Storage.
LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

Chartio - lean business intelligence platform to visualize and explore your data.
Jaspersoft - powerful business intelligence suite.
Jedox Palo - customisable business intelligence platform.
Microsoft - business intelligence software and platform.
Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
Pentaho - business intelligence platform.
Qlik - business intelligence and analytics platform.
Tableau - business intelligence platform.
Spango BI - open source business intelligence platform.

Data Visualization

Arbor - graph visualization library using web workers and jQuery.
Chart.js - open source HTML5 Charts visualizations.
Cubism - JavaScript library for time series visualization.
D3 - javaScript library for manipulating documents.
Envisionjs - dynamic HTML5 visualization.
Grafana - graphite dashboard frontend, editor and graph composer.
Graphite - scalable Realtime Graphing.
Google Charts - simple charting API.
Highcharts - simple and flexible charting API.
Matplotlib - plotting with Python.
NVD3 - chart components for d3.js.
Peity - Progressive bar, line and pie charts.
Recline - simple but powerful library for building data applications in pure Javascript and HTML.
Sigma.js - JavaScript library dedicated to graph drawing.
Vega - a visualization grammar.

Interesting Readings

Big Data Benchmark - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
NoSQL Comparison - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.

Interesting Papers

2013 - 2014

2013 - AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
2013 - AMPLab - MLbase: A Distributed Machine-learning System.
2013 - AMPLab - Shark: SQL and Rich Analytics at Scale.
2013 - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.
2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
2013 - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.
2013 - Metamarkets - Druid: A Real-time Analytical Data Store.
2013 - Google - Online, Asynchronous Schema Change in F1.
2013 - Google - F1: A Distributed SQL Database That Scales.
2013 - Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
2013 - Facebook - Scuba: Diving into Data at Facebook.
2013 - Facebook - Unicorn: A System for Searching the Social Graph.
2013 - Facebook - Scaling Memcache at Facebook.

2011 - 2012

2012 - AMPLab - Blink and It’s Done: Interactive Queries on Very Large Data.
2012 - AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark.
2012 - AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
2012 - Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
2012 - Microsoft - Paxos Made Parallel.
2012 - AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
2012 - Google - Processing a trillion cells per mouse click.
2012 - Google - Spanner: Google’s Globally-Distributed Database.
2011 - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
2011 - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
2011 - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

2001 - 2010

2010 - Facebook - Finding a needle in Haystack: Facebook’s photo storage.
2010 - AMPLab - Spark: Cluster Computing with Working Sets.
2010 - Google - Storage Architecture and Challenges.
2010 - Google - Pregel: A System for Large-Scale Graph Processing.
2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notiﬁcations base of Percolator and Caffeine.
2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
2010 - Yahoo - S4: Distributed Stream Computing Platform.
2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
2008 - AMPLab - Chukwa: A large-scale monitoring system.
2007 - Amazon - Dynamo: Amazon’s Highly Available Key-value Store.
2006 - Google - The Chubby lock service for loosely-coupled distributed systems.
2006 - Google - Bigtable: A Distributed Storage System for Structured Data.
2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.
2003 - Google - The Google File System.

Other Awesome Lists

Other amazingly awesome lists can be found in the awesome-awesomeness list.

atassumer / awesome-bigdata Goto Github PK

awesome-bigdata's Introduction