Code Monkey home page Code Monkey logo

tablesaw's Introduction

Tablesaw

Tablesaw is a high-performance, in-memory data table in Java.

Its design is driven by two ideas: First, few people need distributed analytics. On a single server, Tablesaw lets you work interactively with a 2,000,000,000 row table. (I plan to raise that ceiling, btw.) Second, it should be super easy to use: To that end I happily steal ideas from everything from spreadsheets to specialized column stores like KDB.

###On Maven Central:

<dependency>
    <groupId>com.github.lwhite1</groupId>
    <artifactId>tablesaw</artifactId>
    <version>0.7.1</version>
</dependency>

###Getting started?:

A 1.0 release is planned for early September.

What Tablesaw does today:

  • Import data from RDBMS and CSV files, local or remote (http, S3, etc.)
  • Add and remove columns
  • Sort
  • Filter
  • Group
  • Map and reduce operations
  • Descriptive stats (mean, min, max, median, etc.)
  • Store tables in a very-fast, compressed columnar storage format

What Tablesaw will do:

  • Integrated machine learning
  • More specialized column types and operations: (lat/lon, time interval, money)

Plotting

I'm starting to add plot support to Tablesaw, because you can't understand a dataset unless you can see it. We will gradually increase both the number of plot types and the supported options. Meanwhile, here's an example where each dot shows the starting location of a single tornado: Alt text

The goal is to support discovery by seamlessly integrating Tablesaw's data manipulation facilities with plotting libraries to make visualization as easy as possible. We'll take the same approach when it comes to integrating machine learning tools.

Current performance:

In its current state, some areas of Tablesaw perform better than others. To give you a sense of where we're going, you can now load a 500,000,000 row, 4 column csv file (35GB on disk) entirely into about 10 GB of memory. If it's in Tablesaw's .saw format, you can load it in 22 seconds. You can query that table in 1-2 ms: fast enough to use as a cache for a Web app.

BTW, those numbers were achieved on a laptop.

Easy to Use is Easy to Say

The goal in this example is to identify the production shifts with the worst performance. These few lines demonstrate data import, column-wise operations (differenceInSeconds()), filters (isInQ2()) grouping and aggegating (median() and .by()), and (top(n)) calculations.

    Table ops = Table.create("data/operations.csv");                             // load data
    
    LongColumn duration = start.differenceInSeconds(end);                        // calc duration
    duration.setName("Duration");
    ops.addColumn(duration);
    
    Table filtered = ops.selectWhere(                                            // filter
          allOf
              (column("date").isInQ2(),
              (column("SKU").startsWith("429")),
              (column("Operation").isEqualTo("Assembly"))));
   
    Table summary = filtered.median("Duration").by("Facility", "Shift");         // group medians
    FloatArrayList tops = summary.floatColumn("Median").top(5);                  // get "slowest"

If you see something that can be improved, let us know.

tablesaw's People

Contributors

lwhite1 avatar smarks avatar

Watchers

Kevin Kelley avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.