Code Monkey home page Code Monkey logo

csvtools's Introduction

csvtools, fast processing of CSV streams

Build Status Coverity Scan Build Status

As our data gets bigger, CSV files grow in size. The CSV format is not exactly pipe-friendly due to embedded newlines and quoted separators. onyxfish/csvkit offers a great set of utilties for most tasks you would want to perform on CSV's in a gnu toolset kind of way. However, it is not fast. For reasonable data sets, this doesn't matter, but for CSVs of more than a few MBs, you start to feel the pain.

This repository contains gnu-alike tools for parsing RFC 4180 CSVs at high speed.

Tools

  • csvcut a cut(1) equivelant to drop columns from a csv file
  • csvgrep a grep(1) equivelant to match on one or more collumns per row, and only keep the rows matching all or any of the patterns. (it uses PRCE for regular expression goodness)
  • csvawk a wrapper for awk(1) which correctly recognizes rows and cells (even across newlines). This is comparable to geoffroy-aubry/awk-csv-parser, except that it also supports embedded newlines.
  • csvpipe and csvunpipe translate the newlines separating rows to \0 such that sort -z and uniq -z and other null-terminated-line based tools can be used more correctly.

Performance

Benchmarking on the Canada 2011 sensus we compare csvtools with other solutions. Note that these solutions might not correctly handle CSV's.

The performance was measured with a 850MB csv file on a SSD drive, and the maximum speed was 519MB/s.

scenario csvkit cut sed csvtools
dropping first column 4.32 MiB/s 195 MiB/s 228 MiB/s 318 MiB/s
dropping third column 4.12 MiB/s 224 MiB/s 91 MiB/s 359 MiB/s

So even compared to sed or cut, which aren't handeling quoted separators correctly, our csvcut is much faster.

scenario csvkit grep awk csvtools
one pattern 1.86 MiB/s 284 MiB/s 208 MiB/s 310 MiB/s
two patterns 1.87 MiB/s 224 MiB/s 140 MiB/s 258 MiB/s

Faster than grep and awk, this is because the column selection in grep is done with negative character classes multiple times.

There are ofcourse regular expressions possible where PCRE is slower than grep.

scenario awk awk-csv-parser csvtools
print second column 317 MiB/s 2.21 MiB/s 307 MiB/s
count numeric column 219 MiB/s 2.23 MiB/s 217 MiB/s

Since we wrap awk the awk raw is the maximum speed, but we can see only a small overhead, for accurate results.

Why so fast?

No malloc & memcpy! Or as valgrind reports it:

==2473==   total heap usage: 18 allocs, 18 frees, 210 bytes allocated

In the crititical path of tokenizing the csv stream and writing it to stdout, there are no copies or memory allocations. The programs read into a buffer from stdin (or the file passed as last argument), the tokenizer stores offsets (to that buffer) and lenghts in a cell array, and the printer writes from the same buffer, using the offsets and lengths from the cell array.

Instalation

  1. Clone this repository
  2. Navigate to it
  3. make install (or with prefix: make install prefix=~/.apps/)
  4. enjoy :)

csvtools's People

Contributors

davylandman avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.