As our data gets bigger, CSV files grow in size. The CSV format is not exactly pipe-friendly due to embedded newlines and quoted separators. onyxfish/csvkit offers a great set of utilties for most tasks you would want to perform on CSV's in a gnu toolset kind of way. However, it is not fast. For reasonable data sets, this doesn't matter, but for CSVs of more than a few MBs, you start to feel the pain.
This repository contains gnu-alike tools for parsing RFC 4180 CSVs at high speed.
csvcut
acut(1)
equivelant to drop columns from a csv filecsvgrep
agrep(1)
equivelant to match on one or more collumns per row, and only keep the rows matching all or any of the patterns. (it uses PRCE for regular expression goodness)csvawk
a wrapper forawk(1)
which correctly recognizes rows and cells (even across newlines). This is comparable to geoffroy-aubry/awk-csv-parser, except that it also supports embedded newlines.csvpipe
andcsvunpipe
translate the newlines separating rows to\0
such thatsort -z
anduniq -z
and other null-terminated-line based tools can be used more correctly.
Benchmarking on the Canada 2011 sensus we compare csvtools
with other solutions. Note that these solutions might not correctly handle CSV's.
The performance was measured with a 850MB csv file on a SSD drive, and the maximum speed was 519MB/s.
scenario | csvkit | cut | sed | csvtools |
---|---|---|---|---|
dropping first column | 4.32 MiB/s | 195 MiB/s | 228 MiB/s | 318 MiB/s |
dropping third column | 4.12 MiB/s | 224 MiB/s | 91 MiB/s | 359 MiB/s |
So even compared to sed or cut, which aren't handeling quoted separators correctly, our csvcut
is much faster.
scenario | csvkit | grep | awk | csvtools |
---|---|---|---|---|
one pattern | 1.86 MiB/s | 284 MiB/s | 208 MiB/s | 310 MiB/s |
two patterns | 1.87 MiB/s | 224 MiB/s | 140 MiB/s | 258 MiB/s |
Faster than grep and awk, this is because the column selection in grep is done with negative character classes multiple times.
There are ofcourse regular expressions possible where PCRE is slower than grep.
scenario | awk | awk-csv-parser | csvtools |
---|---|---|---|
print second column | 317 MiB/s | 2.21 MiB/s | 307 MiB/s |
count numeric column | 219 MiB/s | 2.23 MiB/s | 217 MiB/s |
Since we wrap awk
the awk raw is the maximum speed, but we can see only a small overhead, for accurate results.
No malloc & memcpy! Or as valgrind reports it:
==2473== total heap usage: 18 allocs, 18 frees, 210 bytes allocated
In the crititical path of tokenizing the csv stream and writing it to stdout
, there are no copies or memory allocations. The programs read into a buffer from stdin
(or the file passed as last argument), the tokenizer stores offsets (to that buffer) and lenghts in a cell array, and the printer writes from the same buffer, using the offsets and lengths from the cell array.
- Clone this repository
- Navigate to it
make install
(or with prefix:make install prefix=~/.apps/
)- enjoy :)