Code Monkey home page Code Monkey logo

simdcsv's Introduction

simdcsv

A fast SIMD parser for CSV files as defined by RFC 4180.

This project will be a fast SIMD parser for CSV files. The approach will closely resemble simdjson in many respects; I plan to a number of similar tricks to the ones we did in that project. Initially, many techniques will be (regrettably) copy-pasted from that project; I hope to factor out some common functionality for this kind of code later.

Real-life parsing of CSV files has to deal with a huge range of optional variations on what a CSV might look like. My plan is to initially focus on standards-compliant CSV files and potentially add some variations later.

The broad outline of how the parsing will work:

  1. Read in a CSV file into a buffer - as per usual, the buffer will be cache-line-aligned and padded so that even an exuberantly long SIMD read in a unrolled loop can safely happen without having to worry about unsafe reads.

  2. Identification of CSV fields. This process will be considerably simpler, as unlike simdjson, we will not have to a implement a complex grammar.

a) We need to identify where are quotes are first - this ensures that escaped commas and CR-LF pairs are not treated as separators. Since RFC 4180 defines our quote convention as using "" for an escaped quote in all circumstances where they appear, and otherwise pairing quotes at the start and end of a field, this means that our quote detection code from simjson (see https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/ for a write-up) will allow us to identify all regions where we are 'inside' a quote quite easily.

The "edges" that we will identify here are relatively complex as we will nominally leave and reenter a quoted field every time we encounter a doubled-quote. So for example,

,"foo""bar,",

encountered in a field will cause us to 'leave and renenter' our quoted field between the 'foo' and the 'bar'. However, this will have no real effect on the main point of this pass, which is to identify unescaped commas and CR-LF sequences.

  1. Comma and CR-LF detection.

We need to then scan for commas and CR-LF pairs. This is relatively simple and the only new wrinkle on SIMD scanning techniques in simdjson is the fact that we have to detect a CR followed by a LF.

At this point, we can identify all our actual delimiters. There may be additional passes to be done in the SIMD domain, but it's possible that we might at this stage do a bits-to-indexes transform and start working on our CSV document as a series of indexes into our data in a 2-dimensional (at least nominally) array.

Other tasks that need to happen:

  • We should validate that the things that appear as "textdata" within the fields are valid ASCII as per the standard.
  • UTF validation is not covered by RFC 4180 but will surely be a necessity.
  • Numbers that appear within fields will likely need to be converted to integer or floating point values
  • The escaped text will need to be converted (in situ or in newly allocated storage) into unescaped variants
  • It should be possible to parse only some columns, without incurring much of a price for skipping the other columns.

The initial cut of the code will be for AVX2 capable machines. An ARM variant will appear shortly, as will AVX512 and possible SSE versions.

References

Ge, Chang and Li, Yinan and Eilebrecht, Eric and Chandramouli, Badrish and Kossmann, Donald, Speculative Distributed CSV Data Parsing for Big Data Analytics, SIGMOD 2019.

Mühlbauer, T., Rödiger, W., Seilbeck, R., Reiser, A., Kemper, A., & Neumann, T. (2013). Instant loading for main memory databases. Proceedings of the VLDB Endowment, 6(14), 1702-1713.

simdcsv's People

Contributors

geofflangdale avatar lemire avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simdcsv's Issues

License

Any chance you could switch to a more permissive license like MIT?

Thanks for your hard work on this!

For column projection, only a partial indexing is necessary

Terminology: Selecting a subset of columns is called a projection.

Since extracting the indexes is expensive, you may want to only pick some of the indexes, never committing to memory the unneeded indexes. So maybe you want the 4th index and the 10th one. Then you want the 14th index and the 20th index and so forth. Skipping an index is cheap, it is a single instruction (blsr).

So, conceptually, getting just the indexes you need is easy. Practically, there is a software engineering issue in getting the whole thing to work... but it can be done.

use flag -march=native may cause trouble

So I got 2 different instances, both CPU supports AVX2, I was expecting the shared library built from one instance can be used in another, however that's not the case. I was able to narrow the issue down to the -march=native flag: When replace it with -mavx2 -mpclmul it works perfectly in both instances. Just a note for anyone who may got same issue.

bad msvc performance

I've proposed a PR (#10) that allows to compile on MSVC.

However the performance is awful (0.65GB/sec instead of 11GB/sec on the same machine, with MSVC Clang)

If anybody is able to look into that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.