Code Monkey home page Code Monkey logo

evaluationofcolumnarformats's Introduction

An Empirical Evaluation of Columnar Storage Formats

This is the code for the paper titled "An Empirical Evaluation of Columnar Storage Formats" to be published in VLDB Vol 17, No 2.

Directory structure

The code is split into several repositories. Some of them are forked from the main branch of corresponding formats and modified for profiling purposes.

~/EvaluationOfColumnarFormats/
    \____ OpenFormat 
           \____ benchmark (real-world data analysis and benchmark generator)
           \____ python (experiment automation scripts and plotting)
                \____ ...
                \____ cudf (experiment for GPU decoding)
           \____ vector_data (experiment scripts for ML workload)
    \____ arrow-private (including executables and profiling utilities for testing Parquet's scan performance, and also nested overhead for Parquet and ORC to Arrow)
    \____ orc (including executables and profiling utilities for testing ORC's scan and select performance)
    \____ arrow-rs (including executables for testing Parquet's select performance)

Instructions

You need to follow each repo's build instructions to install the dependencies of arrow, orc, and arrow-rs, and then build each repo in release mode.

arrow is built using cmake preset "openformat-release".

Some utilities from arrow-rs need to be installed globally (i.e., add to PATH) to run the experiments. You can go to arrow-rs directory and run cargo install --path parquet --features=cli to install them.

OpenFormat contains further instructions on how to run the workload and data generator, and how to reproduce the results.

evaluationofcolumnarformats's People

Contributors

xinyuzeng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.