Code Monkey home page Code Monkey logo

openformat's Introduction

OpenFormat

The benchmark directory is for real-world data set distribution analysis and benchmark generator, python directory is for experiment running and ploting scripts.

Executables to test the scan performance of Parquet and ORC refer to separate repositories arrow-Public, arrow-rs and orc.

To run the experiments, you need to make sure all the repos are in the same home directory. For example:

/home/user/
    \__ OpenFormat 
           \__ benchmark (real-world data analysis and benchmark generator)
              \__ generator_v2 
                 \__ README.md (detailed instructions for the benchmark generator)
           \__ python (experiment automation scripts and plotting)
           \__ vector_data: embeddings cr and performance
    \__ arrow-Public (including executables and profiling utilities for testing Parquet's scan performance, and also nested overhead for Parquet and ORC to Arrow)
    \__ orc (including executables and profiling utilities for testing ORC's scan and select performance)
    \__ arrow-rs (including executables for testing Parquet's select performance)

You need to follow each repo's build instructions to install the dependencies of arrow, orc, and arrow-rs, and then build each repo in release mode.

arrow-Public is built using cmake preset "openformat-release".

Some utilities from arrow-rs need to be installed globally (i.e., add to PATH) to run the experiments. You can go to arrow-rs directory and run cargo install --path parquet --features=cli to install them.

Benchmark

You should firstly get into benchmark/generator_v2 and read README.md

Multi Workloads

We have provided four typical workloads:

  • classic: the classic workloads of database or normal life, including movie records (imdb), business reviews (yelp), product prices (UKPP), recipe collection (menu).
  • geo: the datasets about some geography or location information, including cell towers' location (cells), Geoname database (geo), flight information (flight).
  • log: the datasets about the log of servers or websites, including machine log (mgbench), website click log (edgar).
  • ml: the machine learning dataset (ml).
  • bi: the Public BI benchmark (cwi).

The configs of each workload is in directory benchmark/generator_v2/workload_config/{workload}_config/. Predicate configs are stored separately in benchmark/generator_v2/filter_config.

Experiment scripts

Experiment scripts are in python directory, with the experiments of each feature of format in a separate subdirectory.

openformat's People

Contributors

averyqi115 avatar xinyuzeng avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.