Code Monkey home page Code Monkey logo

cool's Introduction

COOL: a COhort OnLine analytical processing system


Website | Documentation | Blog | Demo | GitHub


Introduction to COOL

 COOL

Different groups of people often have different behaviors or trends. For example, the bones of older people are more porous than those of younger people. It is of great value to explore the behaviors and trends of different groups of people, especially in healthcare, because we could adopt appropriate measures in time to avoid tragedy. The easiest way to do this is cohort analysis.

However, with a variety of big data accumulated over the years, query efficiency becomes one of the problems that OnLine Analytical Processing (OLAP) systems meet, especially for cohort analysis. Therefore, COOL is introduced to solve the problems.

COOL is an online cohort analytical processing system that supports various types of data analytics, including cube query, iceberg query and cohort query.

With the support of several newly proposed operators on top of a sophisticated storage layer, COOL could provide high-performance (near real-time) analytical responses for emerging data warehouse domains.

Key features of COOL

  1. Easy to use. COOL is easy to deploy locally or on the cloud via Docker.
  2. Near Real-time Responses. COOL is highly efficient, and therefore, can process cohort queries in near real-time analytical responses.
  3. Specialized Storage Layout. A specialized storage layout is designed for fast query processing and reduced space consumption.
  4. Self-designed Semantics. There are some novel self-designed semantics for the cohort query, which can simplify its complexity and improve its functionality.
  5. Flexible Integration. Flexible integration with other data systems via common data formats(e.g., CSV, Parquet, Avro, and Arrow).
  6. Artificial Intelligence Model. A new neural network model will be introduced soon.

Quickstart

Build package

mvn clean package

Required sources

  1. dataset file: a CSV file with "," delimiter (normally dumped from a database table) and the table header removed.
  2. dataset schema file: a table.yaml file specifying the dataset's columns and their measure fields.
  3. query file: a YAML file specifying the parameters for the running query server.

Load dataset

Before query processing, we need to load the dataset into COOL native format. The sample code to load csv dataset with data loader can be found in CsvLoader.java.

./cool load \
    dataset \
    path/to/your/.yaml \
    path/to/your/datafile \
    path/to/output/datasource/directory

The five arguments in the command have the following meaning:

  1. the dataset name
  2. the table.yaml (the third required source)
  3. the dataset file (the first required source)
  4. the output directory for the compacted dataset

Execute queries

We provide an example for cohort query processing in CohortAnalysis.java.

Cohort Selection

./cool cohortselection \
    path/to/output/datasource/directory \
    path/to/your/queryfile

Cohort Query

./cool cohortquery \
    path/to/output/datasource/directory \
    path/to/your/cohortqueryfile

Funnel Query

./cool funnelquery \
    path/to/output/datasource/directory \
    path/to/your/funnelqueryfile

OLAP Query

./cool olapquery \
    path/to/output/datasource/directory \
    path/to/your/queryfile

Example: Cohort Analysis

Load dataset

We have provided examples in sogamo directory and health_raw directory. Now we take sogamo for example.

The COOL system supports CSV data format by default, and you can load sogamo dataset with the following command.

./cool load \
    sogamo \
    datasets/sogamo/table.yaml \
    datasets/sogamo/data.csv \
    ./CubeRepo

There will be a cube generated under the ./CubeRepo directory, which is named sogamo.

Similarly, load the health_raw dataset with:

./cool load \
    health_raw \
    datasets/health_raw/table.yaml \
    datasets/health_raw/data.csv \
    ./CubeRepo

Execute queries

We use the health_raw dataset for example to demonstrate the cohort analysis.

Select the specific users

./cool cohortselection \
    ./CubeRepo \
    datasets/health_raw/sample_query_selection/query.json

where the arguments are:

  1. ./CubeRepo: the output directory for the compacted dataset
  2. datasets/health_raw/sample_query_selection/query.json: the cohort query (in JSON)

Execute cohort query

./cool cohortquery \
    ./CubeRepo \
    datasets/health_raw/sample_query_average/query.json

Funnel Analysis

We use the sogamo dataset for example to demonstrate the funnel analysis.

./cool funnelquery \
    ./CubeRepo \
    datasets/sogamo/sample_funnel_analysis/query.json

Example: OLAP Analysis

Load dataset

We have provided examples in olap-tpch directory.

The COOL system supports CSV data format by default, and you can load tpc-h dataset with the following command.

./cool load \
    tpc-h-10g \
    datasets/olap-tpch/table.yaml \
    datasets/olap-tpch/scripts/data.csv \
    ./CubeRepo

Finally, there will be a cube generated under the ./CubeRepo directory, which is named tpc-h-10g.

Execute queries

Run Server

  1. put the application.property file at the same level as the .jar file.
  2. edit the server configuration in the application.property file.
  3. run the below command.
./cool server

CONNECT TO EXTERNAL STORAGE SERVICES

COOL has an StorageService interface, which will allow COOL standalone server/workers (coming soon) to handle data movement between local and an external storage service. A sample implementation for HDFS connection can be found under the hdfs-extensions.

Publication

  • Q. Cai, K. Zheng, H.V. Jagadish, B.C. Ooi, J.W.L. Yip. CohortNet: Empowering Cohort Discovery for Interpretable Healthcare Analytics, in Proceedings of the VLDB Endowment, 10(17), 2024.

  • Z. Xie, H. Ying, C. Yue, M. Zhang, G. Chen, B. C. Ooi. Cool: a COhort OnLine analytical processing system, in 2020 IEEE 36th International Conference on Data Engineering, pp.577-588, 2020.

  • Q. Cai, Z. Xie, M. Zhang, G. Chen, H.V. Jagadish and B.C. Ooi. Effective Temporal Dependence Discovery in Time Series Data, in Proceedings of the VLDB Endowment, 11(8), pp.893-905, 2018.

  • Z. Xie, Q. Cai, F. He, G.Y. Ooi, W. Huang, B.C. Ooi. Cohort Analysis with Ease, in Proceedings of the 2018 International Conference on Management of Data, pp.1737-1740, 2018.

  • D. Jiang, Q. Cai, G. Chen, H. V. Jagadish, B. C. Ooi, K.-L. Tan, and A. K. H. Tung. Cohort Query Processing, in Proceedings of the VLDB Endowment, 10(1), 2016.

cool's People

Contributors

zrealshadow avatar nlgithubwp avatar kimballcai avatar hugy718 avatar liuchangshiye avatar qlinsey avatar xpang-sf avatar cchrewrite avatar yinghongbin avatar rationalai avatar tinyadapter avatar ooibc avatar lemonviv avatar alexxiao007 avatar raghavchalapathy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.