Code Monkey home page Code Monkey logo

tpc-ds's Introduction

########################################################################################
TPC-DS benchmark scripts for HAWQ and Greenplum database.
########################################################################################

########################################################################################
Prerequisites
########################################################################################
1. Greenplum Database or Apache HAWQ installed and running
2. Connectivity is possible to the MASTER_HOST and from the Data Nodes / Segment Hosts
3. Root access

########################################################################################
Installation
########################################################################################
1. ssh to the host that will be home to these scripts and files with root
ssh root@mdw

2. Download the tpcds.sh file
curl https://raw.githubusercontent.com/pivotalguru/TPC-DS/master/tpcds.sh > tpcds.sh
chmod 755 tpcds.sh

3. Execute tpcds.sh and specify the scale in Gigabtes and the number of threads that
will create the data
./tpcds.sh 100 8

This example will create a 100 GB TPC-DS dataset and run all of the queries.

Note: Increase the number of threads when building larger datasets and the host has 
available cores to process the data.

Aditionally, a quiet mode with no user prompts is available when an optional third 
parameter is provided with any value.  Example:
./tpcds.sh 100 8 shhh

########################################################################################
Notes
########################################################################################
- tpch_variables.sh file will be created with variables you can adjust
- The default location for the repo is /pivotalguru/TPC-DS but can be changed in the 
tpch_variables.sh file.
- Additional storage options are in /pivotalguru/TPC-DS/functions.sh
- Files for the benchmark will be created in /pivotalguru/TPC-DS/data[1-8]/
You can update these directories to be symbolic links to better utilize the disk 
volumes you have available.
- Example of running tpcds as root as a background process:
nohup ./tpcds.sh 3000 48 shh > tpcds.log 2>&1 < tpcds.log &

########################################################################################
TPC-DS Minor Modifications
########################################################################################
1.  Change to SQL queries that subtracted or added days were modified slightly:

Old:
and (cast('2000-02-28' as date) + 30 days)

New:
and (cast('2000-02-28' as date) + '30 days'::interval)

This was done on queries: 12, 16, 20, 21, 32, 37, 40, 77, 80, 82, 92, 94, 95, and 98.

2.  Change to queries with ORDER BY on column alias to use sub-select.

Old: 
select  
    sum(ss_net_profit) as total_sum
   ,s_state
   ,s_county
   ,grouping(s_state)+grouping(s_county) as lochierarchy
   ,rank() over (
 	partition by grouping(s_state)+grouping(s_county),
 	case when grouping(s_county) = 0 then s_state end 
 	order by sum(ss_net_profit) desc) as rank_within_parent
 from
    store_sales
   ,date_dim       d1
   ,store
 where
    d1.d_month_seq between 1212 and 1212+11
 and d1.d_date_sk = ss_sold_date_sk
 and s_store_sk  = ss_store_sk
 and s_state in
             ( select s_state
               from  (select s_state as s_state,
 			    rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking
                      from   store_sales, store, date_dim
                      where  d_month_seq between 1212 and 1212+11
 			    and d_date_sk = ss_sold_date_sk
 			    and s_store_sk  = ss_store_sk
                      group by s_state
                     ) tmp1 
               where ranking <= 5
             )
 group by rollup(s_state,s_county)
 order by
   lochierarchy desc
  ,case when lochierarchy = 0 then s_state end
  ,rank_within_parent
 limit 100;

New:
select * from ( --new
select  
    sum(ss_net_profit) as total_sum
   ,s_state
   ,s_county
   ,grouping(s_state)+grouping(s_county) as lochierarchy
   ,rank() over (
 	partition by grouping(s_state)+grouping(s_county),
 	case when grouping(s_county) = 0 then s_state end 
 	order by sum(ss_net_profit) desc) as rank_within_parent
 from
    store_sales
   ,date_dim       d1
   ,store
 where
    d1.d_month_seq between 1212 and 1212+11
 and d1.d_date_sk = ss_sold_date_sk
 and s_store_sk  = ss_store_sk
 and s_state in
             ( select s_state
               from  (select s_state as s_state,
 			    rank() over ( partition by s_state order by sum(ss_net_profit) desc) as ranking
                      from   store_sales, store, date_dim
                      where  d_month_seq between 1212 and 1212+11
 			    and d_date_sk = ss_sold_date_sk
 			    and s_store_sk  = ss_store_sk
                      group by s_state
                     ) tmp1 
               where ranking <= 5
             )
 group by rollup(s_state,s_county)
) AS sub --new
 order by
   lochierarchy desc
  ,case when lochierarchy = 0 then s_state end
  ,rank_within_parent
 limit 100;

This was done on queries: 36 and 70.

tpc-ds's People

Contributors

runningjon avatar

Watchers

Lirong Jian avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.