hdinsight / tpcds-hdinsight Goto Github PK

This project forked from dharmeshkakadia/tpcds-hdinsight

TPCDS benchmark for various engines

Python 100.00%

tpcds-hdinsight's Introduction

tpcds-hdinsight

Goal of this project is to help generate TPCDS data with hive and create your own HDInsight benchmarks for various engines

Hive
Interactive Hive(LLAP)
Spark
Presto

How to use with Hive CLI

Clone this repo.

git clone https://github.com/hdinsight/tpcds-hdinsight/ && cd tpcds-hdinsight

Run TPCDSDataGen.hql with settings.hql file and set the required config variables.
```
/usr/bin/hive -i settings.hql -f TPCDSDataGen.hql -hiveconf SCALE=10 -hiveconf PARTS=10 -hiveconf LOCATION=/HiveTPCDS/ -hiveconf TPCHBIN=resources 
```
Here,

SCALE is a scale factor for TPCDS. Scale factor 10 roughly generates 10 GB data, Scale factor 1000 generates 1 TB of data and so on.

PARTS is a number of task to use for datagen (parrellelization). This should be set to the same value as SCALE.

LOCATION is the directory where the data will be stored on HDFS.

TPCHBIN is where the resources are found. You can specify specific settings in settings.hql file.

Now you can create tables on the generated data.

/usr/bin/hive -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCDS/ -hiveconf DBNAME=tpcds

Generate ORC tables and analyze

hive -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpcds_orc -hiveconf SOURCE=tpcds
hive -i settings.hql -f ddl/analyze.hql -hiveconf ORCDBNAME=tpcds_orc

Run the queries !

/usr/bin/hive -database tpcds_orc -i settings.hql -f queries/query12.sql

How to use with Beeline CLI

Clone this repo.

git clone https://github.com/hdinsight/tpcds-hdinsight && cd tpcds-hdinsight

Upload the resources to DFS.
```
hdfs dfs -copyFromLocal resources /tmp
```
Run TPCDSDataGen.hql with settings.hql file and set the required config variables.
```
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f TPCDSDataGen.hql -hiveconf SCALE=10 -hiveconf PARTS=10 -hiveconf LOCATION=/HiveTPCDS/ -hiveconf TPCHBIN=`grep -A 1 "fs.defaultFS" /etc/hadoop/conf/core-site.xml | grep -o "wasb[^<]*"`/tmp/resources  
```
```
Here, 
```
SCALE is a scale factor for TPCDS. Scale factor 10 roughly generates 10 GB data, Scale factor 1000 generates 1 TB of data and so on.

PARTS is a number of task to use for datagen (parrellelization). This should be set to the same value as SCALE.

LOCATION is the directory where the data will be stored on HDFS.

TPCHBIN is where the resources are found. You can specify specific settings in settings.hql file.

Now you can create tables on the generated data.

beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCDS/ -hiveconf DBNAME=tpcds

Generate ORC tables and analyze

beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpcds_orc -hiveconf SOURCE=tpcds
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/analyze.hql -hiveconf ORCDBNAME=tpcds_orc

Run the queries !

beeline -u "jdbc:hive2://`hostname -f`:10001/tpcds_orc;transportMode=http" -n "" -p "" -i settings.hql -f queries/query12.sql

If you want to run all the queries 10 times and measure the times it takes, you can use the following command:

for f in queries/*.sql; do for i in {1..10} ; do STARTTIME="`date +%s`";  beeline -u "jdbc:hive2://`hostname -f`:10001/tpcds_orc;transportMode=http" -i settings.hql -f $f  > $f.run_$i.out 2>&1 ; SUCCESS=$? ; ENDTIME="`date +%s`"; echo "$f,$i,$SUCCESS,$STARTTIME,$ENDTIME,$(($ENDTIME-$STARTTIME))" >> times_orc.csv; done; done;

FAQ

Does it work with scale factor 1?

No. The parrellel data generation assumes that scale > 1. If you are just starting out, I would suggest you start with 10 and then move to standard higher scale factors (100, 1000, 10000,..)

Do I have to specify PARTS=SCALE ?

Yes.

How do I avoid my session getting killed due to network errors while long running benchmark?

Use byobu. Type byobu which will start a new session and then run the command. It will be there when you come back even if your network connection is broken.

How do I generate partitioned text tables ?

After generating raw data(step 3a), use the following command:

hive -i settings.hql -f ddl/createAllTextTables.hql -hiveconf TEXTDBNAME=tpcds_text -hiveconf SOURCE=tpcds

This will generate tpcds_text database with all the tables in text format.

How do I generate Parquet data?

After generating raw data(step 3a), use the following command:

hive -i settings.hql -f ddl/createAllParquetTables.hql -hiveconf PARQUETDBNAME=tpcds_pqt -hiveconf SOURCE=tpcds

This will generate tpcds_pqt database with all the tables in parquet format.

How do I run the queries with Spark?

Spark thriftserver listens on 10002 instead of hive thrift server listening on 10001. So replace the connection url appropriately. For example, running the all the queries 10 times with Spark,

for f in queries/*.sql; do for i in {1..10} ; do STARTTIME="`date +%s`";  beeline -u "jdbc:hive2://`hostname -f`:10002/tpcds_orc;transportMode=http" -i sparksettings.hql -f $f  > $f.run_$i.out 2>&1 ; SUCCESS=$? ; ENDTIME="`date +%s`"; echo "$f,$i,$SUCCESS,$STARTTIME,$ENDTIME,$(($ENDTIME-$STARTTIME))" >> times_orc.csv; done; done;

How do I run the queries with Presto?

presto --schema tpcds_orc -f queries/query12.sql

You can run all the queries 10 times with presto with the following command,

for f in queries/*.sql; do for i in {1..10} ; do STARTTIME="`date +%s`"; presto --schema tpcds_orc -f $f  > $f.run_$i.out 2>&1 ; SUCCESS=$? ; ENDTIME="`date +%s`"; echo "$f,$i,$SUCCESS,$STARTTIME,$ENDTIME,$(($ENDTIME-$STARTTIME))" >> times_orc.csv; done; done;

tpcds-hdinsight's People

Contributors

Stargazers

Watchers

Forkers

rkkalluri slunyakin-zz unclegen thbeh nelango skepticatgit mattsfuller ashishthaps1 broussea1901 willshen rahul-kumar92568 mabushaireh letian-jiang

tpcds-hdinsight's Issues

if using ADLS Gen2

beeline -u "jdbc:hive2://hostname -f:10001/;transportMode=http" -n "" -p "" -i settings.hql -f TPCDSDataGen.hql -hiveconf SCALE=10 -hiveconf PARTS=10 -hiveconf LOCATION=/HiveTPCDS/ -hiveconf TPCHBIN=grep -A 1 "fs.defaultFS" /etc/hadoop/conf/core-site.xml | grep -o "wasb[^<]*"/tmp/resources

wasb has to be replaced with abfs

Translate CTAS statements to allow LOCATION clause

I don't think Hive supports using CTAS with a LOCATION clause, meaning that statements like this can't be used to create ORC tables on secondary storage.

Is there a way to modify these to not use CTAS and allow for these ORC tables to be created on secondary storage?

GLIBC_2.14 not found

Hi,

Can this be used to run on Centos 6 or this is targeting only Centos 7?
I am trying to run the performance benchmarks on Centos 6 and hitting system package issues like 'GLIBC_2.14 not found'

./dsdgen: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./dsdgen)
INFO:__main__:command ./dsdgen -dir . -force Y -scale 5 -child 3 -parallel 5 failed. Retries remaining 2. Sleeping for 342 before trying again
./dsdgen: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./dsdgen)
INFO:__main__:command ./dsdgen -dir . -force Y -scale 5 -child 3 -parallel 5 failed. Retries remaining 1. Sleeping for 550 before trying again
./dsdgen: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./dsdgen)
INFO:__main__:command ./dsdgen -dir . -force Y -scale 5 -child 3 -parallel 5 failed. Retries remaining 0. Sleeping for 528 before trying again
INFO:__main__:All retries for ./dsdgen -dir . -force Y -scale 5 -child 3 -parallel 5 exhauseted. Failing the attempt
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error occurred when trying to close the Operator running your custom script.
        at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java:572)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610)

Thanks

Failed to run TPCDSDataGen.hql

This command failed with an error that ADD FILE couldn't find local file.

/usr/bin/hive -i settings.hql -f TPCDSDataGen.hql -hiveconf SCALE=10 -hiveconf PARTS=10 -hiveconf LOCATION=/HiveTPCDS/ -hiveconf TPCHBIN=resources

It worked by copying files to ADFS and specifying the path in ADLS like below

/usr/bin/hive -i settings.hql -f TPCDSDataGen.hql -hiveconf SCALE=10 -hiveconf PARTS=10 -hiveconf LOCATION=/HiveTPCDS/ -hiveconf TPCHBIN=abfs:///tpcds-hdinsight/resources

Can you use secondary storage to test?

I'm attempting to populate data to a secondary storage location that I am able to query using hdfs commands (ls, df, etc.)

Using the argument LOCATION=protocol://hostname/HDFS directory, is there any obvious reason this wouldn't work with this benchmark/data generation tool?

Queries failing to terminate and hanging indefinitely

With the exception of query28, almost all queries are hanging indefinitely and failing to terminate.

Generally, they follow a sequence like this:


Logging initialized using configuration in file:/etc/hive/2.6.1.3-4/0/hive-log4j.properties
OK
Time taken: 1.934 seconds
Query ID = sshuser_20170905205055_3edb68ee-a90c-455f-9864-759a7fd148c0
Total jobs = 11
Execution log at: /tmp/sshuser/sshuser_20170905205055_3edb68ee-a90c-455f-9864-759a7fd148c0.log
2017-09-05 21:02:46	Starting to launch local task to process map join;	maximum memory = 523239424
2017-09-05 21:02:48	Dump the side-table for tag: 1 with group count: 29 into file: file:/tmp/sshuser/ac6b001d-4fb9-4c78-8564-5b6975f50d54/hive_2017-09-05_20-50-55_505_2073659822727453965-1/-local-10018/HashTable-Stage-20/MapJoin-mapfile41--.hashtable
2017-09-05 21:02:48	Uploaded 1 File to: file:/tmp/sshuser/ac6b001d-4fb9-4c78-8564-5b6975f50d54/hive_2017-09-05_20-50-55_505_2073659822727453965-1/-local-10018/HashTable-Stage-20/MapJoin-mapfile41--.hashtable (872 bytes)
2017-09-05 21:02:48	End of local task; Time Taken: 1.546 sec.

Following hive.log and logs for the individual tasks, there doesn't appear to be anything obviously wrong. No errors, etc.

As they terminate, hive.log shows

2017-09-05 21:02:43,091 INFO  [main]: ql.Driver (Driver.java:execute(1411)) - Starting command(queryId=sshuser_20170905205055_3edb68ee-a90c-455f-9864-759a7fd148c0): select
     i_item_id
    ,i_item_desc
    ,s_store_id
    ,s_store_name
    ,sum(ss_quantity)        as store_sales_quantity
    ,sum(sr_return_quantity) as store_returns_quantity
    ,sum(cs_quantity)        as catalog_sales_quantity
 from
    store_sales
   ,store_returns
   ,catalog_sales
   ,date_dim             d1
   ,date_dim             d2
   ,date_dim             d3
   ,store
   ,item
 where
     d1.d_moy               = 2
 and d1.d_year              = 2000
 and d1.d_date_sk           = ss_sold_date_sk
 and i_item_sk              = ss_item_sk
 and s_store_sk             = ss_store_sk
 and ss_customer_sk         = sr_customer_sk
 and ss_item_sk             = sr_item_sk
 and ss_ticket_number       = sr_ticket_number
 and sr_returned_date_sk    = d2.d_date_sk
 and d2.d_moy               between 2 and  2 + 3
 and d2.d_year              = 2000
 and sr_customer_sk         = cs_bill_customer_sk
 and sr_item_sk             = cs_item_sk
 and cs_sold_date_sk        = d3.d_date_sk
 and d3.d_year              in (2000,2000+1,2000+2)
 group by
    i_item_id
   ,i_item_desc
   ,s_store_id
   ,s_store_name
 order by
    i_item_id
   ,i_item_desc
   ,s_store_id
   ,s_store_name
 limit 100
2017-09-05 21:02:43,095 INFO  [main]: hooks.ATSHook (ATSHook.java:<init>(114)) - Created ATS Hook
2017-09-05 21:02:43,095 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(149)) - <PERFLOG method=PreHook.org.apache.hadoop.hive.ql.hooks.ATSHook from=org.apache.hadoop.hive.ql.Driver>
2017-09-05 21:02:43,096 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(177)) - </PERFLOG method=PreHook.org.apache.hadoop.hive.ql.hooks.ATSHook start=1504645363095 end=1504645363096 duration=1 from=org.apache.hadoop.hive.ql.Driver>
2017-09-05 21:02:43,097 INFO  [main]: ql.Driver (SessionState.java:printInfo(984)) - Query ID = sshuser_20170905205055_3edb68ee-a90c-455f-9864-759a7fd148c0
2017-09-05 21:02:43,104 INFO  [main]: ql.Driver (SessionState.java:printInfo(984)) - Total jobs = 11
2017-09-05 21:02:43,105 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(149)) - <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
2017-09-05 21:02:43,107 INFO  [main]: ql.Driver (Driver.java:launchTask(1746)) - Starting task [Stage-29:MAPREDLOCAL] in serial mode
2017-09-05 21:02:43,108 INFO  [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(158)) - Generating plan file file:/tmp/sshuser/ac6b001d-4fb9-4c78-8564-5b6975f50d54/hive_2017-09-05_20-50-55_505_2073659822727453965-1/-local-10028/plan.xml
2017-09-05 21:02:43,110 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(149)) - <PERFLOG method=serializePlan from=org.apache.hadoop.hive.ql.exec.Utilities>
2017-09-05 21:02:43,110 INFO  [main]: exec.Utilities (Utilities.java:serializePlan(1035)) - Serializing MapredLocalWork via kryo
2017-09-05 21:02:43,121 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(177)) - </PERFLOG method=serializePlan start=1504645363110 end=1504645363121 duration=11 from=org.apache.hadoop.hive.ql.exec.Utilities>
2017-09-05 21:02:43,183 INFO  [ATS Logger 0]: hooks.ATSHook (ATSHook.java:createPreHookEvent(302)) - Received pre-hook notification for :sshuser_20170905205055_3edb68ee-a90c-455f-9864-759a7fd148c0
2017-09-05 21:02:43,489 INFO  [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(287)) - Executing: /usr/hdp/2.6.1.3-4/hadoop/bin/hadoop jar /usr/hdp/2.6.1.3-4/hive/lib/hive-exec-1.2.1000.2.6.1.3-4.jar org.apache.hadoop.hive.ql.exec.mr.ExecDriver -localtask -plan file:/tmp/sshuser/ac6b001d-4fb9-4c78-8564-5b6975f50d54/hive_2017-09-05_20-50-55_505_2073659822727453965-1/-local-10028/plan.xml   -jobconffile file:/tmp/sshuser/ac6b001d-4fb9-4c78-8564-5b6975f50d54/hive_2017-09-05_20-50-55_505_2073659822727453965-1/-local-10029/jobconf.xml

Nothing appears out of place, except the query is being scheduled as a local task.

This is using the MapReduce execution engine and a secondary storage location. I'm unaware of any interaction there that would cause this behavior.

Additionally, simple queries/select statements seem to work fine.

Any ideas?

SQL query typos

All the queries with "interval #num days" will fail. "days" is a typo, it should be "day".

Error after running initial datagen

Receive errors following initial data generation:

OK
DBGEN2 Population Generator (Version 1.4.0) NULL
Copyright Transaction Processing Performance Council (TPC) 2001 - 2015 NULL
log4j:ERROR Could not find value for key log4j.appender.CLA NULL
log4j:ERROR Could not instantiate appender named "CLA". NULL
log4j:ERROR Could not find value for key log4j.appender.CLA NULL
log4j:ERROR Could not instantiate appender named "CLA". NULL
log4j:ERROR Could not find value for key log4j.appender.CLA NULL
log4j:ERROR Could not instantiate appender named "CLA". NULL
log4j:ERROR Could not find value for key log4j.appender.CLA NULL
log4j:ERROR Could not instantiate appender named "CLA". NULL
log4j:ERROR Could not find value for key log4j.appender.CLA NULL
log4j:ERROR Could not instantiate appender named "CLA". NULL
log4j:ERROR Could not find value for key log4j.appender.CLA NULL

Getting constant Vertex failed error with ORC file creation

I am getting vertex failed errors constantly with both 100 node Hive and LLAP clusters. My data size is 100TB . Do you have any recommended settings that I should be using for these cluster sizes. I have played with some settings , but they have not helped and the error persists when I run the script below for certain tables.

beeline -u "jdbc:hive2://hostname -f:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpcds_orc -hiveconf SOURCE=tpcds

ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1556442545309_0012_5_00, diagnostics=[Task failed, taskId=task_1556442545309_0012_5_00_000216, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : java.lang.OutOfMemoryError: Java heap space

TPCDSDataGen.hql not terminating once data is generated

For a SCALE=2 (doesn't matter the scale factor, this is just an example. behavior is the same for SCALE=10 or SCALE=100) we see the following files on storage:

sshuser@hn0-scojhd:testnfs/withTezHiveTPCDS $ du -s * | cut -d " " -f 1
8	call_center
1608	catalog_page
42364	catalog_returns
583512	catalog_sales
18720	customer
7788	customer_address
39044	customer_demographics
10124	date_dim
160	household_demographics
8	income_band
169004	inventory
7188	item
48	promotion
8	reason
8	ship_mode
12	store
64872	store_returns
768672	store_sales
5016	time_dim
8	warehouse
12	web_page
19424	web_returns
289972	web_sales
16	web_site

However, the hive job gets stuck with 1-2 reducers still running:

sshuser@hn0-scojhd:~/tpcds-hdinsight ‹master*›$ hive -i settings.hql -f TPCDSDataGen.hql -hiveconf SCALE=2 -hiveconf PARTS=2 -hiveconf LOCATION=nfs://172.16.2.84:2049/withTezHiveTPCDS -hiveconf TPCHBIN=resources

Logging initialized using configuration in file:/etc/hive/2.6.1.3-4/0/hive-log4j.properties
Added resources: [resources/dsdgen]
Added resources: [resources/tpcds.idx]
Added resources: [resources/sequenceGenerator.py]
Added resources: [resources/TPCDSgen.py]
Query ID = sshuser_20170907212652_6bda2a22-8a76-453b-9e02-852c19ac10d6
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1504732778260_0013)

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       1       0
Reducer 2 .....      RUNNING     20         19        1        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/02  [========================>>--] 95%   ELAPSED TIME: 592.40 s
--------------------------------------------------------------------------------

This will go on indefinitely until sending SIGINT to the hive process.