databricks / spark-sql-perf Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
According to the README, the data is on local disk; is there anyway to put into some external storage, e.g. object storage?
TPC-DS schemas are different between spark-sql-perf
TPCDSTables and spark-master/branch-3.1
TPCDSBase (string v.s. char/varchar). For example;
// spark
"reason" ->
"""
|`r_reason_sk` INT,
|`r_reason_id` CHAR(16),
|`r_reason_desc` CHAR(100)
""".stripMargin,
// spark-sql-perf
Table("reason",
partitionColumns = Nil,
'r_reason_sk .int,
'r_reason_id .string,
'r_reason_desc .string),
To generated TPCDS table data for Spark (master/branch-3.1), it would be nice to use CHAR/VARCHAR types in TPCDSTables
.
NOTE: This ticket comes from apache/spark#31886
Hi experts @davies
Now i am using the spark-sql-perf to generate TPC-DS 1TB data with enabling partitionTables like tables.genData("hdfs://ip:8020/tpctest", "parquet", true, true, false, false, false) . But found some of big tables(e.g., store_sales) got slower to be completed(about 3hrs on 4-slave nodes). I observed that firstly all data were put in /tpcds_1t/store_sales/_temporary/0, then move to /tpcds_1t/store_sales on HDFS, these 'move' on HDFS took a lot time to complete...If some guys came cross the same issue like me ? How to resolve it ? BTW, we use TPC-DS kit from https://github.com/davies/tpcds-kit
Thanks in advance !
I am new to spark-sql-perf test tool. I have some of confused questions about it. Hopefully got your answers. Thanks in advance !
I deployed Spark 1.6.2 cluster. Could i use spark-sql-perf v0.4.3 to run and test ? I saw the spark version is set 2.0.0 in build.sbt in v0.4.3.
How can I run one specified query(e.g., q19) in spark-shell, i only see below instruction to run all of queries:
import com.databricks.spark.sql.perf.tpcds.TPCDS
val tpcds = new TPCDS (sqlContext = sqlContext)
val experiment = tpcds.runExperiment(tpcds.tpcds1_4Queries)
Looking at the most recent Travis build we see:
16/03/30 19:02:44 INFO SessionState: Created HDFS directory: file:/tmp/scratch-362b2e98-2401-4066-9a73-06f6dbbc3454/travis/447f0c84-1866-45d9-aeca-53fb780ef096/_tmp_space.db
Internal error when running tests: java.lang.OutOfMemoryError: PermGen space
Exception in thread "Thread-1" java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at sbt.React.react(ForkTests.scala:114)
at sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:74)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Thread-5" java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.scalatest.tools.Framework$ScalaTestRunner$Skeleton$1$React.react(Framework.scala:945)
at org.scalatest.tools.Framework$ScalaTestRunner$Skeleton$1.run(Framework.scala:934)
at java.lang.Thread.run(Thread.java:745)
�[0m[�[0minfo�[0m] �[0m�[36mRun completed in 35 seconds, 691 milliseconds.�[0m�[0m
�[0m[�[0minfo�[0m] �[0m�[36mTotal number of tests run: 0�[0m�[0m
�[0m[�[0minfo�[0m] �[0m�[36mSuites: completed 0, aborted 0�[0m�[0m
�[0m[�[0minfo�[0m] �[0m�[36mTests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0�[0m�[0m
�[0m[�[0minfo�[0m] �[0m�[33mNo tests were executed.�[0m�[0m
�[0m[�[32msuccess�[0m] �[0mTotal time: 193 s, completed Mar 30, 2016 7:02:57 PM�[0m
I met error :
Caused by: java.lang.RuntimeException: Could not find dsdgen at /home/dcos/tpcds-kit-master/tools/dsdgen or //home/dcos/tpcds-kit-master/tools/dsdgen. Run install
I have run 'make' under /home/dcos/tpcds-kit-master/tools/
my scala code is :
val tables = new Tables(sqlContext, "/home/dcos/tpcds-kit-master/tools", 1) tables.genData("hdfs://master1:9000/tpctest", "parquet", true, false, false, false, false)
Hi,
on Spark 2.3.2:
scala> tables.genData(
| location = rootDir,
| format = format,
| overwrite = true, // overwrite the data that is already there
| partitionTables = true, // create the partitioned fact tables
| clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files.
| filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value
| tableFilter = "", // "" means generate all tables
| numPartitions = 1)
org.apache.spark.sql.AnalysisException: cannot resolve '`cs_sold_date_sk`' given input columns: [catalog_sales_text.value]; line 8 pos 2;
'RepartitionByExpression ['cs_sold_date_sk], 200
+- Project [value#64]
+- SubqueryAlias catalog_sales_text
+- LogicalRDD [value#64], false
Hi
I am facing below issues, when I am trying to run this code. For this command
WARN TaskSetManager: Stage 268 contains a task of very large size (331 KB). The maximum recommended task size is 100 KB.
//Here it is taking so much of time. When i press enter button, it enter into scala prompt i.e(scala >)
error: value createResultsTable is not a member of com.databricks.spark.sql.perf.tpcds.TPCDS
Spark 2.0.1,
6/11/02 15:28:45 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 172.16.90.5): java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 0 if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, cs_sold_date_sk), StringType), true) AS cs_sold_date_sk#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, cs_sold_date_sk), StringType), true) :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt : :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) : : +- input[0, org.apache.spark.sql.Row, true]
: +- 0
:- null
+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, cs_sold_date_sk), StringType), true)
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, cs_sold_date_sk), StringType)
+- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, cs_sold_date_sk) +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) +- input[0, org.apache.spark.sql.Row, true]
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 1, cs_sold_time_sk), StringType), true) AS cs_sold_time_sk#1 +- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 1, cs_sold_time_sk), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
: :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) : : +- input[0, org.apache.spark.sql.Row, true] : +- 1 :- null
+- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 1, cs_sold_time_sk), StringType), true)
+- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 1, cs_sold_time_sk), StringType) +- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 1, cs_sold_time_sk) +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object) +- input[0, org.apache.spark.sql.Row, true]
Is there a simple starter script to begin using the repository. I am new to scala, so any guidance will be really helpful.
Hello,
I got the following error when trying to build the spark-sql-perf package for Spark 2.1.0. Did anyone see this before and any idea to fix it? Thanks
Steps:
git clone https://github.com/databricks/spark-sql-perf.git
cd spark-sql-perf
vi build.sbt to change the sparkVersion := "2.1.0"
./build/sbt clean package
Compilation errors:
spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Query.scala:89: Cannot prove that (org.apache.spark.sql.catalyst.trees.TreeNode[_$6], Int) forSome { type $6 } <:< (T, U).
[error] val indexMap = physicalOperators.map { case (index, op) => (op, index) }.toMap
[error] ^
[error] spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Query.scala:97: value execute is not a member of org.apache.spark.sql.catalyst.trees.TreeNode[$6]
[error] newNode.execute().foreach((row: Any) => Unit)
[error] ^
I am trying to generate data for TPC-DS and data is getting generated but while analyzing the table it throws exception.
This is the log: It starts analyzing the tables but failed with first table.
Analyzing table catalog_sales.
Analyzing table catalog_sales columns cs_sold_date_sk, cs_sold_time_sk, cs_ship_date_sk, cs_bill_customer_sk, cs_bill_cdemo_sk, cs_bill_hdemo_sk, cs_bill_addr_sk, cs_ship_customer_sk, cs_ship_cdemo_sk, cs_ship_hdemo_sk, cs_ship_addr_sk, cs_call_center_sk, cs_catalog_page_sk, cs_ship_mode_sk, cs_warehouse_sk, cs_item_sk, cs_promo_sk, cs_order_number, cs_quantity, cs_wholesale_cost, cs_list_price, cs_sales_price, cs_ext_discount_amt, cs_ext_sales_price, cs_ext_wholesale_cost, cs_ext_list_price, cs_ext_tax, cs_coupon_amt, cs_ext_ship_cost, cs_net_paid, cs_net_paid_inc_tax, cs_net_paid_inc_ship, cs_net_paid_inc_ship_tax, cs_net_profit.
org.apache.spark.sql.AnalysisException: Column cs_sold_date_sk does not exist.
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.$anonfun$getColumnsToAnalyze$3(AnalyzeColumnCommand.scala:90)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.$anonfun$getColumnsToAnalyze$1(AnalyzeColumnCommand.scala:90)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.getColumnsToAnalyze(AnalyzeColumnCommand.scala:88)
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.analyzeColumnInCatalog(AnalyzeColumnCommand.scala:116)
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:51)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:234)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3702)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:836)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3700)
at org.apache.spark.sql.Dataset.(Dataset.scala:234)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:104)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:836)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:101)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:671)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:836)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:666)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:672)
at com.databricks.spark.sql.perf.Tables$Table.analyzeTable(Tables.scala:280)
at com.databricks.spark.sql.perf.Tables.$anonfun$analyzeTables$2(Tables.scala:352)
at com.databricks.spark.sql.perf.Tables.$anonfun$analyzeTables$2$adapted(Tables.scala:351)
at scala.collection.immutable.List.foreach(List.scala:392)
at com.databricks.spark.sql.perf.Tables.analyzeTables(Tables.scala:351)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-4211249281701897:48)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw$$iw$$iw$$iw$$iw.(command-4211249281701897:100)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw$$iw$$iw$$iw.(command-4211249281701897:102)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw$$iw$$iw.(command-4211249281701897:104)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw$$iw.(command-4211249281701897:106)
at line533d73f9645742d2b1ac0c523afb366f27.$read$$iw.(command-4211249281701897:108)
at line533d73f9645742d2b1ac0c523afb366f27.$read.(command-4211249281701897:110)
at line533d73f9645742d2b1ac0c523afb366f27.$read$.(command-4211249281701897:114)
at line533d73f9645742d2b1ac0c523afb366f27.$read$.(command-4211249281701897)
at line533d73f9645742d2b1ac0c523afb366f27.$eval$.$print$lzycompute(:7)
at line533d73f9645742d2b1ac0c523afb366f27.$eval$.$print(:6)
at line533d73f9645742d2b1ac0c523afb366f27.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:745)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1021)
at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:574)
at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:41)
at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:37)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:600)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:570)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:219)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$repl$1(ScalaDriverLocal.scala:204)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:773)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:726)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:204)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$10(DriverLocal.scala:431)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:239)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:234)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:231)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:48)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:276)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:269)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:48)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:408)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:653)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:645)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:486)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:598)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:391)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)
Build errors. spark 3.0.0
build.sbt snippet
...
crossScalaVersions := Seq("2.11.12", "2.12.10")
...
sparkVersion := "3.0.0"
[info] Compiling 66 Scala sources to /opt/spark-sql-perf/target/scala-2.12/classes...
[info] 'compiler-interface' not yet compiled for Scala 2.12.10. Compiling...
[info] Compilation completed in 18.491 s
[error] /opt/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Benchmark.scala:338: value table is not a member of Seq[String]
[error] case UnresolvedRelation(t) => t.table
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Query.scala:59: value table is not a member of Seq[String]
[error] tableIdentifier.table
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Query.scala:94: missing argument list for method simpleString in class QueryPlan
[error] Unapplied methods are only converted to functions when a function type is expected.
[error] You can make this conversion explicit by writing simpleString _
or simpleString(_)
instead of simpleString
.
[error] messages += s"Breakdown: ${node.simpleString}"
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/Query.scala:107: missing argument list for method simpleString in class QueryPlan
[error] Unapplied methods are only converted to functions when a function type is expected.
[error] You can make this conversion explicit by writing simpleString _
or simpleString(_)
instead of simpleString
.
[error] node.simpleString.replaceAll("#\d+", ""),
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/org/apache/spark/ml/ModelBuilderSSP.scala:62: not enough arguments for constructor NaiveBayesModel: (uid: String, pi: org.apache.spark.ml.linalg.Vector, theta: org.apache.spark.ml.linalg.Matrix, sigma: org.apache.spark.ml.linalg.Matrix)org.apache.spark.ml.classification.NaiveBayesModel.
[error] Unspecified value parameter sigma.
[error] val model = new NaiveBayesModel("naivebayes-uid", pi, theta)
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/org/apache/spark/ml/ModelBuilderSSP.scala:163: not enough arguments for method getCalculator: (impurity: String, stats: Array[Double], rawCount: Long)org.apache.spark.mllib.tree.impurity.ImpurityCalculator.
[error] Unspecified value parameter rawCount.
[error] ImpurityCalculator.getCalculator("variance", Array.fillDouble(0.0))
[error] ^
[error] /opt/spark-sql-perf/src/main/scala/org/apache/spark/ml/ModelBuilderSSP.scala:165: not enough arguments for method getCalculator: (impurity: String, stats: Array[Double], rawCount: Long)org.apache.spark.mllib.tree.impurity.ImpurityCalculator.
[error] Unspecified value parameter rawCount.
[error] ImpurityCalculator.getCalculator("gini", Array.fillDouble(0.0))
[error] ^
[error] 7 errors found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 178 s, completed Aug 12, 2020, 4:00:12 AM
I splited the Generate data and Query query in two jar files. Firstly generated data and then parallelly query the data. Most of tasks cost about 20ms. However some tasks cost 700ms. The reason is these taks access non exist key(files). Why the query task access non-generated files.
Following is a part of my program.
Generate date code
val tables = new TPCDSTables(sqlContext,
dsdgenDir = "/opt/tpcds-kit/tools", // location of dsdgen
scaleFactor = 70,
useDoubleForDecimal = false, // true to replace DecimalType with DoubleType
useStringForDate = false) // true to replace DateType with StringType
tables.genData(
location = rootDir,
format = format,
overwrite = true, // overwrite the data that is already there
partitionTables = true, // create the partitioned fact tables
clusterByPartitionColumns = true, // shuffle to get partitions coalesced into single files.
filterOutNullPartitionValues = false, // true to filter out the partition with NULL key value
tableFilter = "", // "" means generate all tables
numPartitions = 100) // how many dsdgen part
Query code
val tables = new TPCDSTables(sqlContext,
dsdgenDir = "/opt/tpcds-kit/tools", // location of dsdgen
scaleFactor = 70,
useDoubleForDecimal = false, // true to replace DecimalType with DoubleType
useStringForDate = false) // true to replace DateType with StringType
// Create the specified database
sql(s"create database $databaseName")
// Create metastore tables in a specified database for your data.
// Once tables are created, the current database will be switched to the specified database.
tables.createExternalTables(rootDir, "parquet", databaseName, overwrite = true, discoverPartitions = false)
val tpcds = new TPCDS (sqlContext = sqlContext)
// Set:
val resultLocation = args(3) // place to write results
val iterations = 1 // how many iterations of queries to run.
//val queries = tpcds.tpcds2_4Queries // queries to run.
val timeout = 246060 // timeout, in seconds.
def queries = {
if (args(4) == "all") {
tpcds.tpcds2_4Queries
} else {
val qa = args(4).split(",", 0)
val qs = qa.toSeq
tpcds.tpcds2_4Queries.filter(q => {
qs.contains(q.name)
})
}
}
println(queries.size)
sql(s"use $databaseName")
val experiment = tpcds.runExperiment(
queries,
iterations = iterations,
resultLocation = resultLocation,
forkThread = true)
experiment.waitForFinish(timeout)
spark2.2.0
Caused by: java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 0
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, cs_sold_date_sk), StringType), true) AS cs_sold_date_sk#939if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 1, cs_sold_time_sk), StringType), true) AS cs_sold_time_sk#940if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 2, cs_ship_date_sk), StringType), true) AS cs_ship_date_sk#941if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 3, cs_bill_customer_sk), StringType), true) AS cs_bill_customer_sk#942if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 4, cs_bill_cdemo_sk), StringType), true) AS cs_bill_cdemo_sk#943if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 5, cs_bill_hdemo_sk), StringType), true) AS cs_bill_hdemo_sk#944if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 6, cs_bill_addr_sk), StringType), true) AS cs_bill_addr_sk#945if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 7, cs_ship_customer_sk), StringType), true) AS cs_ship_customer_sk#946if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 8, cs_ship_cdemo_sk), StringType), true) AS cs_ship_cdemo_sk#947if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 9, cs_ship_hdemo_sk), StringType), true) AS cs_ship_hdemo_sk#948if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 10, cs_ship_addr_sk), StringType), true) AS cs_ship_addr_sk#949if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 11, cs_call_center_sk), StringType), true) AS cs_call_center_sk#950if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 12, cs_catalog_page_sk), StringType), true) AS cs_catalog_page_sk#951if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 13, cs_ship_mode_sk), StringType), true) AS cs_ship_mode_sk#952if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 14, cs_warehouse_sk), StringType), true) AS cs_warehouse_sk#953if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 15, cs_item_sk), StringType), true) AS cs_item_sk#954if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 16, cs_promo_sk), StringType), true) AS cs_promo_sk#955if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 17, cs_order_number), StringType), true) AS cs_order_number#956if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 18, cs_quantity), StringType), true) AS cs_quantity#957if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 19, cs_wholesale_cost), StringType), true) AS cs_wholesale_cost#958if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 20, cs_list_price), StringType), true) AS cs_list_price#959if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 21, cs_sales_price), StringType), true) AS cs_sales_price#960if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 22, cs_ext_discount_amt), StringType), true) AS cs_ext_discount_amt#961if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 23, cs_ext_sales_price), StringType), true) AS cs_ext_sales_price#962if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 24, cs_ext_wholesale_cost), StringType), true) AS cs_ext_wholesale_cost#963if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 25, cs_ext_list_price), StringType), true) AS cs_ext_list_price#964if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 26, cs_ext_tax), StringType), true) AS cs_ext_tax#965if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 27, cs_coupon_amt), StringType), true) AS cs_coupon_amt#966if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 28, cs_ext_ship_cost), StringType), true) AS cs_ext_ship_cost#967if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 29, cs_net_paid), StringType), true) AS cs_net_paid#968if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 30, cs_net_paid_inc_tax), StringType), true) AS cs_net_paid_inc_tax#969if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 31, cs_net_paid_inc_ship), StringType), true) AS cs_net_paid_inc_ship#970if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 32, cs_net_paid_inc_ship_tax), StringType), true) AS cs_net_paid_inc_ship_tax#971if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getex
ternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 33, cs_net_profit), StringType), true) AS cs_net_profit#972 at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:573)
at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:573)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:173)
at org.apache.spark.sql.Row$class.isNullAt(Row.scala:191)
at org.apache.spark.sql.catalyst.expressions.GenericRow.isNullAt(rows.scala:165)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfCondExpr$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 16 more
I use ./bin/run --benchmark DatasetPerformance
I can see the spark web page at 8080. But error shows in terminal like this.
Besides, do I need to start spark first?
[error] 16/02/01 22:03:37 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 4185 ms on localhost (1/8)
[error] 16/02/01 22:03:37 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 4210 ms on localhost (2/8)
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 4193 ms on localhost (3/8)
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 4193 ms on localhost (4/8)
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 4195 ms on localhost (5/8)
[error] 16/02/01 22:03:37 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 4203 ms on localhost (6/8)
[error] 16/02/01 22:03:37 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 4249 ms on localhost (7/8)
[error] 16/02/01 22:03:37 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 915 bytes result sent to driver
[error] 16/02/01 22:03:37 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 4256 ms on localhost (8/8)
[error] 16/02/01 22:03:37 INFO DAGScheduler: ResultStage 0 (foreach at Benchmark.scala:604) finished in 4.287 s
[error] 16/02/01 22:03:37 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
[error] 16/02/01 22:03:37 INFO DAGScheduler: Job 0 finished: foreach at Benchmark.scala:604, took 4.463525 s
I'm new to scala and spark. Got overwhelmed with build errors due to unresolved dependencies like :
[warn] Note: Unresolved dependencies path:
[error] sbt.librarymanagement.ResolveException: Error downloading com.databricks:sbt-databricks;sbtVersion=1.0;scalaVersion=2.12:0.1.3
[error] Not found
[error] Not found
[error] not found: https://repo1.maven.org/maven2/com/databricks/sbt-databricks_2.12_1.0/0.1.3/sbt-databricks-0.1.3.pom
[error] not found: https://dl.bintray.com/spark-packages/maven/com/databricks/sbt-databricks_2.12_1.0/0.1.3/sbt-databricks-0.1.3.pom
[error] not found: https://oss.sonatype.org/content/repositories/releases/com/databricks/sbt-databricks_2.12_1.0/0.1.3/sbt-databricks-0.1.3.pom
[error] not found: https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.databricks/sbt-databricks/scala_2.12/sbt_1.0/0.1.3/ivys/ivy.xml
[error] not found: https://repo.typesafe.com/typesafe/ivy-releases/com.databricks/sbt-databricks/scala_2.12/sbt_1.0/0.1.3/ivys/ivy.xml
[error] Error downloading com.jsuereth:sbt-pgp;sbtVersion=1.0;scalaVersion=2.12:1.0.0
[`
Any help ?
Which TPC-DS queries are supposed to be working? I'm finding that several queries are not parseable, for a variety of reasons. Specifically, the following queries compile without exception:
Set(q25, q65, q50, q88, q64, q99, q46, q76, q97, q19, q17, q57, q75, q49, q90, q89, q78, q63, qSsMax, q71, q4, q93, q74, q34, q28, q82, q85, q96, q51, q40, q13, q62, q79, q59, q29, q48, q37, q52, q73, q68, q3, q15, q55, q39b, q84, q72, q31, q53, q7, q39a, q43, q61, q42, q11, q26, q21, q47, q91)
The following queries throw exceptions when compiling:
Set(q77, q60, q32, q69, q14b, q86, q98, q6, q35, q83, q66, q87, q94, q23b, q8, q27, q5, q38, q14a, q24b, q67, q56, q16, q45, q23a, q18, q12, q81, q92, q41, q33, q70, q24a, q30, q95, q22, q20, q44, q10, q1, q36, q9, q58, q80, q2, q54)
I've attached detailed output with the various exceptions that are thrown.
queries.txt
The code I used to produce this, from the shell, is:
import java.io._
import com.databricks.spark.sql.perf.tpcds.TPCDS
val tpcds = new TPCDS (sqlContext = sqlContext)
val writer = new PrintWriter(new File("queries.txt"))
tpcds.tpcds1_4QueriesMap.keys.foreach(k => try { tpcds.tpcds1_4QueriesMap{k}.tablesInvolved; writer.write(k + ": success\n\n"); } catch { case e: Exception => writer.write(k + ": failed, reason:\n" + e + "\n\n"); })
Is this expected? Since there are several queries split over multiple files (Simple, ImpalaKit, and TPCDS_1_4), are there a set of queries that are considered "supported" right now?
When I try to setup TPCDS dataset in a cluster I get an error that Spark is not able to infer parquet schema. This happens only in cluster mode, in local mode the setup finishes successfully.
I have installed tpcds kit on all nodes under the same path and the location of the data is the same as well.
Specifically I try
./bin/spark-shell --jars /root/spark-sql-perf/target/scala-2.11/spark-sql-perf-assembly-0.5.0-SNAPSHOT.jar --master spark://master:7077
scala> import spark.sqlContext.implicits._
scala> import com.databricks.spark.sql.perf.tpcds.TPCDSTables
scala> val tables = new TPCDSTables(spark.sqlContext, "/tmp/tpcds-kit-src/tools", "1", false, false)
scala> tables.genData("/tmp/tpcds-data", "parquet", true, true, true, false, "", 100)
scala> sql("create database tpcds")
scala> tables.createExternalTables("/tmp/tpcds-data", "parquet", "tpcds", true, true)
Creating external table catalog_sales in database tpcds using data stored in /tmp/tpcds-data/catalog_sales.
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:182)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:182)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:181)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:77)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:142)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:139)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:120)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:121)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
at org.apache.spark.sql.internal.CatalogImpl.createTable(CatalogImpl.scala:352)
at org.apache.spark.sql.internal.CatalogImpl.createTable(CatalogImpl.scala:319)
at org.apache.spark.sql.internal.CatalogImpl.createTable(CatalogImpl.scala:302)
at org.apache.spark.sql.SQLContext.createExternalTable(SQLContext.scala:544)
at com.databricks.spark.sql.perf.Tables$Table.createExternalTable(Tables.scala:242)
at com.databricks.spark.sql.perf.Tables$$anonfun$createExternalTables$1.apply(Tables.scala:311)
at com.databricks.spark.sql.perf.Tables$$anonfun$createExternalTables$1.apply(Tables.scala:309)
at scala.collection.immutable.List.foreach(List.scala:381)
at com.databricks.spark.sql.perf.Tables.createExternalTables(Tables.scala:309)
... 50 elided
Hi @juliuszsompolski
Can you add steps on how to use TPC - h testing? Some experiments still require TPC - h testing
I'm running spark-sql-perf in a scenario where you would cache tables in RAM for performance reasons. Code here.
The result is blockmanager failing, see the trace below...
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import com.databricks.spark.sql.perf.tpcds.Tables
import com.databricks.spark.sql.perf.tpcds.TPCDS
val tableNames = Array("call_center", "catalog_page",
"catalog_returns", "catalog_sales", "customer",
"customer_address", "customer_demographics", "date_dim",
"household_demographics", "income_band", "inventory", "item",
"promotion", "reason", "ship_mode", "store",
"store_returns", "store_sales", "time_dim", "warehouse",
"web_page", "web_returns", "web_sales", "web_site")
for(i <- 0 to tableNames.length - 1) { sqlContext.read.parquet("hdfs://sparkc:9000/sqldata" + "/" + tableNames{i}).registerTempTable(tableNames{i})
sqlContext.cacheTable(tableNames{i})
}
val tpcds = new TPCDS (sqlContext = sqlContext)
val experiment = tpcds.runExperiment(tpcds.tpcds1_4Queries, iterations = 1)
experiment.waitForFinish(606010)
16/11/04 12:30:40 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 39.0 (TID 1545, 172.16.90.9): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_15_piece0 of broadcast_15
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1280)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:358)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_15_piece0 of broadcast_15
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:146)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:125)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:125)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:186)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1273)
... 37 more
I am trying to run my benchmark as
./bin/run -b DatasetPerformance -f RDD
which displays only RDD related things on output namely -> average, back to back filters, back to back map, and range....
But is there a way to run only RDD: back to back filters ?
The IBM Spark Technology Center has made 86 (out of 99 TPCDS business queries) to run successfully on Spark 1.4.1 and 1.5.0. The existing spark-sql-perf test kit (on github from databricks) has a small subset of these. Could we add all the following queries into the kit so the community can learn/leverage?
query01
query02
query03
query04
query05
query07
query08
query09
query11
query12
query13
query15
query16
query17
query20
query21
query22
query24a
query24b
query25
query26
query27
query28
query29
query30
query31
query32
query33
query34
query36
query37
query38
query39a
query39b
query40
query42
query43
query45
query46
query47
query48
query49
query50
query51
query52
query53
query56
query57
query58
query59
query60
query61
query62
query63
query66
query67
query68
query69
query72
query73
query74
query75
query76
query77
query78
query79
query80
query81
query82
query83
query84
query85
query86
query87
query88
query89
query90
query91
query92
query93
query94
query95
query96
query97
query98
query99
Please contact me for the working queries.
Jesse Chen
[email protected]
Hi experts,
I am using:
spark 1.6.1
scala 2.10.5
I get many errors when run following command:
./bin/run -benchmark DatasetPerformance
The error is:
[info] Compiling 40 Scala sources to /data/tpc-ds/spark-sql-perf-master/target/scala-2.10/classes...
[error] /data/tpc-ds/spark-sql-perf-master/src/main/scala/com/databricks/spark/sql/perf/AggregationPerformance.scala:17: type mismatch;
[error] found : org.apache.spark.sql.DataFrame
[error] required: org.apache.spark.sql.Dataset[_]
[error] }.toDF("a", "b"))
[error] ^
Can you help me?
Thanks a lot.
I've downloaded the TPC-DS package, from tpc.org. However it does not contain anything remotely resembling dsdgen.
Hello,
I am using:
When using genData() and generating data for store_sales table, the application crash with the following stacktrace: (see next post).
The parameters using for genData are: HDFS folder, 'parquet', true, false, false, false, false.
Any suggestions?
Hi
I am getting following exception while running it. please help. Thanks
shuja@shuja:~/Desktop/data/spark-sql-perf-master$ bin/run --benchmark DatasetPerformance
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
[info] Loading project definition from /home/shuja/Desktop/data/spark-sql-perf-master/project
Missing bintray credentials /home/shuja/.bintray/.credentials. Some bintray features depend on this.
[info] Set current project to spark-sql-perf (in build file:/home/shuja/Desktop/data/spark-sql-perf-master/)
[warn] Credentials file /home/shuja/.bintray/.credentials does not exist
[info] Updating {file:/home/shuja/Desktop/data/spark-sql-perf-master/}spark-sql-perf-master...
[info] Resolving org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT ...
[warn] module not found: org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT
[warn] ==== local: tried
[warn] /home/shuja/.ivy2/local/org.apache.spark/spark-sql_2.10/2.0.0-SNAPSHOT/ivys/ivy.xml
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.10/2.0.0-SNAPSHOT/spark-sql_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== Spark Packages Repo: tried
[warn] https://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-sql_2.10/2.0.0-SNAPSHOT/spark-sql_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== apache-snapshots: tried
[warn] https://repository.apache.org/snapshots/org/apache/spark/spark-sql_2.10/2.0.0-SNAPSHOT/spark-sql_2.10-2.0.0-SNAPSHOT.pom
[info] Resolving org.apache.spark#spark-hive_2.10;2.0.0-SNAPSHOT ...
[warn] module not found: org.apache.spark#spark-hive_2.10;2.0.0-SNAPSHOT
[warn] ==== local: tried
[warn] /home/shuja/.ivy2/local/org.apache.spark/spark-hive_2.10/2.0.0-SNAPSHOT/ivys/ivy.xml
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/org/apache/spark/spark-hive_2.10/2.0.0-SNAPSHOT/spark-hive_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== Spark Packages Repo: tried
[warn] https://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-hive_2.10/2.0.0-SNAPSHOT/spark-hive_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== apache-snapshots: tried
[warn] https://repository.apache.org/snapshots/org/apache/spark/spark-hive_2.10/2.0.0-SNAPSHOT/spark-hive_2.10-2.0.0-SNAPSHOT.pom
[info] Resolving org.apache.spark#spark-mllib_2.10;2.0.0-SNAPSHOT ...
[warn] module not found: org.apache.spark#spark-mllib_2.10;2.0.0-SNAPSHOT
[warn] ==== local: tried
[warn] /home/shuja/.ivy2/local/org.apache.spark/spark-mllib_2.10/2.0.0-SNAPSHOT/ivys/ivy.xml
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/org/apache/spark/spark-mllib_2.10/2.0.0-SNAPSHOT/spark-mllib_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== Spark Packages Repo: tried
[warn] https://dl.bintray.com/spark-packages/maven/org/apache/spark/spark-mllib_2.10/2.0.0-SNAPSHOT/spark-mllib_2.10-2.0.0-SNAPSHOT.pom
[warn] ==== apache-snapshots: tried
[warn] https://repository.apache.org/snapshots/org/apache/spark/spark-mllib_2.10/2.0.0-SNAPSHOT/spark-mllib_2.10-2.0.0-SNAPSHOT.pom
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT: not found
[warn] :: org.apache.spark#spark-hive_2.10;2.0.0-SNAPSHOT: not found
[warn] :: org.apache.spark#spark-mllib_2.10;2.0.0-SNAPSHOT: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.apache.spark:spark-sql_2.10:2.0.0-SNAPSHOT ((sbtsparkpackage.SparkPackagePlugin) SparkPackagePlugin.scala#L186)
[warn] +- com.databricks:spark-sql-perf_2.10:0.4.10-SNAPSHOT
[warn] org.apache.spark:spark-hive_2.10:2.0.0-SNAPSHOT ((sbtsparkpackage.SparkPackagePlugin) SparkPackagePlugin.scala#L186)
[warn] +- com.databricks:spark-sql-perf_2.10:0.4.10-SNAPSHOT
[warn] org.apache.spark:spark-mllib_2.10:2.0.0-SNAPSHOT ((sbtsparkpackage.SparkPackagePlugin) SparkPackagePlugin.scala#L186)
[warn] +- com.databricks:spark-sql-perf_2.10:0.4.10-SNAPSHOT
sbt.ResolveException: unresolved dependency: org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT: not found
unresolved dependency: org.apache.spark#spark-hive_2.10;2.0.0-SNAPSHOT: not found
unresolved dependency: org.apache.spark#spark-mllib_2.10;2.0.0-SNAPSHOT: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:291)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:188)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:165)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
at sbt.IvySbt.withIvy(Ivy.scala:127)
at sbt.IvySbt.withIvy(Ivy.scala:124)
at sbt.IvySbt$Module.withModule(Ivy.scala:155)
at sbt.IvyActions$.updateEither(IvyActions.scala:165)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1369)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1365)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$87.apply(Defaults.scala:1399)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$87.apply(Defaults.scala:1397)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1402)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1396)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1419)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1348)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1310)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:235)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
error sbt.ResolveException: unresolved dependency: org.apache.spark#spark-sql_2.10;2.0.0-SNAPSHOT: not found
[error] unresolved dependency: org.apache.spark#spark-hive_2.10;2.0.0-SNAPSHOT: not found
[error] unresolved dependency: org.apache.spark#spark-mllib_2.10;2.0.0-SNAPSHOT: not found
[error] Total time: 10 s, completed Aug 17, 2016 5:19:11 PM
Is it okay to configure "executor-memory=1G" ?
I am running spark querying tpcds.tpcds2_4Queries(q1-q99) testing on 100G data on kubernetes cluster. I want to find out the most suitable executor-memory for these queries.
We are looking to validate if the returned results of TPC-DS queries are correct. We used IBM's TPC-DS benchmark suite before, which has a process of validating the correctness of the returned results. I tried to search such tools in the repo, and could not find any... Did I miss anything? Any suggestions would be appreciated. Thanks!
Hi,
I'm using spark 1.6.0 and I want to run the benchmark. However, I need first to setup the benchmark (I guess).
In the tutorial it's written that we have to execute those lines:
import com.databricks.spark.sql.perf.tpcds.Tables
val tables = new Tables(sqlContext, dsdgenDir, scaleFactor)
tables.genData(location, format, overwrite, partitionTables, useDoubleForDecimal, clusterByPartitionColumns, filterOutNullPartitionValues)
// Create metastore tables in a specified database for your data.
// Once tables are created, the current database will be switched to the specified database.
tables.createExternalTables(location, format, databaseName, overwrite)
// Or, if you want to create temporary tables
tables.createTemporaryTables(location, format)
// Setup TPC-DS experiment
import com.databricks.spark.sql.perf.tpcds.TPCDS
val tpcds = new TPCDS (sqlContext = sqlContext)
I understood that I have to run "spark-shell" first in order to run those lines, but the problem is that when i do "import com.databricks.spark.sql.perf.tpcds.Tables" I got an error " error: object sql is not a member of package com.databricks.spark". In "com.databricks.spark" there is only the "avro" package (I don't really know what it is)
Could you help me please, maybe I understood something wrong?
Thanks
"bin/run --benchmark DatasetPerformance" shows too many errors.
log.txt
Hi,
Thanks for your devotion in developing this tool first.
My environment: Cloudera CDH 5.6.0+spark 1.5.0
I encountered this error when I tried to generate data using tables.genData():
scala> tables.genData("/hdfs2/ds", "text" ,true ,true,true,true,true) java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2570) at java.lang.Class.getDeclaredMethod(Class.java:2002) at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1431) ......
previously I was just following the README.md, here's what I've done:
import com.databricks.spark.sql.perf.tpcds.Tables import com.databricks.spark.sql.perf.tpcds.Tables scala> val tables = new Tables(sqlContext,"/root/tpc-ds/tools" , 50) tables: com.databricks.spark.sql.perf.tpcds.Tables = com.databricks.spark.sql.perf.tpcds.Tables@194765c4
I would like to know what is possibly going wrong here? Is it because that I use cloudera distribution of spark or I was using version 1.5.0 which is not supported?
Thanks for your help in advance.
The initial generation of data seems to fail:
15/09/07 23:54:23 INFO EventLoggingListener: Logging events to file:/tmp/spark-events/1/20150907-235419-84120842-5050-11875-0000
SELECT
ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk,ss_cdemo_sk,ss_hdemo_sk,ss_addr_sk,ss_store_sk,ss_promo_sk,ss_ticket_number,ss_quantity,ss_wholesale_cost,ss_list_price,ss_sales_price,ss_ext_discount_amt,ss_ext_sales_price,ss_ext_wholesale_cost,ss_ext_list_price,ss_ext_tax,ss_coupon_amt,ss_net_paid,ss_net_paid_inc_tax,ss_net_profit
FROM
store_sales_text
DISTRIBUTE BY
ss_sold_date_sk
Exception in thread "main" java.lang.RuntimeException: [6.12] failure: \`\`union'' expected but `by' found
DISTRIBUTE BY
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169)
at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:115)
at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:166)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:166)
at org.apache.spark.sql.execution.datasources.DDLParser.parse(DDLParser.scala:42)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:189)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719)
at com.databricks.spark.sql.perf.tpcds.Tables$Table.genData(Tables.scala:141)
at com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:204)
at com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:202)
at scala.collection.immutable.List.foreach(List.scala:318)
at com.databricks.spark.sql.perf.tpcds.Tables.genData(Tables.scala:202)
at SimpleApp$.tpc(submitApp.scala:110)
at SimpleApp$.main(submitApp.scala:147)
at SimpleApp.main(submitApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/09/07 23:54:26 INFO SparkContext: Invoking stop() from shutdown hook
15/09/07 23:54:26 INFO SparkUI: Stopped Spark web UI at http://10.141.3.5:4040
15/09/07 23:54:26 INFO DAGScheduler: Stopping DAGScheduler
I0907 23:54:26.484743 12297 sched.cpp:1286] Asked to stop the driver
I0907 23:54:26.484998 12275 sched.cpp:752] Stopping framework '20150907-235419-84120842-5050-11875-0000'
15/09/07 23:54:26 INFO MesosSchedulerBackend: driver.run() returned with code DRIVER_STOPPED
15/09/07 23:54:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/09/07 23:54:26 INFO MemoryStore: MemoryStore cleared
15/09/07 23:54:26 INFO BlockManager: BlockManager stopped
15/09/07 23:54:26 INFO BlockManagerMaster: BlockManagerMaster stopped
15/09/07 23:54:26 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/09/07 23:54:26 INFO SparkContext: Successfully stopped SparkContext
15/09/07 23:54:26 INFO ShutdownHookManager: Shutdown hook called
15/09/07 23:54:26 INFO ShutdownHookManager: Deleting directory /local/vdbogert/tmp1/spark-d25f80c9-46e9-45e9-a92a-bd29e766a1a7
Reservation 3423559 cancelled
Throughout the code I get the feeling that a pre-installed Hive installation is needed. Is this correct? Because when writing to (external table) I can see the spark driver assumes the destination is a Hive database.
If it is indeed needed, that should be important to add in the README.md.
I'm confused about what is the difference between tpcds2_4Queries and tpcds1_4Queries?
my spark version is 2.3.2 and scala version is 2.11.8.
Hi, I'd like to contribute the below changes, looking for second opinions on whether this is the direction we want to go with this benchmark.
I suggest the following options be added and I'm happy to work on this
-u for a URI so we can store our data in a database such as DB2 instead of the local file system (concerned here as we'd need to pass the user and pass to the benchmark for some database configurations, wouldn't want this to be getting made available in someone's .bash_history)
-m for a custom master URL so we can use a cluster instead of just local[*]
-wsc to quickly enable WholeStageCodegen
-q to run only select queries e.g. -q 1 2 4 90
We have a variety of wholestage codegen changes that we'd like to contribute to Spark and we think this benchmark is the most fitting for our goals, with the above changes we can quickly execute individual queries and use databases, execute across multiple machines (or just not ours) and quickly see the benefits or drawbacks to WholeStageCodegen
Hello,
I am using Spark 1.5.2 . I succesfully generated the data, stored it in HDFS, created a TPCDS instance and loaded the tables using createTemporaryTables(). But every time I try to run an experiment or get queries, interactiveQueries etc (from Benchmark class) I get the following stack trace:
tpcds.queries
java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrame.queryExecution()Lorg/apache/spark/sql/execution/QueryExecution;
at com.databricks.spark.sql.perf.Benchmark$Query.toString(Benchmark.scala:543)
at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
at scala.runtime.ScalaRunTime$$anonfun$scala$runtime$ScalaRunTime$$inner$1$2.apply(ScalaRunTime.scala:320)
at scala.runtime.ScalaRunTime$$anonfun$scala$runtime$ScalaRunTime$$inner$1$2.apply(ScalaRunTime.scala:320)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
at scala.collection.AbstractIterator.addString(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
at scala.collection.AbstractIterator.mkString(Iterator.scala:1157)
at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:320)
at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
at .(:10)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Hi,
I am new to sbt and do not know how to use this package. I downloaded the zip file and unzip it. when i run 'sbt assembly' or 'sbt package', it is requiring to set DBC_USERNAME. Do we have to have Databrick Cloud account, in order to use this package? Could you provide some detail instructions on how to use this package? Thanks a lot!
Hi,
1)I have created the jar file of spark-sql-perf after building it and started spark-shell with that jar.
Please find the error details in the attached file
spark shell command error.txt
Please let me know if any further information required.
hi:
I am facing below issues, when I am trying to run this code. For this command
tables.createExternalTables("file:///home/tpctest/", "parquet", "mydata", false)
java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier CREATE found
CREATE DATABASE IF NOT EXISTS mydata
^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
..............
I used spark-sql-perf-0.2.4 ,scala-2.10.5 spark-1.6.1;
but this commend:
tables.createTemporaryTables("file:///home/wl/tpctest/", "parquet") has no problem,
can you help me resove this problem?
and tpcds.createResultsTable() commend has the same with tables.createExternalTables()
Hello,
does this benchmark provide fine-grained statistics and measures the execution time for every node in a query plan for spark SQL library?
thank you for your help.
best regards
table.genData(tableLocation, format, overwrite, clusterByPartitionColumns,
What value does format take when generating TPC-DS benchmarks?
Hi All,
I am new to Spark and Scala. I have the source code for Spark SQL Performance Tests and dsdgen .
Can anyone tell me how to proceed next ? I am done with building by giving command bin/run --help.
I am trying to execute bin/run --benchmark DatasetPerformance and it is giving me error. But before that it would be really great if someone can tell me how to start with this.I understand Readme is still under development. Is there any manual which I can follow ?
I used this * tables.genData(location, format, true, true, true, true, true)* and occurred the following problem, it is ok for tables.genData(location, format, true, false, true, true, true), so this benchmark do not support partitionTables?
java.lang.RuntimeException: [7.1] failure: ``union'' expected but identifier DISTRIBUTE found
After I read the source code, I found the problem occurred on this line.
val query =
s"""
|SELECT
| $columnString
|FROM
| $tempTableName
|$predicates
|DISTRIBUTE BY
| $partitionColumnString
""".stripMargin
Is there any suggestion to solve this problem?
I run this on Spark1.5.
Hello everyone,
I ran TPC-DS Benchmark for Hawq & SparkSQL on Hadoop. Could you please help me verify the numbers?
Ref:
https://github.com/pivotalguru/TPC-DS
https://github.com/databricks/spark-sql-perf
A sub-set of 19 queries were executed to evaluate the performance of Hawq & Spark-SQL.
I have attached the system setup and the results of the tests. Please share your comments.
hawq-vs-spark-verify.pdf
Thanks,
Tania
This is probably a really simple question. How can I point bin/run to a running Spark cluster?
The TPC-DS spec defines primary key constraints for tables (Section 2.2.1.2)
However, it looks like this constraint isn't enforced while generating data. For example if I run the following two queries on the store_sales
table, I get different number of rows even though the ss_item_sk
is defined as the primary key.
scala> val df = sql("SELECT count(ss_item_sk) FROM store_sales")
df: org.apache.spark.sql.DataFrame = [count(ss_item_sk): bigint]
scala> df.show
+-----------------+
|count(ss_item_sk)|
+-----------------+
| 2879789|
+-----------------+
scala> val df = sql("SELECT count(distinct ss_item_sk) FROM store_sales")
df: org.apache.spark.sql.DataFrame = [count(DISTINCT ss_item_sk): bigint]
scala> df.show
+--------------------------+
|count(DISTINCT ss_item_sk)|
+--------------------------+
| 18000|
+--------------------------+
I get an error trying to import class Tables from target/scala-2.10/spark-sql-perf_2.10-0.4.7-SNAPSHOT.jar
I use $spark-shell --jars ...
scala> import com.databricks.spark.sql.perf.tpcds.Tables <console>:23: error: object databricks is not a member of package com
I also noticed:
WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Any pointers?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.