Comments (9)
It would be great to add these. I think the biggest barrier here is going to be adapting the data generation code to produce all of the fact tables. Right now, we use the streaming mode of dsdgen
and I don't think you can do this for the tables that have foreign key dependencies.
from spark-sql-perf.
What do you think about de-coupling dsdgen from the test kit itself, and
simply provide instructions on how to run dsdgen by itself?
Is it correct to assume that most users are probably using Hive and have
created these tables and loaded data already?
For our tests, we used externally generated data on HDFS (path passed in),
created DF and used csv to load data, like this:
def importTable(sqlContext: SQLContext, filename: String, schema:
StructType, tablename: String) {
val df = sqlContext.read.format("com.databricks.spark.csv").
schema(schema).option("delimiter", "|").load(filename)
df.registerTempTable(tablename)
}
A few queries do have modifications -- thought to mention that but should
be good enough for this kit.
Will package the queries up and send it soon.
From: Michael Armbrust [email protected]
To: databricks/spark-sql-perf [email protected]
Cc: Jesse F Chen/San Francisco/IBM@IBMUS
Date: 09/17/2015 12:26 PM
Subject: Re: [spark-sql-perf] Can we put all working queries into this
test kit? There are 86 out of 99 working in Spark 1.5 (#23)
It would be great to add these. I think the biggest barrier here is going
to be adapting the data generation code to produce all of the fact tables.
Right now, we use the streaming mode of dsdgen and I don't think you can do
this for the tables that have foreign key dependencies.
—
Reply to this email directly or view it on GitHub.
from spark-sql-perf.
We don't necessarily need to block adding the queries on adding the data generation, but in my experience generating larger scale factors (SF1500 - SF15000) is actually a significant challenge. So I would defiantly like to add support for generating them in the context of a Spark job.
from spark-sql-perf.
Definitely nice to have data generation done in a Spark job. What the best way to upload a gzip file containing all 86 queries in text files?
from spark-sql-perf.
I wouldn't upload them as a zip file. I'd do one of the following:
- Add the files in
src/main/resources/...
and create a harness that reads them from the classloader and creates query objects for each. Put this as another trait in the tpcds directory. - Hard code them as strings as we have in the other tpcds files
from spark-sql-perf.
Do you have any update in this? I'd be interested in testing out the new queries... Thanks! E.
from spark-sql-perf.
This is still being worked on. Stay tuned please. We will implement this as first option from Michael's comment above - that makes sense.
from spark-sql-perf.
News? :) I may have some free time in the next days, if you could PR the queries I can have a look at how to add some scala glue... thanks!
from spark-sql-perf.
Great job!
from spark-sql-perf.
Related Issues (20)
- How to put data into external storage?
- suitable exector-memory for spark-sql-perf testing
- Validating the correctness of results HOT 2
- The Query and Generate mismatch
- For spark-3.0.0, there is no method called org.apache.spark.sql.SQLContext.createExternalTable
- build errors due to dependencies HOT 1
- Spark 3.0.0 compile error
- Getting error when analyzing the columns
- Use CHAR/VARCHAR types in TPCDSTables HOT 2
- Error when trying to create binary from source code
- sbt run error with unresolved dependency
- NoSuchMethodError on Spark 3.1 in Databricks HOT 1
- sbt package failed with unresolved dependency HOT 5
- executor_per_core is fixed to 1 vCores in spark-sql-perf on EMR
- Build failed
- Compilation failed for Spark 3.2.0
- command "build/sbt .." failed with unresolved dependency HOT 1
- genData, the data isn`t stored the location I set. HOT 1
- genData,the tpchdata always stored in the dbgen directory.
- does it has plan to support tpcds 3.2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-sql-perf.