Code Monkey home page Code Monkey logo

spear's Introduction

Overview

Build Status codecov.io

Codecov.io

This project is a sandbox and playground of mine for experimenting ideas and potential improvements to Spark SQL. It consists of:

  • A parser that parses a small SQL dialect into unresolved logical plans
  • A semantic analyzer that resolves unresolved logical plans into resolved ones
  • A query optimizer that optimizes resolved query plans into equivalent but more performant ones
  • A query planner that turns (optimized) logical plans into executable physical plans

Currently Spear only works with local Scala collections.

Build

Building Spear is as easy as:

$ ./build/sbt package

Run the REPL

Spear has an Ammonite-based REPL for interactive experiments. To start it:

$ ./build/sbt spear-repl/run

Let's create a simple DataFrame of numbers:

@ context range 10 show ()
╒══╕
│id│
├──┤
│ 0│
│ 1│
│ 2│
│ 3│
│ 4│
│ 5│
│ 6│
│ 7│
│ 8│
│ 9│
╘══╛

A sample query using the DataFrame API:

@ context.
    range(10).
    select('id as 'key, (rand(42) * 100) cast IntType as 'value).
    where('value % 2 === 0).
    orderBy('value.desc).
    show()
╒═══╤═════╕
│key│value│
├───┼─────┤
│  5│   90│
│  9│   78│
│  0│   72│
│  1│   68│
│  4│   66│
│  8│   46│
│  6│   36│
│  2│   30│
╘═══╧═════╛

Equivalent sample query using SQL:

@ context range 10 asTable 't // Registers a temporary table first

@ context.sql(
    """SELECT * FROM (
      |  SELECT id AS key, CAST(RAND(42) * 100 AS INT) AS value FROM t
      |) s
      |WHERE value % 2 = 0
      |ORDER BY value DESC
      |""".stripMargin
  ).show()
╒═══╤═════╕
│key│value│
├───┼─────┤
│  5│   90│
│  9│   78│
│  0│   72│
│  1│   68│
│  4│   66│
│  8│   46│
│  6│   36│
│  2│   30│
╘═══╧═════╛

We can also check the query plan using explain():

@ context.
    range(10).
    select('id as 'key, (rand(42) * 100) cast IntType as 'value).
    where('value % 2 === 0).
    orderBy('value.desc).
    explain(true)
# Logical plan
Sort: order=[$0] ⇒ [?output?]
│ ╰╴$0: `value` DESC NULLS FIRST
╰╴Filter: condition=$0 ⇒ [?output?]
  │ ╰╴$0: ((`value` % 2:INT) = 0:INT)
  ╰╴Project: projectList=[$0, $1] ⇒ [?output?]
    │ ├╴$0: (`id` AS `key`#11)
    │ ╰╴$1: (CAST((RAND(42:INT) * 100:INT) AS INT) AS `value`#12)
    ╰╴LocalRelation: data=<local-data> ⇒ [`id`#10:BIGINT!]

# Analyzed plan
Sort: order=[$0] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
│ ╰╴$0: `value`#12:INT! DESC NULLS FIRST
╰╴Filter: condition=$0 ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
  │ ╰╴$0: ((`value`#12:INT! % 2:INT) = 0:INT)
  ╰╴Project: projectList=[$0, $1] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
    │ ├╴$0: (`id`#10:BIGINT! AS `key`#11)
    │ ╰╴$1: (CAST((RAND(CAST(42:INT AS BIGINT)) * CAST(100:INT AS DOUBLE)) AS INT) AS `value`#12)
    ╰╴LocalRelation: data=<local-data> ⇒ [`id`#10:BIGINT!]

# Optimized plan
Sort: order=[$0] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
│ ╰╴$0: `value`#12:INT! DESC NULLS FIRST
╰╴Filter: condition=$0 ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
  │ ╰╴$0: ((`value`#12:INT! % 2:INT) = 0:INT)
  ╰╴Project: projectList=[$0, $1] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
    │ ├╴$0: (`id`#10:BIGINT! AS `key`#11)
    │ ╰╴$1: (CAST((RAND(42:BIGINT) * 100.0:DOUBLE) AS INT) AS `value`#12)
    ╰╴LocalRelation: data=<local-data> ⇒ [`id`#10:BIGINT!]

# Physical plan
Sort: order=[$0] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
│ ╰╴$0: `value`#12:INT! DESC NULLS FIRST
╰╴Filter: condition=$0 ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
  │ ╰╴$0: ((`value`#12:INT! % 2:INT) = 0:INT)
  ╰╴Project: projectList=[$0, $1] ⇒ [`key`#11:BIGINT!, `value`#12:INT!]
    │ ├╴$0: (`id`#10:BIGINT! AS `key`#11)
    │ ╰╴$1: (CAST((RAND(42:BIGINT) * 100.0:DOUBLE) AS INT) AS `value`#12)
    ╰╴LocalRelation: data=<local-data> ⇒ [`id`#10:BIGINT!]

spear's People

Contributors

cloud-fan avatar liancheng avatar zzl0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spear's Issues

Ammonite REPL doesn't work well with Scala 2.11.8

$ build/sbt repl/run
[info] Running scraper.repl.Main
Loading...
[error] (run-main-9) java.lang.AbstractMethodError: ammonite.repl.interp.CompilerCompatibility$$anon$2.printingOk(Lscala/reflect/internal/Trees$Tree;)Z
java.lang.AbstractMethodError: ammonite.repl.interp.CompilerCompatibility$$anon$2.printingOk(Lscala/reflect/internal/Trees$Tree;)Z
        at scala.tools.nsc.typechecker.TypersTracking$class.noPrintAdapt(TypersTracking.scala:164)
        at ammonite.repl.interp.CompilerCompatibility$$anon$2.noPrintAdapt(CompilerCompatibility.scala:10)
        at scala.tools.nsc.typechecker.TypersTracking$typingStack$.showAdapt(TypersTracking.scala:115)
        at scala.tools.nsc.typechecker.Typers$Typer.runTyper$1(Typers.scala:5415)
        at scala.tools.nsc.typechecker.Typers$Typer.scala$tools$nsc$typechecker$Typers$Typer$$typedInternal(Typers.scala:5423)
        at scala.tools.nsc.typechecker.Typers$Typer.body$2(Typers.scala:5370)
        at scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5374)
        at scala.tools.nsc.typechecker.Typers$Typer.typedQualifier(Typers.scala:5472)
        at scala.tools.nsc.typechecker.Typers$Typer.typedQualifier(Typers.scala:5480)
        at scala.tools.nsc.typechecker.Typers$Typer.typedPackageDef$1(Typers.scala:5012)
        at scala.tools.nsc.typechecker.Typers$Typer.typedMemberDef$1(Typers.scala:5312)
        at scala.tools.nsc.typechecker.Typers$Typer.typed1(Typers.scala:5359)
        at scala.tools.nsc.typechecker.Typers$Typer.runTyper$1(Typers.scala:5396)
        at scala.tools.nsc.typechecker.Typers$Typer.scala$tools$nsc$typechecker$Typers$Typer$$typedInternal(Typers.scala:5423)
        at scala.tools.nsc.typechecker.Typers$Typer.body$2(Typers.scala:5370)
        at scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5374)
        at scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5448)
        at scala.tools.nsc.typechecker.Analyzer$typerFactory$$anon$3.apply(Analyzer.scala:102)
        at scala.tools.nsc.Global$GlobalPhase$$anonfun$applyPhase$1.apply$mcV$sp(Global.scala:440)
        at scala.tools.nsc.Global$GlobalPhase.withCurrentUnit(Global.scala:431)
        at scala.tools.nsc.Global$GlobalPhase.applyPhase(Global.scala:440)
        at scala.tools.nsc.typechecker.Analyzer$typerFactory$$anon$3$$anonfun$run$1.apply(Analyzer.scala:94)
        at scala.tools.nsc.typechecker.Analyzer$typerFactory$$anon$3$$anonfun$run$1.apply(Analyzer.scala:93)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.tools.nsc.typechecker.Analyzer$typerFactory$$anon$3.run(Analyzer.scala:93)
        at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1501)
        at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1486)
        at scala.tools.nsc.Global$Run.compileSources(Global.scala:1481)
        at scala.tools.nsc.Global$Run.compileFiles(Global.scala:1571)
        at ammonite.repl.interp.Compiler$$anon$4.compile(Compiler.scala:256)
        at ammonite.repl.interp.Interpreter$$anonfun$6$$anonfun$apply$7.apply(Interpreter.scala:339)
        at ammonite.repl.interp.Interpreter$$anonfun$6$$anonfun$apply$7.apply(Interpreter.scala:339)
        at ammonite.repl.interp.Evaluator$$anon$1.ammonite$repl$interp$Evaluator$$anon$$compileClass(Evaluator.scala:144)
        at ammonite.repl.interp.Evaluator$$anon$1.cachedCompileBlock(Evaluator.scala:288)
        at ammonite.repl.interp.Evaluator$$anon$1.processScriptBlock(Evaluator.scala:303)
        at ammonite.repl.interp.Interpreter$$anonfun$processModule$1.apply(Interpreter.scala:98)
        at ammonite.repl.interp.Interpreter$$anonfun$processModule$1.apply(Interpreter.scala:98)
        at ammonite.repl.interp.Interpreter.loop$1(Interpreter.scala:156)
        at ammonite.repl.interp.Interpreter.processScript(Interpreter.scala:169)
        at ammonite.repl.interp.Interpreter.processModule(Interpreter.scala:98)
        at ammonite.repl.interp.Interpreter.<init>(Interpreter.scala:400)
        at ammonite.repl.Repl.<init>(Repl.scala:38)
        at ammonite.repl.Main$.repl$lzycompute$1(Main.scala:130)
        at ammonite.repl.Main$.repl$1(Main.scala:130)
        at ammonite.repl.Main$.run(Main.scala:136)
        at scraper.repl.Main$.main(Main.scala:11)
        at scraper.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:483)

Should probably downgrade Scala back to 2.11.7.

Improves tree string of expressions

Currently, Expression.nodeCaption only contains the class name, thus Expression.prettyTree loses necessary information for some expressions. For example:

@ (lit(1) cast StringType).prettyTree
res21: String = """
Cast
╰╴1:INT
"""

In the above snippet, the target data type is missing. We should show all constructor arguments except for child nodes in Expression.nodeCaption, similar to what we did for QueryPlan.

Refactor data type ordering

Currently, all orderable data types extend from OrderedType. But this is not suitable for the case of complex types. For example, a StructType is orderable iff all of its fields are of orderable types. Thus, we can't let StructType extend from OrderedType since it's not always orderable.

The approach proposed here is that:

  1. Removes OrderedType
  2. Adds a ordering: Option[Ordering[T]] field to DataType
  3. A DataType t is orderable iff t.ordering.isDefined == true

Spark sql function

@liancheng I want to consult a question about spark sql function: from_utc_timestamp(ts: Column, tz: String)
I am using mongo-spark to load "member" collection from mongodb which is included three fields: memberId, date, timezone.
case class Member(memberId: String, date: Timestamp, timezone: String)
val memberDF: Dataframe = load [ Member ] ("member")
I want to invoke from_utc_timestamp to get member's timezone timestamp, memberDF.select(memberId, from_utc_timestamp(date, timezone)), however, tz type is String, it is not a column type. how to implement from_utc_timestamp(ts:Column, tz:Column)?

def from_utc_timestamp(ts: Column, tz: String): Column = withExpr {
FromUTCTimestamp(ts.expr, Literal(tz))
}

withExpr is private method......

Thanks,
Aaron

Query plan tree may show duplicated expressions

$ build/sbt repl/run
...
@ context range 10 groupBy 'id agg 'id explain ()
# Logical plan
UnresolvedAggregate: keys=[$0], projectList=[$0], havingConditions=[], order=[] ⇒ [?output?]
│ ├╴$0: `id`
│ ╰╴$1: `id`
╰╴LocalRelation: data=<local-data> ⇒ [`id`#0:BIGINT!]
...

id shows twice as it appears twice in UnresolvedAggregate.

Proper case sensitivity handling

Case sensitivity is always a nasty source of bugs. It would be nice to have a safe, consistent, yet concise way to do case-sensitive and case-insensitive name resolution.

Distinct aggregate function resolution

Similar to the DistinctAggregationRewriter in Spark, used to resolve multiple distinct aggregate functions within a single SELECT query. E.g.:

SELECT
  SUM(a),
  COUNT(DISTINCT b),
  COUNT(DISTINCT c)
FROM t
GROUP BY d

Revise various tree string printing methods

The current code base leverages a bunch of methods are used to build pretty tree
string of a TreeNode:

  • TreeNode
    • .prettyTree
    • .nodeName
    • .nodeCaption
    • .buildNestedTree
    • .buildPrettyTree
  • QueryPlan
    • .argStrings
    • .argValueStrings
    • .outputStrings

This is too complicated. Should revise and simplify the current design.

Interface for table-valued functions

After having an interface for table-valued functions (TVFs), we can firstly make range a TVF, and then introduce more data sources as TVFs.

One abstraction I'd like to experiment with is to make external data sources TVFs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.