Code Monkey home page Code Monkey logo

verdict's Introduction

Verdict is Useful

Verdict is an Interactive-speed, resource-efficient query processor. Verdict is useful because:

  1. 200x faster by sacrificing only 1% accuracy Verdict can give you 99% accurate answers for your big data queries in a fraction of the time needed for calculating exact answers. If your data is too big to analyze in a couple of seconds, you will like Verdict.
  2. No change to your database Verdict is a middleware standing between your application and your database. You can just issue the same queries as before and get approximate answers right away. Of course, Verdict handles exact query processing too.
  3. Runs on (almost) any database Verdict can run on any database that supports standard SQL. We already have drivers for Hive, Impala, and MySQL. We’ll soon add drivers for some other popular databases.
  4. Ease of use Verdict is a client-side library: no servers, no port configurations, no extra user authentication, etc. You can simply make a JDBC connection to Verdict; then, Verdict automatically reads data from your database. Verdict is also shipped with a command-line interface.

Find more about Verdict at our website: VerdictDB.org.

Getting Started

Verdict can run on top of Apache Spark, Apache (incubating) Impala, and Apache Hive. We are adding drivers for other database systems.

Using Verdict is easy. Following this guide, you can finish setup in five minutes if you have any of those supported systems ready.

Downloading Verdict

  1. Download and unzip the latest release.
  2. Type mvn package in the unzipped directory. This command will download all the dependencies and compile Verdict's code. The command will create three jar files in the target directory.

Building Verdict is done!

More details

Verdict is tested on Oracle JDK 1.7 or above, but it should work with open JDK, too. mvn is the command for the Apache Maven package manager. If you do not have the Maven, you will have to install it. The official page for the Maven installiation is this.

Using Verdict

The steps for starting Verdict is slightly different depending on the database system it works on. Once connected, however, Verdict accepts the same SQL statements.

On Apache Spark

Verdict works with Spark by internally creating Spark's HiveContext. In this way, Verdict can load persisted tables through Hive metastore. Verdict is tested with Apache Spark 1.6.0 (in the Cloudera distribution CDH 5.11). We will support Spark 2.0 shortly.

We show how to use Verdict in spark-shell and pyspark. Using Verdict in an Spark application written either in Scala or Python is the same.

Verdict-on-Spark

You can start spark-shell with Verdict as follows.

$ spark-shell --jars target/verdict-core-0.3.0-jar-with-dependencies.jar

After spark-shell starts, import and use Verdict as follows.

import edu.umich.verdict.VerdictSparkHiveContext

scala> val vc = new VerdictSparkHiveContext(sc)   // sc: SparkContext instance

scala> vc.sql("show databases").show(false)       // Simply displays the databases (or often called schemas)

// Creates samples for the table. This step needs to be done only once for the table.
scala> vc.sql("create sample of database_name.table_name").show(false)

// Now Verdict automatically uses available samples for speeding up this query.
scala> vc.sql("select count(*) from database_name.table_name").show(false)

The return value of VerdictSparkHiveContext#sql() is a Spark's DataFrame class; thus, any methods that work on Spark's DataFrame work on Verdict's answer seamlessly.

Verdict-on-PySpark

You can start pyspark shell with Verdict as follows.

$ export PYTHONPATH=$(pwd)/python:$PYTHONPATH

$ pyspark --driver-class-path target/verdict-core-0.3.0-jar-with-dependencies.jar

Limitation: Note that, in order for the --driver-class-path option to work, the jar file (i.e., target/verdict-core-0.3.0-jar-with-dependencies.jar) must be present in the Spark's driver node. Verdict will support --jars option shortly.

After pyspark shell starts, import and use Verdict as follows.

>>> from pyverdict import VerdictHiveContext

>>> vc = VerdictHiveContext(sc)        # sc: SparkContext instance

>>> vc.sql("show databases").show()    # Simply displays the databases (or often called schemas)

# Creates samples for the table. This step needs to be done only once for the table.
>>> vc.sql("create sample of database_name.table_name").show()

# Now Verdict automatically uses available samples for speeding up this query.
>>> vc.sql("select count(*) from database_name.table_name").show()

The return value of VerdictHiveContext#sql() is a pyspark's DataFrame class; thus, any methods that work on pyspark's DataFrame work on Verdict's answer seamlessly.

On Apache Impala or Apache Hive

We will use our command line interface (which is called veeline) for connecting to those databases. You can programmatically connect to Verdict using the standard JDBC interface, too. Please see our website for the JDBC instruction.

Verdict-on-Impala

Type the following command in terminal to launch veeline that connects to Impala.

$ veeline/bin/veeline -h "impala://hostname:port/schema;key1=value1;key2=value2;..." -u username -p password

Note that parameters are delimited using semicolons (;). The connection string is quoted since the semicolons have special meaning in bash. The user name and password can be passed in the connetion string as parameters, too.

Verdict supports the Kerberos connection. For this, add principal=user/host@domain as one of those key-values pairs.

After veeline launches, you can issue regular SQL queries as follows.

verdict:impala> show databases;

verdict:impala> create sample of database_name.table_name;

verdict:impala> select count(*) from database_name.table_name;

Verdict-on-Hive

Type the following command in terminal to launch veeline that connects to Hive.

$ veeline/bin/veeline -h "hive2://hostname:port/schema;key1=value1;key2=value2;..." -u username -p password

Note that parameters are delimited using semicolons (;). The connection string is quoted since the semicolons have special meaning in bash. The user name and password can be passed in the connetion string as parameters, too.

Verdict supports the Kerberos connection. For this, add principal=user/host@domain as one of those key-values pairs.

After veeline launches, you can issue regular SQL queries as follows.

verdict:Apache Hive> show databases;

verdict:Apache Hive> create sample of database_name.table_name;

verdict:Apache Hive> select count(*) from database_name.table_name;

Notes on using veeline

veeline makes a JDBC connection to the database systems that Verdict work on top of (e.g., Impala or Hive). For this, it uses the JDBC drivers stored in the lib folder. Our code ships by default with the Cloudera's Impala and Hive JDBC drivers (jar files). However, if these drivers are not compatible with your environment, you can put the compatible JDBC drivers in the lib folder after deleting existing ones.

What's Next

See what types of queries are supported by Verdict in our website, and enjoy the speedup provided Verdict for those queries.

If you have use cases that are not supported by Verdict, please contact us at [email protected], or create an issue in our Github repository. We will answer your questions or requests shortly (at most in a few days).

verdict's People

Contributors

jsoren avatar junhaowang avatar pyongjoo avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.