Code Monkey home page Code Monkey logo

brickhouse's Introduction

Welcome to the Brickhouse

Build Status

Brickhouse is a collection of UDF's for Hive to improve developer productivity, and the scalability and robustness of Hive queries.

Brickhouse covers a wide range of functionality, grouped in the following packages.

  • collect - An implementaion of "collect" and various utilities for dealing with maps and arrays.

  • json - Translate between Hive structures and JSON strings

  • sketch - An implementation of KMV sketch sets, for reach estimation of large datasets.

  • bloom - UDF wrappers around the Hadoop BloomFilter implementation.

  • sanity - Tools for implementing sanity checks and managing Hive in a production environment.

  • hbase - Experimental UDFs for an alternative way to integrate Hive with HBase.

Requirements:

Brickhouse require Hive 0.9.0 or later; Maven 2.0 and a Java JDK is required to build.

Getting Started

  1. Clone ( or fork ) the repo from https://github.com/klout/brickhouse
  2. Run "mvn package" from the command line.
  3. Add the jar "target/brickhouse-<version number>.jar" to your HIVE_AUX_JARS_FILE_PATH, or add it to the distributed cache from the Hive CLI with the "add jar" command
  4. Source the UDF declarations defined in src/main/resource/brickhouse.hql

See the wiki on Github at https://github.com/klout/brickhouse/wiki for more information.

Also, see discussions on the Brickhouse Confessions blog on Wordpress

http://brickhouseconfessions.wordpress.com/

DOI

brickhouse's People

Contributors

ap-ensighten avatar bennies avatar dawnshue avatar dyrosss avatar feliperazeek avatar guanglefan avatar jdmaturen avatar jeromebanks avatar jsh2134 avatar kmlk15 avatar kukido avatar leemoonsoo avatar leventov avatar otistamp avatar proftodd avatar spidaman avatar wonder-mice avatar y-lan avatar yizeli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

brickhouse's Issues

parameterize sketch sets

Size of sketch sets are now hard-coded to 5000.
parameterize size of sketch set , using const parameter.

Fix naming conventions

right now our naming conventions are somewhat confusing and inconsistent.

We should change these to be more intuitive.

Right now ...
collect ... UDAFs for converting primitive to list or map
combine ... UDFs for manipulating lists and arrays
union ... UDAFs for grouping lists and maps together in various ways

implement collect_min

Need collect_min for completeness sake.. Maybe some code-reuse is possible.

Eventually should allow a comparison function to be passed in somehow ( ie. HoneyDog )

create jaccard distance UDF

Add a UDF for the sketch_set package to calculate jaccard distance, and demonstrate how how to measure set similarity ...

Efficient last() and first() UDFs

Actually passing in long arrays of values can be expensive with non-generic UDFs, because of object conversions. Add generic UDF's of some common array and map operations which avoid coercion to java primitives for more efficient access

Deprecate json_map/split_json or allow types to be specified

from_json is a stand-in replacement for json_map and split_json, and offers the same functionality.

We should either deprecate these functions; or keep them as convenience functions, but allow the type of the value to be specified somehow.

Possible class cast exception in join_array

This stack trace popped up in warehouse version of join_array... Might be present in brickhouse version
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.hadoop.io.Text
at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveJavaObject(WritableStringObjectInspector.java:45)
at com.klout.analytics.warehouse.udf.JoinArrayUDF.evaluate(JoinArrayUDF.java:28)
at com.klout.analytics.warehouse.udf.JoinArrayUDF.evaluate(JoinArrayUDF.java:39)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.evaluate(ExprNodeGenericFuncEvaluator.java:163)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator$DeferredExprObject.get(ExprNodeGenericFuncEvaluator.java:64)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:177)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.evaluate(ExprNodeGenericFuncEvaluator.java:163)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:76)

hbase UDFs should handle column families

Right now there is one column family "c" which is assumed for hbase tables inserted into with the HBase udf's . We should generalize this somehow, and allow to specify

support negative array indexes and ranges for array_index

Array_index could support -1 as a value, to emulate Ruby style arrays and return the last value of the array, to be more general then last_index() and first_index() functions.

Also support getting a range of values, ie array_index( 3,6) which returns an array of values instead of a single value.

Create Brickhouse-branded AWS image

A lot of developers use AWS these days, and they don't have the opportunity to add custom jars to their Hive environment. To improve adoption, we should build a custom AWS image which includes Brickhouse ( and perhaps other tools ) from which developers can run pipelines in the cloud, and also a demo environment or for people wanting to testdrive before creating the environment.

Add group_count/rank UDF

The group_count UDF was never propagated from the initial warehouse package.

Even though other rank() UDFs exist out there, we should add ours to brickhouse.
We just need to add this, and possibly rename to rank()

Also, add check results are correct if called multiple times ...

hbase scan UDTF

Define a Hive UDTF to do an HBase scan, so one can do straight lookups starting from a particular location..
Syntax would be something like this ..

FROM (
hbase_scan( "mytable", "mykey") ) hb
select col1, col2

extend array_index to return sub_list

Either add sub_list to brickhouse,
or allow two index arguments to be passed in

ie array_index( ["all","good","dogs","goto","heaven"], 1,2 ) = [ "good","dogs"]

pom.xml outdated

Failed to execute goal on project brickhouse: Could not resolve dependencies for project com.klout:brickhouse:jar:0.5.1: The following artifacts could not be resolved: javax.jdo:jdo2-api:jar:2.3-ec, com.clearspring.analytics:stream:jar:2.4.0-SNAPSHOT: Failure to find javax.jdo:jdo2-api:jar:2.3-ec in http://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]

http://search.maven.org/#search|ga|1|g%3A%22com.clearspring.analytics%22

integrate with hive builtins jar

Hive has a "builtins" mechanism where a certain jar contains references to all the Hive UDFs , so that they don't have to be specifically declared in a "setup.hql" file.

Somehow use this mechanism to register all the brickhouse UDFs as well.

error building

I modified the pom to use hadoop 1.0.3, hive 0.11.0 and hbase 0.92.0 (versions on emr), and ran mvn package which failed with the message:

brickhouse/src/main/java/brickhouse/udf/json/JsonSplitUDF.java:[58,29] cannot find symbol
symbol : method writeValueAsString(java.lang.Object)
location: class org.codehaus.jackson.map.ObjectMapper

brickhouse/src/main/java/brickhouse/udf/json/FromJsonUDF.java:[65,36] cannot find symbol
symbol : method readTree(java.lang.String)
location: class org.codehaus.jackson.map.ObjectMapper

brickhouse/src/main/java/brickhouse/hbase/CachedGetUDF.java:[75,37] cannot find symbol
symbol : method readTree(java.lang.String)
location: class org.codehaus.jackson.map.ObjectMapper

The fix was to include the missing jackson dependencies in the pom.xml:

org.codehaus.jackson jackson-core-asl 1.9.9 org.codehaus.jackson jackson-mapper-asl 1.9.9

I did verify the build succeeded using the pom file as it is on master.

ObjectInspector madness

A lot of the generic UDF's handle ObjectInspectors incorrectly, which can result in UDFs working find in some queries, but producing ClassCastExceptions for different queries.

Need to go through most UDFs and make sure ObjectInspectors are handled correctly

revisit vector operations

There is some overlap between thunder's vector ( map ) operations and brickhouse.

Either expand functionality to brickhouse's methods, ( ie. support vector x vector, dot product, vector sum, normalization, etc ... ) or import thunder's functions.

Add HyperLogLog UDFs

KVM sketches are fine for certain applications, but we need something a little more space efficient. We should integrate a thirdparty hyperloglog implementations ( or write our our, depending upon license, dependency, etc.)

Also, if there is some way to convert a KMV sketch to a hyperloglog, it would be useful. We could store sketch-sets on disk, to retain some sampling, and then convert to hyperloglogs to reduce sort space.

top k elements

Implement a Hive UDF which implements a streaming top-k elements efficiently
( to avoid doing a count(*)/group by and a collect_max )

cosine similarity UDF

create a UDF which calculates cosine similarity, when interpreting a map of strings and doubles as a vector in multidimensional space,

Specify type in from_json by a string with the typename

Specifying template objects is too confusing to understand for most people, so allow the template type in from_json() to be specified with a type string... ie

instead of from_json(" [ 1,2,3 ] ", array( 1 ) ) ,
do this from_json(" [ 1, 2, 3 ] ", "array" )

md5 or hash_md5 ???

Somehow we ended up with two md5 functions , md5, and hash_md5,

one which returns a string, and one which returns a hex string....

Need to clean this up, and perhaps include other hash types as well.

mvn package is not working

Tests in error:
testObama(brickhouse.analytics.uniques.SketchSetTest): src/test/resources/obama.txt (No such file or directory)

Am I missing anything?

~Aniket

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.