forward3d / rbhive Goto Github PK

View Code? Open in Web Editor NEW

98.0 40.0 75.0 174 KB

Ruby gem for querying Apache Hive

Home Page: http://www.forward3d.com

License: MIT License

Ruby 100.00%

rbhive's Introduction

RBHive - A Ruby Thrift client for Apache Hive

RBHive is a simple Ruby gem to communicate with the Apache Hive Thrift servers.

It supports:

Hiveserver (the original Thrift service shipped with Hive since early releases)
Hiveserver2 (the new, concurrent Thrift service shipped with Hive releases since 0.10)
Any other 100% Hive-compatible Thrift service (e.g. Sharkserver)

It is capable of using the following Thrift transports:

BufferedTransport (the default)
SaslClientTransport (SASL-enabled transport)
HTTPClientTransport (tunnels Thrift over HTTP)

As of version 1.0, it supports asynchronous execution of queries. This allows you to submit a query, disconnect, then reconnect later to check the status and retrieve the results. This frees systems of the need to keep a persistent TCP connection.

About Thrift services and transports

Hiveserver

Hiveserver (the original Thrift interface) only supports a single client at a time. RBHive implements this with the RBHive::Connection class. It only supports a single transport, BufferedTransport.

Hiveserver2

Hiveserver2 (the new Thrift interface) can support many concurrent client connections. It is shipped with Hive 0.10 and later. In Hive 0.10, only BufferedTranport and SaslClientTransport are supported; starting with Hive 0.12, HTTPClientTransport is also supported.

Each of the versions after Hive 0.10 has a slightly different Thrift interface; when connecting, you must specify the Hive version or you may get an exception.

Hiveserver2 supports (in versions later than 0.12) asynchronous query execution. This works by submitting a query and retrieving a handle to the execution process; you can then reconnect at a later time and retrieve the results using this handle. Using the asynchronous methods has some caveats - please read the Asynchronous Execution section of the documentation thoroughly before using them.

RBHive implements this client with the RBHive::TCLIConnection class.

Warning!

We had to set the following in hive-site.xml to get the BufferedTransport Thrift service to work with RBHive:

<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value>
</property>

Otherwise you'll get this nasty-looking exception in the logs:

ERROR server.TThreadPoolServer: Error occurred during processing of message.
java.lang.ClassCastException: org.apache.thrift.transport.TSocket cannot be cast to org.apache.thrift.transport.TSaslServerTransport
  at org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:35)
  at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
  at java.lang.Thread.run(Thread.java:662)

Other Hive-compatible services

Consult the documentation for the service, as this will vary depending on the service you're using.

Connecting to Hiveserver and Hiveserver2

Hiveserver

Since Hiveserver has no options, connection code is very simple:

RBHive.connect('hive.server.address', 10_000) do |connection|
  connection.fetch 'SELECT city, country FROM cities'
end 
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]

Hiveserver2

Hiveserver2 has several options with how it is run. The connection code takes a hash with these possible parameters:

:transport - one of :buffered (BufferedTransport), :http (HTTPClientTransport), or :sasl (SaslClientTransport)
:hive_version - the number after the period in the Hive version; e.g. 10, 11, 12, 13 or one of a set of symbols; see Hiveserver2 protocol versions below for details
:timeout - if using BufferedTransport or SaslClientTransport, this is how long the timeout on the socket will be
:sasl_params - if using SaslClientTransport, this is a hash of parameters to set up the SASL connection

If you pass either an empty hash or nil in place of the options (or do not supply them), the connection is attempted with the Hive version set to 0.10, using :buffered as the transport, and a timeout of 1800 seconds.

Connecting with the defaults:

RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
  connection.fetch('SHOW TABLES')
end

Connecting with a Logger:

RBHive.tcli_connect('hive.server.address', 10_000, { logger: Logger.new(STDOUT) }) do |connection|
  connection.fetch('SHOW TABLES')
end

Connecting with a specific Hive version (0.12 in this case):

RBHive.tcli_connect('hive.server.address', 10_000, { hive_version: 12 }) do |connection|
  connection.fetch('SHOW TABLES')
end

Connecting with a specific Hive version (0.12) and using the :http transport:

RBHive.tcli_connect('hive.server.address', 10_000, { hive_version: 12, transport: :http }) do |connection|
  connection.fetch('SHOW TABLES')
end

We have not tested the SASL connection, as we don't run SASL; pull requests and testing are welcomed.

Hiveserver2 protocol versions

Since the introduction of Hiveserver2 in Hive 0.10, there have been a number of revisions to the Thrift protocol it uses.

The following table lists the available values you can supply to the :hive_version parameter when making a connection to Hiveserver2.

value	Thrift protocol version	notes
`10`	V1	First version of the Thrift protocol used only by Hive 0.10
`11`	V2	Used by the Hive 0.11 release (but not CDH5 which ships with Hive 0.11!) - adds asynchronous execution
`12`	V3	Used by the Hive 0.12 release, adds varchar type and primitive type qualifiers
`13`	V7	Used by the Hive 0.13 release, adds features from V4, V5 and V6, plus token-based delegation connections
`:cdh4`	V1	CDH4 uses the V1 protocol as it ships with the upstream Hive 0.10
`:cdh5`	V5	CDH5 ships with upstream Hive 0.11, but adds patches to bring the Thrift protocol up to V5

In addition, you can explicitly set the Thrift protocol version according to this table:

value	Thrift protocol version	notes
`:PROTOCOL_V1`	V1	Used by Hive 0.10 release
`:PROTOCOL_V2`	V2	Used by Hive 0.11 release
`:PROTOCOL_V3`	V3	Used by Hive 0.12 release
`:PROTOCOL_V4`	V4	Updated during Hive 0.13 development, adds decimal precision/scale, char type
`:PROTOCOL_V5`	V5	Updated during Hive 0.13 development, adds error details when GetOperationStatus returns in error state
`:PROTOCOL_V6`	V6	Updated during Hive 0.13 development, adds binary type for binary payload, uses columnar result set
`:PROTOCOL_V7`	V7	Used by Hive 0.13 release, support for token-based delegation connections

Asynchronous execution with Hiveserver2

In versions of Hive later than 0.12, the Thrift server supports asynchronous execution.

The high-level view of using this feature is as follows:

Submit your query using async_execute(query). This function returns a hash with the following keys: :guid, :secret, and :session. You don't need to care about the internals of this hash - all methods that interact with an async query require this hash, and you can just store it and hand it to the methods.
To check the state of the query, call async_state(handles), where handles is the handles hash given to you when you called async_execute(query).
To retrieve results, call either async_fetch(handles) or async_fetch_in_batch(handles), which work like the non async methods.
When you're done with the query, call async_close_session(handles).

Memory leaks

When you call async_close_session(handles), all async handles created during this session are closed.

If you do not close the sessions you create, you will leak memory in the Hiveserver2 process. Be very careful to close your sessions!

Method documentation

`async_execute(query)`

This method submits a query for async execution. The hash you get back is used in the other async methods, and will look like this:

{
  :guid => (binary string),
  :secret => (binary string),
  :session => (binary string)
}

The Thrift protocol specifies the strings as "binary" - which means they have no encoding. Be extremely careful when manipulating or storing these values, as they can quite easily get converted to UTF-8 strings, which will make them invalid when trying to retrieve async data.

`async_state(handles)`

handles is the hash returned by async_execute(query). The state will be a symbol with one of the following values and meanings:

symbol	meaning
:initialized	The query is initialized in Hive and ready to run
:running	The query is running (either as a MapReduce job or within process)
:finished	The query is completed and results can be retrieved
:cancelled	The query was cancelled by a user
:closed	Unknown at present
:error	The query is invalid semantically or broken in another way
:unknown	The query is in an unknown state
:pending	The query is ready to run but is not running

There are also the utility methods async_is_complete?(handles), async_is_running?(handles), async_is_failed?(handles) and async_is_cancelled?(handles).

`async_cancel(handles)`

Calling this method will cancel the query in execution.

`async_fetch(handles)`, `async_fetch_in_batch(handles)`

These methods let you fetch the results of the async query, if they are complete. If you call these methods on an incomplete query, they will raise an exception. They work in exactly the same way as the normal synchronous methods.

Examples

Fetching results

Hiveserver

RBHive.connect('hive.server.address', 10_000) do |connection|
  connection.fetch 'SELECT city, country FROM cities'
end 
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]

Hiveserver2

RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
  connection.fetch 'SELECT city, country FROM cities'
end 
➔ [{:city => "London", :country => "UK"}, {:city => "Mumbai", :country => "India"}, {:city => "New York", :country => "USA"}]

Executing a query

Hiveserver

RBHive.connect('hive.server.address') do |connection|
  connection.execute 'DROP TABLE cities'
end
➔ nil

Hiveserver2

RBHive.tcli_connect('hive.server.address') do |connection|
  connection.execute 'DROP TABLE cities'
end
➔ nil

Creating tables

table = TableSchema.new('person', 'List of people that owe me money') do
  column 'name', :string, 'Full name of debtor'
  column 'address', :string, 'Address of debtor'
  column 'amount', :float, 'The amount of money borrowed'

  partition 'dated', :string, 'The date money was given'
  partition 'country', :string, 'The country the person resides in'
end

Then for Hiveserver:

RBHive.connect('hive.server.address', 10_000) do |connection|
  connection.create_table(table)
end

Or Hiveserver2:

RBHive.tcli_connect('hive.server.address', 10_000) do |connection|
  connection.create_table(table)
end

Modifying table schema

table = TableSchema.new('person', 'List of people that owe me money') do
  column 'name', :string, 'Full name of debtor'
  column 'address', :string, 'Address of debtor'
  column 'amount', :float, 'The amount of money borrowed'
  column 'new_amount', :float, 'The new amount this person somehow convinced me to give them'

  partition 'dated', :string, 'The date money was given'
  partition 'country', :string, 'The country the person resides in'
end

Then for Hiveserver:

RBHive.connect('hive.server.address') do |connection|
  connection.replace_columns(table)
end

Or Hiveserver2:

RBHive.tcli_connect('hive.server.address') do |connection|
  connection.replace_columns(table)
end

Setting properties

You can set various properties for Hive tasks, some of which change how they run. Consult the Apache Hive documentation and Hadoop's documentation for the various properties that can be set. For example, you can set the map-reduce job's priority with the following:

connection.set("mapred.job.priority", "VERY_HIGH")

Inspecting tables

Hiveserver

RBHive.connect('hive.hadoop.forward.co.uk', 10_000) {|connection| 
  result = connection.fetch("describe some_table")
  puts result.column_names.inspect
  puts result.first.inspect
}

Hiveserver2

RBHive.tcli_connect('hive.hadoop.forward.co.uk', 10_000) {|connection| 
  result = connection.fetch("describe some_table")
  puts result.column_names.inspect
  puts result.first.inspect
}

Testing

We use RBHive against Hive 0.10, 0.11 and 0.12, and have tested the BufferedTransport and HTTPClientTransport. We use it against both Hiveserver and Hiveserver2 with success.

We have not tested the SaslClientTransport, and would welcome reports on whether it works correctly.

Contributing

We welcome contributions, issues and pull requests. If there's a feature missing in RBHive that you need, or you think you've found a bug, please do not hesitate to create an issue.

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

rbhive's People

Contributors

Stargazers

Watchers

Forkers

epictetus hrp jkitching danrama rekotan mcelona yoni p5k6 livingsocial floreal kolobock y-lan u2i nikolaifedorov kovyrin strategist922 hanjing5 rfronczyk talbright philipppahl tagomoris rockyyost robertino khanh-nguyen sheltowt davidcoen shun0102 vidma zanker 4ndypanda emthomas rkm3 bayu41 mouchtaris gjackson12 rvictory anilpeechara rafalcymerys julsdelatierra agaridata jackdanger swiftype liprais yelnis ukasiu ezetter yuua ojima-h bschmeck tad80 aguynamedryan mgraffx hzbadr gtournie colinmarc starsy kazuakit roughcompass staqapp ldm314 optionalg sakshia125 0x6666 zhangcaiyan naveen-cheenilabs dapulse aressweety pointlessone navgeet gaybro8777

rbhive's Issues

uninitialized constant Config (NameError) while installing gem for ruby version >= 2.2.x

gem install rbhive
Fetching: thrift-0.9.0.gem (100%)
Building native extensions. This could take a while...
ERROR: Error installing rbhive:
ERROR: Failed to build gem native extension.

/home/jyothu/.rvm/rubies/ruby-2.2.4/bin/ruby -r ./siteconf20170628-12526-wz2lr4.rb extconf.rb

*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary
libraries and/or headers. Check the mkmf.log file for more details. You may
need configuration options.

Provided configuration options:
--with-opt-dir
--without-opt-dir
--with-opt-include
--without-opt-include=${opt-dir}/include
--with-opt-lib
--without-opt-lib=${opt-dir}/lib
--with-make-prog
--without-make-prog
--srcdir=.
--curdir
--ruby=/home/jyothu/.rvm/rubies/ruby-2.2.4/bin/$(RUBY_BASE_NAME)
extconf.rb:25:in `

': uninitialized constant Config (NameError)

Thrift::TransportException: Broken pipe

Hi
I have installed Hadoop, Hive, added gem rbhive
While executing,

RBHive.tcli_connect("localhost", 9998, {:hive_version => :cdh5, :transport => :sasl, :sasl_params => {} }) do |connection|
connection.fetch('SHOW TABLES')
end

I am getting Error

Initializing transport sasl
Connecting to HiveServer2 localhost on port 9998
Thrift::TransportException: Broken pipe
    from /home/dheena/.rvm/gems/ruby-2.2.2/gems/thrift-0.9.3.0/lib/thrift/transport/socket.rb:90:in `rescue in write'

TCLI Connection Object Question

I would like to re-use a connection object.

I have something like this:

connection = RBHive::TCLIConnection.new(server, port, options)
connection.open
connection.open_session

I use the connection object to execute a query, and then:

connection.close_session
connection.close

If I try to re-use the same connection object it just seems to hang when I try to open a new session:

connection.open
connection.open_session

Any idea what could be happening and how to get around it? (my knowledge of Thrift is limited so I apologize in advance if this is a dumb question).

Has this project been abandoned?

Hive metastore types in global namespace

The classes defined in hive_metastore_types.rb are created directly in the global namespace, not in a specific module and they are prone to class name collisions because of this.

For example, in my project my Role model is conflicting with the Role class defined by rbhive. There are other very generic names in this file that could also cause collisions, like Type and Version.

Can you put these classes inside a module? Is there any workaround in the meantime?

ParseException : missing EOF

I am using this gem to connect to the HiveServer via thrift and for the following sample query- i am getting parseException error

Query:
RBHive.tcli_connect('my.host.name',10_000,{}) do |c|
c.fetch("USE dbname;SHOW TABLES")
end

Initializing transport buffered
Connecting to HiveServer2 my.host.name on port 10000
Executing Hive Query: USE dbname;SHOW TABLES
/usr/local/lib/ruby/gems/1.9.1/gems/rbhive-0.6.0/lib/rbhive/t_c_l_i_connection.rb:336:in raise_error_if_failed!': Error while processing statement: FAILED: ParseException line 1:7 missing EOF at ';' near 'dbname' (RuntimeError) from /usr/local/lib/ruby/gems/1.9.1/gems/rbhive-0.6.0/lib/rbhive/t_c_l_i_connection.rb:203:inblock in fetch'
from /usr/local/lib/ruby/gems/1.9.1/gems/rbhive-0.6.0/lib/rbhive/t_c_l_i_connection.rb:294:in block in safe' from <internal:prelude>:10:insynchronize'
from /usr/local/lib/ruby/gems/1.9.1/gems/rbhive-0.6.0/lib/rbhive/t_c_l_i_connection.rb:294:in safe' from /usr/local/lib/ruby/gems/1.9.1/gems/rbhive-0.6.0/lib/rbhive/t_c_l_i_connection.rb:200:infetch'
from hive.rb:4:in block in <main>' from /usr/local/lib/ruby/gems/1.9.1/gems/rbhive-0.6.0/lib/rbhive/t_c_l_i_connection.rb:56:intcli_connect'
from hive.rb:3:in `

However following works fine:
RBHive.tcli_connect('my.host.name',10_000,{}) do |c|
c.fetch("USE dbname")
c.fetch("SHOW TABLES")
end

rbhive broken on cdh3u4 release/update (Apache Hive 0.7.1-cdh3u4)

I should mention that it was working fine on the cdh3u3 release.

bash-3.2$ cat test_rbhive.rb
require 'rbhive'

results = RBHive.connect('hive.server.address') do |connection|
connection.fetch "select state_code from zip_code"
end

results.each do |data|
puts "data = #{data.inspect}"
end

bash-3.2$ ruby test_rbhive.rb
Connecting to hive.server.address on port 10000
Executing Hive Query: select state_code from zip_code
/home/someuser/.rvm/gems/ruby-1.9.2-p290@somegemset/gems/rbhive-0.2.94/lib/thrift/thrift_hive.rb:26:in recv_execute': Query returned non-zero code: 10, cause: FAILED: Error in semantic analysis: Unable to fetch table zip_code (HiveServerException) from /home/someuser/.rvm/gems/ruby-1.9.2-p290@somegemset/gems/rbhive-0.2.94/lib/thrift/thrift_hive.rb:17:inexecute'
from /home/someuser/.rvm/gems/ruby-1.9.2-p290@somegemset/gems/rbhive-0.2.94/lib/rbhive/connection.rb:140:in execute_unsafe' from /home/someuser/.rvm/gems/ruby-1.9.2-p290@somegemset/gems/rbhive-0.2.94/lib/rbhive/connection.rb:81:inblock in fetch'
from /home/someuser/.rvm/gems/ruby-1.9.2-p290@somegemset/gems/rbhive-0.2.94/lib/rbhive/connection.rb:145:in block in safe' from <internal:prelude>:10:insynchronize'
from /home/someuser/.rvm/gems/ruby-1.9.2-p290@somegemset/gems/rbhive-0.2.94/lib/rbhive/connection.rb:145:in safe' from /home/someuser/.rvm/gems/ruby-1.9.2-p290@somegemset/gems/rbhive-0.2.94/lib/rbhive/connection.rb:80:infetch' from test_rbhive.rb:4:in block in <main>' from /home/someuser/.rvm/gems/ruby-1.9.2-p290@somegemset/gems/rbhive-0.2.94/lib/rbhive/connection.rb:14:inconnect'
from test_rbhive.rb:3:in `

requires with .. broken in jruby

If you use the rbhive gem in a framework that packs it into a jar (such as redstorm), you get an error in connection.rb where it attempts to import a file stepping up to a parent path using a ".."

Here is a simple fix for that:

diff --git a/lib/rbhive/connection.rb b/lib/rbhive/connection.rb
index e2a0e62..02cc088 100644
--- a/lib/rbhive/connection.rb
+++ b/lib/rbhive/connection.rb
@@ -1,7 +1,7 @@

suppress warnings

old_verbose, $VERBOSE = $VERBOSE, nil

require thrift autogenerated files

-require File.join(File.dirname(FILE), *%w[.. thrift thrift_hive])
+require File.join(File.split(File.dirname(FILE)).first, *%w[thrift thrift_hive])

restore warnings

$VERBOSE = old_verbose

Empty resultset using :sasl on HIve 0.13 and 0.14

Using SASL the query doesn't return rows:

RBHive.tcli_connect(host,10_000,{ logger: Logger.new(STDOUT), hive_version: 13, transport: :sasl, sasl_params: {username: 'hive'} }) do |connection|
  res = connection.fetch(query)
  puts res.column_names.inspect
  puts res.first.inspect
end

This is the output:

I, [2015-02-12T07:06:43.543538 #1333]  INFO -- : Executing Hive Query: SHOW TABLES
[:tab_name]
nil
=> nil

Also async execution doesn't yield anything:

RBHive.tcli_connect(host,10_000,{ logger: Logger.new(STDOUT), hive_version: 13, transport: :sasl, sasl_params: {username: 'hive'} }) do |connection|
  hash = connection.async_execute(query2)
  state = connection.async_state(hash)
  while state == :running do
      puts state
      sleep(2)
    state = connection.async_state(hash)
  end  
  puts state  
  puts connection.async_fetch(hash)
  connection.async_close_session(hash)
end

I, [2015-02-12T07:09:54.103844 #1333]  INFO -- : Executing query asynchronously: SHOW TABLES
2
finished
2
=> <Hive2::Thrift::TCloseSessionResp status:<Hive2::Thrift::TStatus statusCode:SUCCESS_STATUS (0)>>

See Lots of Ruby work..what about Haml and Sass?

Hello,

Love your guys work with Ruby. Was wondering if you are going to release any projects specifically with Haml and Sass?

Kenny

No version identifier, old protocol client error with Spark 3

Hi, thanks for this gem! I'm hoping to use it with the Thrift JDBC/ODBC server that ships with Spark 3 (./sbin/start-thriftserver.sh), but I'm getting the following error:

Traceback (most recent call last):
	6: from main.rb:3:in `<main>'
	5: from /Users/andrew/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/rbhive-1.0.0/lib/rbhive/t_c_l_i_connection.rb:56:in `tcli_connect'
	4: from /Users/andrew/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/rbhive-1.0.0/lib/rbhive/t_c_l_i_connection.rb:155:in `open_session'
	3: from /Users/andrew/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/rbhive-1.0.0/lib/thrift/t_c_l_i_service.rb:18:in `OpenSession'
	2: from /Users/andrew/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/rbhive-1.0.0/lib/thrift/t_c_l_i_service.rb:26:in `recv_OpenSession'
	1: from /Users/andrew/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/thrift-0.11.0.0/lib/thrift/client.rb:54:in `receive_message'
/Users/andrew/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/thrift-0.11.0.0/lib/thrift/protocol/binary_protocol.rb:131:in `read_message_begin': No version identifier, old protocol client? (Thrift::ProtocolException)

Script

require "rbhive"

RBHive.tcli_connect('localhost', 10000, {logger: Logger.new(STDOUT), hive_version: 13}) do |connection|
  p connection.fetch('SHOW TABLES')
end

I've tried it with all the hive_versions, but no luck.

Add `rbhive` support for Hive Thrift Server 2

Hive now has a proper Thrift server which supports multiple connections, transactions, and more. I'm not sure if rbhive in it's current state supports Hive Thrift Server 2.

Opening this issue to track testing and possibly augmenting rbhive to support it.

Any interest in prepending the header row on generated csv and tsv files?

Was considering submitting a PR for this, but wanted to gauge interest first.

couldn't install rbhive using 'gem install ...'

localhost:ext liuxiao$ gem install rbhive

Building native extensions. This could take a while...
ERROR: Error installing rbhive:
ERROR: Failed to build gem native extension.

No such file or directory - getcwd

Gem files will remain installed in /Users/liuxiao/.rvm/gems/ruby-2.2.2/gems/thrift-0.9.0 for inspection.
Results logged to /Users/liuxiao/.rvm/gems/ruby-2.2.2/extensions/x86_64-darwin-14/2.2.0/thrift-0.9.0/gem_make.out

Connection hanging

When trying to create a table, I open the connection and pass the table, like below:

RBHive.connect('hive.server.address', 10_000) do |connection|
connection.create_table(table)
end

But with Hiveserver, I never get a response, the table doesn't get created, and the connection stays, never disconnecting. It does say that it's Executing Hive Query, and spells out the table creation statement, but that's it.

It's similar with Hiveserver2, only I do get a message about the connection happening. After that, it behaves like Hiveserver.

Thanks!

Does library support Kerberos authorisation?

Problems sending query to server

I'm facing some issues with RBHive to connect to Hive 1.1(CDH 5.5).

When connecting using the following:

 :server: localhost
:port: 10000
:options:
  :hive_version: 13
  :transport: :sasl
  :sasl_params:
    :username: test
    :password: tttt

I'm receiving:
org.apache.hive.service.cli.HiveSQLException: Invalid OperationHandle: OperationHandle [opType=EXECUTE_STATEMENT...
on Hiveserver logs and "No operation state found for handles - has the session been closed?" on client side
I opened RbHive gem code and the return for


 response = @client.GetOperationStatus(
        Hive2::Thrift::TGetOperationStatusReq.new(operationHandle: prepare_operation_handle(handles))
      )

is nil.

Does anyone know a workaround for this?

Thanks in advance!

Question also on stackoverflow: http://stackoverflow.com/questions/36484318/rbhive-problems-sending-query-to-server

Set credentials

Hi,

How do we set authentication params like user and password, while using tcli_connect or connect methods?

Max Rows in Fetch

Hello!

First off, thank you very much for your hard work on this project!

I recently was surprised by the default of max_rows = 100 here, so I thought I'd ask if there was a reason that it wouldn't make sense to effectively default max_rows to +Infinity? Would changing that involve significant technical hurdles?

Thanks!

#execute SET with leading spaces faulty in Hive 2 with :hive_version 12

There appears to be an issue with using SET commands where there are leading spaces in the query. This affects Hive 2 #tcli_connect with options { :hive_version => 12 }. In Hive 1 with #connect, however, this appears to work as expected. Compare the working:

RBHive.tcli_connect('hive.example.com', 10002, { :hive_version => 12 }) do |conn|
  conn.execute("SET mapred.fairscheduler.pool=swimming")
  conn.execute("SELECT * FROM chairs")
end

which results in mapred.fairscheduler.pool being set to swimming, with the seemingly faulty:

RBHive.tcli_connect('hive.example.com', 10002, { :hive_version => 12 }) do |conn|
  conn.execute("    SET mapred.fairscheduler.pool=swimming")
  conn.execute("SELECT * FROM chairs")
end

RBHive.tcli_connect('hive.example.com', 10002, { :hive_version => 12 }) do |conn|
  conn.execute(%Q{
    SET mapred.fairscheduler.pool=swimming
  })
  conn.execute("SELECT * FROM chairs")
end

which results in SET mapred.fairscheduler.pool being set to swimming and the setting not being applied (note the leading SET).

Invalid value of field serverProtocolVersion! (Thrift::ProtocolException)

I've had to upgrade Cloudera to CDH5. With Hue, I can connect to Hiveserver2, however, when I try through rbhive, I get 'Invalid value of field serverProtocolVersion! (Thrift::ProtocolException)' as an error when trying to connect. I looked at the Hive logs, and it seems like things connected okay. Here are the relevant lines from the Hive logs:

2014-03-01 15:18:40,655 INFO org.apache.hive.service.cli.thrift.ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V3
2014-03-01 15:18:40,789 WARN org.apache.hadoop.hive.conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
2014-03-01 15:18:40,877 WARN org.apache.hadoop.hive.conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
2014-03-01 15:18:40,878 INFO org.apache.hive.service.cli.CLIService: SessionHandle [bdb9f8f3-439b-483f-9874-01a1e02c9f27]: openSession()

Can't get results from fetches

I'm connecting to a spark-sql thrift server, which supports the HiveServer2 protocol. When I make queries using fetch, I can see that the data is returned, but it is in fetch_results.results.columns rather than fetch_results.results.rows, so the code always returns an empty array. Is this a known issue? Perhaps it's only a problem for spark-sql? I'd be happy to submit a patch, but it's unclear from some of the comments here if the gem is currently working for queries.

require "rbhive" fails

related lib version:
rbhive (0.6.0)
thrift (0.9.0)
thrift_client (0.9.2)
ruby version: 2.0.0-p0

exception stacktrace:

/home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/rbhive-0.6.0/lib/thrift/hive_metastore_types.rb:29:in <top (required)>': Version is not a class (TypeError) from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/rbhive-0.6.0/lib/thrift/thrift_hive_metastore.rb:9:inrequire_relative'
from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/rbhive-0.6.0/lib/thrift/thrift_hive_metastore.rb:9:in <top (required)>' from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/rbhive-0.6.0/lib/thrift/thrift_hive.rb:8:inrequire_relative'
from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/rbhive-0.6.0/lib/thrift/thrift_hive.rb:8:in <top (required)>' from /home/CORP/xiao.li/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/site_ruby/2.0.0/rubygems/core_ext/kernel_require.rb:45:inrequire'
from /home/CORP/xiao.li/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/site_ruby/2.0.0/rubygems/core_ext/kernel_require.rb:45:in require' from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/activesupport-3.2.14/lib/active_support/dependencies.rb:251:inblock in require'
from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/activesupport-3.2.14/lib/active_support/dependencies.rb:236:in load_dependency' from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/activesupport-3.2.14/lib/active_support/dependencies.rb:251:inrequire'
from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/rbhive-0.6.0/lib/rbhive/connection.rb:4:in <top (required)>' from /home/CORP/xiao.li/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/site_ruby/2.0.0/rubygems/core_ext/kernel_require.rb:45:inrequire'
from /home/CORP/xiao.li/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/site_ruby/2.0.0/rubygems/core_ext/kernel_require.rb:45:in require' from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/activesupport-3.2.14/lib/active_support/dependencies.rb:251:inblock in require'
from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/activesupport-3.2.14/lib/active_support/dependencies.rb:236:in load_dependency' from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/activesupport-3.2.14/lib/active_support/dependencies.rb:251:inrequire'
from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/rbhive-0.6.0/lib/rbhive.rb:1:in <top (required)>' from /home/CORP/xiao.li/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/site_ruby/2.0.0/rubygems/core_ext/kernel_require.rb:110:inrequire'
from /home/CORP/xiao.li/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/site_ruby/2.0.0/rubygems/core_ext/kernel_require.rb:110:in rescue in require' from /home/CORP/xiao.li/.rvm/rubies/ruby-2.0.0-p0/lib/ruby/site_ruby/2.0.0/rubygems/core_ext/kernel_require.rb:35:inrequire'
from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/activesupport-3.2.14/lib/active_support/dependencies.rb:251:in block in require' from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/activesupport-3.2.14/lib/active_support/dependencies.rb:236:inload_dependency'
from /home/CORP/xiao.li/.rvm/gems/ruby-2.0.0-p0@aptf/gems/activesupport-3.2.14/lib/active_support/dependencies.rb:251:in `require'

Question

Is this gem still being maintained? I only ask because I notice there are a few PRs that have been open for quite some time.

If it is, whomever is maintaining it, would you like some help?

String columns containing string "null" values are coerced to nil

Hello,

After some pretty painful debugging of a problem we've stumbled upon today, I've noticed that rbhive coerces string values that contain "null" or "NULL" into nil. Is there any reason behind this behaviour?

I see the change has been made by @andykent in 7bc4b41. Andy, could you shed some light on the issue?

Limitation to send commands (create/drop table, select...) for different databases

I was looking into your code and found out that you don't support to execute commands in different databases without a "USE [dbname]". If we don't set one, Hive will assume 'default' database. It is kind of ugly to require users to send a 'use [dbname]' every time s/he open a connection to HiveServer in a different database. It would be nice to have something like:

RBHive.tcli_connect('hive.server.address', 10_000, 'dbname') do |connection|
  connection.fetch('SHOW TABLES')
end

Please let me know if you want me to make a PR.

fetch support for ADD JAR queries

I'm using rbhive on HiveServer2 to execute arbitrary commands and, if there is a response, to capture that. The fetch command works generally quite well. However, it seems to not support "ADD JAR ..." type commands for loading UDFs.

I can alternatively use the execute command. However, this has the disadvantage of not loading results in a simple way and of suppressing all errors.

Is there a way to modify fetch so that it supports arbitrary queries?

Thanks,
Kevin

ActiveRecord Role and RbHive Role classes are in conflict

I have an activerecord Role class in my app, and rails throws this error:

TypeError: superclass mismatch for class Role

forward3d / rbhive Goto Github PK

rbhive's Introduction

RBHive - A Ruby Thrift client for Apache Hive

About Thrift services and transports

Hiveserver

Hiveserver2

Warning!

Other Hive-compatible services

Connecting to Hiveserver and Hiveserver2

Hiveserver

Hiveserver2

Hiveserver2 protocol versions

Asynchronous execution with Hiveserver2

Memory leaks

Method documentation

async_execute(query)

async_state(handles)

async_cancel(handles)

async_fetch(handles), async_fetch_in_batch(handles)

Examples

Fetching results

Hiveserver

Hiveserver2

Executing a query

Hiveserver

Hiveserver2

Creating tables

Modifying table schema

Setting properties

Inspecting tables

Hiveserver

Hiveserver2

Testing

Contributing

rbhive's People

Contributors

Stargazers

Watchers

Forkers

rbhive's Issues

suppress warnings

require thrift autogenerated files

restore warnings

Recommend Projects

Recommend Topics

Recommend Org

`async_execute(query)`

`async_state(handles)`

`async_cancel(handles)`

`async_fetch(handles)`, `async_fetch_in_batch(handles)`