kszucs / pandahouse Goto Github PK

View Code? Open in Web Editor NEW

229.0 10.0 70.0 63 KB

Pandas interface for Clickhouse database

License: MIT License

Python 99.73% Dockerfile 0.27%

pandas clickhouse dataframe

pandahouse's Introduction

Pandahouse

Pandas interface for Clickhouse HTTP API

Install

pip install pandahouse

Usage

Writing dataframe to clickhouse

connection = {'host': 'http://clickhouse-host:8123',
              'database': 'test'}
affected_rows = to_clickhouse(df, table='name', connection=connection)

Reading arbitrary clickhouse query to pandas

df = read_clickhouse('SELECT * FROM {db}.table', index_col='id',
                     connection=connection)

pandahouse's People

Contributors

Stargazers

Watchers

Forkers

marcoboniche szemetesjozsi yutas fossabot woutifier jleech orenov ap-mic vecchp dr-denzy intelecy letfoolsdie iancal word911 amonsat abeusher dengchaoyun007 noff anirudhkm he9995 lamprosstavrogiannis hbluotian mistralbkru surkovsv nnnet ifengkou ap0stal spotekhin kung-foo rodabt assertive-yield carmanzhang tomloopme lytmbot sn0wfree euyuil shwangdev gzf117 lubchenko05 isgasho nightsailer zhangguiyu winnersoftlab nkalmykov romich995 taoye romakoks mykrass tsinlingli briangu zhouxinfei kosyak zky001 nickmishustin binque excpt0 ayushdeqode sbelia astupidbear shen-zc zmk112 sicher123 grkhr henri-j-norden

pandahouse's Issues

Did anyone continue this project? It seems abandoned.

Pandahouse couldn't handle Array type results.

Pandahouse should handle Array type results from ClickHouse.

Add UUID to object types

Please add UUID to object types in convert.py

pandahouse.http.ClickhouseException: b"Code: 60, e.displayText() = DB::Exception: Table default.sales_all doesn't exist. (version 20.8.3.18)\n"

Does anyone know this？

FutureWarning: read_table is deprecated

https://github.com/kszucs/pandahouse/blob/master/pandahouse/convert.py#L78 still uses read_table which results in this warning with pandas 0.24.2:

/.../lib/python3.7/site-packages/pandahouse/convert.py:77: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  na_values=set(), keep_default_na=False, **kwargs)

DateTime64(ns) KeyError, when ns is set in table definition

Hey, thanks for your great work, this module has been really helpful!

using clickhouse-server version: 20.3.8.53

I have a small problem with DateTime64(ns) where the size of ns has been explicitly set in the table definition.

With a clickhouse table as follows:

CREATE TABLE db_name.tbl_name (
   Timestamp DateTime64(6) CODEC(Delta(8), LZ4),
   SomeData Float32 CODEC(LZ4)
) 
ENGINE = MergeTree() PARTITION BY toYYYYMMDD(Timestamp)
ORDER BY intHash64(toInt64(Timestamp)) 
SAMPLE BY intHash64(toInt64(Timestamp))

A query:

query = "SELECT DISTINCT * FROM db_name.tbl_name"
connection = {
    'host': server_name,
    'database': db_name,
    'user': 'default'
}    
pandahouse.read_clickhouse(query=query, index=False, connection=connection)

Results in:

pandahouse/core.py in read_clickhouse(query, tables, index, connection, **kwargs)
     56     lines = execute(query, external=external, stream=True,
     57                     connection=connection)
---> 58     return to_dataframe(lines, **kwargs)
     59 
     60 

pandahouse/convert.py in to_dataframe(lines, **kwargs)
     65     dtypes, parse_dates, converters = {}, [], {}
     66     for name, chtype in zip(names, types):
---> 67         dtype = CH2PD[chtype]
     68         if dtype == 'object':
     69             converters[name] = decode_escapes

KeyError: 'DateTime64(6)

If I understand correctly, the mapping defined earlier in the file convert.py

MAPPING = {'object': 'String',
           'uint64': 'UInt64',
           'uint32': 'UInt32',
           'uint16': 'UInt16',
           'uint8': 'UInt8',
           'float64': 'Float64',
           'float32': 'Float32',
           'int64': 'Int64',
           'int32': 'Int32',
           'int16': 'Int16',
           'int8': 'Int8',
           'datetime64[D]': 'Date',
           'datetime64[ns]': 'DateTime'}

does not cover the DateTime64(6) case. Or by extension any other DateTime(ns) case?

I would be happy to contribute a solution with a little guidance.

Thanks

KeyError: 'Nullable(String)'

I get the following error with one column, even thoI do have other empty columns which do notcause any errors (such as aircraftCategory is a string and is NaN)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-19-d531a02e1a3b> in <module>()
      2 
      3 connection = {'host': 'http://localhost:8123'}
----> 4 df = ph.read_clickhouse('SELECT hex, flight, recorded_date, recorded_datetime,recorded_longitude, recorded_latitude, recorded_squawk, recorded_altitude, recorded_verticalRate,recorded_track,recorded_speed,aircraftCategory FROM flightLog_MT LIMIT 1000000', connection=connection)

~/.pyenv/versions/3.6.2/envs/general_362/lib/python3.6/site-packages/pandahouse/core.py in read_clickhouse(query, tables, index, connection, **kwargs)
     56     lines = execute(query, external=external, stream=True,
     57                     connection=connection)
---> 58     return to_dataframe(lines, **kwargs)
     59 
     60 

~/.pyenv/versions/3.6.2/envs/general_362/lib/python3.6/site-packages/pandahouse/convert.py in to_dataframe(lines, **kwargs)
     60     dtypes, parse_dates, converters = {}, [], {}
     61     for name, chtype in zip(names, types):
---> 62         dtype = CH2PD[chtype]
     63         if dtype == 'object':
     64             converters[name] = decode_escapes

KeyError: 'Nullable(String)'

Added quotes to the dataframe at push mess up DateTime columns

I'm reading data from a clickhouse db and writing to another one.

my script is basically this:

df=ph.read_clickhouse("SELECT * FROM test.testing", index=True, connection=dict(host=url1))
ph.to_clickhouse(df, table='testing', index=False,  chunksize=5000, connection=dict(database='test', host=url2))

The dataframe seems clean
This is the output I'm getting

Code: 27, e.displayText() = DB::Exception: Cannot parse input: expected " before: 6-02","Donnees","Tous Sexes","Tous Ages",83,17,"","0000-00-00 00:00:00","0000-00-00 00:00:00"
"51 Marne","2020-04-09","2020-06-02","Donnees","Tous Sexes","Tous : (at row 1)

Row 1:
Column 0,   name: dep,              type: String,   parsed text: "<DOUBLE QUOTE>51 Marne<DOUBLE QUOTE>"
Column 1,   name: timestamp_data,   type: DateTime, ERROR: text "<DOUBLE QUOTE>2020-04-0" is not like DateTime

 (version 20.3.8.53 (official build))

Unable to reach "http.execute( )'s" "external" parameter from "to_clickhouse( )" function

The core.to_clickhouse( ) function is using the http.execute( ) function "under the hood". Some request preparations happening there by using the http.prepare( ) function.

But it is impossible to pass any other value in execute(external=) parameter (which is used in requests preparation), except hardcoded one, since there is no interface for that in to_clickhouse() function.

It would be great if such an opportunity existed :D

Provide support for Enum

Right now pandahouse throws this error when you're querying some column with enum value in it:

df = pdch.read_clickhouse(query, connection=connection)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<timed exec> in <module>

~/anaconda3/lib/python3.6/site-packages/pandahouse/core.py in read_clickhouse(query, tables, index, connection, **kwargs)
     56     lines = execute(query, external=external, stream=True,
     57                     connection=connection)
---> 58     return to_dataframe(lines, **kwargs)
     59 
     60 

~/anaconda3/lib/python3.6/site-packages/pandahouse/convert.py in to_dataframe(lines, **kwargs)
     65     dtypes, parse_dates, converters = {}, [], {}
     66     for name, chtype in zip(names, types):
---> 67         dtype = CH2PD[chtype]
     68         if dtype == 'object':
     69             converters[name] = decode_escapes
KeyError: "Enum8(\\'one\\' = 1, \\'two\\' = 2)"

You can learn more about enum here (if you need it)

Run flake8 linter on the CI

Create table from dataframe?

Is there a way to automatically create a table matching the column types of a dataframe, similar to pandas.DataFrame.to_sql()?

Issue with integer columns: KeyError: '0'

Describe the bug
I try to create table with pandahouse:

  create_table = f"CREATE TABLE data.sma({fields}) ENGINE=Log"
  ph.execute(create_table, connection=connection)

and got as a result:

Traceback (most recent call last):
  File "./SMA_test_alter.py", line 128, in <module>
    main()
  File "./SMA_test_alter.py", line 125, in main
    ch_client.execute('INSERT into data.sma VALUES', dfs.to_dict('records'))
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/client.py", line 261, in execute
    rv = self.process_insert_query(
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/client.py", line 478, in process_insert_query
    rv = self.send_data(sample_block, data,
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/client.py", line 529, in send_data
    block = block_cls(sample_block.columns_with_types, chunk,
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/block.py", line 39, in __init__
    self.data = self.normalize(data or [])
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/block.py", line 127, in normalize
    self._mutate_dicts_to_rows(data)
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/block.py", line 161, in _mutate_dicts_to_rows
    data[i] = [row[name] for name in column_names]
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/block.py", line 161, in <listcomp>
    data[i] = [row[name] for name in column_names]
KeyError: '0'

To Reproduce

create table = "CREATE TABLE data.sma(`3` Float64 , `4` Float64 , `5` Float64 , `6` Float64 , `7` Float64 , `8` Float64 , `9` Float64 , `10` Float64 , `1-2` LowCardinality(String), `1-3` LowCardinality(String), `1-4` LowCardinality(String), `1-5` LowCardinality(String), `1-6` LowCardinality(String), `1-7` LowCardinality(String), `1-8` LowCardinality(String), `2-3` LowCardinality(String), `2-4` LowCardinality(String), `2-5` LowCardinality(String), `2-6` LowCardinality(String), `2-7` LowCardinality(String), `2-8` LowCardinality(String), `3-4` LowCardinality(String), `3-5` LowCardinality(String), `3-6` LowCardinality(String), `3-7` LowCardinality(String), `3-8` LowCardinality(String), `4-5` LowCardinality(String), `4-6` LowCardinality(String), `4-7` LowCardinality(String), `4-8` LowCardinality(String), `5-6` LowCardinality(String), `5-7` LowCardinality(String), `5-8` LowCardinality(String), `6-7` LowCardinality(String), `6-8` LowCardinality(String), `7-8` LowCardinality(String)) ENGINE=Log"

ph.execute(create_table, connection=connection)

Expected behavior
created table(b/c I just copied create_table string and it works if I manually push it using ch client.

Versions

Version of package with the problem.
pandahouse==0.2.7
Python version.
python -V
Python 3.8.10

Column Order and missing columns handling

Hi,
I wanted to know the following about this library

Does the column of the list array passed in the ph.to_clickhouse have to be in the same order as he table in clickhouse?
Do all columns need to be passed in the list? Does it take default values mentioned in the table for columns that are not passed in the array?

Thanks for your help in advance!

Double quote converts into backquote

I'm using pandahouse to execute query with where-condition using string that contains apostrophe in its value (e.g. Rock'n'Roll)
Normally I use double-quote in datagrip and all works just fine
But it seems to me read_clickhouse function just converts all double quotes into backquotes and it forces clickhouse to see this expression as a column
For example the sting that I'm passing to read_clickhouse function as follows:

select * from sometable
where genre == "Rock'n'Roll"

As a result of read_clickhouse function I got mistake that clickhouse doesn't see the column Rock'n'Roll, because it executes not the same string, but:

select * from sometable
where genre == `Rock'n'Roll`

Actually all queries that have double-quotes fall because of this mistake
I can't even use replaceAll(genre, "'", ' ') clickhouse-function, because it cannot see column ' - that is strange

Why is this happening?