Code Monkey home page Code Monkey logo

pandahouse's Introduction

Pandahouse

Pandas interface for Clickhouse HTTP API

Install

pip install pandahouse

Usage

Writing dataframe to clickhouse

connection = {'host': 'http://clickhouse-host:8123',
              'database': 'test'}
affected_rows = to_clickhouse(df, table='name', connection=connection)

Reading arbitrary clickhouse query to pandas

df = read_clickhouse('SELECT * FROM {db}.table', index_col='id',
                     connection=connection)

pandahouse's People

Contributors

haakonvt avatar iancal avatar jleech avatar kszucs avatar kung-foo avatar letfoolsdie avatar noff avatar r7ar7a avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pandahouse's Issues

DateTime64(ns) KeyError, when ns is set in table definition

Hey, thanks for your great work, this module has been really helpful!

using clickhouse-server version: 20.3.8.53

I have a small problem with DateTime64(ns) where the size of ns has been explicitly set in the table definition.

With a clickhouse table as follows:

CREATE TABLE db_name.tbl_name (
   Timestamp DateTime64(6) CODEC(Delta(8), LZ4),
   SomeData Float32 CODEC(LZ4)
) 
ENGINE = MergeTree() PARTITION BY toYYYYMMDD(Timestamp)
ORDER BY intHash64(toInt64(Timestamp)) 
SAMPLE BY intHash64(toInt64(Timestamp))

A query:

query = "SELECT DISTINCT * FROM db_name.tbl_name"
connection = {
    'host': server_name,
    'database': db_name,
    'user': 'default'
}    
pandahouse.read_clickhouse(query=query, index=False, connection=connection)

Results in:

pandahouse/core.py in read_clickhouse(query, tables, index, connection, **kwargs)
     56     lines = execute(query, external=external, stream=True,
     57                     connection=connection)
---> 58     return to_dataframe(lines, **kwargs)
     59 
     60 

pandahouse/convert.py in to_dataframe(lines, **kwargs)
     65     dtypes, parse_dates, converters = {}, [], {}
     66     for name, chtype in zip(names, types):
---> 67         dtype = CH2PD[chtype]
     68         if dtype == 'object':
     69             converters[name] = decode_escapes

KeyError: 'DateTime64(6)

If I understand correctly, the mapping defined earlier in the file convert.py

MAPPING = {'object': 'String',
           'uint64': 'UInt64',
           'uint32': 'UInt32',
           'uint16': 'UInt16',
           'uint8': 'UInt8',
           'float64': 'Float64',
           'float32': 'Float32',
           'int64': 'Int64',
           'int32': 'Int32',
           'int16': 'Int16',
           'int8': 'Int8',
           'datetime64[D]': 'Date',
           'datetime64[ns]': 'DateTime'}

does not cover the DateTime64(6) case. Or by extension any other DateTime(ns) case?

I would be happy to contribute a solution with a little guidance.

Thanks

License

Please add license to your repo.

KeyError: 'Nullable(String)'

I get the following error with one column, even thoI do have other empty columns which do notcause any errors (such as aircraftCategory is a string and is NaN)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-19-d531a02e1a3b> in <module>()
      2 
      3 connection = {'host': 'http://localhost:8123'}
----> 4 df = ph.read_clickhouse('SELECT hex, flight, recorded_date, recorded_datetime,recorded_longitude, recorded_latitude, recorded_squawk, recorded_altitude, recorded_verticalRate,recorded_track,recorded_speed,aircraftCategory FROM flightLog_MT LIMIT 1000000', connection=connection)

~/.pyenv/versions/3.6.2/envs/general_362/lib/python3.6/site-packages/pandahouse/core.py in read_clickhouse(query, tables, index, connection, **kwargs)
     56     lines = execute(query, external=external, stream=True,
     57                     connection=connection)
---> 58     return to_dataframe(lines, **kwargs)
     59 
     60 

~/.pyenv/versions/3.6.2/envs/general_362/lib/python3.6/site-packages/pandahouse/convert.py in to_dataframe(lines, **kwargs)
     60     dtypes, parse_dates, converters = {}, [], {}
     61     for name, chtype in zip(names, types):
---> 62         dtype = CH2PD[chtype]
     63         if dtype == 'object':
     64             converters[name] = decode_escapes

KeyError: 'Nullable(String)'

Added quotes to the dataframe at push mess up DateTime columns

I'm reading data from a clickhouse db and writing to another one.

my script is basically this:

df=ph.read_clickhouse("SELECT * FROM test.testing", index=True, connection=dict(host=url1))
ph.to_clickhouse(df, table='testing', index=False,  chunksize=5000, connection=dict(database='test', host=url2))

The dataframe seems clean
This is the output I'm getting

Code: 27, e.displayText() = DB::Exception: Cannot parse input: expected " before: 6-02","Donnees","Tous Sexes","Tous Ages",83,17,"","0000-00-00 00:00:00","0000-00-00 00:00:00"
"51 Marne","2020-04-09","2020-06-02","Donnees","Tous Sexes","Tous : (at row 1)

Row 1:
Column 0,   name: dep,              type: String,   parsed text: "<DOUBLE QUOTE>51 Marne<DOUBLE QUOTE>"
Column 1,   name: timestamp_data,   type: DateTime, ERROR: text "<DOUBLE QUOTE>2020-04-0" is not like DateTime

 (version 20.3.8.53 (official build))

Unable to reach "http.execute( )'s" "external" parameter from "to_clickhouse( )" function

The core.to_clickhouse( ) function is using the http.execute( ) function "under the hood". Some request preparations happening there by using the http.prepare( ) function.

But it is impossible to pass any other value in execute(external=) parameter (which is used in requests preparation), except hardcoded one, since there is no interface for that in to_clickhouse() function.

It would be great if such an opportunity existed :D

Provide support for Enum

Right now pandahouse throws this error when you're querying some column with enum value in it:

df = pdch.read_clickhouse(query, connection=connection)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<timed exec> in <module>

~/anaconda3/lib/python3.6/site-packages/pandahouse/core.py in read_clickhouse(query, tables, index, connection, **kwargs)
     56     lines = execute(query, external=external, stream=True,
     57                     connection=connection)
---> 58     return to_dataframe(lines, **kwargs)
     59 
     60 

~/anaconda3/lib/python3.6/site-packages/pandahouse/convert.py in to_dataframe(lines, **kwargs)
     65     dtypes, parse_dates, converters = {}, [], {}
     66     for name, chtype in zip(names, types):
---> 67         dtype = CH2PD[chtype]
     68         if dtype == 'object':
     69             converters[name] = decode_escapes
KeyError: "Enum8(\\'one\\' = 1, \\'two\\' = 2)"

You can learn more about enum here (if you need it)

Create table from dataframe?

Is there a way to automatically create a table matching the column types of a dataframe, similar to pandas.DataFrame.to_sql()?

Issue with integer columns: KeyError: '0'

Describe the bug
I try to create table with pandahouse:

  create_table = f"CREATE TABLE data.sma({fields}) ENGINE=Log"
  ph.execute(create_table, connection=connection)

and got as a result:

Traceback (most recent call last):
  File "./SMA_test_alter.py", line 128, in <module>
    main()
  File "./SMA_test_alter.py", line 125, in main
    ch_client.execute('INSERT into data.sma VALUES', dfs.to_dict('records'))
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/client.py", line 261, in execute
    rv = self.process_insert_query(
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/client.py", line 478, in process_insert_query
    rv = self.send_data(sample_block, data,
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/client.py", line 529, in send_data
    block = block_cls(sample_block.columns_with_types, chunk,
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/block.py", line 39, in __init__
    self.data = self.normalize(data or [])
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/block.py", line 127, in normalize
    self._mutate_dicts_to_rows(data)
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/block.py", line 161, in _mutate_dicts_to_rows
    data[i] = [row[name] for name in column_names]
  File "/home/1/.local/lib/python3.8/site-packages/clickhouse_driver/block.py", line 161, in <listcomp>
    data[i] = [row[name] for name in column_names]
KeyError: '0'

To Reproduce

create table = "CREATE TABLE data.sma(`3` Float64 , `4` Float64 , `5` Float64 , `6` Float64 , `7` Float64 , `8` Float64 , `9` Float64 , `10` Float64 , `1-2` LowCardinality(String), `1-3` LowCardinality(String), `1-4` LowCardinality(String), `1-5` LowCardinality(String), `1-6` LowCardinality(String), `1-7` LowCardinality(String), `1-8` LowCardinality(String), `2-3` LowCardinality(String), `2-4` LowCardinality(String), `2-5` LowCardinality(String), `2-6` LowCardinality(String), `2-7` LowCardinality(String), `2-8` LowCardinality(String), `3-4` LowCardinality(String), `3-5` LowCardinality(String), `3-6` LowCardinality(String), `3-7` LowCardinality(String), `3-8` LowCardinality(String), `4-5` LowCardinality(String), `4-6` LowCardinality(String), `4-7` LowCardinality(String), `4-8` LowCardinality(String), `5-6` LowCardinality(String), `5-7` LowCardinality(String), `5-8` LowCardinality(String), `6-7` LowCardinality(String), `6-8` LowCardinality(String), `7-8` LowCardinality(String)) ENGINE=Log"

ph.execute(create_table, connection=connection)

Expected behavior
created table(b/c I just copied create_table string and it works if I manually push it using ch client.

Versions

  • Version of package with the problem.
    pandahouse==0.2.7

  • Python version.
    python -V
    Python 3.8.10

Column Order and missing columns handling

Hi,
I wanted to know the following about this library

  1. Does the column of the list array passed in the ph.to_clickhouse have to be in the same order as he table in clickhouse?
  2. Do all columns need to be passed in the list? Does it take default values mentioned in the table for columns that are not passed in the array?

Thanks for your help in advance!

Double quote converts into backquote

I'm using pandahouse to execute query with where-condition using string that contains apostrophe in its value (e.g. Rock'n'Roll)
Normally I use double-quote in datagrip and all works just fine
But it seems to me read_clickhouse function just converts all double quotes into backquotes and it forces clickhouse to see this expression as a column
For example the sting that I'm passing to read_clickhouse function as follows:

select * from sometable
where genre == "Rock'n'Roll"

As a result of read_clickhouse function I got mistake that clickhouse doesn't see the column Rock'n'Roll, because it executes not the same string, but:

select * from sometable
where genre == `Rock'n'Roll`

Actually all queries that have double-quotes fall because of this mistake
I can't even use replaceAll(genre, "'", ' ') clickhouse-function, because it cannot see column ' - that is strange

Why is this happening?

Switch to Parquet and ORC format using pyarrow

More recent versions of clickhouse support Apache Parquet and Apache ORC format using Apache Arrow. Switching to these formats from csv/tsv would make the clickhouse <-> pandas conversions trivial and much faster.

Impossible to read nan values of float fields

due to na_values=set(), keep_default_na=False at to_dataframe no nan values supported, but clickhouse can store nan, inf, -info for float32/64 values and it should be supported in read_clickhouse.

Without patching there is no way now to load nan'able data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.