chdb-io / chdb Goto Github PK

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse

Home Page: https://clickhouse.com/docs/en/chdb

License: Apache License 2.0

Vim Script 0.01% CMake 1.50% C++ 91.46% C 4.21% Assembly 0.33% Shell 0.48% Dockerfile 0.08% Python 0.67% C# 0.01% Go 0.01% Java 0.02% JavaScript 0.02% PHP 0.01% Makefile 0.01% HTML 0.19% GAP 0.01% Perl 0.01% ANTLR 0.05% CSS 0.01% Jupyter Notebook 0.98%

chdb clickhouse clickhouse-database clickhouse-server data-science database embedded-database olap python sql

chdb's People

Contributors

Stargazers

Watchers

Forkers

nmreadelf lmangani love2io2 jonathancarrol elonginf richard-mille laodouya junsenc masemxiao hay-man n0wwa mistyr0se closegoingaway herpacker paramedick jkred369 andyli029 reema93jain ugbot thedumbasscoin qingant blackrez dennitz l1t1 fixcik shuther rufusexe srikanthccv mingyaaaa lingo-xp vzakaznikov radar-scale-testing-2 eliasyaoyc metrico daniel-robbins ybubnov ppmzhang2 mindkhichdi richardsonjf genostack mizaro pikagu guruparan18 traviscooper kokizzu yunyu nevinpuri eternalerrors damon008 ikrammert pandeyshishir rsohlot chechi v3g42 datalearns charliexue awesomedatatool bmcandibox zhao-xiaogang daodol wangxiaoying asanisimov strogo tengguozheng terryliu zhaonaiy helioshub chhetripradeep aiq-project nauu alex-cheng demoyeti

chdb's Issues

Clickhouse-local and chdb performance issue on clickbench Q.23 Q28

Performance on c6a.metal is really good, but on c6a.4xlarge is not good, I did some analysis.
SQLs on clickbench is start from 0.

So queries below on line N is Q.N-1 on clickbench.

Q.23

Q.23: SELECT * FROM file("hits_*.parquet", Parquet) WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10;
got:

chdb: [56.04940148999992,51.57905327000003,47.26720601400007]
duckdb: [55.858323720000044,11.130785724999896,11.137916557000153]
clickhouse-local: [56.544, 17.397, 17.414]

To prove my guess, I did the LD_PRELOAD=/usr/lib64/libtcmalloc.so.4 before running Q.23

But why chdb don't link jemalloc? it's a problem I didn't dig deep enough:

When just link jemalloc into the _chdb.cpython-xxxxx.so import will got:
ImportError: /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so: cannot allocate memory in static TLS block

As the performance impact is so much, I should solve this. Maybe just follow jemalloc/jemalloc#1237

Q.28

Q.28: SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM file("hits_*.parquet", Parquet) WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
got:

chdb: [10.385669515000018,10.016407472000083,10.284191555000007]
duckdb: [8.373684226000023,4.492152146999842,4.521108621999929]
clickhouse-local: [9.724, 9.377, 9.490]

Q.28 is mainly a regex performance problem, chdb eats all the 16 cores of c6a.4x. Time consumption of chdb and its cousin clickhouse-local are quite identical. But Duckdb run this really fast. I think this might be 2 explains:

re2 lib version or optimization issue.
As we know clickhouse engine didn't use the min, max data in every parquet file. This might get too much lines REGEXP_REPLACE.

Tips: The clickhouse parquet file handling issue might be solved in v23.4 https://twitter.com/ClickHouseDB/status/1649085317000105985?s=20

I also expect that the huge performance gap between clickhouse engine and duckdb on Q36, Q37, Q38, Q39 will also be greatly improved in v23.4

Screenshot above is from clickbench

As we can expect the LD_PRELOAD=/usr/lib64/libtcmalloc.so.4 didn't improve Q28 too much.

Here is the raw test result of chdb on c6a.4xlarge:

[0.4212477300000046,0.05675365600001214,0.05619672399996034],
[0.4166838250000069,0.06531620699996665,0.06449468199997455],
[1.321115270000007,0.11996398300004785,0.11983177199999773],
[1.0048882890000073,0.19388485299998592,0.19680832800003145],
[1.9418000499999835,1.3102912339999762,1.306730918000028],
[1.7881873679999671,0.9799579250000079,0.9898529660000008],
[0.4436242609999681,0.06424746199996889,0.06408176899998352],
[0.4105360549999659,0.07395825700001524,0.07268457200001421],
[1.4529841990000136,0.7407924660000162,0.7251662129999659],
[3.591443298999991,0.908300875000009,0.8936401500000102],
[2.30893248000001,0.3970227440000258,0.4045306449999657],
[2.3094307900000217,0.44978438499998674,0.46010506099997883],
[1.766951419999998,0.8745181759999809,0.8729436739999983],
[3.655738981000013,1.3014796009999827,1.3102514689999794],
[1.7923423009999624,0.9957754339999951,0.9786744300000123],
[1.4113377190000165,0.904729954000004,0.8983218939999915],
[4.241506091000019,2.110677882999994,2.108407218000025],
[3.475192112000002,1.3591108809999923,1.3430213159999767],
[7.629405652000003,4.1000120279999805,4.094294319000028],
[0.9980636189999927,0.1998158640000156,0.20184096600002022],
[11.467297670999983,3.3498150409999994,3.340530508000029],
[12.402329799999961,3.8429053930000237,3.8139943720000247],
[23.43172926599999,7.04715183899998,7.0258233459999815],
[56.04940148999992,51.57905327000003,47.26720601400007],
[3.778993186999969,0.9129657269999143,0.8343545320000203],
[1.5488553679999768,0.7785552640000333,0.7703242139999702],
[3.7575681749999603,0.7797505359999377,0.7788957500000606],
[11.195327173999999,3.9467314390000183,4.043763985999931],
[10.385669515000018,10.016407472000083,10.284191555000007],
[3.109853629999975,2.6357215879999103,2.592264249999971],
[4.27900334200001,1.0691001030000962,1.0309490470000355],
[8.493144112999971,1.4274674520000872,1.41434988900005],
[9.647242985000048,6.089477478999925,5.9719970529999955],
[11.494503093000048,5.471921920999989,5.2813955599999645],
[11.458392625000101,5.404140325000071,5.264485821999983],
[1.4544796239999869,1.1437831629999664,1.1075418030000037],
[13.741583547000005,3.498664385999973,3.5927922500000022],
[12.745886966999933,3.499401634000037,3.489023333999967],
[13.625658656999917,3.5325630399999,3.5357377320000296],
[21.39596359199993,5.756981011999983,5.797184558000026],
[6.0100831690000405,0.6963497419999385,0.6951165990000163],
[6.212279076000073,0.6570028610000236,0.6571199239999714],
[4.017423710000003,0.5795366049999302,0.505997210999908]

(Custom) table engine bindings

Being able to bind a custom table engine (not function) that exhibits dynamic behavior would be really powerful.

For example, similar to how I have the parquet merge engine IceDB where I have an example querying from a custom table function, it would be much nicer if the columns for filtering the metadata could be determined via the query predicate (WHERE conditions) rather than function parameters.

This would allow you to fully decoupled storage, compute, and metadata to build a serverless OLAP DB that natively support multi-tenancy and dynamic schema (through column name merging, existing clickhouse feature). This could then effectively be aliased to being a list of parquet files to read, with merge on name enabled (can't find the docs in clickhouse to save my life for that setting).

Effectively this would be intercepting the AST in a very DX-friendly way, and returning a custom clickhouse table function in its place before execution.

Segfault on handled exceptions

Error responses are causing core dumps on the latest version.

Example

The following query attempts to use a non-existing resource (could be local, remote, etc)

import chdb
url = "https://does.not.exist"
query = f"""SELECT * FROM url('{url}', Parquet) LIMIT 5 """
res = chdb.query(query, 'Dataframe')
print(str(res.get_memview().tobytes()))

The following cause is the expected exception:

Code: 198. DB::Exception: Not found address of host: donot.exist: Cannot extract table structure from Parquet format file. You can specify the structure manually. (DNS_ERROR)``

When using the latest chdb version (or any really) errors consistently cause a core dump after the error response which does not seem related to the chdb dynamic library but rather the python changes:

Code: 198. DB::Exception: Not found address of host: donot.exist: Cannot extract table structure from Parquet format file. You can specify the structure manually. (DNS_ERROR)
/arrow/python/pyarrow/src/arrow/python/common.cc:178:  Check failed: (data_) != (nullptr) Null pointer in Py_buffer
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.1200(+0x1010758)[0x7fdd593b3758]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.1200(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7fdd593b3b3d]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow_python.so(_ZN5arrow2py8PyBuffer4InitEP7_object+0xf5)[0x7fdd95ee0815]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow_python.so(_ZN5arrow2py8PyBuffer12FromPyObjectEP7_object+0x13a)[0x7fdd95ee096a]
/usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c5735)[0x7fdd85288735]
/usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0xd033e)[0x7fdd8519333e]
/usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x15a10c)[0x7fdd8521d10c]
/usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0xd033e)[0x7fdd8519333e]
/usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1b70d4)[0x7fdd8527a0d4]
python3(PyCFunction_Call+0x59)[0x5f6939]
python3(_PyObject_MakeTpCall+0x296)[0x5f7506]
python3(_PyEval_EvalFrameDefault+0x6259)[0x571019]
python3(_PyEval_EvalCodeWithName+0x26a)[0x5697da]
python3[0x59c396]
python3(_PyObject_MakeTpCall+0x1ff)[0x5f746f]
python3(_PyEval_EvalFrameDefault+0x5dce)[0x570b8e]
python3(_PyFunction_Vectorcall+0x1b6)[0x5f6ce6]
python3(_PyEval_EvalFrameDefault+0x72d)[0x56b4ed]
python3(_PyFunction_Vectorcall+0x1b6)[0x5f6ce6]
python3(_PyEval_EvalFrameDefault+0x72d)[0x56b4ed]
python3(_PyEval_EvalCodeWithName+0x26a)[0x5697da]
python3(_PyFunction_Vectorcall+0x393)[0x5f6ec3]
python3(_PyEval_EvalFrameDefault+0x5796)[0x570556]
python3(_PyEval_EvalCodeWithName+0x26a)[0x5697da]
python3(PyEval_EvalCode+0x27)[0x68e547]
python3[0x67dbf1]
python3[0x67dc6f]
python3[0x67dd11]
python3(PyRun_SimpleFileExFlags+0x197)[0x67fe37]
python3(Py_RunMain+0x212)[0x6b7c82]
python3(Py_BytesMain+0x2d)[0x6b800d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fddb3306083]
python3(_start+0x2e)[0x5fb85e]
Aborted (core dumped)

The same seems to apply to any error condition returned.

Opening for investigation.

Aggregate function bindings

Very similar to the function bindings, but can be used as an aggregate function. For example a query could be:

select myfunc(colA), colB from t group by colB

Where:

def myfunc(prev: int | None, row: int) -> int:
  return 2*row + prev if prev != None else 0

(don't even know if that's valid python just a random example)

prev being an accumulator, and row being the current row

0.11.0: recursive_mutex lock failed: Invalid argument. (STD_EXCEPTION)

Stock queries are failing with chdb 0.11.0. Here's an example we know works as expected with chdb 0.10.x

query

SELECT
    town,
    district,
    count() AS c,
    round(avg(price)) AS price
FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
GROUP BY
    town,
    district
LIMIT 10

Failing response

Code: 1001. DB::Exception: std::__1::system_error: recursive_mutex lock failed: Invalid argument. (STD_EXCEPTION)

Expected response

query results

Binaries instead of Strings in dataframe and arrow exports

Describe the unexpected behaviour
When querying a parquet files with chdb, strings become bytes in dataframe and arrow format. I don't this issue with JSON or CSV.

How to reproduce

The query

SELECT AVG(prix) as prix_moy,
        pdvid,
        name,
        ville,
        type_carburant
FROM  s3('https://********.s3.eu-west-1.amazonaws.com/instantane.parquet', 'Parquet') AS p
LEFT JOIN s3('https://************.s3.eu-west-1.amazonaws.com/station.csv', '*****', '****', 'CSVWithNames') AS stations
ON p.pdvid = stations.id
GROUP BY all
ORDER BY prix_moy DESC;

The results with clickhouse local

The result with chdb

In [25]: res
Out[25]:
       prix_moy     pdvid                                    name                        ville type_carburant
0         2.799  49480005  b"BP A11 AIRE DES PORTES D'ANGERS SUD"     b"Saint-Sylvain-D'Anjou"        b'SP98'
1         2.770  75014008                                    None                     b'Paris'        b'SP98'
2         2.740  75014008                                    None                     b'Paris'        b'SP95'
3         2.699  49160003                            b'SARL ROUX'    b'Longu\xc3\xa9-Jumelles'        b'SP98'
4         2.690  75016011                  b'Sarl STATION KLEBER'                     b'Paris'        b'SP98'

Make static library

thanks for your great job, is it possible to compile a chdb static library?

Add support for stateful query

can we have something like this

chdb.query("create or replace view xxx as select 42")

chdb.query(" select from xxx")

AVX2 Support

Great mod and very interesting binding approach! We're also playing with clickhouse-local custom builds towards a slimmer execution engine.

Question: Is there a specific reason for AVX2 to be enabled for Linux systems? This greatly reduces the scope of usage (ie: lambdas, virtualized cpus, etc, fail with Instruction check fail. The CPU does not support AVX2 instruction set.)

Having this disabled (or possibly multiple versions) could broaden compatibility scope enormously. Also the library doesn't seem to be stripped and the distributed .so has 100M+ of trimmable fat.

Thanks!

Example of dataframe performance of pandas vs parquet

Was looking through the example and saw performance comparing parquet with pandas vs chdb, but it would be interesting to see performance querying dataframes with pandas vs chdb!

Run a query on a pandas dataframe directly

Is there a plan to add the ability to run a sql query directly on a pandas dataframe?
Similar to https://pypi.org/project/sqldf/

import pandas as pd
import numpy as np
import chdb

# Create a dummy pd.Dataframe
df = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'], 'col2': ['F', np.NaN, 'G', 'H', 'I']})

# Define a SQL (Clickhouse) query
query = '''
SELECT *
FROM df
WHERE col_1 IS NOT NULL;
'''

# Run the query
df_view = chdb.query(sql, "Dataframe")

My current workaround is to 1) save the dataframe to a parquet file and then 2) run chdb.query on that file
I could wrap this into a function that would create a temporary parquet file.
But being able to directly query on dataframes seems very convenient to me.

Convert chdb result to pandas/DataFrame format

Use case

Convert chdb result to pandas/DataFrame format

Describe the solution you'd like

Make chdb query result work with pandas

Describe alternatives you've considered

Maybe a built-in result.to_df() method or just an example of to_df(result).

Additional context

This requirement has the following constraints

Do not add necessary external Python library dependent implementations for chdb, similar to pyarrow

Try not to add additional C++ library dependencies

result.to_df() is recommended, but it is also acceptable to provide a utility function similar to to_df(result)

Instalation on macOS Ventura 13.1 M1 fails

pip install chdb

ERROR: Could not find a version that satisfies the requirement chdb (from versions: none)
ERROR: No matching distribution found for chdb

uname -a
Darwin xxxx 22.2.0 Darwin Kernel Version 22.2.0: Fri Nov 11 02:04:44 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T8103 arm64

python -V
Python 3.10.9

cannot import in amazon2 linux

Describe what's wrong

I'm trying to use chdb on AWS Lambda, I use Docker for testing and building the application. And when I import chdb, it crashed due to clickhouse can't access to /etc/localtime. The file exists, I tried to change the timezone, it failed too.

How to reproduce

(base) ➜  ~ docker run -d --name chdb --rm public.ecr.aws/lambda/python:3.11
f500dc48b67d268020b2034d92b84f3af7505edcfbce2dc84cb955b296efc13c
docker exec -it chdb /bin/bash
bash-4.2# pip install chdb
Collecting chdb
  Downloading chdb-0.11.5-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (90.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90.0/90.0 MB 3.5 MB/s eta 0:00:00
Installing collected packages: chdb
Successfully installed chdb-0.11.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip
Python 3.11.4 (main, Jul 31 2023, 08:41:59) [GCC 7.3.1 20180712 (Red Hat 7.3.1-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import chdb
Poco::Exception. Code: 1000, e.code() = 0, Exception: Cannot load time zone /etc/localtime (version 23.6.1.1)

Add arm64 build (and install from source)

Hello,

I tried to install chdb on ubuntu arm64 and I've got this error from pip

(venv) ubuntu@ip-172-31-31-61:~$ pip install chdb
ERROR: Could not find a version that satisfies the requirement chdb (from versions: none)
ERROR: No matching distribution found for chdb

Is it possible to add the source on pypi or arm64 build (or both 😉 )?

Thanks

Add show method for python compatibility with Pyspark, Duckdb , Datafusion and Databend

something like this will be really nice.

from chdb import session as chs
sess = chs.Session()
sess.sql('select 42').show()

Zero-copy select and return for pyarrow tables

Being able to select pyarrow tables without copying, as well as accessing results as pyarrow tables without copying would be massively beneficial to building low-latency ETL pipelines and other data processing pipelines.

Specifically having streaming support would be massive too: https://duckdb.org/2021/12/03/duck-arrow.html#streaming-data-fromto-arrow as this would greatly reduce the required memory usage for queries and post-processing of data

Distributed query processing

Like distributed queries in clickhouse, it would be great to have a (semi-)native way to process distributed queries.

For example, being able to divide the list of parquet files up on to multiple hosts running chdb in-process, and then having them reduce down to the initial node that ran the query. This is somewhat possible manually (in theory).

Processes like choosing what files go to what hosts could be left for the developer, but the map-reduce across hosts is the functionlaity that would be ideal to have natively. Even if the data passing was through some binary RPC calls that the developer has to implement, telling the CHDB that it's a partial query and it has to deliver results that can be aggregated down on a single final worker is something that would need to be in CHDB itself.

R API!

R has a wonderful data science community. It would be great if there was an R binding.

The DuckDB API is built here, I wonder if this could be template code:

https://github.com/duckdb/duckdb/tree/master/tools/rpkg/R

Worst case, R can do reticulate, which is a Python wrapper.

Query csv with group by very slow

Describe the situation
import chdb
res=chdb.query('select count(*) cnt from file("/Users/xbsura/Downloads/organizations-2000000.csv", CSVWithNames) group by Name order by cnt desc', 'CSV')

wc -l /Users/xbsura/Downloads/organizations-2000000.csv
2000001 /Users/xbsura/Downloads/organizations-2000000.csv

head /Users/xbsura/Downloads/organizations-2000000.csv
Index,Organization Id,Name,Website,Country,Description,Founded,Industry,Number of employees
1,391dAA77fea9EC1,Daniel-Mcmahon,https://stuart-rios.biz/,Cambodia,Focused eco-centric help-desk,2013,Sports,1878
2,9FcCA4A23e6BcfA,"Mcdowell, Tate and Murray",http://jacobs.biz/,Guyana,Front-line real-time portal,2018,Legal Services,9743
3,DB23330238B7B3D,"Roberts, Carson and Trujillo",http://www.park.com/,Jordan,Innovative hybrid data-warehouse,1992,Hospitality,7537
4,bbf18835CFbEee7,"Poole, Jefferson and Merritt",http://hayden.com/,Cocos (Keeling) Islands,Extended regional Graphic Interface,1991,Food Production,9974

this sql need more than 1min to finish, and memory used is more than 100G

Which ClickHouse server version to use
res = chdb.query('select version()', 'Pretty'); print(res.data())
┏━━━━━━━━━━━┓
┃ version() ┃
┡━━━━━━━━━━━┩
│ 22.12.1.1 │
└───────────┘
Queries to run that lead to slow performance
select count(*) cnt from file("/Users/xbsura/Downloads/organizations-2000000.csv", CSVWithNames) group by Name order by cnt desc

Expected performance
200MB file, maybe less than 1 seconds is ok

Cleaning up stale parts

👋

Just wanted to quickly check if chdb automatically cleans stale MergeTree parts after optimize table commands. And if so, how does it work?

Question About Query On Pandas DataFrame

how can i deal with 2 table join operation when i query on pandas dataframe?

import pandas as pd

tbl1 = cdf.Table(dataframe=pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}))
tbl2 = cdf.Table(dataframe=pd.DataFrame({'a': [4, 2, 3], 'c': ['c', 'd', 'c']}))  ```


how can i join tbl1 and tbl2 on columns 'a'?  
i cant deal with it through demo
thanks

Include time, rows, and bytes read in query result

Like clickhouse server/local, it would be great if we can get the time spent processing the query, number of rows processed, and the total bytes read for a given query.

Python function bindings

If we could bind functions like how DuckDB enables here https://duckdb.org/docs/api/python/function.html that would be extremely useful so we can make custom functions.

I'd love to see support for functions, aggregation functions, and table functions

Drop support for Python 3.7 after v0.9.0

As pandas start to drop Python 3.7 from 2021.

Pandas 2.0 with pyarrow backend support require at least Python 3.8.
The memfd_create feature for #36 is also introduced in Python 3.8.

So I decided to drop support for Python 3.7 after v0.9.0. For Python 3.7 the last available chdb version is 0.9.0

Any feedback will be appreciated.

TPCH_SF0.1 get stuck randomly

for some reason it get really slow randomly even for very small data ?

reproducible example here

https://colab.research.google.com/drive/1o07UEUUAxcdR9WCdYt53mVZBvqqbqzxn#scrollTo=FT8O269BJNfE

Support use db_xxx in chdb session.

from chdb import session as chs

sess = chs.Session()
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic;")
sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = MergeTree ORDER BY x;")

sess.query("USE db_xxx;") # update current db to db_xxx

sess.query("INSERT INTO log_table_xxx (x, y) SELECT toString(rand()), rand() FROM numbers(1000000);")
sess.query("SELECT count(*) FROM log_table_xxx;")

Can not redirct stderr in chdb.query()

Describe the unexpected behaviour
In version above 0.10, chdb.query() will print error message in stderr.
But it can not redirct or catch into variable.

How to reproduce

import chdb
from io import StringIO
import sys
from contextlib import redirect_stderr

f = StringIO()
with redirect_stderr(f):
    res = chdb.query("select 1", "csv")
    # sys.stderr.write("write stderr in python")
err = f.getvalue()

print("Capture err by redirect_stderr: ", err)

Expected behavior
Error msg catched and stored into err variable

`.to_arrowTable` error `ArrowInvalid: File is too small: 8` (0.13.0, amd64, macos)

A convenience method to return a dataframe or pyarrow table from a query result would be great. DuckDB has similar features: https://duckdb.org/docs/api/python/overview.html#result-conversion

I think df and arrow are the most important ones, with arrow being the top because it's easy to get a df from an arrow table with another method call (uses more mem though)

UPDATE and DELETE have no effect

Hey there, thanks for the amazing work on chdb!

I noticed that ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE statements don't seem to have any effect in chdb. Data simply stays unchanged.

The same does work in clickhouse-local.

How to reproduce

The below examples can also be seen in this Colab notebook

UPDATE

from chdb import session as chs

sess = chs.Session()

sess.query("CREATE DATABASE test_db ENGINE = Atomic;")
sess.query("CREATE TABLE test_db.test_table (x String, y String) ENGINE = MergeTree ORDER BY tuple()")
sess.query("INSERT INTO test_db.test_table (x, y) VALUES ('A', 'B')")

print("Original values:")
print(sess.query("SELECT * FROM test_db.test_table"))

sess.query("ALTER TABLE test_db.test_table UPDATE y = 'updated' WHERE x = 'A' AND y = 'B'")

print('Values after UPDATE (expected "A", "updated"):')
print(sess.query("SELECT * FROM test_db.test_table"))

This prints:

Original values:
"A","B"

Values after UPDATE (expected "A", "updated"):
"A","B"

DELETE

from chdb import session as chs

sess = chs.Session()

sess.query("CREATE DATABASE test_db ENGINE = Atomic;")
sess.query("CREATE TABLE test_db.test_table (x String, y String) ENGINE = MergeTree ORDER BY tuple()")
sess.query("INSERT INTO test_db.test_table (x, y) VALUES ('A', 'B')")

print("Original values:")
print(sess.query("SELECT * FROM test_db.test_table"))

sess.query("ALTER TABLE test_db.test_table DELETE WHERE x = 'A' AND y = 'B'")

print("Values after DELETE (expected is no rows):")
print(sess.query("SELECT * FROM test_db.test_table"))

This prints:

Original values:
"A","B"

Values after DELETE (expected is no rows):
"A","B"

Expected behavior
Data is updated / deleted, the same way as in ClickHouse / clickhouse-local.

Additional context

I tested the same queries as shown above in a clickhouse-local session, started with clickhouse local --path .. There it works as expected.

I can reproduce the issue with chdb on my M1 Mac and on Google Colab

Binary UDF Support

ClickHouse patches for local UDF support are now available.
This patch would open up the way for simple and complex UDF mapping

@auxten can we backport those minor changes to leverage user_scripts_path in chdb sessions without a rebase?

Sketch: main...UDF-settings

rust support

I'm hoping to add support for rust, I'm involved in this, where would I like to start

Add chdb-cli to module

Discussed in #109

^{Originally posted by l1t1 September 13, 2023}
now we must download and set the file chdb-cli.py manually.
it will be easy to if it could run as python -m chdb cli after pip install chdb.

Date type transformed to uint16 when export to Dataframe

Describe the issue
When a query retuns Date type and it is exported to Dataframe or Arrow, it will be transformed to

How to reproduce

res = chdb.query('SELECT toDate(now())', 'Dataframe')
res
>>>toDate(now())
>>>0          19585

res = chdb.query('SELECT toDate32(now())', 'Dataframe')
res
>>>toDate32(now())
>>>0      2023-08-16

Feature Request: chdb.exec() function

As discussed in discord, we are considering adding an exec command to pre-provision any SET parameters, UDF functions, etc with full persistence within a chdb session.

import chdb
exec = chdb.exec("CREATE FUNCTION hello AS () -> 'chDB'")
ret = chdb.query("SELECT hello()", "CSV")
print(ret.get_memview().tobytes())

Thread open for discussion and implementation proposals.

query remote clickhouse instance

Suppose I would want to read foreign Clickhouse data.

If I typed the following:

select * from remote('database:port','database.table','user','password')

If I understand correctly, it just returns "Ok". and doesn't process the query

Did I do something wrong? It seems like potentially desirable behavior to be able to read directly from Clickhouse instances.

ChDB fails to execute Query 21 of TPCH_SF10

Hi,

I have been testing ChDB on TPCH_SF10 queries but it fails to execute Query 21. Can you please look into this?
Screenshot below:

Thanks
Reema Jain

dbapi return int as string

chdb driver version: 0.14.2
chdb version: 23.6.1.1
Python version: 3.9.6
macos catalina: 10.15.7

How to reproduce:
create test table.

CREATE TABLE t1
(
    `id` Int64,
    c1 String 
)
ENGINE = ReplacingMergeTree
ORDER BY id;

insert into t1 values
(1,'a'),
(2,'b'),
(3,'c');

python code, just read data from table t1.

from chdb import dbapi

conn1 = dbapi.connect()
cur1 = conn1.cursor()

cur1.execute("select id, c1 from remote('host:9000','db','table','user','password') final")
result = cur1.fetchall()
print(result)

cur1.close()
conn1.close()

when i run the python code as above, i got the below result, the int id return as string, and the result returned as tuple, is it possible dbapi return result same as clickhouse_driver in python?

(('1', 'a'), ('2', 'b'), ('3', 'c'))

Test webhook

discord general channel

Example on how to query pyarrow table

The example for section Query On Table (Pandas DataFrame, Parquet file/bytes, Arrow bytes) on the readme has a dataframe example, but not pyarrow.

Also, question: is it zero-copy select?

Pandas dataframe output introduced pyarrow and pandas dependency

Pandas dataframe output introduced pyarrow and pandas dependency #6 will be released at v0.5.0
I'm thinking about whether it's worth it, after all, the size of pyarrow and pandas is not small. Initiate a vote here, think:

It is worthwhile to add pyarrow and pandas dependencies to support dataframe, please click 👍
If you think it is not worth it, please click 👎

Slow performance in TPCH_SF09

very slow performance for some Queries compared to other Single Node Engines

https://colab.research.google.com/drive/1MTSBp9qbki6-uBNM2HH3M2Fc-qFOSSWl#scrollTo=z494Cl6XKUVX

Enable jemalloc in chdb shared lib for linux

As analyzed in #19 , we need to enable jemalloc in chdb shared lib.
With tips here jemalloc/jemalloc#1237

I disabled initial exec tls like this:

contrib/jemalloc-cmake/include_linux_x86_64/jemalloc/internal/jemalloc_internal_defs.h.in

@@ -139,7 +139,8 @@
 /* #undef JEMALLOC_MUTEX_INIT_CB */
 
 /* Non-empty if the tls_model attribute is supported. */
-#define JEMALLOC_TLS_MODEL __attribute__((tls_model("initial-exec")))
+// #define JEMALLOC_TLS_MODEL __attribute__((tls_model("initial-exec")))
+#define JEMALLOC_TLS_MODEL

But this will cause call_init error during dlopen:

Program received signal SIGSEGV, Segmentation fault.
0x00007f291e00fd29 in sallocx () from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
(gdb) bt
#0  0x00007f291e00fd29 in sallocx () from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#1  0x00007f2916555a36 in operator delete(void*, unsigned long) () from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#2  0x00007f291f45da44 in google::protobuf::internal::OnShutdownRun(void (*)(void const*), void const*) ()
   from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#3  0x00007f291f493172 in google::protobuf::(anonymous namespace)::GeneratedDatabase() () from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#4  0x00007f291f493370 in google::protobuf::DescriptorPool::InternalAddGeneratedFile(void const*, int) ()
   from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#5  0x00007f291f50b937 in google::protobuf::(anonymous namespace)::AddDescriptors(google::protobuf::internal::DescriptorTable const*) ()
   from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#6  0x00007f29212d0fe2 in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fff92a04d68, env=env@entry=0x1440b90) at dl-init.c:72
#7  0x00007f29212d10e9 in call_init (env=0x1440b90, argv=0x7fff92a04d68, argc=1, l=<optimized out>) at dl-init.c:30
#8  _dl_init (main_map=0x152ea80, argc=1, argv=0x7fff92a04d68, env=0x1440b90) at dl-init.c:119
#9  0x00007f292105caed in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:182
#10 0x00007f29212d5058 in dl_open_worker (a=a@entry=0x7fff92a034f0) at dl-open.c:758
#11 0x00007f292105ca90 in __GI__dl_catch_exception (exception=0x7fff92a034d0, operate=0x7f29212d4ca0 <dl_open_worker>, args=0x7fff92a034f0)
    at dl-error-skeleton.c:208
#12 0x00007f29212d48fa in _dl_open (file=0x7f2920d4e530 "/home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so", mode=-2147483646, caller_dlopen=0x61b611, 
    nsid=-2, argc=1, argv=0x7fff92a034d0, env=0x1440b90) at dl-open.c:837
#13 0x00007f2921293258 in dlopen_doit (a=a@entry=0x7fff92a03710) at dlopen.c:66
#14 0x00007f292105ca90 in __GI__dl_catch_exception (exception=exception@entry=0x7fff92a036b0, operate=0x7f2921293200 <dlopen_doit>, args=0x7fff92a03710)
    at dl-error-skeleton.c:208
#15 0x00007f292105cb4f in __GI__dl_catch_error (objname=0x1493150, errstring=0x1493158, mallocedp=0x1493148, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:227
#16 0x00007f2921293a65 in _dlerror_run (operate=operate@entry=0x7f2921293200 <dlopen_doit>, args=args@entry=0x7fff92a03710) at dlerror.c:170
#17 0x00007f29212932e4 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#18 0x000000000061b611 in ?? ()
#19 0x000000000061a2ca in ?? ()
#20 0x00000000005298c4 in ?? ()
#21 0x0000000000517b9b in _PyEval_EvalFrameDefault ()
#22 0x00000000005106ed in ?? ()
#23 0x0000000000528d21 in _PyFunction_Vectorcall ()
#24 0x0000000000516e76 in _PyEval_EvalFrameDefault ()
#25 0x0000000000528b63 in _PyFunction_Vectorcall ()
#26 0x0000000000512192 in _PyEval_EvalFrameDefault ()
#27 0x0000000000528b63 in _PyFunction_Vectorcall ()
#28 0x0000000000511fb5 in _PyEval_EvalFrameDefault ()
#29 0x0000000000528b63 in _PyFunction_Vectorcall ()
#30 0x0000000000511fb5 in _PyEval_EvalFrameDefault ()
#31 0x0000000000528b63 in _PyFunction_Vectorcall ()
#32 0x0000000000511fb5 in _PyEval_EvalFrameDefault ()
#33 0x0000000000528b63 in _PyFunction_Vectorcall ()
#34 0x000000000052842e in ?? ()
#35 0x000000000053f559 in _PyObject_CallMethodIdObjArgs ()
#36 0x000000000053e786 in PyImport_ImportModuleLevelObject ()
#37 0x00000000005144bd in _PyEval_EvalFrameDefault ()
#38 0x00000000005106ed in ?? ()
#39 0x0000000000510497 in _PyEval_EvalCodeWithName ()
#40 0x00000000005f5be3 in PyEval_EvalCode ()
#41 0x0000000000619de7 in ?? ()
#42 0x0000000000615610 in ?? ()
#43 0x0000000000459cb3 in ?? ()
#44 0x0000000000459911 in PyRun_InteractiveLoopFlags ()
#45 0x00000000006194f5 in PyRun_AnyFileExFlags ()
#46 0x000000000044bca9 in ?? ()
#47 0x00000000005ea6e9 in Py_BytesMain ()
#48 0x00007f2920f49d0a in __libc_start_main (main=0x5ea6b0, argc=1, argv=0x7fff92a04d68, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fff92a04d58) at ../csu/libc-start.c:308
#49 0x00000000005ea5ea in _start ()

need further dig...

Intergrate with jupysql

Something like https://jupysql.ploomber.io/en/latest/integrations/duckdb.html
chdb needs a Python DB API 2.0 wrapper to run with jupysql.

Unable to create MergeTree Table in a session: Database _temporary_and_external_tables doesn't exist.

Creating a MergeTree table results in an error with the message: Code: 81. DB::Exception: Database _temporary_and_external_tables doesn't exist. (UNKNOWN_DATABASE)

How to reproduce

modified from readme.md example

from chdb import session as chs

sess = chs.Session()
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic;")
sess.query("use db_xxx;")
sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = MergeTree ORDER BY x;")

the use db_xxx; line doesn't seem to affect the outcome and consistently generates an error.

Expected behavior
No error message. Table is created.

Action Buildtime optimizations

The build_wheels.yml action is huge and could benefit being split into multiple stages to gain significant cross-build speedup. Currently all builders seem to all run in parallel, leveraging neither git submodule caching or library artifact sharing.

I might be wrong as python is not my forte, but assuming the same .so could be used by all binding builders on the same arch (is this the case?) having the clickhouse/chdb library build stage and the python binding stage split/chained could save hours on releases for the same OS/arch over a matrix of python versions.

Note: not a request - posting to find a way to contribute and help out

Discord link is broken

The Discord invite in the README is invalid.

Minified version with no table engines

A lot of binary size comes from the tables engines that may not be relevant for in-process use cases like the merge tree engines, log engines, etc.

Would be great to either have an easy way to compile with engines omitted, or a build that is effectively engine-free (except for some basics like url, s3, file) for a far smaller build. The expectation is that custom engines would be made on top of the url/s3 engines in #52 as sorts of aliases

query a parquet file 4 times slower than clickhouse local

(you don't have to strictly follow this form)

Describe the situation
SELECT avg(i) FROM file('/data/t.parquet') group by round(log10(i));
chdb costs 400s, clickhouse local costs 100s
How to reproduce

Which ClickHouse server version to use 23.6
Which interface to use, if matters CLI.py
Non-default settings, if any
CREATE TABLE statements for all tables involved
select number::int i FROM numbers_mt(1,1000000000)t into outfile '/data/t.parquet';
Sample data for all these tables, use clickhouse-obfuscator if necessary
Queries to run that lead to slow performance
SELECT avg(i) FROM file('/data/t.parquet') group by round(log10(i));

Expected performance
What are your performance expectation, why do you think they are realistic? Has it been working faster in older ClickHouse releases? Is it working faster in some specific other system?
I hope chdb runs as fast as clickhouse local.
Additional context
Add any other context about the problem here.
btw
select number::int i FROM numbers_mt(1,1000000000)t into outfile '/data/t.parquet';
chdb runs as fast as clickhouse local

RowBinary

👋

Is it possible to insert RowBinary using sessions? Right now the following query hangs as I don't know how to provide it with the data:

from chdb import session as chs
s = chs.Session()
s.query("create database dev")
s.query("create table dev.example(a UInt64) engine MergeTree order by tuple()")
s.query("insert into dev.example format RowBinary")

chdb-io / chdb Goto Github PK

chdb's People

Contributors

Stargazers

Watchers

Forkers

chdb's Issues

Q.23

Q.28

Example

query

Failing response

Expected response

UPDATE

DELETE

Discussed in #109

Recommend Projects

Recommend Topics

Recommend Org