Code Monkey home page Code Monkey logo

chdb-io / chdb Goto Github PK

View Code? Open in Web Editor NEW
1.7K 27.0 64.0 874.43 MB

chDB is an embedded OLAP SQL Engine ๐Ÿš€ powered by ClickHouse

Home Page: https://doc.chdb.io

License: Apache License 2.0

Vim Script 0.01% CMake 0.88% C++ 90.12% C 2.07% Assembly 5.44% Shell 0.40% Dockerfile 0.06% Python 0.54% C# 0.01% Go 0.30% Java 0.02% JavaScript 0.01% PHP 0.01% Makefile 0.01% HTML 0.12% GAP 0.01% Perl 0.01% ANTLR 0.04% CSS 0.01% Rust 0.01%
data-science database embedded-database olap python sql chdb clickhouse clickhouse-database clickhouse-server

chdb's Introduction

Build X86 PyPI Downloads Discord Twitter

chDB

ไธญๆ–‡

chDB is an embedded SQL OLAP Engine powered by ClickHouse 1 For more details: The birth of chDB

Features

  • In-process SQL OLAP Engine, powered by ClickHouse
  • No need to install ClickHouse
  • Minimized data copy from C++ to Python with python memoryview
  • Input&Output support Parquet, CSV, JSON, Arrow, ORC and 60+more formats, samples
  • Support Python DB API 2.0, example

Arch

Get Started

Get started with chdb using our Installation and Usage Examples


Installation

Currently, chDB supports Python 3.8+ on macOS and Linux (x86_64 and ARM64).

pip install chdb

Usage

Run in command line

python3 -m chdb SQL [OutputFormat]

python3 -m chdb "SELECT 1,'abc'" Pretty

Data Input

The following methods are available to access on-disk and in-memory data formats:

๐Ÿ—‚๏ธ Query On File

(Parquet, CSV, JSON, Arrow, ORC and 60+)

You can execute SQL and return desired format data.

import chdb
res = chdb.query('select version()', 'Pretty'); print(res)

Work with Parquet or CSV

# See more data type format in tests/format_output.py
res = chdb.query('select * from file("data.parquet", Parquet)', 'JSON'); print(res)
res = chdb.query('select * from file("data.csv", CSV)', 'CSV');  print(res)
print(f"SQL read {res.rows_read()} rows, {res.bytes_read()} bytes, elapsed {res.elapsed()} seconds")

Pandas dataframe output

# See more in https://clickhouse.com/docs/en/interfaces/formats
chdb.query('select * from file("data.parquet", Parquet)', 'Dataframe')

๐Ÿ—‚๏ธ Query On Table

(Pandas DataFrame, Parquet file/bytes, Arrow bytes)

Query On Pandas DataFrame

import chdb.dataframe as cdf
import pandas as pd
# Join 2 DataFrames
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': ["one", "two", "three"]})
df2 = pd.DataFrame({'c': [1, 2, 3], 'd': ["โ‘ ", "โ‘ก", "โ‘ข"]})
ret_tbl = cdf.query(sql="select * from __tbl1__ t1 join __tbl2__ t2 on t1.a = t2.c",
                  tbl1=df1, tbl2=df2)
print(ret_tbl)
# Query on the DataFrame Table
print(ret_tbl.query('select b, sum(a) from __table__ group by b'))

๐Ÿ—‚๏ธ Query with Stateful Session

from chdb import session as chs

## Create DB, Table, View in temp session, auto cleanup when session is deleted.
sess = chs.Session()
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic")
sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = Log;")
sess.query("INSERT INTO db_xxx.log_table_xxx VALUES ('a', 1), ('b', 3), ('c', 2), ('d', 5);")
sess.query(
    "CREATE VIEW db_xxx.view_xxx AS SELECT * FROM db_xxx.log_table_xxx LIMIT 4;"
)
print("Select from view:\n")
print(sess.query("SELECT * FROM db_xxx.view_xxx", "Pretty"))

see also: test_stateful.py.

๐Ÿ—‚๏ธ Query with Python DB-API 2.0

import chdb.dbapi as dbapi
print("chdb driver version: {0}".format(dbapi.get_client_info()))

conn1 = dbapi.connect()
cur1 = conn1.cursor()
cur1.execute('select version()')
print("description: ", cur1.description)
print("data: ", cur1.fetchone())
cur1.close()
conn1.close()

๐Ÿ—‚๏ธ Query with UDF (User Defined Functions)

from chdb.udf import chdb_udf
from chdb import query

@chdb_udf()
def sum_udf(lhs, rhs):
    return int(lhs) + int(rhs)

print(query("select sum_udf(12,22)"))

see also: test_udf.py.

For more examples, see examples and tests.


Demos and Examples

Benchmark

Documentation

Events

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated. There are something you can help:

  • Help test and report bugs
  • Help improve documentation
  • Help improve code quality and performance

Bindings

We welcome bindings for other languages, please refer to bindings for more details.

License

Apache 2.0, see LICENSE for more information.

Acknowledgments

chDB is mainly based on ClickHouse 1 for trade mark and other reasons, I named it chDB.

Contact


Footnotes

  1. ClickHouseยฎ is a trademark of ClickHouse Inc. All trademarks, service marks, and logos mentioned or depicted are the property of their respective owners. The use of any third-party trademarks, brand names, product names, and company names does not imply endorsement, affiliation, or association with the respective owners. โ†ฉ โ†ฉ2

chdb's People

Contributors

allcontributors[bot] avatar auxten avatar daniel-robbins avatar laodouya avatar lmangani avatar nevinpuri avatar nmreadelf avatar reema93jain avatar yunyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chdb's Issues

AVX2 Support

Great mod and very interesting binding approach! We're also playing with clickhouse-local custom builds towards a slimmer execution engine.

Question: Is there a specific reason for AVX2 to be enabled for Linux systems? This greatly reduces the scope of usage (ie: lambdas, virtualized cpus, etc, fail with Instruction check fail. The CPU does not support AVX2 instruction set.)

Having this disabled (or possibly multiple versions) could broaden compatibility scope enormously. Also the library doesn't seem to be stripped and the distributed .so has 100M+ of trimmable fat.

Thanks!

Enable jemalloc in chdb shared lib for linux

As analyzed in #19 , we need to enable jemalloc in chdb shared lib.
With tips here jemalloc/jemalloc#1237

I disabled initial exec tls like this:

contrib/jemalloc-cmake/include_linux_x86_64/jemalloc/internal/jemalloc_internal_defs.h.in

@@ -139,7 +139,8 @@
 /* #undef JEMALLOC_MUTEX_INIT_CB */
 
 /* Non-empty if the tls_model attribute is supported. */
-#define JEMALLOC_TLS_MODEL __attribute__((tls_model("initial-exec")))
+// #define JEMALLOC_TLS_MODEL __attribute__((tls_model("initial-exec")))
+#define JEMALLOC_TLS_MODEL

But this will cause call_init error during dlopen:

Program received signal SIGSEGV, Segmentation fault.
0x00007f291e00fd29 in sallocx () from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
(gdb) bt
#0  0x00007f291e00fd29 in sallocx () from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#1  0x00007f2916555a36 in operator delete(void*, unsigned long) () from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#2  0x00007f291f45da44 in google::protobuf::internal::OnShutdownRun(void (*)(void const*), void const*) ()
   from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#3  0x00007f291f493172 in google::protobuf::(anonymous namespace)::GeneratedDatabase() () from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#4  0x00007f291f493370 in google::protobuf::DescriptorPool::InternalAddGeneratedFile(void const*, int) ()
   from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#5  0x00007f291f50b937 in google::protobuf::(anonymous namespace)::AddDescriptors(google::protobuf::internal::DescriptorTable const*) ()
   from /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
#6  0x00007f29212d0fe2 in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fff92a04d68, env=env@entry=0x1440b90) at dl-init.c:72
#7  0x00007f29212d10e9 in call_init (env=0x1440b90, argv=0x7fff92a04d68, argc=1, l=<optimized out>) at dl-init.c:30
#8  _dl_init (main_map=0x152ea80, argc=1, argv=0x7fff92a04d68, env=0x1440b90) at dl-init.c:119
#9  0x00007f292105caed in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:182
#10 0x00007f29212d5058 in dl_open_worker (a=a@entry=0x7fff92a034f0) at dl-open.c:758
#11 0x00007f292105ca90 in __GI__dl_catch_exception (exception=0x7fff92a034d0, operate=0x7f29212d4ca0 <dl_open_worker>, args=0x7fff92a034f0)
    at dl-error-skeleton.c:208
#12 0x00007f29212d48fa in _dl_open (file=0x7f2920d4e530 "/home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so", mode=-2147483646, caller_dlopen=0x61b611, 
    nsid=-2, argc=1, argv=0x7fff92a034d0, env=0x1440b90) at dl-open.c:837
#13 0x00007f2921293258 in dlopen_doit (a=a@entry=0x7fff92a03710) at dlopen.c:66
#14 0x00007f292105ca90 in __GI__dl_catch_exception (exception=exception@entry=0x7fff92a036b0, operate=0x7f2921293200 <dlopen_doit>, args=0x7fff92a03710)
    at dl-error-skeleton.c:208
#15 0x00007f292105cb4f in __GI__dl_catch_error (objname=0x1493150, errstring=0x1493158, mallocedp=0x1493148, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:227
#16 0x00007f2921293a65 in _dlerror_run (operate=operate@entry=0x7f2921293200 <dlopen_doit>, args=args@entry=0x7fff92a03710) at dlerror.c:170
#17 0x00007f29212932e4 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#18 0x000000000061b611 in ?? ()
#19 0x000000000061a2ca in ?? ()
#20 0x00000000005298c4 in ?? ()
#21 0x0000000000517b9b in _PyEval_EvalFrameDefault ()
#22 0x00000000005106ed in ?? ()
#23 0x0000000000528d21 in _PyFunction_Vectorcall ()
#24 0x0000000000516e76 in _PyEval_EvalFrameDefault ()
#25 0x0000000000528b63 in _PyFunction_Vectorcall ()
#26 0x0000000000512192 in _PyEval_EvalFrameDefault ()
#27 0x0000000000528b63 in _PyFunction_Vectorcall ()
#28 0x0000000000511fb5 in _PyEval_EvalFrameDefault ()
#29 0x0000000000528b63 in _PyFunction_Vectorcall ()
#30 0x0000000000511fb5 in _PyEval_EvalFrameDefault ()
#31 0x0000000000528b63 in _PyFunction_Vectorcall ()
#32 0x0000000000511fb5 in _PyEval_EvalFrameDefault ()
#33 0x0000000000528b63 in _PyFunction_Vectorcall ()
#34 0x000000000052842e in ?? ()
#35 0x000000000053f559 in _PyObject_CallMethodIdObjArgs ()
#36 0x000000000053e786 in PyImport_ImportModuleLevelObject ()
#37 0x00000000005144bd in _PyEval_EvalFrameDefault ()
#38 0x00000000005106ed in ?? ()
#39 0x0000000000510497 in _PyEval_EvalCodeWithName ()
#40 0x00000000005f5be3 in PyEval_EvalCode ()
#41 0x0000000000619de7 in ?? ()
#42 0x0000000000615610 in ?? ()
#43 0x0000000000459cb3 in ?? ()
#44 0x0000000000459911 in PyRun_InteractiveLoopFlags ()
#45 0x00000000006194f5 in PyRun_AnyFileExFlags ()
#46 0x000000000044bca9 in ?? ()
#47 0x00000000005ea6e9 in Py_BytesMain ()
#48 0x00007f2920f49d0a in __libc_start_main (main=0x5ea6b0, argc=1, argv=0x7fff92a04d68, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fff92a04d58) at ../csu/libc-start.c:308
#49 0x00000000005ea5ea in _start ()

need further dig...

Minified version with no table engines

A lot of binary size comes from the tables engines that may not be relevant for in-process use cases like the merge tree engines, log engines, etc.

Would be great to either have an easy way to compile with engines omitted, or a build that is effectively engine-free (except for some basics like url, s3, file) for a far smaller build. The expectation is that custom engines would be made on top of the url/s3 engines in #52 as sorts of aliases

Cleaning up stale parts

๐Ÿ‘‹

Just wanted to quickly check if chdb automatically cleans stale MergeTree parts after optimize table commands. And if so, how does it work?

Segfault on handled exceptions

Error responses are causing core dumps on the latest version.

Example

The following query attempts to use a non-existing resource (could be local, remote, etc)

import chdb
url = "https://does.not.exist"
query = f"""SELECT * FROM url('{url}', Parquet) LIMIT 5 """
res = chdb.query(query, 'Dataframe')
print(str(res.get_memview().tobytes()))

The following cause is the expected exception:

Code: 198. DB::Exception: Not found address of host: donot.exist: Cannot extract table structure from Parquet format file. You can specify the structure manually. (DNS_ERROR)``

When using the latest chdb version (or any really) errors consistently cause a core dump after the error response which does not seem related to the chdb dynamic library but rather the python changes:

Code: 198. DB::Exception: Not found address of host: donot.exist: Cannot extract table structure from Parquet format file. You can specify the structure manually. (DNS_ERROR)
/arrow/python/pyarrow/src/arrow/python/common.cc:178:  Check failed: (data_) != (nullptr) Null pointer in Py_buffer
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.1200(+0x1010758)[0x7fdd593b3758]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow.so.1200(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7fdd593b3b3d]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow_python.so(_ZN5arrow2py8PyBuffer4InitEP7_object+0xf5)[0x7fdd95ee0815]
/usr/local/lib/python3.8/dist-packages/pyarrow/libarrow_python.so(_ZN5arrow2py8PyBuffer12FromPyObjectEP7_object+0x13a)[0x7fdd95ee096a]
/usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c5735)[0x7fdd85288735]
/usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0xd033e)[0x7fdd8519333e]
/usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x15a10c)[0x7fdd8521d10c]
/usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0xd033e)[0x7fdd8519333e]
/usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1b70d4)[0x7fdd8527a0d4]
python3(PyCFunction_Call+0x59)[0x5f6939]
python3(_PyObject_MakeTpCall+0x296)[0x5f7506]
python3(_PyEval_EvalFrameDefault+0x6259)[0x571019]
python3(_PyEval_EvalCodeWithName+0x26a)[0x5697da]
python3[0x59c396]
python3(_PyObject_MakeTpCall+0x1ff)[0x5f746f]
python3(_PyEval_EvalFrameDefault+0x5dce)[0x570b8e]
python3(_PyFunction_Vectorcall+0x1b6)[0x5f6ce6]
python3(_PyEval_EvalFrameDefault+0x72d)[0x56b4ed]
python3(_PyFunction_Vectorcall+0x1b6)[0x5f6ce6]
python3(_PyEval_EvalFrameDefault+0x72d)[0x56b4ed]
python3(_PyEval_EvalCodeWithName+0x26a)[0x5697da]
python3(_PyFunction_Vectorcall+0x393)[0x5f6ec3]
python3(_PyEval_EvalFrameDefault+0x5796)[0x570556]
python3(_PyEval_EvalCodeWithName+0x26a)[0x5697da]
python3(PyEval_EvalCode+0x27)[0x68e547]
python3[0x67dbf1]
python3[0x67dc6f]
python3[0x67dd11]
python3(PyRun_SimpleFileExFlags+0x197)[0x67fe37]
python3(Py_RunMain+0x212)[0x6b7c82]
python3(Py_BytesMain+0x2d)[0x6b800d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fddb3306083]
python3(_start+0x2e)[0x5fb85e]
Aborted (core dumped)

The same seems to apply to any error condition returned.

Opening for investigation.

Query csv with group by very slow

Describe the situation
import chdb
res=chdb.query('select count(*) cnt from file("/Users/xbsura/Downloads/organizations-2000000.csv", CSVWithNames) group by Name order by cnt desc', 'CSV')

wc -l /Users/xbsura/Downloads/organizations-2000000.csv
2000001 /Users/xbsura/Downloads/organizations-2000000.csv

head /Users/xbsura/Downloads/organizations-2000000.csv
Index,Organization Id,Name,Website,Country,Description,Founded,Industry,Number of employees
1,391dAA77fea9EC1,Daniel-Mcmahon,https://stuart-rios.biz/,Cambodia,Focused eco-centric help-desk,2013,Sports,1878
2,9FcCA4A23e6BcfA,"Mcdowell, Tate and Murray",http://jacobs.biz/,Guyana,Front-line real-time portal,2018,Legal Services,9743
3,DB23330238B7B3D,"Roberts, Carson and Trujillo",http://www.park.com/,Jordan,Innovative hybrid data-warehouse,1992,Hospitality,7537
4,bbf18835CFbEee7,"Poole, Jefferson and Merritt",http://hayden.com/,Cocos (Keeling) Islands,Extended regional Graphic Interface,1991,Food Production,9974

this sql need more than 1min to finish, and memory used is more than 100G

  • Which ClickHouse server version to use
    res = chdb.query('select version()', 'Pretty'); print(res.data())
    โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
    โ”ƒ version() โ”ƒ
    โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
    โ”‚ 22.12.1.1 โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  • Queries to run that lead to slow performance
    select count(*) cnt from file("/Users/xbsura/Downloads/organizations-2000000.csv", CSVWithNames) group by Name order by cnt desc

Expected performance
200MB file, maybe less than 1 seconds is ok

Add chdb-cli to module

Discussed in #109

Originally posted by l1t1 September 13, 2023
now we must download and set the file chdb-cli.py manually.
it will be easy to if it could run as python -m chdb cli after pip install chdb.

query remote clickhouse instance

Suppose I would want to read foreign Clickhouse data.

If I typed the following:

select * from remote('database:port','database.table','user','password')

If I understand correctly, it just returns "Ok". and doesn't process the query

Did I do something wrong? It seems like potentially desirable behavior to be able to read directly from Clickhouse instances.

Convert chdb result to pandas/DataFrame format

Use case

Convert chdb result to pandas/DataFrame format

Describe the solution you'd like

Make chdb query result work with pandas

Describe alternatives you've considered

Maybe a built-in ย result.to_df() method or just an example of to_df(result).

Additional context

This requirement has the following constraints

  1. Do not add necessary external Python library dependent implementations for chdb, similar to pyarrow
  2. Try not to add additional C++ library dependencies
  3. result.to_df() is recommended, but it is also acceptable to provide a utility function similar to to_df(result)

Feature Request: chdb.exec() function

As discussed in discord, we are considering adding an exec command to pre-provision any SET parameters, UDF functions, etc with full persistence within a chdb session.

import chdb
exec = chdb.exec("CREATE FUNCTION hello AS () -> 'chDB'")
ret = chdb.query("SELECT hello()", "CSV")
print(ret.get_memview().tobytes())

Thread open for discussion and implementation proposals.

Pandas dataframe output introduced pyarrow and pandas dependency

Pandas dataframe output introduced pyarrow and pandas dependency #6 will be released at v0.5.0
I'm thinking about whether it's worth it, after all, the size of pyarrow and pandas is not small. Initiate a vote here, think:

  1. It is worthwhile to add pyarrow and pandas dependencies to support dataframe, please click ๐Ÿ‘
  2. If you think it is not worth it, please click ๐Ÿ‘Ž

Run a query on a pandas dataframe directly

Is there a plan to add the ability to run a sql query directly on a pandas dataframe?
Similar to https://pypi.org/project/sqldf/

import pandas as pd
import numpy as np
import chdb

# Create a dummy pd.Dataframe
df = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'], 'col2': ['F', np.NaN, 'G', 'H', 'I']})

# Define a SQL (Clickhouse) query
query = '''
SELECT *
FROM df
WHERE col_1 IS NOT NULL;
'''

# Run the query
df_view = chdb.query(sql, "Dataframe")

My current workaround is to 1) save the dataframe to a parquet file and then 2) run chdb.query on that file
I could wrap this into a function that would create a temporary parquet file.
But being able to directly query on dataframes seems very convenient to me.

Instalation on macOS Ventura 13.1 M1 fails

pip install chdb

ERROR: Could not find a version that satisfies the requirement chdb (from versions: none)
ERROR: No matching distribution found for chdb

uname -a
Darwin xxxx 22.2.0 Darwin Kernel Version 22.2.0: Fri Nov 11 02:04:44 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T8103 arm64

python -V
Python 3.10.9

(Custom) table engine bindings

Being able to bind a custom table engine (not function) that exhibits dynamic behavior would be really powerful.

For example, similar to how I have the parquet merge engine IceDB where I have an example querying from a custom table function, it would be much nicer if the columns for filtering the metadata could be determined via the query predicate (WHERE conditions) rather than function parameters.

This would allow you to fully decoupled storage, compute, and metadata to build a serverless OLAP DB that natively support multi-tenancy and dynamic schema (through column name merging, existing clickhouse feature). This could then effectively be aliased to being a list of parquet files to read, with merge on name enabled (can't find the docs in clickhouse to save my life for that setting).

Effectively this would be intercepting the AST in a very DX-friendly way, and returning a custom clickhouse table function in its place before execution.

Clickhouse-local and chdb performance issue on clickbench Q.23 Q28

Performance on c6a.metal is really good, but on c6a.4xlarge is not good, I did some analysis.
SQLs on clickbench is start from 0.

So queries below on line N is Q.N-1 on clickbench.

Q.23

Q.23: SELECT * FROM file("hits_*.parquet", Parquet) WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10;
got:

chdb: [56.04940148999992,51.57905327000003,47.26720601400007]
duckdb: [55.858323720000044,11.130785724999896,11.137916557000153]
clickhouse-local: [56.544, 17.397, 17.414]

To prove my guess, I did the LD_PRELOAD=/usr/lib64/libtcmalloc.so.4 before running Q.23
image

But why chdb don't link jemalloc? it's a problem I didn't dig deep enough:

When just link jemalloc into the _chdb.cpython-xxxxx.so import will got:
ImportError: /home/Clickhouse/chdb/_chdb.cpython-39-x86_64-linux-gnu.so: cannot allocate memory in static TLS block

As the performance impact is so much, I should solve this. Maybe just follow jemalloc/jemalloc#1237

Q.28

Q.28: SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM file("hits_*.parquet", Parquet) WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
got:

chdb: [10.385669515000018,10.016407472000083,10.284191555000007]
duckdb: [8.373684226000023,4.492152146999842,4.521108621999929]
clickhouse-local: [9.724, 9.377, 9.490]

Q.28 is mainly a regex performance problem, chdb eats all the 16 cores of c6a.4x. Time consumption of chdb and its cousin clickhouse-local are quite identical. But Duckdb run this really fast. I think this might be 2 explains:

  1. re2 lib version or optimization issue.
  2. As we know clickhouse engine didn't use the min, max data in every parquet file. This might get too much lines REGEXP_REPLACE.

Tips: The clickhouse parquet file handling issue might be solved in v23.4 https://twitter.com/ClickHouseDB/status/1649085317000105985?s=20

I also expect that the huge performance gap between clickhouse engine and duckdb on Q36, Q37, Q38, Q39 will also be greatly improved in v23.4

image

Screenshot above is from clickbench

As we can expect the LD_PRELOAD=/usr/lib64/libtcmalloc.so.4 didn't improve Q28 too much.
image

Here is the raw test result of chdb on c6a.4xlarge:

  1. [0.4212477300000046,0.05675365600001214,0.05619672399996034],
  2. [0.4166838250000069,0.06531620699996665,0.06449468199997455],
  3. [1.321115270000007,0.11996398300004785,0.11983177199999773],
  4. [1.0048882890000073,0.19388485299998592,0.19680832800003145],
  5. [1.9418000499999835,1.3102912339999762,1.306730918000028],
  6. [1.7881873679999671,0.9799579250000079,0.9898529660000008],
  7. [0.4436242609999681,0.06424746199996889,0.06408176899998352],
  8. [0.4105360549999659,0.07395825700001524,0.07268457200001421],
  9. [1.4529841990000136,0.7407924660000162,0.7251662129999659],
  10. [3.591443298999991,0.908300875000009,0.8936401500000102],
  11. [2.30893248000001,0.3970227440000258,0.4045306449999657],
  12. [2.3094307900000217,0.44978438499998674,0.46010506099997883],
  13. [1.766951419999998,0.8745181759999809,0.8729436739999983],
  14. [3.655738981000013,1.3014796009999827,1.3102514689999794],
  15. [1.7923423009999624,0.9957754339999951,0.9786744300000123],
  16. [1.4113377190000165,0.904729954000004,0.8983218939999915],
  17. [4.241506091000019,2.110677882999994,2.108407218000025],
  18. [3.475192112000002,1.3591108809999923,1.3430213159999767],
  19. [7.629405652000003,4.1000120279999805,4.094294319000028],
  20. [0.9980636189999927,0.1998158640000156,0.20184096600002022],
  21. [11.467297670999983,3.3498150409999994,3.340530508000029],
  22. [12.402329799999961,3.8429053930000237,3.8139943720000247],
  23. [23.43172926599999,7.04715183899998,7.0258233459999815],
  24. [56.04940148999992,51.57905327000003,47.26720601400007],
  25. [3.778993186999969,0.9129657269999143,0.8343545320000203],
  26. [1.5488553679999768,0.7785552640000333,0.7703242139999702],
  27. [3.7575681749999603,0.7797505359999377,0.7788957500000606],
  28. [11.195327173999999,3.9467314390000183,4.043763985999931],
  29. [10.385669515000018,10.016407472000083,10.284191555000007],
  30. [3.109853629999975,2.6357215879999103,2.592264249999971],
  31. [4.27900334200001,1.0691001030000962,1.0309490470000355],
  32. [8.493144112999971,1.4274674520000872,1.41434988900005],
  33. [9.647242985000048,6.089477478999925,5.9719970529999955],
  34. [11.494503093000048,5.471921920999989,5.2813955599999645],
  35. [11.458392625000101,5.404140325000071,5.264485821999983],
  36. [1.4544796239999869,1.1437831629999664,1.1075418030000037],
  37. [13.741583547000005,3.498664385999973,3.5927922500000022],
  38. [12.745886966999933,3.499401634000037,3.489023333999967],
  39. [13.625658656999917,3.5325630399999,3.5357377320000296],
  40. [21.39596359199993,5.756981011999983,5.797184558000026],
  41. [6.0100831690000405,0.6963497419999385,0.6951165990000163],
  42. [6.212279076000073,0.6570028610000236,0.6571199239999714],
  43. [4.017423710000003,0.5795366049999302,0.505997210999908]

Example on how to query pyarrow table

The example for section Query On Table (Pandas DataFrame, Parquet file/bytes, Arrow bytes) on the readme has a dataframe example, but not pyarrow.

Also, question: is it zero-copy select?

dbapi return int as string

chdb driver version: 0.14.2
chdb version: 23.6.1.1
Python version: 3.9.6
macos catalina: 10.15.7

How to reproduce:
create test table.

CREATE TABLE t1
(
    `id` Int64,
    c1 String 
)
ENGINE = ReplacingMergeTree
ORDER BY id;

insert into t1 values
(1,'a'),
(2,'b'),
(3,'c');

python code, just read data from table t1.

from chdb import dbapi

conn1 = dbapi.connect()
cur1 = conn1.cursor()

cur1.execute("select id, c1 from remote('host:9000','db','table','user','password') final")
result = cur1.fetchall()
print(result)

cur1.close()
conn1.close()

when i run the python code as above, i got the below result, the int id return as string, and the result returned as tuple, is it possible dbapi return result same as clickhouse_driver in python?

(('1', 'a'), ('2', 'b'), ('3', 'c'))

Support use db_xxx in chdb session.

from chdb import session as chs

sess = chs.Session()
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic;")
sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = MergeTree ORDER BY x;")

sess.query("USE db_xxx;") # update current db to db_xxx

sess.query("INSERT INTO log_table_xxx (x, y) SELECT toString(rand()), rand() FROM numbers(1000000);")
sess.query("SELECT count(*) FROM log_table_xxx;")

See also: #82

Drop support for Python 3.7 after v0.9.0

As pandas start to drop Python 3.7 from 2021.

  • Pandas 2.0 with pyarrow backend support require at least Python 3.8.
  • The memfd_create feature for #36 is also introduced in Python 3.8.

So I decided to drop support for Python 3.7 after v0.9.0. For Python 3.7 the last available chdb version is 0.9.0

Any feedback will be appreciated.

0.11.0: recursive_mutex lock failed: Invalid argument. (STD_EXCEPTION)

Stock queries are failing with chdb 0.11.0. Here's an example we know works as expected with chdb 0.10.x

query

SELECT
    town,
    district,
    count() AS c,
    round(avg(price)) AS price
FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')
GROUP BY
    town,
    district
LIMIT 10

Failing response

Code: 1001. DB::Exception: std::__1::system_error: recursive_mutex lock failed: Invalid argument. (STD_EXCEPTION)

Expected response

query results

Aggregate function bindings

Very similar to the function bindings, but can be used as an aggregate function. For example a query could be:

select myfunc(colA), colB from t group by colB

Where:

def myfunc(prev: int | None, row: int) -> int:
  return 2*row + prev if prev != None else 0

(don't even know if that's valid python just a random example)

prev being an accumulator, and row being the current row

query a parquet file 4 times slower than clickhouse local

(you don't have to strictly follow this form)

Describe the situation
SELECT avg(i) FROM file('/data/t.parquet') group by round(log10(i));
chdb costs 400s, clickhouse local costs 100s
How to reproduce

  • Which ClickHouse server version to use 23.6
  • Which interface to use, if matters CLI.py
  • Non-default settings, if any
  • CREATE TABLE statements for all tables involved
    select number::int i FROM numbers_mt(1,1000000000)t into outfile '/data/t.parquet';
  • Sample data for all these tables, use clickhouse-obfuscator if necessary
  • Queries to run that lead to slow performance
    SELECT avg(i) FROM file('/data/t.parquet') group by round(log10(i));

Expected performance
What are your performance expectation, why do you think they are realistic? Has it been working faster in older ClickHouse releases? Is it working faster in some specific other system?
I hope chdb runs as fast as clickhouse local.
Additional context
Add any other context about the problem here.
btw
select number::int i FROM numbers_mt(1,1000000000)t into outfile '/data/t.parquet';
chdb runs as fast as clickhouse local

UPDATE and DELETE have no effect

Hey there, thanks for the amazing work on chdb!

I noticed that ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE statements don't seem to have any effect in chdb. Data simply stays unchanged.

The same does work in clickhouse-local.

How to reproduce

The below examples can also be seen in this Colab notebook

UPDATE

from chdb import session as chs

sess = chs.Session()

sess.query("CREATE DATABASE test_db ENGINE = Atomic;")
sess.query("CREATE TABLE test_db.test_table (x String, y String) ENGINE = MergeTree ORDER BY tuple()")
sess.query("INSERT INTO test_db.test_table (x, y) VALUES ('A', 'B')")

print("Original values:")
print(sess.query("SELECT * FROM test_db.test_table"))

sess.query("ALTER TABLE test_db.test_table UPDATE y = 'updated' WHERE x = 'A' AND y = 'B'")

print('Values after UPDATE (expected "A", "updated"):')
print(sess.query("SELECT * FROM test_db.test_table"))

This prints:

Original values:
"A","B"

Values after UPDATE (expected "A", "updated"):
"A","B"

DELETE

from chdb import session as chs

sess = chs.Session()

sess.query("CREATE DATABASE test_db ENGINE = Atomic;")
sess.query("CREATE TABLE test_db.test_table (x String, y String) ENGINE = MergeTree ORDER BY tuple()")
sess.query("INSERT INTO test_db.test_table (x, y) VALUES ('A', 'B')")

print("Original values:")
print(sess.query("SELECT * FROM test_db.test_table"))

sess.query("ALTER TABLE test_db.test_table DELETE WHERE x = 'A' AND y = 'B'")

print("Values after DELETE (expected is no rows):")
print(sess.query("SELECT * FROM test_db.test_table"))

This prints:

Original values:
"A","B"

Values after DELETE (expected is no rows):
"A","B"

Expected behavior
Data is updated / deleted, the same way as in ClickHouse / clickhouse-local.

Additional context

I tested the same queries as shown above in a clickhouse-local session, started with clickhouse local --path .. There it works as expected.

I can reproduce the issue with chdb on my M1 Mac and on Google Colab

Question About Query On Pandas DataFrame

how can i deal with 2 table join operation when i query on pandas dataframe?

import pandas as pd

tbl1 = cdf.Table(dataframe=pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}))
tbl2 = cdf.Table(dataframe=pd.DataFrame({'a': [4, 2, 3], 'c': ['c', 'd', 'c']}))  ```


how can i join tbl1 and tbl2 on columns 'a'?  
i cant deal with it through demo
thanks

Action Buildtime optimizations

The build_wheels.yml action is huge and could benefit being split into multiple stages to gain significant cross-build speedup. Currently all builders seem to all run in parallel, leveraging neither git submodule caching or library artifact sharing.

I might be wrong as python is not my forte, but assuming the same .so could be used by all binding builders on the same arch (is this the case?) having the clickhouse/chdb library build stage and the python binding stage split/chained could save hours on releases for the same OS/arch over a matrix of python versions.

Note: not a request - posting to find a way to contribute and help out

Binaries instead of Strings in dataframe and arrow exports

Describe the unexpected behaviour
When querying a parquet files with chdb, strings become bytes in dataframe and arrow format. I don't this issue with JSON or CSV.

How to reproduce

The query

SELECT AVG(prix) as prix_moy,
        pdvid,
        name,
        ville,
        type_carburant
FROM  s3('https://********.s3.eu-west-1.amazonaws.com/instantane.parquet', 'Parquet') AS p
LEFT JOIN s3('https://************.s3.eu-west-1.amazonaws.com/station.csv', '*****', '****', 'CSVWithNames') AS stations
ON p.pdvid = stations.id
GROUP BY all
ORDER BY prix_moy DESC;

The results with clickhouse local
Capture dโ€™eฬcran 2023-08-19 aฬ€ 14 41 11

The result with chdb

In [25]: res
Out[25]:
       prix_moy     pdvid                                    name                        ville type_carburant
0         2.799  49480005  b"BP A11 AIRE DES PORTES D'ANGERS SUD"     b"Saint-Sylvain-D'Anjou"        b'SP98'
1         2.770  75014008                                    None                     b'Paris'        b'SP98'
2         2.740  75014008                                    None                     b'Paris'        b'SP95'
3         2.699  49160003                            b'SARL ROUX'    b'Longu\xc3\xa9-Jumelles'        b'SP98'
4         2.690  75016011                  b'Sarl STATION KLEBER'                     b'Paris'        b'SP98'

Add arm64 build (and install from source)

Hello,

I tried to install chdb on ubuntu arm64 and I've got this error from pip

(venv) ubuntu@ip-172-31-31-61:~$ pip install chdb
ERROR: Could not find a version that satisfies the requirement chdb (from versions: none)
ERROR: No matching distribution found for chdb

Is it possible to add the source on pypi or arm64 build (or both ๐Ÿ˜‰ )?

Thanks

Unable to create MergeTree Table in a session: Database _temporary_and_external_tables doesn't exist.

Creating a MergeTree table results in an error with the message: Code: 81. DB::Exception: Database _temporary_and_external_tables doesn't exist. (UNKNOWN_DATABASE)

How to reproduce

  • modified from readme.md example
    from chdb import session as chs
    
    sess = chs.Session()
    sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic;")
    sess.query("use db_xxx;")
    sess.query("CREATE TABLE IF NOT EXISTS db_xxx.log_table_xxx (x String, y Int) ENGINE = MergeTree ORDER BY x;")
    
    the use db_xxx; line doesn't seem to affect the outcome and consistently generates an error.

Expected behavior
No error message. Table is created.

rust support

I'm hoping to add support for rust, I'm involved in this, where would I like to start

Date type transformed to uint16 when export to Dataframe

Describe the issue
When a query retuns Date type and it is exported to Dataframe or Arrow, it will be transformed to

How to reproduce

res = chdb.query('SELECT toDate(now())', 'Dataframe')
res
>>>toDate(now())
>>>0          19585

res = chdb.query('SELECT toDate32(now())', 'Dataframe')
res
>>>toDate32(now())
>>>0      2023-08-16

Distributed query processing

Like distributed queries in clickhouse, it would be great to have a (semi-)native way to process distributed queries.

For example, being able to divide the list of parquet files up on to multiple hosts running chdb in-process, and then having them reduce down to the initial node that ran the query. This is somewhat possible manually (in theory).

Processes like choosing what files go to what hosts could be left for the developer, but the map-reduce across hosts is the functionlaity that would be ideal to have natively. Even if the data passing was through some binary RPC calls that the developer has to implement, telling the CHDB that it's a partial query and it has to deliver results that can be aggregated down on a single final worker is something that would need to be in CHDB itself.

Can not redirct stderr in chdb.query()

Describe the unexpected behaviour
In version above 0.10, chdb.query() will print error message in stderr.
But it can not redirct or catch into variable.

How to reproduce

import chdb
from io import StringIO
import sys
from contextlib import redirect_stderr

f = StringIO()
with redirect_stderr(f):
    res = chdb.query("select 1", "csv")
    # sys.stderr.write("write stderr in python")
err = f.getvalue()

print("Capture err by redirect_stderr: ", err)

Expected behavior
Error msg catched and stored into err variable

Make static library

thanks for your great job, is it possible to compile a chdb static library?

RowBinary

๐Ÿ‘‹

Is it possible to insert RowBinary using sessions? Right now the following query hangs as I don't know how to provide it with the data:

from chdb import session as chs
s = chs.Session()
s.query("create database dev")
s.query("create table dev.example(a UInt64) engine MergeTree order by tuple()")
s.query("insert into dev.example format RowBinary")

cannot import in amazon2 linux

Describe what's wrong

I'm trying to use chdb on AWS Lambda, I use Docker for testing and building the application. And when I import chdb, it crashed due to clickhouse can't access to /etc/localtime. The file exists, I tried to change the timezone, it failed too.

How to reproduce

(base) โžœ  ~ docker run -d --name chdb --rm public.ecr.aws/lambda/python:3.11
f500dc48b67d268020b2034d92b84f3af7505edcfbce2dc84cb955b296efc13c
docker exec -it chdb /bin/bash
bash-4.2# pip install chdb
Collecting chdb
  Downloading chdb-0.11.5-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (90.0 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 90.0/90.0 MB 3.5 MB/s eta 0:00:00
Installing collected packages: chdb
Successfully installed chdb-0.11.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip
Python 3.11.4 (main, Jul 31 2023, 08:41:59) [GCC 7.3.1 20180712 (Red Hat 7.3.1-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import chdb
Poco::Exception. Code: 1000, e.code() = 0, Exception: Cannot load time zone /etc/localtime (version 23.6.1.1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.