Code Monkey home page Code Monkey logo

datasketches-cpp's Introduction

Apache DataSketches Core C++ Library Component

This is the core C++ component of the Apache DataSketches library. It contains all of the key sketching algorithms that are in the Java component and can be accessed directly from user applications.

This component is also a dependency of other components of the library that create adaptors for target systems, such as PostgreSQL.

Note that we have a parallel core component for [Java]((https://github.com/apache/datasketches-java) and [Python]((https://github.com/apache/datasketches-python) implementations of the same sketch algorithms.

Please visit the main Apache DataSketches website for more information.

If you are interested in making contributions to this site please see our Community page for how to contact us.


This code requires C++11.

This library is header-only. The build process provided is only for building unit tests.

Building the unit tests requires cmake 3.12.0 or higher.

Installing the latest cmake on OSX: brew install cmake

Building and running unit tests using cmake for OSX and Linux:

    $ cmake -S . -B build/Release -DCMAKE_BUILD_TYPE=Release
    $ cmake --build build/Release -t all test

Building and running unit tests using cmake for Windows from the command line:

    $ cd build
    $ cmake ..
    $ cd ..
    $ cmake --build build --config Release
    $ cmake --build build --config Release --target RUN_TESTS

To install a local distribution (OSX and Linux), use the following command. The CMAKE_INSTALL_PREFIX variable controls the destination. If not specified, it defaults to installing in /usr (/usr/include, /usr/lib, etc). In the command below, the installation will be in /tmp/install/DataSketches (/tmp/install/DataSketches/include, /tmp/install/DataSketches/lib, etc)

    $ cmake -S . -B build/Release -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/tmp/install/DataSketches
    $ cmake --build build/Release -t install

To generate an installable package using cmake's built in cpack packaging tool, use the following command. The type of packaging is controlled by the CPACK_GENERATOR variable (semi-colon separated list). Cmake usually supports packaging types such as RPM, DEB, STGZ, TGZ, TZ, ZIP, etc.

    $ cmake3 -S . -B build/Release -DCMAKE_BUILD_TYPE=Release -DCPACK_GENERATOR="RPM;STGZ;TGZ" 
    $ cmake3 --build build/Release -t package

The DataSketches project can be included in other projects' CMakeLists.txt files in one of two ways. If DataSketches has been installed on the host (using an RPM, DEB, "make install" into /usr/local, or some way, then CMake's find_package command can be used like this:

    find_package(DataSketches 3.2 REQUIRED)
    target_link_library(my_dependent_target PUBLIC ${DATASKETCHES_LIB})

When used with find_package, DataSketches exports several variables, including

  • DATASKETCHES_VERSION: The version number of the datasketches package that was imported.
  • DATASKETCHES_INCLUDE_DIR: The directory that should be added to access DataSketches include files. Because cmake automatically includes the interface directories for included target libraries when using target_link_library, under normal circumstances there will be no need to include this directly.
  • DATASKETCHES_LIB: The name of the DataSketches target to include as a dependency. Projects pulling in DataSketches should reference this with target_link_library in order to set up all the correct dependencies and include paths.

If you don't have DataSketches installed locally, dependent projects can pull it directly from GitHub using CMake's ExternalProject module. The code would look something like this:

    cmake_policy(SET CMP0097 NEW)
    include(ExternalProject)
    ExternalProject_Add(datasketches
        GIT_REPOSITORY https://github.com/apache/datasketches-cpp.git
        GIT_TAG 3.2.0
        GIT_SHALLOW true
        GIT_SUBMODULES ""
        INSTALL_DIR /tmp/datasketches-prefix
        CMAKE_ARGS -DBUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} -DCMAKE_INSTALL_PREFIX=/tmp/datasketches-prefix

        # Override the install command to add DESTDIR
        # This is necessary to work around an oddity in the RPM (but not other) package
        # generation, as CMake otherwise picks up the Datasketch files when building
        # an RPM for a dependent package. (RPM scans the directory for files in addition to installing
        # those files referenced in an "install" rule in the cmake file)
        INSTALL_COMMAND env DESTDIR= ${CMAKE_COMMAND} --build . --target install
    )
    ExternalProject_Get_property(datasketches INSTALL_DIR)
    set(datasketches_INSTALL_DIR ${INSTALL_DIR})
    message("Source dir of datasketches = ${datasketches_INSTALL_DIR}")
    target_include_directories(my_dependent_target 
                                PRIVATE ${datasketches_INSTALL_DIR}/include/DataSketches)
    add_dependencies(my_dependent_target datasketches)

datasketches-cpp's People

Contributors

alexandersaydakov avatar alexey-milovidov avatar aseure avatar b-atanasov avatar bryanherger avatar c-dickens avatar chufucun avatar claudenw avatar etseidl avatar fivepapertigers avatar fluorinedog avatar gabm avatar gaborkaszab avatar jamie256 avatar jghoman avatar jmalkin avatar leerho avatar mdhimes avatar tadejstajner avatar why520it avatar will-lauer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasketches-cpp's Issues

theta_a_not_b misbehaving on ordered sketches

When called on 2 ordered sketches, it seems that a_not_b is severely overestimating its result in some cases, with a_not_b(a, b) having an estimate bigger than a if b is bigger than a.

Here is a small reproduction (on the current master):

>>> import random
>>> import uuid
>>> import datasketches
>>> ids = [str(uuid.uuid4()) for _ in range(100_000)]
>>> a_ids = random.sample(ids, 10_000)
>>> b_ids = random.sample(ids, 25_000)
>>> len(set(a_ids) - set(b_ids))
7548
>>> a = datasketches.update_theta_sketch()
>>> b = datasketches.update_theta_sketch()
>>> for x in a_ids:
...     a.update(x)
...
>>> for x in b_ids:
...     b.update(x)
...
>>> a_compact = a.compact()
>>> b_compact = b.compact()
>>> print(a)
### Update Theta sketch summary:
   lg nominal size      : 12
   lg current size      : 13
   num retained keys    : 5333
   resize factor        : 8
   sampling probability : 1
   seed hash            : 37836
   empty?               : false
   ordered?             : false
   estimation mode?     : true
   theta (fraction)     : 0.533299
   theta (raw 64-bit)   : 4918811584238243237
   estimate             : 10000
   lower bound 95% conf : 9813.24
   upper bound 95% conf : 10190.3
### End sketch summary

>>> print(b)
### Update Theta sketch summary:
   lg nominal size      : 12
   lg current size      : 13
   num retained keys    : 7074
   resize factor        : 8
   sampling probability : 1
   seed hash            : 37836
   empty?               : false
   ordered?             : false
   estimation mode?     : true
   theta (fraction)     : 0.28443
   theta (raw 64-bit)   : 2623405683183212657
   estimate             : 24870.8
   lower bound 95% conf : 24373.3
   upper bound 95% conf : 25378.4
### End sketch summary

>>> for a_ in (a, a_compact):
...     for b_ in (b, b_compact):
...             print(datasketches.theta_a_not_b().compute(a_, b_))
...
### Compact Theta sketch summary:
   num retained keys    : 2178
   seed hash            : 37836
   empty?               : false
   ordered?             : true
   estimation mode?     : true
   theta (fraction)     : 0.28443
   theta (raw 64-bit)   : 2623405683183212657
   estimate             : 7657.41
   lower bound 95% conf : 7382.58
   upper bound 95% conf : 7942.37
### End sketch summary

### Compact Theta sketch summary:
   num retained keys    : 2178
   seed hash            : 37836
   empty?               : false
   ordered?             : true
   estimation mode?     : true
   theta (fraction)     : 0.28443
   theta (raw 64-bit)   : 2623405683183212657
   estimate             : 7657.41
   lower bound 95% conf : 7382.58
   upper bound 95% conf : 7942.37
### End sketch summary

### Compact Theta sketch summary:
   num retained keys    : 2178
   seed hash            : 37836
   empty?               : false
   ordered?             : true
   estimation mode?     : true
   theta (fraction)     : 0.28443
   theta (raw 64-bit)   : 2623405683183212657
   estimate             : 7657.41
   lower bound 95% conf : 7382.58
   upper bound 95% conf : 7942.37
### End sketch summary

### Compact Theta sketch summary:
   num retained keys    : 4635
   seed hash            : 37836
   empty?               : false
   ordered?             : true
   estimation mode?     : true
   theta (fraction)     : 0.28443
   theta (raw 64-bit)   : 2623405683183212657
   estimate             : 16295.7
   lower bound 95% conf : 15893.5
   upper bound 95% conf : 16708
### End sketch summary

In the 3 first cases, a_not_b is called with either 1 or 2 non ordered sketches, and the result is correct.
In the last case it is called with 2 ordered sketches, and the estimate (16295.7) is higher than the estimate of a (10000).

(When calling it on unordered compact sketches, or when a > b, the results are correct)

Examples of Tuple Sketches

Do we have some online code examples for theta sketches in python? Also can I use theta sketch to store precomputed aggregates for machine learning models?

UndefinedBehaviorSanitizer failed, when serializing after using theta_a_not_b

We are integrating datasketches with Clickhouse.
In some test case, when serializing after using theta_a_not_b, we meet UndefinedBehavior(the binary built with -fsanitize=undefined). The stack.

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../contrib/datasketches-cpp/common/include/memory_operations.hpp:51:15 in
../contrib/datasketches-cpp/common/include/memory_operations.hpp:51:15: runtime error: null pointer passed as argument 2, which is declared to never be null
#0 0x148f9160 in datasketches::copy_to_mem(void const*, void*, unsigned long) build_docker/../contrib/datasketches-cpp/common/include/memory_operations.hpp:51:3
    #1 0x148f9160 in datasketches::compact_theta_sketch_alloc<std::__1::allocator<unsigned long> >::serialize(unsigned int) const build_docker/../contrib/datasketches-cpp/theta/include/theta_sketch_impl.hpp:391:12
    #2 0x148f8849 in DB::ThetaSketchData<unsigned long>::write(DB::WriteBuffer&) const build_docker/../src/AggregateFunctions/ThetaSketchData.h:158:49
    #3 0x24aa975a in DB::SerializationAggregateFunction::serializeBinaryBulk(DB::IColumn const&, DB::WriteBuffer&, unsigned long, unsigned long) const build_docker/../src/DataTypes/Serializations/SerializationAggregateFunction.cpp:73:19
    #4 0x26fd7b85 in DB::writeData(DB::ISerialization const&, COW<DB::IColumn>::immutable_ptr<DB::IColumn> const&, DB::WriteBuffer&, unsigned long, unsigned long) build_docker/../src/Formats/NativeWriter.cpp:63:19
    #5 0x26fd73f2 in DB::NativeWriter::write(DB::Block const&) build_docker/../src/Formats/NativeWriter.cpp:164:13
    #6 0x26f946d4 in DB::TCPHandler::sendData(DB::Block const&) build_docker/../src/Server/TCPHandler.cpp:1742:26
    #7 0x26f90829 in DB::TCPHandler::processOrdinaryQueryWithProcessors() build_docker/../src/Server/TCPHandler.cpp:748:21
    #8 0x26f821f3 in DB::TCPHandler::runImpl() build_docker/../src/Server/TCPHandler.cpp:368:17
    #9 0x26fa22b9 in DB::TCPHandler::run() build_docker/../src/Server/TCPHandler.cpp:1866:9
    #10 0x27ee7e0b in Poco::Net::TCPServerConnection::start() build_docker/../contrib/poco/Net/src/TCPServerConnection.cpp:43:3
    #11 0x27ee82f9 in Poco::Net::TCPServerDispatcher::run() build_docker/../contrib/poco/Net/src/TCPServerDispatcher.cpp:115:20
    #12 0x2805ffe6 in Poco::PooledThread::run() build_docker/../contrib/poco/Foundation/src/ThreadPool.cpp:199:14
    #13 0x2805dace in Poco::ThreadImpl::runnableEntry(void*) build_docker/../contrib/poco/Foundation/src/Thread_POSIX.cpp:345:27
    #14 0x7fb8e0647608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477:8
    #15 0x7fb8e056c132 in __clone /build/glibc-SzIz7B/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95

the test code can reproduce

#include <theta_a_not_b.hpp>
#include <theta_union.hpp>

int main()
{

	datasketches::theta_union tmp_union = datasketches::theta_union::builder().build();
        datasketches::update_theta_sketch tmp_update = datasketches::update_theta_sketch::builder().build();

        datasketches::update_theta_sketch tmp_update2 = datasketches::update_theta_sketch::builder().build();

        for ( int64_t i = 0 ; i < 6144; ++i)
        {
            tmp_update2.update(&i, 8);
            if (i > 1023)
                tmp_update.update(&i, 8);
        }


            tmp_union.update(tmp_update);
            tmp_update = datasketches::update_theta_sketch::builder().build();


        datasketches::theta_a_not_b a_not_b;

            datasketches::compact_theta_sketch result = a_not_b.compute(tmp_union.get_result(), tmp_update2);
            tmp_union = datasketches::theta_union::builder().build();
            tmp_union.update(result);
            auto bytes = tmp_union.get_result().serialize();
}

compiler: clang15
compile_options: -g -fno-omit-frame-pointer -DSANITIZER -fsanitize=undefined -fno-sanitize-recover=all -fno-sanitize=float-divide-by-zero
link_options: -fuse-ld=lld -fsanitize=undefined

related:ClickHouse/ClickHouse#39919

Compatibile with Pyodide, Error: `pure Python 3 wheel for 'datasketches'.`

Hello,

i am trying to import the python library in the Pyodide with a custom build, but when i try to install i receive this error

`running build_ext
-- The CXX compiler identification is GNU 8.3.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /src/tools/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Warning (dev) at /usr/local/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/GNUInstallDirs.cmake:239 (message):
Unable to determine default CMAKE_INSTALL_LIBDIR directory because no
target architecture is known. Please enable at least one language before
including GNUInstallDirs.
Call Stack (most recent call first):
CMakeLists.txt:23 (include)
This warning is for project developers. Use -Wno-dev to suppress it.

-- Found Python3: /usr/local/bin/python (found version "3.9.5") found components: Interpreter Development Development.Module Development.Embed
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- Found pybind11: /usr/local/include (found version "2.9.1")
-- Configuring done
-- Generating done
-- Build files have been written to: /src/packages/datasketches/build/datasketches-3.2.0.1/build/temp.emscripten_wasm32-3.9
[ 20%] Building CXX object python/CMakeFiles/python.dir/src/datasketches.cpp.o
[ 20%] Building CXX object python/CMakeFiles/python.dir/src/hll_wrapper.cpp.o
[ 30%] Building CXX object python/CMakeFiles/python.dir/src/kll_wrapper.cpp.o
[ 40%] Building CXX object python/CMakeFiles/python.dir/src/cpc_wrapper.cpp.o
[ 50%] Building CXX object python/CMakeFiles/python.dir/src/fi_wrapper.cpp.o
[ 60%] Building CXX object python/CMakeFiles/python.dir/src/theta_wrapper.cpp.o
[ 70%] Building CXX object python/CMakeFiles/python.dir/src/vo_wrapper.cpp.o
[ 80%] Building CXX object python/CMakeFiles/python.dir/src/req_wrapper.cpp.o
[ 90%] Building CXX object python/CMakeFiles/python.dir/src/vector_of_kll.cpp.o
[100%] Linking CXX shared module ../../lib.emscripten_wasm32-3.9/datasketches.cpython-39-x86_64-linux-gnu.so
make[3]: *** [python/CMakeFiles/python.dir/build.make:226: ../lib.emscripten_wasm32-3.9/datasketches.cpython-39-x86_64-linux-gnu.so] Error 1
make[3]: *** Deleting file '../lib.emscripten_wasm32-3.9/datasketches.cpython-39-x86_64-linux-gnu.so'
make[2]: *** [CMakeFiles/Makefile2:657: python/CMakeFiles/python.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:664: python/CMakeFiles/python.dir/rule] Error 2
make: *** [Makefile:309: python] Error 2
Traceback (most recent call last):
File "/src/packages/datasketches/build/datasketches-3.2.0.1/setup.py", line 82, in
setup(
File "/usr/local/lib/python3.9/site-packages/setuptools/init.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/usr/local/lib/python3.9/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/local/lib/python3.9/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/usr/local/lib/python3.9/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.9/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/usr/local/lib/python3.9/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.9/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/src/packages/datasketches/build/datasketches-3.2.0.1/setup.py", line 45, in run
self.build_extension(ext)
File "/src/packages/datasketches/build/datasketches-3.2.0.1/setup.py", line 78, in build_extension
subprocess.check_call(['cmake', '--build', '.', '--target', 'python'] + build_args,
File "/usr/local/lib/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'python', '--config', 'Release', '--', '-j2']' returned non-zero exit status 2.
[2022-02-09 21:23:39] Failed building package datasketches in 47.0 seconds.
Traceback (most recent call last):
File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/src/pyodide-build/pyodide_build/main.py", line 62, in
main()
File "/src/pyodide-build/pyodide_build/main.py", line 56, in main
args.func(args)
File "/src/pyodide-build/pyodide_build/buildpkg.py", line 821, in main
build_package(
File "/src/pyodide-build/pyodide_build/buildpkg.py", line 655, in build_package
compile(
File "/src/pyodide-build/pyodide_build/buildpkg.py", line 385, in compile
pywasmcross.capture_compile(
File "/src/pyodide-build/pyodide_build/pywasmcross.py", line 163, in capture_compile
result.check_returncode()
File "/usr/local/lib/python3.9/subprocess.py", line 460, in check_returncode
raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '['/usr/local/bin/python', 'setup.py', 'build']' returned non-zero exit status 1.`

who can help me to build an pure python package?

datasketches does not work with python 3.9

Datasketches module 3.2.0.1 seems to have been compiled with python 3.10 and does not work with python 3.9

root@907f428e7dc2:/temp# pip install datasketches --target .
Collecting datasketches
Downloading datasketches-3.2.0.1-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (474 kB)
|████████████████████████████████| 474 kB 658 kB/s
Collecting numpy
Downloading numpy-1.22.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
|████████████████████████████████| 16.8 MB 7.3 MB/s
Installing collected packages: numpy, datasketches
Successfully installed datasketches-3.2.0.1 numpy-1.22.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 21.2.4; however, version 22.0.3 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.

root@907f428e7dc2:/temp# ls -lrt
total 1056
drwxr-xr-x 2 root root 4096 Feb 12 06:16 numpy.libs
drwxr-xr-x 19 root root 4096 Feb 12 06:16 numpy
drwxr-xr-x 2 root root 4096 Feb 12 06:16 bin
drwxr-xr-x 2 root root 4096 Feb 12 06:16 numpy-1.22.2.dist-info
-rwxr-xr-x 1 root root 1049008 Feb 12 06:16 datasketches.cpython-310-x86_64-linux-gnu.so
drwxr-xr-x 3 root root 4096 Feb 12 06:16 tests
drwxr-xr-x 3 root root 4096 Feb 12 06:16 src
drwxr-xr-x 2 root root 4096 Feb 12 06:16 datasketches-3.2.0.1.dist-info

root@907f428e7dc2:/temp# python
Python 3.9.10 (main, Jan 26 2022, 20:56:53)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.

import datasketches
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'datasketches'

As per the below link 3.9 is supported
https://www.piwheels.org/project/datasketches/

But the compiled library is that of 3.10 : datasketches.cpython-310-x86_64-linux-gnu.so

The Python package for ARM MacOS has an x86_64 datasketches.so in it

When I install datasketches 3.5.0 with pip and the try to import it, I get the following on an ARM based MacBook:

>>> import datasketches
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: dlopen(/Users/username/venv/lib/python3.9/site-packages/datasketches.so, 0x0002): tried: '/Users/username/venv/lib/python3.9/site-packages/datasketches.so' (mach-o file, but is an incompatible architecture (have (x86_64), need (arm64e)))

I downloaded the file datasketches-3.5.0-cp39-cp39-macosx_11_0_arm64.whl from pypi.org, extracted datasketches.so from it and then checked the platform, and it's a wrong platform:

❯ file datasketches.so
datasketches.so: Mach-O 64-bit bundle x86_64

[python packaging] Top-level name problems

Hi, looking at the 3.4.0 wheel for datasketches as released, the *.dist-info/top_level.txt file contains:

datasketches
src
tests

The top-level dir "src" should not be included at all; the one file it contains is not useful.
The top-level dir "tests" should be called something more unique (like "datasketches_tests") if you intend to package it in the wheel. Alternatively just keep this in the sdist?

Both of these are likely to cause conflicts with other projects.

Getting Attribute Error: type object 'datasketches.theta_sketch' has no attribute 'deserialize'

We are using datasketches library to perform union/intersection operations and compute datasketches. Everything was working fine on an older laptop with datasketches===2.2.0-incubating-SNAPSHOT.

We tried installing the latest version of the datasketches module (datasketches==3.1.0.dev0) and now its failing to run our python script with Attribute Error: type object 'datasketches.theta_sketch' has no attribute 'deserialize'

I've attached the terminal commands and output in the following text file
Datasketches_installtion_terminal.txt

This is the function in which we're trying to call the deserialize object -

` def get_sketches(bucket, values, logger):
value_string = str([str(i) for i in values])[1:-1]

    SELECT_QUERY = "SELECT value, sketch FROM {table} WHERE bucket='{bucket}' AND value in ({values}) AND timestamp::DATE = '{date}'"
    query = SELECT_QUERY.format(
        table=SINGLE_DAY_SKETCH_TABLE,
        bucket=bucket,
        values=value_string,
        date=DATE
    )
    logger.info("SQL: " + query)
    result = run_select_query(query)
    logger.info("Data fetched from DB.")

    sketch_list = []
    for row in result:
        if bucket in ('brand', 'sic', 'dma'):
            value = int(row[0])
        else:
            value = row[0]
        sketch_string = row[1]
        sketch_obj = datasketches.theta_sketch.deserialize(bytes(sketch_string))
        sketch_list.append((value, sketch_obj))

    return sketch_list`

This script is working perfectly fine in the older laptop with 2.2.0 version.

would really appreciate any help on this.

Thanks

kll_sketch throw exception when number of inserted data is small

#include <iostream>
#include <kll_sketch.hpp>
using namespace datasketches;
using namespace std;

int main()
{
    using T = int8_t;
    kll_sketch<T> kll;
    int N = 3;
    kll.update(-8);
    kll.update(-8);
    kll.update(-8);
    
    auto blob = kll.serialize();
    std::cout << "blobsize=" <<  blob.size() << std::endl;
    auto kll2 = kll_sketch<T>::deserialize(blob.data(), blob.size());
    return 0;
}

will lead to

libc++abi: terminating with uncaught exception of type std::out_of_range: Insufficient buffer size detected: bytes available 29, minimum needed 32

Seems that ensure_minimum_memory(size, 1ULL << preamble_ints); at kll_sketch_impl.hpp:578 is incorrect.

Lack of cross-validation testing between C++, Java

Issue #149 would have been easily discovered if we had cross-validation testing between Java and C++. This is a critical bug against the release process. To clear this bug I am suggesting that we can demonstrate:

  • An automated or semi-automated cron job that can be easily extended. This job would run tests in java and C++, that produce binary results that can be directly compared without having to resort to static stored binaries.

  • This job should be run at least nightly, and run against any Release Candidate.

The qualifications required to clear this bug are certainly open for discussion.

kll_sketch::update & get_rank

I have a use case where I want to get the approximate rank of each item inserted via update in a kll_sketch. Currently, calling update followed by get_rank each time reduces performance significantly (about 10x slower in my tests).

Instead, would it be possible to speed up the get_rank computation by leveraging the fact that we already know the position of the value in the items_ buffer, since it is returned by kll_sketch::internal_update?

Python wrapper doesn't work

Trying to use the python wrapper for datasketches-cpp to use the thetasketches.. tried pip install from git as well as using the setup script.. Get this error on importing datasketches

ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/datasketches.so, 2): Symbol not found: __ZN12datasketches25update_theta_sketch_allocINSt3__19allocatorIvEEE7builder12DEFAULT_LG_KE

DRIFT_LIMIT in reverse_purge_hash_map

Hi, thanks for your work on this library!

I've begun testing the frequent item sketch, and I'm hitting the drift limit reached exception in reverse_purge_hash_map under production data streams. The comment indicates used only for stress testing but it throws a std::logic_error. I tried doubling the limit to 2048 but still see this exception (less frequently), and I'd like to not have to adjust this value locally in my code base.

Can you discuss the significance of this and how to address? The "drift" tracking seems to indicate hash collisions but I'm not sure of the best way forward.

the theta_union estimate value will change dramatically with the order of merge

I compared the datasketches-java and datasketches-cpp, found the different behavior between JAVA and C++.

see the code below:

datasketches::update_theta_sketch update_sketch1 = datasketches::update_theta_sketch::builder().set_lg_k(14).build();
for(int i = 0; i < 16384; i++) update_sketch1.update(i);

datasketches::update_theta_sketch update_sketch2 = datasketches::update_theta_sketch::builder().set_lg_k(14).build();
for(int i = 0; i < 26384; i++) update_sketch2.update(i);

datasketches::update_theta_sketch update_sketch3 = datasketches::update_theta_sketch::builder().set_lg_k(14).build();
for(int i = 0; i < 86384; i++) update_sketch3.update(i);

datasketches::theta_union sk_union1 = datasketches::theta_union::builder().set_lg_k(16).build();
sk_union1.update(update_sketch1);
sk_union1.update(update_sketch2);
sk_union1.update(update_sketch3);
std::cout <<"test2: result1: " << sk_union1.get_result().get_estimate() << std::endl;

datasketches::theta_union sk_union2 = datasketches::theta_union::builder().set_lg_k(16).build();
sk_union2.update(update_sketch1);
sk_union2.update(update_sketch3);
sk_union2.update(update_sketch2);
std::cout <<"test2: result2: " << sk_union2.get_result().get_estimate() << std::endl;   

the result is :
test2: result1: 151281
test2: result2: 126358

The result is incorrect !!!
And there is a big gap between 151281 and 126358, because of different merge order:
[update_sketch1 -> update_sketch2 -> update_sketch3]
[update_sketch1 -> update_sketch3 -> update_sketch2]

but I using datasketches-java do the same things, I didn't find any problems.

because I set update_sketch's lg_k = 14 and theta_union‘s lg_k = 16, If the theta_union‘s lg_k > update_sketch's lg_k there will be a problem, but if theta_union‘s lg_k <= update_sketch's lg_k there will be no problem.

can we fix this problem?

error: redefinition of ‘uint16_t datasketches::compute_seed_hash(uint64_t)’

When the file contains the following two kinds of sketches, it will report error: redefinition of ‘uint16_t datasketches::compute_seed_hash(uint64_t)’

#include <theta_sketch.hpp>
#include <cpc_sketch.hpp>

Test Case:
https://github.com/chufucun/datasketches-cpp/blob/test/theta/test/multi_sketch_test.cpp

====================[ Build | theta_test | Debug ]==============================
/home/impdev/Impala/toolchain/toolchain-packages-gcc7.5.0/cmake-3.14.3/bin/cmake --build /home/impdev/datasketches-cpp/cmake-build-debug --target theta_test -- -j 9
[ 22%] Built target common_test
Scanning dependencies of target theta_test
[ 33%] Building CXX object theta/test/CMakeFiles/theta_test.dir/multi_sketch_test.cpp.o
In file included from /home/impdev/datasketches-cpp/cpc/include/cpc_compressor_impl.hpp:28:0,
                 from /home/impdev/datasketches-cpp/cpc/include/cpc_compressor.hpp:145,
                 from /home/impdev/datasketches-cpp/cpc/include/cpc_sketch.hpp:30,
                 from /home/impdev/datasketches-cpp/theta/test/multi_sketch_test.cpp:25:
/home/impdev/datasketches-cpp/cpc/include/cpc_util.hpp: In function ‘uint16_t datasketches::compute_seed_hash(uint64_t)’:
/home/impdev/datasketches-cpp/cpc/include/cpc_util.hpp:27:24: error: redefinition of ‘uint16_t datasketches::compute_seed_hash(uint64_t)’
 static inline uint16_t compute_seed_hash(uint64_t seed) {
                        ^~~~~~~~~~~~~~~~~
In file included from /home/impdev/datasketches-cpp/theta/include/theta_sketch.hpp:23:0,
                 from /home/impdev/datasketches-cpp/theta/test/multi_sketch_test.cpp:24:
/home/impdev/datasketches-cpp/theta/include/theta_update_sketch_base.hpp:188:24: note: ‘uint16_t datasketches::compute_seed_hash(uint64_t)’ previously defined here
 static inline uint16_t compute_seed_hash(uint64_t seed) {
                        ^~~~~~~~~~~~~~~~~
gmake[3]: *** [theta/test/CMakeFiles/theta_test.dir/multi_sketch_test.cpp.o] Error 1
gmake[2]: *** [theta/test/CMakeFiles/theta_test.dir/all] Error 2
gmake[1]: *** [theta/test/CMakeFiles/theta_test.dir/rule] Error 2
gmake: *** [theta_test] Error 2

Test Case: chufucun@6effa9b

Cannot deserialize Theta EmptyCompactSketch from Java

In Java, an empty Theta compact sketch is represented by the singleton EmptyCompactSketch, which has a seed hash of 0.

It is represented by the constant byte array [1, 3, 3, 0, 0, 0x1E, 0, 0], as defined here.
Deserializing this byte array from Java returns EmptyCompactSketch as expected (snippet using Ammonite):

@ import $ivy.`org.apache.datasketches:datasketches-java:1.3.0-incubating`, org.apache.datasketches.theta._, org.apache.datasketches.memory._
import $ivy.$                                                           , org.apache.datasketches.theta._, org.apache.datasketches.memory._

@ val emptySketch = (new UpdateSketchBuilder).build.compact
emptySketch: CompactSketch =
### EmptyCompactSketch SUMMARY:
   Estimate                : 0.0
   Upper Bound, 95% conf   : 0.0
   Lower Bound, 95% conf   : 0.0
   Theta (double)          : 1.0
   Theta (long)            : 9223372036854775807
   Theta (long) hex        : 7fffffffffffffff
   EstMode?                : false
   Empty?                  : true
   Retained Entries        : 0
   Seed Hash               : 0 | 0
### END SKETCH SUMMARY


@ emptySketch.toByteArray
res2: Array[Byte] = Array(1, 3, 3, 0, 0, 30, 0, 0)

@ Sketch.wrap(Memory.wrap(Array[Byte](1, 3, 3, 0, 0, 0x1E, 0, 0)))
res3: Sketch =
### EmptyCompactSketch SUMMARY:
   Estimate                : 0.0
   Upper Bound, 95% conf   : 0.0
   Lower Bound, 95% conf   : 0.0
   Theta (double)          : 1.0
   Theta (long)            : 9223372036854775807
   Theta (long) hex        : 7fffffffffffffff
   EstMode?                : false
   Empty?                  : true
   Retained Entries        : 0
   Seed Hash               : 0 | 0
### END SKETCH SUMMARY

In C++ (or Python), the result is a Sketch seed hash mismatch:

>>> import datasketches
>>> datasketches.theta_sketch.deserialize(bytes([1, 3, 3, 0, 0, 0x1E, 0, 0]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Sketch seed hash mismatch: expected 37836, actual 0

I would expect to get an empty sketch instead.

It looks like an explicit design decision in the Java project to ignore the seed hash for the empty sketch, so I'm raising the issue here instead of over there.

Is it possible to estimate the fraction of a single frequent value?

After building the kll_sketch with a stream of values, we call get_quantiles(101) to get 100 buckets with lower bounds and upper bounds, and estmate their fraction calling the get_ratio for each bound to get a closest estimate.

We found that if there is a frequent value X having 30% fraction, there will be around 30 buckets holding lower_bound=upper_bound=X, which is understandable and we will merge these buckets into one.

However, I wonder if it's possible to get the estimated count directly from kll_sketch? Doing that will make our life much easier.

Use non-colliding family id in count-min

In order to have proper binary compatibility between C++ and Java and allow valid deserialization checks, we need to ensure we have a common family_id namespace. In Java it exists in a single enum class: https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/common/Family.java

The count-min sketch's family_id currently collides with Java's Alpha variant of the theta sketch: https://github.com/apache/datasketches-cpp/blob/master/count/include/count_min.hpp#L322
Let's assign count-min an id of 18, and then make a PR against the Java repo to reserve that value in the Family enum.

Test failures on hll branch

$ uname -a
Darwin pc049 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64 x86_64

$ cc --version
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

$ c++ --version
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
$./test_runner 
..............................E...................E............### KLL sketch summary:
   K              : 200
   min K          : 200
   M              : 8
   N              : 1000
   Epsilon        : 1.33%
   Epsilon PMF    : 1.65%
   Empty          : false
   Estimation mode: true
   Levels         : 3
   Sorted         : true
   Capacity items : 422
   Retained items : 324
   Storage bytes  : 0
   Min value      : 0
   Max value      : 999
### End sketch summary



!!!FAILURES!!!
Test Results:
Run:  61   Failures: 0   Errors: 2


1) test: datasketches::ToFromByteArray::deserializeFromJava (E) 
uncaught exception of type std::exception (or derived).
- Attempt to deserialize Unknown object type


2) test: datasketches::kll_sketch_test::deserialize_from_java (E) 
uncaught exception of type std::exception (or derived).
- ios_base::clear: unspecified iostream_category error

Java <-> CPP Theta sketch compatibility issue

I am unsure if this is user error or a bug. I have been attempting to exchange serialized Theta sketches between the Java and C++ implementations. I am able to read sketches encoded from C++ in Java without any issue, but not the other way around. Is this perhaps an endianness issue? Both libraries are using Theta sketch serial version 3.

If I use the sketches contained in the test suite for the C++ code that say they are from Java, they do work. If I use a sketch that I wrote out from Java (e.g. sketch.compact().toByteArray()) which works just fine from Java, I get a version mismatch error. Error from the C++ version when opening a sketch I wrote from Java:

sketch type mismatch: expected 3, actual 2

Relevant versions:

  • Java: datasketches-java-2.0.0
  • C++: datasketches-cpp-3.1.0

Hardware/OS:

  • MacBook Pro
  • Intel CPU
  • macOS 10.13.6

Hex of the first part of the sketch written from C++:

0000000 03 03 03 00 00 1a cc 93 a0 19 00 00 00 00 00 00
0000010 a1 37 94 51 48 60 d6 00 c8 39 1e 63 1c 06 00 00
0000020 11 58 cf 9c 60 1b 00 00 54 2a 25 b0 35 2a 00 00

From Java:

0000000 c3 03 02 0c 0d 00 cc 93 dc 19 00 00 00 00 80 3f
0000010 93 d3 21 31 ff 31 5e 05 00 c0 b0 da 96 a4 d1 03
0000020 0a 7f 4f 5c 38 62 d6 00 02 20 bd e2 7c 54 49 01

Any help is appreciated.

Is de-serialized size different from serialized size?

Hello again, I've run into a corner case I think is worth pointing out.

I was trying to serialize a vector of kll sketches into a fixed size buffer backed by a heap-allocated char array. While the serialization works fine, when I try to de-serialize the elements into an array of the same size I run into issues with writing beyond the array length. I calculate the total size of the sketches by calling get_serialized_size_bytes() for each item. I know this one does not include one integer as it says in the code. The serialized size of all the sketches (plus a uint64 for vector size) is 9340 bytes.

Here's an printout of how many bytes are written/read with the buffers.

The error comes from within the kll_sketch(uint16_t k, uint8_t flags, std::istream& is) constructor, and specifically deserialize_items<T>(is, &items_[levels_[0]], num_items);link to code where my buffer fails with

AssertError:read can not have position exceed buffer length: curr_ptr_: 9216 read size: 1024 buffer_size: 9340

Any idea why that line would trigger a write of 1024 bytes, while the sketch itself is 9340-9216=124 bytes? When I used non-fixed size buffers ser/deser works fine.

Weighted version of the KLL sketch?

Hello,

There's a consideration at XGBoost about potentially using the KLL sketch to represent feature value histograms.

One potential blocker is the need for a weighted version of the sketch, this would allow us to use data points that are weighted, and adjust their feature contributions accordingly (See Appendix A of XGBoost paper).

I remember discussing in the past the possibility of using data weights with KLL, is that still an option?

Is it possible to estimate the max byte size of a kll sketch?

The kll sketch currently provides a get_serialized_size_bytes() function that returns the current byte size of a sketch.

For some communication frameworks (like rabit which is also backed by MPI) we need to allocate data structures of the maximum possible size, to ensure that after merging together sketches from multiple workers they don't overlap in the byte array being reduced.

For an example use-case think of a distributed tabular dataset where we want to create a sketch for each feature. We can create sketches locally and then communicate those sketches through MPI to get the overall sketch per feature. The way rabit (and MPI AFAIK) does communication however is "dense" requiring knowledge of the maximum possible size each sketch will occupy before we do the all reduce step.
You can take a look at the relevant code here.

So I'm wondering if given the parameters of a sketch it's possible to get the maximum possible byte size a sketch will occupy. A conservative estimate would probably work in this case as well.

Can we add an enhancement to merge update_theta_sketch?

Can we please add below method to merge update_theta_sketch?
update_theta_sketch_alloc::merge(const theta_sketch_alloc& sketch)
Its implementation will be same as 'theta_union_alloc::update(const theta_sketch_alloc& sketch)'

REQ segfaults on get_pmf() with empty sketch

REQ's get_pmf() needs to pass back the result from get_cdf() when the sketch is empty. In other words, it needs to add a check:
if(is_empty()) return buckets;

Otherwise, the method assumes there are at least 2 valid buckets, which is not valid.

std::iterator is deprecated; replace it

std::iterator is deprecated; see https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0174r2.html#2.1. It is used ten times in datasketches-cpp:

$ git grep -c std::iterator
fi/include/reverse_purge_hash_map.hpp:1
hll/include/HllArray.hpp:1
hll/include/coupon_iterator.hpp:1
kll/include/kll_sketch.hpp:1
quantiles/include/quantiles_sketch.hpp:1
req/include/req_sketch.hpp:1
sampling/include/var_opt_sketch.hpp:2
theta/include/theta_update_sketch_base.hpp:2

Replace these inheritances with explicit using/typedef declarations in each class.

sketch_union classes need reset() method

When unioning sketches together, it is often useful to be able to reset the union rather than freeing and reallocating a new object. reset() methods should be added to theta_union, tuple_union, and any and all other union classes to allow the computation to be reset back to its initial state.

Invalid free thrown at program exit.

Hello.

For my use case I'm creating a vector of KLL sketches that I'm populating with update as I iterate through the dataset. Example:

std::vector<datasketches::kll_sketch<float>> feature_sketches(NUM_FEATURES);

  // Create sketches for the complete data
  for (const auto &data_point : full_data) {
    // Iterate over non-zero values of data vector
    for (Eigen::SparseVector<float>::InnerIterator it(data_point); it; ++it) {
      feature_sketches[it.index()].update(it.value());
    }
  }

At program exit when the vector falls out of scope I get a double free error.

I ran the program with valgrind and have attached the output log here.

Any idea for the cause and how to deal with this? Should I be releasing memory manually in the vector at program exit?

Jaccard similarity doesn't handle custom seeds

If two sketches with custom seeds are loaded and passed to theta_jaccard_similarity::jaccard(), an exception is generated for seed hash mismatch. It looks like the problem is that Jaccard builds unions and intersections but doesn't have a way to pass the relevant see to those sketches.

Remove c-style const casts in kll/ and common/ directories

I recently started integrating the Apache DataSketches C++ library into Apache Impala and our build system (with clang tidy) found some casts where the const-ness is casted away using c-style casts. The best practice would be to use C++ style casts like const_cast in this case instead.

The code popping up in the report:
1)
https://github.com/apache/incubator-datasketches-cpp/blob/762904bb168d44846c1fe4f178998c9a8a57ccba/common/include/serde.hpp#L53
This can be a reinterpret_cast<char*>(const_cast<T*>(items))
2)
https://github.com/apache/incubator-datasketches-cpp/blob/762904bb168d44846c1fe4f178998c9a8a57ccba/kll/include/kll_sketch_impl.hpp#L381
Note, fixing this will result that 2 lines below will pop up in the report, fixing that will reveal the same error another 2 lines below, and so on. Removing the const-ness of the local variables would solve the issue.

This ticket focuses only on the kll/ and common/ folders of the C++ library as my work was only touching those folders so I've only run the report on them. And this ticket covers only those c-style casts where the const-ness is lost during the cast

random_utils is not thread-safe

In our project, we use multiple kll_sketch in parallel. But it depends on random_bits:

static std::independent_bits_engine<std::mt19937, 1, uint32_t>
random_bit(static_cast<uint32_t>(std::chrono::system_clock::now().time_since_epoch().count()));
// common random declarations
namespace random_utils {
static std::random_device rd; // possibly unsafe in MinGW with GCC < 9.2
static std::mt19937_64 rand(rd());
static std::uniform_real_distribution<> next_double(0.0, 1.0);
}

which is not thread-safe, making Thread Sanitizer(TSAN) complain about data race.

I make a PR to solve this. it needs your approve to start ci. plz check it.

Installation error

Trying all the different approaches to python wrapper installation but they all seem to boil down to this error:

incubator-datasketches-cpp/python/src/fi_wrapper.cpp:90:6: error: no matching member function for call to 'def' .def("merge", &frequent_items_sketch<T>::merge)

along with a few other errors in pybind of candidate template ignored

What am I missing?

Issue in update_theta_sketch_alloc<A>::internal_deserialize() method

'update_theta_sketch_alloc::internal_deserialize()' method returns 'update_theta_sketch_alloc' object but while constructing this object, parameters 'lg_nom_size' and 'lg_cur_size' are not in correct order.
Currently the object is constructed as below:
update_theta_sketch_alloc(..., lg_nom_size, lg_cur_size, ...)
It should be as below so that it de-serializes the sketch properly:
update_theta_sketch_alloc(..., lg_cur_size, lg_nom_size, ...)

Compile error: "error: invalid initialization of reference of type ‘std::move_iterator<const long unsigned int*>::reference {aka long unsigned int&&}’ from expression of type ‘std::remove_reference<const long unsigned int&>::type {aka const long unsigned int}’"

I attempted to compile current master (as of 05 Jul 2021) on CentOS 7.9.2009 with CMake 3.17.5 and g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), but I get the following issue. Is a minimum g++ version required? v4.8.5 seems a bit dated relative to my OS release.

[ 66%] Building CXX object theta/test/CMakeFiles/theta_test.dir/theta_a_not_b_test.cpp.o
In file included from /usr/include/c++/4.8.2/bits/stl_algobase.h:67:0,
from /usr/include/c++/4.8.2/bits/char_traits.h:39,
from /usr/include/c++/4.8.2/string:40,
from /home/bryan/github/datasketches-cpp/common/test/catch.hpp:468,
from /home/bryan/github/datasketches-cpp/theta/test/theta_a_not_b_test.cpp:20:
/usr/include/c++/4.8.2/bits/stl_iterator.h: In instantiation of ‘std::move_iterator<_Iterator>::value_type&& std::move_iterator<_Iterator>::operator*() const [with _Iterator = const long unsigned int*; std::move_iterator<_Iterator>::reference = long unsigned int&&; std::move_iterator<_Iterator>::value_type = long unsigned int]’:
/usr/include/c++/4.8.2/bits/stl_algo.h:962:13: required from ‘_OIter std::copy_if(_IIter, _IIter, _OIter, _Predicate) [with _IIter = std::move_iterator<const long unsigned int*>; _OIter = std::back_insert_iterator<std::vector<long unsigned int, std::allocator > >; _Predicate = datasketches::key_less_than<long unsigned int, long unsigned int, datasketches::trivial_extract_key>]’
/home/bryan/github/datasketches-cpp/theta/include/theta_set_difference_base_impl.hpp:49:47: required from ‘CS datasketches::theta_set_difference_base<Entry, ExtractKey, CompactSketch, Allocator>::compute(FwdSketch&&, const Sketch&, bool) const [with FwdSketch = const datasketches::wrapped_compact_theta_sketch_alloc<std::allocator >; Sketch = datasketches::wrapped_compact_theta_sketch_alloc<std::allocator >; Entry = long unsigned int; ExtractKey = datasketches::trivial_extract_key; CompactSketch = datasketches::compact_theta_sketch_alloc<std::allocator >; Allocator = std::allocator]’
/home/bryan/github/datasketches-cpp/theta/include/theta_a_not_b_impl.hpp:37:63: required from ‘datasketches::theta_a_not_b_alloc::CompactSketch datasketches::theta_a_not_b_alloc::compute(FwdSketch&&, const Sketch&, bool) const [with FwdSketch = const datasketches::wrapped_compact_theta_sketch_alloc<std::allocator >; Sketch = datasketches::wrapped_compact_theta_sketch_alloc<std::allocator >; Allocator = std::allocator; datasketches::theta_a_not_b_alloc::CompactSketch = datasketches::compact_theta_sketch_alloc<std::allocator >]’
/home/bryan/github/datasketches-cpp/theta/test/theta_a_not_b_test.cpp:186:3: required from here
/usr/include/c++/4.8.2/bits/stl_iterator.h:963:37: error: invalid initialization of reference of type ‘std::move_iterator<const long unsigned int*>::reference {aka long unsigned int&&}’ from expression of type ‘std::remove_reference<const long unsigned int&>::type {aka const long unsigned int}’
{ return std::move(*_M_current); }
^
make[2]: *** [theta/test/CMakeFiles/theta_test.dir/theta_a_not_b_test.cpp.o] Error 1
make[1]: *** [theta/test/CMakeFiles/theta_test.dir/all] Error 2
make: *** [all] Error 2

TupleSketch support for Python?

Hello devs,

Thanks for releasing and maintaining the DataSketches library, and the Python bindings.
I was wondering if there is any effort to add support for TupleSketches in the Python bindings?
Or if there was a prior effort, and it's not supported/implemented for some reasons?

Thanks!
Priyam

`theta_update_sketch_base` is not exception safe.

theta_update_sketch_base_impl.hpp:

template<typename EN, typename EK, typename A>
void theta_update_sketch_base<EN, EK, A>::resize() {
  ...
  lg_cur_size_ += factor;    <-- the size is increased here
  ...
  entries_ = allocator_.allocate(new_size);    <-- an exception can be thrown here
  ...
}

While unwinding the stack, destructor is called:

template<typename EN, typename EK, typename A>
theta_update_sketch_base<EN, EK, A>::~theta_update_sketch_base()
{
    ...
    const size_t size = 1 << lg_cur_size_;    <-- wrong size is calculated
    for (size_t i = 0; i < size; ++i) {
      if (EK()(entries_[i]) != 0) entries_[i].~EN();    <-- memory corruption
    }
    allocator_.deallocate(entries_, size);      <-- method is called with wrong argument
  }
}

The issue has been found while we tried to integrate the library into ClickHouse:
ClickHouse/ClickHouse#23334

ClickHouse has powerful fuzzing infrastructure (you can read about it here).

.whl for Linux ARM64 error

I have some trouble to install python datasketches==4.0.0 on Linux ARM64, I receive the following error when I run pip3 install datasketches==4.0.0

Building wheels for collected packages: datasketches
  Building wheel for datasketches (PEP 517) ... error
  ERROR: Command errored out with exit status 1:
   command: /usr/bin/python3 /usr/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmpr2gn81le
       cwd: /tmp/pip-install-r9ab7pna/datasketches
  Complete output (68 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-aarch64-cpython-37
  creating build/lib.linux-aarch64-cpython-37/datasketches
  copying python/datasketches/PySerDe.py -> build/lib.linux-aarch64-cpython-37/datasketches
  copying python/datasketches/__init__.py -> build/lib.linux-aarch64-cpython-37/datasketches
  running build_ext
  -- The CXX compiler identification is unknown
  CMake Error at CMakeLists.txt:25 (project):
    No CMAKE_CXX_COMPILER could be found.
  
    Tell CMake where to find the compiler by setting either the environment
    variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
    to the compiler, or to the compiler name if it is in the PATH.
  
  
  -- Configuring incomplete, errors occurred!
  See also "/tmp/pip-install-r9ab7pna/datasketches/build/temp.linux-aarch64-cpython-37/CMakeFiles/CMakeOutput.log".
  See also "/tmp/pip-install-r9ab7pna/datasketches/build/temp.linux-aarch64-cpython-37/CMakeFiles/CMakeError.log".
  Traceback (most recent call last):
    File "/usr/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py", line 280, in <module>
      main()
    File "/usr/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py", line 263, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/usr/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py", line 205, in build_wheel
      metadata_directory)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 414, in build_wheel
      wheel_directory, config_settings)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 398, in _build_with_temp_dir
      self.run_setup()
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/build_meta.py", line 335, in run_setup
      exec(code, locals())
    File "<string>", line 109, in <module>
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/dist.py", line 1208, in run_command
      super().run_command(command)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 325, in run
      self.run_command("build")
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/dist.py", line 1208, in run_command
      super().run_command(command)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 132, in run
      self.run_command(cmd_name)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
      self.distribution.run_command(command)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/dist.py", line 1208, in run_command
      super().run_command(command)
    File "/tmp/pip-build-env-_wmx9kjy/overlay/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "<string>", line 46, in run
    File "<string>", line 78, in build_extension
    File "/usr/lib64/python3.7/subprocess.py", line 363, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-install-r9ab7pna/datasketches', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/tmp/pip-install-r9ab7pna/datasketches/build/lib.linux-aarch64-cpython-37', '-DWITH_PYTHON=True', '-DCMAKE_CXX_STANDARD=11', '-DPython3_EXECUTABLE=/usr/bin/python3', '-DCMAKE_BUILD_TYPE=Release']' returned non-zero exit status 1.
  ----------------------------------------
  ERROR: Failed building wheel for datasketches
Failed to build datasketches
ERROR: Could not build wheels for datasketches which use PEP 517 and cannot be installed directly

how can i generate a whl that will work on Linux ARM64 of datasketches==4.0.0?

DoubleQuantilesSketch

Is there any interest in providing the DoubleQuantilesSketch in cpp? The java library contains it and druid only supports the DoubleQuantilesSketch at the moment. We would like to send the serialized sketch to druid and it would be really useful to have this sketch in cpp. Thank you.

State of the library?

Hello @AlexanderSaydakov and thanks for starting the effort in bringing DataSketches to C++!

I'm wondering what is the state of this library, would you say it's at least in a state where we can use the base quantile sketch in test projects, or is the functionality itself broken?

Regards,
Theodore

datasketches cannot be imported on Apple M1

I am currently experiencing a "wrong architecture" problem when I tried to use datasketches on my M1 macbook pro:

davide-anastasia@Davides-MBP ~ % mkdir datasketches
davide-anastasia@Davides-MBP ~ % cd datasketches
davide-anastasia@Davides-MBP datasketches % pipenv shell
Creating a virtualenv for this project...
Pipfile: /Users/davide-anastasia/datasketches/Pipfile
Using /opt/homebrew/bin/python3 (3.9.7) to create virtualenv...
⠸ Creating virtual environment...created virtual environment CPython3.9.7.final.0-64 in 169ms
  creator CPython3Posix(dest=/Users/davide-anastasia/.local/share/virtualenvs/datasketches-yWoxoifi, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/davide-anastasia/Library/Application Support/virtualenv)
    added seed packages: pip==21.3.1, setuptools==60.2.0, wheel==0.37.1
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator

✔ Successfully created virtual environment!
Virtualenv location: /Users/davide-anastasia/.local/share/virtualenvs/datasketches-yWoxoifi
Creating a Pipfile for this project...
Launching subshell in virtual environment...
 . /Users/davide-anastasia/.local/share/virtualenvs/datasketches-yWoxoifi/bin/activate
davide-anastasia@Davides-MBP datasketches %  . /Users/davide-anastasia/.local/share/virtualenvs/datasketches-yWoxoifi
/bin/activate
(datasketches) davide-anastasia@Davides-MBP datasketches % pipenv install datasketches
Installing datasketches...
Adding datasketches to Pipfile's [packages]...
✔ Installation Succeeded
Pipfile.lock not found, creating...
Locking [dev-packages] dependencies...
Locking [packages] dependencies...
Building requirements...
Resolving dependencies...
✔ Success!
Updated Pipfile.lock (99e97b)!
Installing dependencies from Pipfile.lock (99e97b)...
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0/0 — 00:00:00
(datasketches) davide-anastasia@Davides-MBP datasketches % python3 -c 'import datasketches'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: dlopen(/Users/davide-anastasia/.local/share/virtualenvs/datasketches-yWoxoifi/lib/python3.9/site-packages/datasketches.so, 2): no suitable image found.  Did find:
	/Users/davide-anastasia/.local/share/virtualenvs/datasketches-yWoxoifi/lib/python3.9/site-packages/datasketches.so: mach-o, but wrong architecture
	/Users/davide-anastasia/.local/share/virtualenvs/datasketches-yWoxoifi/lib/python3.9/site-packages/datasketches.so: mach-o, but wrong architecture

merge_sorted_blocks_direct

While testing kll sketch, I found that VS complains about an invalid iterator in method merge_sorted_blocks_direct.

The main problem is:

  const auto chunk_begin = temp.begin() + temp.size(); // this is the same as temp.end(), or at least, IS_TRUE(temp.end() == chunk_begin)
  merge_sorted_blocks_reversed(orig, temp, levels, starting_level_1, num_levels_1); // these lines add elements without reallocation
  merge_sorted_blocks_reversed(orig, temp, levels, starting_level_2, num_levels_2);
  const uint32_t num_items_1 = levels[starting_level_1 + num_levels_1] - levels[starting_level_1];
  std::merge( // This line makes VS in Debug crash
    std::make_move_iterator(chunk_begin), std::make_move_iterator(chunk_begin + num_items_1),
    std::make_move_iterator(chunk_begin + num_items_1), std::make_move_iterator(temp.end()),
    orig.begin() + levels[starting_level], compare_pair_by_first<C>()
  );
  temp.erase(chunk_begin, temp.end());

I don't have a quote from the standard, but I think VS is right (and that the code in the function is UB): there is no guarantee that temp.end() is actually a pointer to the element following the last. Also, I found an equivalent question on stackoverflow that seems to agree to my interpretation.

So while in practice it doesn't make any difference as vector iterators are implemented through pointers (in release mode with VS and in any mode with gcc 9.3, including with -D_GLIBCXX_DEBUG), I think it could be improved.

A possible (while probably not really elegant) work-around could be:

diff --git a/kll/include/kll_quantile_calculator_impl.hpp b/kll/include/kll_quantile_calculator_impl.hpp
index 23efa4d..ff6e547 100644
--- a/kll/include/kll_quantile_calculator_impl.hpp
+++ b/kll/include/kll_quantile_calculator_impl.hpp
@@ -129,10 +129,11 @@ void kll_quantile_calculator<T, C, A>::merge_sorted_blocks_direct(Container& ori
   const uint8_t num_levels_2 = num_levels - num_levels_1;
   const uint8_t starting_level_1 = starting_level;
   const uint8_t starting_level_2 = starting_level + num_levels_1;
-  const auto chunk_begin = temp.begin() + temp.size();
+  const auto initial_size = temp.size();
   merge_sorted_blocks_reversed(orig, temp, levels, starting_level_1, num_levels_1);
   merge_sorted_blocks_reversed(orig, temp, levels, starting_level_2, num_levels_2);
   const uint32_t num_items_1 = levels[starting_level_1 + num_levels_1] - levels[starting_level_1];
+  const auto chunk_begin = temp.begin() + initial_size;
   std::merge(
     std::make_move_iterator(chunk_begin), std::make_move_iterator(chunk_begin + num_items_1),
     std::make_move_iterator(chunk_begin + num_items_1), std::make_move_iterator(temp.end()),

Please, correct me if I'm wrong about this.

Buffer overflow in ThetaSketch intersection update

Hi,

I am investigating a crash we saw in tests using ThetaSketch 1.0.0 which I have narrowed down to the following cause:

const uint32_t max_matches = std::min(num_keys_, sketch.get_num_retained());
    uint64_t* matched_keys = AllocU64().allocate(max_matches);
    uint32_t match_count = 0;
    for (auto key: sketch) {
      if (key < theta_) {
        if (update_theta_sketch_alloc<A>::hash_search(key, keys_, lg_size_)) matched_keys[match_count++] = key;
      } else if (sketch.is_ordered()) {
        break; // early stop
      }
    }

https://github.com/apache/incubator-datasketches-cpp/blob/master/theta/include/theta_intersection_impl.hpp#L131

As you can see in the debug below, match_count is higher than matched_keys, meaning we have written beyond the allocated buffer (this result in a crash eventually due to memory corruption)

Not entirely sure how hash_search could have found more match than there is elements (hash collisions?).

But in any case shouldn't the for loop stop once it has found match_count == matched_keys ?

Screenshot 2019-12-18 at 17 54 26

I can provide the sketch if that can help.

Thanks!

Empty kll sketches behavior

Hi! I've been experimenting with the kll sketch lately, and I don't understand the behavior of the "empty" test.

On line 73 I see a counting loop and a REQUIRE(count == 0); that in my mind means "it should never enter the loop", but I was surprised that on my machine (Windows 10, Visual Studio 19 Community latest version, both Release and Debug configuration, x64 platform) that is not the case: it just loops until count overflows and than exits the loop. This makes the test "pass", but (apart from overflow being UB) I don't think it's the right behavior.

To make sure I'm not missing something, I changed the test with

diff --git a/kll/test/kll_sketch_test.cpp b/kll/test/kll_sketch_test.cpp
index def96f6..c29c135 100644
--- a/kll/test/kll_sketch_test.cpp
+++ b/kll/test/kll_sketch_test.cpp
@@ -74,6 +74,7 @@ TEST_CASE("kll sketch", "[kll_sketch]") {
     for (auto& it: sketch) {
       (void) it; // to suppress "unused" warning
       ++count;
+      FAIL("the sketch should be empty");
     }
     REQUIRE(count == 0);
 }

And the test immediately fails.

I also tested this behavior using GCC (on a WSL2 machine, Ubuntu 20.04, g++ 9.3.0) and the test fails, too.

Am I missing something?

static analyzer detects a possible bug

Recently, we run static analyzer on our project, and it reports the following problem:

const uint8_t pseudo_phase = determine_pseudo_phase(source.get_lg_k(), source.get_num_coupons());
const uint8_t* permutation = column_permutations_for_encoding[pseudo_phase];

L299 get pseudo_phase, which could be 16 + x. But the column_permutations_for_encoding is a fixed 2-D array with dimension [16][59], so it's a possible overflow.

If this is a false positive, maybe we should add a check like the following (found in the coresponding uncompress method) to make the static analyzer happy?

if (pseudo_phase >= 16) throw std::logic_error("pseudo phase >= 16");

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.