Datasketches sandbox project

Apache Datasketches is a software library of stochastic streaming algorithms.

This repository provides a simple HTTP interface to evaluate datasketches on your own data.

For large datasets, the following problems are typically difficult to measure exactly using limited resources:

distinct count
quantiles and histograms
frequent items
reservoir sampling

Datasketches makes use of sketches with mathematically proven error bounds to provide robust solutions to these problems. Moreover, it is order insensitive to input data and only has to see a data item once ("one touch") making it ideal for streaming and big data use cases.

Usage notes

The service maintains a stateful in-memory sketch/exact copy for each dataset, which can be periodically interrogated for approximate results. This stateful operation allows set operations between sketches.

In order to use the exact equivalent to a sketch, append the ?exact flag to the endpoint.

Each sketch needs to be assigned a key for reference, which typically adheres to the following format:

dataset-dimension1-dimension2-dimensionN

For example:

# country dataset, country code
country-jp
country-us

# occupation dataset, job name, state
occupation-technician-ca
occupation-surgeon-co
occupation-surgeon-tx

Finally, see the useful helper scripts in the scripts directory.

Running in Docker

# Starts the published container from Github container service
docker run -d -p 8099:8080/tcp ghcr.io/davecromberge/datasketches-sandbox/ds-sandbox-server:latest
→ container-id

# Tests the container
curl -X GET http://0.0.0.0:8099/ping
→ pong

# Stops the container
docker stop container-id

Distinct count

Problem: Gather a distinct count of identities, independent of the order of the input.

curl -X PUT http://127.0.0.1:8099/v1/distinct/count/country-jp/user-id1
→ Accepted

curl -X PUT http://127.0.0.1:8099/v1/distinct/count/country-jp/user-id2
→ Accepted

curl -X PUT http://127.0.0.1:8099/v1/distinct/count/country-us/user-id2
→ Accepted

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-jp
→ {"value":2.0,"lowerBound":2.0,"upperBound":2.0}

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-us/union/country-jp
→ {"value":2.0,"lowerBound":2.0,"upperBound":2.0}

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-us/intersect/country-jp
→ {"value":1.0,"lowerBound":1.0,"upperBound":1.0}

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-jp/anotb/country-us
→ {"value":1.0,"lowerBound":1.0,"upperBound":1.0}

curl -F [email protected]  http://127.0.0.1:8099/v1/distinct/count/country-us
→ Accepted

curl -F [email protected]  http://127.0.0.1:8099/v1/distinct/count/country-jp?exact
→ Accepted

curl -X DELETE http://127.0.0.1:8099/v1/distinct/count/country-jp
→ Ok

curl -X DELETE http://127.0.0.1:8099/v1/distinct/count/country-us
→ Ok

For comparison purposes, any of the above URLs can have the ?exact flag set to perform an exact count distinct. Uploading large input streams to the exact endpoints can be orders of magnitude slower, whereas the sketches grow sub-linearly in relation to the input data size.

Environment variables

By default, the sketch nominal entries setting is 2^16, and affects the accuracy of the final estimate.

To alter the defaults, run the docker image with the relevant environment variables set:

docker run -d --env SKETCH_ACCURACY=12 -p 8099:8080/tcp datasketches-sandbox/ds-sandbox-server

Building a Linux executable

Build the Docker image in the docker directory

docker build -f docker/GraalDockerfile -t datasketches-sandbox/graalvm-native-image .

Run the nativeImage task from sbt. The result will be a Linux executable.
Build the lightweight docker image locally

docker build -f docker/SandboxDockerfile -t datasketches-sandbox/ds-sandbox-server .

Acknowledgements

The Apache Datasketches team and community for the incredibly useful library.
This blog post by Noel Welsh describes how to build a GraalVM service using SBT and docker.

Todos

Support more sketch types
Create a java equivalent for the Apache organisation
Add better documentation
Github actions for automatically publishing the package to ghcr

davecromberge / datasketches-sandbox Goto Github PK

datasketches-sandbox's Introduction

Datasketches sandbox project

Usage notes

Running in Docker

Distinct count

Environment variables

Building a Linux executable

Acknowledgements

Todos

datasketches-sandbox's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent