Code Monkey home page Code Monkey logo

datasketches-sandbox's Introduction

Datasketches sandbox project

Apache Datasketches is a software library of stochastic streaming algorithms.

This repository provides a simple HTTP interface to evaluate datasketches on your own data.

For large datasets, the following problems are typically difficult to measure exactly using limited resources:

  • distinct count
  • quantiles and histograms
  • frequent items
  • reservoir sampling

Datasketches makes use of sketches with mathematically proven error bounds to provide robust solutions to these problems. Moreover, it is order insensitive to input data and only has to see a data item once ("one touch") making it ideal for streaming and big data use cases.

Usage notes

The service maintains a stateful in-memory sketch/exact copy for each dataset, which can be periodically interrogated for approximate results. This stateful operation allows set operations between sketches.

In order to use the exact equivalent to a sketch, append the ?exact flag to the endpoint.

Each sketch needs to be assigned a key for reference, which typically adheres to the following format:

dataset-dimension1-dimension2-dimensionN

For example:

# country dataset, country code
country-jp
country-us

# occupation dataset, job name, state
occupation-technician-ca
occupation-surgeon-co
occupation-surgeon-tx

Finally, see the useful helper scripts in the scripts directory.

Running in Docker

# Starts the published container from Github container service
docker run -d -p 8099:8080/tcp ghcr.io/davecromberge/datasketches-sandbox/ds-sandbox-server:latest
→ container-id

# Tests the container
curl -X GET http://0.0.0.0:8099/ping
→ pong

# Stops the container
docker stop container-id

Distinct count

Problem: Gather a distinct count of identities, independent of the order of the input.

curl -X PUT http://127.0.0.1:8099/v1/distinct/count/country-jp/user-id1
→ Accepted

curl -X PUT http://127.0.0.1:8099/v1/distinct/count/country-jp/user-id2
→ Accepted

curl -X PUT http://127.0.0.1:8099/v1/distinct/count/country-us/user-id2
→ Accepted

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-jp
→ {"value":2.0,"lowerBound":2.0,"upperBound":2.0}

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-us/union/country-jp
→ {"value":2.0,"lowerBound":2.0,"upperBound":2.0}

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-us/intersect/country-jp
→ {"value":1.0,"lowerBound":1.0,"upperBound":1.0}

curl -X GET http://127.0.0.1:8099/v1/distinct/count/country-jp/anotb/country-us
→ {"value":1.0,"lowerBound":1.0,"upperBound":1.0}

curl -F [email protected]  http://127.0.0.1:8099/v1/distinct/count/country-us
→ Accepted

curl -F [email protected]  http://127.0.0.1:8099/v1/distinct/count/country-jp?exact
→ Accepted

curl -X DELETE http://127.0.0.1:8099/v1/distinct/count/country-jp
→ Ok

curl -X DELETE http://127.0.0.1:8099/v1/distinct/count/country-us
→ Ok

For comparison purposes, any of the above URLs can have the ?exact flag set to perform an exact count distinct. Uploading large input streams to the exact endpoints can be orders of magnitude slower, whereas the sketches grow sub-linearly in relation to the input data size.

Environment variables

By default, the sketch nominal entries setting is 2^16, and affects the accuracy of the final estimate.

To alter the defaults, run the docker image with the relevant environment variables set:

docker run -d --env SKETCH_ACCURACY=12 -p 8099:8080/tcp datasketches-sandbox/ds-sandbox-server

Building a Linux executable

  1. Build the Docker image in the docker directory
docker build -f docker/GraalDockerfile -t datasketches-sandbox/graalvm-native-image .

  1. Run the nativeImage task from sbt. The result will be a Linux executable.

  2. Build the lightweight docker image locally

docker build -f docker/SandboxDockerfile -t datasketches-sandbox/ds-sandbox-server .

Acknowledgements

  • The Apache Datasketches team and community for the incredibly useful library.
  • This blog post by Noel Welsh describes how to build a GraalVM service using SBT and docker.

Todos

  • Support more sketch types
  • Create a java equivalent for the Apache organisation
  • Add better documentation
  • Github actions for automatically publishing the package to ghcr

datasketches-sandbox's People

Contributors

davecromberge avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.