Code Monkey home page Code Monkey logo

go-hll's Introduction

go-hll CircleCI Go Report Card GoDoc

A Go implementation of HyperLogLog that is storage-compatible with the Aggregate Knowledge HLL Storage Spec.

Overview

HyperLogLog (HLL) is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes HLL can estimate the count of tens of billions of distinct values with only a few percent error.

In addition to the algorithm proposed in the original paper, this implementation is augmented to improve its accuracy and memory use without sacrificing much speed.

Motivation

While there are a handful of existing HLL implementations in Go, none of them implement the AK Storage Spec.
The unified storage format is useful for reading and writing HLLs in a multi-lingual environment. At Segment, most of our runtime code is written in Go, but we frequently persist data to PostgreSQL for long-term storage. The Postgres HLL plugin is fairly ubiquitous--it's available for the standalone server, AWS RDS, AWS Aurora, and CitusDB.

An excellent description for the motivation behind the storage strategy can be found in the Java HLL library's README.

Hashing

A good hashing algorithm is crucial to achieving the pseudorandomness that HLL requires in order to perform its calculations. The 64 bit variant of MurmurHash3 is recommended. If using a seed, it must be constant for all inputs to a given HLL. Further, if that HLL is to be unioned, then the same seed must be used for all inputs to the other HLL.

See the Java HLL README for a discussion on why MurmurHash3 is a good choice.

Adaptations to Go

The API is intended to be as similar as possible to Java HLL and Postgresql HLL. There are a couple of features, though, that make it more friendly to Go Programmers.

Zero Value

If default settings are specified using the hll.Defaults function, the zero value can be used directly as an empty HLL.

Since its impossible to reason about an HLL without the settings, operations on a zero value in lieu of default settings will panic.

StrictUnion

The other HLL implementations allow for two HLLs to be union-ed even if their log2m or regwidth parameters differ. However, doing so can produce wildly inaccurate results. This library provides an additional StrictUnion operation that will return an error if attempting a union on HLLs with incompatible settings.

Building

Dependencies are managed with dep. Before building, ensure that dep is installed and on the path.

Download Dependencies

make dep

Test

make test

Usage

package main 

import (
	"fmt"
	
	"github.com/segmentio/go-hll"
)

func main() {
	
	// install default settings.
	hll.Defaults(hll.Settings{
		Log2m: 10,
		Regwidth: 4,
		ExplicitThreshold: hll.AutoExplicitThreshold,
		SparseEnabled: true,
	})
	
	// add elements.
	h := hll.Hll{}
	h.AddRaw(123456789)
	fmt.Print(h.Cardinality())  // prints "1"
	
	// union Hlls
	h2 := hll.Hll{}
	h2.AddRaw(123456789)
	h2.AddRaw(987654321)
	h2.Union(h)
	fmt.Print(h2.Cardinality()) // prints "2"
 
	// write to/read from bytes. 
	h3, _ := hll.FromBytes(h2.ToBytes())
	fmt.Print(h3.Cardinality()) // prints "2"
}

There is a compatibility battery run as part of the unit tests. The battery was produced by the Test Generator in the java-hll package.

Additional Resources

License

Released under the MIT license.

go-hll's People

Contributors

achille-roussel avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.