Code Monkey home page Code Monkey logo

hyperloglog's People

Contributors

absolute8511 avatar bahlo avatar devarshishimpi avatar fabura avatar jameshartig avatar jdanford avatar koron avatar lahsivjar avatar lukasmalkmus avatar maciej avatar martinpinto avatar njpatel avatar pmbauer avatar raduberinde avatar seiflotfy avatar tamird avatar thecraftman avatar w-zr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hyperloglog's Issues

request: insert-hash method

It would be useful to have an insert method that just takes the hash value directly. This would allow the caller to choose the hash function. It is particularly useful if our data is just integers - we could use some function that works on integers directly instead of converting them to []byte.

Reduce size of new sparse sketch

A new instance of a sparse Sketch requires around 118 bytes of memory on amd64.

type Sketch struct {
	p          uint8                // 1 byte
	b          uint8                // 1 byte
	m          uint32               // 4 bytes
	alpha      float64              // 8 bytes
	tmpSet     set					// 56 bytes
	sparseList *compressedList      // 40 bytes  (8 + 2 * 4 + 24 (pointer + 2 x uint32 + slice))
	regs       *registers           // 8 bytes (nil pointer)
	                                // 118 bytes total
}

The scenario I would like to optimize for is storing many sketches that contain few elements. In my case I have a more-or-less zipfian distribution of set cardinalities (few large ones and a long tail of small ones).
118 bytes is considerably more than, for example, Redis which uses 21 bytes for a HLL containing a single element. See my gist on that.

Looking over the code today and previously I'm considering a few optimizations. @seiflotfy I'd like to get you opinion on some of them before I get my hands dirty.

  1. Remove tmpSet. This would already reduce the memory footprint by half. As far as I understand the code it's used to "cache" multiple elements inserted into a the sparse representation of the sketch. I believe the rationale here was to reduce the time spent on extending and traversing the compressed list. Lack of this cache can be mitigated for example by providing a InsertMany function instead.

  2. Remove 2 pointers to compressedList and registers. Both those types represent []uint8 slices with some additional metadata. That metadata could be rolled into the data structure. That's a 16 bytes saved for pointers. The tradeoff here is code readability.

  3. Don't precompute alpha and m. 12 bytes saving. A few CPU cycles are a tradeoff here.

The above memory optimizations would amount to savings of 75 bytes bringing down the size of an empty sketch to 43 bytes.

type Sketch struct {
	sparse bool     // 1 byte - bringing it back since there is no *compressedList to match against
	p      uint8    // 1 byte
	b      uint8    // 1 byte
	regs   []uint32 // 24 bytes
	nz     uint32   // 4 bytes
	count  uint32   // 4 bytes
	last   uint32   // 4 bytes
	                // 43 bytes total
}

What are your thoughts?

Make New Public Again

The public constructors at present allow only 14 or 16 registers.

func New() *Sketch
func New14() *Sketch
func New16() *Sketch
func New16NoSparse() *Sketch

The private constructor newSketch looks like it can take between 4-18 registers, and also a flag to say sparse or normal.

Looking at the commit history, the public constructor was removed in commit dba7ba9

Do you see any fundamental issue with the reduced bases? Is it possible to allow those again?

I have a use case where I need to maintain a large number of these HLL sketches, that can tolerate a higher error rate as well. So I am planning to use just 4 or 8 registers.

need help on TestHLLTC_Merge_Complex

I need a bit of help following TestHLLTC_Merge_Complex in hyperloglog_test.go:

Lines 374 and 384: why is ratio calculated against sk1?

Lines 391 and 395: I tried inserting the 500000 records into sk2 and then merge with empty sk3, like in the source. This gives me a ratio of about 1.18%. I made another version where the 500000 records were inserted into sk3, and then merge with sk2. This gets me 6.52%. Shouldn't they be the same?

My code for ref:

func TestHLLTC_Merge_Complex(t *testing.T) {
	sk1, err := newSketch(14, true)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	sk2, err := newSketch(14, true)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	sk3, err := newSketch(14, true)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	// sk1 will contain flow-[1..10000000]
	// sk2 will contain flow-[2,4,...10000000]
	unique := map[string]bool{}
	for i := 1; len(unique) <= 10000000; i++ {
		str := fmt.Sprintf("flow-%d", i)
		sk1.Insert([]byte(str))
		if i%2 == 0 {
			sk2.Insert([]byte(str))
		}
		unique[str] = true
	}

	// sk1
	exact1 := uint64(len(unique))
	res1 := uint64(sk1.Estimate())
	ratio := 100*estimateError(res1, exact1)
	if ratio > 2 {
		t.Errorf("exact %d, got %d which is %.2f%% error", exact1, res1, ratio)
	}
	t.Logf("sk1 estimate %d; exact %d (%.2f%% err)", res1, exact1, ratio)

	// sk2
	exact2 := uint64(len(unique))/2
	res2 := uint64(sk2.Estimate())
	ratio = 100*estimateError(res2, exact2)
	if ratio > 2 {
		t.Errorf("exact %d, got %d which is %.2f%% error", exact2, res2, ratio)
	}
	t.Logf("sk2 estimate %d; exact %d (%.2f%% err)", res2, exact2, ratio)

	// sk2 merge sk1
	if err := sk2.Merge(sk1); err != nil {
		t.Errorf("unexpected error: %v", err)
	}
	exact2 = uint64(len(unique))
	res2 = uint64(sk2.Estimate())
	ratio = 100*estimateError(res2, exact2)
	if ratio > 2 {
		t.Errorf("exact %d, got %d which is %.2f%% error", exact2, res2, ratio)
	}
	t.Logf("sk2 merge sk1 estimate %d; exact %d (%.2f%% err)", res2, exact2, ratio)

	for i := 1; i <= 500000; i++ {
		str := fmt.Sprintf("stream-%d", i)
		sk3.Insert([]byte(str)) // also try sk2.Insert(...)
		unique[str] = true
	}
	/*
	exact3 := uint64(500000)
	res3 := uint64(sk3.Estimate())
	ratio = 100*estimateError(res3, exact3)
	if ratio > 2 {
		t.Errorf("exact %d, got %d which is %.2f%% error", exact3, res3, ratio)
	}
	t.Logf("sk3 estimate %d; exact %d (%.2f%% err)", res3, exact3, ratio)
	*/

	// sk2 merge sk3
	if err := sk2.Merge(sk3); err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	exact2 = uint64(len(unique))
	res2 = uint64(sk2.Estimate())
	ratio = 100*estimateError(res2, exact2)
	if ratio > 2 {
		t.Errorf("exact %d, got %d which is %.2f%% error", exact2, res2, ratio)
	}
	t.Logf("sk2 merge sk3 estimate %d; exact %d (%.2f%% err)", res2, exact2, ratio)
}

document requirements for InsertHash

It would be good to provide some documentation about what properties are expected of the input hash for InsertHash. For example, a reasonable question is: if my data is a set of 64-bit ints, can I use them directly as the "hash"? I ran some experiments with that and got bad results. I also got bad results when I used a simplistic hash function (xor and multiply by large prime). Does it require good avalanche characteristics?

Add continuous integration

Hi @seiflotfy,

As a contributor I'd be somehow more confident creating PRs if the project had CI integration.

Does TravisCI sound good for you? I can make a PR with the changes.

Explicit public toNormal method

We have thousands of distinct datasets that we merge together and the underlying Sketch's get Merge called to get a resulting Sketch. We're seeing a significant amount of time spent in maybeToNormal.
image

Would it be better for us to store all of the datasets with the "normal" Sketch instead of a sparse one? In other words, we'd call ToNormal (or something like that) before calling MarshalBinary. Would this reduce the time spent merging?

Sparse representation is too big

I'm counting unique metrics accross different dimensions, and the number of used HLLs can go up to several millions (e.g. number of unique IPs for each user agent or vice versa).
It seems that even empty hyperloglog.Sketch will allocate at least 2^14 = 16384 bytes for the sparseList, which is quite a lot. Redis hyperloglog has way better memory footprint for small cardinalities - http://download.redis.io/redis-stable/src/hyperloglog.c
I wonder if it will be possible to align / merge these two approaches.

fatal error: concurrent map writes

var Counter *hyperloglog.Sketch

...

Counter = hyperloglog.New16()

...

		bpk := make([]byte, 8)
		binary.LittleEndian.PutUint64(bpk, pk)
		Counter.Insert(bpk)
fatal error: concurrent map writes

goroutine 458 [running]:
runtime.throw(0x788fc0, 0x15)
	/Go/src/runtime/panic.go:1114 +0x79 fp=0xc005f75e48 sp=0xc005f75e18 pc=0x435c49
runtime.mapassign_fast32(0x742f00, 0xc00813aba0, 0xc002276860, 0xa52e40)
	/Go/src/runtime/map_fast32.go:101 +0x328 fp=0xc005f75e88 sp=0xc005f75e48 pc=0x411028
github.com/axiomhq/hyperloglog.set.add(...)
	/Go/bin/src/github.com/axiomhq/hyperloglog/sparse.go:44
github.com/axiomhq/hyperloglog.(*Sketch).InsertHash(0xc0000615f0, 0x89da180ed69d24d0, 0x8)
	/Go/bin/src/github.com/axiomhq/hyperloglog/hyperloglog.go:220 +0x117 fp=0xc005f75ed0 sp=0xc005f75e88 pc=0x6d4797
github.com/axiomhq/hyperloglog.(*Sketch).Insert(0xc0000615f0, 0xc0020a7388, 0x8, 0x8, 0xc00848fa40)
	/Go/bin/src/github.com/axiomhq/hyperloglog/hyperloglog.go:214 +0x65 fp=0xc005f75f00 sp=0xc005f75ed0 pc=0x6d4665
main.BulkWorker(0x18ad287d)
	/src/BulkRequest.go:92 +0x1cb fp=0xc005f75f58 sp=0xc005f75f00 pc=0x6da52b
main.main.func1(0x70c3c0, 0xc00205eaa0)
	/src/api.go:71 +0x46 fp=0xc005f75f80 sp=0xc005f75f58 pc=0x6dccd6
github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc005d43290)
	/Go/bin/src/github.com/panjf2000/ants/worker_func.go:68 +0xb7 fp=0xc005f75fd8 sp=0xc005f75f80 pc=0x6d9b07
runtime.goexit()
	/Go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc005f75fe0 sp=0xc005f75fd8 pc=0x4626f1
created by github.com/panjf2000/ants.(*goWorkerWithFunc).run
	/Go/bin/src/github.com/panjf2000/ants/worker_func.go:48 +0x53

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.