axiomhq / hyperloglog Goto Github PK

HyperLogLog with lots of sugar (Sparse, LogLog-Beta bias correction and TailCut space reduction) brought to you by Axiom

Home Page: https://axiom.co

License: MIT License

Go 100.00%

axiom data-structures go golang hyperloglog

hyperloglog's People

Contributors

Stargazers

Watchers

Forkers

bpd1069 number0 izogain jeffwidman anhntbk08 koolhazz zhanglei kevinlingit glycerine martinpinto xc35 lkk2003rty bozzcq evilmcjerkface gophersgang raduberinde dvillega epsilonsolutions preetam vividcortex pombredanne bigbaser anke522 skyler1grey etsangsplk msomani jameshartig nigestream ryanfowlkes koron absolute8511 haiyangma njpatel appcoreopc maciej isgasho wanglianglin tianchaijz stels-cs bahlo bahadrix mopemope smagill junxie6 awbuana nfisher orijtech kirbyquerby thecraftman standardgalactic ajunlonglive myjeffxie cods-sources-macro zhenruyan skykistler lahsivjar xiaoconstantine jamestiotio cgray tiaprobkconsnu iq-scm fabura sprig nliver w-zr matrixorigin dferstay devarshishimpi ihatiko

hyperloglog's Issues

Remove dependency on github.com/dgryski/go-metro

github.com/dgryski/go-metro is written largely in gc-syntax assembly, which makes it difficult to support gccgo downstream of it.

cc @dgryski

request: insert-hash method

It would be useful to have an insert method that just takes the hash value directly. This would allow the caller to choose the hash function. It is particularly useful if our data is just integers - we could use some function that works on integers directly instead of converting them to []byte.

Reduce size of new sparse sketch

A new instance of a sparse Sketch requires around 118 bytes of memory on amd64.

type Sketch struct {
	p          uint8                // 1 byte
	b          uint8                // 1 byte
	m          uint32               // 4 bytes
	alpha      float64              // 8 bytes
	tmpSet     set					// 56 bytes
	sparseList *compressedList      // 40 bytes  (8 + 2 * 4 + 24 (pointer + 2 x uint32 + slice))
	regs       *registers           // 8 bytes (nil pointer)
	                                // 118 bytes total
}

The scenario I would like to optimize for is storing many sketches that contain few elements. In my case I have a more-or-less zipfian distribution of set cardinalities (few large ones and a long tail of small ones).
118 bytes is considerably more than, for example, Redis which uses 21 bytes for a HLL containing a single element. See my gist on that.

Looking over the code today and previously I'm considering a few optimizations. @seiflotfy I'd like to get you opinion on some of them before I get my hands dirty.

Remove tmpSet. This would already reduce the memory footprint by half. As far as I understand the code it's used to "cache" multiple elements inserted into a the sparse representation of the sketch. I believe the rationale here was to reduce the time spent on extending and traversing the compressed list. Lack of this cache can be mitigated for example by providing a InsertMany function instead.
Remove 2 pointers to compressedList and registers. Both those types represent []uint8 slices with some additional metadata. That metadata could be rolled into the data structure. That's a 16 bytes saved for pointers. The tradeoff here is code readability.
Don't precompute alpha and m. 12 bytes saving. A few CPU cycles are a tradeoff here.

The above memory optimizations would amount to savings of 75 bytes bringing down the size of an empty sketch to 43 bytes.

type Sketch struct {
	sparse bool     // 1 byte - bringing it back since there is no *compressedList to match against
	p      uint8    // 1 byte
	b      uint8    // 1 byte
	regs   []uint32 // 24 bytes
	nz     uint32   // 4 bytes
	count  uint32   // 4 bytes
	last   uint32   // 4 bytes
	                // 43 bytes total
}

What are your thoughts?

Make New Public Again

The public constructors at present allow only 14 or 16 registers.

func New() *Sketch
func New14() *Sketch
func New16() *Sketch
func New16NoSparse() *Sketch

The private constructor newSketch looks like it can take between 4-18 registers, and also a flag to say sparse or normal.

Looking at the commit history, the public constructor was removed in commit dba7ba9

Do you see any fundamental issue with the reduced bases? Is it possible to allow those again?

I have a use case where I need to maintain a large number of these HLL sketches, that can tolerate a higher error rate as well. So I am planning to use just 4 or 8 registers.

need help on TestHLLTC_Merge_Complex

I need a bit of help following TestHLLTC_Merge_Complex in hyperloglog_test.go:

Lines 374 and 384: why is ratio calculated against sk1?

Lines 391 and 395: I tried inserting the 500000 records into sk2 and then merge with empty sk3, like in the source. This gives me a ratio of about 1.18%. I made another version where the 500000 records were inserted into sk3, and then merge with sk2. This gets me 6.52%. Shouldn't they be the same?

My code for ref:

func TestHLLTC_Merge_Complex(t *testing.T) {
	sk1, err := newSketch(14, true)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	sk2, err := newSketch(14, true)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	sk3, err := newSketch(14, true)
	if err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	// sk1 will contain flow-[1..10000000]
	// sk2 will contain flow-[2,4,...10000000]
	unique := map[string]bool{}
	for i := 1; len(unique) <= 10000000; i++ {
		str := fmt.Sprintf("flow-%d", i)
		sk1.Insert([]byte(str))
		if i%2 == 0 {
			sk2.Insert([]byte(str))
		}
		unique[str] = true
	}

	// sk1
	exact1 := uint64(len(unique))
	res1 := uint64(sk1.Estimate())
	ratio := 100*estimateError(res1, exact1)
	if ratio > 2 {
		t.Errorf("exact %d, got %d which is %.2f%% error", exact1, res1, ratio)
	}
	t.Logf("sk1 estimate %d; exact %d (%.2f%% err)", res1, exact1, ratio)

	// sk2
	exact2 := uint64(len(unique))/2
	res2 := uint64(sk2.Estimate())
	ratio = 100*estimateError(res2, exact2)
	if ratio > 2 {
		t.Errorf("exact %d, got %d which is %.2f%% error", exact2, res2, ratio)
	}
	t.Logf("sk2 estimate %d; exact %d (%.2f%% err)", res2, exact2, ratio)

	// sk2 merge sk1
	if err := sk2.Merge(sk1); err != nil {
		t.Errorf("unexpected error: %v", err)
	}
	exact2 = uint64(len(unique))
	res2 = uint64(sk2.Estimate())
	ratio = 100*estimateError(res2, exact2)
	if ratio > 2 {
		t.Errorf("exact %d, got %d which is %.2f%% error", exact2, res2, ratio)
	}
	t.Logf("sk2 merge sk1 estimate %d; exact %d (%.2f%% err)", res2, exact2, ratio)

	for i := 1; i <= 500000; i++ {
		str := fmt.Sprintf("stream-%d", i)
		sk3.Insert([]byte(str)) // also try sk2.Insert(...)
		unique[str] = true
	}
	/*
	exact3 := uint64(500000)
	res3 := uint64(sk3.Estimate())
	ratio = 100*estimateError(res3, exact3)
	if ratio > 2 {
		t.Errorf("exact %d, got %d which is %.2f%% error", exact3, res3, ratio)
	}
	t.Logf("sk3 estimate %d; exact %d (%.2f%% err)", res3, exact3, ratio)
	*/

	// sk2 merge sk3
	if err := sk2.Merge(sk3); err != nil {
		t.Fatalf("unexpected error: %v", err)
	}
	exact2 = uint64(len(unique))
	res2 = uint64(sk2.Estimate())
	ratio = 100*estimateError(res2, exact2)
	if ratio > 2 {
		t.Errorf("exact %d, got %d which is %.2f%% error", exact2, res2, ratio)
	}
	t.Logf("sk2 merge sk3 estimate %d; exact %d (%.2f%% err)", res2, exact2, ratio)
}

document requirements for InsertHash

It would be good to provide some documentation about what properties are expected of the input hash for InsertHash. For example, a reasonable question is: if my data is a set of 64-bit ints, can I use them directly as the "hash"? I ran some experiments with that and got bad results. I also got bad results when I used a simplistic hash function (xor and multiply by large prime). Does it require good avalanche characteristics?

Add continuous integration

Hi @seiflotfy,

As a contributor I'd be somehow more confident creating PRs if the project had CI integration.

Does TravisCI sound good for you? I can make a PR with the changes.

Explicit public toNormal method

We have thousands of distinct datasets that we merge together and the underlying Sketch's get Merge called to get a resulting Sketch. We're seeing a significant amount of time spent in maybeToNormal.

Would it be better for us to store all of the datasets with the "normal" Sketch instead of a sparse one? In other words, we'd call ToNormal (or something like that) before calling MarshalBinary. Would this reduce the time spent merging?

Sparse representation is too big

I'm counting unique metrics accross different dimensions, and the number of used HLLs can go up to several millions (e.g. number of unique IPs for each user agent or vice versa).
It seems that even empty hyperloglog.Sketch will allocate at least 2^14 = 16384 bytes for the sparseList, which is quite a lot. Redis hyperloglog has way better memory footprint for small cardinalities - http://download.redis.io/redis-stable/src/hyperloglog.c
I wonder if it will be possible to align / merge these two approaches.

Add `capacity` Parameter to newCompressedList for Preallocated Memory in `newCompressedList`

Problem

When creating a compressedList using the newCompressedList function, the current implementation lacks the ability to specify the initial size of the list. This results in multiple memory allocations during the mergeSparse function when appending elements to the newList, potentially leading to increased memory allocation overhead.

What is the actual memory usage?

I see options for both 16384 and 65536 registers. How much memory does each instance actually use? The read-me says each register 4-bits, what about the sketch structure overhead?

make new() thread safe

https://github.com/axiomhq/hyperloglog/blob/master/hyperloglog.go#L50 is not threadsafe (compile with -race and run in a multi-threaded program to detect it). this can be worked around by adding a package init function.

func init() {
    hash = hashFunc
}

the new() function is not threadsafe because it is modifying a module level variable without locks

fatal error: concurrent map writes

var Counter *hyperloglog.Sketch

...

Counter = hyperloglog.New16()

...

		bpk := make([]byte, 8)
		binary.LittleEndian.PutUint64(bpk, pk)
		Counter.Insert(bpk)

fatal error: concurrent map writes

goroutine 458 [running]:
runtime.throw(0x788fc0, 0x15)
	/Go/src/runtime/panic.go:1114 +0x79 fp=0xc005f75e48 sp=0xc005f75e18 pc=0x435c49
runtime.mapassign_fast32(0x742f00, 0xc00813aba0, 0xc002276860, 0xa52e40)
	/Go/src/runtime/map_fast32.go:101 +0x328 fp=0xc005f75e88 sp=0xc005f75e48 pc=0x411028
github.com/axiomhq/hyperloglog.set.add(...)
	/Go/bin/src/github.com/axiomhq/hyperloglog/sparse.go:44
github.com/axiomhq/hyperloglog.(*Sketch).InsertHash(0xc0000615f0, 0x89da180ed69d24d0, 0x8)
	/Go/bin/src/github.com/axiomhq/hyperloglog/hyperloglog.go:220 +0x117 fp=0xc005f75ed0 sp=0xc005f75e88 pc=0x6d4797
github.com/axiomhq/hyperloglog.(*Sketch).Insert(0xc0000615f0, 0xc0020a7388, 0x8, 0x8, 0xc00848fa40)
	/Go/bin/src/github.com/axiomhq/hyperloglog/hyperloglog.go:214 +0x65 fp=0xc005f75f00 sp=0xc005f75ed0 pc=0x6d4665
main.BulkWorker(0x18ad287d)
	/src/BulkRequest.go:92 +0x1cb fp=0xc005f75f58 sp=0xc005f75f00 pc=0x6da52b
main.main.func1(0x70c3c0, 0xc00205eaa0)
	/src/api.go:71 +0x46 fp=0xc005f75f80 sp=0xc005f75f58 pc=0x6dccd6
github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc005d43290)
	/Go/bin/src/github.com/panjf2000/ants/worker_func.go:68 +0xb7 fp=0xc005f75fd8 sp=0xc005f75f80 pc=0x6d9b07
runtime.goexit()
	/Go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc005f75fe0 sp=0xc005f75fd8 pc=0x4626f1
created by github.com/panjf2000/ants.(*goWorkerWithFunc).run
	/Go/bin/src/github.com/panjf2000/ants/worker_func.go:48 +0x53