axiomhq / hyperloglog Goto Github PK
View Code? Open in Web Editor NEWHyperLogLog with lots of sugar (Sparse, LogLog-Beta bias correction and TailCut space reduction) brought to you by Axiom
Home Page: https://axiom.co
License: MIT License
HyperLogLog with lots of sugar (Sparse, LogLog-Beta bias correction and TailCut space reduction) brought to you by Axiom
Home Page: https://axiom.co
License: MIT License
github.com/dgryski/go-metro is written largely in gc-syntax assembly, which makes it difficult to support gccgo downstream of it.
cc @dgryski
It would be useful to have an insert method that just takes the hash value directly. This would allow the caller to choose the hash function. It is particularly useful if our data is just integers - we could use some function that works on integers directly instead of converting them to []byte
.
A new instance of a sparse Sketch requires around 118 bytes of memory on amd64.
type Sketch struct {
p uint8 // 1 byte
b uint8 // 1 byte
m uint32 // 4 bytes
alpha float64 // 8 bytes
tmpSet set // 56 bytes
sparseList *compressedList // 40 bytes (8 + 2 * 4 + 24 (pointer + 2 x uint32 + slice))
regs *registers // 8 bytes (nil pointer)
// 118 bytes total
}
The scenario I would like to optimize for is storing many sketches that contain few elements. In my case I have a more-or-less zipfian distribution of set cardinalities (few large ones and a long tail of small ones).
118 bytes is considerably more than, for example, Redis which uses 21 bytes for a HLL containing a single element. See my gist on that.
Looking over the code today and previously I'm considering a few optimizations. @seiflotfy I'd like to get you opinion on some of them before I get my hands dirty.
Remove tmpSet. This would already reduce the memory footprint by half. As far as I understand the code it's used to "cache" multiple elements inserted into a the sparse representation of the sketch. I believe the rationale here was to reduce the time spent on extending and traversing the compressed list. Lack of this cache can be mitigated for example by providing a InsertMany
function instead.
Remove 2 pointers to compressedList
and registers
. Both those types represent []uint8
slices with some additional metadata. That metadata could be rolled into the data structure. That's a 16 bytes saved for pointers. The tradeoff here is code readability.
Don't precompute alpha
and m
. 12 bytes saving. A few CPU cycles are a tradeoff here.
The above memory optimizations would amount to savings of 75 bytes bringing down the size of an empty sketch to 43 bytes.
type Sketch struct {
sparse bool // 1 byte - bringing it back since there is no *compressedList to match against
p uint8 // 1 byte
b uint8 // 1 byte
regs []uint32 // 24 bytes
nz uint32 // 4 bytes
count uint32 // 4 bytes
last uint32 // 4 bytes
// 43 bytes total
}
What are your thoughts?
The public constructors at present allow only 14 or 16 registers.
func New() *Sketch
func New14() *Sketch
func New16() *Sketch
func New16NoSparse() *Sketch
The private constructor newSketch
looks like it can take between 4-18 registers, and also a flag to say sparse or normal.
Looking at the commit history, the public constructor was removed in commit dba7ba9
Do you see any fundamental issue with the reduced bases? Is it possible to allow those again?
I have a use case where I need to maintain a large number of these HLL sketches, that can tolerate a higher error rate as well. So I am planning to use just 4 or 8 registers.
I need a bit of help following TestHLLTC_Merge_Complex in hyperloglog_test.go
:
Lines 374 and 384: why is ratio calculated against sk1?
Lines 391 and 395: I tried inserting the 500000 records into sk2 and then merge with empty sk3, like in the source. This gives me a ratio of about 1.18%. I made another version where the 500000 records were inserted into sk3, and then merge with sk2. This gets me 6.52%. Shouldn't they be the same?
My code for ref:
func TestHLLTC_Merge_Complex(t *testing.T) {
sk1, err := newSketch(14, true)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
sk2, err := newSketch(14, true)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
sk3, err := newSketch(14, true)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
// sk1 will contain flow-[1..10000000]
// sk2 will contain flow-[2,4,...10000000]
unique := map[string]bool{}
for i := 1; len(unique) <= 10000000; i++ {
str := fmt.Sprintf("flow-%d", i)
sk1.Insert([]byte(str))
if i%2 == 0 {
sk2.Insert([]byte(str))
}
unique[str] = true
}
// sk1
exact1 := uint64(len(unique))
res1 := uint64(sk1.Estimate())
ratio := 100*estimateError(res1, exact1)
if ratio > 2 {
t.Errorf("exact %d, got %d which is %.2f%% error", exact1, res1, ratio)
}
t.Logf("sk1 estimate %d; exact %d (%.2f%% err)", res1, exact1, ratio)
// sk2
exact2 := uint64(len(unique))/2
res2 := uint64(sk2.Estimate())
ratio = 100*estimateError(res2, exact2)
if ratio > 2 {
t.Errorf("exact %d, got %d which is %.2f%% error", exact2, res2, ratio)
}
t.Logf("sk2 estimate %d; exact %d (%.2f%% err)", res2, exact2, ratio)
// sk2 merge sk1
if err := sk2.Merge(sk1); err != nil {
t.Errorf("unexpected error: %v", err)
}
exact2 = uint64(len(unique))
res2 = uint64(sk2.Estimate())
ratio = 100*estimateError(res2, exact2)
if ratio > 2 {
t.Errorf("exact %d, got %d which is %.2f%% error", exact2, res2, ratio)
}
t.Logf("sk2 merge sk1 estimate %d; exact %d (%.2f%% err)", res2, exact2, ratio)
for i := 1; i <= 500000; i++ {
str := fmt.Sprintf("stream-%d", i)
sk3.Insert([]byte(str)) // also try sk2.Insert(...)
unique[str] = true
}
/*
exact3 := uint64(500000)
res3 := uint64(sk3.Estimate())
ratio = 100*estimateError(res3, exact3)
if ratio > 2 {
t.Errorf("exact %d, got %d which is %.2f%% error", exact3, res3, ratio)
}
t.Logf("sk3 estimate %d; exact %d (%.2f%% err)", res3, exact3, ratio)
*/
// sk2 merge sk3
if err := sk2.Merge(sk3); err != nil {
t.Fatalf("unexpected error: %v", err)
}
exact2 = uint64(len(unique))
res2 = uint64(sk2.Estimate())
ratio = 100*estimateError(res2, exact2)
if ratio > 2 {
t.Errorf("exact %d, got %d which is %.2f%% error", exact2, res2, ratio)
}
t.Logf("sk2 merge sk3 estimate %d; exact %d (%.2f%% err)", res2, exact2, ratio)
}
It would be good to provide some documentation about what properties are expected of the input hash for InsertHash
. For example, a reasonable question is: if my data is a set of 64-bit ints, can I use them directly as the "hash"? I ran some experiments with that and got bad results. I also got bad results when I used a simplistic hash function (xor and multiply by large prime). Does it require good avalanche characteristics?
Hi @seiflotfy,
As a contributor I'd be somehow more confident creating PRs if the project had CI integration.
Does TravisCI sound good for you? I can make a PR with the changes.
We have thousands of distinct datasets that we merge together and the underlying Sketch's get Merge called to get a resulting Sketch. We're seeing a significant amount of time spent in maybeToNormal
.
Would it be better for us to store all of the datasets with the "normal" Sketch instead of a sparse one? In other words, we'd call ToNormal
(or something like that) before calling MarshalBinary
. Would this reduce the time spent merging?
I'm counting unique metrics accross different dimensions, and the number of used HLLs can go up to several millions (e.g. number of unique IPs for each user agent or vice versa).
It seems that even empty hyperloglog.Sketch will allocate at least 2^14 = 16384 bytes for the sparseList, which is quite a lot. Redis hyperloglog has way better memory footprint for small cardinalities - http://download.redis.io/redis-stable/src/hyperloglog.c
I wonder if it will be possible to align / merge these two approaches.
When creating a compressedList
using the newCompressedList
function, the current implementation lacks the ability to specify the initial size of the list. This results in multiple memory allocations during the mergeSparse
function when appending elements to the newList
, potentially leading to increased memory allocation overhead.
I see options for both 16384 and 65536 registers. How much memory does each instance actually use? The read-me says each register 4-bits, what about the sketch structure overhead?
https://github.com/axiomhq/hyperloglog/blob/master/hyperloglog.go#L50 is not threadsafe (compile with -race and run in a multi-threaded program to detect it). this can be worked around by adding a package init function.
func init() {
hash = hashFunc
}
the new() function is not threadsafe because it is modifying a module level variable without locks
var Counter *hyperloglog.Sketch
...
Counter = hyperloglog.New16()
...
bpk := make([]byte, 8)
binary.LittleEndian.PutUint64(bpk, pk)
Counter.Insert(bpk)
fatal error: concurrent map writes
goroutine 458 [running]:
runtime.throw(0x788fc0, 0x15)
/Go/src/runtime/panic.go:1114 +0x79 fp=0xc005f75e48 sp=0xc005f75e18 pc=0x435c49
runtime.mapassign_fast32(0x742f00, 0xc00813aba0, 0xc002276860, 0xa52e40)
/Go/src/runtime/map_fast32.go:101 +0x328 fp=0xc005f75e88 sp=0xc005f75e48 pc=0x411028
github.com/axiomhq/hyperloglog.set.add(...)
/Go/bin/src/github.com/axiomhq/hyperloglog/sparse.go:44
github.com/axiomhq/hyperloglog.(*Sketch).InsertHash(0xc0000615f0, 0x89da180ed69d24d0, 0x8)
/Go/bin/src/github.com/axiomhq/hyperloglog/hyperloglog.go:220 +0x117 fp=0xc005f75ed0 sp=0xc005f75e88 pc=0x6d4797
github.com/axiomhq/hyperloglog.(*Sketch).Insert(0xc0000615f0, 0xc0020a7388, 0x8, 0x8, 0xc00848fa40)
/Go/bin/src/github.com/axiomhq/hyperloglog/hyperloglog.go:214 +0x65 fp=0xc005f75f00 sp=0xc005f75ed0 pc=0x6d4665
main.BulkWorker(0x18ad287d)
/src/BulkRequest.go:92 +0x1cb fp=0xc005f75f58 sp=0xc005f75f00 pc=0x6da52b
main.main.func1(0x70c3c0, 0xc00205eaa0)
/src/api.go:71 +0x46 fp=0xc005f75f80 sp=0xc005f75f58 pc=0x6dccd6
github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc005d43290)
/Go/bin/src/github.com/panjf2000/ants/worker_func.go:68 +0xb7 fp=0xc005f75fd8 sp=0xc005f75f80 pc=0x6d9b07
runtime.goexit()
/Go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc005f75fe0 sp=0xc005f75fd8 pc=0x4626f1
created by github.com/panjf2000/ants.(*goWorkerWithFunc).run
/Go/bin/src/github.com/panjf2000/ants/worker_func.go:48 +0x53
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.