Code Monkey home page Code Monkey logo

cardinalityestimation's Introduction

CardinalityEstimation

HyperLogLog-based set cardinality estimation library

This library estimates the number of unique elements in a set, in a quick and memory-efficient manner. It's based on the following:

  1. Flajolet et al., "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm", DMTCS proc. AH 2007, http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
  2. Heule, Nunkesser and Hall 2013, "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40671.pdf

The accuracy/memory usage are user-selectable. Typically, a cardinality estimator will give a perfect estimate of small cardinalities (up to 100 unique elements), and 97% accuracy or better (usually much better) for any cardinality up to near 2^64, while consuming several KB of memory (no more than 16KB).

Usage

Usage is very simple:

ICardinalityEstimator<string> estimator = new CardinalityEstimator();

estimator.Add("Alice");
estimator.Add("Bob");
estimator.Add("Alice");
estimator.Add("George Michael");

ulong numberOfuniqueElements = estimator.Count(); // will be 3

Nuget Package

This code is available as the Nuget package CardinalityEstimation.

To install, run the following command in the Package Manager Console:

Install-Package CardinalityEstimation

Release Notes

Keeping things friendly

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

cardinalityestimation's People

Contributors

charlessolar avatar olegsavin avatar oronnavon avatar pshrosbree avatar saguiitay avatar selvasingh avatar slurmlord avatar upasnarayan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cardinalityestimation's Issues

Provide efficient serializer

Since multiple estimators are expected to run on various servers, and their results merged together, we can expect that the estimators will transmitted over the network. Making an efficient (size-wise) serializer makes sense, in order to reduce the amount of data transmitted over the network.

Murmur hash gives different hash values for same input value

When using cardinalityestimation with "long" as input object, murmur seems to give different hashes for the same values (at least sometimes, 64 bit machine). Setting the hash function to fnv1a resolves the issue.

I will probably look into this a bit more, but since this was very hard to debug, I thought I'll warn other users :-)

Regards,

Arthur

High error for sequential sets - consider MurmurHash instead of FNV

The current implementation shows much higher error rates in the edge case where IDs of counted objects are sequential (e.g. 1, 2, 3... when counting integers). It appears that MurmurHash has better avalanche characteristics than the FNV hash, playing around some seems to confirm this.

There is a good C# implementation of MurmurHash with a suitable license here https://github.com/darrenkopp/murmurhash-net . It needs some minor modification but will work well for CardinalityEstimation.

CardinalityEstimation is not supported on .net core

Any reason for that?

Restoring packages for ...
Package CardinalityEstimation 1.5.0 is not compatible with netcoreapp1.1 (.NETCoreApp,Version=v1.1). Package CardinalityEstimation 1.5.0 supports: net45 (.NETFramework,Version=v4.5)
Package murmurhash 1.0.0 is not compatible with netcoreapp1.1 (.NETCoreApp,Version=v1.1). Package murmurhash 1.0.0 supports:
  - net35 (.NETFramework,Version=v3.5)
  - net40 (.NETFramework,Version=v4.0)
  - net45 (.NETFramework,Version=v4.5)
One or more packages are incompatible with .NETCoreApp,Version=v1.1.
Package restore failed. Rolling back package changes.
Time Elapsed: 00:00:01.6802616
========== Finished ==========

Deserialization fails with NullReferenceException in certain cases

I was testing out the serializer/deserializer and noticed a potential issue.

When I use 10 bits or fewer and add 100 elements or less, there seems to be an exception during deserialization:

            var estimator = new CardinalityEstimator(b: 10);

            for (int i = 0; i < 100; i++)
            {
                Guid g = Guid.NewGuid();
                estimator.Add(g.ToByteArray());
            }

            Console.WriteLine("Count before serialize: {0}", estimator.Count());

            Stream stream = new MemoryStream();
            var serializer = new CardinalityEstimatorSerializer();
            serializer.Serialize(stream, estimator, true);

            Console.WriteLine("Length of stream: {0}", stream.Length);

            stream.Seek(0, SeekOrigin.Begin);
            var deserialized = serializer.Deserialize(stream); // exception HERE

The exception happens in lines 500-503 in CardinalityEstimator.cs:

            else
            {
                this.lookupDense[substream] = Math.Max(this.lookupDense[substream], sigma); // HERE
            }

Would be great if you could look into this. It's also possible that I'm using the library incorrectly and this is expected behavior, if so, I would like to understand why deserialization fails for this case.

Publish signed version of .NET Standard-compatible library

It looks like there are two NuGets available:

  • CardinalityEstimation: .NET Standard, not strong-name signed
  • CardinalityEstimation.Signed: .NET Framework, strong-name signed

We'd like to use the CardinalityEstimation library in a .NET Core project that requires strong-name signed assemblies. Would you be willing to update the CardinalityEstimation.Signed nuget to the .NET Standard version?

CardinalityEstimator.Merge() Keeps Direct Counting Items No Matter How Many Exist After The Merge

I've run into a case where the CardinalityEstimator serialized size grows larger and larger because the direct counted items list is never cleared.

We have a set of estimators which are later merged together (no additional Adds are done). If the estimators are below 100 items, they will hold 100+ items after the merge. If more small estimators are later merged in, the direct count list will continue to grow.

In this example, I'm using serialize to demonstrate the increasing size of the "mergedEstimator" vs the add-only estimator. If the loop is increased from 10k to 100k, the mergedEstimator continues to increase in size:

var addedEstimator = new CardinalityEstimator();
var mergedEstimator = new CardinalityEstimator();

for(int i = 0; i < 10_000; i++) {
	var guid = Guid.NewGuid().ToString();
	
	addedEstimator.Add(guid);

	// Simulate some intermediate estimators being merged together
	var temporaryEstimator = new CardinalityEstimator();
	temporaryEstimator.Add(guid);
	mergedEstimator.Merge(temporaryEstimator);
}

var serializer = new CardinalityEstimatorSerializer();

var stream1 = new MemoryStream();
serializer.Serialize(stream1, addedEstimator, true);
Console.WriteLine($"Size of 'Just added' estimator: {stream1.Length.ToString("N0")}");

var stream2 = new MemoryStream();
serializer.Serialize(stream2, mergedEstimator, true);
Console.WriteLine($"Size of merged estimator: {stream2.Length.ToString("N0")}");

Output:

Size of 'Just added' estimator: 16,406
Size of merged estimator: 80,022

Add a strong-named version

Check in strong naming key
Build both versions side by side
Strong-named version should use murmurhash-signed
Publish both version to NuGet

Make DirectCounterMaxElements configurable

Would be great to be able to change the 100 direct counter elements to something smaller, like 10 or 20. Maybe a parameter in the constructor like the # of bits to use. We are working on a project with a significant memory restriction and so we want to reduce the amount of storage used as much as possible.

Re-estimate biases based on actual hash used

As done by Heule et al., it is important to estimate the bias of the CardinalityEstimator. However, the current implementation uses the values published by Huele et al., but a different hash function - which leads to sub-optimal accuracy. We can follow their method to estimate the current bias and update the bias-correction values.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.