vla / bloomfilter.netcore Goto Github PK

View Code? Open in Web Editor NEW

135.0 11.0 36.0 554 KB

A bloom filter implementation

License: MIT License

C# 99.86% Batchfile 0.11% Shell 0.03%

bloomfilter

bloomfilter.netcore's People

Contributors

Stargazers

Watchers

bloomfilter.netcore's Issues

Remove uint usage (CLS)

https://github.com/vla/BloomFilter.NetCore/blob/0036a2c007fbb8419b3bbf86a8c6335f4209ed25/src/BloomFilter/FilterBuilder.cs#LL14C38-L14C43

uint is not CLS-compliant.
Ex: StringBuilder(Int32), List<T>(Int32) etc

https://learn.microsoft.com/en-us/dotnet/standard/language-independence#types-and-type-member-signatures

Unexpectedly high memory usage

Perhaps I'm missing some configuration options, but I would have expected Bloom filter to have static memory usage:

[MemoryDiagnoser]
public class Benchmark
{
    private static readonly int items = 30_000_000;
    [Benchmark]
    public void BloomFilter()
    {
        var bf = FilterBuilder.Build(1000, 0.01);
        for (var i = 0; i < items; i++)
        {
            bf.Add($"property_{i}_name");
        }
        for (var i = 0; i < items; i++)
        {
            if (!bf.Contains($"property_{i}_name"))
            {
                Console.WriteLine($"False negative {i}");
            }
        }
    }

    [Benchmark]
    public void Dictionary()
    {
        var bf = new Dictionary<string, bool>();
        for (var i = 0; i < items; i++)
        {
            bf.Add($"property_{i}_name", true);
        }
        for (var i = 0; i < items; i++)
        {
            if (!bf.ContainsKey($"property_{i}_name"))
            {
                Console.WriteLine($"False negative {i}");
            }
        }
    }
}

BenchmarkDotNet v0.13.7, macOS Big Sur 11.6.5 (20G527) [Darwin 20.6.0]
Intel Core i5-1038NG7 CPU 2.00GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 6.0.202
  [Host]     : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT AVX2
  DefaultJob : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT AVX2


|      Method |     Mean |    Error |   StdDev |         Gen0 |        Gen1 |       Gen2 | Allocated |
|------------ |---------:|---------:|---------:|-------------:|------------:|-----------:|----------:|
| BloomFilter |  9.538 s | 0.0471 s | 0.0440 s | 3774000.0000 |           - |          - |  11.03 GB |
|  Dictionary | 17.162 s | 0.2386 s | 0.1993 s | 1008000.0000 | 131000.0000 | 11000.0000 |   6.37 GB |

// * Hints *
Outliers
  Benchmark.Dictionary: Default -> 2 outliers were removed (17.85 s, 18.53 s)

Tested with various different options, and Dictionary uses consistently less memory

Number of Hash Functions Design Question

My understanding is that Bloom Filters work on the premise of using multiple hash functions (i.e., k = # of hash functions) (see https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions). To add or test an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1 or check the positions are set to 1. If they are set to 1, the item maybe in the set. If any of the bits are 0, then the item is not in the set.

The implementation of Filter.cs only uses a single HashFunction. Why is that? While you can do this with a single hash function, the implementation should have the ability to use 1 or more hash functions.

To be put another way, the Bloom Filter requires "different" hash functions. There must be k different hash functions defined.

There is a discussion of using a hash function by passing k different initial values to a hash function. I am assuming that is what you are doing. Please help me confirm this.

"Alternatively, one can pass k different initial values (such as 0, 1, ..., k − 1) to a hash function that takes an initial value; or add (or append) these values to the key" from WIKIPEDIA below.

*********************************** FROM WIKIPEDIA ************************************
The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass k different initial values (such as 0, 1, ..., k − 1) to a hash function that takes an initial value; or add (or append) these values to the key. For larger m and/or k, independence among the hash functions can be relaxed with negligible increase in false positive rate.[3] (Specifically, Dillinger & Manolios (2004b) show the effectiveness of deriving the k indices using enhanced double hashing or triple hashing, variants of double hashing that are effectively simple random number generators seeded with the two or three hash values.)
"

Bloomfilter bit collision

Hi,
can you explain for me how does bloomfilter deal with bit collision and how we can prevent that?

Tks,
Wilson

Checking for bits set on Add?

I noticed that you check whether the bits were set on Add. Wouldn't it be better to just set the bits and not care about the results?

        lock (_sync)
        {
            for (var i = 0; i < hashes.Count; i++)
            {
                if (!_hashBits.Get(hashes[i]))
                {
                    _hashBits.Set(hashes[i], true);
                    processResults[i] = false;
                }
                else
                {
                    processResults[i] = true;
                }
            }
        }

vla / bloomfilter.netcore Goto Github PK

bloomfilter.netcore's People

Contributors

Stargazers

Watchers

Forkers

bloomfilter.netcore's Issues

Remove uint usage (CLS)

Unexpectedly high memory usage

Number of Hash Functions Design Question

Bloomfilter bit collision

Checking for bits set on Add?

ContainsAsync too slow

Can you set the expiration time of RedisKey under Redis?

Save bloom filter to disk

Add C# Hashset to the benchmarks?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent