vla / bloomfilter.netcore Goto Github PK
View Code? Open in Web Editor NEWA bloom filter implementation
License: MIT License
A bloom filter implementation
License: MIT License
uint
is not CLS-compliant.
Ex: StringBuilder(Int32)
, List<T>(Int32)
etc
Perhaps I'm missing some configuration options, but I would have expected Bloom filter to have static memory usage:
[MemoryDiagnoser]
public class Benchmark
{
private static readonly int items = 30_000_000;
[Benchmark]
public void BloomFilter()
{
var bf = FilterBuilder.Build(1000, 0.01);
for (var i = 0; i < items; i++)
{
bf.Add($"property_{i}_name");
}
for (var i = 0; i < items; i++)
{
if (!bf.Contains($"property_{i}_name"))
{
Console.WriteLine($"False negative {i}");
}
}
}
[Benchmark]
public void Dictionary()
{
var bf = new Dictionary<string, bool>();
for (var i = 0; i < items; i++)
{
bf.Add($"property_{i}_name", true);
}
for (var i = 0; i < items; i++)
{
if (!bf.ContainsKey($"property_{i}_name"))
{
Console.WriteLine($"False negative {i}");
}
}
}
}
BenchmarkDotNet v0.13.7, macOS Big Sur 11.6.5 (20G527) [Darwin 20.6.0]
Intel Core i5-1038NG7 CPU 2.00GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 6.0.202
[Host] : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT AVX2
DefaultJob : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated |
|------------ |---------:|---------:|---------:|-------------:|------------:|-----------:|----------:|
| BloomFilter | 9.538 s | 0.0471 s | 0.0440 s | 3774000.0000 | - | - | 11.03 GB |
| Dictionary | 17.162 s | 0.2386 s | 0.1993 s | 1008000.0000 | 131000.0000 | 11000.0000 | 6.37 GB |
// * Hints *
Outliers
Benchmark.Dictionary: Default -> 2 outliers were removed (17.85 s, 18.53 s)
Tested with various different options, and Dictionary uses consistently less memory
My understanding is that Bloom Filters work on the premise of using multiple hash functions (i.e., k = # of hash functions) (see https://en.wikipedia.org/wiki/Bloom_filter#Optimal_number_of_hash_functions). To add or test an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1 or check the positions are set to 1. If they are set to 1, the item maybe in the set. If any of the bits are 0, then the item is not in the set.
The implementation of Filter.cs only uses a single HashFunction. Why is that? While you can do this with a single hash function, the implementation should have the ability to use 1 or more hash functions.
To be put another way, the Bloom Filter requires "different" hash functions. There must be k different hash functions defined.
There is a discussion of using a hash function by passing k different initial values to a hash function. I am assuming that is what you are doing. Please help me confirm this.
"Alternatively, one can pass k different initial values (such as 0, 1, ..., k โ 1) to a hash function that takes an initial value; or add (or append) these values to the key" from WIKIPEDIA below.
*********************************** FROM WIKIPEDIA ************************************
The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass k different initial values (such as 0, 1, ..., k โ 1) to a hash function that takes an initial value; or add (or append) these values to the key. For larger m and/or k, independence among the hash functions can be relaxed with negligible increase in false positive rate.[3] (Specifically, Dillinger & Manolios (2004b) show the effectiveness of deriving the k indices using enhanced double hashing or triple hashing, variants of double hashing that are effectively simple random number generators seeded with the two or three hash values.)
"
Hi,
can you explain for me how does bloomfilter deal with bit collision and how we can prevent that?
Tks,
Wilson
I noticed that you check whether the bits were set on Add. Wouldn't it be better to just set the bits and not care about the results?
lock (_sync)
{
for (var i = 0; i < hashes.Count; i++)
{
if (!_hashBits.Get(hashes[i]))
{
_hashBits.Set(hashes[i], true);
processResults[i] = false;
}
else
{
processResults[i] = true;
}
}
}
I have 50 items to test the ContainsAsync of Redis filter. It took about 1.5 seconds. And 1000 items took 5 seconds. Its configuation is : Expected Elements:100000, errorate: 0.01. Could u pls tell my how to improve my performace?
Can you set the expiration time of RedisKey under Redis?
Could a feature be included to allow filters to be saved to a file or loaded from file?
Thanks
Hello
Unless I overlooked it, I didn't see the C# Hashset in the benchmarks. I think it's an interesting baseline to add.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.