yankun1992 / fastbloom Goto Github PK

View Code? Open in Web Editor NEW

71.0 2.0 14.0 550 KB

A fast bloom filter implemented by Rust for Python! 10x faster than pybloom!

License: Apache License 2.0

Python 24.25% Rust 49.80% Scala 0.81% Java 18.80% Shell 3.51% Batchfile 2.83%

bloom-filter bloomfilter pyo3 python rust murmur3 counting-bloom-filter counting-bloom-filters

fastbloom's People

Contributors

Stargazers

Watchers

Forkers

pyjcode hcdenbakker simonwicky tempbottle s3bk sadams2013 ansrivas getdozer t0k4rt pmaak solsolution c6ai

fastbloom's Issues

Load from file (wsl, Python)

Hi Yankun,

I testing load from file test.bin created with 10bil items, test.bin with 11.7Gb it load to Ram but need 23.4Gb Ram to load. My pc 16Gb ram and it load swap after loaded (224 second), 6Gb in real Ram and 6Gb in swap and bloomfilter it progress slow.
So it need double ram with .bin file to load and progress quickly ?

Thanks,

Request: Method for lookups based on pre-computed hashes

I have a use-case where I need to look up a key within multiple bloom filters of identical fixed size and properties. Unioning the filters together doesn't help in this scenario because I need to identify which bloom filters contain the key. Ideally, I'd create the hash of the key once, then look up the set of hashes within the multiple bloom filters to avoid the re-hashing of the content.

From python, maybe it'd look like:

hashes = bloom.get_hashes('hello')
if bloom.contains_hashes(hashes):
   ... do some stuff ...

How to save and load the created filter

Hi,
How to save and load the created filter?

panic: index out of bounds

You should know that the use of #[cfg(target_pointer_width = "64")] and the 32 equivalent makes this crate incompatible with serde, as a filter serialised on a 64bits arch, sent over the network, and deserialised on a 32bits arch, will panic. This is specially the case for code that communicates from binary compiled Rust on one side, and a WASM32 JS runtime on the other side.

Request: Counting bloom filter function to get current value

Another request if possible. For the counting bloom filter, is it possible add a function to get the current count?

Eg:

cbf.add("asdf")
cbf.add("asdf")
print(cbf.count("asdf")) --> 2
print(cbf.count("abcd")) --> 0

Not sure if this is possible since I'm not a bloom filter expert. I think looking at the code it uses 4 bits for the counting, so will be encoding a count between 0 and 15 to track the count, right?

cross language support (java)

Hi!

I'd like to share my bloom filter output to Java services. Are there any Java libs that are compatible with the data structure provided by fastbloom?

Thanks

Features: enhance bulk operations API.

Now there has some bulk insert API. Eg:

# in BloomFilter
def add_int_batch(self, array: Sequence[int]):
    ...
def add_str_batch(self, array: Sequence[str]):
    ...
...

Add more batch insert API and support batch check API.

Request: Upgrade murmur3 hashing to XXH3 hashing

Awesome project! This worked great out of the box and was a speedup of around 10x from the previous package I was using.

See XXH3 hashing here:
https://cyan4973.github.io/xxHash/

It's about 10x the speed of murmur3 according to their benchmarks there. This I think would probably speed up this bloom filter implementation even further.

Add set cardinality etimation

Hello,

I want to implement a set cardinality estimation function
I'm working on a PR (#11) but I cannot manage to launch my unit tests.
Do you have any advice on how to build the project and launch the rust and python test suite ?

I installed maturin, launched maturin develop --release
then tried to launch the unit tests with python -m unittest py_tests.test_bloom.test_bloom_add
but the test fails.

Thanks for your advices

Serialize bloom filter using serde

Thank you so much for this crate. Many serious applications of fastbloom may need serialization and deserialization support. Is this planned?

add serde support

Would it be possible to have add_if_not_contains?

Your library looks great. My use case is in deduplication, so I would be wanting to know if a given string is already in the Bloom filter, and add it if not. I can certainly make a call to contains, and then a call to add_str, but it seems like that will be repeating computing all the hash values. I assume it would run roughly twice as slow as a combined function could. Would that be something that would be possible to add? It would return whether it was added or not. Thanks!

Bug: CountingBloomFilter should not use hash1 membership if enable_repeat_insert=true

Found a tricky bug in my code which I think is caused by this.

CountingBloomFilter is using k hashes to track the count, but is also adding one more hash to test membership (hash1). Even if enable_repeat_insert=true, this count is still created and leveraged. I'm not sure if this additional hash is typical design?

Maybe when enable_repeat_insert=true, it should not be using the hash1 anywhere and should use the count>0 as the membership test only.

In my weird case I believe the above is impacting me because:

Test membership of key 'abc', it thinks correctly that it is present because it has non-zero count, but is actually an error collision
I loop decrement until 'abc' is not present anymore to get the count
Now I loop increment 'abc' until it's back to it's original count

Theoretically even with collisions I think this shouldn't impact any of the underlying numbers. But I think it is, because 'abc' was not actually present in the CountingBloomFilter (it was a collision case), so when it processes the first increment it'll add a presence hash. This then breaks some other numbers in my bloom filter causing the error.

Also the addition of hash1 for presence tracking adds to the error rate unnecessarily in the enable_repeat_insert=true scenario.