yankun1992 / fastbloom Goto Github PK
View Code? Open in Web Editor NEWA fast bloom filter implemented by Rust for Python! 10x faster than pybloom!
License: Apache License 2.0
A fast bloom filter implemented by Rust for Python! 10x faster than pybloom!
License: Apache License 2.0
Hi Yankun,
I testing load from file test.bin created with 10bil items, test.bin with 11.7Gb it load to Ram but need 23.4Gb Ram to load. My pc 16Gb ram and it load swap after loaded (224 second), 6Gb in real Ram and 6Gb in swap and bloomfilter it progress slow.
So it need double ram with .bin file to load and progress quickly ?
Thanks,
I have a use-case where I need to look up a key within multiple bloom filters of identical fixed size and properties. Unioning the filters together doesn't help in this scenario because I need to identify which bloom filters contain the key. Ideally, I'd create the hash of the key once, then look up the set of hashes within the multiple bloom filters to avoid the re-hashing of the content.
From python, maybe it'd look like:
hashes = bloom.get_hashes('hello')
if bloom.contains_hashes(hashes):
... do some stuff ...
Hi,
How to save and load the created filter?
You should know that the use of #[cfg(target_pointer_width = "64")]
and the 32 equivalent makes this crate incompatible with serde, as a filter serialised on a 64bits arch, sent over the network, and deserialised on a 32bits arch, will panic. This is specially the case for code that communicates from binary compiled Rust on one side, and a WASM32 JS runtime on the other side.
Another request if possible. For the counting bloom filter, is it possible add a function to get the current count?
Eg:
cbf.add("asdf")
cbf.add("asdf")
print(cbf.count("asdf")) --> 2
print(cbf.count("abcd")) --> 0
Not sure if this is possible since I'm not a bloom filter expert. I think looking at the code it uses 4 bits for the counting, so will be encoding a count between 0 and 15 to track the count, right?
Hi!
I'd like to share my bloom filter output to Java services. Are there any Java libs that are compatible with the data structure provided by fastbloom?
Thanks
Now there has some bulk insert API. Eg:
# in BloomFilter
def add_int_batch(self, array: Sequence[int]):
...
def add_str_batch(self, array: Sequence[str]):
...
...
Add more batch insert API and support batch check API.
Awesome project! This worked great out of the box and was a speedup of around 10x from the previous package I was using.
See XXH3 hashing here:
https://cyan4973.github.io/xxHash/
It's about 10x the speed of murmur3 according to their benchmarks there. This I think would probably speed up this bloom filter implementation even further.
Hello,
I want to implement a set cardinality estimation function
I'm working on a PR (#11) but I cannot manage to launch my unit tests.
Do you have any advice on how to build the project and launch the rust and python test suite ?
I installed maturin, launched maturin develop --release
then tried to launch the unit tests with python -m unittest py_tests.test_bloom.test_bloom_add
but the test fails.
Thanks for your advices
Thank you so much for this crate. Many serious applications of fastbloom may need serialization and deserialization support. Is this planned?
Your library looks great. My use case is in deduplication, so I would be wanting to know if a given string is already in the Bloom filter, and add it if not. I can certainly make a call to contains
, and then a call to add_str
, but it seems like that will be repeating computing all the hash values. I assume it would run roughly twice as slow as a combined function could. Would that be something that would be possible to add? It would return whether it was added or not. Thanks!
Found a tricky bug in my code which I think is caused by this.
CountingBloomFilter is using k hashes to track the count, but is also adding one more hash to test membership (hash1). Even if enable_repeat_insert=true, this count is still created and leveraged. I'm not sure if this additional hash is typical design?
Maybe when enable_repeat_insert=true, it should not be using the hash1 anywhere and should use the count>0 as the membership test only.
In my weird case I believe the above is impacting me because:
Theoretically even with collisions I think this shouldn't impact any of the underlying numbers. But I think it is, because 'abc' was not actually present in the CountingBloomFilter (it was a collision case), so when it processes the first increment it'll add a presence hash. This then breaks some other numbers in my bloom filter causing the error.
Also the addition of hash1 for presence tracking adds to the error rate unnecessarily in the enable_repeat_insert=true scenario.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.