Code Monkey home page Code Monkey logo

fastbloom's People

Contributors

bitcoin-eagle avatar yankun1992 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

fastbloom's Issues

Load from file (wsl, Python)

Hi Yankun,

I testing load from file test.bin created with 10bil items, test.bin with 11.7Gb it load to Ram but need 23.4Gb Ram to load. My pc 16Gb ram and it load swap after loaded (224 second), 6Gb in real Ram and 6Gb in swap and bloomfilter it progress slow.
So it need double ram with .bin file to load and progress quickly ?

Thanks,

Request: Method for lookups based on pre-computed hashes

I have a use-case where I need to look up a key within multiple bloom filters of identical fixed size and properties. Unioning the filters together doesn't help in this scenario because I need to identify which bloom filters contain the key. Ideally, I'd create the hash of the key once, then look up the set of hashes within the multiple bloom filters to avoid the re-hashing of the content.

From python, maybe it'd look like:

hashes = bloom.get_hashes('hello')
if bloom.contains_hashes(hashes):
   ... do some stuff ...

panic: index out of bounds

You should know that the use of #[cfg(target_pointer_width = "64")] and the 32 equivalent makes this crate incompatible with serde, as a filter serialised on a 64bits arch, sent over the network, and deserialised on a 32bits arch, will panic. This is specially the case for code that communicates from binary compiled Rust on one side, and a WASM32 JS runtime on the other side.

Request: Counting bloom filter function to get current value

Another request if possible. For the counting bloom filter, is it possible add a function to get the current count?

Eg:

cbf.add("asdf")
cbf.add("asdf")
print(cbf.count("asdf")) --> 2
print(cbf.count("abcd")) --> 0

Not sure if this is possible since I'm not a bloom filter expert. I think looking at the code it uses 4 bits for the counting, so will be encoding a count between 0 and 15 to track the count, right?

cross language support (java)

Hi!

I'd like to share my bloom filter output to Java services. Are there any Java libs that are compatible with the data structure provided by fastbloom?

Thanks

Features: enhance bulk operations API.

Now there has some bulk insert API. Eg:

# in BloomFilter
def add_int_batch(self, array: Sequence[int]):
    ...
def add_str_batch(self, array: Sequence[str]):
    ...
...

Add more batch insert API and support batch check API.

Add set cardinality etimation

Hello,

I want to implement a set cardinality estimation function
I'm working on a PR (#11) but I cannot manage to launch my unit tests.
Do you have any advice on how to build the project and launch the rust and python test suite ?

I installed maturin, launched maturin develop --release
then tried to launch the unit tests with python -m unittest py_tests.test_bloom.test_bloom_add
but the test fails.

Thanks for your advices

Serialize bloom filter using serde

Thank you so much for this crate. Many serious applications of fastbloom may need serialization and deserialization support. Is this planned?

Would it be possible to have add_if_not_contains?

Your library looks great. My use case is in deduplication, so I would be wanting to know if a given string is already in the Bloom filter, and add it if not. I can certainly make a call to contains, and then a call to add_str, but it seems like that will be repeating computing all the hash values. I assume it would run roughly twice as slow as a combined function could. Would that be something that would be possible to add? It would return whether it was added or not. Thanks!

Bug: CountingBloomFilter should not use hash1 membership if enable_repeat_insert=true

Found a tricky bug in my code which I think is caused by this.

CountingBloomFilter is using k hashes to track the count, but is also adding one more hash to test membership (hash1). Even if enable_repeat_insert=true, this count is still created and leveraged. I'm not sure if this additional hash is typical design?

Maybe when enable_repeat_insert=true, it should not be using the hash1 anywhere and should use the count>0 as the membership test only.

In my weird case I believe the above is impacting me because:

  1. Test membership of key 'abc', it thinks correctly that it is present because it has non-zero count, but is actually an error collision
  2. I loop decrement until 'abc' is not present anymore to get the count
  3. Now I loop increment 'abc' until it's back to it's original count

Theoretically even with collisions I think this shouldn't impact any of the underlying numbers. But I think it is, because 'abc' was not actually present in the CountingBloomFilter (it was a collision case), so when it processes the first increment it'll add a presence hash. This then breaks some other numbers in my bloom filter causing the error.

Also the addition of hash1 for presence tracking adds to the error rate unnecessarily in the enable_repeat_insert=true scenario.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.