Code Monkey home page Code Monkey logo

python-bloomfilter's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-bloomfilter's Issues

how to use bloomfilter set expire time

I want to use bloomfilter in scrapy_redis. scrapy_redis has a large number of url need to filter.
If i use the bloomfilter for a few weeks. Linux memory will boom.So i want use bloomfilter to filter 7 days(or other time) url.What shou i do? Can you give me some adices?

Error installing pybloom

When doing the install, the following error is obtained.

ERROR: Complete output from command python setup.py egg_info: ERROR: Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-iwa1sif6/pybloom/setup.py", line 2, in <module> from ez_setup import use_setuptools File "/tmp/pip-install-iwa1sif6/pybloom/ez_setup.py", line 98 except pkg_resources.VersionConflict, e:
^
SyntaxError: invalid syntax ----------------------------------------
ERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-iwa1sif6/pybloom/

Is there any way to solve the problem?

Unit-tests fail on windows

I tested on Linux and they work fine.

But on windows the bitarray library throws an exception:

======================================================================
ERROR: test_serialization (pybloom.tests.Serialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\users\taylor\src\python-bloomfilter\pybloom\tests.py", line 92, in test_serialization
    filter.tofile(f)
  File "c:\users\taylor\src\python-bloomfilter\pybloom\pybloom.py", line 251, in tofile
    else self.bitarray.tofile(f))
TypeError: open file expected

----------------------------------------------------------------------
Ran 14 tests in 0.391s

FAILED (errors=1)

Even though the file is definitely open:

<open file '<fdopen>', mode 'w+b' at 0x0000000003291420>

I realize this isn't a bug in the pybloom library, but I am filing it here since the UT's do not pass on windows.

Custom Hashes

Is it possible to seed the hash functions within an instance of BloomFilter?

Multi-level url deduplication problem 多级url去重的问题

A website has tens of thousands or more urls after rendering, and these urls are hierarchical. If the url of the previous level is judged to be repeated, then its next level url is directly ignored. Can the problem bloomfilter be solved? Where do I need to change?

一个网站经过渲染后有几万甚至更多url,这些url是分级的。如果上一级的url被判断重复了,那么它的下一级url就被直接忽略了,这个问题bloomfilter能解决吗?我需要在哪里修改?有没有大佬能提供一个好的思路?

How Can i dump a bloomfilter object out?

i want to dump the object out, and load it the next time.

But when i use pickle, i found it always clean(.count()=0), when reload.

How can i dump it out, and save the memory of it?

confused about hash twice

salts = tuple(hashfn(hashfn(pack('I', i)).digest()) for i in range_fn(num_salts))
why do I need to hash twice,we only use it once and can get the same result.

Tag request

Possible to add a tag for version 2.0? Would like to add this project to MacPorts, but they require fetching tagged versions and not directly from head.

Incompatibility with bitarray >= 2.0.0

Getting an exception using this library...

...
  File "/home/adam/v/tf/evenly/ws2/app/bloom.py", line 25, in get_bloom_filters
    green_bloom = ScalableBloomFilter.fromfile(infile)
  File "/home/adam/v/tf/evenly/ws2/buildout/eggs/pybloom_live-3.0.0-py3.7.egg/pybloom_live/pybloom.py", line 368, in fromfile
    filter.filters.append(BloomFilter.fromfile(f, fl))
  File "/home/adam/v/tf/evenly/ws2/buildout/eggs/pybloom_live-3.0.0-py3.7.egg/pybloom_live/pybloom.py", line 216, in fromfile
    if filter.num_bits != filter.bitarray.length() and \
NotImplementedError: self.length() has been deprecated since 1.5.1, and was removed in 2.0.0.  Use len(self) instead.

I understand that that NotImplementedError comes from bit array, although the trace back doesn't make that super clear - looks like it is from a c file https://github.com/ilanschnell/bitarray/blob/cdb9b11cb144b373f49ac5b5b9015f1bfa2982d7/bitarray/_bitarray.c#L644

ratio in ScalableBloomFilter

Hi, I'm wondering why you are using ratio in ScalableBloomFilter, and it seems that the first filter has different error_as from the rest filters. Because in the code, the first filter has error rate as error_rate * (1 - ratio), and the rest of filters have error rate as error_rate * ratio.

Bug: Inconsistency in how `fromfile` and `intersection` and `union` deals with self.bitarray

Once you deserialize a serialized BloomFilter object the self.bitarray length might differ because of added padding.

https://github.com/jaybaird/python-bloomfilter/blob/master/pybloom/pybloom.py#L271

Here difference in length due to the trailing bits is ignored.

No such accounting of differing bitarray lengths are being done here https://github.com/jaybaird/python-bloomfilter/blob/master/pybloom/pybloom.py#L224 or https://github.com/jaybaird/python-bloomfilter/blob/master/pybloom/pybloom.py#L238 . Here the bitarray union and intersection will fail if the bitarray.length( ) are different. The lengths may differ because of a roundtrip through serialization deserialization, even when the capacity and error-rates are the same.

I think the correct thing to do here is to strip off the padding in fromfile to ensure that the bitarray representation is exactly the same

make_hashfuncs how to design

first thanks for your bloom filter ,it's easy to use. and i'm interested in "make_hashfuncs", can you provide some article to tell me how to design the method?

why num_bits >= (1 << 31) then fmt_code, chunk_size = 'Q', 8, and how to calulate total_hash_bits and so on.

Incorrect unicode behavior

In pybloom.py line 71 should test if the key is unicode rather than testing if the key is not unicode. If the key is not unicode, encoding it as utf-8 will first attempt to decode the string using the default encoding. Usually this is ascii, so if the string is not ascii a UnicodeDecodeError will be raised. Similarly, if the key is unicode, str(key) tries to encode the key in the default encoding. If the key contains non-ascii characters, this will raise UnicodeEncodeError.

BloomFilter (and ScalableBF) should support set operations (intersection, union ... )

I am not particularly sure that the backend algorithm can reliably support this, but it would be nice if the BloomFilter implementation could support the operations of the set primitive. Specifically:

>>> a = BloomFilter(10)
>>> a.add(1)
>>> b = BloomFilter(10)
>>> b.add(2)
>>> c = a | b
>>> 1 in c
True
>>> 2 in c
True
>>> 3 in c
False

I find sets to be incredibly useful in general coding practice, as it allows one to work naturally on data. Having a proxy to a set of data, without having to load said data, would increase the size of problems that one could easily work on, especially when one is more interested in set membership than the actual members.

This is just a feature suggestion (not even a request really), so no hurry. If I have time, I might even write a patch. I am unfamiliar with the hash function generation and the nuances there, so thats why I haven't jumped right into it.

confused about num_slices

I am very confused about num_slices in the code,the formula for calculating num_slices is num_slices = M/n*ln(2),however the formula is num_slices = log2(1/P) in this code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.