jaybaird / python-bloomfilter Goto Github PK

View Code? Open in Web Editor NEW

1.6K 1.6K 329.0 341 KB

Scalable Bloom Filter implemented in Python

License: MIT License

Python 100.00%

python-bloomfilter's People

Stargazers

Watchers

Forkers

etrepum alexbrasetvik gregglind joelimome pombredanne greeness tantalor uiuc-shenli xupeng ausiv horacepan glangford aburan28 akademieolympia vrash ballacky13 parsely jaklaassen changdongsheng kaulie bobpoekert eliq imclab zuojie jackscott grandinquisitor atlaspilotpuppy sthomas-github nv1591 wavelets bachmann1234 joel-wright yowenter infoscout guhehehe amritkumar gumblex simon-weber redwolf0302 d13sl0w williamff bartleyn mikeaddison93 changwu-tw priestd09 wajihcz sunghoon ins1ghtlabs dongshuaike janzhou adewes easonchan1213 xingwudao ericlau2018 seanjensengrey latimer1 henrywoo heianxing useus couldtt wajih-o ashumeow lmtwga haohailuo sararamezanian alxdavids bluesl louisyw adibalcan is00hcw michaelfemi81 razerm uazw jlafon joseph-fox widnyana pugong dongqing7 m-berger promsoft wangzijian0x7c6 judyyan darrenhurleysmith mac119 awaemmanuel finfou tivvit liujinliang99 jgeewax gabelev xlong88 coofucoo hannibalhuang growthring xblues agdphd lalex0106 xjlin bugra guyskk

python-bloomfilter's Issues

How do I save my bloom filter to a file?

Is there somehow I can store my full bloom filter to a file and load it again later on to check for new values?

how to use bloomfilter set expire time

I want to use bloomfilter in scrapy_redis. scrapy_redis has a large number of url need to filter.
If i use the bloomfilter for a few weeks. Linux memory will boom.So i want use bloomfilter to filter 7 days(or other time) url.What shou i do? Can you give me some adices?

Error installing pybloom

When doing the install, the following error is obtained.

ERROR: Complete output from command python setup.py egg_info: ERROR: Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-iwa1sif6/pybloom/setup.py", line 2, in <module> from ez_setup import use_setuptools File "/tmp/pip-install-iwa1sif6/pybloom/ez_setup.py", line 98 except pkg_resources.VersionConflict, e:
^
SyntaxError: invalid syntax ----------------------------------------
ERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-iwa1sif6/pybloom/

Is there any way to solve the problem?

Unit-tests fail on windows

I tested on Linux and they work fine.

But on windows the bitarray library throws an exception:

======================================================================
ERROR: test_serialization (pybloom.tests.Serialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\users\taylor\src\python-bloomfilter\pybloom\tests.py", line 92, in test_serialization
    filter.tofile(f)
  File "c:\users\taylor\src\python-bloomfilter\pybloom\pybloom.py", line 251, in tofile
    else self.bitarray.tofile(f))
TypeError: open file expected

----------------------------------------------------------------------
Ran 14 tests in 0.391s

FAILED (errors=1)

Even though the file is definitely open:

<open file '<fdopen>', mode 'w+b' at 0x0000000003291420>

I realize this isn't a bug in the pybloom library, but I am filing it here since the UT's do not pass on windows.

Custom Hashes

Is it possible to seed the hash functions within an instance of BloomFilter?

Multi-level url deduplication problem 多级url去重的问题

A website has tens of thousands or more urls after rendering, and these urls are hierarchical. If the url of the previous level is judged to be repeated, then its next level url is directly ignored. Can the problem bloomfilter be solved? Where do I need to change?

一个网站经过渲染后有几万甚至更多url，这些url是分级的。如果上一级的url被判断重复了，那么它的下一级url就被直接忽略了，这个问题bloomfilter能解决吗？我需要在哪里修改？有没有大佬能提供一个好的思路？

How Can i dump a bloomfilter object out?

i want to dump the object out, and load it the next time.

But when i use pickle, i found it always clean(.count()=0), when reload.

How can i dump it out, and save the memory of it?

Is this available on python3?

confused about hash twice

salts = tuple(hashfn(hashfn(pack('I', i)).digest()) for i in range_fn(num_salts))
why do I need to hash twice，we only use it once and can get the same result.

Tag request

Possible to add a tag for version 2.0? Would like to add this project to MacPorts, but they require fetching tagged versions and not directly from head.

What is the license of this Project?

Can I use part of the Code (make_hashfuncs ) in my project?

Incompatibility with bitarray >= 2.0.0

Getting an exception using this library...

...
  File "/home/adam/v/tf/evenly/ws2/app/bloom.py", line 25, in get_bloom_filters
    green_bloom = ScalableBloomFilter.fromfile(infile)
  File "/home/adam/v/tf/evenly/ws2/buildout/eggs/pybloom_live-3.0.0-py3.7.egg/pybloom_live/pybloom.py", line 368, in fromfile
    filter.filters.append(BloomFilter.fromfile(f, fl))
  File "/home/adam/v/tf/evenly/ws2/buildout/eggs/pybloom_live-3.0.0-py3.7.egg/pybloom_live/pybloom.py", line 216, in fromfile
    if filter.num_bits != filter.bitarray.length() and \
NotImplementedError: self.length() has been deprecated since 1.5.1, and was removed in 2.0.0.  Use len(self) instead.

I understand that that NotImplementedError comes from bit array, although the trace back doesn't make that super clear - looks like it is from a c file https://github.com/ilanschnell/bitarray/blob/cdb9b11cb144b373f49ac5b5b9015f1bfa2982d7/bitarray/_bitarray.c#L644

ratio in ScalableBloomFilter

Hi, I'm wondering why you are using ratio in ScalableBloomFilter, and it seems that the first filter has different error_as from the rest filters. Because in the code, the first filter has error rate as error_rate * (1 - ratio), and the rest of filters have error rate as error_rate * ratio.

how to remove element in a scaledbloomfilter?

I want to remove some elements, so next time add(them) will return false.

is there inner function to do this?

Submit 2.0 to PyPI

Can you submit 2.0 to PyPI ? seems like 1.1 is the latest: https://pypi.python.org/pypi/pybloom

Load bloom filter from bytes (instead of loading from file)

Hello.

Is there a way to load a Bloom Filter from bytes instead of loading from a file? The advantage is that we do not need to download a file to load a Bloom Filter.

Thank you in advance.

Is this abandonware?

Is this project abandonware? Based on #9 it would seem so...

Bug: Inconsistency in how `fromfile` and `intersection` and `union` deals with self.bitarray

Once you deserialize a serialized BloomFilter object the self.bitarray length might differ because of added padding.

https://github.com/jaybaird/python-bloomfilter/blob/master/pybloom/pybloom.py#L271

Here difference in length due to the trailing bits is ignored.

No such accounting of differing bitarray lengths are being done here https://github.com/jaybaird/python-bloomfilter/blob/master/pybloom/pybloom.py#L224 or https://github.com/jaybaird/python-bloomfilter/blob/master/pybloom/pybloom.py#L238 . Here the bitarray union and intersection will fail if the bitarray.length( ) are different. The lengths may differ because of a roundtrip through serialization deserialization, even when the capacity and error-rates are the same.

I think the correct thing to do here is to strip off the padding in fromfile to ensure that the bitarray representation is exactly the same

make_hashfuncs how to design

first thanks for your bloom filter ,it's easy to use. and i'm interested in "make_hashfuncs", can you provide some article to tell me how to design the method?

why num_bits >= (1 << 31) then fmt_code, chunk_size = 'Q', 8, and how to calulate total_hash_bits and so on.

Incorrect unicode behavior

In pybloom.py line 71 should test if the key is unicode rather than testing if the key is not unicode. If the key is not unicode, encoding it as utf-8 will first attempt to decode the string using the default encoding. Usually this is ascii, so if the string is not ascii a UnicodeDecodeError will be raised. Similarly, if the key is unicode, str(key) tries to encode the key in the default encoding. If the key contains non-ascii characters, this will raise UnicodeEncodeError.

Can I clear a bloom filter I created?

If I have a global bloom filter, is there a way to empty it out (set all the bits to zero?) Thanks

How to install pybloom on mac os with python3?

I can't install pybloom on pip3

BloomFilter (and ScalableBF) should support set operations (intersection, union ... )

I am not particularly sure that the backend algorithm can reliably support this, but it would be nice if the BloomFilter implementation could support the operations of the set primitive. Specifically:

>>> a = BloomFilter(10)
>>> a.add(1)
>>> b = BloomFilter(10)
>>> b.add(2)
>>> c = a | b
>>> 1 in c
True
>>> 2 in c
True
>>> 3 in c
False

I find sets to be incredibly useful in general coding practice, as it allows one to work naturally on data. Having a proxy to a set of data, without having to load said data, would increase the size of problems that one could easily work on, especially when one is more interested in set membership than the actual members.

This is just a feature suggestion (not even a request really), so no hurry. If I have time, I might even write a patch. I am unfamiliar with the hash function generation and the nuances there, so thats why I haven't jumped right into it.

confused about num_slices

I am very confused about num_slices in the code，the formula for calculating num_slices is num_slices = M/n*ln(2)，however the formula is num_slices = log2(1/P) in this code.