jaybaird / python-bloomfilter Goto Github PK
View Code? Open in Web Editor NEWScalable Bloom Filter implemented in Python
License: MIT License
Scalable Bloom Filter implemented in Python
License: MIT License
Is there somehow I can store my full bloom filter to a file and load it again later on to check for new values?
I want to use bloomfilter in scrapy_redis. scrapy_redis has a large number of url need to filter.
If i use the bloomfilter for a few weeks. Linux memory will boom.So i want use bloomfilter to filter 7 days(or other time) url.What shou i do? Can you give me some adices?
When doing the install, the following error is obtained.
ERROR: Complete output from command python setup.py egg_info: ERROR: Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-iwa1sif6/pybloom/setup.py", line 2, in <module> from ez_setup import use_setuptools File "/tmp/pip-install-iwa1sif6/pybloom/ez_setup.py", line 98 except pkg_resources.VersionConflict, e:
^
SyntaxError: invalid syntax ----------------------------------------
ERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-iwa1sif6/pybloom/
Is there any way to solve the problem?
I tested on Linux and they work fine.
But on windows the bitarray library throws an exception:
======================================================================
ERROR: test_serialization (pybloom.tests.Serialization)
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\users\taylor\src\python-bloomfilter\pybloom\tests.py", line 92, in test_serialization
filter.tofile(f)
File "c:\users\taylor\src\python-bloomfilter\pybloom\pybloom.py", line 251, in tofile
else self.bitarray.tofile(f))
TypeError: open file expected
----------------------------------------------------------------------
Ran 14 tests in 0.391s
FAILED (errors=1)
Even though the file is definitely open:
<open file '<fdopen>', mode 'w+b' at 0x0000000003291420>
I realize this isn't a bug in the pybloom library, but I am filing it here since the UT's do not pass on windows.
Is it possible to seed the hash functions within an instance of BloomFilter?
A website has tens of thousands or more urls after rendering, and these urls are hierarchical. If the url of the previous level is judged to be repeated, then its next level url is directly ignored. Can the problem bloomfilter be solved? Where do I need to change?
一个网站经过渲染后有几万甚至更多url,这些url是分级的。如果上一级的url被判断重复了,那么它的下一级url就被直接忽略了,这个问题bloomfilter能解决吗?我需要在哪里修改?有没有大佬能提供一个好的思路?
i want to dump the object out, and load it the next time.
But when i use pickle, i found it always clean(.count()=0), when reload.
How can i dump it out, and save the memory of it?
salts = tuple(hashfn(hashfn(pack('I', i)).digest()) for i in range_fn(num_salts))
why do I need to hash twice,we only use it once and can get the same result.
Possible to add a tag for version 2.0? Would like to add this project to MacPorts, but they require fetching tagged versions and not directly from head.
Can I use part of the Code (make_hashfuncs
) in my project?
Getting an exception using this library...
...
File "/home/adam/v/tf/evenly/ws2/app/bloom.py", line 25, in get_bloom_filters
green_bloom = ScalableBloomFilter.fromfile(infile)
File "/home/adam/v/tf/evenly/ws2/buildout/eggs/pybloom_live-3.0.0-py3.7.egg/pybloom_live/pybloom.py", line 368, in fromfile
filter.filters.append(BloomFilter.fromfile(f, fl))
File "/home/adam/v/tf/evenly/ws2/buildout/eggs/pybloom_live-3.0.0-py3.7.egg/pybloom_live/pybloom.py", line 216, in fromfile
if filter.num_bits != filter.bitarray.length() and \
NotImplementedError: self.length() has been deprecated since 1.5.1, and was removed in 2.0.0. Use len(self) instead.
I understand that that NotImplementedError comes from bit array, although the trace back doesn't make that super clear - looks like it is from a c file https://github.com/ilanschnell/bitarray/blob/cdb9b11cb144b373f49ac5b5b9015f1bfa2982d7/bitarray/_bitarray.c#L644
Hi, I'm wondering why you are using ratio in ScalableBloomFilter, and it seems that the first filter has different error_as from the rest filters. Because in the code, the first filter has error rate as error_rate * (1 - ratio), and the rest of filters have error rate as error_rate * ratio.
I want to remove some elements, so next time add(them) will return false.
is there inner function to do this?
Can you submit 2.0 to PyPI ? seems like 1.1 is the latest: https://pypi.python.org/pypi/pybloom
Hello.
Is there a way to load a Bloom Filter from bytes instead of loading from a file? The advantage is that we do not need to download a file to load a Bloom Filter.
Thank you in advance.
Once you deserialize a serialized BloomFilter
object the self.bitarray
length might differ because of added padding.
https://github.com/jaybaird/python-bloomfilter/blob/master/pybloom/pybloom.py#L271
Here difference in length due to the trailing bits is ignored.
No such accounting of differing bitarray lengths are being done here https://github.com/jaybaird/python-bloomfilter/blob/master/pybloom/pybloom.py#L224 or https://github.com/jaybaird/python-bloomfilter/blob/master/pybloom/pybloom.py#L238 . Here the bitarray
union and intersection will fail if the bitarray.length( ) are different. The lengths may differ because of a roundtrip through serialization deserialization, even when the capacity and error-rates are the same.
I think the correct thing to do here is to strip off the padding in fromfile
to ensure that the bitarray
representation is exactly the same
first thanks for your bloom filter ,it's easy to use. and i'm interested in "make_hashfuncs", can you provide some article to tell me how to design the method?
why num_bits >= (1 << 31) then fmt_code, chunk_size = 'Q', 8, and how to calulate total_hash_bits and so on.
In pybloom.py line 71 should test if the key is unicode rather than testing if the key is not unicode. If the key is not unicode, encoding it as utf-8 will first attempt to decode the string using the default encoding. Usually this is ascii, so if the string is not ascii a UnicodeDecodeError will be raised. Similarly, if the key is unicode, str(key) tries to encode the key in the default encoding. If the key contains non-ascii characters, this will raise UnicodeEncodeError.
If I have a global bloom filter, is there a way to empty it out (set all the bits to zero?) Thanks
I can't install pybloom on pip3
I am not particularly sure that the backend algorithm can reliably support this, but it would be nice if the BloomFilter implementation could support the operations of the set primitive. Specifically:
>>> a = BloomFilter(10)
>>> a.add(1)
>>> b = BloomFilter(10)
>>> b.add(2)
>>> c = a | b
>>> 1 in c
True
>>> 2 in c
True
>>> 3 in c
False
I find sets to be incredibly useful in general coding practice, as it allows one to work naturally on data. Having a proxy to a set of data, without having to load said data, would increase the size of problems that one could easily work on, especially when one is more interested in set membership than the actual members.
This is just a feature suggestion (not even a request really), so no hurry. If I have time, I might even write a patch. I am unfamiliar with the hash function generation and the nuances there, so thats why I haven't jumped right into it.
I am very confused about num_slices in the code,the formula for calculating num_slices is num_slices = M/n*ln(2),however the formula is num_slices = log2(1/P) in this code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.