splatlab / cqf Goto Github PK
View Code? Open in Web Editor NEWA General-Purpose Counting Filter: Counting Quotient Filter
License: BSD 3-Clause "New" or "Revised" License
A General-Purpose Counting Filter: Counting Quotient Filter
License: BSD 3-Clause "New" or "Revised" License
Is is possible to modify https://github.com/splatlab/cqf/blob/master/src/hashutil.c#L132 to add a uint128_t version ?
Hi,
I am trying to create cqf having key remainders of size less than 8 bits. From the usage example(main.c), the number of hash bits equals 8+ qbits. Changing the value 8 to other value leads to assertion error("(BITS_PER_SLOT == 0 || BITS_PER_SLOT == qf->bits_per_slot)"). However, Changing BITS_PER_SLOT in gqf.h from 8 to 0 bypass the assertion and the code passes the testcase written in main.c.
I want to make sure is it safe to set BITS_PER_SLOT to 0. I am curious if you did enough tests for this situation.
We are trying to make use of this implementation in khmer (specifically dib-lab/khmer#1675). It is pretty early on and the PR takes a lot of shortcuts at the moment.
One thing I've run into and would appreciate your help on is the interplay between qf_init
's nslots
parameter and the maximum value the key
in qf_insert
can take. I assumed that there was no limit on key
. However I end up with a segfault when using the following setup: nslots=2**8
and K=28
(length of the kmers). For example inserting "TAGCCTGGTTACTACGCCGAATTATGAT" (key = 6827347471803364) leaves me with:
Program received signal SIGSEGV, Segmentation fault.
offset_lower_bound (slot_index=26669326061731, qf=0x7c6e30) at gqf.c:417
417 const uint64_t boffset = b->offset;
this is the stack trace:
(gdb) where
#0 offset_lower_bound (slot_index=26669326061731, qf=0x7c6e30) at gqf.c:417
#1 is_empty (slot_index=26669326061731, qf=0x7c6e30) at gqf.c:429
#2 insert1 (hash=<optimized out>, qf=0x7c6e30) at gqf.c:1038
#3 qf_insert (qf=qf@entry=0x7c6e30, key=key@entry=6827347471803364, value=value@entry=0, count=count@entry=1) at gqf.c:1453
#4 0x00007ffff50b7612 in khmer::QFStorage::add (this=0x7c6e20, khash=6827347471803364) at lib/storage.hh:432
4 and beyond is inside khmer.
Because there are quite a few moving parts in integrating CQF with khmer I wanted to ask if this should work in principle or not.
(K=8 with nslots = 2**24 seems to work)
Hi, we are adding CQF support to khmer and one thing we've run into is that after inserting some number of unique kmers we hit this assertion: https://github.com/splatlab/cqf/blob/master/gqf.c#L487
For example when you create a QF with qf_init(cf, (1ULL << size), size+8, 0)
and size=7
. I then attempt to insert 400 random 20-mers. The assert gets triggered when I try to insert kmer 242.
I think this is because the QF is "full". Is that right? I'm asking because coming from a world of bloom filters (BF) you can keep adding and adding to your BF and while eventually all bits will be set to 1, you can always add more (even if it makes little sense).
For khmer this is nice. We can guarantee to the user that the program will not run out of memory (all allocated up front) and we can warn them if we think that they made their BF too small. But you can leave your program running for hours and get "some kind of result" instead of "ah we used up all the memory and crashed please start from scratch".
So finally my question: is there a way for me to check how full the CQF is and if adding one more kmer will fail or not? Or is there a way to use the CQF so that it will let me keep adding kmers until I go blue in the face regardless of how full it is/overflow the counter?
(I think you can reproduce this with squeakr: ./main 0 10 1 test.fastq
though for me that segfaults and doesn't print the assert it hit so I am not 100% sure.)
Very exciting work!
In comparing the Counting Quotient Filter to counting bloom filters and other probabilistic data structures, can you provide any numbers?
For example, the Cuckoo Filter paper (Table 1) lists the number of cache misses per lookup and space requirements as compared to the basic Bloom Filter, as well as whether or not deletion is supported. Of course, your work is more comparable to only the counting Bloom filter. (And, perhaps, comparable to HyperLogLog, which, while not optimized for it, provides a AMQ interface with counting.)
Thanks!
I ran into two problems with GCC 6.2 on our HPC --
cc1plus: error: invalid option argument ‘-Ofast’
cc1plus: error: unrecognized command line option "-std=c++11"
and
gqf.c: In function ‘void qf_multi_merge(QF**, int, QF*)’:
gqf.c:1593: error: ‘UINT64_MAX’ was not declared in this scope
and eventually figured out this set of fixes:
I don't think anything needs to be done here, and this may be for an old version of the CQF anyway (we're on github.com/dib-lab/khmer commit 1c2810cb, and the CQF code is under third-party/cqf/ FYI), but I wanted to link the problem and its fix into this repo.
I've been experiencing segfaults in shift_remainders(), and it seems to be happening because the value of empty_index is less than start_index, when called from line 1372 (from within insert1()).
I'm just getting started with cqf and it's quite possible that I've misunderstood something in the API. The keys I'm inserting to the QFs should all be within the allowable range, and I am not storing values, just keys. I've enabled auto resizing, and these crashes are happening during qf_resize_malloc.
Am I doing something wrong, or is this a bug?
Using homebrew installed openssl,
% make NH=1 CXXFLAGS='-I /usr/local/Cellar/openssl/1.0.2j/include' LDFLAGS='-L/usr/local/Cellar/openssl/1.0.2j/lib/ -lcrypto'
% ./main 24
worked!
I'm running on Mac OS X 10.12.3, Mac OS X Sierra:
% uname -a
Darwin Titus-MacBook-Air-3.local 16.4.0 Darwin Kernel Version 16.4.0: Thu Dec 22 22:53:21 PST 2016; root:xnu-3789.41.3~3/RELEASE_X86_64 x86_6
I followed the instructions
make
./test 24
and got
Please specify the log of the number of slots and the number of remainder bits in the CQF.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.