jermp / pthash Goto Github PK
View Code? Open in Web Editor NEWFast and compact minimal perfect hash functions in C++.
License: MIT License
Fast and compact minimal perfect hash functions in C++.
License: MIT License
I'm trying to build maps using build but the process of choosing the encoder is extremely frustrating. The help text suggest to look at the source file, which contains no documentation. I tried structure names by trial and error (e.g., elias_fano
), but, for example, compact_compact
doesn't work. The possible values of -e
must should be documented.
The following line in util.hpp
causes compile error in gcc 10.2.1:
#define MAX_BUCKET_SIZE static_cast<bucket_size_type>(100)
error message expected unqualified-id before ‘static_cast’
I found that replacing that line with the following fix the problem:
constexpr bucket_size_type MAX_BUCKET_SIZE = 100;
PTHash has great performance and is headers only, so it's easy to use, however, it only includes the hash function now, is there a map implementation based on PTHash? For example, a map with put() and get().
Thank you.
This issue says that pthash has not map implementation, but It seems that we can use pthash as the hash function of std::unordered_map.
std::unordered_map<uint64_t, uint64_t, pthash_type> pt_umap(keys.size(), f);
To my surprise, pt_umap
even performs worse than the default hash function. Is there a problem with the use of the above code?
Here is my benchmark code:
#include <iostream>
#include <string>
#include <set>
#include <vector>
#include <algorithm>
#include <unordered_set>
#include <unordered_map>
#include <map>
#include <type_traits>
#include <cstdlib>
#include <unistd.h>
#include <thread>
#include <typeindex>
#include <functional>
#include <chrono>
#include "pthash/include/pthash.hpp"
#include "pthash/src/util.hpp"
using namespace std::chrono;
int main()
{
/* Generate 10M random 64-bit keys as input data. */
static const uint64_t num_keys = 10000000;
static const uint64_t seed = 1234567890;
std::cout << "generating input data..." << std::endl;
std::vector<uint64_t> keys = pthash::distinct_keys<uint64_t>(num_keys, seed);
assert(keys.size() == num_keys);
/* Set up a build configuration. */
pthash::build_configuration config;
config.c = 7.0;
config.alpha = 0.99;
config.minimal_output = true; // mphf
config.verbose_output = false;
/* Declare the PTHash function. */
typedef pthash::single_phf<pthash::murmurhash2_64, // base hasher
pthash::compact_compact, // encoder type
true // minimal
> pthash_type;
pthash_type f;
/* Build the function in internal memory. */
std::cout << "building the function..." << std::endl;
f.build_in_internal_memory(keys.begin(), keys.size(), config);
// normal, use the default hash function
std::unordered_map<uint64_t, uint64_t> normal;
for (uint64_t x : keys) {
normal[x] = x;
}
int collision = 0;
for (size_t bucket = 0; bucket < normal.bucket_count(); bucket++) {
if (normal.bucket_size(bucket) > 1) {
collision ++;
}
}
std::cout << "normal collision=" << collision << ", total=" << normal.bucket_count() << std::endl;
// vector with pthash
std::vector<uint64_t> pt_map(keys.size());
for (uint64_t x : keys) {
pt_map[f(x)] = x;
}
// unordered_map with pthash
std::unordered_map<uint64_t, uint64_t, pthash_type> pt_umap(keys.size(), f);
for(uint64_t x : keys) {
pt_umap[x] = x;
}
collision = 0;
for (size_t bucket = 0; bucket < pt_umap.bucket_count(); bucket++) {
if (pt_umap.bucket_size(bucket) > 1) {
collision ++;
}
}
std::cout << "ptu collision=" << collision << ", total=" << pt_umap.bucket_count() << std::endl;
// correctness
for (int i = 0; i < keys.size(); i += 100) {
uint64_t key = keys[i];
assert(pt_map[f(key)] == normal[key]);
assert(pt_umap[key] == normal[key]);
}
// performance
auto start = system_clock::now();
for (int i = 0; i < keys.size(); i += 100) {
auto tmp = normal[keys[i]];
}
std::cout << "normal cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;
start = system_clock::now();
for (int i = 0; i < keys.size(); i += 100) {
auto tmp = pt_map[f(keys[i])];
}
std::cout << "pt cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;
start = system_clock::now();
for (int i = 0; i < keys.size(); i += 100) {
auto tmp = pt_umap[keys[i]];
}
std::cout << "ptu cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;
}
The results are:
generating input data...
building the function...
normal collision=2327031, total=13169977
ptu collision=0, total=10000019
normal cost=59043ms
pt cost=28636ms
ptu cost=60522ms
Hi @jermp,
I'm trying to put piscem up on bioconda, and the linux build works fine. The OSX build, however, gives this error (https://dev.azure.com/bioconda/bioconda-recipes/_build/results?buildId=24063&view=logs&j=1b052f1d-4456-52f0-9d43-71c4c5bd734d&t=edc48dcd-1fc2-5e3b-7036-7be9cea00123&l=597). It seems to be a stricter rule for conversions in clang than g++. Any chance of fixing this upstream?
Thanks!
Rob
Hello again @jermp
What could be the reason for pthash to stuck while creating a mphf (see the log below)?
c = 6 alpha = 0.94 num_keys = 19 table_size = 20 num_buckets = 27 == map+sort took: 0 seconds == merged 100% pairs == merge+check took: 0 seconds == mapping+ordering took 0 seconds == max bucket size = 6 == lambda = 0.707988 == lambda_1 = 1.41598 == lambda_2 = 0.404565 == num_buckets of size 6 = 1 (estimated with Poisson = 0) == num_buckets of size 5 = 0 (estimated with Poisson = 0) == num_buckets of size 4 = 0 (estimated with Poisson = 0) == num_buckets of size 3 = 0 (estimated with Poisson = 1) == num_buckets of size 2 = 5 (estimated with Poisson = 3) == num_buckets of size 1 = 3 (estimated with Poisson = 7) 2023-10-16 15:05:45: search starts
it could be the configuration for small number of keys. If such a case, what do you recommend?
Thanks
Hello, I'm trying to use pthash on ~34*10^9 strings (~50 bytes each) and I'm hitting the runtime error in title when using the main executable like this:
$ ./build --minimal -c 5 -a 0.99 -e dictionary_dictionary -n 34121566250 -i graph.nodes.csv -o graph.nodes.mph -t 48 --external -m 2048 --verbose --check --lookup
2023-11-25 16:39:19: construction starts
terminate called after throwing an instance of 'std::runtime_error'
what(): using too many buckets: change bucket_id_type to uint64_t or use a smaller c
It happens for c >= 5
, but not with c = 4
.
Does one need to recompile with the suggested type change to use higher c
on such an input set?
Thanks a lot for pthash, it's amazing :-)
Hello!
Very nice project and research (I don't understand fully but seems very interesting), I wanted to port this to Rust but seems like it will be a lot of work.
I though might be easier if it had a c api so I can call it with ffi from rust.
Is this possible to do?
I can work on it if you give me some direction.
Hi,
I was using additive displcament in the play branch, however it is generating duplicate hash values.
You can download from here the keyset that I was using. The file I am using is kmers_64bits_5776.txt.
What could be the issue?
Best,
With a pthash clone at commit 0d09102 I'm now getting the following error:
$ ./pthash-build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 34121566250 \
-p 3412 -t 96 --external -m 1024 -i $(zstdcat graph.nodes.csv.zst) -o graph.nodes.mph \
-d ./tmp.pthash-nodes.IH4bXCZpg9 \
--verbose
2023-11-29 07:01:49: construction starts
num_partitions 3412
using 1024 GB of RAM
== partitioned 100% keys
processing 3412/3412 partitions...
terminate called after throwing an instance of 'pthash::seed_runtime_error'
what(): seed did not work
$
I've tried two different datasets, encountering the same error.
I've also tried not using a process substitution for -i
, i.e., passing a regular file to it rather than zstdcat
, and it does not fail in that case, which is quite weird. Note that I did not use the new --mmap
flag, so I'm not sure what is different between the two cases from the point of view of ./build
.
Make description in README consistent with the parameters in that branch.
Hi @jermp ,
I think pthash/include/encoders/util.hpp is missing the cstdint and cassert headers.
g++ 12.3 complains about uintX_t definitions and asserts.
As suggested by @zacchiro,
we should support process substitution to read keys from std::in
.
Once done, we can specify, e.g., -i < (cat file)
or -i < (zcat file.gz)
, etc.
@roberto-trani suggested to use the same trick as used in grep
(https://man7.org/linux/man-pages/man1/grep.1.html)
where
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. If this option
is used multiple times or is combined with the -e
(--regexp) option, search for all patterns given. The
empty file contains zero patterns, and therefore matches
nothing. If FILE is - , read patterns from standard
input.
The only caveat is that: reading from std::cin prevents from iterating twice on the stream, hence we can only build the
function but not check its correctness nor banchmark lookup (because this requires iterating again over the keys from
the stream).
Hello @jermp @ByteHamster
I tried pthash on artificial data, it worked. However when I moved to the hash values of k-mers, it is raising an error related to the seed (see below).
c = 6 alpha = 0.94 num_keys = 21817 table_size = 23209 num_buckets = 9083 == map+sort took: 0.002 seconds == merged 0% pairsterminate called after throwing an instance of 'pthash::seed_runtime_error' what(): seed did not work
I tried different values of seed, but they are not working.
What could be the problem?
It is handy to pass "streaming" content as input to build -i
, e.g., by using shell process substitution like this: build -i <(zstdcat foo.txt.zst)
.
Unfortunately that does not work when --external
is used.
Apparently in that case (and only in that case) the input file passed to -i
is mmap-ed, which obviously fail when the input file is not a regular file (a FIFO, in the case of process substitution).
Would it be possible to either avoid mmap-ing the input file even when --external
is used or, alternatively, to have an option to inhibit that selectively?
Thanks!
The following code removes duplicates, but doesn't shrink the vector to actually remove the duplicates that std::unique
moved to its end:
Line 92 in 2607873
The line should probably be:
keys.erase(std::unique(keys.begin(), keys.end()), keys.end());
With C++14, I encountered the following error during compiling the example code:
pthash/include/builders/util.hpp:148:24: error: use of class template 'payload_iterator' requires template arguments
payload_iterator(pairs.begin() + i - bucket_size));
^
I fixed this error by using C++17.
Is there any way to use pthash under the C++14 standard? Thanks in advance!
I'm getting the following runtime error with the main build
executable:
== partitioned 0% keysterminate called after throwing an instance of 'std::runtime_error'
what(): blank line detected after reading 0 non-empty lines
This is a real dataset where there is indeed a valid key which happens to be the empty string.
Is there any particular reason for barfing on empty string?
I've used other MPHs (specifically: GOV) on this dataset in the past and I don't see why it shouldn't be possible to use PTHash.
Thanks in advance,
Cheers
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.