jermp / pthash Goto Github PK

View Code? Open in Web Editor NEW

171.0 171.0 23.0 307 KB

Fast and compact minimal perfect hash functions in C++.

License: MIT License

CMake 1.50% C++ 97.94% Shell 0.56%

hash hash-function hashing minimal-perfect-hash

pthash's People

Contributors

Stargazers

Watchers

pthash's Issues

Encoders must be documented in the help text

I'm trying to build maps using build but the process of choosing the encoder is extremely frustrating. The help text suggest to look at the source file, which contains no documentation. I tried structure names by trial and error (e.g., elias_fano), but, for example, compact_compact doesn't work. The possible values of -emust should be documented.

Compile issue in gcc 10.2.1

The following line in util.hpp causes compile error in gcc 10.2.1:
#define MAX_BUCKET_SIZE static_cast<bucket_size_type>(100)
error message expected unqualified-id before ‘static_cast’

I found that replacing that line with the following fix the problem:
constexpr bucket_size_type MAX_BUCKET_SIZE = 100;

Is there a map implementation of PTHash?

PTHash has great performance and is headers only, so it's easy to use, however, it only includes the hash function now, is there a map implementation based on PTHash? For example, a map with put() and get().
Thank you.

Use pthash with unordered_map

This issue says that pthash has not map implementation, but It seems that we can use pthash as the hash function of std::unordered_map.

std::unordered_map<uint64_t, uint64_t, pthash_type> pt_umap(keys.size(), f);

To my surprise, pt_umap even performs worse than the default hash function. Is there a problem with the use of the above code?
Here is my benchmark code:

#include <iostream>
#include <string>
#include <set>
#include <vector>
#include <algorithm>
#include <unordered_set>
#include <unordered_map>
#include <map>
#include <type_traits>
#include <cstdlib>
#include <unistd.h>
#include <thread>
#include <typeindex>
#include <functional>
#include <chrono>

#include "pthash/include/pthash.hpp"
#include "pthash/src/util.hpp"

using namespace std::chrono;

int main()
{
    /* Generate 10M random 64-bit keys as input data. */
    static const uint64_t num_keys = 10000000;
    static const uint64_t seed = 1234567890;
    std::cout << "generating input data..." << std::endl;
    std::vector<uint64_t> keys = pthash::distinct_keys<uint64_t>(num_keys, seed);
    assert(keys.size() == num_keys);

    /* Set up a build configuration. */
    pthash::build_configuration config;
    config.c = 7.0;
    config.alpha = 0.99;
    config.minimal_output = true;  // mphf
    config.verbose_output = false;

    /* Declare the PTHash function. */
    typedef pthash::single_phf<pthash::murmurhash2_64,         // base hasher
            pthash::compact_compact,  // encoder type
            true                    // minimal
    > pthash_type;
    pthash_type f;

    /* Build the function in internal memory. */
    std::cout << "building the function..." << std::endl;
    f.build_in_internal_memory(keys.begin(), keys.size(), config);

    // normal, use the default hash function
    std::unordered_map<uint64_t, uint64_t> normal;
    for (uint64_t x : keys) {
        normal[x] = x;
    }

    int collision = 0;
    for (size_t bucket = 0; bucket < normal.bucket_count(); bucket++) {
        if (normal.bucket_size(bucket) > 1) {
            collision ++;
        }
    }
    std::cout << "normal collision=" << collision << ", total=" << normal.bucket_count() << std::endl;

    // vector with pthash
    std::vector<uint64_t> pt_map(keys.size());
    for (uint64_t x : keys) {
        pt_map[f(x)] = x;
    }

    // unordered_map with pthash
    std::unordered_map<uint64_t, uint64_t, pthash_type> pt_umap(keys.size(), f);
    for(uint64_t x : keys) {
        pt_umap[x] = x;
    }

    collision = 0;
    for (size_t bucket = 0; bucket < pt_umap.bucket_count(); bucket++) {
        if (pt_umap.bucket_size(bucket) > 1) {
            collision ++;
        }
    }
    std::cout << "ptu collision=" << collision << ", total=" << pt_umap.bucket_count() << std::endl;

    // correctness
    for (int i = 0; i < keys.size(); i += 100) {
        uint64_t key = keys[i];
        assert(pt_map[f(key)] == normal[key]);
        assert(pt_umap[key] == normal[key]);
    }

    // performance
    auto start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = normal[keys[i]];
    }
    std::cout << "normal cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;

    start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = pt_map[f(keys[i])];
    }
    std::cout << "pt cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;

    start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = pt_umap[keys[i]];
    }
    std::cout << "ptu cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;
}

The results are:

generating input data...
building the function...
normal collision=2327031, total=13169977
ptu collision=0, total=10000019
normal cost=59043ms
pt cost=28636ms
ptu cost=60522ms

Compile issue on OSX (building on bioconda)

Hi @jermp,

I'm trying to put piscem up on bioconda, and the linux build works fine. The OSX build, however, gives this error (https://dev.azure.com/bioconda/bioconda-recipes/_build/results?buildId=24063&view=logs&j=1b052f1d-4456-52f0-9d43-71c4c5bd734d&t=edc48dcd-1fc2-5e3b-7036-7be9cea00123&l=597). It seems to be a stricter rule for conversions in clang than g++. Any chance of fixing this upstream?

Thanks!
Rob

pthash stucks while creating mphf

Hello again @jermp

What could be the reason for pthash to stuck while creating a mphf (see the log below)?

c = 6 alpha = 0.94 num_keys = 19 table_size = 20 num_buckets = 27 == map+sort took: 0 seconds == merged 100% pairs == merge+check took: 0 seconds == mapping+ordering took 0 seconds == max bucket size = 6 == lambda = 0.707988 == lambda_1 = 1.41598 == lambda_2 = 0.404565 == num_buckets of size 6 = 1 (estimated with Poisson = 0) == num_buckets of size 5 = 0 (estimated with Poisson = 0) == num_buckets of size 4 = 0 (estimated with Poisson = 0) == num_buckets of size 3 = 0 (estimated with Poisson = 1) == num_buckets of size 2 = 5 (estimated with Poisson = 3) == num_buckets of size 1 = 3 (estimated with Poisson = 7) 2023-10-16 15:05:45: search starts
it could be the configuration for small number of keys. If such a case, what do you recommend?
Thanks

"std::runtime_error ... using too many buckets: change bucket_id_type to uint64_t or use a smaller " with c >= 5 and 34*10^9 keys

Hello, I'm trying to use pthash on ~34*10^9 strings (~50 bytes each) and I'm hitting the runtime error in title when using the main executable like this:

$ ./build --minimal -c 5 -a 0.99 -e dictionary_dictionary -n 34121566250 -i graph.nodes.csv -o graph.nodes.mph -t 48 --external -m 2048 --verbose --check --lookup
2023-11-25 16:39:19: construction starts
terminate called after throwing an instance of 'std::runtime_error'
  what():  using too many buckets: change bucket_id_type to uint64_t or use a smaller c

It happens for c >= 5, but not with c = 4.

Does one need to recompile with the suggested type change to use higher c on such an input set?

Thanks a lot for pthash, it's amazing :-)

C API

Hello!

Very nice project and research (I don't understand fully but seems very interesting), I wanted to port this to Rust but seems like it will be a lot of work.

I though might be easier if it had a c api so I can call it with ffi from rust.

Is this possible to do?

I can work on it if you give me some direction.

duplicate hash values on additive displacement

Hi,

I was using additive displcament in the play branch, however it is generating duplicate hash values.

You can download from here the keyset that I was using. The file I am using is kmers_64bits_5776.txt.

What could be the issue?

Best,

add support for reading from zstd-compressed files

As discussed in #19, rather than fiddling with input redirections, it would be more robust to explicitly support reading from zstd-compressed input files with -i.

runtime error "seed did not work" when -i points to a non-regular file

With a pthash clone at commit 0d09102 I'm now getting the following error:

$ ./pthash-build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 34121566250 \
    -p 3412 -t 96 --external -m 1024 -i $(zstdcat graph.nodes.csv.zst) -o graph.nodes.mph \
    -d ./tmp.pthash-nodes.IH4bXCZpg9 \
    --verbose
2023-11-29 07:01:49: construction starts
num_partitions 3412
using 1024 GB of RAM
 == partitioned 100% keys
processing 3412/3412 partitions...
terminate called after throwing an instance of 'pthash::seed_runtime_error'
  what():  seed did not work
$

I've tried two different datasets, encountering the same error.

I've also tried not using a process substitution for -i, i.e., passing a regular file to it rather than zstdcat, and it does not fail in that case, which is quite weird. Note that I did not use the new --mmap flag, so I'm not sure what is different between the two cases from the point of view of ./build.

update README in pthash/phobic

Make description in README consistent with the parameters in that branch.

missing include

Hi @jermp ,
I think pthash/include/encoders/util.hpp is missing the cstdint and cassert headers.
g++ 12.3 complains about uintX_t definitions and asserts.

Support process substitution to read keys from std::in

As suggested by @zacchiro,
we should support process substitution to read keys from std::in.

Once done, we can specify, e.g., -i < (cat file) or -i < (zcat file.gz), etc.

@roberto-trani suggested to use the same trick as used in grep (https://man7.org/linux/man-pages/man1/grep.1.html)
where

     -f FILE, --file=FILE

          Obtain patterns from FILE, one per line.  If this option
          is used multiple times or is combined with the -e
          (--regexp) option, search for all patterns given.  The
          empty file contains zero patterns, and therefore matches
          nothing.  If FILE is - , read patterns from standard
          input.

The only caveat is that: reading from std::cin prevents from iterating twice on the stream, hence we can only build the
function but not check its correctness nor banchmark lookup (because this requires iterating again over the keys from
the stream).

Seed not working

Hello @jermp @ByteHamster

I tried pthash on artificial data, it worked. However when I moved to the hash values of k-mers, it is raising an error related to the seed (see below).
c = 6 alpha = 0.94 num_keys = 21817 table_size = 23209 num_buckets = 9083 == map+sort took: 0.002 seconds == merged 0% pairsterminate called after throwing an instance of 'pthash::seed_runtime_error' what(): seed did not work
I tried different values of seed, but they are not working.

What could be the problem?

-i can't read from a non-mmap-friendly input file when using --external

It is handy to pass "streaming" content as input to build -i, e.g., by using shell process substitution like this: build -i <(zstdcat foo.txt.zst).

Unfortunately that does not work when --external is used.
Apparently in that case (and only in that case) the input file passed to -i is mmap-ed, which obviously fail when the input file is not a regular file (a FIFO, in the case of process substitution).

Would it be possible to either avoid mmap-ing the input file even when --external is used or, alternatively, to have an option to inhibit that selectively?

Thanks!

distinct_keys doesn't actually remove duplicates

The following code removes duplicates, but doesn't shrink the vector to actually remove the duplicates that std::unique moved to its end:

pthash/src/util.hpp

Line 92 in 2607873

std::unique(keys.begin(), keys.end()); // remove the duplicates

The line should probably be:

keys.erase(std::unique(keys.begin(), keys.end()), keys.end());

C++14 support

With C++14, I encountered the following error during compiling the example code:

pthash/include/builders/util.hpp:148:24: error: use of class template 'payload_iterator' requires template arguments
                       payload_iterator(pairs.begin() + i - bucket_size));
                       ^

I fixed this error by using C++17.
Is there any way to use pthash under the C++14 standard? Thanks in advance!

runtime error "blank line detected after reading X non-empty lines"

I'm getting the following runtime error with the main build executable:

 == partitioned 0% keysterminate called after throwing an instance of 'std::runtime_error'
  what():  blank line detected after reading 0 non-empty lines

This is a real dataset where there is indeed a valid key which happens to be the empty string.

Is there any particular reason for barfing on empty string?
I've used other MPHs (specifically: GOV) on this dataset in the past and I don't see why it shouldn't be possible to use PTHash.

Thanks in advance,
Cheers

jermp / pthash Goto Github PK

pthash's People

Contributors

Stargazers

Watchers

Forkers

pthash's Issues

Recommend Projects

Recommend Topics

Recommend Org