Code Monkey home page Code Monkey logo

pthash's People

Contributors

bytehamster avatar jermp avatar progval avatar roberto-trani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pthash's Issues

Encoders must be documented in the help text

I'm trying to build maps using build but the process of choosing the encoder is extremely frustrating. The help text suggest to look at the source file, which contains no documentation. I tried structure names by trial and error (e.g., elias_fano), but, for example, compact_compact doesn't work. The possible values of -emust should be documented.

Compile issue in gcc 10.2.1

The following line in util.hpp causes compile error in gcc 10.2.1:
#define MAX_BUCKET_SIZE static_cast<bucket_size_type>(100)
error message expected unqualified-id before ‘static_cast’

I found that replacing that line with the following fix the problem:
constexpr bucket_size_type MAX_BUCKET_SIZE = 100;

Is there a map implementation of PTHash?

PTHash has great performance and is headers only, so it's easy to use, however, it only includes the hash function now, is there a map implementation based on PTHash? For example, a map with put() and get().
Thank you.

Use pthash with unordered_map

This issue says that pthash has not map implementation, but It seems that we can use pthash as the hash function of std::unordered_map.

std::unordered_map<uint64_t, uint64_t, pthash_type> pt_umap(keys.size(), f);

To my surprise, pt_umap even performs worse than the default hash function. Is there a problem with the use of the above code?
Here is my benchmark code:

#include <iostream>
#include <string>
#include <set>
#include <vector>
#include <algorithm>
#include <unordered_set>
#include <unordered_map>
#include <map>
#include <type_traits>
#include <cstdlib>
#include <unistd.h>
#include <thread>
#include <typeindex>
#include <functional>
#include <chrono>

#include "pthash/include/pthash.hpp"
#include "pthash/src/util.hpp"

using namespace std::chrono;

int main()
{
    /* Generate 10M random 64-bit keys as input data. */
    static const uint64_t num_keys = 10000000;
    static const uint64_t seed = 1234567890;
    std::cout << "generating input data..." << std::endl;
    std::vector<uint64_t> keys = pthash::distinct_keys<uint64_t>(num_keys, seed);
    assert(keys.size() == num_keys);

    /* Set up a build configuration. */
    pthash::build_configuration config;
    config.c = 7.0;
    config.alpha = 0.99;
    config.minimal_output = true;  // mphf
    config.verbose_output = false;

    /* Declare the PTHash function. */
    typedef pthash::single_phf<pthash::murmurhash2_64,         // base hasher
            pthash::compact_compact,  // encoder type
            true                    // minimal
    > pthash_type;
    pthash_type f;

    /* Build the function in internal memory. */
    std::cout << "building the function..." << std::endl;
    f.build_in_internal_memory(keys.begin(), keys.size(), config);

    // normal, use the default hash function
    std::unordered_map<uint64_t, uint64_t> normal;
    for (uint64_t x : keys) {
        normal[x] = x;
    }

    int collision = 0;
    for (size_t bucket = 0; bucket < normal.bucket_count(); bucket++) {
        if (normal.bucket_size(bucket) > 1) {
            collision ++;
        }
    }
    std::cout << "normal collision=" << collision << ", total=" << normal.bucket_count() << std::endl;

    // vector with pthash
    std::vector<uint64_t> pt_map(keys.size());
    for (uint64_t x : keys) {
        pt_map[f(x)] = x;
    }

    // unordered_map with pthash
    std::unordered_map<uint64_t, uint64_t, pthash_type> pt_umap(keys.size(), f);
    for(uint64_t x : keys) {
        pt_umap[x] = x;
    }

    collision = 0;
    for (size_t bucket = 0; bucket < pt_umap.bucket_count(); bucket++) {
        if (pt_umap.bucket_size(bucket) > 1) {
            collision ++;
        }
    }
    std::cout << "ptu collision=" << collision << ", total=" << pt_umap.bucket_count() << std::endl;

    // correctness
    for (int i = 0; i < keys.size(); i += 100) {
        uint64_t key = keys[i];
        assert(pt_map[f(key)] == normal[key]);
        assert(pt_umap[key] == normal[key]);
    }

    // performance
    auto start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = normal[keys[i]];
    }
    std::cout << "normal cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;

    start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = pt_map[f(keys[i])];
    }
    std::cout << "pt cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;

    start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = pt_umap[keys[i]];
    }
    std::cout << "ptu cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;
}

The results are:

generating input data...
building the function...
normal collision=2327031, total=13169977
ptu collision=0, total=10000019
normal cost=59043ms
pt cost=28636ms
ptu cost=60522ms

pthash stucks while creating mphf

Hello again @jermp

What could be the reason for pthash to stuck while creating a mphf (see the log below)?

c = 6 alpha = 0.94 num_keys = 19 table_size = 20 num_buckets = 27 == map+sort took: 0 seconds == merged 100% pairs == merge+check took: 0 seconds == mapping+ordering took 0 seconds == max bucket size = 6 == lambda = 0.707988 == lambda_1 = 1.41598 == lambda_2 = 0.404565 == num_buckets of size 6 = 1 (estimated with Poisson = 0) == num_buckets of size 5 = 0 (estimated with Poisson = 0) == num_buckets of size 4 = 0 (estimated with Poisson = 0) == num_buckets of size 3 = 0 (estimated with Poisson = 1) == num_buckets of size 2 = 5 (estimated with Poisson = 3) == num_buckets of size 1 = 3 (estimated with Poisson = 7) 2023-10-16 15:05:45: search starts
it could be the configuration for small number of keys. If such a case, what do you recommend?
Thanks

"std::runtime_error ... using too many buckets: change bucket_id_type to uint64_t or use a smaller " with c >= 5 and 34*10^9 keys

Hello, I'm trying to use pthash on ~34*10^9 strings (~50 bytes each) and I'm hitting the runtime error in title when using the main executable like this:

$ ./build --minimal -c 5 -a 0.99 -e dictionary_dictionary -n 34121566250 -i graph.nodes.csv -o graph.nodes.mph -t 48 --external -m 2048 --verbose --check --lookup
2023-11-25 16:39:19: construction starts
terminate called after throwing an instance of 'std::runtime_error'
  what():  using too many buckets: change bucket_id_type to uint64_t or use a smaller c

It happens for c >= 5, but not with c = 4.

Does one need to recompile with the suggested type change to use higher c on such an input set?

Thanks a lot for pthash, it's amazing :-)

C API

Hello!

Very nice project and research (I don't understand fully but seems very interesting), I wanted to port this to Rust but seems like it will be a lot of work.

I though might be easier if it had a c api so I can call it with ffi from rust.

Is this possible to do?

I can work on it if you give me some direction.

runtime error "seed did not work" when -i points to a non-regular file

With a pthash clone at commit 0d09102 I'm now getting the following error:

$ ./pthash-build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 34121566250 \
    -p 3412 -t 96 --external -m 1024 -i $(zstdcat graph.nodes.csv.zst) -o graph.nodes.mph \
    -d ./tmp.pthash-nodes.IH4bXCZpg9 \
    --verbose
2023-11-29 07:01:49: construction starts
num_partitions 3412
using 1024 GB of RAM
 == partitioned 100% keys
processing 3412/3412 partitions...
terminate called after throwing an instance of 'pthash::seed_runtime_error'
  what():  seed did not work
$

I've tried two different datasets, encountering the same error.

I've also tried not using a process substitution for -i, i.e., passing a regular file to it rather than zstdcat, and it does not fail in that case, which is quite weird. Note that I did not use the new --mmap flag, so I'm not sure what is different between the two cases from the point of view of ./build.

missing include

Hi @jermp ,
I think pthash/include/encoders/util.hpp is missing the cstdint and cassert headers.
g++ 12.3 complains about uintX_t definitions and asserts.

Support process substitution to read keys from std::in

As suggested by @zacchiro,
we should support process substitution to read keys from std::in.

Once done, we can specify, e.g., -i < (cat file) or -i < (zcat file.gz), etc.

@roberto-trani suggested to use the same trick as used in grep (https://man7.org/linux/man-pages/man1/grep.1.html)
where

     -f FILE, --file=FILE

          Obtain patterns from FILE, one per line.  If this option
          is used multiple times or is combined with the -e
          (--regexp) option, search for all patterns given.  The
          empty file contains zero patterns, and therefore matches
          nothing.  If FILE is - , read patterns from standard
          input.

The only caveat is that: reading from std::cin prevents from iterating twice on the stream, hence we can only build the
function but not check its correctness nor banchmark lookup (because this requires iterating again over the keys from
the stream).

Seed not working

Hello @jermp @ByteHamster

I tried pthash on artificial data, it worked. However when I moved to the hash values of k-mers, it is raising an error related to the seed (see below).
c = 6 alpha = 0.94 num_keys = 21817 table_size = 23209 num_buckets = 9083 == map+sort took: 0.002 seconds == merged 0% pairsterminate called after throwing an instance of 'pthash::seed_runtime_error' what(): seed did not work
I tried different values of seed, but they are not working.

What could be the problem?

-i can't read from a non-mmap-friendly input file when using --external

It is handy to pass "streaming" content as input to build -i, e.g., by using shell process substitution like this: build -i <(zstdcat foo.txt.zst).

Unfortunately that does not work when --external is used.
Apparently in that case (and only in that case) the input file passed to -i is mmap-ed, which obviously fail when the input file is not a regular file (a FIFO, in the case of process substitution).

Would it be possible to either avoid mmap-ing the input file even when --external is used or, alternatively, to have an option to inhibit that selectively?

Thanks!

distinct_keys doesn't actually remove duplicates

The following code removes duplicates, but doesn't shrink the vector to actually remove the duplicates that std::unique moved to its end:

std::unique(keys.begin(), keys.end()); // remove the duplicates

The line should probably be:

keys.erase(std::unique(keys.begin(), keys.end()), keys.end());

C++14 support

With C++14, I encountered the following error during compiling the example code:

pthash/include/builders/util.hpp:148:24: error: use of class template 'payload_iterator' requires template arguments
                       payload_iterator(pairs.begin() + i - bucket_size));
                       ^

I fixed this error by using C++17.
Is there any way to use pthash under the C++14 standard? Thanks in advance!

runtime error "blank line detected after reading X non-empty lines"

I'm getting the following runtime error with the main build executable:

 == partitioned 0% keysterminate called after throwing an instance of 'std::runtime_error'
  what():  blank line detected after reading 0 non-empty lines

This is a real dataset where there is indeed a valid key which happens to be the empty string.

Is there any particular reason for barfing on empty string?
I've used other MPHs (specifically: GOV) on this dataset in the past and I don't see why it shouldn't be possible to use PTHash.

Thanks in advance,
Cheers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.