Code Monkey home page Code Monkey logo

rollinghashcpp's Introduction

Randomized rolling hash functions in C++

Ubuntu 22.04 CI (GCC 11)

License: Apache 2.0

What is this?

This is a set of C++ classes implementing various recursive n-gram hashing techniques, also called rolling hashing (http://en.wikipedia.org/wiki/Rolling_hash), including:

  • Randomized Karp-Rabin (sometimes called Rabin-Karp)
  • Hashing by Cyclic Polynomials (also known as Buzhash)
  • Hashing by Irreducible Polynomials

This library is used by khmer: the in-memory nucleotide sequence k-mer engine.

These are randomized hash functions, meaning that each time you create a new hasher instance, you will get new hash values for a given input.

Code sample

        const uint n(3);//hash all sequences of 3 characters
        const uint L(7); // you need 7 bits
        CyclicHash<uint32> hf(n,L );// if you want 64-bit values replace uint32 by uint64
        for(uint32 k = 0; k<n;++k) {
                  chartype c = ... ; // grab some character
                  hf.eat(c); // feed it to the hasher
        }
        while(...) { // go over your string
           hf.hashvalue; // at all times, this contains the hash value
           chartype c = ... ;// point to the next character
           chartype out = ...; // character we want to forget
           hf.update(out,c); // update hash value
        }
        hf.reset(); // you can now hash a new string

Requirements

A recent GNU GCC C++ compiler or a recent CLANG.

What should I do after I download it?

It is a conventional Cmake projet.

cmake -B build
cmake --build build
ctest --test-dir build

Nim version

See Cyclic-Polynomial-Hash for a similar library written in Nim.

References

  • Daniel Lemire, Owen Kaser: Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language, Volume 24, Issue 4, October 2010, Pages 698-710 http://arxiv.org/abs/0705.4676
  • Daniel Lemire, The universality of iterated hashing over variable-length strings, Discrete Applied Mathematics 160 (4-5), 2012. http://arxiv.org/abs/1008.1715
  • Owen Kaser and Daniel Lemire, Strongly universal string hashing is fast, Computer Journal (2014) 57 (11): 1624-1638. http://arxiv.org/abs/1202.4961

This work has been used in genomics, see

rollinghashcpp's People

Contributors

dnbaker avatar jianshu93 avatar kotturtech avatar lemire avatar standage avatar zyan0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rollinghashcpp's Issues

64-bit hashes

Hi,

Is it possible to obtain somehow 64 bit hash values? I need hashes for a Bloom filter, so 2^32 bits are not enough: my application will require to use a filter of size much larger than 2^32 entries.

A Family of Hash Function

Hi,

Suppose that I need a family of N different rolling hash functions? Is your library usable in this scenario? It seems that the Rabin-Karp implementation is not randomized.

Moving from N-gram to (N+1)-gram

Hi,

Is it possible to get quickly hash of an (N+1)-gram given that we have a hash function object, representing an N-gram of that (N+1)-gram? I.e. given hash value of "ABCD", can I have value of
"ABCDE", without computing the whole hash value?

GeneralHash: the default polynomial isn't really irreducible

Hello (again). I wondered where you took the irreducible polynomials from, and unable to see that I tried to check them which failed for the 19-degree one. The other one seems OK according to Wolfram. (EDITed the link not to drop the + characters.)

I verified by different means that the factorization indeed does hold, e.g. 158543 ^ (158543<<1) ^ (158543<<2) == 987373.

For speed testing there's most likely no impact, even though there are conditionals on some of the bits but those should still be uniform enough. And to be honest, most rolling-hash implementations use much worse functions, but still...

Compile error ‘std::string’ has no member named ‘back’

I ran across this error attempting to run make

unit.cpp: In function ‘bool testExtendAndPrepend(uint)’:
unit.cpp:37:29: error: ‘std::string’ has no member named ‘back’
     if(hf.hash_extend(input.back()) != hf.hash(extend)) {
                             ^
unit.cpp:39:60: error: ‘std::string’ has no member named ‘back’
         std::cout << extend << " " << hf.hash_extend(input.back()) << " " << hf.hash(extend) << std::endl;
                                                            ^
In file included from /usr/include/c++/4.8/cassert:43:0,
                 from characterhash.h:8,
                 from cyclichash.h:4,
                 from unit.cpp:4:
unit.cpp:45:33: error: ‘std::string’ has no member named ‘back’
     assert(hf.hash_extend(input.back()) == hf.hash(extend));
                                 ^
make: *** [unit] Error 1

I need to add compile argument. in Makefile, CXXFLAGS, I added -std=c++11

Rolling CRC

It's quite possible to do an efficient rolling CRC. Let me know if you're interested in help with that. I could help with the mathematical theory, and practical implementations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.