Code Monkey home page Code Monkey logo

umash's Introduction

UMASH: a fast almost universal 64-bit string hash

UMASH is a string hash function with throughput (22 GB/s on a 2.5 GHz Xeon 8175M) and latency (9-22 ns for input sizes up to 64 bytes on the same machine) comparable to that of performance-optimised hashes like MurmurHash3, XXH3, or farmhash. Its 64-bit output is almost universal, and it, as well as both its 32-bit halves, passes both Reini Urban's fork of SMHasher and Yves Orton's extended version (after expanding each seed to a 320-byte key for the latter).

Unlike most other non-cryptographic hash functions (CLHash is a rare exception) which do not prevent seed-independent collisions and thus usually suffer from such weaknesses, UMASH provably avoids parameter-independent collisions. For any two inputs of l bytes or fewer, the probability that a randomly parameterised UMASH assigns them the same 64 bit hash value is less than ceil(l / 2048) 2**-56. UMASH also offers a fingerprinting mode that simply computes two independent hashes at the same time. The resulting 128-bit fingerprint collides pairs of l-or-fewer-byte inputs with probability less than ceil(l / 2048)**2 * 2**-112; that's less than 2**-70 (1e-21) for up to 7.5 GB of data.

See umash_reference.py (pre-rendered in umash.pdf) for details and rationale about the design, and a proof sketch for the collision bound. The blog post announcing UMASH includes a higher level overview and may also provide useful context.

If you're not into details, you can also just copy umash.c and umash.h in your project: they're distributed under the MIT license.

The current implementation only build with gcc-compatible compilers that support the integer overflow builtins introduced by GCC 5 (April 2015) and targets x86-64 machines with the CLMUL extension (available since 2011 on Intel and AMD). That's simply because we only use UMASH on such platforms at Backtrace. There should be no reason we can't also target other compilers, or other architectures with carry-less multiplication instructions (e.g., VMULL on ARMv8).

Hacking on UMASH

The test suite calls into a shared object with test-only external symbols with Python 3, CFFI, and Hypothesis. As long as Python3 and venv are installed, you may execute t/run-tests.sh to download test dependencies, build the current version of UMASH and run all the pytests in the t/ directory. t/run-tests-public.sh only exercises the public interface, which may be helpful to test a production build or when making extensive internal changes.

The Python test code is automatically formatted with black. We try to make sure the C code sticks to something similar to the FreeBSD KNF; when in doubt, whatever t/format.sh does is good enough.

Help wanted

While the UMASH algorithm isn't frozen yet, we do not expect to change its overall structure (PH block compressor that feeds a polynomial string hash). There are plenty of lower hanging fruits.

  1. The short (8 or fewer bytes) input code can hopefully be simpler.
  2. The medium-length (9-15 bytes) input code path is a micro-optimised version of the general case, but does not actually share any machine code; can we improve the latency and maintain the collision bounds by replacing it with something completely different?
  3. It’s already nice that we can get away with a single round of xor-shift / multiply in the finaliser, but can we shave even more latency there?
  4. We only looked at x86-64 implementations; we will consider simple changes that improve performance on x86-64, or on other platforms as long they don't penalise x86-64.
  5. We currently only use incremental and one-shot hashing interfaces. If someone needs parallel hashing, we can collaborate to find out what that interface could look like.

And of course, portability to other C compilers or platforms is interesting.

umash's People

Contributors

pkhuong avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.