UMASH is a string hash function with throughput (22 GB/s on a 2.5 GHz Xeon 8175M) and latency (9-22 ns for input sizes up to 64 bytes on the same machine) comparable to that of performance-optimised hashes like MurmurHash3, XXH3, or farmhash. Its 64-bit output is almost universal, and it, as well as both its 32-bit halves, passes both Reini Urban's fork of SMHasher and Yves Orton's extended version (after expanding each seed to a 320-byte key for the latter).
Unlike most other non-cryptographic hash functions
(CLHash is a rare exception) which
do not prevent seed-independent collisions
and thus usually suffer from such weaknesses,
UMASH provably avoids parameter-independent collisions. For any two
inputs of l
bytes or fewer, the probability that a randomly
parameterised UMASH assigns them the same 64 bit hash value is less
than ceil(l / 2048) 2**-56
. UMASH also offers a fingerprinting mode
that simply computes two independent hashes at the same time. The
resulting 128-bit fingerprint
collides pairs of l
-or-fewer-byte inputs with probability less than
ceil(l / 2048)**2 * 2**-112
; that's less than 2**-70
(1e-21
) for
up to 7.5 GB of data.
See umash_reference.py
(pre-rendered in umash.pdf
) for details and
rationale about the design, and a proof sketch for the collision bound.
The blog post announcing UMASH
includes a higher level overview and may also provide useful context.
If you're not into details, you can also just copy umash.c
and
umash.h
in your project: they're distributed under the MIT license.
The current implementation only build with gcc-compatible compilers
that support the integer overflow builtins
introduced by GCC 5 (April 2015) and targets x86-64 machines with the
CLMUL extension
(available since 2011 on Intel and AMD). That's simply because we
only use UMASH on such platforms at Backtrace.
There should be no reason we can't also target other compilers, or
other architectures with carry-less multiplication instructions
(e.g., VMULL
on ARMv8).
The test suite calls into a shared object with test-only external
symbols with Python 3, CFFI,
and Hypothesis. As long as Python3 and
venv are installed, you
may execute t/run-tests.sh
to download test dependencies, build the
current version of UMASH and run all the pytests in the t/
directory. t/run-tests-public.sh
only exercises the public
interface, which may be helpful to test a production build or when
making extensive internal changes.
The Python test code is automatically formatted with
black. We try to make sure the C code
sticks to something similar to the
FreeBSD KNF;
when in doubt, whatever t/format.sh
does is good enough.
While the UMASH algorithm isn't frozen yet, we do not expect to change
its overall structure (PH
block compressor that feeds a polynomial
string hash). There are plenty of lower hanging fruits.
- The short (8 or fewer bytes) input code can hopefully be simpler.
- The medium-length (9-15 bytes) input code path is a micro-optimised version of the general case, but does not actually share any machine code; can we improve the latency and maintain the collision bounds by replacing it with something completely different?
- Itβs already nice that we can get away with a single round of
xor-shift
/ multiply in the finaliser, but can we shave even more latency there? - We only looked at x86-64 implementations; we will consider simple changes that improve performance on x86-64, or on other platforms as long they don't penalise x86-64.
- We currently only use incremental and one-shot hashing interfaces. If someone needs parallel hashing, we can collaborate to find out what that interface could look like.
And of course, portability to other C compilers or platforms is interesting.