Comments (66)
The BLAKE3 a cryptographic hash function - https://github.com/BLAKE3-team/BLAKE3
from borg.
Python bindings for BLAKE3
https://github.com/oconnor663/blake3-py
from borg.
I reran the tests from above on a Ryzen 3900X using OpenSSL 1.1.1d.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 106384.13k 345054.19k 897293.69k 1434363.86k 1788742.25k 1807924.24k
sha512 43884.78k 176089.53k 370019.94k 634628.74k 804669.85k 783004.49k
blake2s256 61339.27k 242532.95k 390655.26k 465763.91k 495717.82k 499056.64k
blake2b512 49562.98k 198390.17k 535434.49k 761940.14k 914828.61k 865731.49k
aes-256-cbc 767444.38k 929565.78k 982629.23k 977272.05k 992434.91k 993516.26k
aes-256-gcm 494790.72k 1351056.42k 2639749.24k 3451936.94k 3881137.88k 3858543.08k
aes-256-ocb 467931.89k 1693760.31k 3525061.12k 5028728.75k 5624994.67k 5851113.04k
aes-256-ctr 618497.82k 2140440.58k 5050578.48k 6892584.36k 7558430.72k 7638058.87k
chacha20-poly1305 297504.81k 541595.35k 1089618.16k 2081096.12k 2004806.25k 1997594.55k
Note massive increase in SHA-256 performance due to the Ryzen supporting Intel SHA Extensions (which only cover SHA-1 and SHA-256). Predictions from a few years ago when we added BLAKE2 were thus correct; better than SHA-256, except if your CPU has hardware support for it.
from borg.
A new interesting encryption algorithm is called AEGIS, that is based on AES, but from my understanding builds on-top of what has been learned with AES block cipher modes/encryption schemesβ¦
https://datatracker.ietf.org/doc/draft-irtf-cfrg-aegis-aead/00/
from borg.
I have a branch where I worked about on the LoggedIO write performance and managed to double it when processing large files (45 MB/s => 90 MB/s, vmlocked input, output onto tmpfs; dd does ~500 MB/s here), mainly by managing syncs in a way to give the kernel a chance to do them when it wants to, without compromising transactionality (and indeed, syncs don't make a significant appearance in the profile anymore)
Adding a none64 encryption using SHA512-256 moved it to ~110 MB/s.
Profiled it there (with Cython profiling enabled):
- 40 %
Chunker.__next__
- 35 %
_hashlib.openssl_sha512
- 10 % CRC32 (which matches very well with CRC32 stand-alone giving me approx 1 GB/s)
- 5 % buffered IO writes
- 4 % the bytes-join in Plaintext64Key.encrypt
- 3.5 % compression (none was enabled)
So it seems to me that the Chunker is the next big target for optimization. i.e. mainly see what the compiler does there and if there is anything left to optimize.
Btw. using that branch in production currently, nada issues so far. So a PR for that will probably come this weekend.
Extraction is basically 70 % SHA-512, 20 % CRC-32 and 10 % IO+misc (for ~210 MB/s). Normal plaintext w/ SHA-256 is 160 MB/s or so. I'd say extraction speed is acceptable for my CPU (which is old and has 'AMD' lasered onto the lid).
from borg.
I think the performance graphs for BLAKE3 is single thread, see the title on the image: https://raw.githubusercontent.com/BLAKE3-team/BLAKE3/037de38bfec4e813ab6189a50cb7c4cbae47268a/media/speed.svg
See also: https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf
from borg.
@Maryse47 that is expected for 64bit platforms, where sha512 is usually faster than sha256.
so nowadays it is kind of stupid to use sha256 as a software implementation because one could just use sha512 (and throw away half of the result if 256bits are wanted). only exception (see above) is CPU hw accelerated sha256, that might be faster again if sha512 is not hw accelerated.
borg uses sha256 mostly due to historical reasons, but we also have the fast blake2b algo (fast in software, there is no hw acceleration for that).
from borg.
OpenSSL 1.1.1k on a Ryzen 5600X locked at 4.6 GHz all cores (if you're thinking "Hey, that seems way better than Zen 2 CPUs which almost never hit their advertised clocks even on the best core with light loads" you'd be right. Zen 3 parts always hit their advertised clocks because (1) AMD did not bullshit this time (2) the default GFL is 50 MHz above the advertised clock. These will always hit 4650 MHz on pretty much any core, even under load.)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 293473.80k 731471.10k 1500114.58k 2005120.14k 2254721.35k 2250309.63k
sha512 93705.30k 378856.55k 635270.24k 927596.54k 1054145.53k 1064222.72k
blake2s256 59266.68k 228436.29k 443893.58k 599372.60k 665122.79k 665390.10k
blake2b512 43775.31k 175761.00k 511203.16k 813573.12k 997690.03k 1023606.78k
aes-256-cbc 975053.72k 1182504.63k 1231511.72k 1238711.03k 1236388.52k 1240951.47k
aes-256-gcm 579415.38k 1531545.54k 3248525.31k 4580585.19k 5483003.90k 5563645.95k
aes-256-ctr 625653.43k 2292196.01k 5539421.87k 8358842.09k 10065625.00k 10040295.42k
aes-256-ocb 470562.27k 1772509.65k 4480814.42k 6982971.63k 8479542.53k 8470701.18k
chacha20-poly1305 363805.49k 648929.49k 1576549.03k 2895763.59k 3068497.39k 3116335.10k
Note massive improvement in ChaCha20-Poly1305 over Zen 2 (almost +50 %), and all other pipelinable modes (GCM, CTR, OCB). Zen 3 has more SIMD EUs and seems to have gained another EU capable of AES-NI. Higher AES-CBC performance likely due to much higher sustained clocks under load compared to my 3900X above.
Also note how even all the hashes see massively improved performance.
During these benchmarks the active core pulls around 4-6 W. Whole CPU is running at around 40 W, 3/4 of that is uncore - MCM / chiplet architecture is a "gas guzzler".
from borg.
https://lwn.net/Articles/681616/ old, but partly still relevant I guess.
from borg.
The many flavors of hashing article about different types of hash functions and algorithms
from borg.
Some benchmarks again, in a roughly historical order.
Intel Xeon Gold 6230 CPU (Cascade Lake = Skylake, 14nm), OpenSSL 1.0.2k-fips 26 Jan 2017 (=RHEL 7.9)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
md5 85504.16k 239891.14k 499248.12k 687352.49k 775319.37k
sha256 62643.81k 163584.06k 338441.59k 454458.71k 505399.82k
sha512 45636.55k 184678.49k 367671.74k 600479.76k 738538.84k
aes-256-cbc 969298.83k 1045541.12k 1059651.13k 1063793.53k 1069484.71k
aes-256-gcm 627073.36k 1413484.99k 2554710.39k 3718083.38k 4323027.63k
The remainder are OpenSSL 1.1.1k FIPS 25 Mar 2021 (RHEL 8)
Intel Xeon Platinum 8358 CPU (Ice Lake, 10nm / Intel 7)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 82856.64k 223238.38k 453702.49k 610063.36k 680492.18k 681306.79k
sha256 96116.06k 299662.45k 699202.90k 1039871.32k 1233153.54k 1243474.60k
sha512 42121.62k 170274.01k 321573.55k 501962.75k 599378.60k 610334.14k
blake2s256 64056.02k 255429.01k 386962.94k 450678.44k 479584.26k 483678.69k
blake2b512 53140.32k 215689.87k 544194.30k 701760.85k 775902.55k 787325.17k
aes-256-cbc 905916.56k 1120170.35k 1162550.02k 1171404.29k 1169083.05k 1169053.01k
aes-256-gcm 524914.40k 1490840.90k 3162930.69k 4295296.43k 5126362.45k 5202695.51k
aes-256-ocb 513380.86k 1833849.56k 4120666.03k 5847052.67k 6601149.10k 6706571.95k
chacha20-poly1305 283249.45k 574282.20k 1824270.34k 3251743.74k 3762856.84k 3789111.30k
Intel Gold 5318N CPU (also Ice Lake, different segment)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 65826.20k 184671.12k 374096.74k 500161.19k 556992.98k 561110.90k
sha256 77679.44k 241026.69k 569760.71k 858738.73k 1008492.74k 1017107.80k
sha512 33756.66k 141953.56k 265739.18k 412468.91k 491859.74k 499503.78k
blake2s256 51620.99k 212150.25k 319860.10k 371007.19k 394276.30k 394657.79k
blake2b512 44016.77k 178105.90k 450999.31k 578862.41k 636781.42k 641788.59k
aes-256-cbc 785405.33k 929806.40k 954328.27k 958396.87k 959707.87k 956410.54k
aes-256-gcm 467454.45k 1249941.55k 2595831.89k 3565374.50k 4200558.96k 4268459.41k
aes-256-ocb 414959.50k 1497825.82k 3194263.47k 4780095.02k 5413427.03k 5500996.49k
chacha20-poly1305 209454.20k 451417.77k 1462585.71k 2586606.53k 2970342.49k 2987840.85k
AMD EPYC 9454 (Zen 4, 5nm) @ 3.8 GHz. Zen 4 has VAES instructions, but it's unclear to me if this is supposed to double the AES throughput or just a different encoding for the existing AES-NI instructions. In any case, the OpenSSL version used in RHEL 8 is too old to know about VAES.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 111832.52k 285128.70k 551276.22k 717113.71k 783878.83k 792075.99k
sha256 150723.91k 449164.61k 1048627.37k 1571731.44k 1837775.20k 1860225.11k
sha512 68714.20k 274766.41k 494378.72k 753474.27k 879861.76k 895667.80k
blake2s256 85720.28k 342488.17k 486073.51k 550913.37k 570242.39k 573834.53k
blake2b512 72041.86k 288009.01k 727135.91k 919075.96k 993648.50k 999615.79k
aes-256-cbc 906857.08k 1021701.54k 1052213.42k 1064056.21k 1066768.27k 1067009.37k
aes-256-gcm 714170.81k 1763196.91k 3377087.10k 4257207.40k 4673672.53k 4730570.40k
aes-256-ocb 626893.67k 2318789.95k 5100248.79k 6748232.70k 7473972.57k 7559254.30k
chacha20-poly1305 296382.31k 556920.14k 1814487.55k 3516808.13k 3887621.82k 3900347.73k
What do we learn from this? Well, in terms of SHA and AES-NI extensions x86 are very, very uniform these days. Especially in server parts, where Intel cores typically have more FP resources than in client parts. If you normalize to clock speed, they're all pretty much the same.
Zen 3 to 4 has no changes at all here, unless VAES makes a difference.
Re-test with OpenSSL 3.1.1
VAES does seem to make a difference. A 2x difference. OpenSSL uses VAES for AES-GCM and AES-MB (multi-buffer, which interleaves encryption/decryption of independent streams and is not used here). It's also in a few stitches of AES-CBC and various SHAs, but not in AES-CTR or AES-OCB. Build flags:
version: 3.1.1
built on: Thu Jun 29 10:06:15 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG
gcc 8.5.0 (the RHEL 8 patchfest)
VAES (AVX512F) seems to just perform one encryption/decryption round on four independent blocks,
while VAES (AVX512VL) does the same, but on an AVX2 register with two blocks.
However, I'm not sure if the results below are actually VAES' doing and if this actually uses VAES with larger than 128 bit registers, because as far as I can tell the code generator uses xmm registers with VAESENC, which would use the AVX512VL encoding and hence should be equivalent to the traditional AES-NI in terms of performance.
So maybe it's just a better implementation in OpenSSL 3.x compared to the old 1.1.x series.
In any case, despite being a somewhat terrible construction, AES-GCM just doesn't seem to be able to stop winning. Almost 11 GB/s at just 3.8 GHz is impeccable performance (that's 0.35 cpb). AES-CTR is quite a bit slower at just 8.6 GB/s. The 128 bit variants are not much faster; 12.5 GB/s and 9.8 GB/s, respectively.
The Ice Lake Xeon performs even a bit better than the Zen 4 EPYC still at just below 0.3 cpb.
AMD EPYC 9454
CPUINFO: OPENSSL_ia32cap=0x7efa320b078bffff:0x415fdef1bf97a9
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 104800.85k 271984.80k 537514.55k 711293.35k 782983.17k 791643.10k
sha256 147387.25k 442508.86k 1041297.72k 1565552.50k 1830996.65k 1860800.47k
sha512 63776.48k 254963.31k 477623.57k 743322.63k 880691.88k 895717.12k
blake2s256 81892.92k 326048.83k 513686.77k 611124.91k 650505.08k 653639.41k
blake2b512 66261.11k 264983.76k 680883.63k 925331.26k 1036629.33k 1048839.02k
AES-256-CBC 907505.92k 1022436.75k 1052622.42k 1064306.90k 1066902.52k 1063479.98k
AES-256-GCM 735854.55k 2656460.46k 5900279.78k 6939782.94k 10419991.89k 10804415.10k
AES-256-OCB 697220.41k 2601784.53k 5363791.96k 6891523.42k 7504102.14k 7545640.28k
ChaCha20-Poly1305 296290.55k 555132.90k 1836174.68k 3647579.78k 3896452.20k 3916316.67k
Intel Xeon Platinum 8358 CPU (the Xeon Gold 5318N behaves the same way and has the same CPUID flags)
CPUINFO: OPENSSL_ia32cap=0x7ffef3f7ffebffff:0x40417f5ef3bfb7ef
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 80569.01k 218127.89k 449260.99k 607212.20k 679779.83k 683201.88k
sha256 92481.01k 285519.68k 680692.36k 1033920.85k 1230120.58k 1241890.82k
sha512 42249.83k 170846.44k 323417.00k 504486.79k 599127.38k 608157.70k
blake2s256 62037.24k 247927.21k 397370.62k 473488.70k 501844.65k 506309.44k
blake2b512 50842.18k 205802.25k 538162.01k 716786.64k 797903.53k 805612.20k
AES-256-CBC 1025099.80k 1140883.63k 1162179.58k 1171024.48k 1168703.49k 1168632.49k
AES-256-GCM 654373.75k 2529894.21k 4690537.90k 6565636.10k 11066534.44k 11562456.41k
AES-256-OCB 566158.67k 1972074.24k 4249955.57k 5881506.47k 6632761.02k 6708450.65k
ChaCha20-Poly1305 278457.26k 569466.01k 1851996.13k 3319006.21k 3773874.18k 3813877.38k
from borg.
@infectormp IIRC, a talk from a blosc developer or user was the first time I heard about lz4 (and how they use it to get data faster into cpu cache than reading uncompressed memory). But blosc has quite a lot more stuff than we need.
from borg.
If/when you do feel you can use libsodium, that also opens up two things which could be useful:
- scrypt as a PBKDF. This should be (much) less vulnerable to brute force password guessing
- secretbox with XSalsa20 and Poly-1305. This would solve worries over counter re-use, as you can randomly generate a 24-byte nonce for each new encryption with negligible chance of reuse.
from borg.
@namelessjon interesting, thanks. about the counter re-use issue: there is also this idea of creating per-backup(-thread) random session keys, start counter from 0, encrypt the keys with the master key and store them with the backup.
from borg.
@ThomasWaldmann That sounds less fragile than the current implementation at least (I think).
The secretbox option from libsodium, despite the larger nonce, adds the same 40 byte overhead to the files, because the Poly-1305 MAC is only 16 bytes. I guess if you're now including a pointer to the encrypted session key with each encrypted blob, secretbox would have less overhead, but that shouldn't be that significant anyway?
from borg.
I had a look at libsodium yesterday, seems pretty nice and it also is in some stable linux distributions now.
It would be useful to get some comparative performance values:
aes256-ctr + hmac-sha256 vs aes256-gcm (hw accel.) vs. xsalsa20 + poly1305
same for sha256 against some faster hash from libsodium.
For interfacing, we have 2 options: either via cython (like we use it for openssl right now) or using some python wrapper for libsodium.
from borg.
https://www.imperialviolet.org/2014/02/27/tlssymmetriccrypto.html < not a perfect comparison (it's ChaCha20, not XSalsa20, but I believe the performance is supposed to be similar), and that's one or more intel processor generations ago, but there's that. However, I think since then libsodium has picked up an assembly version of Salsa20 which should be faster.
from borg.
pysodium crypto_generichash (256bit) is 2.8 times faster than sha256 from python stdlib.
pysodium crypto_generichash (512bit) is 1.8 times faster than sha512 from python stdlib.
note: sha256 eats most of the cpu time for borgbackup currently (when using hw accel. aes and lz4).
But: no AES256-CTR in libsodium yet. jedisct1/libsodium#317
from borg.
Seems unlikely aes-ctr will be added in libsodium from how that thread has evolved. I think I agree with the why, too. The nice thing about libsodium is it inherits "hard to mess up" from nacl.
from borg.
Though it does complicate a migration to different algorithms without adding more dependencies
from borg.
openssl 1.1.0 is scheduled for april 2016 release. update: borg uses openssl 1.1.x
https://www.openssl.org/news/changelog.html#x0
chacha20 / poly1305
ocb mode
from borg.
Hash news from Google :
- siphash
- highwayhash
from borg.
As debian stable and ubuntu lts now has libsodium, I've begun working on a cython-based libsodium binding for borg. gives us chacha20-poly1305 as new aead cipher, blake2b as new hash.
Strange, I am seeing less than expected speedup:
speedup sha256 -> sha512: 1.4616587679833115
speedup sha256 -> blake2b: 1.5823430737200959
speedup sha512 -> blake2b: 1.0825666758755845
I first thought this is maybe caused by a slow blake2b 1.0.8 in ubuntu and I manually installed 1.0.10 (which has "record speed avx2 blake2b") - but it doesn't get faster. https://blake2.net/ says blake2b should be about 3x faster than sha512, so what's going wrong here?
from borg.
Quote from python docs: "An Adler-32 checksum is almost as reliable as a CRC32 but can be computed much more quickly."
Quote from stackexchange: "Do note that Adler32 is almost useless for short runs of data. Up to about 180 bytes, it produces numerous collisions."
>>> from zlib import *
>>> data=b'x'*1000000000
>>> # dt computes the runtime of given function in seconds
>>> dt(lambda: crc32(data))
1.1496269702911377
>>> dt(lambda: adler32(data))
0.49709367752075195
from borg.
CRC32 is already around 1 GB/s (even on my older CPUs), and should be [much] faster on CPUs with CLMUL (although I'm not sure whether zlib makes use of that - if it doesn't getting an implementation or nudging Python into using one that does would make sense and comes for free (except the hassle)).
For 2.0 it would make sense to switch to something as fast as CRC32 (blake) but with much higher integrity guarantees. E.g. 128+ bit blake checksums on the Repository layer.
from borg.
All tests made with openssl speed -evp <algorithm>
. (Note: AES-NI always includes CLMUL; I therefore don't mention it separately)
AMD K10 'Thuban', 3.3 GHz, no AES-NI, OpenSSL master (to-be 1.1)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 32807.44k 80855.72k 153107.35k 191174.64k 203030.09k 202454.73k
sha512 22194.31k 88633.71k 174282.49k 272886.07k 312806.06k 313252.22k
blake2s256 30131.61k 121603.55k 248297.91k 334311.34k 345933.32k 338420.01k
blake2b512 29113.06k 115848.20k 340357.74k 517671.73k 574694.83k 582163.52k
aes-256-cbc 65277.71k 71006.95k 73144.51k 184525.14k 185352.90k 186289.92k
aes-256-gcm 50956.21k 56946.58k 57792.17k 60589.06k 59232.56k 58938.56k
aes-256-ocb 61807.33k 65769.66k 66718.91k 66782.89k 66816.68k 67196.32k
chacha20-poly1305 168435.67k 321127.98k 360382.55k 381305.75k 384996.69k 382865.04k
Intel Xeon E3-1231v3, 3.4 GHz, AES-NI
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 55129.16k 145301.59k 311774.38k 428385.09k 468527.79k 476840.16k
sha512 36564.32k 148308.25k 301326.17k 534995.48k 683207.34k 697860.10k
blake2s256 42168.03k 163007.85k 341910.18k 473521.49k 532892.33k 540114.94k
blake2b512 41371.63k 163340.95k 474446.42k 712455.17k 835794.26k 846603.47k
aes-256-cbc 571487.46k 600316.16k 608165.89k 609246.21k 611030.36k 613452.03k
aes-256-ctr 420332.94k 1373547.81k 2840697.54k 3595480.49k 3863664.23k 3909044.31k
aes-256-gcm 373152.65k 1071874.69k 2080868.78k 2579107.16k 2893392.55k 2940644.01k
aes-256-ocb 345922.01k 1456611.33k 2691367.08k 3528726.53k 3820748.80k 3855936.17k
chacha20-poly1305 282856.40k 509213.58k 1095028.99k 1905031.85k 2016478.61k 2010589.87k
powermac G5, dual core, 2 GHz, OpenSSL master (to-be 1.1), configured for ppc64.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 8032.07k 24662.84k 57131.78k 86498.65k 102213.69k 102951.59k
sha512 7186.19k 28637.80k 66281.56k 118863.87k 155130.18k 157947.22k
blake2s256 8173.06k 32977.38k 69935.79k 98463.06k 112172.86k 112831.15k
blake2b512 7160.29k 28813.66k 87564.20k 152836.62k 195205.22k 198838.20k
aes-256-cbc 50287.13k 58998.33k 63115.43k 64246.10k 64804.47k 64823.65k
aes-256-gcm 33388.54k 37288.54k 38743.81k 39149.57k 39439.41k 39436.67k
aes-256-ocb 37836.55k 43085.70k 44221.53k 44683.95k 44921.75k 44772.01k
chacha20-poly1305 63010.86k 122622.37k 218386.26k 239075.57k 246295.21k 247820.33k
X200, Intel P8600, 2.4 GHz, no AES-NI, OpenSSL git 38e19eb96
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 37834.40k 82205.76k 142650.64k 172888.69k 188077.40k 190355.78k
sha512 25130.82k 101252.47k 156723.29k 219927.55k 250352.98k 252177.07k
blake2s256 23435.04k 92726.06k 184286.46k 246177.13k 273435.31k 275256.66k
blake2b512 21004.03k 82610.52k 246223.54k 382414.17k 457061.72k 460215.64k
aes-256-cbc 128602.15k 149870.77k 155655.25k 157678.25k 156516.85k 157543.08k
aes-256-gcm 40657.10k 46472.35k 120275.88k 129553.07k 131236.39k 131164.84k
aes-256-ocb 107630.54k 123701.66k 125957.38k 129277.61k 130329.51k 128839.82k
chacha20-poly1305 143071.17k 253906.65k 395730.28k 412351.74k 423469.06k 420422.21k
X201, Intel i5-520M (1st gen), AES-NI, 2.5 GHz, OpenSSL master (to-be 1.1)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 27347.57k 70701.40k 132937.71k 180204.41k 198897.91k 200478.58k
sha512 18721.99k 74486.12k 138151.43k 203135.40k 254793.19k 259274.05k
blake2s256 29862.39k 120343.92k 213276.18k 268852.92k 276566.88k 283360.73k
blake2b512 22243.32k 92160.97k 260590.30k 400471.62k 474613.69k 479858.77k
aes-256-cbc 411130.34k 451687.22k 469638.53k 471725.87k 472540.95k 471177.45k
aes-256-gcm 198351.23k 435388.37k 570959.46k 620392.81k 632221.72k 630816.99k
aes-256-ocb 191981.68k 603404.07k 1036441.86k 1254013.08k 1342182.23k 1355308.67k
chacha20-poly1305 151109.65k 275360.32k 468868.72k 495142.83k 503227.96k 504269.83k
Odroid-C2, ARM Cortex-A53 (NEON acceleration), 2 GHz, AArch64 mode, 2G RAM, OpenSSL master (to-be-1.1)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 10262.06k 29336.60k 62571.67k 87548.95k 99074.37k 100018.35k
sha512 7540.40k 29637.11k 68571.21k 119539.12k 152745.54k 155860.28k
blake2s256 6914.23k 27511.27k 45578.94k 54736.87k 58282.55k 58521.26k
blake2b512 5008.73k 20017.75k 51662.25k 67957.66k 74815.75k 75377.29k
aes-256-cbc 40206.14k 47970.14k 50627.05k 51336.65k 51544.50k 51555.35k
aes-256-gcm 23389.15k 26157.46k 27123.45k 27378.10k 27464.90k 27478.46k
aes-256-ocb 35397.02k 40917.59k 42634.34k 43232.19k 43095.87k 43238.57k
chacha20-poly1305 51664.38k 106305.89k 208580.15k 235916.72k 247165.12k 247984.32k
A modern ARM core with NEON, performs quite well for AES, and extremely well for ChaCha20-Poly1305 (at ~250 MB/s). SHA-2 is faster than Blake since AArch64 includes instructions for SHA.
As expected, the chacha20-poly1305 scheme is by far the fastest in software[1]. AES-OCB is faster than GCM but doesn't quite gets "nearly as fast" as CBC.
A test on HW with AES-NI and CLMUL would be interesting to see how GCM and OCB compare there.
Update: Thomas' results show that OCB is a good bit faster on his modern Intel. On the i5-520M, which is a bit older (2010ish) OCB is more than twice as fast as GCM.
Update: Added results for a Haswell desktop CPU. The ratios almost exactly match Thomas' results as one would expect (both are Haswell).
Update: Added results for ARM Cortex-A53 (amlogic s905), AArch64
[1] but I still find it surprisingly fast even on the G5.
from borg.
we don't need to compare gcm and cbc modes, cbc does not have auth, so the comparison would be gcm and cbc+auth (hmac or whatever).
i am a bit unsure about ocb. although the patent stuff seems unproblematic meanwhile, it hindered wide usage until recently, so one could suspect ocb is way less practically tested than gcm.
also, i am not convinced whether we should wait until openssl 1.1 is widely available and packaged. we could also go for libsodium, which already is available and packaged (but adds extra dependency).
from borg.
i5-4200u with aes-ni, openssl 1.0.2:
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
sha256 36164.96k 96595.22k 207142.23k 288822.61k 327669.08k
sha512 26040.90k 104421.67k 221632.00k 371365.89k 470551.21k
aes-256-cbc 386736.39k 410446.76k 416114.01k 417011.03k 417680.04k
aes-256-gcm 289575.60k 828874.30k 1474338.99k 1708658.35k 1785738.58k
openssl 1.1.0 git master:
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 36976.78k 98577.73k 211887.62k 292908.18k 327972.18k 332016.30k
sha512 24729.87k 100302.29k 216779.00k 368207.19k 467856.04k 477997.74k
blake2s256 25418.42k 107899.43k 224348.93k 315508.10k 353039.58k 372823.38k
blake2b512 27473.39k 108173.14k 317090.85k 488826.20k 580064.60k 587563.01k
aes-256-cbc 386987.58k 410620.12k 416009.47k 418933.81k 417838.42k 412663.81k
aes-256-gcm 272809.14k 721350.98k 1485755.59k 1755807.06k 1979094.36k 2008274.26k
aes-256-ocb 258501.45k 993106.39k 1864350.38k 2342698.67k 2612682.75k 2639790.08k
chacha20-poly1305 190241.36k 332894.29k 709114.61k 1299729.07k 1386345.81k 1397286.70k
from borg.
also, i am not convinced whether we should wait until openssl 1.1 is widely available and packaged. we could also go for libsodium, which already is available and packaged (but adds extra dependency).
I used OpenSSL here mainly because it's a convenient way to test it: While on x86 I don't expect performance differences between *ssl and NaCl/libsodium, re-testing should be done with the library actually used in the end to ensure it has the performance level we expect(ed).
from borg.
Somehow embarrassing that we can encrypt+auth 4-8 times faster than compute any easily and separately available hash.
from borg.
AES is cheating with it's dedicated per-round instructions :D Could use a hash/mac constructed from AES, but they all have many more caveats than typical MACs in my perception.
Another thing to consider is that more recent ARM chips also include acceleration for AES. Newer Raspis (at least the v3) are running on an A53 core that includes that.
from borg.
I added another set of results above, for a 1st gen i5 (and also some for the previous Core2 processor). Generally in line with other observations, except...
Update: Thomas' results show that OCB is a good bit faster on his modern Intel. On the i5-520M, which is a bit older (2010ish) OCB is more than twice as fast as GCM.
from borg.
Wow, that's a surprising result. It's just a pity that it likely will take quite some time until aes-ocb (openssl 1.1) is widely available and packaged - and by then many of these 1st gen Core-i machines might be gone anyway.
from borg.
About AES-GCM, see Black Hat 2016, paper is public on iacr: "nonce disrespecting adversaries"
https://twitter.com/tqbf/status/760907360319660032
from borg.
http://www.nongnu.org/lzip/xz_inadequate.html interesting article, includes "longer is not necessarily better" fact about CRCs/hashes.
from borg.
While the analysis seems to be accurate for a compressed format, it doesn't really apply to Borg, since both the MAC and the repository-level CRC are done after the data may have been compressed. There is no equivalent to Pudd in Borg. The only kind of false positives we can get is where the checksum was corrupted but the data is actually not corrupted. Here the usual len(checksum) vs len(data) applies.
from borg.
Intel Cannon Lake (due early 2018 or so) will reportedly support the Intel SHA-2 extensions. These instructions (already available on some lower end Atom chips) reportedly perform at about 4 cpb on these (I don't know if the CL implementation would be faster).
from borg.
The AMD Ryzen 1800X cpu has this cpu flag: sha_ni : SHA1/SHA256 Instruction Extensions
from borg.
keccak can be used for single-pass authenticated encryption.
from borg.
π for libsodium discussed at the beginning of this issue. It's really good (and modern) especially if you have no hardware AES support ChaCha20 e.g. is much faster.
from borg.
http://preshing.com/20110504/hash-collision-probabilities/
from borg.
The benchmarks from #45 (comment) and #45 (comment) are benchmarking the implementation of BLAKE2 in openssl, which is probably much slower than some other implementations (including, perhaps, future upgrades of openssl).
For machines with AVX2 like the Xeon E3, https://github.com/sneves/blake2-avx2 reports ~3 cycles per byte, which is about 1.25 times as fast as shown above. For the ARM Cortex-A53, https://bench.cr.yp.to/results-hash.html#aarch64-par3 shows ~6 cycles per byte, which is about 4.4 times as fast as shown above.
from borg.
The BLAKE2 implementation in OpenSSL (and Python, Borg 1.1, libb2, most other stuff as well) is the reference implementation (with no vectorization whatsoever).
The other (SSE4, AVX, etc.) optimised implementations are not included in Borg (and most other packages mentioned above), because everything but AVX2 has practically no improvement in performance since Haswell. (It's a pretty darn good superscalar processor)
AVX2 is not shipped, because no one sat down yet and said "Yep, I declare this thing non-experimental" (all of them are marked experimental by their authors). The sole exception is libsodium which ships an experimental AVX2 implementation in the default configuration.
AVX2 is also not shipped in Borg because that's a huge PITA to do; see #2422 and the "crc32*" files here (esp. crc32_dispatch.c).
Ideally someone else (Python, OpenSSL) decides to ship a better implementation at some point and "someone" adds the necessary code to make Borg use OpenSSL's BLAKE2. Everyone wins. We might also use libsodium in the future. Though I don't know what to make of it yet.
from borg.
another one against gcm:
http://www.metzdowd.com/pipermail/cryptography/2016-March/028824.html
guess we'll just use ocb and chacha/poly.
from borg.
t1ha - Fast Positive Hash. In most cases up to 15% faster than City, xxHash, mum-hash, metro-hash, etc.
from borg.
blake3 seems to have nice speed.
pkgconfig support still missing?
library availability on linux (and other) dists?
platform compatibility?
https://github.com/BLAKE3-team/BLAKE3
from borg.
High performance fork of zlib
https://github.com/zlib-ng/zlib-ng/
from borg.
So are there any plans of adding any of these algorithms in the near future?
from borg.
@deathtrip some of the stuff we talked about is already done (like blake2b, openssl 1.1), some might come after 1.2, when we target crypto changes.
keep in mind that adding new code is always a burden on maintenance and also some risk (bugs?), so we try to find a good balance between pros and cons.
do you think something specific would be useful to add?
note: I added some updates (in bold) to some of the posts above and deleted a few now irrelevant ones.
from borg.
do you think something specific would be useful to add?
I think blake3 and zlib-ng look like good candidates. They were the ones i thought about when i asked the question.
from borg.
zlib
we use zlib via Python stdlib currently. also, i guess it is mostly for backwards compatibility with repos created with early borg and attic versions (when there was no lz4 and zstd yet) and users wanting something more advanced rather use lz4 and zstd?
so not sure what we would win by using zlib-ng, considering that we might have to bundle all of its code because it is not available on all platforms we support.
but in general (had a quick look), it looks like zlib-ng is a good project, so maybe some day python will switch from zlib to zlib-ng and we'll get it automagically.
blake3
C: no libs (like libb3?) for it yet, so we also would need to bundle all of its C code.
using blake3-py would be an option though (and we would not have to deal with the compiling / with rust).
can somebody measure how much faster b3 is (compared to b2) when used single-threaded?
from borg.
from borg.
blake3
C: no libs (like libb3?) for it yet, so we also would need to bundle all of its C code.
Here are instructions on how to build the libraries. Seems that for x86 the assembly library could be used, as the C one doesn't support multithreading yet.
from borg.
@deathtrip what i meant is lib availability on the usual OSes people use out there, didn't want to build/package the lib myself.
also, i'ld first like a perf comparison that is single-threaded to see how much b3 is faster than b2 *without going to multi-threading.
from borg.
PSA: execution times [s] for hashing 1GB of random bytes on Apple M1 (Macbook Air 2020, Python 3.8.9) :
sha256 2.95
sha512 1.80
blake2b 1.06 (fastest)
blake2s 1.75
% file `which python`
...env/bin/python: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64:Mach-O 64-bit executable arm64]
...env/bin/python (for architecture x86_64): Mach-O 64-bit executable x86_64
...env/bin/python (for architecture arm64): Mach-O 64-bit executable arm64
Note:
- sha is slower than blake2b, so it seems that no SHA hw acceleration is used.
- tried
pip install blake3
, but there is no binary wheel for m1 yet.
from borg.
LibreSSL 2.8.3 an macOS 12.0b4 Apple M1 cpu
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
sha256 67477.74k 154138.47k 262916.43k 319208.81k 340417.22k
sha512 57932.16k 231396.08k 350646.38k 481570.49k 546502.26k
aes-256-cbc 228120.47k 237817.67k 234978.42k 239252.78k 240820.57k
aes-256-gcm 142843.72k 145801.57k 143040.33k 142580.08k 142298.70k
aes-256-ctr 251498.14k 268501.38k 270107.85k 271982.03k 273654.51k
chacha20 poly1305 43378.82k 169592.00k 287290.88k 345544.80k 369688.69k
Note: no aes-ocb, no blake2b.
from borg.
why sha256 is slower (significantly) than sha512?
from borg.
10GB/s AES in counter mode, woah! 3GB/s chacha also quite fast.
from borg.
Had a quick test with pypi blake3
package on Apple MBA, macOS 12, M1 CPU:
hmac-sha256 1GB 0.681s
blake2b-256 1GB 2.417s
blake3-256 1GB 1.070s
Notable:
- sha256 is CPU hw accelerated, thus super fast, faster than sw blake2 / blake3
- blake3 much faster than blake2
blake3
pypi even has wheels for macOS arm64
from borg.
Would be cool if PR #6463 could get some review.
from borg.
About adding blake3 support via https://github.com/oconnor663/blake3-py to borg:
How much platform / compile / installation and packaging issues would we likely get by doing so?
- due to it using rust? there is also C code, but "experimental".
- for pip installs, if no platform wheel is available?
- esp. considering the less widespread platform, like netbsd, openbsd, openindiana?
Other options for blake3 support?
Didn't find a libb(lake)3(-dev) package on ubuntu, debian, fedora.
Issue on the python tracker: https://bugs.python.org/issue39298
from borg.
I played a bit around with blake3:
- the portable (generic) blake3 C code is only a bit faster than the blake2b code we already use
- didn't get the NEON accelerated C code to compile
- code from the blake3-py (rust-based) pypi package (macOS ARM wheel) showed about double the blake2b performance
from borg.
Had a quick test with pypi
blake3
package on Apple MBA, macOS 12, M1 CPU:hmac-sha256 1GB 0.681s blake2b-256 1GB 2.417s blake3-256 1GB 1.070s
Notable:
* sha256 is CPU hw accelerated, thus super fast, faster than sw blake2 / blake3 * blake3 much faster than blake2 * `blake3` pypi even has wheels for macOS arm64
This is even more impressive given the fact that HMAC runs SHA256 twice. It would be interesting to compare SHA256 with the SHA extensions against Blake2 with AVX2 (M1 does not have AVX2), although I do not know if hashlib's Blake2 implementation already makes use of AVX2. Unfortunately, I have currently netiher the SHA extensions nor AVX2. Maybe I can add a benchmark in some time when I got a new machine. Maybe someone else has already the possibility?
from borg.
This is even more impressive given the fact that HMAC runs SHA256 twice.
The hash function is invoked twice in HMAC, yes, but the message is only hashed once. The outer hash fn invocation only processes the outer key and inner hash.
Alder Lake results look basically the same as Zen 3 above, except equivalent performance at lower clocks and lower power.
from borg.
Also interesting: while sha512 used to be faster than sha256 in a pure sw implementation, it's vice versa with the sha2 hw acceleration and it is faster than pure sw blake2 (as expected).
from borg.
might be interesting
https://github.com/Blosc/c-blosc2
Blosc (c-blosc2) is a high-performance compressor focused on binary data for efficient storage of large binary data-sets in-memory or on-disk and helping to speed-up memory-bound computations.
from borg.
Related Issues (20)
- run in venv w/ root rights HOT 3
- As of 1.2.0, the ssh relative path hack "/./" works for most actions but not "borg init" HOT 4
- Please add a way to keep backups independently of pruning retention policy HOT 2
- ConnectionResetError: [Errno 104] Connection reset by peer HOT 3
- `borgfs` in Standalone Binary Installation Docs HOT 2
- Can't build borg on arm64 (armbian 22.04LTS) HOT 20
- Use multithreaded zstd compression HOT 9
- Security Feature: Error if local / repository nonce are not in agreement -- improve encryption trust HOT 1
- Are backup archive names encrypted? Cannot find answer in docs. HOT 1
- Backups much slower (5 mins compared to 0.3 secs) than reported "Duration" -- any way to speed it up? HOT 6
- Possible bug in pruning logic with keep-weekly and keep-monthly HOT 7
- netbsd9 vagrant box: broken libxxhash.pc HOT 2
- locking.py seems multiprocess-safe but not thread-safe HOT 3
- `borg check` hangs after errors HOT 5
- pytest startdir: py.path.local argument is deprecated
- Getting "Data integrity error: Invalid segment entry size 0" on fresh repos HOT 9
- Breaking change between b7 and b8 for encrypted repos HOT 5
- --pattern having different outcome in crontab HOT 6
- 2.0.0b5 repositories inaccessible with 2.0.0b8 - KeyError: 'type' in repoobj.py HOT 11
- borg crashes when segment file corrupted, proposed correction HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from borg.