Code Monkey home page Code Monkey logo

rollsum-tests-1's Introduction

Tests for assorted rollsum implementations

Competitors

The following rollsum algorithms and implementations are tested:

Comments:

  • I partitioned out the 8k and 256k competitors in case the number of splits would make a material difference to perf - in practice it does make some, but not enough the change rankings (as you can see from the results)
  • currently these implementation benchmarking times include the initial load of the file - this is bad and they shouldn't, but is currently mitigated by doing cat file > /dev/null to get it into cache
  • the benchmark attempts to be semirealistic - given a file in cache, how long does it take to load it and determine the split points. In time I'll probably migrate this to exclude actually loading the file

Prerequisites

You need Podman or Docker (if the latter, just change the command in the Makefile and it should work).

Clone this repo, get the rollsum implementations and build a Docker image with the dependencies:

$ git clone https://github.com/aidanhs/rollsum-tests.git && cd rollsum-tests && git submodule update --init --recursive
$ make container # create an image for the container

Running Tests

Jump into the container to and set up the test environment.

NOTE: all the implementations load the test file for a run fully into RAM before starting work. The test files are all below 1.5GB.

$ podman run -it -v $(pwd):/work --tmpfs /tmp --rm rollsum-tests bash
# make testfiles # generate deterministic test files
[...]
# make rebuilddep # wipe and build all the dependencies
# make preptest # build the assorted test programs
[...]

Finally:

# make test # actually run tests
[...]
HOW TO READ THE RESULTS
- Each cell is `time[err,count,pstdev,mem]`. `time` is in seconds, `err` indicates whether the split result failed
  to match bup exactly, `count` indicates how many splits there were, `pstdev` indicates the standard devision of the
  split sizes and `mem` indicates max memory in MB.
- Standard deviation is only calculated for tests with >500 splits to try and give a rough judgement on the hash
  algorighm quality (intuitively I think random data should be a happy case for a hash).
- To save screen space, the common results from non-erroring lines are collapsed.

FILE     SIZE  BUP                         RSROBU             BITABU                      PERKEEP            RSROGE                      IPFSRA                      IPFSSPL                   RSROBU256                   RSROGE256                   BITABZ                      ASURANBZ                    IPFSRA256                   IPFSSPL256                IPFSBZ
01.test  4.0K  0.0[/,0c,-,6M]              0.0[",",",2M]      0.0[",",",2M]               0.0[",",",2M]      0.0[",",",3M]               0.0[X,1c,-,6M]              0.0[X,1c,-,6M]            0.0[",",",2M]               0.0[",",",2M]               0.0[",",",2M]               0.0[X,1c,-,2M]              0.0[X,1c,-,6M]              0.0[X,1c,-,6M]            0.0[X,1c,-,6M]
02.test  4.0K  0.0[/,0c,-,7M]              0.0[",",",2M]      0.0[",",",2M]               0.0[",",",2M]      0.0[",",",2M]               0.0[X,1c,-,6M]              0.0[X,1c,-,6M]            0.0[",",",2M]               0.0[",",",2M]               0.0[",",",2M]               0.0[X,1c,-,2M]              0.0[X,1c,-,6M]              0.0[X,1c,-,6M]            0.0[X,1c,-,6M]
03.test  8.0K  0.0[/,1c,-,7M]              0.0[",",",2M]      0.0[",",",2M]               0.0[",",",2M]      0.0[X,0c,-,2M]              0.0[X,1c,-,6M]              0.0[X,1c,-,6M]            0.0[X,0c,-,2M]              0.0[X,0c,-,2M]              0.0[X,0c,-,2M]              0.0[X,1c,-,2M]              0.0[X,1c,-,6M]              0.0[X,1c,-,6M]            0.0[X,1c,-,6M]
04.test  64K   0.01[/,11c,-,7M]            0.0[",",",2M]      0.0[",",",2M]               0.0[",",",2M]      0.0[X,6c,-,2M]              0.0[X,10c,-,6M]             0.0[X,8c,-,6M]            0.0[X,0c,-,2M]              0.0[X,0c,-,2M]              0.0[X,0c,-,2M]              0.0[X,1c,-,2M]              0.0[X,1c,-,6M]              0.0[X,1c,-,6M]            0.0[X,1c,-,6M]
05.test  384K  0.0[/,45c,-,7M]             0.0[",",",3M]      0.0[",",",3M]               0.0[",",",2M]      0.0[X,52c,-,3M]             0.0[X,48c,-,8M]             0.0[X,48c,-,7M]           0.0[X,1c,-,3M]              0.0[X,1c,-,3M]              0.0[X,1c,-,3M]              0.0[X,1c,-,3M]              0.0[X,1c,-,8M]              0.0[X,2c,-,7M]            0.0[X,2c,-,7M]
06.test  1.7M  0.01[/,188c,-,8M]           0.0[",",",4M]      0.0[X,188c,-,5M]            0.0[",",",4M]      0.0[X,211c,-,4M]            0.01[X,203c,-,11M]          0.0[X,206c,-,9M]          0.0[X,8c,-,4M]              0.0[X,7c,-,4M]              0.01[X,7c,-,6M]             0.01[X,6c,-,5M]             0.0[X,6c,-,12M]             0.0[X,7c,-,9M]            0.0[X,7c,-,9M]
07.test  5.5M  0.01[/,711c,7478,12M]       0.0[",",",8M]      0.03[X,711c,7478,9M]        0.02[",",",8M]     0.0[X,729c,8435,8M]         0.04[X,680c,3405,19M]       0.0[X,704c,89,18M]        0.01[X,27c,-,8M]            0.0[X,32c,-,8M]             0.02[X,24c,-,10M]           0.03[X,21c,-,9M]            0.03[X,21c,-,18M]           0.0[X,22c,-,18M]          0.0[X,26c,-,18M]
08.test  16M   0.03[/,2010c,8554,22M]      0.02[",",",18M]    0.1[X,2012c,8554,19M]       0.06[",",",19M]    0.01[X,2178c,8358,19M]      0.11[X,2019c,3438,39M]      0.0[X,2048c,0,39M]        0.03[X,85c,-,19M]           0.02[X,57c,-,18M]           0.08[X,73c,-,21M]           0.09[X,49c,-,20M]           0.1[X,64c,-,39M]            0.0[X,64c,-,39M]          0.02[X,65c,-,39M]
09.test  42M   0.07[/,5094c,8506,47M]      0.07[",",",44M]    0.25[X,5104c,8506,45M]      0.17[",",",44M]    0.02[X,5549c,8232,43M]      0.26[X,5154c,3467,89M]      0.01[X,5255c,31,90M]      0.07[X,213c,-,43M]          0.04[X,160c,-,43M]          0.19[X,163c,-,48M]          0.24[X,135c,-,44M]          0.23[X,155c,-,90M]          0.0[X,165c,-,90M]         0.04[X,172c,-,90M]
10.test  96M   0.18[/,12260c,8176,102M]    0.16[",",",98M]    0.59[X,12265c,8176,99M]     0.39[",",",100M]   0.06[X,12566c,8644,98M]     0.6[X,11970c,3435,199M]     0.03[X,12208c,72,201M]    0.16[X,471c,-,98M]          0.11[X,410c,-,98M]          0.45[X,366c,-,103M]         0.57[X,297c,-,99M]          0.54[X,370c,-,199M]         0.02[X,382c,-,199M]       0.11[X,400c,-,199M]
11.test  205M  0.34[/,26205c,8239,211M]    0.32[",",",207M]   1.3[X,26211c,8239,208M]     0.82[",",",213M]   0.14[X,27406c,8446,207M]    1.3[X,25630c,3425,420M]     0.08[X,26167c,7,422M]     0.35[X,1011c,223003,207M]   0.23[X,803c,263320,207M]    0.96[X,836c,261952,211M]    1.19[X,639c,249044,208M]    1.15[X,799c,111378,418M]    0.03[X,818c,2618,419M]    0.19[X,818c,111677,417M]
12.test  411M  0.72[/,52107c,8292,416M]    0.67[",",",412M]   2.57[X,52088c,8292,414M]    1.69[",",",426M]   0.26[X,54428c,8554,412M]    2.57[X,51216c,3439,835M]    0.14[X,52488c,0,840M]     0.72[X,1981c,212794,412M]   0.47[X,1787c,242932,412M]   1.93[X,1637c,272110,418M]   2.38[X,1365c,231810,414M]   2.27[X,1611c,110133,831M]   0.07[X,1641c,4852,833M]   0.38[X,1672c,110754,831M]
13.test  778M  1.3[/,99447c,8185,784M]     1.26[",",",780M]   4.97[X,99454c,8185,782M]    3.21[",",",806M]   0.5[X,103607c,8485,780M]    5.15[X,97600c,3443,1578M]   0.2[X,99577c,13,1587M]    1.4[X,3680c,228807,780M]    0.92[X,3166c,264622,780M]   3.96[X,3111c,260256,786M]   4.94[X,2623c,233203,781M]   4.58[X,3017c,110748,1570M]  0.12[X,3112c,1101,1574M]  0.77[X,3254c,107475,1568M]
14.test  1.4G  2.42[/,179861c,8153,1414M]  2.26[",",",1410M]  9.28[X,179978c,8153,1412M]  5.87[",",",1456M]  1.01[X,188641c,8448,1410M]  9.66[X,176327c,3444,2850M]  0.46[X,180151c,19,2866M]  2.46[X,6726c,216579,1410M]  1.66[X,5821c,265289,1410M]  6.82[X,5679c,254793,1416M]  8.56[X,4498c,243249,1411M]  8.11[X,5506c,111157,2835M]  0.24[X,5630c,1088,2841M]  1.34[X,5768c,110212,2833M]

You can limit the test files with e.g. make TEST_FILES="test/01.test test/02.test" test.

Results Commentary

  • Of the three identical implementations ('bupsplit'):
    • rsroll edges out bup by small but consistent amounts, potentially due to python overhead
    • perkeep and bita are slow by comparison
  • All of the IPFS implementations seem to be leaking memory pretty badly (~1x the amount being split)
  • IPFS split is a nice baseline to consider other implementations in the context of
  • Gear in rsroll and buzhash seem comparable and fluctuate which is the 'winner' based on time of day - I will be improving benchmarks in time to make them more consistent
    • given FastCDC is claimed to be significantly faster than gear, this is probably one to look at
  • some of the standard deviations are artificially reduced by having min and max chunk sizes - IPFS in particular is very aggressive with this (disabling those limits puts buzhash and rabin on a par with Bita buzhash) and it remains to be seen in future analyses how much this is crippling the chunking

Misc

To regenerate sums:

for f in test/*.test; do ./test_rust rsroll-bup $f | sha1sum > $f.sum; done

Bugs generally:

  • ipfs/notes#1 (comment) - strong agree with this comment, I also found most of them were wrong when I was writing my own

Bugs found/comments inspired by this repo:

TODO:

rollsum-tests-1's People

Contributors

aidanhs avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.