Code Monkey home page Code Monkey logo

deduplication's Introduction

deduplication

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.

Installation

npm install @ronomon/deduplication

Fast

@ronomon/deduplication is an adaptation of FastCDC written for Node.js as a native addon in C++.

FastCDC is about 10× faster than the best of open-source Rabin-based CDC, and about 3× faster than the state-of-the-art Gear- and AE-based CDC, while achieving nearly the same deduplication ratio as the classic Rabin-based approach.

@ronomon/deduplication achieves chunking speeds comparable to FastCDC, while the benchmark and performance results shown here also include the overhead of SHA-256 hashing:

           CPU: Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz
         Cores: 8
       Threads: 8
         Files: 64 x 4194304 Bytes

============================================================

         Chunk: 2048 Bytes (+1.42% E)
         Ratio: 97.96%
    Javascript: Latency: 196.375ms, Throughput: 160.36 MB/s
        Native: Latency: 30.911ms, Throughput: 1028.49 MB/s

         Chunk: 4096 Bytes (+7.57% E)
         Ratio: 97.40%
    Javascript: Latency: 164.572ms, Throughput: 192.70 MB/s
        Native: Latency: 30.058ms, Throughput: 1073.74 MB/s

         Chunk: 8192 Bytes (-1.76% E)
         Ratio: 96.75%
    Javascript: Latency: 150.184ms, Throughput: 211.03 MB/s
        Native: Latency: 29.742ms, Throughput: 1086.78 MB/s

         Chunk: 16384 Bytes (-4.88% E)
         Ratio: 95.19%
    Javascript: Latency: 142.566ms, Throughput: 222.58 MB/s
        Native: Latency: 29.792ms, Throughput: 1091.20 MB/s

         Chunk: 32768 Bytes (+20.57% E)
         Ratio: 89.73%
    Javascript: Latency: 139.043ms, Throughput: 227.68 MB/s
        Native: Latency: 29.426ms, Throughput: 1104.67 MB/s

         Chunk: 65536 Bytes (+3.61% E)
         Ratio: 84.11%
    Javascript: Latency: 138.047ms, Throughput: 229.63 MB/s
        Native: Latency: 29.470ms, Throughput: 1100.15 MB/s

         Chunk: 131072 Bytes (-16.45% E)
         Ratio: 75.44%
    Javascript: Latency: 135.161ms, Throughput: 233.42 MB/s
        Native: Latency: 29.472ms, Throughput: 1100.15 MB/s

Multi-threaded

All chunking and hashing algorithms are executed asynchronously in the Node.js threadpool for multi-core throughput, without blocking the event loop. This effectively treats the event loop as the control plane and the threadpool as the data plane. Multiple source buffers can be deduplicated across multiple threads by simply calling deduplicate() concurrently from the event loop. The number of deduplicate() calls in flight will be limited by the size of the threadpool, and further calls to deduplicate() will wait for these to finish before executing. Please see the crypto-async module for advice on increasing the size of the Node.js threadpool.

Content-dependent chunking

Compared to fixed size chunking, which fails to detect most of the same chunk cut-points when file content is shifted slightly, variable size content-dependent chunking can find most chunk cut-points no matter how the chunks move around. You can tune the absolute minimum and maximum chunk sizes required, as well as the expected average chunk size required.

While content-dependent chunking is more CPU-intensive than fixed size chunking, the chunking algorithm of @ronomon/deduplication can detect chunks at a rate of more than 1.5 GB per second per CPU core. This is significantly faster than the SHA-256 hashing algorithm, which is 2.5× slower by way of contrast.

The following optimizations and variations on FastCDC are involved in the chunking algorithm:

  • 31 bit integers to avoid 64 bit integers for the sake of the Javascript reference implementation.

  • A right shift instead of a left shift to remove the need for an additional modulus operator, which would otherwise have been necessary to prevent overflow.

  • Masks are no longer zero-padded since a right shift is used instead of a left shift.

  • A more adaptive threshold based on a combination of average and minimum chunk size (rather than just average chunk size) to decide the pivot point at which to switch masks. A larger minimum chunk size now switches from the strict mask to the eager mask earlier.

  • Masks use 1 bit of chunk size normalization instead of 2 bits of chunk size normalization.

Deduplication

The 32 byte SHA-256 hash followed by the 4 byte UInt32BE size of each consecutive chunk will be written into the target buffer provided. You can use the SHA-256 hash combined with your own indexing scheme to determine whether a chunk should be stored on disk or transmitted across the network, and so reduce storage and bandwidth costs. You should apply deduplication before you apply compression.

Compression and average chunk size

Compression and deduplication work in tension. Larger average chunk sizes achieve better compression ratios, while smaller average chunk sizes achieve better deduplication ratios. An average chunk size of 64 KB is recommended for optimal end-to-end deduplication and compression efficiency, according to the recommendations of Primary Data Deduplication - Large Scale Study and System Design by Microsoft. An average chunk size of 64 KB will not only maximize the combined savings from deduplication and compression, but will also minimize metadata overhead through reducing the average number of chunks considerably compared to typical average chunk sizes of 4 KB and 8 KB.

Invariants

Most of these invariants are enforced with exceptions rather than asynchronous errors since they represent contract error rather than operational error:

  • @ronomon/deduplication will ensure that all chunks meet your minimum and maximum chunk size requirements (except for the last chunk in the last source buffer, which you can indicate by flags=1), and that the actual average chunk size is within ±20% of your expected average chunk size.

  • When tuning average, minimum and maximum chunk sizes, please ensure that the (maximum - minimum > average) invariant holds so that cut-points are not artificially forced instead of being content-dependent.

  • Please ensure that average, minimum and maximum chunk sizes are within the reasonable and inclusive bounds determined by the respective _MIN and _MAX constants defined in the Javascript reference and C++ implementations.

  • All integers, offsets and sizes must be at most 31 bits (2 GB) to avoid overflow and to optimize the Javascript reference implementation. Note that this does not place a limit on the size of file which can be deduplicated, since a file can be deduplicated in streaming fashion through multiple calls to deduplicate() (setting flags=1 when the last source buffer is provided).

  • When deduplicating a file in streaming fashion through multiple calls to deduplicate(), please ensure that your source buffer is larger than the maximum chunk size required until such time as you set flags=1 to indicate that the last source buffer has been provided (when it can be smaller).

Usage

Please try out the included demo (node demo.js file):

var assert = require('assert');
var fs = require('fs');
var path = require('path');
var Deduplication = require(path.join(module.filename, '..', 'binding.node'));

var file = process.argv[2];
if (!file) {
  console.error('usage: node demo.js <file>');
  return;
}
var fd = fs.openSync(file, 'r');
var fileOffset = 0;
var chunkOffset = 0;

// Recommended average, minimum and maximum chunk size constants:
var average = 65536;
var minimum = Math.round(average / 4);
var maximum = average * 8;

var source = Buffer.alloc(4 * 1024 * 1024);
var target = Buffer.alloc(Deduplication.targetSize(minimum, source.length));

function close(error) {
  fs.closeSync(fd);
  if (error) throw error;
  assert(chunkOffset === fileOffset);
}

function read(sourceStart) {
  var length = source.length - sourceStart;
  assert(length > 0);
  var bytesRead = fs.readSync(fd, source, sourceStart, length, fileOffset);
  fileOffset += bytesRead;
  var flags = (bytesRead < length) ? 1 : 0;
  write(sourceStart + bytesRead, flags);
}

function write(sourceSize, flags) {
  Deduplication.deduplicate(
    average,
    minimum,
    maximum,
    source,
    0,
    sourceSize,
    target,
    0,
    flags,
    function(error, sourceOffset, targetOffset) {
      if (error) return close(error);
      assert(sourceOffset <= sourceSize);
      assert(sourceOffset <= source.length);
      assert(targetOffset <= target.length);
      var offset = 0;
      while (offset < targetOffset) {
        var hash = target.toString('hex', offset, offset + 32);
        offset += 32;
        var size = target.readUInt32BE(offset);
        offset += 4;
        console.log(
          'hash=' + hash + ' offset=' + chunkOffset + ' size=' + size
        );
        chunkOffset += size;
      }
      assert(offset === targetOffset);
      // Anything remaining in the source buffer should be moved to the
      // beginning of the source buffer, and become the sourceStart for the
      // next read so that we do not read data we have already read:
      var remaining = sourceSize - sourceOffset;
      if (remaining > 0) {
        // Assert that we can copy directly within the source buffer:
        assert(remaining < sourceOffset);
        source.copy(source, 0, sourceOffset, sourceOffset + remaining);
      }
      if (flags === 1) {
        assert(remaining === 0);
        close();
      } else {
        read(remaining);
      }
    }
  );
}

read(0);

Tests

To test the native and Javascript bindings:

node test.js

Benchmark

To benchmark the native and Javascript bindings:

node benchmark.js

deduplication's People

Contributors

jorangreef avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

deduplication's Issues

Shift-Right in Gear rollsum messes with resynchronisation after changes.

This implementation of FastCDC has changed the Gear rollsum to use a right-shift instead of a left-shift. This is done so that the least-significant bits of the hash include data from all 32 previous bytes in the rolling window, so that "zero-padding" of the hash judgement mask is not required.

However, this is problematic because it means bytes don't fully "expire" from the hash, so it no longer only includes the last 32 bytes, and bytes from the very start of the block can modify the hash. This is because the addition in the Gear rollsum also
propagates some bit changes upwards, so the shift-right might never fully expire a byte out of the hash. In some ways this improves the entropy/distribution of the hash, but it also means a change early in a chunk can change the break point at the end, messing with resynchronization.

One way to address this would be to replace the addition with an xor, so that bit changes don't propagate upwards. This would make it like a simplified buzzhash. However, this would further weaken the hash, and Gear is already very weak.

The number of bits in the Gear rollsum directly limits how many bytes are included in the sliding window. Reducing the number of bits from 32 to 31 directly reduces the sliding window size. The Gear rollsum is already a not particularly strong rollsum, so this could matter. In languages that don't have uint32 types this might be unavoidable, but copying this in languages without that limitation is unnecessarily limiting the rollsum strength.

This implementation has been widely copied by other implementations, so this problem affects at least the following implementations as well;

For the record, FastCDC's zero-padding solution when using a left-shift is also IMHO sub-optimal. We know that the most significant bits include the most information about the past 32 bytes, so the best solution is simply use the N most significant bits for a 2^N expected size. This can be done with a mask to check the upper bits are zero as in the FastCDC paper like this;

  MASK = (-1 << (32-N))

  if (!(hash & MASK))
    <it's a chunk boundary>

But a shift-right is probably just as fast and an even simpler way to do the same thing;

  SHIFT = 32 - N

  if !(hash >> SHIFT):
     <it's a chunk boundary>

However, just as fast but even more flexible is to check if the hash value is smaller than a threshold. This allows arbitrary expected size values, not just powers-of-2. It uses all the bits in the hash and treats the most significant bits as most important. This matches nicely with the characteristics of the Gear rollsum;

  THRESHOLD = 2^32 / tgt_size

  if (hash < THRESHOLD):
     <it's a chunk boundary>

Note the above suggestions assume uint32 support. They will need obvious modifications if using a 31 bit hash.

Would a Promise-base example be useful?

The demo is all well and good, but for an application, something based on Promises seems more suitable. I offer this bit of code that is working for me:

function findFileChunks (infile, average) {
  const fd = fs.openSync(infile, 'r')
  const minimum = Math.round(average / 2)
  const maximum = average * 2
  const source = Buffer.alloc(maximum * 2)
  const target = Buffer.alloc(dedupe.targetSize(minimum, source.length))

  return new Promise((resolve, reject) => {
    let flags = 0
    const close = (error) => {
      fs.closeSync(fd)
      if (error) {
        // force the loop to exit
        flags = 1
        reject(error)
      }
    }

    let chunks = []
    let fileOffset = 0
    let chunkOffset = 0
    let sourceStart = 0

    while (flags === 0) {
      const length = source.length - sourceStart
      const bytesRead = fs.readSync(fd, source, sourceStart, length, fileOffset)
      fileOffset += bytesRead
      flags = (bytesRead < length) ? 1 : 0
      const sourceSize = sourceStart + bytesRead
      try {
        dedupe.deduplicate(average, minimum, maximum, source, 0, sourceSize, target, 0, flags,
          (error, sourceOffset, targetOffset) => {
            // n.b. the library throws the error, so this is always undefined
            if (error) {
              close(error)
              return
            }
            let offset = 0
            while (offset < targetOffset) {
              const hash = target.slice(offset, offset + 32)
              offset += 32
              const size = target.readUInt32BE(offset)
              offset += 4
              chunks.push({ hash, offset: chunkOffset, size })
              chunkOffset += size
            }
            // Anything remaining in the source buffer should be moved to the
            // beginning of the source buffer, and become the sourceStart for the
            // next read so that we do not read data we have already read:
            sourceStart = sourceSize - sourceOffset
            if (sourceStart > 0) {
              source.copy(source, 0, sourceOffset, sourceOffset + sourceStart)
            }
            if (flags !== 0) {
              // the last block has finished processing
              close()
              resolve(chunks)
            }
          }
        )
      } catch (err) {
        close(err)
      }
    }
  })
}

For demo purposes, this would need a .then() and some console logging to match the current demo. Let me know if this is useful and I can create a proper pull request. Or not, and feel free to close this.

Observations and Recommendations using this chunker.

This is not so much a bug as general observations and recommendations about using this chunker. I'm putting this here since this chunker is widely copied by other chunker implementations for other languages, and people copying this probably also want to know this. I'm not sure where else this kind of info should go. Apologies if this is annoying. This comes out of some testing and analysis I did of chunker algorithms here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

Observations related to this FastCDC implementation are;

  1. The average argument is not really the average block length. The actual average is very close to minimum + average/2 (for reasons explained below).
  2. This implementation changes the FastCDC "transition point" to max(0, average - 1.5*minimum). This means that the transition point is less than minimum unless minimum < 0.4*average.
  3. FastCDC gets most of its speed advantages from "cut-point-skipping" past minimum, and recommends minimum = 0.5*average or even minimum=average. This means for the FastCDC recommended minimum settings that give good speed, the first mask is never used, and only the second mask is used.
  4. When only the second mask is used, FastCDC's "normalized chunking" feature is not used. Since this implementation uses "NC1" with only one bit of normalization, it behaves like a normal exponential chunker with average set to half the value. An exponential chunker's actual average block length (excluding truncation from maximum) is minimum + average, hence this chunker's actual average being minimum + average/2 (provided maximum is set large enough to minimize truncations).
  5. Not using "normalized chunking" is probably a good thing anyway, since my testing suggests that is worse than simple exponential chunking when you compare using the same actual average and minimum chunk sizes.

With all of that in mind, based on my testing, for this chunker the best minimum/average/maximum settings for optimal deduplication with good speed would be minimum=average/2, maximum>2.5*average with an average chunk length of average. For faster chunking speed with slightly reduced deduplication use minimum=average, maximum>3*average with an average chunk length of 1.5*average. Unfortunately this chunker throws an error if minimum >= average, so it doesn't let you do that. However, you can get away with setting minimum = average -1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.