iscc / fastcdc-py Goto Github PK

View Code? Open in Web Editor NEW

44.0 5.0 17.0 347 KB

FastCDC implementation in Python https://pypi.org/project/fastcdc/

License: MIT License

Python 82.39% Cython 17.61%

python deduplication chunking-algorithm chunking content-dependent

fastcdc-py's People

Contributors

Stargazers

Watchers

Forkers

xyb bolapara jiyoungsin onlyone0001 80ca86 mrcocoacat jwink3101 fengkuiyang-hust joao-pinheiro dukeofargyle trifle friz-zy aderemi-adesada hello-sources farley-chen biswapanda

fastcdc-py's Issues

Problems with right-shift in Gear rollsum.

This implementation copies another implementation that uses a right-shift in its Gear rollsum. This has problems I reported here;

ronomon/deduplication#7

Installation via pip does not build cython extension.

Python 3.12 & pip

Hi.

Could we get a new pip release that targets Python 3.12.3 or at least >=3.12?
It's the latest stable and the current pip release does not allow for targeting 3.12 on windows:

$ pip download --verbose --python-version=3.12 --only-binary=":all:" --platform win_amd64 --dest ~/Downloads/ "fastcdc==1.5.0"

Could skip using --only-binary and use the source to get around it. But that could put additional requirements on the target system installing the package so having the option to get the binary would be nice.

Cython version does not appear to pass correct data when using 'fat' option.

I've noticed that the 'fat' option appears to be broken when using the Cython version of the library. The pure Python version appears to work correctly. I have not had time to investigate why the Cython version returns different data than the Python version (And I'm not very familiar with Cython anyway). I've added a test to expose the problem and will submit a PR for it. #8

fastcdc 1.4.0
Cython version 0.29.21
Python 3.8.2
Ubuntu 20.04 x64 based distro.

Reduce memory copying in Cython version

Hey there,

Thanks for a lovely piece of code. I couldn't help but notice a tremendous (at least 20%) amount of runtime wasted on string copying, and it seems quite straightforward to resolve. The patch below tidies up the main loop to avoid any copying, netting a 42% runtime decrease on my crappy old 2015 XPS laptop, with throughput increasing from roughly 664 MiB/sec to 1158 MB/sec for my test file (a 12GB VMware vmdk). At least some of this is explained by the avoidance of copying, the rest likely due to avoiding trashing the CPU cache.

I would have submitted this as a PR, but it wasn't clear what the correct semantics for the first argument of fastcdc() are, or whether you are happy to use mmap.mmap() at all. This was only tested on Linux, but similar / the same code should work fine on Windows too. It's also not clear if there is any value in a fallback mode for situations where mmap() is not available. Perhaps there is, but they don't occur to me just now. There is also a fixed expense to setting up a mmap() that means for smaller files it may still make sense for performance reasons to fall back to regular IO.

diff --git a/fastcdc/fastcdc_cy.pyx b/fastcdc/fastcdc_cy.pyx
index c16ec81..acb8c1a 100644
--- a/fastcdc/fastcdc_cy.pyx
+++ b/fastcdc/fastcdc_cy.pyx
@@ -1,5 +1,6 @@
 # -*- coding: utf-8 -*-
 cimport cython
+import mmap
 from libc.stdint cimport uint32_t, uint8_t
 from libc.math cimport log2, lround
 from io import BytesIO
@@ -17,9 +18,11 @@ def fastcdc_cy(data, min_size=None, avg_size=8192, max_size=None, fat=False, hf=
 
     # Ensure we have a readable stream
     if isinstance(data, str):
-        stream = open(data, "rb")
+        with open(data, 'rb') as fp:
+            map = mmap.mmap(fp.fileno(), 0, access=mmap.PROT_READ)
+            stream = memoryview(map)
     elif not hasattr(data, "read"):
-        stream = BytesIO(data)
+        stream = memoryview(data)
     else:
         stream = data
     return chunk_generator(stream, min_size, avg_size, max_size, fat, hf)
@@ -32,17 +35,14 @@ def chunk_generator(stream, min_size, avg_size, max_size, fat, hf):
     mask_s = mask(bits + 1)
     mask_l = mask(bits - 1)
     read_size = max(1024 * 64, max_size)
-    blob = memoryview(stream.read(read_size))
     offset = 0
-    while blob:
-        if len(blob) <= max_size:
-            blob  = memoryview(bytes(blob) + stream.read(read_size))
+    while offset < len(stream):
+        blob = stream[offset:offset + read_size]
         cp = cdc_offset(blob, min_size, avg_size, max_size, cs, mask_s, mask_l)
         raw = bytes(blob[:cp]) if fat else b''
         h = hf(blob[:cp]).hexdigest() if hf else ''
         yield Chunk(offset, cp, raw, h)
         offset += cp
-        blob = blob[cp:]

Create github workflow for binary wheels (manylinux, mac, windows)

As a starting point see:

https://github.com/grantjenks/python-c2f
https://github.com/RalfG/python-wheels-manylinux-build

Also create workflow for publishing wheels to pypi.

Can't install without poetry

Hey there,

Here is what you get when you try to install fastcdc without poetry on your system:

$ pip install --user fastcdc
Collecting fastcdc
  Downloading fastcdc-1.4.2.tar.gz (19 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
ERROR: Exception:
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 160, in exc_logging_wrapper
    status = run_func(*args)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 247, in wrapper
    return func(self, options, args)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 400, in run
    requirement_set = resolver.resolve(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
    result = self._result = resolver.resolve(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 481, in resolve
    state = resolution.resolve(requirements, max_rounds=max_rounds)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 348, in resolve
    self._add_to_criteria(self.state.criteria, r, parent=None)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 172, in _add_to_criteria
    if not criterion.candidates:
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/structs.py", line 151, in __bool__
    return bool(self._sequence)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
    return any(self)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
    return (c for c in iterator if id(c) not in self._incompatible_ids)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
    candidate = func()
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
    self._link_candidate_cache[link] = LinkCandidate(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 297, in __init__
    super().__init__(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 162, in __init__
    self.dist = self._prepare()
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 231, in _prepare
    dist = self._prepare_distribution()
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 308, in _prepare_distribution
    return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 491, in prepare_linked_requirement
    return self._prepare_linked_requirement(req, parallel_builds)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 577, in _prepare_linked_requirement
    dist = _get_prepared_distribution(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 69, in _get_prepared_distribution
    abstract_dist.prepare_distribution_metadata(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/distributions/sdist.py", line 48, in prepare_distribution_metadata
    self._install_build_reqs(finder)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/distributions/sdist.py", line 118, in _install_build_reqs
    build_reqs = self._get_build_requires_wheel()
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/distributions/sdist.py", line 95, in _get_build_requires_wheel
    return backend.get_requires_for_build_wheel()
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/utils/misc.py", line 685, in get_requires_for_build_wheel
    return super().get_requires_for_build_wheel(config_settings=cs)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/pep517/wrappers.py", line 173, in get_requires_for_build_wheel
    return self._call_hook('get_requires_for_build_wheel', {
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/pep517/wrappers.py", line 319, in _call_hook
    raise BackendUnavailable(data.get('traceback', ''))
pip._vendor.pep517.wrappers.BackendUnavailable: Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 77, in _build_backend
    obj = import_module(mod_path)
  File "/opt/homebrew/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'poetry'

Not everyone uses poetry, yet. :)

Add file split and recombine commands.

Proposal: Add a closefd boolean argument to the chunk_generator function.

Currently, the fastcdc_(py|cy) functions leave the file descriptor open after returning the generator since the generator requires it. This can result in a memory leak for non-CPython interpreters that do not utilize memory reference counting. To address this issue, I propose adding a boolean parameter called closefd to the chunk_generator function, along with some corresponding if code. I am confident that I can make these changes myself once my current pull request #16 is merged.

Alternative: Convert fastcdc_(py|cy) functions into generators themselves plus using with open(...) expression inside.

chunk length is incorrect for files less than min_size

When a chunk is smaller than min_size, such as a small file/stream , the reported size is incorrect.

Consider the following example:

data = b'\x04\xc9KM\x8a\xeaiH\x83\xaf\x01{\xd6\xe1\xab(# \xdb\xaf' # from os.urandom(20)
print(f'{len(data) = }')

chunks = fastcdc.fastcdc(
    data, 
    min_size=1024, # 1 kb
    avg_size=4*1024, # 4 kb
    max_size=16*1024, # 16 kb
    fat=True, # for demo
)
chunk = next(chunks)

print(f'{chunk.length = }')
print(f'{len(chunk.data) = }')
print(f'{data == chunk.data = }')

print(f'{fastcdc.__version__ = }')

Out:

len(data) = 20
chunk.length = 1024
len(chunk.data) = 20
data == chunk.data = True
fastcdc.__version__ = '1.4.2'

As you can see, chunk.length is incorrect for a data stram of 20 bytes (20 << 1024). When used with fat=True, I can ascertain the true size but that is needless using extra memory.

Add deduplication analysis.

Sensible bounds for read()

Ok, figured it out.

The current implementation will tend to read a buffer way beyond max_size because it always reads max_size. As a result, it then yields multiple chunks in one go. That can be avoided by simply filling up to the designated size:

            blob = memoryview(bytes(blob) + stream.read(max_size - len(blob)))

As far as I read the implementation, chunks should never be > max_size (where it splits regardless of a matching cut point), so this change should not alter the behavior in any way.

[Edit]
This was a question about reducing the amount of chunks that are emitted simultaneously.
Re-reading the paper, it became clear that one can essentially disable normalized chunking (NC) by setting fixed chunk sizes for min, avg and max. Chunking is then effectively fixed-size and reliably emits one chunk. That's enough for my purpose.

Sorry for the noise!

Query Regarding the Safety of FastCDC in a Multiprocessing Environment

Dear FastCDC Developers,
Hello! I am currently using the FastCDC library for some project development and I greatly appreciate your work.

I have a question that I would like to ask. I am considering using multiprocessing in my project, but I am unsure if the FastCDC library can safely operate in a multiprocessing environment. I did not find related information in the documentation, so I thought to inquire here.

If the FastCDC library can safely operate in a multiprocessing environment, are there any special steps or precautions that need to be taken to ensure its safety? If not, could you recommend any alternatives or solutions?

Thank you very much for your assistance! I look forward to your reply.