Code Monkey home page Code Monkey logo

fastcdc-py's People

Contributors

bolapara avatar titusz avatar xyb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

fastcdc-py's Issues

Python 3.12 & pip

Hi.

Could we get a new pip release that targets Python 3.12.3 or at least >=3.12?
It's the latest stable and the current pip release does not allow for targeting 3.12 on windows:

$ pip download --verbose --python-version=3.12 --only-binary=":all:" --platform win_amd64 --dest ~/Downloads/ "fastcdc==1.5.0"

Could skip using --only-binary and use the source to get around it. But that could put additional requirements on the target system installing the package so having the option to get the binary would be nice.

Cython version does not appear to pass correct data when using 'fat' option.

I've noticed that the 'fat' option appears to be broken when using the Cython version of the library. The pure Python version appears to work correctly. I have not had time to investigate why the Cython version returns different data than the Python version (And I'm not very familiar with Cython anyway). I've added a test to expose the problem and will submit a PR for it. #8

fastcdc 1.4.0
Cython version 0.29.21
Python 3.8.2
Ubuntu 20.04 x64 based distro.

Reduce memory copying in Cython version

Hey there,

Thanks for a lovely piece of code. I couldn't help but notice a tremendous (at least 20%) amount of runtime wasted on string copying, and it seems quite straightforward to resolve. The patch below tidies up the main loop to avoid any copying, netting a 42% runtime decrease on my crappy old 2015 XPS laptop, with throughput increasing from roughly 664 MiB/sec to 1158 MB/sec for my test file (a 12GB VMware vmdk). At least some of this is explained by the avoidance of copying, the rest likely due to avoiding trashing the CPU cache.

I would have submitted this as a PR, but it wasn't clear what the correct semantics for the first argument of fastcdc() are, or whether you are happy to use mmap.mmap() at all. This was only tested on Linux, but similar / the same code should work fine on Windows too. It's also not clear if there is any value in a fallback mode for situations where mmap() is not available. Perhaps there is, but they don't occur to me just now. There is also a fixed expense to setting up a mmap() that means for smaller files it may still make sense for performance reasons to fall back to regular IO.

diff --git a/fastcdc/fastcdc_cy.pyx b/fastcdc/fastcdc_cy.pyx
index c16ec81..acb8c1a 100644
--- a/fastcdc/fastcdc_cy.pyx
+++ b/fastcdc/fastcdc_cy.pyx
@@ -1,5 +1,6 @@
 # -*- coding: utf-8 -*-
 cimport cython
+import mmap
 from libc.stdint cimport uint32_t, uint8_t
 from libc.math cimport log2, lround
 from io import BytesIO
@@ -17,9 +18,11 @@ def fastcdc_cy(data, min_size=None, avg_size=8192, max_size=None, fat=False, hf=
 
     # Ensure we have a readable stream
     if isinstance(data, str):
-        stream = open(data, "rb")
+        with open(data, 'rb') as fp:
+            map = mmap.mmap(fp.fileno(), 0, access=mmap.PROT_READ)
+            stream = memoryview(map)
     elif not hasattr(data, "read"):
-        stream = BytesIO(data)
+        stream = memoryview(data)
     else:
         stream = data
     return chunk_generator(stream, min_size, avg_size, max_size, fat, hf)
@@ -32,17 +35,14 @@ def chunk_generator(stream, min_size, avg_size, max_size, fat, hf):
     mask_s = mask(bits + 1)
     mask_l = mask(bits - 1)
     read_size = max(1024 * 64, max_size)
-    blob = memoryview(stream.read(read_size))
     offset = 0
-    while blob:
-        if len(blob) <= max_size:
-            blob  = memoryview(bytes(blob) + stream.read(read_size))
+    while offset < len(stream):
+        blob = stream[offset:offset + read_size]
         cp = cdc_offset(blob, min_size, avg_size, max_size, cs, mask_s, mask_l)
         raw = bytes(blob[:cp]) if fat else b''
         h = hf(blob[:cp]).hexdigest() if hf else ''
         yield Chunk(offset, cp, raw, h)
         offset += cp
-        blob = blob[cp:]

Can't install without poetry

Hey there,

Here is what you get when you try to install fastcdc without poetry on your system:

$ pip install --user fastcdc
Collecting fastcdc
  Downloading fastcdc-1.4.2.tar.gz (19 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
ERROR: Exception:
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 160, in exc_logging_wrapper
    status = run_func(*args)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 247, in wrapper
    return func(self, options, args)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 400, in run
    requirement_set = resolver.resolve(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
    result = self._result = resolver.resolve(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 481, in resolve
    state = resolution.resolve(requirements, max_rounds=max_rounds)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 348, in resolve
    self._add_to_criteria(self.state.criteria, r, parent=None)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 172, in _add_to_criteria
    if not criterion.candidates:
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/structs.py", line 151, in __bool__
    return bool(self._sequence)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
    return any(self)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
    return (c for c in iterator if id(c) not in self._incompatible_ids)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
    candidate = func()
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
    self._link_candidate_cache[link] = LinkCandidate(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 297, in __init__
    super().__init__(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 162, in __init__
    self.dist = self._prepare()
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 231, in _prepare
    dist = self._prepare_distribution()
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 308, in _prepare_distribution
    return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 491, in prepare_linked_requirement
    return self._prepare_linked_requirement(req, parallel_builds)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 577, in _prepare_linked_requirement
    dist = _get_prepared_distribution(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 69, in _get_prepared_distribution
    abstract_dist.prepare_distribution_metadata(
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/distributions/sdist.py", line 48, in prepare_distribution_metadata
    self._install_build_reqs(finder)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/distributions/sdist.py", line 118, in _install_build_reqs
    build_reqs = self._get_build_requires_wheel()
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/distributions/sdist.py", line 95, in _get_build_requires_wheel
    return backend.get_requires_for_build_wheel()
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/utils/misc.py", line 685, in get_requires_for_build_wheel
    return super().get_requires_for_build_wheel(config_settings=cs)
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/pep517/wrappers.py", line 173, in get_requires_for_build_wheel
    return self._call_hook('get_requires_for_build_wheel', {
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/pep517/wrappers.py", line 319, in _call_hook
    raise BackendUnavailable(data.get('traceback', ''))
pip._vendor.pep517.wrappers.BackendUnavailable: Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 77, in _build_backend
    obj = import_module(mod_path)
  File "/opt/homebrew/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'poetry'

Not everyone uses poetry, yet. :)

Proposal: Add a closefd boolean argument to the chunk_generator function.

Currently, the fastcdc_(py|cy) functions leave the file descriptor open after returning the generator since the generator requires it. This can result in a memory leak for non-CPython interpreters that do not utilize memory reference counting. To address this issue, I propose adding a boolean parameter called closefd to the chunk_generator function, along with some corresponding if code. I am confident that I can make these changes myself once my current pull request #16 is merged.

Alternative: Convert fastcdc_(py|cy) functions into generators themselves plus using with open(...) expression inside.

chunk length is incorrect for files less than min_size

When a chunk is smaller than min_size, such as a small file/stream , the reported size is incorrect.

Consider the following example:

data = b'\x04\xc9KM\x8a\xeaiH\x83\xaf\x01{\xd6\xe1\xab(# \xdb\xaf' # from os.urandom(20)
print(f'{len(data) = }')

chunks = fastcdc.fastcdc(
    data, 
    min_size=1024, # 1 kb
    avg_size=4*1024, # 4 kb
    max_size=16*1024, # 16 kb
    fat=True, # for demo
)
chunk = next(chunks)

print(f'{chunk.length = }')
print(f'{len(chunk.data) = }')
print(f'{data == chunk.data = }')

print(f'{fastcdc.__version__ = }')

Out:

len(data) = 20
chunk.length = 1024
len(chunk.data) = 20
data == chunk.data = True
fastcdc.__version__ = '1.4.2'

As you can see, chunk.length is incorrect for a data stram of 20 bytes (20 << 1024). When used with fat=True, I can ascertain the true size but that is needless using extra memory.

Sensible bounds for read()

Ok, figured it out.

The current implementation will tend to read a buffer way beyond max_size because it always reads max_size. As a result, it then yields multiple chunks in one go. That can be avoided by simply filling up to the designated size:

            blob = memoryview(bytes(blob) + stream.read(max_size - len(blob)))

As far as I read the implementation, chunks should never be > max_size (where it splits regardless of a matching cut point), so this change should not alter the behavior in any way.

[Edit]
This was a question about reducing the amount of chunks that are emitted simultaneously.
Re-reading the paper, it became clear that one can essentially disable normalized chunking (NC) by setting fixed chunk sizes for min, avg and max. Chunking is then effectively fixed-size and reliably emits one chunk. That's enough for my purpose.

Sorry for the noise!

Query Regarding the Safety of FastCDC in a Multiprocessing Environment

Dear FastCDC Developers,
Hello! I am currently using the FastCDC library for some project development and I greatly appreciate your work.

I have a question that I would like to ask. I am considering using multiprocessing in my project, but I am unsure if the FastCDC library can safely operate in a multiprocessing environment. I did not find related information in the documentation, so I thought to inquire here.

If the FastCDC library can safely operate in a multiprocessing environment, are there any special steps or precautions that need to be taken to ensure its safety? If not, could you recommend any alternatives or solutions?

Thank you very much for your assistance! I look forward to your reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.