iscc / fastcdc-py Goto Github PK
View Code? Open in Web Editor NEWFastCDC implementation in Python https://pypi.org/project/fastcdc/
License: MIT License
FastCDC implementation in Python https://pypi.org/project/fastcdc/
License: MIT License
This implementation copies another implementation that uses a right-shift in its Gear rollsum. This has problems I reported here;
Hi.
Could we get a new pip release that targets Python 3.12.3
or at least >=3.12
?
It's the latest stable and the current pip release does not allow for targeting 3.12 on windows:
$ pip download --verbose --python-version=3.12 --only-binary=":all:" --platform win_amd64 --dest ~/Downloads/ "fastcdc==1.5.0"
Could skip using --only-binary
and use the source to get around it. But that could put additional requirements on the target system installing the package so having the option to get the binary would be nice.
I've noticed that the 'fat' option appears to be broken when using the Cython version of the library. The pure Python version appears to work correctly. I have not had time to investigate why the Cython version returns different data than the Python version (And I'm not very familiar with Cython anyway). I've added a test to expose the problem and will submit a PR for it. #8
fastcdc 1.4.0
Cython version 0.29.21
Python 3.8.2
Ubuntu 20.04 x64 based distro.
Hey there,
Thanks for a lovely piece of code. I couldn't help but notice a tremendous (at least 20%) amount of runtime wasted on string copying, and it seems quite straightforward to resolve. The patch below tidies up the main loop to avoid any copying, netting a 42% runtime decrease on my crappy old 2015 XPS laptop, with throughput increasing from roughly 664 MiB/sec to 1158 MB/sec for my test file (a 12GB VMware vmdk). At least some of this is explained by the avoidance of copying, the rest likely due to avoiding trashing the CPU cache.
I would have submitted this as a PR, but it wasn't clear what the correct semantics for the first argument of fastcdc()
are, or whether you are happy to use mmap.mmap()
at all. This was only tested on Linux, but similar / the same code should work fine on Windows too. It's also not clear if there is any value in a fallback mode for situations where mmap() is not available. Perhaps there is, but they don't occur to me just now. There is also a fixed expense to setting up a mmap() that means for smaller files it may still make sense for performance reasons to fall back to regular IO.
diff --git a/fastcdc/fastcdc_cy.pyx b/fastcdc/fastcdc_cy.pyx
index c16ec81..acb8c1a 100644
--- a/fastcdc/fastcdc_cy.pyx
+++ b/fastcdc/fastcdc_cy.pyx
@@ -1,5 +1,6 @@
# -*- coding: utf-8 -*-
cimport cython
+import mmap
from libc.stdint cimport uint32_t, uint8_t
from libc.math cimport log2, lround
from io import BytesIO
@@ -17,9 +18,11 @@ def fastcdc_cy(data, min_size=None, avg_size=8192, max_size=None, fat=False, hf=
# Ensure we have a readable stream
if isinstance(data, str):
- stream = open(data, "rb")
+ with open(data, 'rb') as fp:
+ map = mmap.mmap(fp.fileno(), 0, access=mmap.PROT_READ)
+ stream = memoryview(map)
elif not hasattr(data, "read"):
- stream = BytesIO(data)
+ stream = memoryview(data)
else:
stream = data
return chunk_generator(stream, min_size, avg_size, max_size, fat, hf)
@@ -32,17 +35,14 @@ def chunk_generator(stream, min_size, avg_size, max_size, fat, hf):
mask_s = mask(bits + 1)
mask_l = mask(bits - 1)
read_size = max(1024 * 64, max_size)
- blob = memoryview(stream.read(read_size))
offset = 0
- while blob:
- if len(blob) <= max_size:
- blob = memoryview(bytes(blob) + stream.read(read_size))
+ while offset < len(stream):
+ blob = stream[offset:offset + read_size]
cp = cdc_offset(blob, min_size, avg_size, max_size, cs, mask_s, mask_l)
raw = bytes(blob[:cp]) if fat else b''
h = hf(blob[:cp]).hexdigest() if hf else ''
yield Chunk(offset, cp, raw, h)
offset += cp
- blob = blob[cp:]
As a starting point see:
https://github.com/grantjenks/python-c2f
https://github.com/RalfG/python-wheels-manylinux-build
Also create workflow for publishing wheels to pypi.
Hey there,
Here is what you get when you try to install fastcdc without poetry on your system:
$ pip install --user fastcdc
Collecting fastcdc
Downloading fastcdc-1.4.2.tar.gz (19 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
ERROR: Exception:
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/cli/base_command.py", line 160, in exc_logging_wrapper
status = run_func(*args)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/cli/req_command.py", line 247, in wrapper
return func(self, options, args)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/commands/install.py", line 400, in run
requirement_set = resolver.resolve(
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
result = self._result = resolver.resolve(
File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 481, in resolve
state = resolution.resolve(requirements, max_rounds=max_rounds)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 348, in resolve
self._add_to_criteria(self.state.criteria, r, parent=None)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/resolvers.py", line 172, in _add_to_criteria
if not criterion.candidates:
File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/resolvelib/structs.py", line 151, in __bool__
return bool(self._sequence)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
return any(self)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
return (c for c in iterator if id(c) not in self._incompatible_ids)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
candidate = func()
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
self._link_candidate_cache[link] = LinkCandidate(
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 297, in __init__
super().__init__(
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 162, in __init__
self.dist = self._prepare()
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 231, in _prepare
dist = self._prepare_distribution()
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/resolution/resolvelib/candidates.py", line 308, in _prepare_distribution
return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 491, in prepare_linked_requirement
return self._prepare_linked_requirement(req, parallel_builds)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 577, in _prepare_linked_requirement
dist = _get_prepared_distribution(
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/operations/prepare.py", line 69, in _get_prepared_distribution
abstract_dist.prepare_distribution_metadata(
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/distributions/sdist.py", line 48, in prepare_distribution_metadata
self._install_build_reqs(finder)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/distributions/sdist.py", line 118, in _install_build_reqs
build_reqs = self._get_build_requires_wheel()
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/distributions/sdist.py", line 95, in _get_build_requires_wheel
return backend.get_requires_for_build_wheel()
File "/opt/homebrew/lib/python3.10/site-packages/pip/_internal/utils/misc.py", line 685, in get_requires_for_build_wheel
return super().get_requires_for_build_wheel(config_settings=cs)
File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/pep517/wrappers.py", line 173, in get_requires_for_build_wheel
return self._call_hook('get_requires_for_build_wheel', {
File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/pep517/wrappers.py", line 319, in _call_hook
raise BackendUnavailable(data.get('traceback', ''))
pip._vendor.pep517.wrappers.BackendUnavailable: Traceback (most recent call last):
File "/opt/homebrew/lib/python3.10/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 77, in _build_backend
obj = import_module(mod_path)
File "/opt/homebrew/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'poetry'
Not everyone uses poetry, yet. :)
Currently, the fastcdc_(py|cy)
functions leave the file descriptor open after returning the generator since the generator requires it. This can result in a memory leak for non-CPython interpreters that do not utilize memory reference counting. To address this issue, I propose adding a boolean parameter called closefd
to the chunk_generator
function, along with some corresponding if code. I am confident that I can make these changes myself once my current pull request #16 is merged.
Alternative: Convert fastcdc_(py|cy)
functions into generators themselves plus using with open(...)
expression inside.
When a chunk is smaller than min_size
, such as a small file/stream , the reported size is incorrect.
Consider the following example:
data = b'\x04\xc9KM\x8a\xeaiH\x83\xaf\x01{\xd6\xe1\xab(# \xdb\xaf' # from os.urandom(20)
print(f'{len(data) = }')
chunks = fastcdc.fastcdc(
data,
min_size=1024, # 1 kb
avg_size=4*1024, # 4 kb
max_size=16*1024, # 16 kb
fat=True, # for demo
)
chunk = next(chunks)
print(f'{chunk.length = }')
print(f'{len(chunk.data) = }')
print(f'{data == chunk.data = }')
print(f'{fastcdc.__version__ = }')
Out:
len(data) = 20
chunk.length = 1024
len(chunk.data) = 20
data == chunk.data = True
fastcdc.__version__ = '1.4.2'
As you can see, chunk.length is incorrect for a data stram of 20 bytes (20 << 1024). When used with fat=True
, I can ascertain the true size but that is needless using extra memory.
Ok, figured it out.
The current implementation will tend to read a buffer way beyond max_size
because it always reads max_size
. As a result, it then yields multiple chunks in one go. That can be avoided by simply filling up to the designated size:
blob = memoryview(bytes(blob) + stream.read(max_size - len(blob)))
As far as I read the implementation, chunks should never be > max_size (where it splits regardless of a matching cut point), so this change should not alter the behavior in any way.
[Edit]
This was a question about reducing the amount of chunks that are emitted simultaneously.
Re-reading the paper, it became clear that one can essentially disable normalized chunking (NC) by setting fixed chunk sizes for min, avg and max. Chunking is then effectively fixed-size and reliably emits one chunk. That's enough for my purpose.
Sorry for the noise!
Dear FastCDC Developers,
Hello! I am currently using the FastCDC library for some project development and I greatly appreciate your work.
I have a question that I would like to ask. I am considering using multiprocessing in my project, but I am unsure if the FastCDC library can safely operate in a multiprocessing environment. I did not find related information in the documentation, so I thought to inquire here.
If the FastCDC library can safely operate in a multiprocessing environment, are there any special steps or precautions that need to be taken to ensure its safety? If not, could you recommend any alternatives or solutions?
Thank you very much for your assistance! I look forward to your reply.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.