Hi, great stuff you have done in blosc, thanks! :) I am trying it fo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Proposed fix for this in <a class="issue-link js-issue-link" data-error-text="Failed t

add docs for set_blocksize for release about python-blosc HOT 11 CLOSED

blosc commented on July 22, 2024

add docs for set_blocksize for release

from python-blosc.

Comments (11)

ThomasWaldmann commented on July 22, 2024

Hmm, meanwhile I somehow have the suspicion that there is some overhead per compress call and it gets a bit inefficient for small block sizes. Is it due to thread starting overhead? Is it starting / stopping threads per compress call?

from python-blosc.

FrancescAlted commented on July 22, 2024

No, there is a pool of threads, so this should not add too much overhead. The performance problem is a bit hairy to describe because caches are strange beasts. My experience playing with buffers is pretty much summarized in the compute_blocksize() function (https://github.com/Blosc/c-blosc/blob/master/blosc/blosc.c#L819), and that means that 16 KB is a minimum per thread, so if you have say, 4 cores, that would mean chunksizes of 64 KB.

Caches being complex creatures also means that it is difficult to document recommendations for users other than testing with different chunksizes and number of threads. Sorry about that.

from python-blosc.

esc commented on July 22, 2024

In addition it is worth mentioning, that an LZ77 compressor works by looking at previously seen data. If the blocks are small, there are boundaries which cannot be traversed by the compressor, meaning many small blocks will likely compress worse than larger ones overall. Also, there is always a fixed overhead per block in the form of a header, less blocks means less headers. So overall choosing a good blocksize is an art, hence the 'expert only' documentation.

Having said that, feel free to open a pull-request updating the docstring if you like.

Regarding the new release, it is being prepped already and should be out soon.

from python-blosc.

ThomasWaldmann commented on July 22, 2024

thanks for working on new release and for the explanations.

yes, i see that the small chunks are increasing overhead, I'll see if it makes sense to increase chunksize in attic. it would decrease overhead at other places also, but a increased chunksize might mean less deduplication because larger chunks are less likely duplicates, so it's tricky...

btw, did some benchmarks and on my test data lz4 was about as fast as no compression. a bit strange was that lz4 compression levels didn't change output size and lz4 level 9 (65s) seemed slightly faster than level 1 (69s), but both produced 3.79GB compressed data.

i was somehow wondering how much overhead constructing the bytes object it wants added (I have a memoryview as "data"):

def compress(self, data):
    return blosc.compress(bytes(data), 1, cname=self.CNAME, clevel=self._get_level())

I didn't find some other way, especially not how to give it a pointer and length.

from python-blosc.

FrancescAlted commented on July 22, 2024

@ThomasWaldmann blosc.compress_ptr() would not help?

http://python-blosc.blosc.org/tutorial.html#compressing-from-a-data-pointer

The example is for a numpy array, but using ctypes maybe you can make it run with strings as well.

from python-blosc.

ThomasWaldmann commented on July 22, 2024

i had a look at compress_ptr and also at memoryview's docs, but there is no (pure) python way to get the address of the data in a memoryview. I don't use numpy, but thanks for the tip about using ctypes.

But: I think there should be an easier way, most python devs won't invoke ctypes just to get at some pointer. See also the ticket I openend, maybe memoryviews could be supported better.

from python-blosc.

FrancescAlted commented on July 22, 2024

Agreed, supporting memoryviews would be cool. If you feel like you can contribute a PR for this, that would be fantastic.

from python-blosc.

esc commented on July 22, 2024

Proposed fix for this in #81

from python-blosc.

esc commented on July 22, 2024

Closing because open for too long.

from python-blosc.

esc commented on July 22, 2024

Feel free to reopen if you thing the issue persists.

from python-blosc.

esc commented on July 22, 2024

Here is the patch if you want to resurrect:

commit 25cd5871d5732d8c29c13d92a6381cf2ef4d515f
Author: Valentin Haenel <[email protected]>
Date:   Sat Mar 28 22:52:05 2015 +0100

    update docs for set_blocksize, fixes #76

diff --git a/blosc/toplevel.py b/blosc/toplevel.py
index 82932d0e36..8f51afd81c 100644
--- a/blosc/toplevel.py
+++ b/blosc/toplevel.py
@@ -98,13 +98,25 @@ def set_nthreads(nthreads):
 def set_blocksize(blocksize):
     """set_blocksize(blocksize)

-    Force the use of a specific blocksize.  If 0, an automatic
+    Force the use of a specific blocksize in bytes.  If 0, an automatic
     blocksize will be used (the default).

     Notes
     -----

-    This is a low-level function and is recommened for expert users only.
+    This is a low-level function and is recommended for expert users only.
+    Changing the blocksize can have profound effect on the performance of
+    blosc. If the blocksize is too large each block may not fit into the CPU
+    caches anymore and thereby rendering the blocking technique ineffective.
+    For example, a block may have to travel from and to memory twice, once when
+    applying the shuffle filter and a second time for doing the actual
+    compression. Also, for a large blocksize, blosc may not be able to split
+    the input, depending on it's size, which in turns means no multithreading.
+    If the blocksize is too small, the amount of constant overhead is increased
+    since each block must store a header that contains information about it's
+    compressed size. Additionally LZ77 style compressors may not reach the same
+    compression ratio as with larger blocks since their internal dictionary can
+    not be reused across block boundaries.

     Examples
     --------

from python-blosc.

add docs for set_blocksize for release about python-blosc HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent