Code Monkey home page Code Monkey logo

Comments (8)

kiyo-masui avatar kiyo-masui commented on June 1, 2024

What do mean verify it is working? Do you mean verify that the data is compressed? You can just check the length of the output array. Which interface are you using, C, Python or HDF5?

from bitshuffle.

ranaivomahaleo avatar ranaivomahaleo commented on June 1, 2024

Yes. How to verify that the data is compressed? I am using Python and hdf5 interface.

The command I use to create the dataset is as follows:

import h5py as hdf
from h5py import h5f, h5d, h5z, h5t, h5s, filters
from bitshuffle import h5

datasetfillpath = '...'
f = hdf.File(datasetfullpath, 'w')

filter_pipeline = (32008, 32000)
filter_opts = ((1000000, h5.H5_COMPRESS_LZ4), ())
h5.create_dataset(f,
'dataset_name',
(20000, 9801, 200),
np.float32,
chunks=(50, 50, 100),
filter_pipeline=filter_pipeline,
filter_opts=filter_opts)

f[...] = ...
f.flush()

The size of the resulting hdf5 file is around 250Gb. I think that it is too big for a compressed file.
I expect a file size of 20,000 x 9,801 x 200 x 4 bytes (around 146Gb) for a non compressed file but why de we have 250Gb?

Is there something wrong in my filter configuration above?
How to configure additionally GZIP or third-party filter in the filter configuration above (for example having bitshuffle+LZ4, LZ and GZIP as a pipeline).

from bitshuffle.

ranaivomahaleo avatar ranaivomahaleo commented on June 1, 2024

Just a rectification concerning the comment above:
I created three datasets using the script above and stored in a file. So the expected file size is 3 x 20000 x 9801 x 200 = 438Gb. The resulting file has a size of 250Gb. So a compression of 1.75:1. Good but how can I gain more space (smaller file size)?

from bitshuffle.

kiyo-masui avatar kiyo-masui commented on June 1, 2024

Okay, a few things:

  1. Your OS probably reports file sizes in Gb = 10^9 bytes, so your data should be 470 Gb.
  2. There shouldn't be a need to additionally compress the compressed data. Adding LZF (32000) to the pipeline will mostly just slow things down and not compress things much over the LZ4 compression built into bitshuffle. That being said, you can in principle add an arbitrary number of filters to the pipeline in the way you have done. For GZIP you need to add the filter number for DEFLATE (h5z.FILTER_DEFLATE) to the pipeline.
  3. Do you care about speed? If not, bitshuffle is not the compressor for you. BZIP2 is ridiculously slow but gets ridiculously high compression ratios. If you don't want to build the BZIP2 hdf5 filter, just try compressing the file on the command line to see what ratios you get. It will be similar to what you would get out of the filter. LZMA is another option, but I'm not sure if a filter exists for it yet.
  4. Bitshuffle works best if the fastest varying axis of the dataset (the one with length 200 in your case) is the one over which the data is most highly correlated. I.e. if the data doesn't change much from element to element.
  5. You have specified a block size for bitshuffle's internal compression of 1000000. You are probably better off not specifying one (set it to 0), but if you do, be sure to make it a multiple of 8. I should probably document this somewhere.
  6. Float data often doesn't compress well (for any compressor) due to numerical noise. I'm not particularly surprised that you are only getting compression ratios of 1.75:1.

from bitshuffle.

ranaivomahaleo avatar ranaivomahaleo commented on June 1, 2024

Thanks for your feed-backs. Just a few questions:

  1. ok
  2. ok
  3. Yes, I care about speed. I tested LZMA in command line but I do not gain a lot in term of space.
  4. I do not understand the meaning of "the fastest varying axis of the dataset". In my case (the 20,000 x 9801 x 200 dataset), the data is mostly correlated along the first axis (20,000 values). How can I configure the chunk size and bitshuffle to have more efficient compression.
    (Note: I changed the chunk size of the hdf5 file to (100, 100, 200)).
  5. I changed the chunk size of the hdf5 file to (100, 100, 200) and specified a block-size equals to 64,000,000 for bitshuffle. Is this configuration fine?
  6. Does an integer representation better in this scheme?

from bitshuffle.

kiyo-masui avatar kiyo-masui commented on June 1, 2024

For the fastest varying axis, you have two options: either transpose your data so that the index with length 20000 is the last one, or set your chunk size to be (20000, 1, 1). The former is preferred.

I would just let bitshuffle choose the block. It will be faster.

Yes, integers often compress better (in all compression schemes).

from bitshuffle.

ranaivomahaleo avatar ranaivomahaleo commented on June 1, 2024

I had a gain of 4:1, the best compression that I had with my data, with the following configurations:

  • dataset: 3 x (200, 9,801, 20,000) with a chunk shape of (200, 100, 100)
  • Compression: bitshuffle + LZ4 and LZF compression (LZF offers a little gain)
  • Using integer representation (uint32) and 4 digits after the floating point. The 4 digits after the floating point is crucial for the compression ratio but depends on the data.
  • Using a bitshuffle block size of 64,000,000. With a block size of 0, I had bad compression.
    Thanks for your advice.

from bitshuffle.

kiyo-masui avatar kiyo-masui commented on June 1, 2024

Okay, try one more iteration:

  • Chunk shape 20,10,10000.
  • Block size 0
  • Drop LZF.

This might not improve ratios much, but will greatly improve speed. Having the third axis be longer in the chunk is key. Bitshuffle needs to see a bunch of these like elements in a row to get good compression. 100 isn't enough. 1000 is the bare minimum.

from bitshuffle.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.