Comments (8)
What do mean verify it is working? Do you mean verify that the data is compressed? You can just check the length of the output array. Which interface are you using, C, Python or HDF5?
from bitshuffle.
Yes. How to verify that the data is compressed? I am using Python and hdf5 interface.
The command I use to create the dataset is as follows:
import h5py as hdf
from h5py import h5f, h5d, h5z, h5t, h5s, filters
from bitshuffle import h5
datasetfillpath = '...'
f = hdf.File(datasetfullpath, 'w')
filter_pipeline = (32008, 32000)
filter_opts = ((1000000, h5.H5_COMPRESS_LZ4), ())
h5.create_dataset(f,
'dataset_name',
(20000, 9801, 200),
np.float32,
chunks=(50, 50, 100),
filter_pipeline=filter_pipeline,
filter_opts=filter_opts)
f[...] = ...
f.flush()
The size of the resulting hdf5 file is around 250Gb. I think that it is too big for a compressed file.
I expect a file size of 20,000 x 9,801 x 200 x 4 bytes (around 146Gb) for a non compressed file but why de we have 250Gb?
Is there something wrong in my filter configuration above?
How to configure additionally GZIP or third-party filter in the filter configuration above (for example having bitshuffle+LZ4, LZ and GZIP as a pipeline).
from bitshuffle.
Just a rectification concerning the comment above:
I created three datasets using the script above and stored in a file. So the expected file size is 3 x 20000 x 9801 x 200 = 438Gb. The resulting file has a size of 250Gb. So a compression of 1.75:1. Good but how can I gain more space (smaller file size)?
from bitshuffle.
Okay, a few things:
- Your OS probably reports file sizes in Gb = 10^9 bytes, so your data should be 470 Gb.
- There shouldn't be a need to additionally compress the compressed data. Adding LZF (32000) to the pipeline will mostly just slow things down and not compress things much over the LZ4 compression built into bitshuffle. That being said, you can in principle add an arbitrary number of filters to the pipeline in the way you have done. For GZIP you need to add the filter number for DEFLATE (
h5z.FILTER_DEFLATE
) to the pipeline. - Do you care about speed? If not, bitshuffle is not the compressor for you. BZIP2 is ridiculously slow but gets ridiculously high compression ratios. If you don't want to build the BZIP2 hdf5 filter, just try compressing the file on the command line to see what ratios you get. It will be similar to what you would get out of the filter. LZMA is another option, but I'm not sure if a filter exists for it yet.
- Bitshuffle works best if the fastest varying axis of the dataset (the one with length 200 in your case) is the one over which the data is most highly correlated. I.e. if the data doesn't change much from element to element.
- You have specified a block size for bitshuffle's internal compression of 1000000. You are probably better off not specifying one (set it to 0), but if you do, be sure to make it a multiple of 8. I should probably document this somewhere.
- Float data often doesn't compress well (for any compressor) due to numerical noise. I'm not particularly surprised that you are only getting compression ratios of 1.75:1.
from bitshuffle.
Thanks for your feed-backs. Just a few questions:
- ok
- ok
- Yes, I care about speed. I tested LZMA in command line but I do not gain a lot in term of space.
- I do not understand the meaning of "the fastest varying axis of the dataset". In my case (the 20,000 x 9801 x 200 dataset), the data is mostly correlated along the first axis (20,000 values). How can I configure the chunk size and bitshuffle to have more efficient compression.
(Note: I changed the chunk size of the hdf5 file to (100, 100, 200)). - I changed the chunk size of the hdf5 file to (100, 100, 200) and specified a block-size equals to 64,000,000 for bitshuffle. Is this configuration fine?
- Does an integer representation better in this scheme?
from bitshuffle.
For the fastest varying axis, you have two options: either transpose your data so that the index with length 20000 is the last one, or set your chunk size to be (20000, 1, 1). The former is preferred.
I would just let bitshuffle choose the block. It will be faster.
Yes, integers often compress better (in all compression schemes).
from bitshuffle.
I had a gain of 4:1, the best compression that I had with my data, with the following configurations:
- dataset: 3 x (200, 9,801, 20,000) with a chunk shape of (200, 100, 100)
- Compression: bitshuffle + LZ4 and LZF compression (LZF offers a little gain)
- Using integer representation (uint32) and 4 digits after the floating point. The 4 digits after the floating point is crucial for the compression ratio but depends on the data.
- Using a bitshuffle block size of 64,000,000. With a block size of 0, I had bad compression.
Thanks for your advice.
from bitshuffle.
Okay, try one more iteration:
- Chunk shape 20,10,10000.
- Block size 0
- Drop LZF.
This might not improve ratios much, but will greatly improve speed. Having the third axis be longer in the chunk is key. Bitshuffle needs to see a bunch of these like elements in a row to get good compression. 100 isn't enough. 1000 is the bare minimum.
from bitshuffle.
Related Issues (20)
- Releasing the Python Global Interpreter Lock (GIL) during compression/decompression
- Issues with importing bitshuffle's h5 module since PR #81 HOT 3
- Bitshuffle.__version__ returns unformatted string HOT 1
- ABI compatibility of the C library and versioned soname
- Provide option or call option allowing bitshuffle with compression enabled to abort when output would exceed input size in bytes
- Current version incompatible with poetry HOT 1
- Wheel on PyPi doesn't work on systems without AVX512 HOT 3
- Debugging corrupted bitshuffle data HOT 2
- when to release AVX-512 version? HOT 3
- Decompression slow downs for "too many" threads
- dtype argument issue HOT 1
- `pip install bitshuffle` fails ... on python3.7 HOT 27
- Conda package question about AVX2
- error: conflicting types for βuint64_tβ
- Zstd + bitshuffle HOT 4
- Build option to set -march compile flags HOT 1
- Can't build bitshuffle on macOs 10.15 with clang 11.0.0 HOT 3
- error while running bitshuffle dynamic plugin HOT 2
- [question] New release HOT 7
- CUDA kernels ...
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bitshuffle.