ncar / sperr Goto Github PK

SPERR is a lossy scientific (floating-point) data compressor that produces one of the best rate-distortion curves.

License: Apache License 2.0

Shell 0.29% C 6.40% C++ 84.72% CMake 2.07% Makefile 0.09% Gnuplot 1.21% Dockerfile 0.16% TeX 5.06%

high-performance-computing lossy-compression

sperr's Introduction

Overview

SPERR (pronounced like spur) is a lossy compressor for scientific data (2D or 3D floating-point data, mostly produced by numerical simulations). SPERR has one of the highest coding efficiencies among popular lossy compressors, meaning that it usually uses the least amount of storage to satisfy a prescribed error tolerance (e.g., a maximum point-wise error tolerance).

Under the hood, SPERR uses wavelet transforms, SPECK coding, and a custom outlier coding algorithm in its compression pipeline. This combination gives SPERR flexibility to compress targetting different quality controls, namely 1) bit-per-pixel (BPP), 2) peak signal-to-noise ratio (PSNR), and 3) point-wise error (PWE). The name of SPERR stands for SPeck with ERRor bounding.

Quick Build

SPERR requires 1) a working C++ compiler and 2) CMake tools to build. On a Unix-like system, the build steps are the following:

git clone https://github.com/NCAR/SPERR.git     # clone the repo
mkdir SPERR/build                               # create the build directory
cd SPERR/build                                  # enter the build directory
cmake ..                                        # Linux: use cmake to configure the project (with OpenMP)
cmake -DUSE_OMP=OFF ..                          # Mac: use cmake to configure the project (OpenMP disabled)
cmake -DCMAKE_INSTALL_PREFIX=/my/install/dir .. # Optional: specify a directory to install SPERR. The default is /usr/local .
make -j 8                                       # build the project
ctest .                                         # run unit tests, which should have 100% tests passed
make install                                    # install the library and CLI tools to a specified directory.

Plugin for HDF5

SPERR is available as a dynamically loaded plugin for HDF5 with a registered ID of 32028. This plugin is available at this repo.

Wrapper for Fortran

A Fortran wrapper for SPERR has also been created by ofmla at this repo.

Documentation

SPERR documentation is hosted on Github Wiki pages. To get started, one might want to build SPERR from source and explore compression and decompression utilities for 3D and 2D inputs. One can also use spack to install SPERR by a single command spack install sperr. Finally, a collection of canonical scientific data sets is available at SDRBench for testing and evaluation purposes.

SPERR also provides programming API in C++ and C.

Publication

If SPERR benefits your work, please kindly cite this publication:

@INPROCEEDINGS{10177487,
  author={Li, Shaomeng and Lindstrom, Peter and Clyne, John},
  booktitle={2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, 
  title={Lossy Scientific Data Compression With SPERR}, 
  year={2023},
  pages={1007-1017},
  doi={10.1109/IPDPS54959.2023.00104}}

(Author's copy is available here.)

Presentations

FZ Workshop Hands-on: Feb 15 2024, Sarasota, FL. (handout and examples)
SC'23 Tutorial on lossy scientific data compression: Nov 13 2023, Denver CO. (slides)
IPDPS'23 Lossy Scientific Data Compression With SPERR: May 18 2023, St. Petersburg, FL. (slides)

sperr's People

Contributors

Stargazers

Watchers

Forkers

codereport robertu94 shaomeng jpcoding ayzk supreethms1809 sgpearse halehawk jliu-1 ofmla

sperr's Issues

Optimization: use a bit mask to store "need to refine" information

I.e., replace the current m_indices_to_refine variable.

Note: m_indices_to_refine was introduced to replace direct refining the coefficients, which increased the speed by a small amount.

Optimization: bitstream container

Currently the bitstream container is a std::vector<bool>.

When encoding, I will need to do an explicit translation from values in std::vector<bool> to a custom bitstream. When decoding, the reverse needs to happen. This packing/unpacking bit operation isn't very cheap!

If a custom bitstream container is introduced, then the content of this custom bitstream container can be directly dumped when encoding, and bytes read from disk can be directly put in the bitstream!

change dimension storage and parameter passing to be an array type.

Use streaming API for ZSTD integration

So that the target bpp can be strictly honored.

Optimization: use list of indices to represent significance map

In the first few iterations of SPECK encoding, there are very few positions having coefficients deemed significant. In this case, using a list of indices, instead of a bit mast, could be more performant.

Note, this optimization won't affect decompression speed.

Implement QZ_TERM mode for SPECK2D

Error bounding not so tight after double-->float cast

Observation: in some cases with input and output both being 32-bit float, reconstructed values can go slightly above the specified error bound.

Cause: currently when floats are the input data type, the 1st thing performed is to make them double. Then the entire processing pipeline (wavelet transform, normal SPECK encoding, normal SPECK decoding, find outliers, encoding outliers) is done in double. In the decoder, the reverse of the above steps are performed in double too. Only before writing out the decoded volume every double is simply cast to float. Then the problem is, some data points don't appear as outliers in double, but they do in float.

Effort to fix: not a trivial effort, since the entire error coding step assumes that an error is bigger than the tolerance.

Impact. Probably not too big, because the biggest amount to exceed an error bound is the amount of precision you lost during the double-->float cast.

Reproducible case: Using the attached cube at 128^3, perform compression and decompression using the following commands, then you'll notice a few values exceeding the specified 0.01 threshold:
./compressor_3d --dims 128 128 128 --chunks 128 128 128 -o /tmp/stream -t 0.01 -q -6 ./Bx.chk1
./decompressor_3d -o Bx.chk1.1e-2 --compare Bx.chk1 /tmp/stream

Bx.chk1.tar.gz

Optimization: use deque instead of vector to store SPECKSet3D objects.

Need to measure to know if it helps.

Optimization: single precision DWT

I think using single precision floating point values for encoding can improve performance without sacrificing accuracy. That's because SPECK encodes progressively bit plane by bit plane, and any values that fall in the same bit plane intervals will result in the exact encoding output.

On the other hand, decoding steps could suffer more from single precisions. That's because even if the IDWT step starts from the exact values represented using single vs. double precisions, the rounding errors could still accumulate through the calculation.

Conclusion: need experimentations to test the tradeoffs.

Optimization: use std::remove_if() in m_clean_LIS function

Same time complexity, less space requirement, than my current code.

Optimization: consolidate the 3 partition level member variables of a SPECK3DSet into one.

Specifically, use one part_level variable to hold information of what is currently held by part_level_x, part_level_y, part_level_z.

The benefits will be:

Saves runtime memory consumption.
Runs faster because it
1. Is more cache friendly
2. Saves an addition operation.

Better return values

Use a special value to represent normal execution but non-zero return values. One instance of this kind is "bit number meets budget."

Parallelization opportunity

Upon close inspection, it seems refinement_pass can be paralleled using OpenMP with a reasonable amount of effort.

Fortunately, this is also an expensive routine at low compression levels.

Possible alternate terminating condision: a certain number of bit planes are processed.

The potential performance benefit is that the refinement_pass could be vectorized, which is the most expensive routine right now.

Optimization idea: use bvec and decide significance map for all 8 subsets.

When we use small blocks, the significance map represented using bvec is only 262K with 128x128x128 cubes, and only 32K with 64x64x64 cubes. It might be worth to scan the entire cube and know the significance of all its 8 subsets.

Create specialized routines for small and large m_LSP, m_LIP cases.

The current code records the locations of insignificant and significant pixels in m_LSP and m_LIP lists. These locations are processed in the order of their storage in these two lists. That means, the locations being processed in the m_coeff_buf array is kind of random.

This design makes sense when the number of elements in m_LSP and m_LIP is small relative to the size of m_coeff_buf. When it's not small anymore, it'd be making sense to sequentially examine every location in m_coeff_buf and process it accordingly, in an effort to avoid random memory access of m_coeff_buf. The threshold to determine using the 1st or 2nd strategy is still to be found, but I imagine it's around 2 orders of magnitude.

Some statistics of the percentage at iteration (in one experiment):

./compressor_3d --dims 512 512 512 --qz_level -6 -p ~/Syncthing/Vapor-Dev/BigFiles/wmag512.float
LIP perc : 0.000000,  LSP_old perc : 0.000000,  LSP_new perc : 0.000004
LIP perc : 0.000014,  LSP_old perc : 0.000004,  LSP_new perc : 0.000025
LIP perc : 0.000072,  LSP_old perc : 0.000029,  LSP_new perc : 0.000106
LIP perc : 0.000425,  LSP_old perc : 0.000135,  LSP_new perc : 0.000727
LIP perc : 0.003120,  LSP_old perc : 0.000862,  LSP_new perc : 0.005547
LIP perc : 0.024186,  LSP_old perc : 0.006409,  LSP_new perc : 0.033319
LIP perc : 0.137131,  LSP_old perc : 0.039728,  LSP_new perc : 0.141884
LIP perc : 0.535074,  LSP_old perc : 0.181612,  LSP_new perc : 0.447493
LIP perc : 1.545773,  LSP_old perc : 0.629105,  LSP_new perc : 1.139149
LIP perc : 3.627021,  LSP_old perc : 1.768254,  LSP_new perc : 2.478713
LIP perc : 7.327677,  LSP_old perc : 4.246967,  LSP_new perc : 4.810458
LIP perc : 13.263370,  LSP_old perc : 9.057425,  LSP_new perc : 8.284678
LIP perc : 20.688181,  LSP_old perc : 17.342103,  LSP_new perc : 12.303483
LIP perc : 27.211743,  LSP_old perc : 29.645587,  LSP_new perc : 15.214301
LIP perc : 29.552762,  LSP_old perc : 44.859888,  LSP_new perc : 15.741248
LIP perc : 27.113483,  LSP_old perc : 60.601135,  LSP_new perc : 13.927564
LIP perc : 20.914830,  LSP_old perc : 74.528699,  LSP_new perc : 10.322917
LIP perc : 13.686384,  LSP_old perc : 84.851616,  LSP_new perc : 6.708028
LIP perc : 8.094178,  LSP_old perc : 91.559644,  LSP_new perc : 3.989879
LIP perc : 4.412294,  LSP_old perc : 95.549523,  LSP_new perc : 2.184421
Compression takes time: 29900ms

SPERR seems to be memory inefficient

According to the Memory Consumption analysis by VTune. Need to look into it.

Optimization: convert recursion to loops

SPECK algorithm is extensively recursive. If we could determine the overhead of recursion, then we could convert this algorithm to a loop fashion, and save some execution time.

Note: this optimization would affect both encoding and decoding.

break LSP into two lists

Currently, it it seems refinement_pass_encode() is THE most expensive function. VTune also reports that querying m_LSP_newly is very expensive.

Now I'm thinking that for m_LSP, the front of it is not newly, and the back of it is newly, there shouldn't be interleaving significant pixels. If we break m_LSP into two lists, one representing newly pixels, and one representing old pixels, then the expensive query could be eliminated.

Note that in serial mode, that's still the most expensive routine, but less clear what the bottleneck is inside of it...

ZSTD compression should be moved up a level to SPECK_OMP_C/D modules.

This is because ZSTD doesn't work well with small amounts of data, and individual blocks are likely to be too small.

QZ_TERM mode performance optimization

In QZ_TERM mode, one can perform multiple steps of quantization to an identified significant pixel until the defined QZ level is reached. Thus, there no refinement step needed and thus huge performance gain.

Identify memory allocations that could potentially be eliminated

by passing macros LINE and FILE to memory allocation functions.

Consider using std::pmr::unsynchronized_pool_resource for LIS

This mechanism has a great potential to reduce memory allocation costs. Though this was standardized in C++17,

OpenACC for DWT

Given that DWT still takes a significant amount of time when decoding, it might be worth exploring.

Utility programs should support double precision input/output

Should provide a command line flag that specifies the input file is in double precision. In that scenario, any output should be in double precision also.

Compare with QccPack 3D SPECK

Make sure that SPECK2000 has the same efficiency with QccPack (before entropy compression).

Optimization: use a more compact class to represent a single-pixel set.

I suspect this will save a huge amount of memory during execution. The reduced memory consumption could also reduce memory addressing costs.

Note: I observed as much as 7GB memory use with a 500^3 cube. That's way too much...

Optimization: use Intel Advisor to find vectorization opportunities.

It seems Intel has a specific product, Intel Advisor, to help identify vectorization opportunities. I should give that one a try.

multi-resolution considerations

The paper discusses multi-resolution considerations that worth reading:
https://www.ecse.rpi.edu/~pearlman/papers/hyper_chap.pdf

const constraint on viewing data

A few objects, including CDF97, has a method view_data which is supposed to provide a read-only view of the underlying data. However, the return type poses no constraint on modifying underlying data values.

Investigate parallel algorithms provided by C++17

One post on such topic:
https://devblogs.microsoft.com/cppblog/using-c17-parallel-algorithms-for-better-performance/

Use parallel stl algorithms

A few std algorithms in sequential portions of the program should make use of parallel stl algorithms. One example is std::mismatch and std::accumulate in speck_helper.cpp. There could be others too.

Implement quantization level termination for SPECK2D

Just as what we did for SPECK3D.

Optimization: might makes sense to use a separate class to represent SPECK3DSet with 2 elements.

Need to evaluate what percentage these sets consist though.

OMP: reuse the same compressor objects for different chunks

Main purpose is to reuse data structures inside of each compressor instance (avoid frequent allocation). This approach could also help alleviate issue #70

Optimization: Not using m_significance_map structure

Instead, directly compare the coefficient values with the threshold when evaluating every set.

The rational is that std::vector<bool> cannot be vectorized...

Optimization: explore vectorization opportunities in DWT

After recent performance optimizations on SPECK algorithm, the percentage of execution time that DWT uses is larger and larger. At the same time, VTune Profile reports that there is little vectorization in the code. Given this, the next step would be explore vectorization opportunities in DWT. This should also happen before #2 .

Optimization: do std structures introduce significant overheads?

Namely, the currently implementation uses the following std structures in the code.

std::unique_ptr<>
std::vector<>

To investigate:

Does std::unique_ptr<> introduce a lot of overhead compared to raw pointers?
In cases where std::vector<> could be replaced, i.e., fixed container size cases, does it make a difference compared to raw pointers?

Modify CMakeList.txt so libraries are more separated.

Currently all class objects are packed into one library: SPECK_LIBS.

It'd make sense to separate some of them so there'll be a subset of classes to provide functionalities of DWT, SPECK3D, SPECK3D, etc.

Apply similar optimizations to SPECK2D as in SPECK3D.

Optimization: use uint16_t inside of SPECKSet3D

Major benefit: will save a good amount of memory.

Minor benefit: will speed up the execution a bit, because more cache friendly.

Note: will need a closer examination of the LIS list sizes before making changes.

Q: should a value exactly equal to the threshold considered significant or not?

Context: In the current implementation (and in QccPack) when a value is exactly the same as the threshold, it is considered significant.

Overall impact: double precision absolute equality is very rare, so whether it's considered significant or not won't have a big impact of the algorithm.

Analysis:
-- Imagine there's a value at exactly the threshold value (T), if we considered it significant, it'll cost a bit and then be recorded as 1.5 x T. If considered non-significant, it will not consume a bit and be recorded as 0.0.
-- Next iteration with threshold T/2. If considered significant, it'll consume another bit and be reconstructed as 1.25 x T. If considered non-significant, it'll first be recorded and be reconstructed as 0.75 x T.
-- At this point, it seems when considered significant, it costs 2 bits to reconstruct a value with 0.25 x T error, and when considered insignificant, it costs 1 bit con reconstruct a value with 0.25 x T error.
-- At the end of iteration one, when considered significant, it costs 1 bit to reconstruct a value with 0.5 x T error, and when considered insignificant, it costs 0 bit to reconstruct a value with 1.0 x T error.

What should we choose?

Optimization: DWT3D Z direction operates plane by plane

Current DWT3D when it performs dwt along Z direction, it does the following:

Copy a column of values to a buffer
DWT1D using that buffer
Copy back values from the buffer to the column.

Using the following optimization might improve memory access:

Copy dim_x number of columns to a buffer, organized as dim_x segments, and each segment hosts dim_z values.
DWT1D on all segments
Copy back this buffer to dim_x columns, and then move to the next plane.

Optimization: m_partition_S_XYZ() operates on a class member object

instead of returning an array.

Return value optimization doesn't seem to apply to the returned array object, though I had an impression that it would. Need to study RVO a little more.

It seems RVO would work on std::vector<>. It makes sense because it's more expensive to create a new vector.

Create a constant-length container object to store big arrays

Make sure it meets container requirements, so that we can make use of many c++ stl algorithms.

Optimization: use pixel indices to represent insignificant pixels to reduce runtime memory

Problem description:
Current SPECK runtime memory consumption is a little too big. As an example, when compressing a 512 cube with bpp=5, it uses 6.7 GB memory.
(./compressor_3d ../test_data/wmag512.float 512 512 512 XYZ 512_bpp_5.bitstream 5)

Investigation:
The last LIS, which is the LIS that keeps insignificant pixels, is almost the largest among all LIS's, and it can grow to a very large size. In the example above, it contains 78M sets at one point. Given that each set is 28 bytes, those 78M sets take more than 2GB memory.

Solution:
Use an array of indices to represent these insignificant pixels, much like what we did with LSP.

SPECK3D::m_process_S_encode() could preserve the information on where significant pixels are

Instead of deciding the significance of the current set as a whole, look at its 8 subsets, and decide the significance of them individually. This information is then passed to m_code_S().

Consider deleting copy constructors of SPECK class objects.

Assignment constructors should always be deleted in the base class.
Copy constructors are usually specified as default in base classes, but for our purpose in SPECK, they can be deleted for simplicity as well.

Make error_hat as a member field of Outlier class

The rationale is that the Outlier class has 4 bytes empty anyway, so pack error_hat into the empty space, instead of taking space as an array of itself, would make sense.

Identify opportunities to reduce high water-mark memory usage

Given that the current pipeline of compression has multiple steps, especially in error-bound mode, it'd be useful to identify opportunities to clear up memory between steps.