Code Monkey home page Code Monkey logo

sperr's Introduction

clang-format unit-test CodeQL DOI

Overview

SPERR (pronounced like spur) is a lossy compressor for scientific data (2D or 3D floating-point data, mostly produced by numerical simulations). SPERR has one of the highest coding efficiencies among popular lossy compressors, meaning that it usually uses the least amount of storage to satisfy a prescribed error tolerance (e.g., a maximum point-wise error tolerance).

Under the hood, SPERR uses wavelet transforms, SPECK coding, and a custom outlier coding algorithm in its compression pipeline. This combination gives SPERR flexibility to compress targetting different quality controls, namely 1) bit-per-pixel (BPP), 2) peak signal-to-noise ratio (PSNR), and 3) point-wise error (PWE). The name of SPERR stands for SPeck with ERRor bounding.

Quick Build

SPERR requires 1) a working C++ compiler and 2) CMake tools to build. On a Unix-like system, the build steps are the following:

git clone https://github.com/NCAR/SPERR.git     # clone the repo
mkdir SPERR/build                               # create the build directory
cd SPERR/build                                  # enter the build directory
cmake ..                                        # Linux: use cmake to configure the project (with OpenMP)
cmake -DUSE_OMP=OFF ..                          # Mac: use cmake to configure the project (OpenMP disabled)
cmake -DCMAKE_INSTALL_PREFIX=/my/install/dir .. # Optional: specify a directory to install SPERR. The default is /usr/local .
make -j 8                                       # build the project
ctest .                                         # run unit tests, which should have 100% tests passed
make install                                    # install the library and CLI tools to a specified directory.

Plugin for HDF5

SPERR is available as a dynamically loaded plugin for HDF5 with a registered ID of 32028. This plugin is available at this repo.

Wrapper for Fortran

A Fortran wrapper for SPERR has also been created by ofmla at this repo.

Documentation

SPERR documentation is hosted on Github Wiki pages. To get started, one might want to build SPERR from source and explore compression and decompression utilities for 3D and 2D inputs. One can also use spack to install SPERR by a single command spack install sperr. Finally, a collection of canonical scientific data sets is available at SDRBench for testing and evaluation purposes.

SPERR also provides programming API in C++ and C.

Publication

If SPERR benefits your work, please kindly cite this publication:

@INPROCEEDINGS{10177487,
  author={Li, Shaomeng and Lindstrom, Peter and Clyne, John},
  booktitle={2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, 
  title={Lossy Scientific Data Compression With SPERR}, 
  year={2023},
  pages={1007-1017},
  doi={10.1109/IPDPS54959.2023.00104}}

(Author's copy is available here.)

Presentations

  • FZ Workshop Hands-on: Feb 15 2024, Sarasota, FL. (handout and examples)
  • SC'23 Tutorial on lossy scientific data compression: Nov 13 2023, Denver CO. (slides)
  • IPDPS'23 Lossy Scientific Data Compression With SPERR: May 18 2023, St. Petersburg, FL. (slides)

sperr's People

Contributors

clyne avatar codereport avatar robertu94 avatar shaomeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

sperr's Issues

Optimization: bitstream container

Currently the bitstream container is a std::vector<bool>.

When encoding, I will need to do an explicit translation from values in std::vector<bool> to a custom bitstream. When decoding, the reverse needs to happen. This packing/unpacking bit operation isn't very cheap!

If a custom bitstream container is introduced, then the content of this custom bitstream container can be directly dumped when encoding, and bytes read from disk can be directly put in the bitstream!

Optimization: use list of indices to represent significance map

In the first few iterations of SPECK encoding, there are very few positions having coefficients deemed significant. In this case, using a list of indices, instead of a bit mast, could be more performant.

Note, this optimization won't affect decompression speed.

Error bounding not so tight after double-->float cast

Observation: in some cases with input and output both being 32-bit float, reconstructed values can go slightly above the specified error bound.

Cause: currently when floats are the input data type, the 1st thing performed is to make them double. Then the entire processing pipeline (wavelet transform, normal SPECK encoding, normal SPECK decoding, find outliers, encoding outliers) is done in double. In the decoder, the reverse of the above steps are performed in double too. Only before writing out the decoded volume every double is simply cast to float. Then the problem is, some data points don't appear as outliers in double, but they do in float.

Effort to fix: not a trivial effort, since the entire error coding step assumes that an error is bigger than the tolerance.

Impact. Probably not too big, because the biggest amount to exceed an error bound is the amount of precision you lost during the double-->float cast.

Reproducible case: Using the attached cube at 128^3, perform compression and decompression using the following commands, then you'll notice a few values exceeding the specified 0.01 threshold:
./compressor_3d --dims 128 128 128 --chunks 128 128 128 -o /tmp/stream -t 0.01 -q -6 ./Bx.chk1
./decompressor_3d -o Bx.chk1.1e-2 --compare Bx.chk1 /tmp/stream

Bx.chk1.tar.gz

Optimization: single precision DWT

I think using single precision floating point values for encoding can improve performance without sacrificing accuracy. That's because SPECK encodes progressively bit plane by bit plane, and any values that fall in the same bit plane intervals will result in the exact encoding output.

On the other hand, decoding steps could suffer more from single precisions. That's because even if the IDWT step starts from the exact values represented using single vs. double precisions, the rounding errors could still accumulate through the calculation.

Conclusion: need experimentations to test the tradeoffs.

Better return values

Use a special value to represent normal execution but non-zero return values. One instance of this kind is "bit number meets budget."

Parallelization opportunity

Upon close inspection, it seems refinement_pass can be paralleled using OpenMP with a reasonable amount of effort.

Fortunately, this is also an expensive routine at low compression levels.

Create specialized routines for small and large m_LSP, m_LIP cases.

The current code records the locations of insignificant and significant pixels in m_LSP and m_LIP lists. These locations are processed in the order of their storage in these two lists. That means, the locations being processed in the m_coeff_buf array is kind of random.

This design makes sense when the number of elements in m_LSP and m_LIP is small relative to the size of m_coeff_buf. When it's not small anymore, it'd be making sense to sequentially examine every location in m_coeff_buf and process it accordingly, in an effort to avoid random memory access of m_coeff_buf. The threshold to determine using the 1st or 2nd strategy is still to be found, but I imagine it's around 2 orders of magnitude.

Some statistics of the percentage at iteration (in one experiment):

./compressor_3d --dims 512 512 512 --qz_level -6 -p ~/Syncthing/Vapor-Dev/BigFiles/wmag512.float
LIP perc : 0.000000,  LSP_old perc : 0.000000,  LSP_new perc : 0.000004
LIP perc : 0.000014,  LSP_old perc : 0.000004,  LSP_new perc : 0.000025
LIP perc : 0.000072,  LSP_old perc : 0.000029,  LSP_new perc : 0.000106
LIP perc : 0.000425,  LSP_old perc : 0.000135,  LSP_new perc : 0.000727
LIP perc : 0.003120,  LSP_old perc : 0.000862,  LSP_new perc : 0.005547
LIP perc : 0.024186,  LSP_old perc : 0.006409,  LSP_new perc : 0.033319
LIP perc : 0.137131,  LSP_old perc : 0.039728,  LSP_new perc : 0.141884
LIP perc : 0.535074,  LSP_old perc : 0.181612,  LSP_new perc : 0.447493
LIP perc : 1.545773,  LSP_old perc : 0.629105,  LSP_new perc : 1.139149
LIP perc : 3.627021,  LSP_old perc : 1.768254,  LSP_new perc : 2.478713
LIP perc : 7.327677,  LSP_old perc : 4.246967,  LSP_new perc : 4.810458
LIP perc : 13.263370,  LSP_old perc : 9.057425,  LSP_new perc : 8.284678
LIP perc : 20.688181,  LSP_old perc : 17.342103,  LSP_new perc : 12.303483
LIP perc : 27.211743,  LSP_old perc : 29.645587,  LSP_new perc : 15.214301
LIP perc : 29.552762,  LSP_old perc : 44.859888,  LSP_new perc : 15.741248
LIP perc : 27.113483,  LSP_old perc : 60.601135,  LSP_new perc : 13.927564
LIP perc : 20.914830,  LSP_old perc : 74.528699,  LSP_new perc : 10.322917
LIP perc : 13.686384,  LSP_old perc : 84.851616,  LSP_new perc : 6.708028
LIP perc : 8.094178,  LSP_old perc : 91.559644,  LSP_new perc : 3.989879
LIP perc : 4.412294,  LSP_old perc : 95.549523,  LSP_new perc : 2.184421
Compression takes time: 29900ms

Optimization: convert recursion to loops

SPECK algorithm is extensively recursive. If we could determine the overhead of recursion, then we could convert this algorithm to a loop fashion, and save some execution time.

Note: this optimization would affect both encoding and decoding.

break LSP into two lists

Currently, it it seems refinement_pass_encode() is THE most expensive function. VTune also reports that querying m_LSP_newly is very expensive.

Now I'm thinking that for m_LSP, the front of it is not newly, and the back of it is newly, there shouldn't be interleaving significant pixels. If we break m_LSP into two lists, one representing newly pixels, and one representing old pixels, then the expensive query could be eliminated.

query

Note that in serial mode, that's still the most expensive routine, but less clear what the bottleneck is inside of it...

QZ_TERM mode performance optimization

In QZ_TERM mode, one can perform multiple steps of quantization to an identified significant pixel until the defined QZ level is reached. Thus, there no refinement step needed and thus huge performance gain.

OpenACC for DWT

Given that DWT still takes a significant amount of time when decoding, it might be worth exploring.

const constraint on viewing data

A few objects, including CDF97, has a method view_data which is supposed to provide a read-only view of the underlying data. However, the return type poses no constraint on modifying underlying data values.

Use parallel stl algorithms

A few std algorithms in sequential portions of the program should make use of parallel stl algorithms. One example is std::mismatch and std::accumulate in speck_helper.cpp. There could be others too.

Optimization: explore vectorization opportunities in DWT

After recent performance optimizations on SPECK algorithm, the percentage of execution time that DWT uses is larger and larger. At the same time, VTune Profile reports that there is little vectorization in the code. Given this, the next step would be explore vectorization opportunities in DWT. This should also happen before #2 .

Optimization: do std structures introduce significant overheads?

Namely, the currently implementation uses the following std structures in the code.

  1. std::unique_ptr<>
  2. std::vector<>

To investigate:

  1. Does std::unique_ptr<> introduce a lot of overhead compared to raw pointers?
  2. In cases where std::vector<> could be replaced, i.e., fixed container size cases, does it make a difference compared to raw pointers?

Optimization: use uint16_t inside of SPECKSet3D

Major benefit: will save a good amount of memory.

Minor benefit: will speed up the execution a bit, because more cache friendly.

Note: will need a closer examination of the LIS list sizes before making changes.

Q: should a value exactly equal to the threshold considered significant or not?

Context: In the current implementation (and in QccPack) when a value is exactly the same as the threshold, it is considered significant.

Overall impact: double precision absolute equality is very rare, so whether it's considered significant or not won't have a big impact of the algorithm.

Analysis:
-- Imagine there's a value at exactly the threshold value (T), if we considered it significant, it'll cost a bit and then be recorded as 1.5 x T. If considered non-significant, it will not consume a bit and be recorded as 0.0.
-- Next iteration with threshold T/2. If considered significant, it'll consume another bit and be reconstructed as 1.25 x T. If considered non-significant, it'll first be recorded and be reconstructed as 0.75 x T.
-- At this point, it seems when considered significant, it costs 2 bits to reconstruct a value with 0.25 x T error, and when considered insignificant, it costs 1 bit con reconstruct a value with 0.25 x T error.
-- At the end of iteration one, when considered significant, it costs 1 bit to reconstruct a value with 0.5 x T error, and when considered insignificant, it costs 0 bit to reconstruct a value with 1.0 x T error.

What should we choose?

Optimization: DWT3D Z direction operates plane by plane

Current DWT3D when it performs dwt along Z direction, it does the following:

  1. Copy a column of values to a buffer
  2. DWT1D using that buffer
  3. Copy back values from the buffer to the column.

Using the following optimization might improve memory access:

  1. Copy dim_x number of columns to a buffer, organized as dim_x segments, and each segment hosts dim_z values.
  2. DWT1D on all segments
  3. Copy back this buffer to dim_x columns, and then move to the next plane.

Optimization: m_partition_S_XYZ() operates on a class member object

instead of returning an array.

Return value optimization doesn't seem to apply to the returned array object, though I had an impression that it would. Need to study RVO a little more.

It seems RVO would work on std::vector<>. It makes sense because it's more expensive to create a new vector.

Optimization: use pixel indices to represent insignificant pixels to reduce runtime memory

Problem description:
Current SPECK runtime memory consumption is a little too big. As an example, when compressing a 512 cube with bpp=5, it uses 6.7 GB memory.
(./compressor_3d ../test_data/wmag512.float 512 512 512 XYZ 512_bpp_5.bitstream 5)

Investigation:
The last LIS, which is the LIS that keeps insignificant pixels, is almost the largest among all LIS's, and it can grow to a very large size. In the example above, it contains 78M sets at one point. Given that each set is 28 bytes, those 78M sets take more than 2GB memory.

Solution:
Use an array of indices to represent these insignificant pixels, much like what we did with LSP.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.