rballester / tthresh Goto Github PK
View Code? Open in Web Editor NEWC++ compressor for multidimensional grid data using the Tucker decomposition
License: GNU Lesser General Public License v3.0
C++ compressor for multidimensional grid data using the Tucker decomposition
License: GNU Lesser General Public License v3.0
what is the input file format expected by tthresh ?
thanks
TTHRESH won't build on a Mac using AppleClang 11.0.0.11000033. The compiler complains about missing __float128 and __float80, which I believe are gcc extensions.
I tested tthresh on Argonne bebop (https://www.lcrc.anl.gov/systems/resources/bebop/) but the output results are problematic.
No matter how I set the -r or psnr, the decompressed data are always with very low psnr, as below:
[-@beboplogin3 data]$ tthresh -i QCLOUDf01.dat -t float -s 500 500 100 -r 0.0001 -o /tmp/QCLOUDf01.dat.ttresh.out -c /tmp/QCLOUDf01.dat.tthresh
oldbits = 800000000, newbits = 13072, compressionratio = 61199.5, bpv = 0.00052288
eps = 0.394786, rmse = 6.97548e-06, psnr = 34.9702
[-@beboplogin3 data]$ tthresh -i QCLOUDf01.dat -t float -s 500 500 100 -r 0.01 -o /tmp/QCLOUDf01.dat.ttresh.out -c /tmp/QCLOUDf01.dat.tthresh
oldbits = 800000000, newbits = 13072, compressionratio = 61199.5, bpv = 0.00052288
eps = 0.394786, rmse = 6.97548e-06, psnr = 34.9702
[-@beboplogin3 data]$ tthresh -i QCLOUDf01.dat -t float -s 500 500 100 -p 60 -o /tmp/QCLOUDf01.dat.ttresh.out -c /tmp/QCLOUDf01.dat.tthresh
oldbits = 800000000, newbits = 13072, compressionratio = 61199.5, bpv = 0.00052288
eps = 0.394786, rmse = 6.97548e-06, psnr = 34.9702
[sdi@beboplogin3 data]$
I then installed tthresh on my laptop (Fedora25), and the compression looks normal.
tthresh -i QCLOUDf01.dat -t float -s 500 500 100 -p 60 -o /tmp/QCLOUDf01.dat.out -c /tmp/QCLOUDf01.t
oldbits = 800000000, newbits = 643728, compressionratio = 1242.76, bpv = 0.0257491
eps = 0.0193903, rmse = 3.42608e-07, psnr = 61.1457
FYI, on the Bebop cluster, I checked the version of tthresh using git log, shown below:
commit 595655c
Author: Rafael Ballester-Ripoll [email protected]
Date: Thu Sep 5 09:42:25 2019 +0200
Update encode.hpp
commit 930ce68
Merge: 506431e 5e421d4
Author: rballester [email protected]
Date: Fri Jun 28 12:20:36 2019 +0200
.......
So, it should be the latest version.
I used the default compiler gcc 4.8. I also recompiled it using gcc 7.3, but the problem still exits.
uname -a:
Linux beboplogin3 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
I am using the server of this cluster (bebop), so more detailed information could be found here https://www.lcrc.anl.gov/systems/resources/bebop/
Any ideas why it cannot work normally on Bebop?
One cannot specify target rms error in scientific notation, e.g., as -r 1e-6
. Rather, this must be specified as -r 0.000001
. This becomes cumbersome when integrating tthresh with other tools (like awk) that output very small or large floating-point numbers in scientific notation.
Can the is_number()
function be replaced with something based off sscanf()
, which parses numbers in any format supported by printf()
? For example, one could specify a number like 2-23 (single-precision machine epsilon) as 0x1p-23
.
I am doing some compression experiments with a 3D data set generated from a memoryless Gaussian source with zero mean and unit variance, i.e., the data set has no autocorrelation and is essentially incompressible. Using the target PSNR setting (-p
option), I cannot get TTHRESH to deliver a PSNR of more than 190 dB, and the corresponding rate does not exceed 31 bits.
The plot below shows the accuracy gain as a function of rate, where the accuracy gain is defined as
α = log₂(σ / E) - R
Here σ = 1 is the standard deviation of the source data, E is the RMS error, and R is the rate. Rate-distortion theory says that we cannot encode such data using an error less than E in fewer than log₂(σ / E) bits/value, and hence we expect α ≤ 0. Moreover, for each additional bit stored, we expect E to halve, so the accuracy gain ought to be constant. At high rates, finite roundoff errors (e.g., when converting from a compressed representation to IEEE floating point) may cause α to dip as E converges to some small, finite value, as exhibited by the zfp curve. The other three compressors all show surprising behavior, though for the sake of this discussion, I am interested only in why TTHRESH gets stuck close to R = 31 regardless of the PSNR setting.
Hello!
Is there a possibility to reduce amount of allocated memory occupied by auxiliary data structures during compression/decompression, e.g. by separating of input data on subsets, e.g. hundred of slices, and processing entire dataset part by part?
What side-effects could appear using such approach?
Best regards.
Hi!
Thank you for this project, it seems very perspective.
I am trying to evaluate functionality provided by tthresh.
I can compile it on Linux without any problem, but my aim is Windows platform and I have got a bunch of compilation errors there.
So, I will appreciate very well for any help to make it working on this OS.
Here is my way:
zlib_io.hpp:
Cannot open include file: 'zlib.h': No such file or directory
When I change CMakeLists.txt from set(ZLIB_INCLUDE_DIRS zlib ${ZLIB_BINARY_DIR}) to set(ZLIB_INCLUDE_DIRS external/zlib ${ZLIB_BINARY_DIR}), I can go further.
compress.hpp:
Cannot open include file: 'unistd.h': No such file or directory
I have made the following modification with success:
#ifdef WIN32
#include <io.h>
#else
#include <unistd.h>
#endif
After that I stuck with "and", "or" and "not" operators unsupported by VS compiler.
The following definitions in CMakeLists.txt helped:
add_definitions (-Dand=&& -Dor=|| -Dnot=!)
tucker.hpp:
'M_PI': undeclared identifier
Solved by:
#define _USE_MATH_DEFINES
#include <math.h>
Later, there are several errors of similar nature fixed by change to int64_t type in compress.hpp:
'i': index variable in OpenMP 'for' statement must have signed integral type
Then compiler refused to build zlib_io.hpp due to string passed to SET_BINARY_MODE instead of FILE*:
'int fileno(FILE *)': cannot convert argument 1 from 'const char *' to 'FILE *'
Solution:
Comment out all lines with SET_BINARY_MODE and add "b" key to each call of fopen function in file.
And finally I have got working app!
But during compression procedure it hunged completely...
The matter was in infinite loop in encode.hpp when value of "high" variable is turned to null as far as MAX_CODE is equal to 0.
I have tried to fix it like this:
uint64_t MAX_CODE = (static_cast<uint64_t>(1)<<CODE_VALUE_BITS)-1;
After that compressing process finishes correctly, but I am unable to decompress data to initial state after all...
There are two files in attachments containing entire output which I have got during compression/decompression using the following commands (I use BostonTeapot.raw from tc18.org for experiments):
tthresh -i BostonTeapot.raw -c BostonTeapot.raw_comp -t uchar -s 256 256 178 -p 0 -d
tthresh -o BostonTeapot.raw_unc -c BostonTeapot.raw_comp -d
Best regards.
A SIGSEGV
occurs when using the -k
option to skip bytes prior to creating the tensor for compression.
I created two binary files with a 64 x 64 x 64 float32 array using numpy
; one is written to disk using numpy.save()
, including a 128-byte header that describes the array, and the other is just raw bytes written to disk:
-rw-rw-r-- 1 dkell dkell 1048576 Dec 8 19:17 data/3grid-gamma.float32.bin
-rw-rw-r-- 1 dkell dkell 1048704 Dec 7 19:21 data/3grid-gamma.float32.npy
Compressing the numpy
file selecting the float
datatype, specifying the array dimensions, and requesting a PSNR of 40 yields a segfault:
> tthresh -i data/3grid-gamma.float32.npy -k 128 -c z3grid-gamma.float32.40.npy.tthresh -v -d -t float -s 64 64 64 -p 40
/***** Compression: 3D tensor of size 64 x 64 x 64 *****/
Loading and casting input data... Elapsed time: 13.489ms
Input statistics: min = 0, max = 22.2277, norm = 2806.03
We target eps = 0.0202788, rmse = 0.111139, psnr = 40
Tucker decomposition...
Unfold (1)... Project (1)...
Unfold (2)... Project (2)...
Unfold (3)... Project (3)...
Fold...
Elapsed time: 42.136ms
Preliminaries... Elapsed time: 16.678ms
Encoding core...
Encoding core's bit plane p = 63
Encoding core's bit plane p = 62
Encoding core's bit plane p = 61
Encoding core's bit plane p = 60
Encoding core's bit plane p = 59
Encoding core's bit plane p = 58
Encoding core's bit plane p = 57
Encoding core's bit plane p = 56
Encoding core's bit plane p = 55
Encoding core's bit plane p = 54
Encoding core's bit plane p = 53
Encoding core's bit plane p = 52
Encoding core's bit plane p = 51
Encoding core's bit plane p = 50
Encoding core's bit plane p = 49 <- breakpoint: coefficient 138180
Elapsed time: 37.892ms
Computing ranks... Elapsed time: 1.494ms
Compressed tensor ranks: 64 64 64
oldbits = 8388608, newbits = 1671248, compressionratio = 5.01937, bpv = 6.37531
Segmentation fault (core dumped)
and compressing the binary file without the header is successful:
tthresh -i data/3grid-gamma.float32.bin -c z3grid-gamma.float32.40.bin.tthresh -v -d -t float -s 64 64 64 -p 40
/***** Compression: 3D tensor of size 64 x 64 x 64 *****/
Loading and casting input data... Elapsed time: 3.146ms
Input statistics: min = 0.141503, max = 22.2277, norm = 2806.19
We target eps = 0.0201486, rmse = 0.110431, psnr = 40
Tucker decomposition...
Unfold (1)... Project (1)...
Unfold (2)... Project (2)...
Unfold (3)... Project (3)...
Fold...
Elapsed time: 24.269ms
Preliminaries... Elapsed time: 17.823ms
Encoding core...
Encoding core's bit plane p = 63
Encoding core's bit plane p = 62
Encoding core's bit plane p = 61
Encoding core's bit plane p = 60
Encoding core's bit plane p = 59
Encoding core's bit plane p = 58
Encoding core's bit plane p = 57
Encoding core's bit plane p = 56
Encoding core's bit plane p = 55
Encoding core's bit plane p = 54
Encoding core's bit plane p = 53
Encoding core's bit plane p = 52
Encoding core's bit plane p = 51
Encoding core's bit plane p = 50
Encoding core's bit plane p = 49 <- breakpoint: coefficient 139767
Elapsed time: 41.645ms
Computing ranks... Elapsed time: 1.12ms
Compressed tensor ranks: 64 64 64
oldbits = 8388608, newbits = 1672080, compressionratio = 5.01687, bpv = 6.37848
The following Python script can be used to generate the data:
import numpy as np
gen = np.random.default_rng(12345)
a = gen.gamma(5, size=(64, 64, 64))
a = a.astype(np.float32)
np.save("3grid-gamma.float32.npy", a, allow_pickle=False, fix_imports=False)
with open("3grid-gamma.float32.bin", "wb") as b:
b.write(a.tobytes())
In tthresh.cpp
, a delete[]
statement is attempting to free memory at the address pointed to by double *data - skip_bytes
:
217 /***************************/
218 // The real work starts here
219 /***************************/
220
221 double *data = NULL;
222 if (input_flag)
223 data = compress(d, input_file, compressed_file, io_type, target, target_value, skip_bytes, verbose_flag, debug_flag);
224 if (output_flag)
225 decompress(d, compressed_file, output_file, data, cutout, autocrop_flag, verbose_flag, debug_flag);
226 //delete[] (data-skip_bytes);
227 delete[] data;
228
229 return 0;
Removing the skip_bytes
allows the program to run without a SIGSEGV
:
tthresh -i data/3grid-gamma.float32.npy -k 128 -c z3grid-gamma.float32.40.npy.tthresh -v -d -t float -s
64 64 64 -p 40
/***** Compression: 3D tensor of size 64 x 64 x 64 *****/
Loading and casting input data... Elapsed time: 3.421ms
Input statistics: min = 0, max = 22.2277, norm = 2806.03
We target eps = 0.0202788, rmse = 0.111139, psnr = 40
Tucker decomposition...
Unfold (1)... Project (1)...
Unfold (2)... Project (2)...
Unfold (3)... Project (3)...
Fold...
Elapsed time: 259.872ms
Preliminaries... Elapsed time: 23.653ms
Encoding core...
Encoding core's bit plane p = 63
Encoding core's bit plane p = 62
Encoding core's bit plane p = 61
Encoding core's bit plane p = 60
Encoding core's bit plane p = 59
Encoding core's bit plane p = 58
Encoding core's bit plane p = 57
Encoding core's bit plane p = 56
Encoding core's bit plane p = 55
Encoding core's bit plane p = 54
Encoding core's bit plane p = 53
Encoding core's bit plane p = 52
Encoding core's bit plane p = 51
Encoding core's bit plane p = 50
Encoding core's bit plane p = 49 <- breakpoint: coefficient 138180
Elapsed time: 197.02ms
Computing ranks... Elapsed time: 44.122ms
Compressed tensor ranks: 64 64 64
oldbits = 8388608, newbits = 1671248, compressionratio = 5.01937, bpv = 6.37531
I can make a quick PR if it's desired.
Thanks!
TTHRESH cannot compress data composed of all zeros. For example, consider compressing a 10x10x10 array of doubles:
dd bs=8000 count=1 if=/dev/zero of=/tmp/input.bin
tthresh -t double -s 10 10 10 -p 100 -i /tmp/input.bin -o /tmp/output.bin -c /tmp/output.tthresh
Segmentation fault
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.