Code Monkey home page Code Monkey logo

lfzip's Introduction

LFZip

Multivariate floating-point time series lossy compression under maximum error distortion

Build Status

See update below on selecting value of NLMS order for LFZip

Download and install dependencies

Using Conda (Linux/MacOSX):

LFZip (NLMS prediction mode) is now available on conda through the conda-forge channel. For the neural network prediction mode or to run from source, see the next section.

conda create --name lfzip_env
conda activate lfzip_env
conda config --add channels conda-forge
conda install lfzip

After the installation, LFZip (NLMS) can be run using the command lfzip-nlms. To install LFZip in a conda virtual environment, follow the instructions here.

From source (Linux/MacOSX):

Download:

git clone https://github.com/shubhamchandak94/LFZip.git

To set up virtual environment and dependencies (on Linux):

cd LFZip/src/
python3 -m venv env
source env/bin/activate
./install.sh

On macOS, you need gcc compiler for running BSC which is the entropy coder used in LFZip. For this, install gcc@9 using brew as follows:

brew update
brew install gcc@9

and then replace the last statement of the Linux instructions with

./install_macos.sh

If you get an error related to the compilation flags, please look at issue #6 which might help you out.

For processors without AVX instructions (e.g., Intel Pentium/Celeron) used in the latest Tensorflow package, do the following instead (requires a working conda installation):

cd LFZip/src/
conda create --name no_avx_env python=3.6
conda activate no_avx_env
./install_without_avx.sh

General comments

  • Note that LFZip (NLMS), LFZip (NN) and CA (critical aperture) expect the input to be in numpy array (.npy) format and support only float32 arrays.
  • LFZip (NLMS) additionally supports multivariate time series with at most 256 variables, where the input is a numpy array of shape (k,T) where k is the number of variables and T is the length of the time series.
  • During compression, the reconstructed time series is also generated as a byproduct and stored as compressed_file.bsc.recon.npy. This can be used to verify the correctness of the compression-decompression pipeline.
  • Examples are shown after the usages below [link].

LFZip (NLMS)

Compression/Decompression:

If installed using conda, replace python3 nlms_compress.py by lfzip-nlms.

python3 nlms_compress.py [-h] --mode MODE --infile INFILE --outfile OUTFILE
                        [--NLMS_order N [N ...]] [--mu MU [MU ...]]
                        [--absolute_error MAXERROR [MAXERROR ...]]
                        [--quantization_bytes QUANTIZATION_BYTES [QUANTIZATION_BYTES ...]]

with the parameters:

  -h, --help            show this help message and exit
  --mode MODE, -m MODE  c or d (compress/decompress)
  --infile INFILE, -i INFILE
                        infile .npy/.bsc
  --outfile OUTFILE, -o OUTFILE
                        outfile .bsc/.npy
  --NLMS_order N [N ...], -n N [N ...]
                        order of NLMS filter for compression (default 32) -
                        single value or one per variable
  --mu MU [MU ...]      learning rate of NLMS for compression (default 0.5) -
                        single value or one per variable
  --absolute_error MAXERROR [MAXERROR ...], -a MAXERROR [MAXERROR ...]
                        max allowed error for compression - single value or
                        one per variable
  --quantization_bytes QUANTIZATION_BYTES [QUANTIZATION_BYTES ...], -q QUANTIZATION_BYTES [QUANTIZATION_BYTES ...]
                        number of bytes used to encode quantized error -
                        decides number of quantization levels. Valid values
                        are 1, 2 (default: 2) - single value or one per variable

Note that nlms_compress_python.py is an older and slower version with a similar interface but with the core NLMS compression code written in Python instead of C++.

Update on NLMS order

While the default order for NLMS is 32, we have found that for certain dataset, the optimal order is 0 (i.e., the prediction step is skipped). We recommend that the user try out both values using the -n flag for a given data source before selecting the order. We are currently working on making this process automatic.

LFZip (NN)

Training a model

First select the appropriate function from models.py, e.g., FC or biGRU. Then call

python3 nn_trainer.py -train training_data.npy -val validation_data.npy -model_file saved_model.h5 \
-model_name model_name -model_params model_params [-lr lr -noise noise -epochs epochs]

with the parameters:

model_name:   (str) name of model (function name from models.py)
model_params: space separated list of parameters to the function model_name
lr:           (float) learning rate (default 1e-3 for Adam)
noise:        (float) noise added to input during training (uniform[-noise,noise]), default 0
epochs:       (int) number of epochs to train (0 means store random model)

Compression/Decompression:

CUDA_VISIBLE_DEVICES="" PYTHONHASHSEED=0 python3 nn_compress.py [-h] --mode MODE --infile INFILE --outfile OUTFILE
                      [--absolute_error MAXERROR] --model_file MODEL_FILE
                      [--quantization_bytes QUANTIZATION_BYTES]
                      [--model_update_period MODEL_UPDATE_PERIOD] [--lr LR]
                      [--epochs NUM_EPOCHS]

with the parameters:

  -h, --help            show this help message and exit
  --mode MODE, -m MODE  c or d (compress/decompress)
  --infile INFILE, -i INFILE
                        infile .npy/bsc
  --outfile OUTFILE, -o OUTFILE
                        outfile .bsc/.npy
  --absolute_error MAXERROR, -a MAXERROR
                        max allowed error for compression
  --model_file MODEL_FILE
                        model file
  --quantization_bytes QUANTIZATION_BYTES, -q QUANTIZATION_BYTES
                        number of bytes used to encode quantized error -
                        decides number of quantization levels. Valid values
                        are 1, 2 (deafult: 2)
  --model_update_period MODEL_UPDATE_PERIOD
                        train model (both during compression & decompression)
                        after seeing these many symbols (default: never train)
  --lr LR               learning rate for Adam when model update used
  --epochs NUM_EPOCHS   number of epochs to train when model update used

The CUDA_VISIBLE_DEVICES="" PYTHONHASHSEED=0 environment variables are set to ensure that the decompression works precisely the same as the compression and generates the correct reconstruction.

Critical aperture (CA)

WARNING: in some cases, maxerror constraint can be slightly violated (~1e-5) due to numerical precision issues (only for the CA implementation).

Compression/Decompression:

python3 ca_compress.py [-h] --mode MODE --infile INFILE --outfile OUTFILE
                      [--absolute_error MAXERROR]

Input

optional arguments:
  -h, --help            show this help message and exit
  --mode MODE, -m MODE  c or d (compress/decompress)
  --infile INFILE, -i INFILE
                        infile .npy/.bsc
  --outfile OUTFILE, -o OUTFILE
                        outfile .bsc/.npy
  --absolute_error MAXERROR, -a MAXERROR
                        max allowed error for compression

Other helpful scripts

  • data/dat_to_np.py: convert a .dat file (with 1 time series value in plaintext per line) to .npy file
  • data/npy_to_bin.py: convert a .npy file to binary file used as input to SZ
  • data/bin_to_npy.py: convert a .bin file to .npy file

Examples

LFZip (NLMS)

If installed using conda, replace python nlms_compress.py by lfzip-nlms. See also update above on selecting the NLMS order.

Compression:

python nlms_compress.py -m c -i ../data/evaluation_datasets/dna/nanopore_test.npy -o nanopore_test_compressed.bsc -a 0.01

Decompression:

python nlms_compress.py -m d -i nanopore_test_compressed.bsc -o nanopore_test.decompressed.npy

Verification:

cmp nanopore_test.decompressed.npy nanopore_test_compressed.bsc.recon.npy

LFZip (NN)

Training a fully connected model (FC in models.py) with input_dim = 32, num_hidden_layers = 4, hidden_layer_size = 128 for 5 epochs with uniform noise in [-0.05,0.05] added to the input.

python nn_trainer.py -train ../data/evaluation_datasets/dna/nanopore_train.npy -val ../data/evaluation_datasets/dna/nanopore_val.npy -model_name FC -model_params 32 4 128 -model_file nanopore_trained.h5 -noise 0.05 -epochs 5

Compression:

CUDA_VISIBLE_DEVICES="" PYTHONHASHSEED=0 python nn_compress.py -m c -i ../data/evaluation_datasets/dna/nanopore_test.npy -o nanopore_test_compressed.bsc -a 0.01 --model_file nanopore_trained.h5

Decompression:

CUDA_VISIBLE_DEVICES="" PYTHONHASHSEED=0 python nn_compress.py -m d -i nanopore_test_compressed.bsc -o nanopore_test.decompressed.npy --model_file nanopore_trained.h5

Verification:

cmp nanopore_test.decompressed.npy nanopore_test_compressed.bsc.recon.npy

CA

Compression:

python ca_compress.py -m c -i ../data/evaluation_datasets/dna/nanopore_test.npy -o nanopore_test_compressed.bsc -a 0.01

Decompression:

python ca_compress.py -m d -i nanopore_test_compressed.bsc -o nanopore_test.decompressed.npy

Verification:

cmp nanopore_test.decompressed.npy nanopore_test_compressed.bsc.recon.npy

lfzip's People

Contributors

kedartatwawadi avatar mohit1997 avatar shubhamchandak94 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lfzip's Issues

No such file or directory: 'nanopore_test_compressed.bsc.tmp.dir/recon.bin'

I installed the tool using conda, while I run

python nlms_compress.py -m c -i ../data/evaluation_datasets/dna/nanopore_test.npy -o nanopore_test_compressed.bsc -a 0.01

I was informed

/bin/sh: 1: /root/LFZip/src/nlms_helper.out: not found
Traceback (most recent call last):
  File "/root/LFZip/src/nlms_compress.py", line 109, in <module>
    with open(tmpfile_recon, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'nanopore_test_compressed.bsc.tmp.dir/recon.bin'

How can I fix this??

FileNotFoundError: [Errno 2] No such file or directory: 'nanopore_test_compressed.bsc.tmp.dir/recon.bin'

When I run nlms_compress.py with the following command:
python nlms_compress.py -m c -i ../data/evaluation_datasets/dna/nanopore_test.npy -o nanopore_test_compressed.bsc -a 0.01

I encounter the following error:

Traceback (most recent call last):
  File "D:\Desktop\lfzip\LFZip-1.1\src\nlms_compress.py", line 109, in <module>
    with open(tmpfile_recon, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory:'nanopore_test_compressed.bsc.tmp.dir/recon.bin'

Could you please help me resolve this issue? Thank you very much.

Definiation of Compression Ratio

Hi,

I'm wondering how the Compression Ratio is defined? Many thanks indeed!
Currently I'm assuming CR = size of original data file (.npz) / size of compressed file (.npz)
Is it correct?

march=native not supported for all architectures.

In install_macos.sh, the flag -march for the clang compiler doesn't work on all architectures.

In order to make it work on my m1, I had to substitute it with -mcpu=apple-m1, resulting in the command

g++ nlms_helper.cpp -std=c++11 -o nlms_helper.out -Wall -O3 -mcpu=apple-m1

FileNotFoundError: [Errno 2] No such file or directory: 'nanopore_test_compressed.bsc.tmp.dir/recon.bin'

When I run nlms_compress.py with the following command:
python nlms_compress.py -m c -i ../data/evaluation_datasets/dna/nanopore_test.npy -o nanopore_test_compressed.bsc -a 0.01
I encounter the following error:
Traceback (most recent call last): File "D:\Desktop\lfzip\LFZip-1.1\src\nlms_compress.py", line 109, in <module> with open(tmpfile_recon, 'rb') as f: FileNotFoundError: [Errno 2] No such file or directory: 'nanopore_test_compressed.bsc.tmp.dir/recon.bin'
Could you please help me resolve this issue? Thank you very much.

Issues with error bounds

https://github.com/shubhamchandak94/LFZip/blob/14b2e9f79b1d9bc3f60d5cb69e20eb1529c709b4/src/nlms_compress.py#L85C18-L85C18

The value np.finfo(np.float32).resolution is used to reduce the error bound, but it's not fit for this purpose. The value just represents the approximate spacing between floating point values above and around 1 to the nearest power of 10. It is incorrect for magnitudes away from 1.

If we take an error bound of 10 and subtract this value, the resulting new error bound when interpreted as float32 is the same as the original. Conversely, if we were to take a smaller error bound the new error bound would be multiple steps lower than the next smallest bound. The correct function to use here would be np.spacing:

https://numpy.org/doc/stable/reference/generated/numpy.spacing.html

This returns the precise spacing between a value and the next one away from zero (despite what the documentation says).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.