quixdb / squash Goto Github PK

View Code? Open in Web Editor NEW

401.0 401.0 52.0 3.42 MB

Compression abstraction library and utilities

Home Page: https://quixdb.github.io/squash/

License: MIT License

Emacs Lisp 0.03% Shell 0.71% CMake 11.47% Vala 1.32% C 80.99% C++ 4.86% Ragel 0.63%

squash's People

Contributors

Stargazers

Watchers

Forkers

tml ezhangle ivan-tkatchev anthrotype bulat-ziganshin jibsen wolut gsenesac szabadka jrmuizel revel8n elliotwoods saremox rdp darkzeros banzaiman mknjc rmax-contrib n3world jakirkham domfun pichi templeblock interestingthings cloudef gupti artemzi igxactly-forks codetriage-readme-bot antonblanchard jakekirk demacr dead2 sralmai b0nefish 1shekhar kbaladurin kishanroy360 jt416 shobbs xandercrews desmoteo geoff-nixon gerhobbelt clayne programmer-k nstandif mhcolburn xiaoxiang781216 slesarev-hub hartraft

squash's Issues

Add unit tests for flushing streams

We need to be able to test flushing for plugins which support it (issue #41):

Decode the flushed compressed stream and get all the data which has been compressed.
Compress more data after flushing / decoding and have it work as expected.
Repeat multiple times for a given stream.

This all needs to be tested at various sizes, down to a single byte.

Create GIO integration library

It would be great to have a GIO integration library provide GConverter implementations similar to GZlibCompressor/GZlibDecompressor

Support LZ4 file format

https://code.google.com/p/lz4/source/browse/trunk/lz4cli.c

Replace SHARC plugin with DENSITY

https://github.com/centaurean/density

Add stdio stream-based convenience API

It should be easy to use FILE streams for I/O. Maybe something like

SquashStatus squash_codec_compress_file_with_options (SquashCodec* codec, FILE* compressed, FILE* uncompressed, SquashOptions* options);
SquashStatus squash_codec_decompress_file_with_options (SquashCodec* condec, FILE* decompressed, FILE* compressed, SquashOptions* options);

And the associated convenience APIs (squash_compress_file, squash_decompress_file, squash_codec_compress_file, squash_codec_decompress_file, squash_compress_file_with_options, squash_decompress_file_with_options).

Add plugin(s) for FreeArc algorithms

They're GPL, and AFAICT there is no English documentation, but they're available at http://freearc.org/Research.aspx

CRUSH plugin

http://sourceforge.net/projects/crush/

Consider merging efforts with fsbench

There is a related effort, fsbench (filesystem benchmark) at: http://encode.ru/threads/1371-Filesystem-benchmark/page5

Other algorithms are integrated and implemented with the aim to benchmark speed. There are various algorithms to choose from, and some integration work has been done to invoke the various codecs.

Windows support

Squash should work on Windows. I don't think it would take too much work. Squash depends on pthreads right now, but that's not exposed in the API so it should be fairly easy to #ifdef around that and use whatever Windows has. Atomic functions for ref counting should probably to be implemented using InterlockedCompareExchange. Again, easily #ifdef-able.

Provide a way to test whether a codec can compress to a buffer smaller than max compressed size

Most codecs are capable of (trying to) compress to a buffer which is smaller than the max compressed size, and simply returning an error if the data doesn't fit. This is nice for applications where you only want to compress data when you'll get a specified ratio (e.g., you'll just store it uncompressed unless you can get a 25% reduction in size).

Right now the codecs which can't do that will just returned SQAUSH_BUFFER_FULL if the buffer is less than max compressed size.

This would also allow us to create a special case in Squash where if you try to compress with one of these codecs to a buffer which is less than the max uncompressed length we malloc a buffer which is max compressed length, compress to that, then memcpy the result to the provided buffer if it will fit and return SQUASH_BUFFER_FULL if it will not.

Automatic parallel compression

Admittedly I haven't researched it much, but it seems like pigz and pbzip2 both do basically the same thing; create multiple streams (1 per thread), and compress a fixed-size block of the input data in each one, flush, then merge the streams. XZ utils may also be taking the same approach.

If that's all that needs to happen, it is possible Squash could provide an API to do this generically (likely with individual codecs opting in to such a feature) instead of adding plugins for pigz, pbzip2, etc.

Mac OSX -lrt in benchmark Makefile causing linking problem

On OSX everything now compiles fine, there is just one remaining issue while linking :

Making all in benchmark
  CC       benchmark-timer.o
  CC       benchmark-benchmark.o
  CCLD     benchmark
ld: library not found for -lrt
collect2: error: ld returned 1 exit status
make[1]: *** [benchmark] Error 1
make: *** [all-recursive] Error 1

Removing the -lrt in benchmark/Makefile solved the problem.

LZF plugin

Should be pretty trivial.

http://liblzf.plan9.de/

Add LZJB plugin

http://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c seems to be the best source I can find.

API to compress to a file in background

For many applications it might be nice to have an API to compress a file in a background thread. Squash already includes TinyCThread so it shouldn't require any extra dependencies. Something like

typedef void (*SquashWriteToFileBackgroundFinishFunc) (SquashCodec* codec, FILE* compressed, const uint8_t* uncompressed, size_t uncompressed_length, void* user_data);
void squash_codec_compress_to_file_background (SquashCodec* codec, FILE* compressed, const uint8_t* uncompressed, size_t uncompressed_length, SquashWriteToFileBackgroundFinishFunc cb, void* user_data);

Add man page for CLI tool

squash needs a man page.

Add BALZ plugin

http://balz.sourceforge.net/

Add Zopfli plugin

https://code.google.com/p/zopfli/

AFAICT it's compressor-only, so it would still require zlib for the decompressor.

Add CLI

A tool similar to gzip/lzop/xz/… would be nice.

Add LZ4 plugin

Potentially a bit of a PITA since they don't currently ship a library—may need to use a submodule.

http://fastcompression.blogspot.com/p/lz4.html

Shrinker Plugin

https://code.google.com/p/data-shrinker/

Python binding

It would be nice to have Python bindings.

Add unit test for thread safety of plugins

It would probably be a very good idea to add unit tests to check each plugin for thread safety. Perhaps the best way would be to just reuse the existing tests, but execute each one many times in parallel.

LZF fails when asked to compress uncompressible data

This is a bug in liblzf. I've already sent a message to Marc Lehmann.

When trying to compress n bytes of data which would require LZF_MAX_COMPRESSED_SIZE(n), lzf_compress will return 0 to indicate an error.

Test case:

#include <stdio.h>
#include <stdlib.h>

#include "liblzf/lzf.h"

#define DATA \
    "\x83\x0d\x6b\xd9\xb3\x16\x2a\x83" \
    "\x94\x52\x22\x23\xe\xe20\x59\x16" \
    "\x5f\xc2\xe2\x51\xb4\x19\x40\xf9" \
    "\x77\xbd\xcc\xe9\x5f\xa2\x4c\xcb" \
    \
    "\x2e\x10\x75\x54\x41\xd3\x2d\xd6" \
    "\xfb\x5c\x9d\xa8\x9d\x42\x8e\xa2" \
    "\x4d\xdf\x9e\x0a\xd0\x39\x61\x23" \
    "\x01\x01\x3f\xbd\xa7\xc5\x88\xb1" \
    \
    "\x3d\x0b\x49\x6f\x10\x61\xb5\x66" \
    "\x51\xcc\x96\xe4\xae\xd8\xbd\x76" \
    "\x9e\x45\x46\x76\x2f\x18\x0a\xfb" \
    "\x10\x0a\x08\xa7\xa4\x61\xda\x12" \
    \
    "\x75\xe9\xf1\xf9\x53\x48\xec\xea" \
    "\x89\xc3\xa1\x71\x37\x46\x0b\x5f" \
    "\x38\x37\xbf\x8e\x5d\x6a\x42\xaf" \
    "\x3a\xe7\x20\xd1\xaa\x89\xe9\x12" \
    \
    "\x21\xdf\x78\xbb\x74\x31\xf5\xed" \
    "\xa1\x21\xa6\xb9\x2d\x8b\x67\x04" \
    "\x34\xf6\x2d\xb4\x6a\x01\xf2\x87" \
    "\xe7\x33\x1d\xb8\x74\x1b\xbf\x1b" \
    \
    "\x33\x26\x1d\x6c\x03\xc4\x28\xa5" \
    "\x8e\x56\x64\xff\x17\x0c\xae\xeb" \
    "\xdc\x66\xf9\xb0\xfa\xcf\xd3\xbe" \
    "\x82\xd8\xe1\xff\x67\x9c\xf3\xfe" \
    \
    "\xbe\x83\x05\x5b\x0a\x29\x5b\x9b" \
    "\x84\x5f\x4a\x12\x6e\xb1\x10\xb0" \
    "\xe3\xc2\x10\xe2\xf1\x83\x26\xb9" \
    "\x2a\x95\x68\x4e\x27\x76\x95\x6a" \
    \
    "\xb7\x4f\xc7\x0f\x22\x43\x64\x7d" \
    "\x04\x03\x5c\x1b\x35\x0f\x9f\x00" \
    "\x44\xde\xa4\x41\x27\x8d\x90\xe0" \
    "\xb2\xec\x4a\xd4\xbe"

#define DATA_LEN 253

int main (int argc, char** argv) {
  unsigned int res;
  unsigned int max_compressed_length = LZF_MAX_COMPRESSED_SIZE(DATA_LEN);
  void* compressed = malloc (max_compressed_length);

  res = lzf_compress (DATA, DATA_LEN, compressed, max_compressed_length);

  if (res == 0) {
    fprintf (stderr, "Error\n");
    return -1;
  }

  fprintf (stdout, "Compressed %u -> %u (of %u)\n", DATA_LEN, res, max_compressed_length);

  return 0;
}

Adding 1 to max_compressed_length results in success, so a potential work-around (if this isn't resolved quickly in liblzf) would be to just add one to the value in the lzf plugin.

Report basic time and memory performance characteristics

It's a great benefit to see many compressors integrated under a common and flexible interface. I can imagine a variety of use cases where the caller is interested in trying various algorithms and parameters. When those various algorithms are run, the resultant compressed size for that data is a valuable data point; this comes automatically as part of the return code. However, many compression comparison pages which describe the difference between algorithms and the parameters include additional information: the (de)compression time and memory used in these steps.

I'd value having a common and easy-to-use interface to the effective CPU and memory overhead for this particular algorithm and provided options, maybe as simple as a structure which is populated with that data. I'd like to learn how much memory was allocated during the process of (de)compression, and weigh that accordingly. A separate ticket item (another enhancement) is to deliberately put a hard limit on the available memory for an algorithm (wrapping its malloc () and free ()). But for now it would be helpful to learn how much memory the algorithm did use with the provided compression parameters. This could be returned in a very basic way. Also (again, willing to put it into another ticket), maybe the number of threads spawned can be provided as well as specified.

Finally, it would be helpful to see how long the (de)compression takes, measured in terms of CPU and real time. This wouldn't need to take over the aim of a benchmarking tool (which would run the test repeatedly to get a good estimate, or may flush caches before its run, and so on), but a basic time used during the algorithm would be a helpful statistic, measured at the finest resolution that is supported.

doboz Plugin

https://bitbucket.org/attila_afra/doboz

FastLZ plugin

http://fastlz.org/

Add lzop format support to lzo plugin

lzop uses a container which it would be nice to add support for. I can't find any documentation, but it doesn't look too complicated and lzop is open-source.

ZPAQ plugin not reliable

The ZPAQ plugin needs some work. It will occasionally get caught in a loop or crash. Reliably reproducible by setting step_size to 70 in the buffer_to_buffer_decompress_with_stream in tests/check-plugins.c.

Add internal copy of liblzf

Debian doesn't ship a package like Fedora does, so we should include a copy of liblzf (preferably as a git submodule), but prefer the system copy if one exists.

gipfeli Plugin

https://code.google.com/p/gipfeli/

SHARC plugin

Another dual GPL/proprietary tool.

https://github.com/centaurean/sharc

max compressed size not accurate for doboz

The following 253 byte binary data (shown in hexadecimal, byte by byte concatenated) is said to require 256 bytes maximum with the "doboz" plugin (according to squash_get_max_compressed_size ()), but instead 258 bytes are allocated/returned. I came across this random stream of bytes as I am trying to test error paths. I can always just add a few extra bytes for safety beyond what that function returns but I wanted to mention a case that causes an issue:

830d6bd9b3162a8394522223ee2059165fc2e251b41940f977bdcce95fa24ccb2e10755441d32dd6fb5c9da89d428ea24ddf9e0ad039612301013fbda7c588b13d0b496f1061b56651cc96e4aed8bd769e4546762f180afb100a08a7a461da1275e9f1f95348ecea89c3a17137460b5f3837bf8e5d6a42af3ae720d1aa89e91221df78bb7431f5eda121a6b92d8b670434f62db46a01f287e7331db8741bbf1b33261d6c03c428a58e5664ff170caeebdc66f9b0facfd3be82d8e1ff679cf3febe83055b0a295b9b845f4a126eb110b0e3c210e2f18326b92a95684e2776956ab74fc70f2243647d04035c1b350f9f0044dea441278d90e0b2ec4ad4be

zlite Plugin

https://github.com/richox/zlite

Issues with Mac OSX compilation

There are a few issues when trying to compile on OSX. The first one is the following :

Making all in zpaq
  CXX      libsquash0.1_plugin_zpaq_la-libzpaq.lo
zpaq/libzpaq.cpp:22:21: fatal error: windows.h: No such file or directory
 #include <windows.h>
                     ^
compilation terminated.
make[2]: *** [libsquash0.1_plugin_zpaq_la-libzpaq.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

zpaq uses windows.h which is obviously not found, maybe an extra check in the squash makefile could fix that ?

After disabling zpaq, this is the second error which is more problematic :

Making all in benchmark
  CC       benchmark-benchmark.o
benchmark.c:19:4: warning: #warning Unable to find a way to measure CPU time [-Wcpp]
 #  warning Unable to find a way to measure CPU time
    ^
benchmark.c: In function 'benchmark_timer_start':
benchmark.c:82:3: error: implicit declaration of function 'clock_gettime' [-Werror=implicit-function-declaration]
   if (clock_gettime (CLOCK_REALTIME, &(timer->start_wall)) != 0) {
   ^
benchmark.c:82:3: warning: nested extern declaration of 'clock_gettime' [-Wnested-externs]
benchmark.c:82:22: error: 'CLOCK_REALTIME' undeclared (first use in this function)
   if (clock_gettime (CLOCK_REALTIME, &(timer->start_wall)) != 0) {
                      ^
benchmark.c:82:22: note: each undeclared identifier is reported only once for each function it appears in
benchmark.c: In function 'benchmark_timer_stop':
benchmark.c:20:42: error: 'CLOCK_REALTIME' undeclared (first use in this function)
 #  define SQUASH_BENCHMARK_CLOCK_CPUTIME CLOCK_REALTIME
                                          ^
benchmark.c:94:22: note: in expansion of macro 'SQUASH_BENCHMARK_CLOCK_CPUTIME'
   if (clock_gettime (SQUASH_BENCHMARK_CLOCK_CPUTIME, &(timer->end_cpu)) != 0) {
                      ^
cc1: some warnings being treated as errors
make[1]: *** [benchmark-benchmark.o] Error 1
make: *** [all-recursive] Error 1

On OSX, clock_gettime apparently does not exist, but there are alternatives and this thread is giving plenty : http://stackoverflow.com/questions/5167269/clock-gettime-alternative-in-mac-os-x

Go binding

We should have bindings for the Go programming language.

LZMAT Plugin

http://www.matcode.com/lzmat.htm

Node.js bindings

Node.js seems like a good candidate for bindings.

Documentation for the Node.js zlib binding is at http://nodejs.org/api/zlib.html

Provide a way to test whether a stream is flushable

Some codecs can flush (zlib, bzip2, lzma, …), most can't. Just like squash_codec_knows_uncompressed_size tests for whether or not the uncompressed size can be extracted from a compressed buffer without decoding the whole thing, there should be a squash_codec_can_flush (or similar) function to detect whether or not a codec knows how to flush.

Add unit test for random data

Patch: https://gist.github.com/nemequ/6460041

Depends on issues #37 and #39. SHARC also fails, but I need to re-write the SHARC plugin, if it still fails after that and I can verify that the problem is in SHARC, not Squash, I'll file a bug there, too.

Add MCM plugin

https://github.com/mathieuchartier/mcm

Doesn't currently compile on Linux (mathieuchartier/mcm#1).

PHP extension

It would be good to have a PHP extension.

Benchmark predictability

The Squash benchmark is very interesting due to the number of plugins and architectures included, but I find that with the faster algorithms there is a much greater unpredictability due to the fact that they run for a much shorter period of time.
For more accurate numbers I would suggest to introduce a minimum runtime, for example ensure that every algorithm will run for at least 30 seconds.
It can be easily done with a timer and a repeating loop, just repeat the current test until the time elapsed has at least attained the minimum runtime required. The performance is then calculated by dividing the total time by the number of loops achieved.
This will I think make benchmark results much more stable.