quixdb / squash Goto Github PK
View Code? Open in Web Editor NEWCompression abstraction library and utilities
Home Page: https://quixdb.github.io/squash/
License: MIT License
Compression abstraction library and utilities
Home Page: https://quixdb.github.io/squash/
License: MIT License
We need to be able to test flushing for plugins which support it (issue #41):
This all needs to be tested at various sizes, down to a single byte.
It would be great to have a GIO integration library provide GConverter implementations similar to GZlibCompressor/GZlibDecompressor
It should be easy to use FILE streams for I/O. Maybe something like
SquashStatus squash_codec_compress_file_with_options (SquashCodec* codec, FILE* compressed, FILE* uncompressed, SquashOptions* options);
SquashStatus squash_codec_decompress_file_with_options (SquashCodec* condec, FILE* decompressed, FILE* compressed, SquashOptions* options);
And the associated convenience APIs (squash_compress_file, squash_decompress_file, squash_codec_compress_file, squash_codec_decompress_file, squash_compress_file_with_options, squash_decompress_file_with_options).
They're GPL, and AFAICT there is no English documentation, but they're available at http://freearc.org/Research.aspx
There is a related effort, fsbench (filesystem benchmark) at: http://encode.ru/threads/1371-Filesystem-benchmark/page5
Other algorithms are integrated and implemented with the aim to benchmark speed. There are various algorithms to choose from, and some integration work has been done to invoke the various codecs.
Squash should work on Windows. I don't think it would take too much work. Squash depends on pthreads right now, but that's not exposed in the API so it should be fairly easy to #ifdef around that and use whatever Windows has. Atomic functions for ref counting should probably to be implemented using InterlockedCompareExchange. Again, easily #ifdef-able.
Most codecs are capable of (trying to) compress to a buffer which is smaller than the max compressed size, and simply returning an error if the data doesn't fit. This is nice for applications where you only want to compress data when you'll get a specified ratio (e.g., you'll just store it uncompressed unless you can get a 25% reduction in size).
Right now the codecs which can't do that will just returned SQAUSH_BUFFER_FULL
if the buffer is less than max compressed size.
This would also allow us to create a special case in Squash where if you try to compress with one of these codecs to a buffer which is less than the max uncompressed length we malloc a buffer which is max compressed length, compress to that, then memcpy the result to the provided buffer if it will fit and return SQUASH_BUFFER_FULL
if it will not.
Admittedly I haven't researched it much, but it seems like pigz and pbzip2 both do basically the same thing; create multiple streams (1 per thread), and compress a fixed-size block of the input data in each one, flush, then merge the streams. XZ utils may also be taking the same approach.
If that's all that needs to happen, it is possible Squash could provide an API to do this generically (likely with individual codecs opting in to such a feature) instead of adding plugins for pigz, pbzip2, etc.
On OSX everything now compiles fine, there is just one remaining issue while linking :
Making all in benchmark
CC benchmark-timer.o
CC benchmark-benchmark.o
CCLD benchmark
ld: library not found for -lrt
collect2: error: ld returned 1 exit status
make[1]: *** [benchmark] Error 1
make: *** [all-recursive] Error 1
Removing the -lrt in benchmark/Makefile solved the problem.
Should be pretty trivial.
http://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c seems to be the best source I can find.
For many applications it might be nice to have an API to compress a file in a background thread. Squash already includes TinyCThread so it shouldn't require any extra dependencies. Something like
typedef void (*SquashWriteToFileBackgroundFinishFunc) (SquashCodec* codec, FILE* compressed, const uint8_t* uncompressed, size_t uncompressed_length, void* user_data);
void squash_codec_compress_to_file_background (SquashCodec* codec, FILE* compressed, const uint8_t* uncompressed, size_t uncompressed_length, SquashWriteToFileBackgroundFinishFunc cb, void* user_data);
squash needs a man page.
https://code.google.com/p/zopfli/
AFAICT it's compressor-only, so it would still require zlib for the decompressor.
A tool similar to gzip/lzop/xz/… would be nice.
Potentially a bit of a PITA since they don't currently ship a library—may need to use a submodule.
It would be nice to have Python bindings.
It would probably be a very good idea to add unit tests to check each plugin for thread safety. Perhaps the best way would be to just reuse the existing tests, but execute each one many times in parallel.
This is a bug in liblzf. I've already sent a message to Marc Lehmann.
When trying to compress n bytes of data which would require LZF_MAX_COMPRESSED_SIZE(n), lzf_compress will return 0 to indicate an error.
Test case:
#include <stdio.h>
#include <stdlib.h>
#include "liblzf/lzf.h"
#define DATA \
"\x83\x0d\x6b\xd9\xb3\x16\x2a\x83" \
"\x94\x52\x22\x23\xe\xe20\x59\x16" \
"\x5f\xc2\xe2\x51\xb4\x19\x40\xf9" \
"\x77\xbd\xcc\xe9\x5f\xa2\x4c\xcb" \
\
"\x2e\x10\x75\x54\x41\xd3\x2d\xd6" \
"\xfb\x5c\x9d\xa8\x9d\x42\x8e\xa2" \
"\x4d\xdf\x9e\x0a\xd0\x39\x61\x23" \
"\x01\x01\x3f\xbd\xa7\xc5\x88\xb1" \
\
"\x3d\x0b\x49\x6f\x10\x61\xb5\x66" \
"\x51\xcc\x96\xe4\xae\xd8\xbd\x76" \
"\x9e\x45\x46\x76\x2f\x18\x0a\xfb" \
"\x10\x0a\x08\xa7\xa4\x61\xda\x12" \
\
"\x75\xe9\xf1\xf9\x53\x48\xec\xea" \
"\x89\xc3\xa1\x71\x37\x46\x0b\x5f" \
"\x38\x37\xbf\x8e\x5d\x6a\x42\xaf" \
"\x3a\xe7\x20\xd1\xaa\x89\xe9\x12" \
\
"\x21\xdf\x78\xbb\x74\x31\xf5\xed" \
"\xa1\x21\xa6\xb9\x2d\x8b\x67\x04" \
"\x34\xf6\x2d\xb4\x6a\x01\xf2\x87" \
"\xe7\x33\x1d\xb8\x74\x1b\xbf\x1b" \
\
"\x33\x26\x1d\x6c\x03\xc4\x28\xa5" \
"\x8e\x56\x64\xff\x17\x0c\xae\xeb" \
"\xdc\x66\xf9\xb0\xfa\xcf\xd3\xbe" \
"\x82\xd8\xe1\xff\x67\x9c\xf3\xfe" \
\
"\xbe\x83\x05\x5b\x0a\x29\x5b\x9b" \
"\x84\x5f\x4a\x12\x6e\xb1\x10\xb0" \
"\xe3\xc2\x10\xe2\xf1\x83\x26\xb9" \
"\x2a\x95\x68\x4e\x27\x76\x95\x6a" \
\
"\xb7\x4f\xc7\x0f\x22\x43\x64\x7d" \
"\x04\x03\x5c\x1b\x35\x0f\x9f\x00" \
"\x44\xde\xa4\x41\x27\x8d\x90\xe0" \
"\xb2\xec\x4a\xd4\xbe"
#define DATA_LEN 253
int main (int argc, char** argv) {
unsigned int res;
unsigned int max_compressed_length = LZF_MAX_COMPRESSED_SIZE(DATA_LEN);
void* compressed = malloc (max_compressed_length);
res = lzf_compress (DATA, DATA_LEN, compressed, max_compressed_length);
if (res == 0) {
fprintf (stderr, "Error\n");
return -1;
}
fprintf (stdout, "Compressed %u -> %u (of %u)\n", DATA_LEN, res, max_compressed_length);
return 0;
}
Adding 1 to max_compressed_length
results in success, so a potential work-around (if this isn't resolved quickly in liblzf) would be to just add one to the value in the lzf plugin.
It's a great benefit to see many compressors integrated under a common and flexible interface. I can imagine a variety of use cases where the caller is interested in trying various algorithms and parameters. When those various algorithms are run, the resultant compressed size for that data is a valuable data point; this comes automatically as part of the return code. However, many compression comparison pages which describe the difference between algorithms and the parameters include additional information: the (de)compression time and memory used in these steps.
I'd value having a common and easy-to-use interface to the effective CPU and memory overhead for this particular algorithm and provided options, maybe as simple as a structure which is populated with that data. I'd like to learn how much memory was allocated during the process of (de)compression, and weigh that accordingly. A separate ticket item (another enhancement) is to deliberately put a hard limit on the available memory for an algorithm (wrapping its malloc () and free ()). But for now it would be helpful to learn how much memory the algorithm did use with the provided compression parameters. This could be returned in a very basic way. Also (again, willing to put it into another ticket), maybe the number of threads spawned can be provided as well as specified.
Finally, it would be helpful to see how long the (de)compression takes, measured in terms of CPU and real time. This wouldn't need to take over the aim of a benchmarking tool (which would run the test repeatedly to get a good estimate, or may flush caches before its run, and so on), but a basic time used during the algorithm would be a helpful statistic, measured at the finest resolution that is supported.
lzop uses a container which it would be nice to add support for. I can't find any documentation, but it doesn't look too complicated and lzop is open-source.
The ZPAQ plugin needs some work. It will occasionally get caught in a loop or crash. Reliably reproducible by setting step_size to 70 in the buffer_to_buffer_decompress_with_stream in tests/check-plugins.c.
Debian doesn't ship a package like Fedora does, so we should include a copy of liblzf (preferably as a git submodule), but prefer the system copy if one exists.
Another dual GPL/proprietary tool.
The following 253 byte binary data (shown in hexadecimal, byte by byte concatenated) is said to require 256 bytes maximum with the "doboz" plugin (according to squash_get_max_compressed_size ()), but instead 258 bytes are allocated/returned. I came across this random stream of bytes as I am trying to test error paths. I can always just add a few extra bytes for safety beyond what that function returns but I wanted to mention a case that causes an issue:
830d6bd9b3162a8394522223ee2059165fc2e251b41940f977bdcce95fa24ccb2e10755441d32dd6fb5c9da89d428ea24ddf9e0ad039612301013fbda7c588b13d0b496f1061b56651cc96e4aed8bd769e4546762f180afb100a08a7a461da1275e9f1f95348ecea89c3a17137460b5f3837bf8e5d6a42af3ae720d1aa89e91221df78bb7431f5eda121a6b92d8b670434f62db46a01f287e7331db8741bbf1b33261d6c03c428a58e5664ff170caeebdc66f9b0facfd3be82d8e1ff679cf3febe83055b0a295b9b845f4a126eb110b0e3c210e2f18326b92a95684e2776956ab74fc70f2243647d04035c1b350f9f0044dea441278d90e0b2ec4ad4be
There are a few issues when trying to compile on OSX. The first one is the following :
Making all in zpaq
CXX libsquash0.1_plugin_zpaq_la-libzpaq.lo
zpaq/libzpaq.cpp:22:21: fatal error: windows.h: No such file or directory
#include <windows.h>
^
compilation terminated.
make[2]: *** [libsquash0.1_plugin_zpaq_la-libzpaq.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1
zpaq uses windows.h which is obviously not found, maybe an extra check in the squash makefile could fix that ?
After disabling zpaq, this is the second error which is more problematic :
Making all in benchmark
CC benchmark-benchmark.o
benchmark.c:19:4: warning: #warning Unable to find a way to measure CPU time [-Wcpp]
# warning Unable to find a way to measure CPU time
^
benchmark.c: In function 'benchmark_timer_start':
benchmark.c:82:3: error: implicit declaration of function 'clock_gettime' [-Werror=implicit-function-declaration]
if (clock_gettime (CLOCK_REALTIME, &(timer->start_wall)) != 0) {
^
benchmark.c:82:3: warning: nested extern declaration of 'clock_gettime' [-Wnested-externs]
benchmark.c:82:22: error: 'CLOCK_REALTIME' undeclared (first use in this function)
if (clock_gettime (CLOCK_REALTIME, &(timer->start_wall)) != 0) {
^
benchmark.c:82:22: note: each undeclared identifier is reported only once for each function it appears in
benchmark.c: In function 'benchmark_timer_stop':
benchmark.c:20:42: error: 'CLOCK_REALTIME' undeclared (first use in this function)
# define SQUASH_BENCHMARK_CLOCK_CPUTIME CLOCK_REALTIME
^
benchmark.c:94:22: note: in expansion of macro 'SQUASH_BENCHMARK_CLOCK_CPUTIME'
if (clock_gettime (SQUASH_BENCHMARK_CLOCK_CPUTIME, &(timer->end_cpu)) != 0) {
^
cc1: some warnings being treated as errors
make[1]: *** [benchmark-benchmark.o] Error 1
make: *** [all-recursive] Error 1
On OSX, clock_gettime apparently does not exist, but there are alternatives and this thread is giving plenty : http://stackoverflow.com/questions/5167269/clock-gettime-alternative-in-mac-os-x
We should have bindings for the Go programming language.
Node.js seems like a good candidate for bindings.
Documentation for the Node.js zlib binding is at http://nodejs.org/api/zlib.html
Some codecs can flush (zlib, bzip2, lzma, …), most can't. Just like squash_codec_knows_uncompressed_size
tests for whether or not the uncompressed size can be extracted from a compressed buffer without decoding the whole thing, there should be a squash_codec_can_flush
(or similar) function to detect whether or not a codec knows how to flush.
Patch: https://gist.github.com/nemequ/6460041
Depends on issues #37 and #39. SHARC also fails, but I need to re-write the SHARC plugin, if it still fails after that and I can verify that the problem is in SHARC, not Squash, I'll file a bug there, too.
https://github.com/mathieuchartier/mcm
Doesn't currently compile on Linux (mathieuchartier/mcm#1).
It would be good to have a PHP extension.
The Squash benchmark is very interesting due to the number of plugins and architectures included, but I find that with the faster algorithms there is a much greater unpredictability due to the fact that they run for a much shorter period of time.
For more accurate numbers I would suggest to introduce a minimum runtime, for example ensure that every algorithm will run for at least 30 seconds.
It can be easily done with a timer and a repeating loop, just repeat the current test until the time elapsed has at least attained the minimum runtime required. The performance is then calculated by dividing the total time by the number of loops achieved.
This will I think make benchmark results much more stable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.