Code Monkey home page Code Monkey logo

squash's People

Contributors

cloudef avatar dr-emann avatar eustas avatar gsenesac avatar itkatchev avatar jibsen avatar jrmuizel avatar mknjc avatar n3world avatar nemequ avatar pichi avatar sams96 avatar stefanoberhumer avatar szabadka avatar tobijdc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

squash's Issues

Add unit tests for flushing streams

We need to be able to test flushing for plugins which support it (issue #41):

  • Decode the flushed compressed stream and get all the data which has been compressed.
  • Compress more data after flushing / decoding and have it work as expected.
  • Repeat multiple times for a given stream.

This all needs to be tested at various sizes, down to a single byte.

Add stdio stream-based convenience API

It should be easy to use FILE streams for I/O. Maybe something like

SquashStatus squash_codec_compress_file_with_options (SquashCodec* codec, FILE* compressed, FILE* uncompressed, SquashOptions* options);
SquashStatus squash_codec_decompress_file_with_options (SquashCodec* condec, FILE* decompressed, FILE* compressed, SquashOptions* options);

And the associated convenience APIs (squash_compress_file, squash_decompress_file, squash_codec_compress_file, squash_codec_decompress_file, squash_compress_file_with_options, squash_decompress_file_with_options).

Windows support

Squash should work on Windows. I don't think it would take too much work. Squash depends on pthreads right now, but that's not exposed in the API so it should be fairly easy to #ifdef around that and use whatever Windows has. Atomic functions for ref counting should probably to be implemented using InterlockedCompareExchange. Again, easily #ifdef-able.

Provide a way to test whether a codec can compress to a buffer smaller than max compressed size

Most codecs are capable of (trying to) compress to a buffer which is smaller than the max compressed size, and simply returning an error if the data doesn't fit. This is nice for applications where you only want to compress data when you'll get a specified ratio (e.g., you'll just store it uncompressed unless you can get a 25% reduction in size).

Right now the codecs which can't do that will just returned SQAUSH_BUFFER_FULL if the buffer is less than max compressed size.

This would also allow us to create a special case in Squash where if you try to compress with one of these codecs to a buffer which is less than the max uncompressed length we malloc a buffer which is max compressed length, compress to that, then memcpy the result to the provided buffer if it will fit and return SQUASH_BUFFER_FULL if it will not.

Automatic parallel compression

Admittedly I haven't researched it much, but it seems like pigz and pbzip2 both do basically the same thing; create multiple streams (1 per thread), and compress a fixed-size block of the input data in each one, flush, then merge the streams. XZ utils may also be taking the same approach.

If that's all that needs to happen, it is possible Squash could provide an API to do this generically (likely with individual codecs opting in to such a feature) instead of adding plugins for pigz, pbzip2, etc.

Mac OSX -lrt in benchmark Makefile causing linking problem

On OSX everything now compiles fine, there is just one remaining issue while linking :

Making all in benchmark
  CC       benchmark-timer.o
  CC       benchmark-benchmark.o
  CCLD     benchmark
ld: library not found for -lrt
collect2: error: ld returned 1 exit status
make[1]: *** [benchmark] Error 1
make: *** [all-recursive] Error 1

Removing the -lrt in benchmark/Makefile solved the problem.

API to compress to a file in background

For many applications it might be nice to have an API to compress a file in a background thread. Squash already includes TinyCThread so it shouldn't require any extra dependencies. Something like

typedef void (*SquashWriteToFileBackgroundFinishFunc) (SquashCodec* codec, FILE* compressed, const uint8_t* uncompressed, size_t uncompressed_length, void* user_data);
void squash_codec_compress_to_file_background (SquashCodec* codec, FILE* compressed, const uint8_t* uncompressed, size_t uncompressed_length, SquashWriteToFileBackgroundFinishFunc cb, void* user_data);

Add CLI

A tool similar to gzip/lzop/xz/… would be nice.

Add unit test for thread safety of plugins

It would probably be a very good idea to add unit tests to check each plugin for thread safety. Perhaps the best way would be to just reuse the existing tests, but execute each one many times in parallel.

LZF fails when asked to compress uncompressible data

This is a bug in liblzf. I've already sent a message to Marc Lehmann.

When trying to compress n bytes of data which would require LZF_MAX_COMPRESSED_SIZE(n), lzf_compress will return 0 to indicate an error.

Test case:

#include <stdio.h>
#include <stdlib.h>

#include "liblzf/lzf.h"

#define DATA \
    "\x83\x0d\x6b\xd9\xb3\x16\x2a\x83" \
    "\x94\x52\x22\x23\xe\xe20\x59\x16" \
    "\x5f\xc2\xe2\x51\xb4\x19\x40\xf9" \
    "\x77\xbd\xcc\xe9\x5f\xa2\x4c\xcb" \
    \
    "\x2e\x10\x75\x54\x41\xd3\x2d\xd6" \
    "\xfb\x5c\x9d\xa8\x9d\x42\x8e\xa2" \
    "\x4d\xdf\x9e\x0a\xd0\x39\x61\x23" \
    "\x01\x01\x3f\xbd\xa7\xc5\x88\xb1" \
    \
    "\x3d\x0b\x49\x6f\x10\x61\xb5\x66" \
    "\x51\xcc\x96\xe4\xae\xd8\xbd\x76" \
    "\x9e\x45\x46\x76\x2f\x18\x0a\xfb" \
    "\x10\x0a\x08\xa7\xa4\x61\xda\x12" \
    \
    "\x75\xe9\xf1\xf9\x53\x48\xec\xea" \
    "\x89\xc3\xa1\x71\x37\x46\x0b\x5f" \
    "\x38\x37\xbf\x8e\x5d\x6a\x42\xaf" \
    "\x3a\xe7\x20\xd1\xaa\x89\xe9\x12" \
    \
    "\x21\xdf\x78\xbb\x74\x31\xf5\xed" \
    "\xa1\x21\xa6\xb9\x2d\x8b\x67\x04" \
    "\x34\xf6\x2d\xb4\x6a\x01\xf2\x87" \
    "\xe7\x33\x1d\xb8\x74\x1b\xbf\x1b" \
    \
    "\x33\x26\x1d\x6c\x03\xc4\x28\xa5" \
    "\x8e\x56\x64\xff\x17\x0c\xae\xeb" \
    "\xdc\x66\xf9\xb0\xfa\xcf\xd3\xbe" \
    "\x82\xd8\xe1\xff\x67\x9c\xf3\xfe" \
    \
    "\xbe\x83\x05\x5b\x0a\x29\x5b\x9b" \
    "\x84\x5f\x4a\x12\x6e\xb1\x10\xb0" \
    "\xe3\xc2\x10\xe2\xf1\x83\x26\xb9" \
    "\x2a\x95\x68\x4e\x27\x76\x95\x6a" \
    \
    "\xb7\x4f\xc7\x0f\x22\x43\x64\x7d" \
    "\x04\x03\x5c\x1b\x35\x0f\x9f\x00" \
    "\x44\xde\xa4\x41\x27\x8d\x90\xe0" \
    "\xb2\xec\x4a\xd4\xbe"

#define DATA_LEN 253

int main (int argc, char** argv) {
  unsigned int res;
  unsigned int max_compressed_length = LZF_MAX_COMPRESSED_SIZE(DATA_LEN);
  void* compressed = malloc (max_compressed_length);

  res = lzf_compress (DATA, DATA_LEN, compressed, max_compressed_length);

  if (res == 0) {
    fprintf (stderr, "Error\n");
    return -1;
  }

  fprintf (stdout, "Compressed %u -> %u (of %u)\n", DATA_LEN, res, max_compressed_length);

  return 0;
}

Adding 1 to max_compressed_length results in success, so a potential work-around (if this isn't resolved quickly in liblzf) would be to just add one to the value in the lzf plugin.

Report basic time and memory performance characteristics

It's a great benefit to see many compressors integrated under a common and flexible interface. I can imagine a variety of use cases where the caller is interested in trying various algorithms and parameters. When those various algorithms are run, the resultant compressed size for that data is a valuable data point; this comes automatically as part of the return code. However, many compression comparison pages which describe the difference between algorithms and the parameters include additional information: the (de)compression time and memory used in these steps.

I'd value having a common and easy-to-use interface to the effective CPU and memory overhead for this particular algorithm and provided options, maybe as simple as a structure which is populated with that data. I'd like to learn how much memory was allocated during the process of (de)compression, and weigh that accordingly. A separate ticket item (another enhancement) is to deliberately put a hard limit on the available memory for an algorithm (wrapping its malloc () and free ()). But for now it would be helpful to learn how much memory the algorithm did use with the provided compression parameters. This could be returned in a very basic way. Also (again, willing to put it into another ticket), maybe the number of threads spawned can be provided as well as specified.

Finally, it would be helpful to see how long the (de)compression takes, measured in terms of CPU and real time. This wouldn't need to take over the aim of a benchmarking tool (which would run the test repeatedly to get a good estimate, or may flush caches before its run, and so on), but a basic time used during the algorithm would be a helpful statistic, measured at the finest resolution that is supported.

Add lzop format support to lzo plugin

lzop uses a container which it would be nice to add support for. I can't find any documentation, but it doesn't look too complicated and lzop is open-source.

ZPAQ plugin not reliable

The ZPAQ plugin needs some work. It will occasionally get caught in a loop or crash. Reliably reproducible by setting step_size to 70 in the buffer_to_buffer_decompress_with_stream in tests/check-plugins.c.

Add internal copy of liblzf

Debian doesn't ship a package like Fedora does, so we should include a copy of liblzf (preferably as a git submodule), but prefer the system copy if one exists.

max compressed size not accurate for doboz

The following 253 byte binary data (shown in hexadecimal, byte by byte concatenated) is said to require 256 bytes maximum with the "doboz" plugin (according to squash_get_max_compressed_size ()), but instead 258 bytes are allocated/returned. I came across this random stream of bytes as I am trying to test error paths. I can always just add a few extra bytes for safety beyond what that function returns but I wanted to mention a case that causes an issue:

830d6bd9b3162a8394522223ee2059165fc2e251b41940f977bdcce95fa24ccb2e10755441d32dd6fb5c9da89d428ea24ddf9e0ad039612301013fbda7c588b13d0b496f1061b56651cc96e4aed8bd769e4546762f180afb100a08a7a461da1275e9f1f95348ecea89c3a17137460b5f3837bf8e5d6a42af3ae720d1aa89e91221df78bb7431f5eda121a6b92d8b670434f62db46a01f287e7331db8741bbf1b33261d6c03c428a58e5664ff170caeebdc66f9b0facfd3be82d8e1ff679cf3febe83055b0a295b9b845f4a126eb110b0e3c210e2f18326b92a95684e2776956ab74fc70f2243647d04035c1b350f9f0044dea441278d90e0b2ec4ad4be

Issues with Mac OSX compilation

There are a few issues when trying to compile on OSX. The first one is the following :

Making all in zpaq
  CXX      libsquash0.1_plugin_zpaq_la-libzpaq.lo
zpaq/libzpaq.cpp:22:21: fatal error: windows.h: No such file or directory
 #include <windows.h>
                     ^
compilation terminated.
make[2]: *** [libsquash0.1_plugin_zpaq_la-libzpaq.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

zpaq uses windows.h which is obviously not found, maybe an extra check in the squash makefile could fix that ?

After disabling zpaq, this is the second error which is more problematic :

Making all in benchmark
  CC       benchmark-benchmark.o
benchmark.c:19:4: warning: #warning Unable to find a way to measure CPU time [-Wcpp]
 #  warning Unable to find a way to measure CPU time
    ^
benchmark.c: In function 'benchmark_timer_start':
benchmark.c:82:3: error: implicit declaration of function 'clock_gettime' [-Werror=implicit-function-declaration]
   if (clock_gettime (CLOCK_REALTIME, &(timer->start_wall)) != 0) {
   ^
benchmark.c:82:3: warning: nested extern declaration of 'clock_gettime' [-Wnested-externs]
benchmark.c:82:22: error: 'CLOCK_REALTIME' undeclared (first use in this function)
   if (clock_gettime (CLOCK_REALTIME, &(timer->start_wall)) != 0) {
                      ^
benchmark.c:82:22: note: each undeclared identifier is reported only once for each function it appears in
benchmark.c: In function 'benchmark_timer_stop':
benchmark.c:20:42: error: 'CLOCK_REALTIME' undeclared (first use in this function)
 #  define SQUASH_BENCHMARK_CLOCK_CPUTIME CLOCK_REALTIME
                                          ^
benchmark.c:94:22: note: in expansion of macro 'SQUASH_BENCHMARK_CLOCK_CPUTIME'
   if (clock_gettime (SQUASH_BENCHMARK_CLOCK_CPUTIME, &(timer->end_cpu)) != 0) {
                      ^
cc1: some warnings being treated as errors
make[1]: *** [benchmark-benchmark.o] Error 1
make: *** [all-recursive] Error 1

On OSX, clock_gettime apparently does not exist, but there are alternatives and this thread is giving plenty : http://stackoverflow.com/questions/5167269/clock-gettime-alternative-in-mac-os-x

Go binding

We should have bindings for the Go programming language.

Provide a way to test whether a stream is flushable

Some codecs can flush (zlib, bzip2, lzma, …), most can't. Just like squash_codec_knows_uncompressed_size tests for whether or not the uncompressed size can be extracted from a compressed buffer without decoding the whole thing, there should be a squash_codec_can_flush (or similar) function to detect whether or not a codec knows how to flush.

Benchmark predictability

The Squash benchmark is very interesting due to the number of plugins and architectures included, but I find that with the faster algorithms there is a much greater unpredictability due to the fact that they run for a much shorter period of time.
For more accurate numbers I would suggest to introduce a minimum runtime, for example ensure that every algorithm will run for at least 30 seconds.
It can be easily done with a timer and a repeating loop, just repeat the current test until the time elapsed has at least attained the minimum runtime required. The performance is then calculated by dividing the total time by the number of loops achieved.
This will I think make benchmark results much more stable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.