Code Monkey home page Code Monkey logo

gzip-hpp's Introduction

Gzip C++ lib for gzip compression and decompression. Extracted from mapnik-vector-tile for light-weight modularity.

Build Status hpp-skel badge

Usage

// Include the specific gzip headers your code needs, for example...
#include <gzip/compress.hpp>
#include <gzip/config.hpp>
#include <gzip/decompress.hpp>
#include <gzip/utils.hpp>
#include <gzip/version.hpp>

// All function calls must pass in a pointer of an 
// immutable character sequence (aka a string in C) and its size
std::string data = "hello";
const char * pointer = data.data();
std::size_t size = data.size();

// Check if compressed. Can check both gzip and zlib.
bool c = gzip::is_compressed(pointer, size); // false

// Compress returns a std::string
std::string compressed_data = gzip::compress(pointer, size);

// Decompress returns a std::string and decodes both zlib and gzip
const char * compressed_pointer = compressed_data.data();
std::string decompressed_data = gzip::decompress(compressed_pointer, compressed_data.size());

// Or like so
std::string compressed_data = gzip::compress(tile->data(), tile->data.size());

// Or like so
std::string value = gzip::compress(node::Buffer::Data(obj), node::Buffer::Length(obj));

// Or...etc

Compress

// Optionally include compression level
std::size_t size; // No default value, but what happens when not passed??
int level = Z_DEFAULT_COMPRESSION; // Z_DEFAULT_COMPRESSION is the default if no arg is passed

std::string compressed_data = gzip::compress(tile->data(), size, level);

Decompress

// No args other than the std:string
std::string data = "hello";
std::string compressed_data = gzip::compress(data);
const char * compressed_pointer = compressed_data.data();

std::string decompressed_data = gzip::decompress(compressed_pointer, compressed_data.size());

Test

# build test binaries
make

# run tests
make test

You can make Release test binaries as well

BUILDTYPE=Release make
BUILDTYPE=Release make test

Versioning

This library is semantically versioned using the /include/gzip/version.cpp file. This defines a number of macros that can be used to check the current major, minor, or patch versions, as well as the full version string.

Here's how you can check for a particular version to use specific API methods

#if GZIP_VERSION_MAJOR > 2
// use version 2 api
#else
// use older verion apis
#endif

Here's how to check the version string

std::cout << "version: " << GZIP_VERSION_STRING << "/n";
// => version: 0.2.0

And lastly, mathematically checking for a specific version:

#if GZIP_VERSION_CODE > 20001
// use feature provided in v2.0.1
#endif

gzip-hpp's People

Contributors

flippmoke avatar mapsam avatar springmeyer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gzip-hpp's Issues

utils.hpp missing cstdint

I get the following compile error on g++ 13.2:

../ext/gzip/utils.hpp:15:36: error: ‘uint8_t’ does not name a type
   15 |                        static_cast<uint8_t>(data[0]) == 0x78 &&
../ext/gzip/utils.hpp:2:1: note: ‘uint8_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
    1 | #include <cstdlib>
  +++ |+#include <cstdint>

Adding #include <cstdint> to utils.hpp fixes the issue

Soup up test data handling

Per #5 (comment), we can improve tests by updating them to better handle test data and ensure proper comparisons/assertions:

  • file based fixtures for input data
  • an adjacent file of gzip coded data
  • then compare them byte for byte
  • optional: add an UPDATE environment variable to be able to put the tests into a mode where they are updateable

This way we can ensure the output is exactly as we expect. And we would pick up differences, say, between our results and another implementation like #7.

cc @springmeyer

Move to single compress API

Uncovered that the std::string compress method is causing unintentional behaviour that could affect performance.

What happens is that when you fail to pass size to the const char * pointer method, it dispatches to the other std::string function, creates a std::string implicitly, and the size of that is used. So things work right, just not quite as you would hope, and an extra string copy is created that we do not want.

Per chat with @artemp re: strong types, we may want to be more explicit here. When performance and reliability are critical, then requiring more explicit types is well worth it since they can guarantee correct code.

Next Actions

cc @mapbox/core-tech

Where did this implementation originally come from?

This project exists, in part, to stop the dangerous practice of copying code around without properly understanding what it does and without there being a clear lineage of bug fixes and docs.

So, we are working on starting off on the right foot in this repo / fixing our colored past. But the question has come up during the vetting of the code of: "where did this come from and why was this decision made?".

I spent some time trying to figure out the history. We know that proximately we copied the code from mapnik-vector-tile to get this project started, but where did that code come from? Well, if you search on github for while (inflate_s.avail_out == 0) (which is a pretty unique line) we get a bunch of hits for the same code copied around: https://github.com/search?q=while+%28inflate_s.avail_out+%3D%3D+0%29&type=Code&utf8=%E2%9C%93. In looking at that I think the first to come along was https://github.com/kkaefer/DEPRECATED-node-zlib from @kkaefer. So I think that is the origination of the code that mapnik-vector-tile has been using. And the history of commits in that repo show some of the reasons for changes in the code (like the addition of handling of Z_BUF_ERROR in kkaefer/DEPRECATED-node-zlib@523477f#diff-1edfe6d0b0a3a9c5bcbb3ba0e11144a9).

Should we allow compressing an empty string?

Currently the API allows for compressing an empty string without throwing. It results in a compressed string that is 20 chars long. Should we continue to support this? Or should we catch this case and throw (since this usage is likely a programmer mistake)?

/cc @mapbox/core-tech

Decompress suppresses Z_BUF_ERROR

The reader loop in decompress.hpp suppresses Z_BUF_ERROR so it doesn't detect truncated input files.

I think the right way to do it is probably as in this change to Tippecanoe's code copied and pasted from the same place: mapbox/tippecanoe@07ab900.

With this change, the loop terminates only on Z_STREAM_END, so in the case of a truncated stream, it continues trying to decompress and reports Z_BUF_ERROR instead of the current behavior of terminating because the input has all been consumed even though the decompressor is still expecting more.

This also attempts to clean up what has always seemed like confusing logic to me about the size of the output buffer, the size of the output so far, and zlib's pointers into the buffer.

Add docs on gzip vs deflate

In gzip-hpp we support:

  • compressing data using gzip coding
  • decompressing data that has been gzip coded or deflate coded
  • detecting if data has been gzip coded or deflate coded

The difference between the two codings is important but small: gzip has a CRC header. I'm creating this ticket so that our docs can:

@GretaCB - can you do a pass at the docs and code comments to figure out where best to link to https://tools.ietf.org/html/rfc7230#section-4.2?

I found that ietf doc via https://nodejs.org/api/zlib.html#zlib_compressing_http_requests_and_responses.

Initial buffer size

When compressing the initial output buffer is sized as size / 2 + 1024 (https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L81). Where is this formula coming from? Is it optimal?

Looking at the real-world tiles in the mvt-fixtures, compressed tiles are (on average) larger than half the size of the uncompressed tiles. So this means we'll go around the loop twice, doing extra work.

Maybe this should be configurable, because only the calling code knows what data it has and what the compression ratio might be?

A question

It is very nice to put the zlib in C++ setting, I was trying to do the same. Just a quick question, is there a wrapper for initializing the gz file?

Blue sky/future: investigate alternative, high perf deflate implementions

I presume our wrapper around zlib is as fast as possible, and faster than the C++ wrapper boost provides (refs #7).

But in the past I've seen zlib compression be a meaningful % of the time taken to work with vector tiles. Often the usecase is:

  • Get a tile over a network or a db
  • Decompress it
  • Do an operation on it (query, shave, composite, etc)
  • Recompress
  • Send back over a network or put into db

When the "do an operation" is fairly speedy, the decompress and recompress times are often meaningful (at least 5-10% of the time taken on the CPU). In applications handling a lot of concurrent requests where the CPU may be working hard, we can increase the amount of concurrency possible on a single machine by reducing the amount of work on the CPU.

So, this is a long way of saying:

  • gzip is fairly cheap and likely already as fast as it can be
  • except in rare cases of high load with fairly optimized code (meaning where the vector tile operation is already pretty fast so the gzip part stands out more)...
  • In that case we might want to revisit trying to speed up gzip coding operations

When/if we do, then we should look into benchmarking https://github.com/ebiggers/libdeflate, which claims to be faster than zlib.

//cc @GretaCB @flippmoke

Docker update request

Any chance we can get a 20.04 dockerfile (and upgrade from 16.04) for master? Also perhaps update zlib version from 1.2.8 to 1.2.11? other dependencies..

Tag v1.0.0 / Package in Mason

Once @GretaCB feels this library is ready for release, let's plan to tag as v1.0.0 and package in mason.

The general flow will be:

  • Tag here: git tag v1.0.0 -a -m "Tagging gzip-hpp v1.0.0"
  • Push the tag: git push --tags
  • Then head over to mason to:
    • Create a new branch
    • Create a gzip package
    • Push the branch
    • Test a build with ./mason trigger gzip 1.0.0

I've pushed some updated docs on mason packaging at mapbox/mason#478.

/cc @GretaCB

Not for zip packages?

"C++ lib for gzip compression and decompression" means not for zip packages?
I have tried:

	std::string filename("c:/transfer/test.zip");
	std::ifstream ifs(filename, std::ios_base::in | std::ios_base::binary);
	if (ifs.is_open())
	{
		std::cout << "zip opened\n";
		std::string str_compressed((std::istreambuf_iterator<char>(ifs.rdbuf())), std::istreambuf_iterator<char>());
		ifs.close();
		std::string out{};
		gzip::Decompressor decomp{};
		try
		{
			decomp.decompress(out, str_compressed.data(), str_compressed.size());
			std::cout << out.c_str() << std::endl;
		}
		catch (std::exception& ex)
		{
			std::cout << ex.what() << std::endl;
		}
	}

But I got:
incorrect header check
So, gzip isn't for zip packages?

Code review

I looked at the code and have some comments:

  1. https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L27 That's confusing: The output parameter is of type InputType? And I would have expected the output parameter to come after the input parameters. But that's historically inconsistent in C APIs anyway...
  2. Not sure we need the Compressor and Decompressor classes. The only thing that ever happens to them is that they are instantiated and then the (de)compress function is called on them. That would work as a simple function? I can see why this might be useful when we (de)compress in blocks, but the library can't do that due to the inflateInit2 and deflateInit2 functions not being in the constructor. Okay, we don't have to set those compression parameters every time, but how often do we re-use an instantiated Compressor or Decompressor object to make this useful?
  3. Useless SECTION: https://github.com/mapbox/gzip-hpp/blob/master/test/test_io.cpp#L12
  4. https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/utils.hpp#L8 can be noexcept and the function is hard to follow due the many parentheses.
  5. https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L32-L38 Consider using assert() here.
  6. There should be helper functions like (de)compress() that take an existing std::string and fill it.
  7. https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/decompress.hpp#L64 If the second condition is true, the first is also true, so this can be simplified.
  8. https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/decompress.hpp#L57 I don't think this is correct. If std::size is 32bit size * 2 will be 32bit and can overflow before it is assigned to the 64bit size_64. const auto size_64 = static_cast<uint64_t>(size) * 2; should be correct.
  9. There are several places where inflateEnd is called (like here https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/decompress.hpp#L60). Maybe this can be wrapped in an RAII wrapper somehow?
  10. If we have gzipped-compressed data, the header can tell us how large the uncompressed data was. Can we use this to optimize this case and have the output buffer in the correct size from the beginning?
  11. https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/decompress.hpp#L55-L63 this can be moved above the inflateInit2 call saving an inflateEnd. Also consider making this an assert.

Multiply-defined symbols linker error

I've included gzip.hpp in two CPP files in a VS2017 project, and when it links I get "already defined" errors for gzip::compress and gzip::decompress. Is there a way to allow multiple inclusion or do I need to include it just once in a wrapper class?

Pragma warnings with MSVC

Not an expert here, but have seen other projects doing something like:

#ifdef __GNUC__
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
#endif

Make warnings errors

We should make all compiler warnings errors so they fail the travis build. Then we should fix the compiler warnings. refs mapbox/hpp-skel#26

/cc @GretaCB - pull me in to help fix the warnings when you get there.

Streaming vs strings

As a general design we should likely be encouraging the use of streaming operators with zlib which is a streaming algorithm. There is not a simple way to make or use a streaming C++ operation here with this current design. I think this is a mistake as I/O operations will almost always be slower then any gzip decompression or compression. I think basic tools to operate on existing buffers is great, but we should be focused on making our operations more based on streams for performance reasons.

benchmark_register.cc:(.text+0x26e6): undefined reference to `std::__throw_out_of_range_fmt(char const*, ...)

cmake did not error out

cmake ..
-- The CXX compiler identification is GNU 4.8.5
-- Check for working CXX compiler: /usr/lib64/ccache/c++
-- Check for working CXX compiler: /usr/lib64/ccache/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring release build
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /home1/newdbadmin/gzip/gzip-hpp/build

make has error:
/home/travis/build/mapbox/mason/mason_packages/.build/benchmark-1.3.0/src/benchmark_register.cc:(.text+0x26e6): undefined reference to `std::__throw_out_of_range_fmt(char const*, ...)'

make
Scanning dependencies of target bench-tests
[ 16%] Building CXX object CMakeFiles/bench-tests.dir/bench/run.cpp.o
[ 33%] Linking CXX executable bench-tests
../mason_packages/linux-x86_64/benchmark/1.3.0/lib/libbenchmark.a(benchmark_register.cc.o): In function benchmark::internal::Benchmark::Ranges(std::vector<std::pair<int, int>, std::allocator<std::pair<int, int> > > const&)': /home/travis/build/mapbox/mason/mason_packages/.build/benchmark-1.3.0/src/benchmark_register.cc:(.text+0x26e6): undefined reference to std::__throw_out_of_range_fmt(char const*, ...)'
collect2: error: ld returned 1 exit status
make[2]: *** [bench-tests] Error 1
make[1]: *** [CMakeFiles/bench-tests.dir/all] Error 2
make: *** [all] Error 2

Thanks

utils.hpp length check

utils.hpp contains the following length check for is_compressed:

return size > 2 && ...

However, it only uses the first two bytes. I would vote that length check should be return size >= 2 && ...

This would allow me to do something like:

        char gzipHeader[2];
        file.read(gzipHeader, 2);
        file.seekg(0, ios::beg); // rewind

        if (gzip::is_compressed(gzipHeader, 2))
            // do something...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.