mapbox / gzip-hpp Goto Github PK

Gzip header-only C++ library

License: BSD 2-Clause "Simplified" License

CMake 31.85% Makefile 3.29% C++ 43.37% Shell 19.43% Dockerfile 2.06%

gzip-hpp's Issues

Should we allow compressing an empty string?

Currently the API allows for compressing an empty string without throwing. It results in a compressed string that is 20 chars long. Should we continue to support this? Or should we catch this case and throw (since this usage is likely a programmer mistake)?

/cc @mapbox/core-tech

error: reinterpret_cast from type â€˜const charâ€™ to type â€˜Bytef {aka unsigned char*}â€™ casts away qualifiers

The solution is to look for line with reinterpreted_cast<z_const Bytef*>(data) to reinterpreted_cast<z_const Bytef*>(&data) for example. This needs to be done twice. Once in the compress.hpp and the decompress.hpp.

Compression with non-empty output buffer

If the output buffer given to the compress() call here https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L27 already contains some data (which might make sense if you want to compress several things one after the other before writing them out in one go), the code below (https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L82-L89) will not do the right thing.

benchmark_register.cc:(.text+0x26e6): undefined reference to `std::__throw_out_of_range_fmt(char const*, ...)

cmake did not error out

cmake ..
-- The CXX compiler identification is GNU 4.8.5
-- Check for working CXX compiler: /usr/lib64/ccache/c++
-- Check for working CXX compiler: /usr/lib64/ccache/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring release build
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /home1/newdbadmin/gzip/gzip-hpp/build

make has error:
/home/travis/build/mapbox/mason/mason_packages/.build/benchmark-1.3.0/src/benchmark_register.cc:(.text+0x26e6): undefined reference to `std::__throw_out_of_range_fmt(char const*, ...)'

make
Scanning dependencies of target bench-tests
[ 16%] Building CXX object CMakeFiles/bench-tests.dir/bench/run.cpp.o
[ 33%] Linking CXX executable bench-tests
../mason_packages/linux-x86_64/benchmark/1.3.0/lib/libbenchmark.a(benchmark_register.cc.o): In function benchmark::internal::Benchmark::Ranges(std::vector<std::pair<int, int>, std::allocator<std::pair<int, int> > > const&)': /home/travis/build/mapbox/mason/mason_packages/.build/benchmark-1.3.0/src/benchmark_register.cc:(.text+0x26e6): undefined reference to std::__throw_out_of_range_fmt(char const*, ...)'
collect2: error: ld returned 1 exit status
make[2]: *** [bench-tests] Error 1
make[1]: *** [CMakeFiles/bench-tests.dir/all] Error 2
make: *** [all] Error 2

Thanks

Streaming vs strings

As a general design we should likely be encouraging the use of streaming operators with zlib which is a streaming algorithm. There is not a simple way to make or use a streaming C++ operation here with this current design. I think this is a mistake as I/O operations will almost always be slower then any gzip decompression or compression. I think basic tools to operate on existing buffers is great, but we should be focused on making our operations more based on streams for performance reasons.

Move to single compress API

Uncovered that the std::string compress method is causing unintentional behaviour that could affect performance.

What happens is that when you fail to pass size to the const char * pointer method, it dispatches to the other std::string function, creates a std::string implicitly, and the size of that is used. So things work right, just not quite as you would hope, and an extra string copy is created that we do not want.

Per chat with @artemp re: strong types, we may want to be more explicit here. When performance and reliability are critical, then requiring more explicit types is well worth it since they can guarantee correct code.

Next Actions

See #1 (comment)

cc @mapbox/core-tech

Decompress suppresses Z_BUF_ERROR

The reader loop in decompress.hpp suppresses Z_BUF_ERROR so it doesn't detect truncated input files.

I think the right way to do it is probably as in this change to Tippecanoe's code copied and pasted from the same place: mapbox/tippecanoe@07ab900.

With this change, the loop terminates only on Z_STREAM_END, so in the case of a truncated stream, it continues trying to decompress and reports Z_BUF_ERROR instead of the current behavior of terminating because the input has all been consumed even though the decompressor is still expecting more.

This also attempts to clean up what has always seemed like confusing logic to me about the size of the output buffer, the size of the output so far, and zlib's pointers into the buffer.

Where did this implementation originally come from?

This project exists, in part, to stop the dangerous practice of copying code around without properly understanding what it does and without there being a clear lineage of bug fixes and docs.

So, we are working on starting off on the right foot in this repo / fixing our colored past. But the question has come up during the vetting of the code of: "where did this come from and why was this decision made?".

I spent some time trying to figure out the history. We know that proximately we copied the code from mapnik-vector-tile to get this project started, but where did that code come from? Well, if you search on github for while (inflate_s.avail_out == 0) (which is a pretty unique line) we get a bunch of hits for the same code copied around: https://github.com/search?q=while+%28inflate_s.avail_out+%3D%3D+0%29&type=Code&utf8=%E2%9C%93. In looking at that I think the first to come along was https://github.com/kkaefer/DEPRECATED-node-zlib from @kkaefer. So I think that is the origination of the code that mapnik-vector-tile has been using. And the history of commits in that repo show some of the reasons for changes in the code (like the addition of handling of Z_BUF_ERROR in kkaefer/DEPRECATED-node-zlib@523477f#diff-1edfe6d0b0a3a9c5bcbb3ba0e11144a9).

Make warnings errors

We should make all compiler warnings errors so they fail the travis build. Then we should fix the compiler warnings. refs mapbox/hpp-skel#26

/cc @GretaCB - pull me in to help fix the warnings when you get there.

Soup up test data handling

Per #5 (comment), we can improve tests by updating them to better handle test data and ensure proper comparisons/assertions:

file based fixtures for input data
an adjacent file of gzip coded data
then compare them byte for byte
optional: add an UPDATE environment variable to be able to put the tests into a mode where they are updateable

This way we can ensure the output is exactly as we expect. And we would pick up differences, say, between our results and another implementation like #7.

cc @springmeyer

Not for zip packages?

"C++ lib for gzip compression and decompression" means not for zip packages?
I have tried:

	std::string filename("c:/transfer/test.zip");
	std::ifstream ifs(filename, std::ios_base::in | std::ios_base::binary);
	if (ifs.is_open())
	{
		std::cout << "zip opened\n";
		std::string str_compressed((std::istreambuf_iterator<char>(ifs.rdbuf())), std::istreambuf_iterator<char>());
		ifs.close();
		std::string out{};
		gzip::Decompressor decomp{};
		try
		{
			decomp.decompress(out, str_compressed.data(), str_compressed.size());
			std::cout << out.c_str() << std::endl;
		}
		catch (std::exception& ex)
		{
			std::cout << ex.what() << std::endl;
		}
	}

But I got:
incorrect header check
So, gzip isn't for zip packages?

Multiply-defined symbols linker error

I've included gzip.hpp in two CPP files in a VS2017 project, and when it links I get "already defined" errors for gzip::compress and gzip::decompress. Is there a way to allow multiple inclusion or do I need to include it just once in a wrapper class?

utils.hpp length check

utils.hpp contains the following length check for is_compressed:

return size > 2 && ...

However, it only uses the first two bytes. I would vote that length check should be return size >= 2 && ...

This would allow me to do something like:

        char gzipHeader[2];
        file.read(gzipHeader, 2);
        file.seekg(0, ios::beg); // rewind

        if (gzip::is_compressed(gzipHeader, 2))
            // do something...

Tag v1.0.0 / Package in Mason

Once @GretaCB feels this library is ready for release, let's plan to tag as v1.0.0 and package in mason.

The general flow will be:

Tag here: git tag v1.0.0 -a -m "Tagging gzip-hpp v1.0.0"
Push the tag: git push --tags
Then head over to mason to:
- Create a new branch
- Create a gzip package
- Push the branch
- Test a build with ./mason trigger gzip 1.0.0

I've pushed some updated docs on mason packaging at mapbox/mason#478.

/cc @GretaCB

utils.hpp missing cstdint

I get the following compile error on g++ 13.2:

../ext/gzip/utils.hpp:15:36: error: ‘uint8_t’ does not name a type
   15 |                        static_cast<uint8_t>(data[0]) == 0x78 &&
../ext/gzip/utils.hpp:2:1: note: ‘uint8_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
    1 | #include <cstdlib>
  +++ |+#include <cstdint>

Adding #include <cstdint> to utils.hpp fixes the issue

Travis temporarily broken

I noticed this repo is currently in an odd state due to https://docs.travis-ci.com/user/open-source-on-travis-ci-com/#Existing-Open-Source-Repositories-on-travis-ci.org. When trying to enable it on https://travis-ci.com/profile/mapbox I see:

@artemp until this is fixed please ensure your tests are passing locally at #25 and ensure you get another reviewer to confirm as well.

/cc @mapsam for visibility

Initial buffer size

When compressing the initial output buffer is sized as size / 2 + 1024 (https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L81). Where is this formula coming from? Is it optimal?

Looking at the real-world tiles in the mvt-fixtures, compressed tiles are (on average) larger than half the size of the uncompressed tiles. So this means we'll go around the loop twice, doing extra work.

Maybe this should be configurable, because only the calling code knows what data it has and what the compression ratio might be?

Add docs on gzip vs deflate

In gzip-hpp we support:

compressing data using gzip coding
decompressing data that has been gzip coded or deflate coded
detecting if data has been gzip coded or deflate coded

The difference between the two codings is important but small: gzip has a CRC header. I'm creating this ticket so that our docs can:

Make sure we get our terms right
Link to https://tools.ietf.org/html/rfc7230#section-4.2 for details (so we don't need to go into to much detail)

@GretaCB - can you do a pass at the docs and code comments to figure out where best to link to https://tools.ietf.org/html/rfc7230#section-4.2?

I found that ietf doc via https://nodejs.org/api/zlib.html#zlib_compressing_http_requests_and_responses.

Write section about how to detect if a file is compressed?

It'd be great to include an example section (or perhaps even a method?) about how to check if a string or buffer is compressed. Perhaps gzip::uncompress does this internally?

cc @GretaCB @springmeyer

Docker update request

Any chance we can get a 20.04 dockerfile (and upgrade from 16.04) for master? Also perhaps update zlib version from 1.2.8 to 1.2.11? other dependencies..

Code review

I looked at the code and have some comments:

https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L27 That's confusing: The output parameter is of type InputType? And I would have expected the output parameter to come after the input parameters. But that's historically inconsistent in C APIs anyway...
Not sure we need the Compressor and Decompressor classes. The only thing that ever happens to them is that they are instantiated and then the (de)compress function is called on them. That would work as a simple function? I can see why this might be useful when we (de)compress in blocks, but the library can't do that due to the inflateInit2 and deflateInit2 functions not being in the constructor. Okay, we don't have to set those compression parameters every time, but how often do we re-use an instantiated Compressor or Decompressor object to make this useful?
Useless SECTION: https://github.com/mapbox/gzip-hpp/blob/master/test/test_io.cpp#L12
https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/utils.hpp#L8 can be noexcept and the function is hard to follow due the many parentheses.
https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L32-L38 Consider using assert() here.
There should be helper functions like (de)compress() that take an existing std::string and fill it.
https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/decompress.hpp#L64 If the second condition is true, the first is also true, so this can be simplified.
https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/decompress.hpp#L57 I don't think this is correct. If std::size is 32bit size * 2 will be 32bit and can overflow before it is assigned to the 64bit size_64. const auto size_64 = static_cast<uint64_t>(size) * 2; should be correct.
There are several places where inflateEnd is called (like here https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/decompress.hpp#L60). Maybe this can be wrapped in an RAII wrapper somehow?
If we have gzipped-compressed data, the header can tell us how large the uncompressed data was. Can we use this to optimize this case and have the output buffer in the correct size from the beginning?
https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/decompress.hpp#L55-L63 this can be moved above the inflateInit2 call saving an inflateEnd. Also consider making this an assert.

Pragma warnings with MSVC

Not an expert here, but have seen other projects doing something like:

#ifdef __GNUC__
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
#endif

Avoid re-allocation overhead by predicting the uncompressed buffer size?

Per chat with @joto , we learned that gzip lib has another API that takes the original size of the original decompressed data. See example in libosmium. This could help with performance in gzip-hpp to avoid constantly resizing the write buffer which requires reallocations.

This would mean grabbing the original filesize/buffer length via zlib's ISIZE flag stored in the header.

cc @springmeyer

A question

It is very nice to put the zlib in C++ setting, I was trying to do the same. Just a quick question, is there a wrapper for initializing the gz file?

Blue sky/future: investigate alternative, high perf deflate implementions

I presume our wrapper around zlib is as fast as possible, and faster than the C++ wrapper boost provides (refs #7).

But in the past I've seen zlib compression be a meaningful % of the time taken to work with vector tiles. Often the usecase is:

Get a tile over a network or a db
Decompress it
Do an operation on it (query, shave, composite, etc)
Recompress
Send back over a network or put into db

When the "do an operation" is fairly speedy, the decompress and recompress times are often meaningful (at least 5-10% of the time taken on the CPU). In applications handling a lot of concurrent requests where the CPU may be working hard, we can increase the amount of concurrency possible on a single machine by reducing the amount of work on the CPU.

So, this is a long way of saying:

gzip is fairly cheap and likely already as fast as it can be
except in rare cases of high load with fairly optimized code (meaning where the vector tile operation is already pretty fast so the gzip part stands out more)...
In that case we might want to revisit trying to speed up gzip coding operations

When/if we do, then we should look into benchmarking https://github.com/ebiggers/libdeflate, which claims to be faster than zlib.

//cc @GretaCB @flippmoke

mapbox / gzip-hpp Goto Github PK

gzip-hpp's Issues

Next Actions

Recommend Projects

Recommend Topics

Recommend Org