mapbox / gzip-hpp Goto Github PK
View Code? Open in Web Editor NEWGzip header-only C++ library
License: BSD 2-Clause "Simplified" License
Gzip header-only C++ library
License: BSD 2-Clause "Simplified" License
Currently the API allows for compressing an empty string without throwing. It results in a compressed string that is 20
chars long. Should we continue to support this? Or should we catch this case and throw (since this usage is likely a programmer mistake)?
/cc @mapbox/core-tech
The solution is to look for line with reinterpreted_cast<z_const Bytef*>(data)
to reinterpreted_cast<z_const Bytef*>(&data)
for example. This needs to be done twice. Once in the compress.hpp and the decompress.hpp.
If the output buffer given to the compress()
call here https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L27 already contains some data (which might make sense if you want to compress several things one after the other before writing them out in one go), the code below (https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L82-L89) will not do the right thing.
cmake did not error out
cmake ..
-- The CXX compiler identification is GNU 4.8.5
-- Check for working CXX compiler: /usr/lib64/ccache/c++
-- Check for working CXX compiler: /usr/lib64/ccache/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring release build
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /home1/newdbadmin/gzip/gzip-hpp/build
make has error:
/home/travis/build/mapbox/mason/mason_packages/.build/benchmark-1.3.0/src/benchmark_register.cc:(.text+0x26e6): undefined reference to `std::__throw_out_of_range_fmt(char const*, ...)'
make
Scanning dependencies of target bench-tests
[ 16%] Building CXX object CMakeFiles/bench-tests.dir/bench/run.cpp.o
[ 33%] Linking CXX executable bench-tests
../mason_packages/linux-x86_64/benchmark/1.3.0/lib/libbenchmark.a(benchmark_register.cc.o): In function benchmark::internal::Benchmark::Ranges(std::vector<std::pair<int, int>, std::allocator<std::pair<int, int> > > const&)': /home/travis/build/mapbox/mason/mason_packages/.build/benchmark-1.3.0/src/benchmark_register.cc:(.text+0x26e6): undefined reference to
std::__throw_out_of_range_fmt(char const*, ...)'
collect2: error: ld returned 1 exit status
make[2]: *** [bench-tests] Error 1
make[1]: *** [CMakeFiles/bench-tests.dir/all] Error 2
make: *** [all] Error 2
Thanks
As a general design we should likely be encouraging the use of streaming operators with zlib which is a streaming algorithm. There is not a simple way to make or use a streaming C++ operation here with this current design. I think this is a mistake as I/O operations will almost always be slower then any gzip decompression or compression. I think basic tools to operate on existing buffers is great, but we should be focused on making our operations more based on streams for performance reasons.
Uncovered that the std::string
compress method is causing unintentional behaviour that could affect performance.
What happens is that when you fail to pass size to the const char *
pointer method, it dispatches to the other std::string
function, creates a std::string
implicitly, and the size of that is used. So things work right, just not quite as you would hope, and an extra string copy is created that we do not want.
Per chat with @artemp re: strong types, we may want to be more explicit here. When performance and reliability are critical, then requiring more explicit types is well worth it since they can guarantee correct code.
cc @mapbox/core-tech
The reader loop in decompress.hpp suppresses Z_BUF_ERROR
so it doesn't detect truncated input files.
I think the right way to do it is probably as in this change to Tippecanoe's code copied and pasted from the same place: mapbox/tippecanoe@07ab900.
With this change, the loop terminates only on Z_STREAM_END
, so in the case of a truncated stream, it continues trying to decompress and reports Z_BUF_ERROR
instead of the current behavior of terminating because the input has all been consumed even though the decompressor is still expecting more.
This also attempts to clean up what has always seemed like confusing logic to me about the size of the output buffer, the size of the output so far, and zlib
's pointers into the buffer.
This project exists, in part, to stop the dangerous practice of copying code around without properly understanding what it does and without there being a clear lineage of bug fixes and docs.
So, we are working on starting off on the right foot in this repo / fixing our colored past. But the question has come up during the vetting of the code of: "where did this come from and why was this decision made?".
I spent some time trying to figure out the history. We know that proximately we copied the code from mapnik-vector-tile to get this project started, but where did that code come from? Well, if you search on github for while (inflate_s.avail_out == 0)
(which is a pretty unique line) we get a bunch of hits for the same code copied around: https://github.com/search?q=while+%28inflate_s.avail_out+%3D%3D+0%29&type=Code&utf8=%E2%9C%93. In looking at that I think the first to come along was https://github.com/kkaefer/DEPRECATED-node-zlib from @kkaefer. So I think that is the origination of the code that mapnik-vector-tile
has been using. And the history of commits in that repo show some of the reasons for changes in the code (like the addition of handling of Z_BUF_ERROR
in kkaefer/DEPRECATED-node-zlib@523477f#diff-1edfe6d0b0a3a9c5bcbb3ba0e11144a9).
We should make all compiler warnings errors so they fail the travis build. Then we should fix the compiler warnings. refs mapbox/hpp-skel#26
/cc @GretaCB - pull me in to help fix the warnings when you get there.
Per #5 (comment), we can improve tests by updating them to better handle test data and ensure proper comparisons/assertions:
UPDATE
environment variable to be able to put the tests into a mode where they are updateableThis way we can ensure the output is exactly as we expect. And we would pick up differences, say, between our results and another implementation like #7.
cc @springmeyer
"C++ lib for gzip compression and decompression" means not for zip packages?
I have tried:
std::string filename("c:/transfer/test.zip");
std::ifstream ifs(filename, std::ios_base::in | std::ios_base::binary);
if (ifs.is_open())
{
std::cout << "zip opened\n";
std::string str_compressed((std::istreambuf_iterator<char>(ifs.rdbuf())), std::istreambuf_iterator<char>());
ifs.close();
std::string out{};
gzip::Decompressor decomp{};
try
{
decomp.decompress(out, str_compressed.data(), str_compressed.size());
std::cout << out.c_str() << std::endl;
}
catch (std::exception& ex)
{
std::cout << ex.what() << std::endl;
}
}
But I got:
incorrect header check
So, gzip isn't for zip packages?
I've included gzip.hpp in two CPP files in a VS2017 project, and when it links I get "already defined" errors for gzip::compress and gzip::decompress. Is there a way to allow multiple inclusion or do I need to include it just once in a wrapper class?
utils.hpp
contains the following length check for is_compressed
:
return size > 2 && ...
However, it only uses the first two bytes. I would vote that length check should be return size >= 2 && ...
This would allow me to do something like:
char gzipHeader[2];
file.read(gzipHeader, 2);
file.seekg(0, ios::beg); // rewind
if (gzip::is_compressed(gzipHeader, 2))
// do something...
Once @GretaCB feels this library is ready for release, let's plan to tag as v1.0.0 and package in mason.
The general flow will be:
git tag v1.0.0 -a -m "Tagging gzip-hpp v1.0.0"
git push --tags
gzip
package./mason trigger gzip 1.0.0
I've pushed some updated docs on mason packaging at mapbox/mason#478.
/cc @GretaCB
I get the following compile error on g++ 13.2:
../ext/gzip/utils.hpp:15:36: error: ‘uint8_t’ does not name a type
15 | static_cast<uint8_t>(data[0]) == 0x78 &&
../ext/gzip/utils.hpp:2:1: note: ‘uint8_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
1 | #include <cstdlib>
+++ |+#include <cstdint>
Adding #include <cstdint>
to utils.hpp
fixes the issue
I noticed this repo is currently in an odd state due to https://docs.travis-ci.com/user/open-source-on-travis-ci-com/#Existing-Open-Source-Repositories-on-travis-ci.org. When trying to enable it on https://travis-ci.com/profile/mapbox
I see:
@artemp until this is fixed please ensure your tests are passing locally at #25 and ensure you get another reviewer to confirm as well.
/cc @mapsam for visibility
When compressing the initial output buffer is sized as size / 2 + 1024
(https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/compress.hpp#L81). Where is this formula coming from? Is it optimal?
Looking at the real-world tiles in the mvt-fixtures, compressed tiles are (on average) larger than half the size of the uncompressed tiles. So this means we'll go around the loop twice, doing extra work.
Maybe this should be configurable, because only the calling code knows what data it has and what the compression ratio might be?
In gzip-hpp we support:
The difference between the two codings is important but small: gzip has a CRC header. I'm creating this ticket so that our docs can:
@GretaCB - can you do a pass at the docs and code comments to figure out where best to link to https://tools.ietf.org/html/rfc7230#section-4.2?
I found that ietf doc via https://nodejs.org/api/zlib.html#zlib_compressing_http_requests_and_responses.
It'd be great to include an example section (or perhaps even a method?) about how to check if a string or buffer is compressed. Perhaps gzip::uncompress
does this internally?
Any chance we can get a 20.04 dockerfile (and upgrade from 16.04) for master? Also perhaps update zlib version from 1.2.8 to 1.2.11? other dependencies..
I looked at the code and have some comments:
output
parameter is of type InputType
? And I would have expected the output parameter to come after the input parameters. But that's historically inconsistent in C APIs anyway...Compressor
and Decompressor
classes. The only thing that ever happens to them is that they are instantiated and then the (de)compress
function is called on them. That would work as a simple function? I can see why this might be useful when we (de)compress in blocks, but the library can't do that due to the inflateInit2
and deflateInit2
functions not being in the constructor. Okay, we don't have to set those compression parameters every time, but how often do we re-use an instantiated Compressor
or Decompressor
object to make this useful?noexcept
and the function is hard to follow due the many parentheses.assert()
here.(de)compress()
that take an existing std::string
and fill it.std::size
is 32bit size * 2
will be 32bit and can overflow before it is assigned to the 64bit size_64
. const auto size_64 = static_cast<uint64_t>(size) * 2;
should be correct.inflateEnd
is called (like here https://github.com/mapbox/gzip-hpp/blob/master/include/gzip/decompress.hpp#L60). Maybe this can be wrapped in an RAII wrapper somehow?inflateInit2
call saving an inflateEnd
. Also consider making this an assert
.Not an expert here, but have seen other projects doing something like:
#ifdef __GNUC__
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
#endif
Per chat with @joto , we learned that gzip lib has another API that takes the original size of the original decompressed data. See example in libosmium. This could help with performance in gzip-hpp to avoid constantly resizing the write buffer which requires reallocations.
This would mean grabbing the original filesize/buffer length via zlib's ISIZE
flag stored in the header.
cc @springmeyer
It is very nice to put the zlib in C++ setting, I was trying to do the same. Just a quick question, is there a wrapper for initializing the gz file?
I presume our wrapper around zlib is as fast as possible, and faster than the C++ wrapper boost provides (refs #7).
But in the past I've seen zlib compression be a meaningful % of the time taken to work with vector tiles. Often the usecase is:
When the "do an operation" is fairly speedy, the decompress and recompress times are often meaningful (at least 5-10% of the time taken on the CPU). In applications handling a lot of concurrent requests where the CPU may be working hard, we can increase the amount of concurrency possible on a single machine by reducing the amount of work on the CPU.
So, this is a long way of saying:
When/if we do, then we should look into benchmarking https://github.com/ebiggers/libdeflate, which claims to be faster than zlib.
//cc @GretaCB @flippmoke
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.