mapbox / protozero Goto Github PK

View Code? Open in Web Editor NEW

286.0 119.0 77.0 2.47 MB

Minimalist protocol buffer decoder and encoder in C++

License: Other

Shell 0.08% C++ 98.60% CMake 1.32%

protocol-buffers c-plus-plus

protozero's Introduction

protozero

Minimalistic protocol buffer decoder and encoder in C++.

Designed for high performance. Suitable for writing zero copy parsers and encoders with minimal need for run-time allocation of memory.

Low-level: this is designed to be a building block for writing a very customized decoder for a stable protobuf schema. If your protobuf schema is changing frequently or lazy decoding is not critical for your application then this approach offers no value: just use the C++ API that can be generated with the Google Protobufs protoc program.

Depends

C++11 compiler
CMake
Some tests depend on the Google Protobuf library, but use of Protozero doesn't need it

How it works

The protozero code does not read .proto files used by the usual Protobuf implementations. The developer using protozero has to manually "translate" the .proto description into code. This means there is no way to access any of the information from the .proto description. This results in a few restrictions:

The names of the fields are not available.
Enum names are not available, you'll have to use the values they are defined with.
Default values are not available.
Field types have to be hardcoded. The library does not know which types to expect, so the user of the library has to supply the right types. Some checks are made using assert(), but mostly the user has to take care of that.

The library will make sure not to overrun the buffer it was given, but basically all other checks have to be made in user code!

Documentation

You have to have a working knowledge of how protocol buffer encoding works.

Read the tutorial for an introduction on how to use Protozero.
Some advanced topics are described in an extra document.
There is a table of all types and functions in the cheat sheet.
Read the upgrading instructions if you are upgrading from an older version of Protozero.

The build process will also build the Doxygen-based reference documentation if you have Doxygen installed. Then open doc/html/index.html in your browser to read it.

Endianness

Protozero uses a very simplistic test to check the byte order of the system it compiles on. If this check is wrong, you'll get test failures. If this is the case, please open an issue and tell us about your system.

Building tests

Extensive tests are included. Build them using CMake:

mkdir build
cd build
cmake ..
make

Call ctest to run the tests.

The unit and reader tests are always build, the writer tests are only build if the Google Protobuf library is found when running CMake.

See test/README.md for more details about the test.

Coverage report

To get a coverage report set CXXFLAGS and LDFLAGS before calling CMake:

CXXFLAGS="--coverage" LDFLAGS="--coverage" cmake ..

Then call make as usual and run the tests using ctest.

If you are using g++ use gcov to generate a report (results are in *.gcov files):

gcov -lp $(find test/ -name '*.o')

If you are using clang++ use llvm-cov instead:

llvm-cov gcov -lp $(find test/ -name '*.o')

If you are using g++ you can use gcovr to generate nice HTML output:

mkdir -p coverage
gcovr . -r SRCDIR --html --html-details -o coverage/index.html

Open coverage/index.html in your browser to see the report.

Clang-tidy

After the CMake step, run

make clang-tidy

to check the code with clang-tidy. You might have to set CLANG_TIDY in CMake config.

Cppcheck

For extra checks with Cppcheck you can, after the CMake step, call

make cppcheck

Installation

After the CMake step, call make install to install the include files in /usr/local/include/protozero.

If you are using CMake to build your own software, you can copy the file cmake/FindProtozero.cmake and use it in your build. See the file for details.

Who is using Protozero?

Are you using Protozero? Tell us! Send a pull request with changes to this README.

protozero's People

Contributors

Stargazers

Watchers

Forkers

xkwangy ezhangle feodorfitsner erichilarysmithsr daniel-j-h tomhughes karacsonyimatyas uikit0 luizperes yezero saggita sebastic testbigorg mapycz krandelbord skyformat99 rubythonode rejdasuki exlimit zonakre playbar kmikhailov os12 nosuchprocess buckaroo-pm nigels-com projectcheshire redheli geetayadav vishnuraju motis-project jiapei100 ahmouda12 blockspacer andrei-errapart linecode billwillman csommelet vesavlad pointrlabs yll1325 cadcorp isabella232 programingjd aguinet paul-demo bencsikandrei joto eyalz800 ffontaine coderaldrich yy405145590 elfprince13 georgthegreat mayhemheroes saildrone-sandbox brawer polixiaohai mapclone weexp int2ecall summershrimp hl0071 zrrbite hailin-plusai mytechall bulat-ziganshin louwers thothai94 hmtriit strogo mapbasedev james-hay sysfce2 oss-evaluation-repository

protozero's Issues

Integrate fixes from artemp

@artemp pushed things forward as part of mapnik geobuf experimentation and we should pull the work back here. https://github.com/mapnik/mapnik/blob/geobuf/plugins/input/geobuf/pbf.hpp.

Handling Sub-Messages section is blank

Presume this is just a minor TODO (adding docs to https://github.com/mapbox/protozero/blob/master/tutorial.md#handling-sub-messages).

/cc @joto

Thinking about asserts

In debug mode, we enable asserts on invalid data/non-parsable data. When not in debug mode these asserts are not fired and the likely outcome is instead a runtime exception down the line.

Both behaviors (assets and exceptions) seem useful for developers to catch bugs in their parsing code and/or invalid data.

One problem exists:

You write testing code that expects exceptions and uses protozero
You compile that code in debug mode, and asserts get enabled
Then the tests you wrote will likely fail if they expect exceptions

Ideas for solutions:

1. Recommend users of protozero write different tests for when compiled in debug mode
1. Recommend users of protozero do not compile/run their unit tests at all in debug mode
1. Add a public macro to disable asserts in debug mode
1. Make asserts opt-in rather than default - then they could be used either in release or debug mode.

Windows Debug Builds are failing

e.g.
https://ci.appveyor.com/project/Mapbox/protozero/build/1.0.43/job/oeeb2ail1ue7atft#L289

test_cases.obj : error LNK2001: unresolved external symbol __imp___CrtDbgReportW [C:\projects\protozero\tests.vcxproj]
  msvcprtd.lib(locale0_implib.obj) : error LNK2019: unresolved external symbol __imp___free_dbg referenced in function "public: static void __cdecl std::_Fac_node::operator delete(void *)" (??3_Fac_node@std@@SAXPAX@Z) [C:\projects\protozero\tests.vcxproj]

  msvcprtd.lib(locale0_implib.obj) : error LNK2019: unresolved external symbol __imp___malloc_dbg referenced in function "public: static void * __cdecl std::_Fac_node::operator new(unsigned int)" (??2_Fac_node@std@@SAPAXI@Z) [C:\projects\protozero\tests.vcxproj]
  C:\projects\protozero\Debug\tests.exe : fatal error LNK1120: 4 unresolved externals [C:\projects\protozero\tests.vcxproj]

varint vs. varint2

Why do we have two functions? Should the developer choose the appropriate version on their own?

Minor warnings on OSX

Would be nice to avoid this noise (I see this locally for every cpp):

include/protozero/pbf_writer.hpp:30:5: warning: '__BYTE_ORDER' is not defined, evaluates to 0 [-Wundef]
#if __BYTE_ORDER != __LITTLE_ENDIAN
    ^
include/protozero/pbf_writer.hpp:30:21: warning: '__LITTLE_ENDIAN' is not defined, evaluates to 0 [-Wundef]
#if __BYTE_ORDER != __LITTLE_ENDIAN
                    ^
include/protozero/pbf_writer.hpp:74:5: warning: '__BYTE_ORDER' is not defined, evaluates to 0 [-Wundef]
#if __BYTE_ORDER == __LITTLE_ENDIAN
    ^
include/protozero/pbf_writer.hpp:74:21: warning: '__LITTLE_ENDIAN' is not defined, evaluates to 0 [-Wundef]
#if __BYTE_ORDER == __LITTLE_ENDIAN
                    ^

Add a writer as well

mapbox/pbf is both reader and writer, while pbf.hpp is just reader. I think the repo eventually needs a writer too.

Syntax Highlighting for Docs

Example: https://github.com/mapbox/protozero/blob/master/doc/tutorial.md#an-introductory-example

At first I thought it is the cheerless Berlin autumn but then I noted it is the lack of syntax highlighting making the docs dim and grey.

Want 🌈

#include <string>
int main() {}

Is ☁️

#include <string>
int main() {}

Markdown syntax

int main() {}

data_view constructor taking string cannot be marked constexpr

In file included from /home/karen/clones/the-alchemist/thirdparty/protozero/include/protozero/pbf_writer.hpp:27:0,
                 from /home/karen/clones/the-alchemist/main.cc:5:
/home/karen/clones/the-alchemist/thirdparty/protozero/include/protozero/types.hpp: In constructor ‘constexpr protozero::data_view::data_view(const str
ing&)’:
/home/karen/clones/the-alchemist/thirdparty/protozero/include/protozero/types.hpp:73:26: error: call to non-constexpr function ‘const _CharT* std::__c
xx11::basic_string<_CharT, _Traits, _Alloc>::data() const [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’
         : m_data(str.data()),
                          ^

Affects both C++11 and 14. Found by including <protozero/pbf_writer.hpp> but not using it.
cc @joto @daniel-j-h

Release new version

The work on a new packed API (#45) is working well in node-mapnik/mapnik-vector-tile as as discussed with @flippmoke and @joto this should now go out in a release.

@joto - can you tag a new release at your earliest availability?

/cc @danpat (who is looking at using this new protozero in an osrm-backend develop branch)

Explore possible SIMD speedups

Might be worth looking at this library:

https://github.com/lemire/simdcomp

and seeing if we can apply that to varint packing (or things like repeated uint32 geometry = 4 [ packed = true ];).

@mourner pointed out simdcomp is a pretty big codebase, it would likely not be a super-smooth integration, and it would greatly increase the complexity of pbf.hpp

fast and safe varint impl

Creating a ticket to document which varint implementation we are using and why.

Original one was from https://github.com/haberman/upb. I also experimented with an optimized one from folly, but removed it in #1 since we did not need two and I saw no distinct benefit from the folly impl.

But then @joto found a bug (I don't see a specific ticket on this) in the upb impl and switched formally to the folly impl in bbfddea.

Tonight I ran the old version of pbf (used in mbgl) through https://scan.coverity.com which found:

@joto - is this related perhaps to the bug you found?

Next actions:

remove the upb license note since we are no longer using the ubp impl.
keep this ticket open until I have a chance to run the upcoming pbf through coverity to ensure it is clean.

REQUIRE( buffer == load_data("bool/data-true") ) fails on big endian architectures

With the fix for #31 the Debian package builds for protozero 1.2.1 (buildlogs) now fail on big endian architectures with:

tests is a Catch v1.2.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
write bool field using pbf_writer
  true
-------------------------------------------------------------------------------
test/t/bool/test_cases.cpp:100
...............................................................................

test/t/bool/test_cases.cpp:112: FAILED:
  REQUIRE( buffer == load_data("bool/data-true") )
with expansion:
  "�" == "��"

-------------------------------------------------------------------------------
write bool field using pbf_builder
  true
-------------------------------------------------------------------------------
test/t/bool/test_cases.cpp:117
...............................................................................

test/t/bool/test_cases.cpp:129: FAILED:
  REQUIRE( buffer == load_data("bool/data-true") )
with expansion:
  "�" == "��"

===============================================================================
test cases:   86 |   84 passed | 2 failed
assertions: 4519 | 4517 passed | 2 failed

The unprintable characters in the expansion are "0x08 0x00" == "0x08 0x01" in hex.
0x08 0x01 is the content of test/t/bool/data-true.pbf and 0x08 0x00 of data-false.pbf.

Write tests that read data written by our library with the official library from Google

Left over from #2

Updating users of protozero

Once a new tag is out (#4), it will be time to update key apps that depend on pbf.hpp to use the release (yay, finally!). This ticket tracks those apps:

Try https://snap-ci.com

https://snap-ci.com/ offers CI (like travis) but runs on centos instead of ubuntu. Would be nice to gain this extra coverage, so I'll take a look at enabling this on a rainy day.

Travis xcode6 run fails

https://travis-ci.org/mapbox/protozero/jobs/144398695

==> Downloading https://downloads.sf.net/project/machomebrew/Bottles/protobuf-2.
curl: (60) SSL certificate problem: Invalid certificate chain
More details here: http://curl.haxx.se/docs/sslcerts.html
curl performs SSL certificate verification by default, using a "bundle"
 of Certificate Authority (CA) public keys (CA certs). If the default
 bundle file isn't adequate, you can specify an alternate file
 using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
 the bundle, the certificate verification probably failed due to a
 problem with the certificate (it might be expired, or the name might
 not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
 the -k (or --insecure) option.
Error: Failed to download resource "protobuf"
Download failed: https://downloads.sf.net/project/machomebrew/Bottles/protobuf-2.6.0.mavericks.bottle.tar.gz
Warning: Bottle installation failed: building from source.
==> Downloading https://protobuf.googlecode.com/svn/rc/protobuf-2.6.0.tar.bz2
curl: (22) The requested URL returned error: 404 Not Found
Error: Failed to download resource "protobuf"
Download failed: https://protobuf.googlecode.com/svn/rc/protobuf-2.6.0.tar.bz2

Provide accessor functions with checks?

Currently the functions to access the pbf encoded data don't check whether it has the right format. Should we add functions that do?

auto a = pbf.varint(); // what if it is not a varint?
auto a = pbf.checked_varint(); // throws if it is not

Or should all functions check? This would add some overhead but make everything more robust.

We can add assert()s, but this will not solve the problem of intentionally or unintentionally corrupted data in production systems.

/cc @springmeyer @artemp @kkaefer

sudo:false for travis builds

Should be easy to get pbf.hpp running on sudo:false travis machines. Looks like the only thing requiring sudo is the pip install cpp-coveralls. This can be done without sudo and locally like https://github.com/Project-OSRM/osrm-backend/blob/masonize/scripts/configure_osrm.sh#L12-L20.

Writer tests are not compiled and run.

There was another problem in #36 not fixed yes: The writer_tests are not compiled and run. This is a separate binary that needs to be compiled and run. It needs the Google Protobuf installed including the protoc compiler.

Code coverage with coveralls

First tag

We should start versioning and tagging the code.

check_swap_1(-1) fails on various architectures

The Debian package build for protozero 1.2.0 failed on arm64, armel, powerpc, ppc64el, s390x & ppc64 (both big and little endian architectures) with the same test failure:

tests is a Catch v1.2.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
byte swapping
-------------------------------------------------------------------------------
test/t/endian/test_cases.cpp:42
...............................................................................

test/t/endian/test_cases.cpp:45: FAILED:
  REQUIRE( -1 == check_swap_1(-1) )
with expansion:

===============================================================================
test cases:   86 |   85 passed | 1 failed
assertions: 4510 | 4509 passed | 1 failed

Please advise how to best help troubleshoot this issue.

Homebrew want upstream comment.

Homebrew/homebrew-core#2321 (comment)
I write homebrew formula but they want upstream comment.
I want to provide similar debian package of protozero.
Please someone write comment.

Use boost for byte order checks?

Boost has some code to figure out which byte order some architecture has:
http://www.boost.org/doc/libs/1_59_0/libs/predef/doc/html/index.html

Would be better to use this than our own, but I hesitate to add a dependency. Currently we don't have any dependencies at all. I tried lifting the parts from boost that we need, but it seems rather complicated.

Opinions?

/cc @springmeyer

Byte order

The fixed size non-varint number types float, double, fixed32, and fixed64 are stored little-endian in Protobuf. There is no special handling in phf.hpp, so it will only work on little-endian machines.

Alignment faults running tests on ARM

I'm seeing a SIGBUS fault running the tests on an ARM system. The backtrace is:

#0  Catch::ExpressionLhs<float const&>::captureExpression<(Catch::Internal::Operator)0, float> (this=0xbeffe2cc, this@entry=0xbeffe388, rhs=@0xbeffe2c4: 17.3400002) at /usr/include/catch/internal/catch_expression_lhs.hpp:90
#1  0x0006d0ec in Catch::ExpressionLhs<float const&>::operator==<float> (rhs=@0xbeffe2bc: 7.73589176e+34, this=0xbeffe2c4) at /usr/include/catch/internal/catch_expression_lhs.hpp:35
#2  ____C_A_T_C_H____T_E_S_T____4 () at test/t/repeated_packed_float/test_cases.cpp:23
#3  0x00042458 in Catch::FreeFunctionTestCase::invoke (this=<optimized out>) at /usr/include/catch/internal/catch_test_case_registry_impl.hpp:116
#4  Catch::TestCase::invoke (this=<optimized out>) at /usr/include/catch/internal/catch_test_case_info.hpp:169
#5  Catch::RunContext::invokeActiveTestCase (this=0xbeffed24) at /usr/include/catch/internal/catch_runner_impl.hpp:298
#6  Catch::RunContext::runCurrentTest (redirectedCerr="", redirectedCout="", this=0xbeffed24) at /usr/include/catch/internal/catch_runner_impl.hpp:270
#7  Catch::RunContext::runTest (testCase=..., this=0xbeffed24) at /usr/include/catch/internal/catch_runner_impl.hpp:108
#8  Catch::Runner::runTests (this=0x25be4 <Catch::FatalConditionHandler::handleSignal(int)>, this@entry=0xbefff178) at /usr/include/catch/catch_runner.hpp:59
#9  0x00042ef8 in Catch::Session::run (this=this@entry=0xbefff344) at /usr/include/catch/catch_runner.hpp:186
#10 0x00015324 in Catch::Session::run (argv=<optimized out>, argc=<optimized out>, this=0xbefff344) at /usr/include/catch/catch_runner.hpp:166
#11 main (argc=<optimized out>, argv=<optimized out>) at /usr/include/catch/internal/catch_default_main.hpp:15

The faulting instruction is a vldr instruction loading the float referenced by *it_pair.first when loading the float from repeated_packed_float/data-one because the address returned is only two byte aligned, but NEON instructions like that require four byte alignment for floats.

I don't think there's an easy fix here - obviously having the actual reader do anything to guarantee alignment is a potentially large performance hit. The alternative is to have the tests copy the value out to an aligned location before testing it.

I guess the question is, what is the pbf_reader supposed to be guaranteeing about the iterators it returns for packed values?

Coverage doesn't work properly

It seems the coverage report doesn't work properly. Some code thats definitely called is not marked as covered.

Renaming the repository?

I think the name pbf.hpp for a repository is rather confusing. It means you'll have a directory with suffix .hpp. Can we find a better name?

fuzzing

There are a variety of fuzzing methods and fuzzing software out there. I'm new to this domain but started playing around with http://lcamtuf.coredump.cx/afl/ simply because is been seeing some buzz.

Docs on actual usage are minimal, but so far I've figured out that it works best by being able to pass data via stdin to a command line tool. So I added one in https://github.com/mapbox/pbf.hpp/tree/afl-fuzz. The shortcoming so far is that is only tries to parse the input as a double. We should likely extend it to try to parse multiple ways.

After installing afl and latest clang 3.6 I ran:

make clean
export CXX=afl-clang++
export PATH=/opt/llvm/bin/:$PATH
export AFL_HARDEN=1
export AFL_USE_ASAN=1
make ./bin/read-pbf
mkdir testcase
cp test/t/double/data-zero.pbf testcase/
afl-fuzz -i testcase/ -o findings_dir/ bin/read-pbf

This has not yet found any crashes. It did report hangs but I think they are spurious. Logging how far I got here because I'm not seeing value in pushing this much further until I have more bandwidth to experiment.

Optimizing varint

Decoding varints is a major bottleneck. When doing profiling, the decode_varint function always shows up high in the list. I have done some experiments which show that over half the varints in OSM PBF files are 1-byte varints. Optimizing this specific case speeds up reading of those files by about 10%! I optimized this by using a wrapper function around decode_varint that checks for this specific case and handles it locally and otherwise hands work over to the original decode_varint function. I believe the compiler inlines the outer function which reduces the overhead for this common case. Forcing the compiler to inline the whole function didn't do much, the function is too big to be inlined anyway.

There are some more things we need to figure out here:

Is the OSM PBF unusual in some way that it has so many 1-byte varints? Part of this is that Protobuf uses varints for the field tags (which say which type of field something is). If the tag is <15, this will be encoded in 1 byte varint. I suspect that most Protobuf formats will do this, because this optimization is specifically mention in the Protobuf documentation. And OSM PBF uses delta encoding of IDs etc. to get small integers in several places, so this also helps. But do other formats have similar rations? We have to check how vector tiles fare here.
Does it make sense to optimize this, or could we make some other optimization that would help in all cases. The varint decoder we use is from folly, I suspect they have done their homework on this, but we don't know their exact circumstances.
Can we do some general optimization that just works in most cases, or do we want the programmer to give hints to the library, such as what size integer to expect or so.

I looked at the SIMD speedup things mentioned in #20 and #27, but that code is huge. Folly has some code to speed up multiple varints stored together which could help with repeated packed varints, but this will not help with the very common case of the varint-encoded tag.

SIMD acceleration (performance)

Protozero uses a reasonable vbyte decoder (https://github.com/mapbox/protozero/blob/master/include/protozero/varint.hpp).

However, for long compressible sequences of integers, it should be possible to be 2 to 3 times faster using SIMD instructions with Masked VByte:

http://maskedvbyte.org/

(A C library is freely available: https://github.com/lemire/MaskedVByte )

It would take some effort to bring it over to protozero... e.g., one needs to handle zigzag decoding. I know the code well and I could help if anyone is interested.

(Feel free to dismiss this issue. I post it just to share the information.)

Documentation: add a table with all types and functions

Benchmarking

With v1.0.0 out, we should now turn our attention to performance again.

Overall I doubt we can do anything major to improve performance and I also doubt we need to. But we should still set up a few benchmarks to confirm these assumptions conclusively.

Plan:

Write a few benchmarks that time decoding.
We should use large vector tiles as test input.
Let's focus to start on two different tile testcases:
- Mapbox Streets (v6) for Yaounde: added in 81eb90b - represents varied data with strings
- @ericfischer twitter dots for midtown NYC: added in 453cd32 - represents long geometry arrays
Write simplistic benchmark that times counting the number of geometries in the tile across all layers, skipping all other data.
Compare parsing speed to using google::protobuf::io::CodedInputStream directly.
Based on profiling output and how we compare to google::protobuf::io::CodedInputStream, then decide whether it makes sense to try to optimize.

Background:

Early versions of protozero force inlined (3a370b0#diff-b00011152aa9901faf463b4e69c8c9acR30). This trick I learned from studying the google::protobuf::io::CodedInputStream code after this comment by the libprotobuf and Cap'n Proto author. Any benefit to re-instating force inlining?

New interface pbf_message and pbf_builder

I have added new interfaces in f37b1be which augment or replace the pbf_reader and pbf_writer classes with template classes build on top of them called pbf_message and pbf_builder. I have used different names instead of changing the behaviour of the existing names to keep backwards compatibility.

The new interfaces are somewhat more type safe and allow working with symbolic names in an easy fashion. You can see an example in libosmium where I am already using this new interface. Basically the .proto file like
https://github.com/scrosby/OSM-binary/blob/master/src/osmformat.proto into is turned into a bunhc of enums like https://github.com/osmcode/libosmium/blob/master/include/osmium/io/detail/protobuf_tags.hpp . Then you write code like https://github.com/osmcode/libosmium/blob/master/include/osmium/io/detail/pbf_decoder.hpp#L242-L266 to parse it.

Because this interface is preliminary, it is not documented or tested yet, but I use it in libosmium and I think it leads to cleaner and safer code with minimal effort.

@springmeyer @artemp any comments?

Iterator Range Conversion fails on libc++

The iterator range's conversion to pair / tuple fails on libc++.
Here's a small self-contained reproducible test-case:

int main() {
  iterator_range<const int*> range;

  const int* first;
  const int* last;

  std::tie(first, last) = range;
}

it.cc:105:25: error: no viable overloaded '='
std::tie(first, last) = range;

Reading the docs:

std::tie may be used to unpack a std::pair because std::tuple has a converting assignment from pairs

http://en.cppreference.com/w/cpp/utility/tuple/tie

For the record, with stdlibc++ I can force a similar error defining -DPROTOZERO_STRICT_API (with both clang 38 as well as gcc 6.2):

error: ‘std::pair<const int*, const int*>’ is an inaccessible base of ‘iterator_range<const int*>’

With libc++ neither defining nor not defining the symbol makes a difference: both compilations fail.

We hit this today on OSX builds:
Project-OSRM/osrm-backend#2974

Windows: failing tests

Some tests are failing for me locally, but succeed on AppVeyor (maybe LF vs. CRLF again?):

Add command line protozero-decode tool

We should add a tool that is able to decode arbitrary protobuf-encoded buffers. It would behave similarly to protoc --decode_raw. An example of running protoc --decode_raw < test/t/vector_tile/data.vector.pbf is at: https://gist.github.com/springmeyer/f7de2d636f8cd084c992af70b05805cb.

Note: reading from stdin is nice, but we could also just accept a file path.

@joto - could you work on adding this tool? I made a quick attempt at https://gist.github.com/springmeyer/6124f976b4e3c5a59561c6fbada29f3e that is probably buggy, but might offer a few steps forward.

pbf_writer: } vs finish function

At the moment, pbf_writer has to be carefully scoped in order to let its destructor "finish" the message writing bytes into the buffer. The artificial nesting gets even worse with multiple writers.

Although this is clever and we all love }, this requires nesting artificial scopes, hiding the "finish" logic in artificial scopes. Well, it's not even hiding it, since we need the artificial scopes.

Can we put the destructor logic into a public finish (or similar) member function, the user can call?

This design would also mimic what the stdlib does eg. in unique_lock: its destructor unlocks if it's owning the mutex but the user can still call unlock manually early on.

/cc @karenzshea

Windows support

To check whether everything works on Windows, we need to

enable appveyor support (@springmeyer)
add Windows build configuration.

Appveyor Windows build broken

The Windows build using gyp is broken. Two issues here:

The first test tests runs only a single test case:

All tests passed (12 assertions in 1 test case)

The output should be:

All tests passed (4519 assertions in 86 test cases)

There are a lot of warnings warning LNK4042: object specified more than once; extras ignored. Could it be because all the test cases have the same file name (but they are in different directories).

The writer_tests are not compiled and run at all. The google protobuf library and protoc program needs to be installed for this.

@BergWerkGIS: Could you have a look?

Optimize pbf_reader::get_packed_bool

pbf_reader::get_packed_bool() currently calls get_packed_int32(). But with a one-byte bool there will never be any endianness or alignment issues so we could optimize this case.

API thoughts

Because the API of pbf.hpp is very low-level it is really easy to shoot yourself in the foot using it. What can we do to make it more robust and easier to use? I think it makes sense to think about this a little bit before we port all the software using it to the new version.

Here are some random ideas/comments:

varint(), svarint(), and fixed() are templated functions. They use a default of uint32_t, maybe it would be better to not have a default so users will have to be explicit and there is less chance of accidental wrong casting? Or we could remove the templates (or change them to be private) and have functions like varint_int64() or whatever. There are functions std::float32() and std::float64() already which basically do just this.
If we keep templates, they should probably check their argument: std::is_integral<T> etc.
We can sprinkle some noexcept around and maybe some constexpr.
The pbf::string() functions has the overhead of allocating a std::string. It is the only function that needs the heap. We can add a lower-level function pbf::data() or so that gives back a std::pair<const char*, size_t> or similar. Maybe a pbf::string(std::string& str) that copies the data to an existing string?
There is currently no special support for packed repeated fields. Do we need that?
How do we handle the case when a varint is larger that the type we give it to?
The current code can only handle 32 bit strings and embedded messages. Is that intended? (I know it is unlikely that we want strings or embedded messages that large...)
We can probably make pbf a POD/trivial type. Could allow some more compiler optimizations.

/cc @artemp @kkaefer @springmeyer

make: protoc: Command not found

Travis has been failing with this error since 898f452

Messages have to fit into buffer

The code currently assumes that the full PBF message fits into some buffer. If we have larger messages this might not necessarily be true, ie we might want to read more data once the input is exhausted. Of course assuming we always have a full message in the input buffer makes the code radically simpler, but it also means we will not be able to handle larger messages without loading them into memory first resulting in larger memory footprint. This could be problematic even for medium sized messages if there are many parallel processes.

Even in cases where we have several messages concatenated one after the other, each being independent, we have to read them all in together, because before inspecting the size encoded internally in the message, we don't know where one message ends and the next one starts. So we can't just read them in one by one.

Are those concerns or do we assume this is all somehow solved outside pbf.hpp?

Support for Maps?

Protobuf language guide describes maps: https://developers.google.com/protocol-buffers/docs/prot . But the encoding documentation at https://developers.google.com/protocol-buffers/docs/encoding doesn tell us how they are encoded.

Do we need to support them?

Type-safe named access to protobuf fields.

One problem with Protozero has always been that it doesn't read the .proto file so it doesn't know about symbolic names of messages and fields and it can't make sure you are accessing the protobuf data using the right types. There is an attempt of fixing some of these problems with the pbf_message and pbf_builder classes, but it is somewhat ugly and only solves half of the problems.

I tried out a new approach in the v2-experimental branch. It uses some template and macro magic to make most of what we want possible. It doesn't read the .proto file, but it allows putting the same information into the C++ file using macros in a way that is very easy to translate from the .proto file:

PROTOZERO_MESSAGE(Sub) {
  PROTOZERO_FIELD(string, s, 1);
};

PROTOZERO_MESSAGE(Test) {
  PROTOZERO_FIELD(fixed32, f, 1);
  PROTOZERO_FIELD(int64, i, 2);
  PROTOZERO_FIELD(int64, j, 3);
  PROTOZERO_MESSAGE_FIELD(Sub, submessage, 5);
  PROTOZERO_FIELD(string, s, 8);
  PROTOZERO_FIELD(uint32, u, 4);
  PROTOZERO_PACKED_FIELD(sint32, d, 7);
};

Access is then through macros called FIELD_TAG() and FIELD_VALUE():

protozero::message<Test> item(buffer);
while (item.next()) {
    switch (item.tag()) {
        case FIELD_TAG(item, f): {
            auto value = FIELD_VALUE(item, f);
            ...
            break;
        }
        case FIELD_TAG(item, submessage): {
            auto subitem = FIELD_VALUE(item, submessage);
            subitem.next();
            auto str = FIELD_VALUE(subitem, s);
            ...
            break;
        }
    }
}

The code in Protozero is too complicated for my liking, but there are probably some simplifications and cleanups we can do. But using this is, I think, rather straightforward. And if the PROTOZERO_MESSAGE() and PROTOZERO_FIELD() macros are correct messages and fields are accessed by name and it is not possible any more to get the types wrong.

What do you think?

/cc @springmeyer @daniel-j-h

pbf_writer (and possibly others) should be (noexcept) Swappable

http://en.cppreference.com/w/cpp/concept/Swappable

Swapping pbf_writers came up today for handling lifetime / finishing the buffers. It would be great to make the types noexcept swappable instead of having to manually create a temporary, and move objects around.

/cc @joto @karenzshea

G++ -5 -Wshadow warning

Seeing this when using protozero 1.4.0 with vector-tile, which uses -Wshadow -Werror:

mason_packages/.link/include/protozero/iterators.hpp: In instantiation of ‘constexpr protozero::iterator_range<T, P>::iterator_range(protozero::iterator_range<T, P>::iterator&&, protozero::iterator_range<T, P>::iterator&&) [with T = protozero::const_varint_iterator<int>; P = std::pair<protozero::const_varint_iterator<int>, protozero::const_varint_iterator<int> >; protozero::iterator_range<T, P>::iterator = protozero::const_varint_iterator<int>]’:
mason_packages/.link/include/protozero/pbf_reader.hpp:129:51:   required from ‘protozero::iterator_range<T> protozero::pbf_reader::get_packed() [with T = protozero::const_varint_iterator<int>]’
mason_packages/.link/include/protozero/pbf_reader.hpp:688:60:   required from here
mason_packages/.link/include/protozero/iterators.hpp:82:15: error: declaration of ‘first’ shadows a member of ‘protozero::iterator_range<protozero::const_varint_iterator<int> >’ [-Werror=shadow]
     constexpr iterator_range(iterator&& first, iterator&& last) :

Assert if signed ints passed into add_enum?

@joto I noticed that the signature of add_enum accepts signed int but then turns into unsigned internally. I presume passing in a negative int will lead to unpredictable behavior, or am I missing something? Should we assert if this happens to protect from misuage?

https://github.com/mapbox/protozero/blob/master/include/protozero/pbf_writer.hpp#L341-L353