ndsev / mapget Goto Github PK

View Code? Open in Web Editor NEW

6.0 3.0 0.0 1.95 MB

Server-client solution for cached map feature data retrieval. 🏞️

License: BSD 3-Clause "New" or "Revised" License

CMake 5.50% C++ 94.18% Python 0.04% Shell 0.28%

mapget's Introduction

mapget

mapget is a server-client solution for cached map feature data retrieval.

Main Capabilities:

Coordinating requests for map data to various map data source processes.
Integrated map data cache based on RocksDB, or a simple in-memory cache.
Simple data-source API with bindings for C++, Python and JS.
Compact GeoJSON feature storage model - 25 to 50% smaller than BSON/msgpack.
Integrated deep feature filter language based on (a subset of) JSONata
PIP-installable server and client component.

Python Package and CLI

The mapget package is deployed to PyPI for any Python version between 3.8 and 3.11. Simply running pip install mapget is enough to get you started:

python -m mapget serve will run a server,
python -m mapget fetch allows you to talk to a remote server,
you can also use the Python package to write a data source, as documented here.

If you build mapget from source as described below, you obtain an executable that can be used analogously to the Python package with mapget serve or mapget fetch.

Configuration

The command line parameters for mapget and its subcommands can be viewed with:

mapget --help
mapget fetch --help
mapget serve --help

(or python -m mapget --help for the Python package).

The mapget executable can parse a config file with arguments supported by the command line interface. The path to the config file can be provided to mapget via command line by specifying the --config parameter.

Sample configuration files can be found under examples/config:

sample-first-datasource.toml and sample-second-datasource.toml will configure mapget to run a simple datasource with sample data. Note: the two formats in config files for subcommand parameters can be used interchangeably.
sample-service.toml to execute the mapget serve command. The instance will fetch and serve data from sources started with sample-*-datasource.toml configs above.

Cache

mapget supports persistent tile caching using a RocksDB-backed cache, and non-persistent in-memory caching. The CLI options to configure caching behavior are:

Option	Description	Default Value
`-c,--cache-type`	Choose between "memory" or "rocksdb" (Technology Preview).	memory
`--cache-dir`	Path to store RocksDB cache.	mapget-cache
`--cache-max-tiles`	Number of tiles to store. Tiles are purged from cache in FIFO order. Set to 0 for unlimited storage.	1024
`--clear-cache`	Clear existing cache entries at startup.	false

Map Data Sources

At the heart of mapget are data sources, which provide map feature data for a specified tile area on the globe and a specified map layer. The data source must provide information as to

Which map it can serve (e.g. China/Michigan/Bavaria...). In the component overview, this is reflected in the DataSourceInfo class.
Which layers it can serve (e.g. Lanes/POIs/...). In the component overview, this is reflected in the LayerInfo class.
Which feature types are contained in a layer (e.g. Lane Boundaries/Lane Centerlines), and how they are uniquely identified. In the component overview, this is reflected in the FeatureTypeInfo class.

Feel free to check out the sample_datasource_info.json. As the mapget Service is asked for a tile, e.g. using the GET /tiles REST API, it first queries its cache for the relevant data. On a cache miss, it proceeds to forward the request to one of its connected data sources for the specific requested map.

Map Features

The atomic units of geographic data which are served by mapget are Features. The content of a mapget feature is aligned with that of a feature in GeoJSON: A feature consists of a unique ID, some attributes, and some geometry. mapget also allows features to have a list of child feature IDs. Note: Feature geometry in mapget may always be 3D.

TODO: Document JSON representation.
TODO: Document Feature ID schemes.
TODO: Document Geometry Types.

Map Tiles

For performance reasons, mapget features are always served in a set covering a whole tile. Each tile is identified by a zoom level z and two grid coordinates x and y. mapget uses a binary tiling scheme for the earths surface: The zoom level z controls the number of subdivisions for the WGS84 longitudinal [-180,180] axis (columns) and latitudinal [-90,90] axis (rows). The tile x coordinate indicates the column, and the y coordinate indicates the row. On level zero, there are two columns and one row. In general, the number of rows is 2^z, and the number of columns is 2^(z+1).

The content of a tile is (leniently) coupled to the geographic extent of its tile id, but also to the map layer it belongs to. When a data source creates a tile, it associates the created tile with the name of the map - e.g. "Europe-HD", and a map data layer, e.g. "Roads" or Lanes.

Component Overview

The following diagram provides an overview over the libraries, their contents, and their dependencies:

mapget consists of four main libraries:

The mapget-model library is the core library which contains the feature-model abstractions.
The mapget-service library contains the main Service, ICache and IDataSource abstractions. Using this library, it is possible to use mapget in-process without any HTTP dependencies or RPC calls.
The mapget-http-service library binds a mapget service to an HTTP server interface, as described here.
The mapget-http-datasource library provides a RemoteDataSource which can connect to a DataSourceServer. This allows running a data source in an external process, which may be written using any programming language.

Developer Setup

mapget has the following prerequisites:

C++17 toolchain
CMake 3.14+
Python3
Ninja build system (not required, but recommended)
gcovr, if you wish to run coverage tests:
```
pip install gcovr
```
Python wheel package, if you wish to build the mapget wheel:
```
pip install wheel
```

Build mapget with the following command:

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -G Ninja
cmake  --build .

If you wish to skip building mapget wheel, deactivate the MAPGET_WITH_WHEEL CMake option in the second command:

cmake .. -DCMAKE_BUILD_TYPE=Release -DMAPGET_WITH_WHEEL=OFF -G Ninja

CMake Build Options

mapget build can be configured using the following variables:

Variable Name	Details
`MAPGET_WITH_WHEEL`	Enable mapget Python wheel (output to WHEEL_DEPLOY_DIRECTORY).
`MAPGET_WITH_SERVICE`	Enable mapget-service library. Requires threads.
`MAPGET_WITH_HTTPLIB`	Enable mapget-http-datasource and mapget-http-service libraries.
`MAPGET_ENABLE_TESTING`	Enable testing.
`MAPGET_BUILD_EXAMPLES`	Build examples.

Environment Settings

The logging behavior of mapget can be customized with the following environment variables:

Variable Name	Details	Value
`MAPGET_LOG_LEVEL`	Set the spdlog output level.	"trace", "debug", "info", "warn", "err", "critical"
`MAPGET_LOG_FILE`	Optional file path to write the log.	string
`MAPGET_LOG_FILE_MAXSIZE`	Max size for the logfile in bytes.	string with unsigned integer

Implementing a Data Source

`examples/cpp/local-datasource`

This example shows, how you can use the basic non-networked mapget::Service in conjunction with a custom data source class which implements the mapget::DataSource interface.

`examples/cpp/http-datasource`

This example shows how you can write a minimal networked data source service.

`examples/python/datasource.py`

This example shows, how you can write a data source service in Python. You can simply pip install mapget to get access to the mapget Python API.

REST API

The mapget library provides simple C++ and HTTP/REST interfaces, which may be used to satisfy the following use-cases:

Obtain streamed map feature tile data for given constraints.
Locate a feature by its ID within any of the connected sources.
Describe the available map data sources.
View a simple HTML server status page (only for REST API).
Instruct the cache to populate itself within given constraints from the connected sources.

The HTTP interface implemented in mapget::HttpService is a view on the C++ interface, which is implemented in mapget::Service. Detailed endpoint descriptions:

Endpoint	Method	Description	Input	Output
`/sources`	GET	Describe the connected Data Sources	None	`application/json`: List of DataSourceInfo objects.
`/tiles`	POST	Get streamed features, according to hard constraints. Accepts encoding types `text/jsonl` or `application/binary`	List of objects containing `mapId`, `layerId`, `tileIds`, and optional `maxKnownFieldIds`.	`text/jsonl` or `application/binary`
`/status`	GET	Server status page	None	text/html
`/locate`	POST	Obtain a list of tile-layer combinations providing a feature that satisfies given ID field constraints.	`application/json`: List of external references, where each is a Request object with `mapId`, `typeId` and `featureId` (list of external ID parts).	`application/json`: List of lists of Resolution objects, where each corresponds to the Request object index. Each Resolution object includes `tileId`, `typeId`, and `featureId`.

Curl Call Example

For example, the following curl call could be used to stream GeoJSON feature objects from the MyMap data source defined previously:

curl -X POST \
    -H "Content-Type: application/json" \
    -H "Accept: application/jsonl" \
    -H "Connection: close" \
    -d '{
    "requests": [
       {
           "mapId": "Tropico",
           "layerId": "WayLayer",
           "tileIds": [1, 2, 3]
       }
    ]
}' "http://localhost:8080/tiles"

C++ Call Example

If we use "Accept: application/binary" instead, we get a binary stream of tile data which we can also parse in C++, Python or JS. Here is an example in C++, using the mapget::HttpClient class:

#include "mapget/http-service/http-client.h"
#include <iostream>

using namespace mapget;

void main(int argc, char const *argv[])
{
     HttpClient client("localhost", service.port());

     auto receivedTileCount = 0;
     client.request(std::make_shared<LayerTilesRequest>(
         "Tropico",
         "WayLayer",
         std::vector<TileId>{{1234, 5678, 9112, 1234}},
         [&](auto&& tile) { receivedTileCount++; }
     ))->wait();

     std::cout << receivedTileCount << std::endl;
     service.stop();
}

Keep in mind, that you can also run a mapget service without any RPCs in your application. Check out examples/cpp/local-datasource on how to do that.

About `locate`

The /locate endpoint allows clients to obtain a list of tile-layer combinations that provide a feature satisfying given ID field constraints. This is crucial for applications needing to find specific data points within the massive datasets typically associated with map services. The endpoint uses a POST method due to the complexity and length of the queries, which involve resolving external references to data.

Details:

Input: The input is a list of requests, each corresponding to an external reference that needs to be resolved. Each request object includes:
- typeId: Specifies the type of feature to locate.
- featureId: An array representing the external ID parts, where each part consists of a field name and value. This array is used to identify the feature uniquely. The used id scheme may be a secondary scheme.
Output: The output is a nested list structure where each outer list corresponds to an input request object. Each of these lists contains resolution objects that provide details about where the requested feature can be found within the map data. Each resolution object includes:
- tileId: The key of the map tile containing the feature.
- typeId: The type of feature found, which should match the typeId specified in the request.
- featureId: An array of ID parts similar to the input but typically using the primary feature ID scheme of the data source.

This design allows clients to batch queries for multiple features in a single request, improving efficiency and reducing the number of required HTTP requests. It also supports the use of different ID schemes, accommodating scenarios where the request and response might use different identifiers for the same data due to varying external reference standards.

Note, that a locate resolution must be provided by a datasource for the specified map, which implements the onLocateRequest callback.

erdblick-mapget-datasource communication pattern

TODO: expand and polish this section stub.

Client (erdblick etc.) sends a composite list of requests to mapget. Requests are batched because browsers limit the number of concurrent requests to one domain, but we want to stream potentially hundreds of tiles.
mapget checks if all requested map+layer combinations can be fulfilled with data sources
- yes: create tile requests, stream responses back to client,
- no: return 400 Bad Request (client needs to refresh its info on map availability).
A data source drops offline / mapget request fails during processing?
- cpp-httplib cleanup callback returns timeout response (probably status code 408).

mapget's People

Contributors

Stargazers

Watchers

mapget's Issues

Basic mapget cache and service libraries

To interface with mapget, clients should use the mapget::cache API (in remote mode or as a C++ library). As a first step, we will facilitate a basic API which supports only the Cache::tiles() function (or the /GET tiles REST API). From the following UML, the Application, Cache, DataSourceConnection and Configuration classes will be implemented.

Arguments to mapget can be specified in a configuration file

Currently, starting mapget with custom data sources requires lengthy command line invocations. CLI11, the command line argument parser used by mapget, can be used to parse config files containing the arguments as documented here.

The goal is for mapget to support reading arguments from a configuration file in addition to the command line, and to document this option. Existing integration tests triggered by CLI arguments should be extended to include one example config file.

`mapget::Service` tile requests support basic filtering functionality.

mapget::Service::request currently allows basic queries based on mapId, layerId, and tiles. This issue aims to extend the existing API to include basic feature filtering capabilities. Implementing this enhancement will allow users to test the function's accuracy and scalability more efficiently. Special attention will be given to meeting general scalability requirements.

Extended 'tiles' Endpoint Example

// Extract from mapget's updated README
// Fetch streamed features based on strict constraints.
// Support for Accept-Encoding text/jsonl or application/binary

+ POST /tiles(list<{
    mapId: string,
    layerId: string,
    tileIds: list<TileId>,
    maxKnownFieldIds* 
  }>, filter: optional<string>): 
  bytes<TileLayerStream>

Acceptance Criteria:

Implement filtering in both local and Http mapget services
Automated tests for both services, with:
- Customizable request options
- Metrics on time spent fetching and filtering data
Performance regression detection using standardized tests on defined hardware/target

Open Questions:

How to restrict the filter language in this step?
Is it also intended to use this functionality for functions like 'viewport-based' filtering?

Persistent Cache is available

Background:

The existing in-memory cache in mapget serves its purpose for short-lived sessions but poses limitations for long-term or recurring data retrieval needs. This transient nature of the cache can lead to repetitive data fetching from source, negatively impacting performance and user experience. Implementing a persistent cache would alleviate this issue by storing data between sessions, saving time and computational resources.

Goal:

Introduce a persistent caching mechanism in mapget to improve long-term data retrieval efficiency and to complement the existing in-memory cache.

Acceptance Criteria:

Persistence: The cache should be capable of storing data between different sessions and system reboots.
Compatibility: The persistent cache should function alongside the existing in-memory cache, allowing users the flexibility to choose between the two depending on their needs.
Cache Flushing: Users should have the ability to manually clear the persistent cache when needed.
Future-Proofing: Like the in-memory cache, the implementation should be designed to be compatible with other caching mechanisms that may be introduced later.
Cache configuration: Users can specify the max number of tiles, path to cache-file and the type of used cache (memory, persistent)
Cache statistics: The persitent cache tracks number of hits and misses. It's sufficient to get this information via logs.

Basic datasource API

A basic mapget DataSource abstraction should be implemented, which allows implementing a feature data provider.

For now, only the /tile- and /metadata-endpoints shall be supported. On startup, a DataSource shall automatically select a port to bind to, if no port is explicitely provided. The port shall be written in a message to stdout, which will allow a management application to register a newly launched data source with a mapget server.

Parallel execution of RemoteDataSource does not work due to httplib Client lock

Currently, all workers for a RemoteDataSource share the same httplib Client. However, it is not possible to run parallel requests through one socket, so the parallelization is effectively disabled.

Enhanced Diagnostics for Data Sources

The system should provide improved diagnostic capabilities that allow for tracking and displaying various statistics about the data from each active map data source. The diagnostic data should include, but not be limited to, the number of map features obtained from each data source and the associated data sizes.

This feature requires extending the mapget component to collate and store these statistics as part of its data aggregation and caching processes. Furthermore, the map data sources should be enhanced to provide diagnostic data to the mapget component during the data retrieval and conversion processes.

The diagnostic data should be accessible through the mapget's query interface. This way, users can get insights into the data's origins, the volume of data sourced from each source, and the associated data size, thus improving transparency and troubleshooting capabilities.

Implement TileLayerStream

Both to transfer data from sources to the mapget cache, and to receive tile data from the mapget cache, the TileLayerStream protocol is required. It will be implemented in two classes:

The TileLayerStreamReader is constructed with a callback for onParsedLayer. It has pushBytes and numUnparsedBytes methods, which may be used to insert binary data to parse, and to check whether there is any unparsed data.
The TileLayerStreamWriter is constructed with a callback for onWrite, which may be called when something is ready to be sent over the wire, and onGetFieldCacheOffset, which may be called when the stream must decide how much of the field cache must be re-sent. It has a pushFeatureLayer method, which is used to insert a (Tile-)FeatureLayer into the stream.

Evaluate switch away from tiny-process-library

tiny-process-library (gitlab link) was selected for the current DataSourceProcess implementation. However, it is badly documented. We should find an alternative or a good reason to keep it.

mapget provides cached data even when no DataSources are connected

When there are no datasources connected, the mapget service should still answer requests from cache.

Problem: mapget currently does not cache DataSourceInfo.

mapget supports '/locate' endpoint

The objective of this issue is to implement the '/locate' endpoint. This endpoint is crucial for enabling erdblick to obtain a list of tile-layer combinations that specify the the TileFeatureLayer which contains the feature satisfying given ID field constraints.

Endpoint Functionality:

The mapget HTTP service '/locate' endpoint should accept feature typeId and a map of string to scalar values (ID parts) as input parameters. It must return a list of [MapTileKey,FeatureId] pairs.
The mapget base service must implement a locate-function, creating requests to all connected datasources without cache checking for now, and return a vector<pair<MapTileKey,string>> if a match is found.
The DataSource HTTP server interface has been extended to cover the /locate endpoint analogously to mapget.
Example data sources demonstrate how to implement this endpoint.
The mapget service interface documentation covers the implementation.

~~The mapget base service evaluates the connected data sources and raises the errors:~~
~~- [ ] 'No data source provides specified feature type' if no data source can provide the feature type ID.~~
~~- [ ] 'Unknown ID (parts)' when provided ID parts are not applicable to any known feature type with the provided ID.~~
~~- [ ] 'Missing non-optional ID parts.'~~

(advanced error detection and reporting left for future refinement)

Hints/Resources:

Have a look at the C++ codebase, particularly focusing on the IdPart and FeatureTypeInfo structures.

Pull RocksDB via Conan

~~The most recent RocksDB version on conan.io is 6.29.5 (current on github is 8.6.7). Let's wait for conan.io to catch up (or open a PR) and pull RocksDB via conan.~~

Pull RocksDB from conan.io.
See https://conan.io/center/recipes/rocksdb?version=8.8.1

Requests to unknown map+layer combinations receive a 400 response

When a fetch request is received for a map+layer combination for which no connected datasource can provide the response, a 400 response should be given with a descriptive error message.

Use conan as C++ dependency management solution and CMake's FetchContent as fallback.

At the moment mapget uses a mixture of CMake's FetchContent* and git submodules (./cmake/cmake_modules) - using conan would (potentially) allow reducing build times. For example the introduction of rocksdb as a dependency increased build times by up to ~6x (for example 10min vs 58min ). If some aspects cannot be covered by conan (not even with a custom recipe), using CMake's FetchContent should be still fine.

Cache Prefilling Support

Background:
While the current lazy-loading approach to cache filling is effective in many situations, it falls short when dealing with multiple or slow data sources. Users often know the specific spatial extent they wish to analyze and would benefit from a cache prefilling phase, even if it takes time to complete. This also allows developers to initially focus on optimizing the interaction between mapget and client applications like a map viewer, before turning their attention to optimizing the entire data pipeline, including data source interactions.

Goal:
Implement cache prefilling in mapget to enhance user experience by making data rapidly available post-prefill.

Acceptance Criteria:

Spatial Extend Support: mapget should support 'populate' requests that specify a spatial extent for prefilling.
Cancellation: Users should be able to cancel populate requests if needed.
Configurable In-Memory Cache: The existing in-memory cache should be easily configurable in terms of its capacity, allowing users to tailor it to the volume of data they intend to explore.
Progress Indication: A way to track the prefilling phase's progress is essential, ideally with an estimated time-to-completion indicator.
Future-Proofing: The implementation should be designed to be compatible with other caching mechanisms that may be introduced later.

Basic Model Library

To facilitate integration of mapget in various upstream and downstream projects, and initial subset of the mapget::model library must be implemented. This should include all classes in the following UML, except for the TileLayerStream class.

Send end-of-stream message

Currently, the client relies on the server to close the connection as a signal that a /tiles request was fully processed. However, the Connection: close header is prohibited in modern browsers. So cpp-httplib waits for the response to time out until it closes the request. By sending EndOfStream, the client can now close the connection once it is received.

Add-on Datasources

For some use-cases, it is necessary that a datasource can annotate features of another datasource with additional attributes, relations, or geometry. For this purpose, the introduction of an add-on: true field in the datasource info is planned. Any such connected datasource will always be called to produce add-on feature data for each requested tile. Mapget will first request the tile from a non-add-on source, then request data from each add-on source, and then merge the retreived set.

Python Bindings for the locate callback

Currently, Python bindings for the /locate-request callback and the corresponding LocateResponse/LocateRequest structs are missing for the DataSourceServer class.

Features and references with missing ID parts are detected and reported

The mapget model implementation currently creates features from a tile without validating if the feature ID fields are set. This will lead to problems with implementing e.g. jump-to-feature down the line, where a feature reference must contain all the information to be resolved to a feature.

To fix the issue, the mapget model should check that the ID fields are set, before adding a feature or feature reference to the feature layer.

Feature Model API covers advanced feature relations.

The current implementation of our Feature Model API permits basic linking of map features by utilizing their IDs as attributes. This method is effective for simple navigational tasks, allowing users to transition between related features with ease. However, this approach is somewhat limited, as it only facilitates basic ID-based connections without considering the complex relationships that can exist between map features.

We need to enhance the Feature Model API to encompass a broader range of feature relations. The revised API should enable defining relations that go beyond mere ID linking, allowing for associations based on specific segments or characteristics of other features. For instance, it should support scenarios like "Feature A is related to Feature B (e.g. visible from), but this relation is valid only within a particular section of Feature B's geometry."

Some additional thoughts/ideas:

How would we model "Feature A, a public park, is related to Feature B, a residential area, because they share a characteristic of being in a 'low-pollution zone.'” - should it be covered by relations too? The key difference is to the previous example is that this about grouping.

Persistent cache should not remain locked when mapget is stopped

Sometimes the LOCK file stays when I stop the source (with SIGTERM), either we avoid this kind of locking behavior or we ensure that it gets properly reset on-demand (at every startup as there should be only one user of the cache). At the moment I cannot share any way of how to reproduce this behavior as soon as I have the needed details I will share them in the scope of this issue.

Datasource processes exit with the mapget process

When the mapget process is killed with the SIGKILL, the datasource processes spun up by it continue running.

There should be a POST /heartbeat request from mapget to datasources. Datasource subprocesses read the MAPGET_DATASOURCE_PROCESS_TIMEOUT environment variable, which indicates the timeout window in seconds. If this variable exists, the process shuts down when a heartbeat has not been received from mapget in the specified time window.

We should add a --datasource-executable-timeout command line argument to mapget.

The datasources should reply to the /heartbeat request with their unique datasource nodeID. In the future, mapget should use the keep-alive reply to:

Check if a datasource process is still alive. If not, mapget reports it and removes worker threads for the datasource.
Validate that the datasource at the given endpoint has the same ID as the originally registered datasource.

mapget uses spdlog with log level as CLI argument

A similar approach is used in zswag, which can be taken as inspiration. Specifically, the header should be adapted. Then all throws and cout/cerr calls should be changed.

Advanced attribute/relation validities

This issue reflects the following additional requirements towards attribute/relation validites:

It shall be possible to specify more than one validity extent for each attribute.
It shall be possible to specify a validity stretch indirectly using (fractional) lengths or positions as start/end bounds for the stretch. This will allow add-on datasources to specify validity stretches without needing access to the feature's base geometry.

Switch from stx::format to fmt::format

Python Data Source Library

Use pybind11 to create the mapget python package, which should initially include only bindings for the mapget-model and mapget-datasource libraries.