Goal
The goal is to update a raw OSM postgis database from the change data as it flows through Underpass.
User stories (kind of)
The raw OSM database will be used to support displaying maps on websites, extracting data (our export-tool), conflation of duplicates, etc...
Currently most people just download from geofabrik the daily snapshot, but we want to do it in near real-time, processing the minutely change files.
DB Schema Selection
The key part is which OSM database schema to update, as here are 3. There's the ogr2ogr one, the osm2pgsql one, and the pgsnapshot one.
Schema Comparison
The table below compares some features of the three tools/DB schemas.
Name |
Tool Language |
OSC Support |
Geometry in the main table |
osm2pgsql |
C++ |
yes |
yes |
ogr2ogr |
C++ |
no |
yes |
pgsnapshot |
-- |
yes |
no |
osm2pgsql
seems the most suitable because:
- the DB schema and the associated C++ tool support updates from
.osc
files out of the box [1]
- there is a flexible option to fine-tune the import process through lua config scripts
- the tool is actively maintained, well written and well documented
- the tool is written in C++ on top of the de-facto standard libosmium library
- workflows totally similar to what is our goal are already successfully used in the industry [2]
Proposed workflow
- initial data import through
osm2pgsql
binary tool, data pbf
for instance obtained from geofabrik
- underpass monitoring thread looks for change files in
osc
format and downloads them
- underpass pipes the downloaded change files content into a separate thread where a spawned
osm2pgsq
process takes care of the DB update
Future enhancements (out of scope for this task)
In the above .3 it will it be possible to speed up operations bypassing the osm2pgsql
process and using a direct implementation of the DB CRUD methods that update the DB reconstructing geometries when necessary. This is a significant effort which is not recommended in the current development phase (because it would be a form of premature optimization).
Another advantage of a direct implementation of the updates would be to reduce external dependency.
Open Issues
The main problem is how to identify the https address of the first OSC
file that needs to applied in order to keep the DB in sync, the current implementation in underpass is not working or just a stub but there is a configuration option to pass the address directly to the application.
Implementation
Update Strategy
- find the current DB timestamp
- find the correct OSC file
- download the OSC file and update the DB
1. Find the current DB timestamp
The current timestamp of the DB is required in order to identify the first change file to process, a running Underpass application can keep this information in its internal memory obtaining it from the last update but the initial value must be calculated from the DB itself, the following query can be used to get the initial value:
select max(foo.ts) from (
select max(tags -> 'osm_timestamp') as ts from planet_osm_point
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_line
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_polygon
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_roads) as foo
2. find the correct OSC file
The OSC file that needs to be processed first is the closest (in time) OSC file that has a timestamp greater than the current DB timestamp.
The timestamp of an OSC file (for instance 049.osc.gz
[3] is contained in the corresponding *.state.txt
file (for instance: 049.state.txt
[4]) but there is no way to obtain the path given a timestamp.
Some research needs to be done in order to determine what is the best strategy to find the address of the first OSC file that needs to be applied, also looking at how other applications solved this problem. For the time being a direct configuration option to the first OSC file is already implemented and works just fine.
One possible strategy could consists in recursively downloading the index page of the change tree (i.e. https://planet.maps.mail.ru/replication/minute/
), analyze the state.txt
files until the change file closest (and greater) to the DB timestamp is found.
The subsequent addresses can be calculated, knowing that each subfolder contains at most 999
OSC files and then the parent is incremented by one.
3. download the OSC file and update the DB
The OSC download part given an address is already implemented in Underpass.
The update part will run in a separate thread and will launch a separate osm2pgsql
process, depending on the performances of the update operation we may choose to join the thread or let it run in the background, in the latter case some sort of locking mechanism will be required in order to prevent out-of-order concurrent updates to the DB.
The simplest joining option will be attempted first, other more complex schemes will be implemented in case they are required.
[1] https://osm2pgsql.org/doc/manual.html#import-and-update
[2] https://ircama.github.io/osm-carto-tutorials/updating-data/
[3] https://planet.maps.mail.ru/replication/minute/004/679/049.osc.gz
[4] https://planet.maps.mail.ru/replication/minute/004/679/049.state.txt