hotosm / underpass Goto Github PK

A customizable data engine for processing mapping data

License: GNU General Public License v3.0

Shell 4.23% M4 29.37% C++ 45.46% Makefile 4.94% Python 5.92% HCL 5.23% C 1.76% Dockerfile 0.87% Lua 2.21%

underpass's Introduction

Tech @ HOTOSM

This repo exists to provide overall coordination of HOT technical interests and activities through the Technical Working Group.

Have a critical need or want to report a problem? Open an issue and someone will help follow up.

How the repo is organized:

Project ideas - The project ideas folder can contain briefs or more in-depth write ups about project ideas. Have an idea for a project? Open an issue and tag it with the label, Project Idea.

Quarterly Goals - What is our focus for the next few months? How can other organizations and individuals help support or contribute to those goals. Want to contribute to the generation of these goals? Find an issue with the label, Quarterly Goals.

Principles - How do we think about open source software and development? What are our guidelines or principles we consider when evaluating software or thinking about a project? How do we make decisions or organize staff, volunteers, contributors to projects? These are principles or guidelines that we adhere to and drive our work.

Resources - This is a catch all for simple help documents or other materials for getting things done. Info that doesn't fit into LearnOSM or another project repo can be found here. Have an idea for a new resource? Open an issue and label it with, Resource Need.

Wiki - The wiki is a place for other notes, documents, or links. For example, you can find the meeting notes and links for the Technical Working Group bi-weekly meeting in the Meetings wiki page.

Issue tracker

Please use the issue tracker to start discussions, report problems, or leave notices about any general technical or system administration needs related to HOT's technical infrastructure.

Bug reports or feature requests for specific HOT applications should be left on the software's specific GitHub repository.

We try to make heavy use of labels in the issue tracker to help organize.

View active Tech WG discussions now

underpass's People

Contributors

Stargazers

Watchers

Forkers

elpaso kshitijrajsharma rsavoye yuiseki spwoodcock

underpass's Issues

Flag : Statistics generated heavily fluctuated

Let’s take this changeset as an example,
https://www.openstreetmap.org/changeset/88698067#map=12/11.9245/75.5825
Here as we can see there are lots of building added definitely more than one but in the underpass, I can see only one school and building added, It has heavily fluctuated

Here Tag is for most of them , that might be reason
building | school

Validate the raw OSM database

Underpass updates a raw OSM database so it can be used by extracts and conflation. While this has had some validation, when it comes to processing raw OSM data, much data has to be processed to find the more obscure problems that often break parsing.

Underpass is not scanning created changesets from Nov 26

The latest in the database is 2021-11-26 23:27:51.000 +0545 , which is this changeset id 114272404 .
While closed_at and updated_at columns show recent data
cc: @robsavoye

Testing the data quality issues

From #97 - @robsavoye has made a fix to ensure the validation table is up to date with latest OSM snapshot. We have to test this with some sample features

cc @itskshitiz321

Flag : Underpass is not scanning way features, Only nodes for building stats

Note : The test is related to building stats only*
This is one of the reasons for the issue

Let’s take this changeset as an example,
110904136
Here, It has lots of buildings tag but not detected by underpass
https://www.openstreetmap.org/changeset/110904136
From insight, it has 165 modified buildings

But from the underpass we have null

Debugging :
I have filtered out the top 55-way feature containing changesets sorted by timestamp desc, Out of them only 9 features intersected with the current underpass boundary. Out of them, only 6 changesets are found on the underpass and all of them are null.

But they have buildings detected from insights, Here is the result

Let’s test this on the osm page: Here are samples you can go with
https://www.openstreetmap.org/changeset/95764997
https://www.openstreetmap.org/changeset/95770243
https://www.openstreetmap.org/changeset/95772236
https://www.openstreetmap.org/changeset/95787689
https://www.openstreetmap.org/changeset/95862817

But the nodes and lines tagged as buildings or amenities they are counted.
Here let’s see some sample top building statistics all of them are detected because of node features, even going through the top 500 features I can only see node features tagged as building

Documentation about OsmStats table is outdated

As discussed with @robsavoye , this page contains obsolete information and should be marked as outdated or amended or removed.

https://hotosm.github.io/underpass/md_doc_osmstats.html

Ensure Users Table is Up to Date and tracking all users

We need to update users table when new user added or when he changes his username and make sure we are tracking all the user within the database and its up to date !

README.md broken link to "project Documentation "

In the main README.md, there is a broken link to the "project Documentation". It is at the end of the document, under "More Information" paragraph.

overlaps function in validate.hh does not consider layer value

It's allowed for buildings to overlap if they have differing depths. Like, if one of the buildings has layer=1 or layer=-1. This is just something I noticed while looking around at the code. Cheers.

Look for orphan nodes

The data validation in Underpass should be extended to look for orphan nodes in the osmchange file. Orphan nodes are nodes with no tags and not part of a way.

Changeset parsing bug

When parsing changeset data, all textual data is localized, and some characters escaped, like embedded quotes. This bug is for the case a FS character (file group) is in the comments field, which currently isn't handled properly, which causes underpass to core dump.

Refactor osmstats to galaxy

Rename everything from osmstats to galaxy to match the same change being made in the API. This will clear up some confusion when referring to "osmstats" as many of us think of leaderboard.

Add integration tests for replicator

Currently we have no tests for the replicator application, all existing tests are library API tests.

In order to avoid merging changes that break the main replicator application we should add a set of smoke tests that check that replicator basic functionality is not broken, for example:

it does not crash when a DB connection is wrong/goes down
can perform the basic operations (monitoring, user sync, osm replication etc.)

Enhancement : Validation for null changesets

Currently, we are ignoring changeset like this
https://www.openstreetmap.org/changeset/96764200

These changeset does not have geometry and they don't have any changes , they are null. Currently, we are completely ignoring those changesets in the underpass but instead of ignoring what we can do is we can keep them in the database and raise an issue inside the validation table . There are plenty of changesets like this , can be found in insight. Since insight is storing them

github action that builds the docs might produce no output and still pass

I think we should add another build step that checks for index.html in the root of gh-pages branch and fail if it's not there, at least we would know that something went wrong.

Flowchart is missing from documentation Underpass Data Flow page

At the end of the page there is a "Flow Chart" string that suggests there might be a flow chart image but there is no visible image in the page.

https://hotosm.github.io/underpass/md_doc_dataflow.html

The link points to https://github.com/hotosm/underpass/blob/master/doc/dataflow.png, it should probably point to https://raw.githubusercontent.com/hotosm/underpass/master/doc/dataflow.png

Flag : Hashtag is skipping

Let’s take this example,
select *
from changesets
where id =112222185

Here in underpass hashtags are null

but if we look it into changeset from osm
https://www.openstreetmap.org/changeset/112222185
Here, the specific hashtag is present but not included in the comment, because it doesn't have any space or semicolon after the hashtag , we are missing those hashtags

That means even when there are multiple hashtags we are missing one of them somehow.
Another example can be seen here
https://www.openstreetmap.org/changeset/105956885#map=10/27.2565/85.0356
Here we are missing the first hashtag
#hotosm-project-10963;

We are escaping the first hashtag everywhere , if it has a single hashtag we are not counting it and if it has more than one we are not counting the first one
Let's take multiple examples :
Changeset id : 88691316,88691189,88691138
Here the first hashtag is excluded in underpass

Re-enable state.txt file processing

State.txt files are used when launching Underpass to find the starting path on the remote planet server with all the replication files. This was disabled when doing some refactoring, and should be reenabled with any changes needed.

Can't write to osm2pgsql database

After merging the pull request, I tried it running with these options:
./replicator --osm2pgsqlserver localhost/osm2pgsql -v -d -l -f hour -u 000/068/372 -m

What I get is this error:
[893212:2] 14056 ERROR: Couldn't upsert node/points record: ERROR: relation "osm2pgsql_pgsql.planet_osm_nodes" does not exist
LINE 2: INSERT INTO osm2pgsql_pgsql.planet_osm_nodes

You can see that planet_osm_nodes is not part of the default, I assume it's only created when doing an update. This was updating a database imported using osm2pgsql.

Implement relation parsing from osmchange files

Currently Underpass doesn't fully parse the relations in an osmchange file, as it wasn't needed for statistics or data validation. This issue finishes implementing relations in Underpass, as that will be needed when updating a raw OSM database.

Cache state.txt information for diffs from planet

Goal

The goal is to identify the right OSC file to download given a timestamp and a planet url.

Use case

In order to keep the local osm2pgsql DB in sync and to compute statistics Underpass needs to be able to
identify the correct diff(s) for a given timestamp or time range.

The planet structure is described by https://wiki.openstreetmap.org/wiki/Planet.osm/diffs and allows to retrieve the diff file .osc.gz for a known sequence number, sequence numbers have an associated timestamp which is only exposed in the state.txt file which corresponds to the given sequence.

Underpass API needs a method to retrieve this information from a planet server and cache them for future use.

Implementation plan

Underpass already have methods and classes to handle state information after they have been stored into the cache but it currently misses a way to populate the cache itself which is the main objective of this enhancement.

The proposed implementation will build on top of the current underpass::Planet class by implementing a series of methods to retrieve state information for a given timestamp on-demand, the information will be first searched in the cache and transparently retrieved from the planet server and added to the cache after a cache miss.

Client code example

Planet planet("https://planet.maps.mail.ru");
const auto osc {planet.findData(replication::minutely, time_from_string("2016-08-04 09:03"))};

change-set id in the changesets table doesn't make sense

Running underpass to calculate stats starting Jul 2020 with hour sequence # 000/068/369

When checking the loaded changesets statistics, found some changetset IDs which doesn't make since as they don't exists in osm yet such as # 836844032 which has "highway"=>"2", "highway_km"=>"11597458" in its added hstore values.
Please advise.

Tests do not build

After 37046c5

[OUT-199] Backup database daily

OUT-199(OUT-199Backup database daily)

Improve performance when updating an OSM database

Currently when Underpass tried to update a raw OSM database, it does a select of all the ways, which for the priority countries, can take a few hours. This obviously needs to be much faster. The current workaround is limiting queries to a specific area, thereby reducing the size of the data to be updated.

Add support for highway validation

Once Galaxy has a raw OSM database available, it'll be possible to do some validation of highways.

Documentation main/landing page is empty

I'm not sure how it is supposed to look, but this is how I see it:

https://robsavoye.github.io/html/index.html

Refactor threading to improve performance

There are some current issues with the current thread code in Underpass that has database connections happening frequently in many spawned threads which reduces the performance. This task is to refactor away from the external thread_pool header file, and use boost::asio::pool instead. Also there should be only one monitor thread per data source, those shouldn't be in the thread_pool. The threadMonitor* functions should implement the thread_pool instead. All threads should share the database connection, instead of connecting all the time.

Flag : Overlapping POI and Polygon for Building Count

Flag: Here changeset 96783104 feature is a node, not a polygon but tagged as an amenity: school and hence counted as a building which is not a bug but the problem is the building that is just beneath the POI is also counted as a building which means currently underpass is counting same building twice (One from Polygon and another from POI when both features intersect) which is providing false statistics . We should remove replication and the building count should be 1 even the feature are replicating.
An example can be seen here :
https://www.openstreetmap.org/changeset/96783104#map=16/12.6326/-8.0501

Flag: Geometry of changeset is stored as null even they have one

Let’s take this changeset as an example ,
https://www.openstreetmap.org/changeset/110814309
Can you see geometry in this changeset ? Yes I can
But in underpass it is null

But if you have a look at this inside insight it has geometry :

How this will affect our stat ?
This will not affect the stat calculation that underpass is doing but the question is how we are finding whether it is inside our priority area or not ? And also when we are escaping geometry and when we will be providing custom polygon stat from API end it will also provide incomplete stat because this changeset doesn't have geometry in the underpass and will not be included in any polygon , These changesets will be missed in stats
More Examples :
https://www.openstreetmap.org/changeset/110814313
https://www.openstreetmap.org/changeset/88698947
https://www.openstreetmap.org/changeset/88698950
https://www.openstreetmap.org/changeset/88698964
https://www.openstreetmap.org/changeset/88698972
And a lot more … Here is a snip from the underpass changeset table

And From insight table for same changeset id

Speed up CI builds using ccache and GH cache

Kind of low priority, but to not forget.

Underpass does not always start with the right file

Replicator sometimes starts with the right file to download from the replication server, but not always.If the change files are cached on disk, it seems to work better, but not always. Also the -u option appears not to work always either. Replicator must start reliably with a timestamp, regardless of whether any files have been cached.

Enhancement : Implement indexes in Geoboundaries Table

Currently, we don't have indexes in geoboundaries table. We can use btree in id and gist in the boundary column for faster queries.

Update Geoboundaries of Underpass

Ensure raw data table is up to date with latest OSM snapshot

For now we have been relying on Insights to extract raw data for a given polygon hotosm/galaxy-api#116 - Output from Insights will be historical snapshots for a given time period. In order to get current snapshot data we have to rely on Underpass. To have that working, OSM data should be updated by the ever running script.

@itskshitiz321 will be fixing this for the next few weeks. Please create that additional issues that might stem out of this.

cc @itskshitiz321

Fix CI when using git worktree

When working on multiple branches using "git worktree", the CI fails to clone the git repository due to the path being different. This isn't a problem when using the older 'git-new-workdir", but using 'git worktree' is preferred.

BUG: Different stats for added and modified fields in changeset table

After the recent underpass enhancements and fixes and since it has been running continuously for almost weeks, validating some added modified stats on the changeset level.
For changeset ID 107235440, which underpass has scanned on 3 Nov, underpass calculated 7 added highways and no modified highways.
While checking the changeset on OSM, there are 8 added and 8 modified highways!
The parsing of osm change file might be the issue

Another invalid changeset is ID = 104497858
Which shows (as below) 3 added school and 3 added building while the changeset ID 104497858 doesn't have ant new added schools but has 3 modified ones and no building tag!

@robsavoye

Flag : Changesets are not included

Let’s have a look at this changeset :
https://www.openstreetmap.org/changeset/110329853#map=10/28.2796/81.6792
it has building features

But it is not included in the underpass

One possible and strong reason could be of country boundary that underpass is using
Debug Result :
Here, This changeset i.e. 110329853 falls inside both underpass and insight boundaries. That means it should not because of boundaries that underpass is using.

[OUT-199]Backup database daily

https://hotosm.atlassian.net/browse/OUT-199

Replace replicatorconfig.hh file with external file

Currently Underpass is using a header file for configuration info for the databases. When installed from a Debian package, a config file gets created "/etc/default/underpass" with the same data. Rather than using environment variables, using an external file allows to change the default connectivity parameters.

Underpass is not calculating newly created changeset's statistics

Sample Changesets to go through :
https://www.openstreetmap.org/changeset/114684067
https://www.openstreetmap.org/changeset/114684846
https://www.openstreetmap.org/changeset/114684822

OSM raw data replication into PostGIS DB

Goal

The goal is to update a raw OSM postgis database from the change data as it flows through Underpass.

User stories (kind of)

The raw OSM database will be used to support displaying maps on websites, extracting data (our export-tool), conflation of duplicates, etc...
Currently most people just download from geofabrik the daily snapshot, but we want to do it in near real-time, processing the minutely change files.

DB Schema Selection

The key part is which OSM database schema to update, as here are 3. There's the ogr2ogr one, the osm2pgsql one, and the pgsnapshot one.

Schema Comparison

The table below compares some features of the three tools/DB schemas.

Name	Tool Language	OSC Support	Geometry in the main table
osm2pgsql	C++	yes	yes
ogr2ogr	C++	no	yes
pgsnapshot	--	yes	no

osm2pgsql seems the most suitable because:

the DB schema and the associated C++ tool support updates from .osc files out of the box [1]
there is a flexible option to fine-tune the import process through lua config scripts
the tool is actively maintained, well written and well documented
the tool is written in C++ on top of the de-facto standard libosmium library
workflows totally similar to what is our goal are already successfully used in the industry [2]

Proposed workflow

initial data import through osm2pgsql binary tool, data pbf for instance obtained from geofabrik
underpass monitoring thread looks for change files in osc format and downloads them
underpass pipes the downloaded change files content into a separate thread where a spawned osm2pgsq process takes care of the DB update

Future enhancements (out of scope for this task)

In the above .3 it will it be possible to speed up operations bypassing the osm2pgsql process and using a direct implementation of the DB CRUD methods that update the DB reconstructing geometries when necessary. This is a significant effort which is not recommended in the current development phase (because it would be a form of premature optimization).

Another advantage of a direct implementation of the updates would be to reduce external dependency.

Open Issues

The main problem is how to identify the https address of the first OSC file that needs to applied in order to keep the DB in sync, the current implementation in underpass is not working or just a stub but there is a configuration option to pass the address directly to the application.

Implementation

Update Strategy

find the current DB timestamp
find the correct OSC file
download the OSC file and update the DB

1. Find the current DB timestamp

The current timestamp of the DB is required in order to identify the first change file to process, a running Underpass application can keep this information in its internal memory obtaining it from the last update but the initial value must be calculated from the DB itself, the following query can be used to get the initial value:

select max(foo.ts) from (
select max(tags -> 'osm_timestamp') as ts from planet_osm_point
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_line
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_polygon
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_roads) as foo

2. find the correct OSC file

The OSC file that needs to be processed first is the closest (in time) OSC file that has a timestamp greater than the current DB timestamp.

The timestamp of an OSC file (for instance 049.osc.gz [3] is contained in the corresponding *.state.txt file (for instance: 049.state.txt [4]) but there is no way to obtain the path given a timestamp.

Some research needs to be done in order to determine what is the best strategy to find the address of the first OSC file that needs to be applied, also looking at how other applications solved this problem. For the time being a direct configuration option to the first OSC file is already implemented and works just fine.

One possible strategy could consists in recursively downloading the index page of the change tree (i.e. https://planet.maps.mail.ru/replication/minute/), analyze the state.txt files until the change file closest (and greater) to the DB timestamp is found.

The subsequent addresses can be calculated, knowing that each subfolder contains at most 999 OSC files and then the parent is incremented by one.

3. download the OSC file and update the DB

The OSC download part given an address is already implemented in Underpass.
The update part will run in a separate thread and will launch a separate osm2pgsql process, depending on the performances of the update operation we may choose to join the thread or let it run in the background, in the latter case some sort of locking mechanism will be required in order to prevent out-of-order concurrent updates to the DB.
The simplest joining option will be attempted first, other more complex schemes will be implemented in case they are required.

[1] https://osm2pgsql.org/doc/manual.html#import-and-update
[2] https://ircama.github.io/osm-carto-tutorials/updating-data/
[3] https://planet.maps.mail.ru/replication/minute/004/679/049.osc.gz
[4] https://planet.maps.mail.ru/replication/minute/004/679/049.state.txt

Create python test cases

Now that Underpass exports some classes into Python3, now there needs to be a few additional test cases in Python, which should be based on the C++ version to test compatibility.

Passing Null Geometry Changeset Directly to Priority country list

Here !

underpass/galaxy/changeset.cc

Lines 201 to 204 in 8689f9a

    
           if (poly.empty()) { 
        
               // log_debug(_("Accepting changeset %1% as in priority area because area information is missing"), 
        
               // change->id); 
        
               change->priority = true;

Check needed after boundary Update

After the boundaries update . We need to check , everything is working fine or not !
cc : @robsavoye

Add data validation for duplicate buildings

Once galaxy has a raw OSM database to access, it'll be possible to check for duplicate buildings.

Get organization data from TM

Currently Underpass gets some of the user fields from the TM. To support training data integration, the organization data should also be used to populate the organization table in Galaxy so they match. The TM organization data is a subset of the Galaxy table, but this will keep the organization IDs and name in sync.

This is needed to integrate training data into reports, since they reference the organization,

	if (poly.empty()) {
	// log_debug(_("Accepting changeset %1% as in priority area because area information is missing"),
	// change->id);
	change->priority = true;

hotosm / underpass Goto Github PK

underpass's Introduction

Tech @ HOTOSM

How the repo is organized:

Issue tracker

underpass's People

Contributors

Stargazers

Watchers

Forkers

underpass's Issues

Goal

Use case

Implementation plan

Client code example

Goal

User stories (kind of)

DB Schema Selection

Schema Comparison

Proposed workflow

Future enhancements (out of scope for this task)

Open Issues

Implementation

Update Strategy

1. Find the current DB timestamp

2. find the correct OSC file

3. download the OSC file and update the DB

Recommend Projects

Recommend Topics

Recommend Org