Code Monkey home page Code Monkey logo

underpass's Introduction

Tech @ HOTOSM

This repo exists to provide overall coordination of HOT technical interests and activities through the Technical Working Group.

Have a critical need or want to report a problem? Open an issue and someone will help follow up.

How the repo is organized:

Project ideas - The project ideas folder can contain briefs or more in-depth write ups about project ideas. Have an idea for a project? Open an issue and tag it with the label, Project Idea.

Quarterly Goals - What is our focus for the next few months? How can other organizations and individuals help support or contribute to those goals. Want to contribute to the generation of these goals? Find an issue with the label, Quarterly Goals.

Principles - How do we think about open source software and development? What are our guidelines or principles we consider when evaluating software or thinking about a project? How do we make decisions or organize staff, volunteers, contributors to projects? These are principles or guidelines that we adhere to and drive our work.

Resources - This is a catch all for simple help documents or other materials for getting things done. Info that doesn't fit into LearnOSM or another project repo can be found here. Have an idea for a new resource? Open an issue and label it with, Resource Need.

Wiki - The wiki is a place for other notes, documents, or links. For example, you can find the meeting notes and links for the Technical Working Group bi-weekly meeting in the Meetings wiki page.

Issue tracker

Please use the issue tracker to start discussions, report problems, or leave notices about any general technical or system administration needs related to HOT's technical infrastructure.

Bug reports or feature requests for specific HOT applications should be left on the software's specific GitHub repository.

We try to make heavy use of labels in the issue tracker to help organize.

View active Tech WG discussions now

underpass's People

Contributors

dakotabenjamin avatar elpaso avatar emi420 avatar eternaltyro avatar helnershingthapa avatar jorgemartinezg avatar kshitijrajsharma avatar petya-kangalova avatar robsavoye avatar rsavoye avatar spwoodcock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

underpass's Issues

Validate the raw OSM database

Underpass updates a raw OSM database so it can be used by extracts and conflation. While this has had some validation, when it comes to processing raw OSM data, much data has to be processed to find the more obscure problems that often break parsing.

Flag : Underpass is not scanning way features, Only nodes for building stats

Note : The test is related to building stats only*
This is one of the reasons for the issue

Let’s take this changeset as an example,
110904136
Here, It has lots of buildings tag but not detected by underpass
https://www.openstreetmap.org/changeset/110904136
From insight, it has 165 modified buildings
image

But from the underpass we have null
image
Debugging :
I have filtered out the top 55-way feature containing changesets sorted by timestamp desc, Out of them only 9 features intersected with the current underpass boundary. Out of them, only 6 changesets are found on the underpass and all of them are null.
image
But they have buildings detected from insights, Here is the result
image
Let’s test this on the osm page: Here are samples you can go with
https://www.openstreetmap.org/changeset/95764997
https://www.openstreetmap.org/changeset/95770243
https://www.openstreetmap.org/changeset/95772236
https://www.openstreetmap.org/changeset/95787689
https://www.openstreetmap.org/changeset/95862817

But the nodes and lines tagged as buildings or amenities they are counted.
Here let’s see some sample top building statistics all of them are detected because of node features, even going through the top 500 features I can only see node features tagged as building
image

Look for orphan nodes

The data validation in Underpass should be extended to look for orphan nodes in the osmchange file. Orphan nodes are nodes with no tags and not part of a way.

Changeset parsing bug

When parsing changeset data, all textual data is localized, and some characters escaped, like embedded quotes. This bug is for the case a FS character (file group) is in the comments field, which currently isn't handled properly, which causes underpass to core dump.

Refactor osmstats to galaxy

Rename everything from osmstats to galaxy to match the same change being made in the API. This will clear up some confusion when referring to "osmstats" as many of us think of leaderboard.

Add integration tests for replicator

Currently we have no tests for the replicator application, all existing tests are library API tests.

In order to avoid merging changes that break the main replicator application we should add a set of smoke tests that check that replicator basic functionality is not broken, for example:

  1. it does not crash when a DB connection is wrong/goes down
  2. can perform the basic operations (monitoring, user sync, osm replication etc.)

Enhancement : Validation for null changesets

Currently, we are ignoring changeset like this
https://www.openstreetmap.org/changeset/96764200
image

These changeset does not have geometry and they don't have any changes , they are null. Currently, we are completely ignoring those changesets in the underpass but instead of ignoring what we can do is we can keep them in the database and raise an issue inside the validation table . There are plenty of changesets like this , can be found in insight. Since insight is storing them

Flag : Hashtag is skipping

Let’s take this example,
select *
from changesets
where id =112222185

Here in underpass hashtags are null
image
but if we look it into changeset from osm
https://www.openstreetmap.org/changeset/112222185
Here, the specific hashtag is present but not included in the comment, because it doesn't have any space or semicolon after the hashtag , we are missing those hashtags
image
That means even when there are multiple hashtags we are missing one of them somehow.
Another example can be seen here
https://www.openstreetmap.org/changeset/105956885#map=10/27.2565/85.0356
Here we are missing the first hashtag
#hotosm-project-10963;
image


image

We are escaping the first hashtag everywhere , if it has a single hashtag we are not counting it and if it has more than one we are not counting the first one
Let's take multiple examples :
Changeset id : 88691316,88691189,88691138
Here the first hashtag is excluded in underpass

Re-enable state.txt file processing

State.txt files are used when launching Underpass to find the starting path on the remote planet server with all the replication files. This was disabled when doing some refactoring, and should be reenabled with any changes needed.

Can't write to osm2pgsql database

After merging the pull request, I tried it running with these options:
./replicator --osm2pgsqlserver localhost/osm2pgsql -v -d -l -f hour -u 000/068/372 -m

What I get is this error:
[893212:2] 14056 ERROR: Couldn't upsert node/points record: ERROR: relation "osm2pgsql_pgsql.planet_osm_nodes" does not exist
LINE 2: INSERT INTO osm2pgsql_pgsql.planet_osm_nodes

You can see that planet_osm_nodes is not part of the default, I assume it's only created when doing an update. This was updating a database imported using osm2pgsql.

osm2pgsql=# \dt
List of relations
Schema | Name | Type | Owner
--------+--------------------+-------+-------
public | planet_osm_line | table | rob
public | planet_osm_point | table | rob
public | planet_osm_polygon | table | rob
public | planet_osm_roads | table | rob
public | spatial_ref_sys | table | rob
(5 rows)

Implement relation parsing from osmchange files

Currently Underpass doesn't fully parse the relations in an osmchange file, as it wasn't needed for statistics or data validation. This issue finishes implementing relations in Underpass, as that will be needed when updating a raw OSM database.

Cache state.txt information for diffs from planet

Goal

The goal is to identify the right OSC file to download given a timestamp and a planet url.

Use case

In order to keep the local osm2pgsql DB in sync and to compute statistics Underpass needs to be able to
identify the correct diff(s) for a given timestamp or time range.

The planet structure is described by https://wiki.openstreetmap.org/wiki/Planet.osm/diffs and allows to retrieve the diff file .osc.gz for a known sequence number, sequence numbers have an associated timestamp which is only exposed in the state.txt file which corresponds to the given sequence.

Underpass API needs a method to retrieve this information from a planet server and cache them for future use.

Implementation plan

Underpass already have methods and classes to handle state information after they have been stored into the cache but it currently misses a way to populate the cache itself which is the main objective of this enhancement.

The proposed implementation will build on top of the current underpass::Planet class by implementing a series of methods to retrieve state information for a given timestamp on-demand, the information will be first searched in the cache and transparently retrieved from the planet server and added to the cache after a cache miss.

Client code example

Planet planet("https://planet.maps.mail.ru");
const auto osc {planet.findData(replication::minutely, time_from_string("2016-08-04 09:03"))};

change-set id in the changesets table doesn't make sense

Running underpass to calculate stats starting Jul 2020 with hour sequence # 000/068/369

When checking the loaded changesets statistics, found some changetset IDs which doesn't make since as they don't exists in osm yet such as # 836844032 which has "highway"=>"2", "highway_km"=>"11597458" in its added hstore values.
Please advise.
image

Improve performance when updating an OSM database

Currently when Underpass tried to update a raw OSM database, it does a select of all the ways, which for the priority countries, can take a few hours. This obviously needs to be much faster. The current workaround is limiting queries to a specific area, thereby reducing the size of the data to be updated.

Refactor threading to improve performance

There are some current issues with the current thread code in Underpass that has database connections happening frequently in many spawned threads which reduces the performance. This task is to refactor away from the external thread_pool header file, and use boost::asio::pool instead. Also there should be only one monitor thread per data source, those shouldn't be in the thread_pool. The threadMonitor* functions should implement the thread_pool instead. All threads should share the database connection, instead of connecting all the time.

Flag : Overlapping POI and Polygon for Building Count

Flag: Here changeset 96783104 feature is a node, not a polygon but tagged as an amenity: school and hence counted as a building which is not a bug but the problem is the building that is just beneath the POI is also counted as a building which means currently underpass is counting same building twice (One from Polygon and another from POI when both features intersect) which is providing false statistics . We should remove replication and the building count should be 1 even the feature are replicating.
An example can be seen here :
https://www.openstreetmap.org/changeset/96783104#map=16/12.6326/-8.0501

Flag: Geometry of changeset is stored as null even they have one

Let’s take this changeset as an example ,
https://www.openstreetmap.org/changeset/110814309
Can you see geometry in this changeset ? Yes I can
But in underpass it is null
image
But if you have a look at this inside insight it has geometry :
image
How this will affect our stat ?
This will not affect the stat calculation that underpass is doing but the question is how we are finding whether it is inside our priority area or not ? And also when we are escaping geometry and when we will be providing custom polygon stat from API end it will also provide incomplete stat because this changeset doesn't have geometry in the underpass and will not be included in any polygon , These changesets will be missed in stats
More Examples :
https://www.openstreetmap.org/changeset/110814313
https://www.openstreetmap.org/changeset/88698947
https://www.openstreetmap.org/changeset/88698950
https://www.openstreetmap.org/changeset/88698964
https://www.openstreetmap.org/changeset/88698972
And a lot more … Here is a snip from the underpass changeset table
image
And From insight table for same changeset id
image

Underpass does not always start with the right file

Replicator sometimes starts with the right file to download from the replication server, but not always.If the change files are cached on disk, it seems to work better, but not always. Also the -u option appears not to work always either. Replicator must start reliably with a timestamp, regardless of whether any files have been cached.

Ensure raw data table is up to date with latest OSM snapshot

For now we have been relying on Insights to extract raw data for a given polygon hotosm/galaxy-api#116 - Output from Insights will be historical snapshots for a given time period. In order to get current snapshot data we have to rely on Underpass. To have that working, OSM data should be updated by the ever running script.

@itskshitiz321 will be fixing this for the next few weeks. Please create that additional issues that might stem out of this.

cc @itskshitiz321

Fix CI when using git worktree

When working on multiple branches using "git worktree", the CI fails to clone the git repository due to the path being different. This isn't a problem when using the older 'git-new-workdir", but using 'git worktree' is preferred.

BUG: Different stats for added and modified fields in changeset table

After the recent underpass enhancements and fixes and since it has been running continuously for almost weeks, validating some added modified stats on the changeset level.
For changeset ID 107235440, which underpass has scanned on 3 Nov, underpass calculated 7 added highways and no modified highways.
While checking the changeset on OSM, there are 8 added and 8 modified highways!
The parsing of osm change file might be the issue
image

Another invalid changeset is ID = 104497858
Which shows (as below) 3 added school and 3 added building while the changeset ID 104497858 doesn't have ant new added schools but has 3 modified ones and no building tag!

image

@robsavoye

Replace replicatorconfig.hh file with external file

Currently Underpass is using a header file for configuration info for the databases. When installed from a Debian package, a config file gets created "/etc/default/underpass" with the same data. Rather than using environment variables, using an external file allows to change the default connectivity parameters.

OSM raw data replication into PostGIS DB

Goal

The goal is to update a raw OSM postgis database from the change data as it flows through Underpass.

User stories (kind of)

The raw OSM database will be used to support displaying maps on websites, extracting data (our export-tool), conflation of duplicates, etc...
Currently most people just download from geofabrik the daily snapshot, but we want to do it in near real-time, processing the minutely change files.

DB Schema Selection

The key part is which OSM database schema to update, as here are 3. There's the ogr2ogr one, the osm2pgsql one, and the pgsnapshot one.

Schema Comparison

The table below compares some features of the three tools/DB schemas.

Name Tool Language OSC Support Geometry in the main table
osm2pgsql C++ yes yes
ogr2ogr C++ no yes
pgsnapshot -- yes no

osm2pgsql seems the most suitable because:

  • the DB schema and the associated C++ tool support updates from .osc files out of the box [1]
  • there is a flexible option to fine-tune the import process through lua config scripts
  • the tool is actively maintained, well written and well documented
  • the tool is written in C++ on top of the de-facto standard libosmium library
  • workflows totally similar to what is our goal are already successfully used in the industry [2]

Proposed workflow

  1. initial data import through osm2pgsql binary tool, data pbf for instance obtained from geofabrik
  2. underpass monitoring thread looks for change files in osc format and downloads them
  3. underpass pipes the downloaded change files content into a separate thread where a spawned osm2pgsq process takes care of the DB update

Future enhancements (out of scope for this task)

In the above .3 it will it be possible to speed up operations bypassing the osm2pgsql process and using a direct implementation of the DB CRUD methods that update the DB reconstructing geometries when necessary. This is a significant effort which is not recommended in the current development phase (because it would be a form of premature optimization).

Another advantage of a direct implementation of the updates would be to reduce external dependency.

Open Issues

The main problem is how to identify the https address of the first OSC file that needs to applied in order to keep the DB in sync, the current implementation in underpass is not working or just a stub but there is a configuration option to pass the address directly to the application.

Implementation

Update Strategy

  1. find the current DB timestamp
  2. find the correct OSC file
  3. download the OSC file and update the DB

1. Find the current DB timestamp

The current timestamp of the DB is required in order to identify the first change file to process, a running Underpass application can keep this information in its internal memory obtaining it from the last update but the initial value must be calculated from the DB itself, the following query can be used to get the initial value:

select max(foo.ts) from (
select max(tags -> 'osm_timestamp') as ts from planet_osm_point
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_line
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_polygon
union
select max(tags -> 'osm_timestamp') as ts from planet_osm_roads) as foo

2. find the correct OSC file

The OSC file that needs to be processed first is the closest (in time) OSC file that has a timestamp greater than the current DB timestamp.

The timestamp of an OSC file (for instance 049.osc.gz [3] is contained in the corresponding *.state.txt file (for instance: 049.state.txt [4]) but there is no way to obtain the path given a timestamp.

Some research needs to be done in order to determine what is the best strategy to find the address of the first OSC file that needs to be applied, also looking at how other applications solved this problem. For the time being a direct configuration option to the first OSC file is already implemented and works just fine.

One possible strategy could consists in recursively downloading the index page of the change tree (i.e. https://planet.maps.mail.ru/replication/minute/), analyze the state.txt files until the change file closest (and greater) to the DB timestamp is found.

The subsequent addresses can be calculated, knowing that each subfolder contains at most 999 OSC files and then the parent is incremented by one.

3. download the OSC file and update the DB

The OSC download part given an address is already implemented in Underpass.
The update part will run in a separate thread and will launch a separate osm2pgsql process, depending on the performances of the update operation we may choose to join the thread or let it run in the background, in the latter case some sort of locking mechanism will be required in order to prevent out-of-order concurrent updates to the DB.
The simplest joining option will be attempted first, other more complex schemes will be implemented in case they are required.

[1] https://osm2pgsql.org/doc/manual.html#import-and-update
[2] https://ircama.github.io/osm-carto-tutorials/updating-data/
[3] https://planet.maps.mail.ru/replication/minute/004/679/049.osc.gz
[4] https://planet.maps.mail.ru/replication/minute/004/679/049.state.txt

Create python test cases

Now that Underpass exports some classes into Python3, now there needs to be a few additional test cases in Python, which should be based on the C++ version to test compatibility.

Get organization data from TM

Currently Underpass gets some of the user fields from the TM. To support training data integration, the organization data should also be used to populate the organization table in Galaxy so they match. The TM organization data is a subset of the Galaxy table, but this will keep the organization IDs and name in sync.

This is needed to integrate training data into reports, since they reference the organization,

Stats Enhancement

Stats generated from underpass is not matching with insights , need to find breach !
The buildings are first priority

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.