Code Monkey home page Code Monkey logo

Comments (24)

bdon avatar bdon commented on May 24, 2024 1

the page fault rate when accessing locations should be higher now

Locations were previously stored as 64 bit integers. The records for the "Locations" table in the osmx file occupy contiguous pages of storage on disk, ordered by node ID. Adding a 32 bit version number increases the record size by 50%, so less records fit on a single disk page.

When osm extract is run, a way's member nd references are resolved into lat/lng by seeking over the locations table; in order of increasing way id. This has very poor locality; extracting Boston might include ways 12345 and 12346, but ways 12345 and 12346 might reference nodes anywhere from 1 to 1000000; the node ID is essentially random (unless it's a set of ways and nodes that were all created around the same time and not edited heavily)

the osmx design (by using lmdb) implements no application level caching. it relies on the kernel to cache pages as they are retrieved from disk. This is tuned to automatically manage a pool in RAM of cached disk pages. Since the locations table is now less dense, it's more likely when fetching Locations that you will need a page that has not been fetched yet or has been evicted from cache.

This is just my performance hypothesis, I need to run some benchmarks to determine whether or not it makes any significant difference.

from osmexpress.

invisiblefunnel avatar invisiblefunnel commented on May 24, 2024

Hi @bdon. I'm a big fan of OSMExpress and the Protomaps extract service. At my company we have some internal tooling that relies on the object version number for caching. We don't need the changeset/user/timestamp. Would you consider adding version numbers to Protomaps extracts? Thanks.

from osmexpress.

bdon avatar bdon commented on May 24, 2024

are you working with a .osmx locally or just a .pbf extract? If an .osmx is it a region or the whole planet? I'm wary to implement this because it will probably double the total db size.

Ideally: metadata is optional, and you won't pay the storage cost for it if you don't use it. but I think this depends on migrating from capnproto to flatbuffers (#1) because of how empty fields are stored.

from osmexpress.

invisiblefunnel avatar invisiblefunnel commented on May 24, 2024

We are just working with .pbf extracts for now.

from osmexpress.

bdon avatar bdon commented on May 24, 2024
  • added version, timestamp, changeset, uid, username to database
  • currently working on a new planet import to confirm the expansion in size is reasonable
    • untagged nodes are ignored
    • still using capnproto

from osmexpress.

bdon avatar bdon commented on May 24, 2024

on an AWS i3.xlarge instance, osmx expand planet.osm.pbf planet.osmx took exactly 7 hours and resulted in a 643G planet.osmx file. The expansion in size when adding all metadata (ignoring untagged nodes) should be less than 10% total, so I'd prefer to always include metadata.

from osmexpress.

bdon avatar bdon commented on May 24, 2024

download server at http://protomaps.com/extracts now includes version and timestamp information

@invisiblefunnel let me know if this is working for you; I'm working on the ecosystem around these tools so I'm interested in what people are building!

from osmexpress.

invisiblefunnel avatar invisiblefunnel commented on May 24, 2024

Thanks @bdon! This is great news. I'll take a look this week and reply back.

from osmexpress.

blackboxlogic avatar blackboxlogic commented on May 24, 2024

I just grabbed an extract from protomaps, loaded it into josm, fixed a road's name, and uploaded the change. This demonstrates that the extract had the required meta-data (version). I also manually verified that elements had edited at and edited by attributes.

However... I cannot use this as a source to change the shape of a road, since most of the way's nodes are tag-less, and you don't provider them with a version.

Please reconsider including meta-data (or at least version) on tag-less points. That would allow the extract to be used for any type of edit.

from osmexpress.

invisiblefunnel avatar invisiblefunnel commented on May 24, 2024

Please reconsider including meta-data (or at least version) on tag-less points.

FWIW this is also a blocker for my use cases which rely on the ID and version to uniquely identify objects in time detect changes.

from osmexpress.

bdon avatar bdon commented on May 24, 2024

just to confirm - to make this work for your use cases only version is needed and no other metadata?

from osmexpress.

invisiblefunnel avatar invisiblefunnel commented on May 24, 2024

Yes, just the version is needed. We don't use timestamps at all.

from osmexpress.

blackboxlogic avatar blackboxlogic commented on May 24, 2024

Confirmed, version would make exports usable for editing projects. I can't think of a reason I'd want other meta-data on tag-less nodes, and I'm sure any reason I eventually think of won't justify the cost.

from osmexpress.

bdon avatar bdon commented on May 24, 2024

changed location values from a 64 bit integer to a 96-bit struct that includes the version

AWS i3.xlarge: osmx expand planet.osm.pbf planet.osmx took 7.38 hours and resulted in a 666G planet.osmx file. so another 3-5% bump in expand time and planet size. need to verify now that this is correct and benchmark some extracts, because the page fault rate when accessing locations should be higher now.

from osmexpress.

CloudNiner avatar CloudNiner commented on May 24, 2024

so another 3-5% bump in expand time and planet size

That seems pretty reasonable. For the augmented diff use case #17, version information is useful for the same reason as @invisiblefunnel mentioned above, it allows for unique identification of a particular node in order to match it to its metadata.

the page fault rate when accessing locations should be higher now

Can you describe this a bit more?

from osmexpress.

bdon avatar bdon commented on May 24, 2024

Here's my test region:

osmx extract planet.osmx benchmark.osm.pbf --bbox 38.462,-77.519,41.0130,-73.333

first run on versionless planet: 943 seconds
second run: 919 seconds

version planet: 873 seconds
version planet 2nd time: 773 seconds

echo 3 > /proc/sys/vm/drop_caches can be used to clear the page cache, but the extract is probably big enough so that it doesn't make a difference. This isn't a very controlled experiment because the versionless planet has been being updated for a few weeks and might be more fragmented. In any case, it doesn't look like adding versions to locations negatively affects the speed by that much.

from osmexpress.

bdon avatar bdon commented on May 24, 2024

@blackboxlogic @invisiblefunnel new planet with versions is now online - can you try on https://protomaps.com/extracts ?

from osmexpress.

invisiblefunnel avatar invisiblefunnel commented on May 24, 2024

Works perfectly for me. Many thanks @bdon.

from osmexpress.

blackboxlogic avatar blackboxlogic commented on May 24, 2024

Every element has a version number, so the extracts are usable for editing.
Tag-less nodes don't have edited at, which is expected. However, I'm noticing that edited by and changeset are both 0 for all objects. Is that intentional?

from osmexpress.

bdon avatar bdon commented on May 24, 2024

Yes, the data is stored but I intentionally am excluding it. That seems to be the convention for GDPR compliance. Is that needed for any of your applications?

from osmexpress.

blackboxlogic avatar blackboxlogic commented on May 24, 2024

I definitely don't need it but it could be plausibly useful* and if you're storing it already then there isn't much to gain by withholding it. Other services handle GDPR by offering the "pii" only to OSM users who have signed in with oAuth, since they have agreed to terms of service. That would, of course, complicate your service by involving oAuth.

*Possible use-case: A vandal changes all buildings into parks, I want to remove all leisure=park where [vandal's name] was the last editor. I've had to do this sort of thing a few times.

from osmexpress.

bdon avatar bdon commented on May 24, 2024

I have an auth system built which is separate from osm Oauth. I could make PII only available to logged in users.

Can you describe your editing workflow in more detail ? Iโ€™d like to include it in my SOTM talk and I can mention your username if thatโ€™s ok.

from osmexpress.

blackboxlogic avatar blackboxlogic commented on May 24, 2024

Re: "Describe your workflow"
Short version: "I've build up a collection of scripts which can be chained together" but I think that's just called "programming"?
Here's a recent example of my work, but I plan to do more and there are two parts of my pipeline I'm rewriting (pulling data from OSM, and schema translation). One of the more cumbersome parts was retrieving up-to-date large regions of data from OSM. It was awkward for multiple reasons and my future projects will benefit from your work.
If you want a longer description shoot me an email [blackboxlogic at gmail dot com] with your phone number and best time to call, I'd love to chat.

Yes, "ok" to mention my username.

from osmexpress.

bdon avatar bdon commented on May 24, 2024

Great, we can discuss over email.

from osmexpress.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.