Code Monkey home page Code Monkey logo

osm4scala's Introduction

osm4scala

Maven Central Build Status Coverage Status Quality Gate Status Gitter MIT licensed FOSSA Status Contributor Covenant

logo

High performance Scala library and Spark Polyglot (Scala, Python, SQL, etc.) connector for OpenStreetMap Pbf files.

Documentation and site

โš  Full usage documentation at https://simplexspatial.github.io/osm4scala/

Stargazers over time

Stargazers over time

Dev information:

It's possible to develop using a Windows machine, but all documentation suppose that you are using Linux or Mac.

Prepare environment

The only special requirement is to execute sbt compile to generate the protobuf source code.

sbt compile

PATCH_211 flag

Because depending on the Scala version, there are projects that are disabled (No spark3 for Scala 2.11) and different libraries dependencies. Because this, there is a flag called PATCH_211 (default value is false) to enable or disable Scala 2.11 compatibility.

Cross versions

The project is using cross version to manage 2.11, 2.12 and 2.13 using the same code base, so remember to use '+' to trigger all versions versions.

So remember, as example, for testing:

PATCH_211=false sbt +test
PATCH_211=true sbt +test

Release process

The publication into Maven Central has been removed from the release process, so now there are few steps:

  1. Release.

    git checkout master
    sbt release
  2. Publish into Maven Central. Information about configuration in plugins involved:

    Basically:

    git checkout v1.*.*
    sbt clean
    PATCH_211=false sbt +publishSigned
    PATCH_211=true sbt +publishSigned
    # In this point, tree target/sonatype-staging/ will show all artifacts to publish.
    sbt sonatypeBundleRelease
  3. Publish documentation and site.

    git checkout v1.*.*
    cd website
    nvm use
    export GIT_USER=<username>; export USE_SSH=true; npm run deploy

References.

PBF information:

third party OSS libraries:

osm4scala's People

Contributors

angelcervera avatar ericsun95 avatar gitter-badger avatar thibauldcroonenborghs-tomtom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

osm4scala's Issues

Unify the schema naming

To reduce redundant work, we'd better unify the naming to be the same as XML format data. Something like this (also for the common shared fields).

StructType(
StructField(id,LongType,false),
StructField(type,ByteType,false),
StructField(lat,DoubleType,true),
StructField(lon,DoubleType,true),
StructField(nd,ArrayType(LongType,true),true),
StructField(relations,ArrayType(StructType(
StructField(id,LongType,true),
StructField(type,ByteType,true),
StructField(role,StringType,true)
),true),true),
StructField(tags,MapType(StringType,StringType,true),true)
)

Spark Splitting file

To take advantage of distributed storage systems like HDFS, I will try to split the file per data locality.

Because of the nature of the osm.pbf format, maybe it is not possible to do it.
There are Spark interfaces and Abstract classes, like FileFormat that help to get chunks of files.

It needs more research and reverse engineering to find the right way to implement it. ๐Ÿ˜„

java.lang.ClassNotFoundException: osm.pbf.DefaultSource

Hi!

I'm running spark v2.4.6. I've started it with the following command:

spark-shell --packages com.acervera.osm4scala:osm4scala-core_2.11:1.0.3

and while trying to load the data:

val osmDF = spark.sqlContext.read.format("osm.pbf").load("<osm files path here>")

I'm getting the following error:

java.lang.ClassNotFoundException: Failed to find data source: osm.pbf. Please find packages at http://spark.apache.org/third-party-projects.html
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  ... 49 elided
Caused by: java.lang.ClassNotFoundException: osm.pbf.DefaultSource
  at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
  at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:953)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:898)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:881)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
  at scala.util.Try$.apply(Try.scala:192)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
  at scala.util.Try.orElse(Try.scala:84)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
  ... 51 more

Is there any other dependency I should add?

Spark 2 compatibility (2.12)

ATM, the connector is only for Spark 3. I did not expend time in the Spark 2 version, but it should be not difficult to add it to the Scala 2.11 branch.

Let's keep this ticket alive and if people need it I will implement it. Please add a +1 reaction if you think that it is something helpful.

Create site with examples and documentation

Integration test for Spark Connector

At the moment, all unit testing are covering around 90% source code, it is good, but there are cases that are necessary to test in real enviroments.

An example is the Spark Connector. For example, If something is wrong in the packaging (services file is not included), the error is not detected until it is used in a real cluster.

This will cover part of #9 and #103

Review Scala modeling and refactoring

  • Refactor model to avoid enums / Don't use new Scala 3 enums to keep 2.11 back comp.
  • RelationMemberEntity:relationTypes should be singular instead plural or better call it one type

2017 comments:
For example, in WayEntity.scala it is using Enums and Java style hierarchy.
Two good explanation about why to avoid Enums, at least in Scala:

Spark - Sequential parsing

I will implement a sequential reader before thinking about how to chunk the pbf file.
Reading the full planet takes 40 minutes, so let see if even in this case the performance is good enough.

  • Processing in parallel with parsing file: I need to research about this because the reader will be executed in one stage, so Spark will not start the processing until the end of the parsing stage.
  • Maybe data transfer and its serialization/deserialization will be a bottleneck.

This will be the initial implementation.

Logging system

We'd better come up with a logging system e.x extend Logging for each main class for better debugging.

Block search reimplentation in the Spark connector.

In the Spark Connector, to support split files, it is necessary to search the location of the first block.
Few improvements can be done:

  • At the moment, a naive search is used, but maybe using other algorithms like KMP adapted to search of bytes could be better. It is necessary to evaluate if really is going to boost the performance.
  • Max size block in osm is 32MB. So it is possible to limit the search to, let's say, 128MB, so if the file is not there, it is not an osm file. osm.pbf contains OSMHeader block as well, so in the first chunk, the location could be after 32MB. This will be helpful in the case of trying to parse non-osm files.
  • Implement validation. At the moment, the pattern OSMData is used only for the header. But if this pattern appears in, for example, a tag with no compression, we can get false positives. We need to validate that really it is a block of data. For example:
    • Check that the 4 bytes before are lower than 32MB. This will remove 99% cases because if the pattern is in a tag as value, just before there is a key, so Strings. 4 characters as int4 will be always higher than 32MB (I think).
    • Parsing it as the last step. If it fails, it is not a block. This should be the last validation because it is expensive.

Add writing support.

I need to generate pbf files along with reading, to be able to create sample files with specific content.

Useful to create testing data as well. For example, to create an example dataset with anonymized info fields.

Support PBF Output

We should support writing data out as .pbf format. This will help us test the whole workflow

Adding project to dependencies not working

Hi,

i've tried to add this lib as dependency to my project, but it failed to load. Could you please advise what i'm doing wrong or how to fix it?

Dependency i've added:

"com.acervera.osm4scala" %% "osm4scala" % "1.0"

Resolvers I have:

resolvers ++= Seq(
  "scalaz-bintray" at "http://dl.bintray.com/scalaz/releases",
  Resolver.jcenterRepo
)

Error I get:

[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	::          UNRESOLVED DEPENDENCIES         ::
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	:: com.acervera.osm4scala#osm4scala_2.11;1.0: not found
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::

String encoding

From the code here:

private def calculateTags(tags: Map[String, String]): MapData = ArrayBasedMapData(
      tags,
      k => UTF8String.fromString(k.toString),
      v => UTF8String.fromString(v.toString)
    )

Why we specify UTF8String, could we make this generic? Some cases the tag string may not use UTF8 which may rise some problems.

Spark connector

It is possible to create a connector to easy access from Spark.
After that, it will not be necessary to put blocks in hdfs and work directly with the pbf file.

Articles related:

Other connectors as example:

Spark DataSets using OSMEntity case classes

At the moment, the Spark connector is using DataFrames. It would be useful to allow direct interaction between OSMEntity types and the Spark Connector using DataSets.

Something like these cases should work.

      import spark.implicits._
      val dataset = Seq(
        NodeEntity(1, 11, 10, Map({ "nodeId" -> "1"})),
        NodeEntity(2, 12, 20, Map({ "nodeId" -> "2"})),
        NodeEntity(3, 13, 30, Map.empty),
        NodeEntity(4, 14, 40, Map.empty),
        NodeEntity(5, 15, 50, Map.empty),
        NodeEntity(6, 16, 60, Map.empty),
        WayEntity(7, Seq(1,2,3,4), Map({ "wayId" -> "7"})),
        WayEntity(8, Seq(4,5,6), Map({ "wayId" -> "8"})),
      ).toDS()

      dataset.show()

Or

      import spark.implicits._
      val monaco = spark.sqlContext.read
        .format("osm.pbf")
        .load("src/test/resources/monaco.osm.pbf")
        .persist()

      monaco.as[OSMEntity]

Prune fields and Filters

Mixing the Reader with PrunedScan and PrunedFilteredScan allows us to keep only requested fields and filtering before returning the parsed data.

build failed

Following output is seen when I ran compile on sbt
[info] Resolving jline#jline;2.14.3 ...
[libprotobuf WARNING google/protobuf/compiler/parser.cc:546] No syntax specified for the proto file: fileformat.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)
[libprotobuf WARNING google/protobuf/compiler/parser.cc:546] No syntax specified for the proto file: osmformat.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)
Traceback (most recent call last):
File "C:\Temp\protocbridge7327161231499492139.py", line 6, in
s.sendall(content)
TypeError: a bytes-like object is required, not 'str'
[info] Done updating.
[info] Resolving org.scalactic#scalactic_2.12;3.0.1 ...
[info] Updating {file:/C:/Users/kzpt72/osm4scala/osm4scala/}examples-counter-akka...
[info] Resolving org.scala-lang#scala-reflect;2.12.2 ...
[info] Resolving jline#jline;2.14.3 ...
[info] Compiling 1 Scala source to C:\Users\kzpt72\osm4scala\osm4scala\examples\common-utilities\target\scala-2.12\classes...
[info] Done updating.
[info] Resolving jline#jline;2.14.3 ...
[info] Done updating.
[info] Resolving jline#jline;2.14.3 ...
[info] Done updating.
[trace] Stack trace suppressed: run last core/compile:protocGenerate for the full output.
[error] (core/compile:protocGenerate) protoc returned exit code: 1
[error] Total time: 8 s, completed Aug 25, 2017 12:31:58 PM

Please let me know if there are additional dependencies I would need to build this.
Thanks

Move away from Bintray

Currently, Bintray is used for jar publications.
Bintray will go off next 1st May so I need to migrate the library to another provider.

Options with OSS support or free layer:

Other useful Links:

Tools helping to publish: https://github.com/olafurpg/sbt-ci-release

Links related to the current project status:

Publishing directly to Sonatype:

Don't forget to update all bintray badges from the documentation!!!

Replace all Map for tags to Array/List

Using map here is incorrect. We should use list or array. It could be same key two values in real life pbf and we should keep the data reader not changing anything.

Expose info fields

For my original use case, I did not need information fields like changeset, user id, etc.. But it is true that other people with different use cases would need It. Also, It is not a huge effort to add it.

/* Optional metadata that may be included into each primitive. */
message Info {
   optional int32 version = 1 [default = -1];
   optional int64 timestamp = 2;
   optional int64 changeset = 3;
   optional int32 uid = 4;
   optional uint32 user_sid = 5; // String IDs

   // The visible flag is used to store history information. It indicates that
   // the current object version has been created by a delete operation on the
   // OSM API.
   // When a writer sets this flag, it MUST add a required_features tag with
   // value "HistoricalInformation" to the HeaderBlock.
   // If this flag is not available for some object it MUST be assumed to be
   // true if the file has the required_features tag "HistoricalInformation"
   // set.
   optional bool visible = 6;
}

/** Optional metadata that may be included into each primitive. Special dense format used in DenseNodes. */
message DenseInfo {
   repeated int32 version = 1 [packed = true]; 
   repeated sint64 timestamp = 2 [packed = true]; // DELTA coded
   repeated sint64 changeset = 3 [packed = true]; // DELTA coded
   repeated sint32 uid = 4 [packed = true]; // DELTA coded
   repeated sint32 user_sid = 5 [packed = true]; // String IDs for usernames. DELTA coded

   // The visible flag is used to store history information. It indicates that
   // the current object version has been created by a delete operation on the
   // OSM API.
   // When a writer sets this flag, it MUST add a required_features tag with
   // value "HistoricalInformation" to the HeaderBlock.
   // If this flag is not available for some object it MUST be assumed to be
   // true if the file has the required_features tag "HistoricalInformation"
   // set.
   repeated bool visible = 6 [packed = true];
}

Subtasks:

  • DenseInfo in DenseNodes
  • Info in Nodes
  • Info in Ways
  • Info in Relations
  • visible field special case.
    • In DenseInfo
    • Not in DenseInfo
  • Update documentation

Support XML Output

We should support writing data out as .xml format. This will help us test the whole workflow

Centralize StringTable processing

StringTable is used in different places and is going to be used in the Info section.
Creating an enricher for this class will simplify its use and will remove duplication of code from two OSMEntity (Relation and Way) factories () and possible from the DenseNodeIterator that generate the NodeEntity.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.