simplexspatial / osm4scala Goto Github PK

View Code? Open in Web Editor NEW

77.0 6.0 16.0 23.34 MB

Scala and Spark library focused on reading OpenStreetMap Pbf files.

Home Page: https://simplexspatial.github.io/osm4scala/

License: MIT License

Scala 89.68% JavaScript 5.46% CSS 1.18% Jupyter Notebook 3.67%

osm pbf scala gis openstreetmap spark openstreetmap-pbf-files

osm4scala's Introduction

osm4scala

High performance Scala library and Spark Polyglot (Scala, Python, SQL, etc.) connector for OpenStreetMap Pbf files.

Documentation and site

⚠ Full usage documentation at https://simplexspatial.github.io/osm4scala/

Stargazers over time

Dev information:

It's possible to develop using a Windows machine, but all documentation suppose that you are using Linux or Mac.

Prepare environment

The only special requirement is to execute sbt compile to generate the protobuf source code.

sbt compile

PATCH_211 flag

Because depending on the Scala version, there are projects that are disabled (No spark3 for Scala 2.11) and different libraries dependencies. Because this, there is a flag called PATCH_211 (default value is false) to enable or disable Scala 2.11 compatibility.

Cross versions

The project is using cross version to manage 2.11, 2.12 and 2.13 using the same code base, so remember to use '+' to trigger all versions versions.

So remember, as example, for testing:

PATCH_211=false sbt +test
PATCH_211=true sbt +test

Release process

The publication into Maven Central has been removed from the release process, so now there are few steps:

Release.
```
git checkout master
sbt release
```
Publish into Maven Central. Information about configuration in plugins involved:
- sbt-pgp using gpg command-line utility under the cover.
- Sonatype GPG documentation
- xerial/sbt-sonatype
- scala-lang documentation
Basically:
- Set the keys as sbt-pgp#working-with-pgp-signatures explains.
- Be sure that public key has been uploaded into, at least last time, pgp.mit.edu: gpg --keyserver hkp://pgp.mit.edu --send-keys <key>
- Set the right credentials file at $HOME/.sbt/1.0/sonatype.sbt.
- Execute:
```
git checkout v1.*.*
sbt clean
PATCH_211=false sbt +publishSigned
PATCH_211=true sbt +publishSigned
# In this point, tree target/sonatype-staging/ will show all artifacts to publish.
sbt sonatypeBundleRelease
```

Publish documentation and site.

git checkout v1.*.*
cd website
nvm use
export GIT_USER=<username>; export USE_SSH=true; npm run deploy

References.

PBF information:

PBF2 Documentation: http://wiki.openstreetmap.org/wiki/PBF_Format
PBF2 Java library: https://github.com/openstreetmap/osmosis/tree/master/osmosis-osm-binary
Download whole planet pbf files: http://free.nchc.org.tw/osm.planet/
Download country pbf files: http://download.geofabrik.de/index.html
Scala protocol buffer library: https://scalapb.github.io/ and https://github.com/thesamet/sbt-protoc
OSM primitives: https://wiki.openstreetmap.org/wiki/Elements
- Node: https://wiki.openstreetmap.org/wiki/Node
- Way: https://wiki.openstreetmap.org/wiki/Way
- Relations: https://wiki.openstreetmap.org/wiki/Relation
- Tags: https://wiki.openstreetmap.org/wiki/Tags

third party OSS libraries:

ScalaPB: https://scalapb.github.io/ and https://github.com/thesamet/sbt-protoc

osm4scala's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger thesamet wsf1990 tallygo johanneshiry fossabot finch0001 ericsun95 capesepias ramayer-fli varshini2305 dingsl-giser thibauldcroonenborghs-tomtom jancallewaert-tomtom sergiupantiru bjoernwaechter

osm4scala's Issues

Unify the schema naming

To reduce redundant work, we'd better unify the naming to be the same as XML format data. Something like this (also for the common shared fields).

StructType(
StructField(id,LongType,false),
StructField(type,ByteType,false),
StructField(lat,DoubleType,true),
StructField(lon,DoubleType,true),
StructField(nd,ArrayType(LongType,true),true),
StructField(relations,ArrayType(StructType(
StructField(id,LongType,true),
StructField(type,ByteType,true),
StructField(role,StringType,true)
),true),true),
StructField(tags,MapType(StringType,StringType,true),true)
)

Spark Splitting file

To take advantage of distributed storage systems like HDFS, I will try to split the file per data locality.

Because of the nature of the osm.pbf format, maybe it is not possible to do it.
There are Spark interfaces and Abstract classes, like FileFormat that help to get chunks of files.

It needs more research and reverse engineering to find the right way to implement it. 😄

Error in TravisCI becuase Java8 is not supported.

https://travis-ci.org/angelcervera/osm4scala/jobs/659493039

java.lang.ClassNotFoundException: osm.pbf.DefaultSource

Hi!

I'm running spark v2.4.6. I've started it with the following command:

spark-shell --packages com.acervera.osm4scala:osm4scala-core_2.11:1.0.3

and while trying to load the data:

val osmDF = spark.sqlContext.read.format("osm.pbf").load("<osm files path here>")

I'm getting the following error:

java.lang.ClassNotFoundException: Failed to find data source: osm.pbf. Please find packages at http://spark.apache.org/third-party-projects.html
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  ... 49 elided
Caused by: java.lang.ClassNotFoundException: osm.pbf.DefaultSource
  at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
  at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:953)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:898)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:881)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
  at scala.util.Try$.apply(Try.scala:192)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
  at scala.util.Try.orElse(Try.scala:84)
  at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
  ... 51 more

Is there any other dependency I should add?

sonarcloud.io integration

Documentation:

To use the sbt-sonar plugin with sonar-cloud, it is necessary to set the property sonarExpectSonarQubeCommunityPlugin to false

Release v1.0 with cross version

Now, only one version published.

Spark 2 compatibility (2.12)

ATM, the connector is only for Spark 3. I did not expend time in the Spark 2 version, but it should be not difficult to add it to the Scala 2.11 branch.

Let's keep this ticket alive and if people need it I will implement it. Please add a +1 reaction if you think that it is something helpful.

Update ScalaTest

Update to ScalaTest 3.2.0

Create site with examples and documentation

Create a site with examples and documentation.

Examples, templates and collections for Jekyll:

More options:

Docusaurus Show cases
Comparations: https://docusaurus.io/docs/#comparison-with-other-tools

Iterate over all file blocks

Create an iterator to iterate over all file blocks in a pbf file.

http://wiki.openstreetmap.org/wiki/PBF_Format#File_format

Integration test for Spark Connector

At the moment, all unit testing are covering around 90% source code, it is good, but there are cases that are necessary to test in real enviroments.

An example is the Spark Connector. For example, If something is wrong in the packaging (services file is not included), the error is not detected until it is used in a real cluster.

This will cover part of #9 and #103

Review Scala modeling and refactoring

Refactor model to avoid enums / Don't use new Scala 3 enums to keep 2.11 back comp.
RelationMemberEntity:relationTypes should be singular instead plural or better call it one type

2017 comments:
For example, in WayEntity.scala it is using Enums and Java style hierarchy.
Two good explanation about why to avoid Enums, at least in Scala:

Spark - Sequential parsing

I will implement a sequential reader before thinking about how to chunk the pbf file.
Reading the full planet takes 40 minutes, so let see if even in this case the performance is good enough.

Processing in parallel with parsing file: I need to research about this because the reader will be executed in one stage, so Spark will not start the processing until the end of the parsing stage.
Maybe data transfer and its serialization/deserialization will be a bottleneck.

This will be the initial implementation.

Open to donations.

For example, in https://gratipay.com/

http://alternativeto.net/software/patreon/?license=free

Review Messages Code Analysis

Review Messages Code Analysis in Intellij, SonarQube, and other linting tools to write better code.

travis-ci is not working with openjdk

If openjdk7 is present in the jdk list of travis-cli configuration file (.travis.yml), there are problems executing sbt: https://travis-ci.org/angelcervera/pbf4scala/builds/147890346

Update the README.md with new examples

Once optional fields added #40, update the README.md with new examples

Fix Scoverage configuration to release properly

There is an issue in the scoverage that is affecting the release process.
It is necessary to disable it in the release process.

Logging system

We'd better come up with a logging system e.x extend Logging for each main class for better debugging.

Release process

Define release process and publish in Bintray

Block search reimplentation in the Spark connector.

In the Spark Connector, to support split files, it is necessary to search the location of the first block.
Few improvements can be done:

At the moment, a naive search is used, but maybe using other algorithms like KMP adapted to search of bytes could be better. It is necessary to evaluate if really is going to boost the performance.
Max size block in osm is 32MB. So it is possible to limit the search to, let's say, 128MB, so if the file is not there, it is not an osm file. osm.pbf contains OSMHeader block as well, so in the first chunk, the location could be after 32MB. This will be helpful in the case of trying to parse non-osm files.
Implement validation. At the moment, the pattern OSMData is used only for the header. But if this pattern appears in, for example, a tag with no compression, we can get false positives. We need to validate that really it is a block of data. For example:
- Check that the 4 bytes before are lower than 32MB. This will remove 99% cases because if the pattern is in a tag as value, just before there is a key, so Strings. 4 characters as int4 will be always higher than 32MB (I think).
- Parsing it as the last step. If it fails, it is not a block. This should be the last validation because it is expensive.

Iterate over all osm blocks

Create an iterator to iterate over all message blocks in a file block.

http://wiki.openstreetmap.org/wiki/PBF_Format#File_format

Add writing support.

I need to generate pbf files along with reading, to be able to create sample files with specific content.

Useful to create testing data as well. For example, to create an example dataset with anonymized info fields.

Spark 2 compatibility (2.11)

Sync branch 1.0.3 to have a Scala 2.11 version.

Support PBF Output

We should support writing data out as .pbf format. This will help us test the whole workflow

version for scala 2.12 ?

Hi,

could you release some version for 2.12 ?
i dont see any here
https://mvnrepository.com/artifact/com.acervera.osm4scala/osm4scala-core

thx

Add gitter to the site

Add Gitter as a widget in the website.

Adding project to dependencies not working

Hi,

i've tried to add this lib as dependency to my project, but it failed to load. Could you please advise what i'm doing wrong or how to fix it?

Dependency i've added:

"com.acervera.osm4scala" %% "osm4scala" % "1.0"

Resolvers I have:

resolvers ++= Seq(
  "scalaz-bintray" at "http://dl.bintray.com/scalaz/releases",
  Resolver.jcenterRepo
)

Error I get:

[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	::          UNRESOLVED DEPENDENCIES         ::
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	:: com.acervera.osm4scala#osm4scala_2.11;1.0: not found
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::

Replace all null with None in code base

As title.

String encoding

From the code here:

private def calculateTags(tags: Map[String, String]): MapData = ArrayBasedMapData(
      tags,
      k => UTF8String.fromString(k.toString),
      v => UTF8String.fromString(v.toString)
    )

Why we specify UTF8String, could we make this generic? Some cases the tag string may not use UTF8 which may rise some problems.

Use ARM in examples

Manage resources like in Java7 with autoclosable.

Example: http://stackoverflow.com/questions/2207425/what-automatic-resource-management-alternatives-exist-for-scala

Spark connector

It is possible to create a connector to easy access from Spark.
After that, it will not be necessary to put blocks in hdfs and work directly with the pbf file.

Articles related:

Other connectors as example:

Spark DataSets using OSMEntity case classes

At the moment, the Spark connector is using DataFrames. It would be useful to allow direct interaction between OSMEntity types and the Spark Connector using DataSets.

Something like these cases should work.

      import spark.implicits._
      val dataset = Seq(
        NodeEntity(1, 11, 10, Map({ "nodeId" -> "1"})),
        NodeEntity(2, 12, 20, Map({ "nodeId" -> "2"})),
        NodeEntity(3, 13, 30, Map.empty),
        NodeEntity(4, 14, 40, Map.empty),
        NodeEntity(5, 15, 50, Map.empty),
        NodeEntity(6, 16, 60, Map.empty),
        WayEntity(7, Seq(1,2,3,4), Map({ "wayId" -> "7"})),
        WayEntity(8, Seq(4,5,6), Map({ "wayId" -> "8"})),
      ).toDS()

      dataset.show()

      import spark.implicits._
      val monaco = spark.sqlContext.read
        .format("osm.pbf")
        .load("src/test/resources/monaco.osm.pbf")
        .persist()

      monaco.as[OSMEntity]

coverage unknown readme badge

https://github.com/simplexspatial/osm4scala#osm4scala

It should be either fixed or removed, there is no point in the current one

Prune fields and Filters

Mixing the Reader with PrunedScan and PrunedFilteredScan allows us to keep only requested fields and filtering before returning the parsed data.

build failed

Following output is seen when I ran compile on sbt
[info] Resolving jline#jline;2.14.3 ...
[libprotobuf WARNING google/protobuf/compiler/parser.cc:546] No syntax specified for the proto file: fileformat.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)
[libprotobuf WARNING google/protobuf/compiler/parser.cc:546] No syntax specified for the proto file: osmformat.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)
Traceback (most recent call last):
File "C:\Temp\protocbridge7327161231499492139.py", line 6, in
s.sendall(content)
TypeError: a bytes-like object is required, not 'str'
[info] Done updating.
[info] Resolving org.scalactic#scalactic_2.12;3.0.1 ...
[info] Updating {file:/C:/Users/kzpt72/osm4scala/osm4scala/}examples-counter-akka...
[info] Resolving org.scala-lang#scala-reflect;2.12.2 ...
[info] Resolving jline#jline;2.14.3 ...
[info] Compiling 1 Scala source to C:\Users\kzpt72\osm4scala\osm4scala\examples\common-utilities\target\scala-2.12\classes...
[info] Done updating.
[info] Resolving jline#jline;2.14.3 ...
[info] Done updating.
[info] Resolving jline#jline;2.14.3 ...
[info] Done updating.
[trace] Stack trace suppressed: run last core/compile:protocGenerate for the full output.
[error] (core/compile:protocGenerate) protoc returned exit code: 1
[error] Total time: 8 s, completed Aug 25, 2017 12:31:58 PM

Please let me know if there are additional dependencies I would need to build this.
Thanks

Merge tag into master after release

As part of the release process, merge the new tag created into the master

Use Float for longitude and latitude.

To represent coordinates, Double is used.
Because of the possible size and precision of this data, it is possible to use Float to save space.

Move away from Bintray

Currently, Bintray is used for jar publications.
Bintray will go off next 1st May so I need to migrate the library to another provider.

Options with OSS support or free layer:

JFrog (the company behind Bintray) offers a free option
https://packagecloud.io/pricing
https://cloudsmith.com/product/pricing/ and https://help.cloudsmith.io/docs/open-source-hosting-policy
https://github.com/features/packages
https://azure.microsoft.com/en-us/blog/announcing-azure-pipelines-with-unlimited-ci-cd-minutes-for-open-source/
👎🏽 https://docs.aws.amazon.com/codeartifact/latest/ug/welcome.html Needs AWS cli and environment token that expires in 12 hours.
👎🏽 https://cloud.google.com/artifact-registry The only repository type that is not in Alpha is the Docker one.
https://ossindex.sonatype.org directly: https://www.scala-sbt.org/1.x/docs/Using-Sonatype.html
https://www.cloudrepo.io/ ??
https://www.ibm.com/ie-en/cloud/urbancode ??

Spark Connector new ideas.

Let's talk in this task about other improvements, like partitioning per pbf block, per geolocation (I will need it seeding the SimpleXSpatial Server), etc.

Replace all Map for tags to Array/List

Using map here is incorrect. We should use list or array. It could be same key two values in real life pbf and we should keep the data reader not changing anything.

Expose info fields

For my original use case, I did not need information fields like changeset, user id, etc.. But it is true that other people with different use cases would need It. Also, It is not a huge effort to add it.

/* Optional metadata that may be included into each primitive. */
message Info {
   optional int32 version = 1 [default = -1];
   optional int64 timestamp = 2;
   optional int64 changeset = 3;
   optional int32 uid = 4;
   optional uint32 user_sid = 5; // String IDs

   // The visible flag is used to store history information. It indicates that
   // the current object version has been created by a delete operation on the
   // OSM API.
   // When a writer sets this flag, it MUST add a required_features tag with
   // value "HistoricalInformation" to the HeaderBlock.
   // If this flag is not available for some object it MUST be assumed to be
   // true if the file has the required_features tag "HistoricalInformation"
   // set.
   optional bool visible = 6;
}

/** Optional metadata that may be included into each primitive. Special dense format used in DenseNodes. */
message DenseInfo {
   repeated int32 version = 1 [packed = true]; 
   repeated sint64 timestamp = 2 [packed = true]; // DELTA coded
   repeated sint64 changeset = 3 [packed = true]; // DELTA coded
   repeated sint32 uid = 4 [packed = true]; // DELTA coded
   repeated sint32 user_sid = 5 [packed = true]; // String IDs for usernames. DELTA coded

   // The visible flag is used to store history information. It indicates that
   // the current object version has been created by a delete operation on the
   // OSM API.
   // When a writer sets this flag, it MUST add a required_features tag with
   // value "HistoricalInformation" to the HeaderBlock.
   // If this flag is not available for some object it MUST be assumed to be
   // true if the file has the required_features tag "HistoricalInformation"
   // set.
   repeated bool visible = 6 [packed = true];
}

Subtasks:

Support XML Output

We should support writing data out as .xml format. This will help us test the whole workflow

Parallelization examples.

Parallelization.

Add license in all files.

http://producingoss.com/en/license-quickstart.html
https://opensource.org/licenses/MIT
https://choosealicense.com/licenses/mit/

Centralize StringTable processing

StringTable is used in different places and is going to be used in the Info section.
Creating an enricher for this class will simplify its use and will remove duplication of code from two OSMEntity (Relation and Way) factories () and possible from the DenseNodeIterator that generate the NodeEntity.

Similar to the osmosis pbf reader, we should also be able to get the bbox info for extracted entities if it has. Reference: https://github.com/openstreetmap/osmosis/blob/2219470cef1f73f5d1319c57149c84b398e767ce/osmosis-pbf2/src/main/java/org/openstreetmap/osmosis/pbf2/v0_6/impl/PbfBlobDecoder.java#L131-L143

simplexspatial / osm4scala Goto Github PK

osm4scala's Introduction

osm4scala

Documentation and site

Stargazers over time

Dev information:

Prepare environment

PATCH_211 flag

Cross versions

Release process

References.

PBF information:

third party OSS libraries:

osm4scala's People

Contributors

Stargazers

Watchers

Forkers

osm4scala's Issues

Recommend Projects

Recommend Topics

Recommend Org