harsha2010 / magellan Goto Github PK

Geo Spatial Data Analytics on Spark

License: Apache License 2.0

Scala 90.50% Shell 2.93% Python 6.58%

geospatial-analytics sparksql spark geometric-algorithms geojson shapefile geospatial geospatial-processing geospatial-analysis big-data

magellan's Introduction

Magellan: Geospatial Analytics Using Spark

Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries.

The application developer writes standard sql or data frame queries to evaluate geometric expressions while the execution engine takes care of efficiently laying data out in memory during query processing, picking the right query plan, optimizing the query execution with cheap and efficient spatial indices while presenting a declarative abstraction to the developer.

Magellan is the first library to extend Spark SQL to provide a relational abstraction for geospatial analytics. I see it as an evolution of geospatial analytics engines into the emerging world of big data by providing abstractions that are developer friendly, can be leveraged by anyone who understands or uses Apache Spark while simultaneously showcasing an execution engine that is state of the art for geospatial analytics on big data.

Version Release Notes

You can find notes on the various released versions here

Linking

You can link against the latest release using the following coordinates:

groupId: harsha2010
artifactId: magellan
version: 1.0.5-s_2.11

Requirements

v1.0.5 requires Spark 2.1+ and Scala 2.11

Capabilities

The library currently supports reading the following formats:

We aim to support the full suite of OpenGIS Simple Features for SQL spatial predicate functions and operators together with additional topological functions.

The following geometries are currently supported:

Geometries:

Point
LineString
Polygon
MultiPoint
MultiPolygon (treated as a collection of Polygons and read in as a row per polygon by the GeoJSON reader)

The following predicates are currently supported:

Intersects
Contains
Within

The following languages are currently supported:

Scala

Reading Data

You can read Shapefile formatted data as follows:

val df = sqlCtx.read.
  format("magellan").
  load(path)
  
df.show()

+-----+--------+--------------------+--------------------+-----+
|point|polyline|             polygon|            metadata|valid|
+-----+--------+--------------------+--------------------+-----+
| null|    null|Polygon(5, Vector...|Map(neighborho ->...| true|
| null|    null|Polygon(5, Vector...|Map(neighborho ->...| true|
| null|    null|Polygon(5, Vector...|Map(neighborho ->...| true|
| null|    null|Polygon(5, Vector...|Map(neighborho ->...| true|
+-----+--------+--------------------+--------------------+-----+

df.select(df.metadata['neighborho']).show()

+--------------------+
|metadata[neighborho]|
+--------------------+
|Twin Peaks       ...|
|Pacific Heights  ...|
|Visitacion Valley...|
|Potrero Hill     ...|
+--------------------+

To read GeoJSON format pass in the type as geojson during load as follows:

val df = sqlCtx.read.
  format("magellan").
  option("type", "geojson").
  load(path)

Scala API

Magellan is hosted on Spark Packages

When launching the Spark Shell, Magellan can be included like any other spark package using the --packages option:

> $SPARK_HOME/bin/spark-shell --packages harsha2010:magellan:1.0.4-s_2.11

A few common packages you might want to import within Magellan

import magellan.{Point, Polygon}
import org.apache.spark.sql.magellan.dsl.expressions._
import org.apache.spark.sql.types._

Data Structures

Point

val points = sc.parallelize(Seq((-1.0, -1.0), (-1.0, 1.0), (1.0, -1.0))).toDF("x", "y").select(point($"x", $"y").as("point"))

points.show()

+-----------------+
|            point|
+-----------------+
|Point(-1.0, -1.0)|
| Point(-1.0, 1.0)|
| Point(1.0, -1.0)|
+-----------------+

Polygon

case class PolygonRecord(polygon: Polygon)

val ring = Array(Point(1.0, 1.0), Point(1.0, -1.0),
 Point(-1.0, -1.0), Point(-1.0, 1.0),
 Point(1.0, 1.0))
val polygons = sc.parallelize(Seq(
    PolygonRecord(Polygon(Array(0), ring))
  )).toDF()
  
polygons.show()

+--------------------+
|             polygon|
+--------------------+
|Polygon(5, Vector...|
+--------------------+

Predicates

within

points.join(polygons).where($"point" within $"polygon").show()

intersects

points.join(polygons).where($"point" intersects $"polygon").show()

+-----------------+--------------------+
|            point|             polygon|
+-----------------+--------------------+
|Point(-1.0, -1.0)|Polygon(5, Vector...|
| Point(-1.0, 1.0)|Polygon(5, Vector...|
| Point(1.0, -1.0)|Polygon(5, Vector...|
+-----------------+--------------------+

contains

Since contains is an overloaded expression (contains is used for checking String containment by Spark SQL), Magellan uses the Binary Expression >? for checking shape containment.

points.join(polygons).where($"polygon" >? $"polygon").show()

A Databricks notebook with similar examples is published here for convenience.

Spatial indexes

Starting v1.0.5, Magellan support spatial indexes. Spatial indexes supported the so called ZOrderCurves.

Given a column of shapes, one can index the shapes to a given precision using a geohash indexer by doing the following:

df.withColumn("index", $"polygon" index 30)

This produces a new column called index which is a list of ZOrder Curves of precision 30 that taken together cover the polygon.

Creating Indexes while loading data

The Spatial Relations (GeoJSON, Shapefile, OSM-XML) all have the ability to automatically index the geometries while loading them.

To turn this feature on, pass in the parameter magellan.index = true and optionally a value for magellan.index.precision (default = 30) while loading the data as follows:

spark.read.format("magellan")
  .option("magellan.index", "true")
  .option("magellan.index.precision", "25")
  .load(s"$path")

This creates an additional column called index which holds the list of ZOrder Curves of the given precision that cover each geometry in the dataset.

Spatial Joins

Magellan leverages Spark SQL and has support for joins by default. However, these joins are by default not aware that the columns are geometric so a join of the form

  points.join(polygons).where($"point" within $"polygon")

will be treated as a Cartesian Join followed by a predicate. In some cases (especially when the polygon dataset is small (O(100-10000) polygons) this is fast enough. However, when the number of polygons is much larger than that, you will need spatial joins to allow you to scale this computation

To enable spatial joins in Magellan, add a spatial join rule to Spark by injecting the following code before the join:

  magellan.Utils.injectRules(spark)

Furthermore, during the join, you will need to provide Magellan a hint of the precision at which to create indices for the join

You can do this by annotating either of the dataframes involved in the join by providing a Spatial Join Hint as follows:

var df = df.index(30) //after load or
val df =spark.read.format(...).load(..).index(30) //during load

Then a join of the form

  points.join(polygons).where($"point" within $"polygon") // or
  
  points.join(polygons index 30).where($"point" within $"polygon")

automatically uses indexes to speed up the join

Developer Channel

Please visit Gitter to discuss Magellan, obtain help from developers or report issues.

Magellan Blog

For more details on Magellan and thoughts around Geospatial Analytics and the optimizations chosen for this project, please visit my blog

magellan's People

Contributors

Stargazers

Watchers

Forkers

andrewor14 kcompher yanjiegao joshrosen elbehery zjffdu simonellistonball mt0803 apsaltis kakamessi99 nexys-consulting jasonmwhite ericchang giserh lossyrob glennlaughlin jinxuan halfabrane cuulee narayana-glassbeam test-user-bb nirajkumar sksundaram-learning russel-m jean890915 papajo itsheng binque zjmeixinyanzhi qthurier hsjung6 chengat1314 mses-bly lkjx77 gdtm86 jtbirdsell zeltovhorton ezbz pierreleveau xiaozan-pku amoussoubaruch liu0013 bongki gischen barrycug patrickgoss scout24 merlintang ryancw huxiao64 liusiye franke1276 openkbs pzz2011 intellifora beeva-arturomartinez nandanrao jay-zeng john-min etes donrv jackzhanggiser lagerspetz kelvinni stevebuckingham nicolasguono1 chengchao103 eriktigerholm jtmurphy89 adotmob gjaime perados climabean gitter-badger mbnik slachiewicz aashish24 chenwei123456 charmatzis opengraphix amjaddxb tonny2v wsf1990 beastneedsmoretorque jonmcpherson nbuchanan wesleydias mithdrann marcusj aslanvaroqua roblovelock qianqian121 nickbondarenko jimmy-newtron philippks skp33 soxueren ehsanhaq miraisolutions kd35a

magellan's Issues

Magellan outside Spark

Hi,
Magellan seems to be the right fit library to solve geo-spatial processing. Is there support for using this outside spark, likes of say, in flink or standalones ?

Varaga

Geojson import does not work on Spark 1.6

Hi Guys,

I was trying the new Library and it does not work on Spark 1.6

My Steps:

./spark-shell --jars /Users/jorge/Downloads/postgresql-9.4.1208.jar --driver-memory 4G --driver-cores 2 --packages harsha2010:magellan:1.0.3-s_2.10,com.databricks:spark-csv_2.11:1.4.0,com.databricks:spark-avro_2.10:2.0.1

import org.apache.spark.sql.{Row, SQLContext}
val mageldf = sqlContext.read.format("magellan").option("type","geojson").load("/Users/jorge/Downloads/magellan/src/test/resources/geojson/polygon/example.geojson")

scala> mageldf.show
16/04/01 09:12:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
.....

Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

zipimport.ZipImportError: can't find module 'magellan'

I'm on spark 1.5. Not sure why I get below error after launching pyspark with magellan package:

# pyspark --packages harsha2010:magellan:1.0.1-s_2.10
Python 3.5.0 (default, Nov  3 2015, 11:48:43)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
harsha2010#magellan added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found harsha2010#magellan;1.0.1-s_2.10 in spark-packages
    found commons-io#commons-io;2.4 in central
    found com.esri.geometry#esri-geometry-api;1.2.1 in central
    found org.json#json;20090211 in central
    found org.codehaus.jackson#jackson-core-asl;1.9.12 in central
downloading http://dl.bintray.com/spark-packages/maven/harsha2010/magellan/1.0.1-s_2.10/magellan-1.0.1-s_2.10.jar ...
    [SUCCESSFUL ] harsha2010#magellan;1.0.1-s_2.10!magellan.jar (91ms)
:: resolution report :: resolve 1307ms :: artifacts dl 97ms
    :: modules in use:
    com.esri.geometry#esri-geometry-api;1.2.1 from central in [default]
    commons-io#commons-io;2.4 from central in [default]
    harsha2010#magellan;1.0.1-s_2.10 from spark-packages in [default]
    org.codehaus.jackson#jackson-core-asl;1.9.12 from central in [default]
    org.json#json;20090211 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   5   |   1   |   1   |   0   ||   5   |   1   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    1 artifacts copied, 4 already retrieved (208kB/8ms)
15/12/03 18:44:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/03 18:44:06 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
15/12/03 18:44:07 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Python version 3.5.0 (default, Nov  3 2015 11:48:43)
SparkContext available as sc, HiveContext available as sqlContext.
>>> from magellan.types import Point, Polygon
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
zipimport.ZipImportError: can't find module 'magellan'

val magellanContext = new MagellanContext(sc): No Class

When I enter val magellanContext = new MagellanContext(sc)

I receive the following error:

java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/DataSourceStrategy$
at org.apache.spark.sql.magellan.MagellanContext$$anon$1.(MagellanContext.scala:35)
at org.apache.spark.sql.magellan.MagellanContext.(MagellanContext.scala:32)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:67)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:72)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:76)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:78)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:80)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:82)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:84)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:86)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:88)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:90)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:92)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:94)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:96)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:98)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:100)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:102)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:104)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:106)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:108)
at $iwC$$iwC$$iwC$$iwC.(:110)
at $iwC$$iwC$$iwC.(:112)
at $iwC$$iwC.(:114)
at $iwC.(:116)
at (:118)
at .(:122)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Before hand I had imported:

import magellan.{Point, Polygon, PolyLine}
import magellan.coord.NAD83
import org.apache.spark.sql.magellan.MagellanContext
import org.apache.spark.sql.magellan.dsl.expressions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

read.format("magellan") throws scala.NotImplementedError: an implementation is missing

I have tried to load many shape files for NYC .. Although some the of the files worked fine, other threw scala.NotImplementedError: an implementation is missing .. I loaded the same files into PostGIS, and it worked fine. So, I am sure the files are not corrupted.

Would like to know the cause of the problem, and to solve the issue if possible.

please delete

Spark breaks down when running points.show()

Hello,
I am trying to run test command points.show() like in the introduction page but then spark stops working and exits spark-shell.
Below is the whole log:
**16/08/23 13:26:20 INFO spark.SparkContext: Starting job: show at :37
16/08/23 13:26:20 INFO scheduler.DAGScheduler: Got job 0 (show at :37) with 1 output partitions
16/08/23 13:26:20 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (show at :37)
16/08/23 13:26:20 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/08/23 13:26:20 INFO scheduler.DAGScheduler: Missing parents: List()
16/08/23 13:26:20 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at show at :37), which has no missing parents
16/08/23 13:26:20 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 5.5 KB, free 5.5 KB)
16/08/23 13:26:20 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.0 KB, free 8.6 KB)
16/08/23 13:26:20 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:52704 (size: 3.0 KB, free: 517.4 MB)
16/08/23 13:26:20 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/08/23 13:26:20 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at show at :37)
16/08/23 13:26:20 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/08/23 13:26:20 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2634 bytes)
16/08/23 13:26:20 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/08/23 13:26:20 INFO executor.Executor: Fetching http://172.31.42.1:56863/jars/org.codehaus.jackson_jackson-core-asl-1.9.12.jar with timestamp 1471958708517
16/08/23 13:26:20 INFO util.Utils: Fetching http://172.31.42.1:56863/jars/org.codehaus.jackson_jackson-core-asl-1.9.12.jar to /tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/userFiles-b35893ec-53c1-4d25-b0f8-c1fd86430e74/fetchFileTemp2531640319641615074.tmp
16/08/23 13:26:21 INFO executor.Executor: Adding file:/tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/userFiles-b35893ec-53c1-4d25-b0f8-c1fd86430e74/org.codehaus.jackson_jackson-core-asl-1.9.12.jar to class loader
16/08/23 13:26:21 INFO executor.Executor: Fetching http://172.31.42.1:56863/jars/com.esri.geometry_esri-geometry-api-1.2.1.jar with timestamp 1471958708510
16/08/23 13:26:21 INFO util.Utils: Fetching http://172.31.42.1:56863/jars/com.esri.geometry_esri-geometry-api-1.2.1.jar to /tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/userFiles-b35893ec-53c1-4d25-b0f8-c1fd86430e74/fetchFileTemp7329359299396583743.tmp
16/08/23 13:26:21 INFO executor.Executor: Adding file:/tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/userFiles-b35893ec-53c1-4d25-b0f8-c1fd86430e74/com.esri.geometry_esri-geometry-api-1.2.1.jar to class loader
16/08/23 13:26:21 INFO executor.Executor: Fetching http://172.31.42.1:56863/jars/commons-io_commons-io-2.4.jar with timestamp 1471958708509
16/08/23 13:26:21 INFO util.Utils: Fetching http://172.31.42.1:56863/jars/commons-io_commons-io-2.4.jar to /tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/userFiles-b35893ec-53c1-4d25-b0f8-c1fd86430e74/fetchFileTemp5867381645425744231.tmp
16/08/23 13:26:21 INFO executor.Executor: Adding file:/tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/userFiles-b35893ec-53c1-4d25-b0f8-c1fd86430e74/commons-io_commons-io-2.4.jar to class loader
16/08/23 13:26:21 INFO executor.Executor: Fetching http://172.31.42.1:56863/jars/harsha2010_magellan-1.0.3-s_2.10.jar with timestamp 1471958708508
16/08/23 13:26:21 INFO util.Utils: Fetching http://172.31.42.1:56863/jars/harsha2010_magellan-1.0.3-s_2.10.jar to /tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/userFiles-b35893ec-53c1-4d25-b0f8-c1fd86430e74/fetchFileTemp2704526544117072793.tmp
16/08/23 13:26:21 INFO executor.Executor: Adding file:/tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/userFiles-b35893ec-53c1-4d25-b0f8-c1fd86430e74/harsha2010_magellan-1.0.3-s_2.10.jar to class loader
16/08/23 13:26:21 INFO executor.Executor: Fetching http://172.31.42.1:56863/jars/org.json_json-20090211.jar with timestamp 1471958708516
16/08/23 13:26:21 INFO util.Utils: Fetching http://172.31.42.1:56863/jars/org.json_json-20090211.jar to /tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/userFiles-b35893ec-53c1-4d25-b0f8-c1fd86430e74/fetchFileTemp8186701820175920799.tmp
16/08/23 13:26:21 INFO executor.Executor: Adding file:/tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/userFiles-b35893ec-53c1-4d25-b0f8-c1fd86430e74/org.json_json-20090211.jar to class loader
16/08/23 13:26:21 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.AbstractMethodError: org.apache.spark.sql.catalyst.expressions.Expression.genCode(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeGenContext;Lorg/apache/spark/sql/catalyst/expressions/codegen/GeneratedExpressionCode;)Ljava/lang/String;
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:104)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:100)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.expressions.Expression.gen(Expression.scala:100)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:459)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:459)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext.generateExpressions(CodeGenerator.scala:459)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:281)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:324)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.generate(GenerateUnsafeProjection.scala:313)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:151)
at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:47)
at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/08/23 13:26:21 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.AbstractMethodError: org.apache.spark.sql.catalyst.expressions.Expression.genCode(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeGenContext;Lorg/apache/spark/sql/catalyst/expressions/codegen/GeneratedExpressionCode;)Ljava/lang/String;
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:104)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:100)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.expressions.Expression.gen(Expression.scala:100)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:459)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:459)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext.generateExpressions(CodeGenerator.scala:459)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:281)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:324)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.generate(GenerateUnsafeProjection.scala:313)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:151)
at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:47)
at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/08/23 13:26:21 INFO spark.SparkContext: Invoking stop() from shutdown hook
16/08/23 13:26:21 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AbstractMethodError: org.apache.spark.sql.catalyst.expressions.Expression.genCode(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeGenContext;Lorg/apache/spark/sql/catalyst/expressions/codegen/GeneratedExpressionCode;)Ljava/lang/String;
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:104)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:100)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.expressions.Expression.gen(Expression.scala:100)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:459)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:459)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenContext.generateExpressions(CodeGenerator.scala:459)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:281)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.create(GenerateUnsafeProjection.scala:324)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.generate(GenerateUnsafeProjection.scala:313)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:151)
at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:47)
at org.apache.spark.sql.execution.Project$$anonfun$1.apply(basicOperators.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/08/23 13:26:21 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/08/23 13:26:21 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/08/23 13:26:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/08/23 13:26:21 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
16/08/23 13:26:21 INFO scheduler.DAGScheduler: ResultStage 0 (show at :37) failed in 0.763 s
16/08/23 13:26:21 INFO scheduler.DAGScheduler: Job 0 failed: show at :37, took 0.929117 s
16/08/23 13:26:21 INFO ui.SparkUI: Stopped Spark web UI at http://172.31.42.1:4040
16/08/23 13:26:21 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/08/23 13:26:21 INFO storage.MemoryStore: MemoryStore cleared
16/08/23 13:26:21 INFO storage.BlockManager: BlockManager stopped
16/08/23 13:26:21 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/08/23 13:26:21 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/08/23 13:26:21 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/08/23 13:26:21 INFO spark.SparkContext: Successfully stopped SparkContext
16/08/23 13:26:21 INFO util.ShutdownHookManager: Shutdown hook called
16/08/23 13:26:21 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-ffcbe845-e900-4c61-99a8-e90314b2c77c
16/08/23 13:26:21 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/08/23 13:26:21 ERROR util.ShutdownHookManager: Exception while deleting Spark temp dir: /tmp/spark-ffcbe845-e900-4c61-99a8-e90314b2c77c
java.io.IOException: Failed to delete: /tmp/spark-ffcbe845-e900-4c61-99a8-e90314b2c77c
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:928)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:267)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:239)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:239)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:218)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
16/08/23 13:26:21 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9
16/08/23 13:26:21 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-794081a7-8d13-48ba-ae14-105ca6e41fd6
16/08/23 13:26:21 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-4c5e63ff-09dc-4f32-afec-350e0c244bb9/httpd-adbe66d4-9643-4729-b9cb-6325cd5bb25a
16/08/23 13:26:21 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.**

Can someone please give me a hint what could be a problem?

Thanks!

Anyway of converting an GEOJSON Polygon into magellan Polygon ?

Hi guys,

Anyway of converting an GEOJSON Polygon into magellan Polygon ?
an Example would be nice. For Example I have dataframe and one of the columns is a GEOJSON Polygon.

How can I create a Polygon with your library ?

Can magellan support Spark2.0?

I have tried magellan running with Spark 2.0.It seems that it will get something wrong like UnsafeRow.
Is it expected ? or I didn't use magellan correctly?

Can magellan support Spark1.4.1?

I have tried magellan running with Spark 1.4.1.It seems that it will get something wrong like InternalRow.

When I enter val df = sqlContext.read.format("magellan").load(path)
I receive the following error:
java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/InternalRow
at magellan.SpatialRelation$class.$init$(SpatialRelation.scala:29)
at magellan.ShapeFileRelation.(ShapefileRelation.scala:32)
at magellan.DefaultSource.createRelation(DefaultSource.scala:37)
at magellan.DefaultSource.createRelation(DefaultSource.scala:30)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:269)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:46)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:48)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:50)
at $iwC$$iwC$$iwC$$iwC.(:52)
at $iwC$$iwC$$iwC.(:54)
at $iwC$$iwC.(:56)
at $iwC.(:58)
at (:60)
at .(:64)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.InternalRow
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 57 more

Before I have imported
import org.apache.spark.sql.catalyst

spark-shell example not working

Hi,

When I try to run through the examples on the wiki in spark-shell of Spark 1.5 (prebuilt with Hadoop 1.4), I am getting the following errors:
scala> val points = sc.parallelize(Seq((-1.0, -1.0), (-1.0, 1.0), (1.0, -1.0))).toDF("x", "y").select(point($"x", $"y").as("point"))
points: org.apache.spark.sql.DataFrame = [point: poin]

scala>

scala> points.show()
16/02/12 18:06:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.AbstractMethodError: org.apache.spark.sql.catalyst.expressions.Expression.genCode(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeGenContext;Lorg/apache/spark/sql/catalyst/expressions/codegen/GeneratedExpressionCode;)Ljava/lang/String;
at org.apache.spark.sql.catalyst.expressions.Expression.gen(Expression.scala:98)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$1.apply(GenerateMutableProjection.scala:46)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$1.apply(GenerateMutableProjection.scala:43)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

Is the support for Spark 1.5 not yet available? I see a related topic in the issues list but wasn't sure whether that has been resolved.

Thanks,
Trang

JDBC connectivity to Magellan

Test whether we can query using JDBC. Fix any issues therein

Runtime issue running with spark 1.6

Could you tell me if magellan is compatible with spark 1.6?

I am running into this issue trying to convert a parsed RDD to DF:

case class Trip(trip_id: String,
probe_id: String,
provider_id: String,
start_time: Date,
start_hour: Integer,
start_point: Point,
trip_mean_speed: Double,
trip_distance_m: Double
)
import sqlContext.implicits._
val tripDF = tripSource.loadMovingTrips(processingDay, sourceUrl).toDF()

Error: Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;

Thanks,
Trang

Ensure examples in README and Wiki actually run

Create and Push Spark Packages

I have added some scripts to make spark packages and have pushed the first Magellan Spark Package to spark-packages.org
It would be good to independently verify this packaging is working correctly.

Spaial Matching query performance

Hi,

I am trying to execute a within query with Magellan since two weeks, and the query always executes finitely ..

The first dataframe is created using GeoJSON type loader .. A record of the dataset is shown below :-

{"type": "FeatureCollection", "features": [{"geometry": {"type": "Polygon", "coordinates": [[[-83.472, 42.71800120413723], [-83.472, 42.7202513456291]
, [-83.47152052359189, 42.72024826025985], [-83.47157381414216, 42.717998461684836], [-83.472, 42.71800120413723]]]}, "type": "Feature", "id": 0, "pro
perties": {"startTime": 1365674580000.0, "endTime": 1365674940000.0, "colorIndex": 6, "value": 0.2171378}}

The second dataframe contains data from a CSV file .. A case class is created before loading the data using Spark textFile() .. The case class is found below :-

case class CarTraces(deviceId: Long, trip: String, gpsUtcTime: Double, point: Point)

I am executing the query on AWS EC2 m4.xlarge instance .. The two datasets are only a sample, geoJson file is 200 MB, and the other one is only 80 MB ..

the query executed is the following :-

cars.join(radars).where($"point".within($"polygon)).cache()

Although the show() call is working fine on the resulting dataframe, whenever I call collect() or write(), the execution kept going and never ends ..

I will appreciate any suggestion to resolve the problem, since I am stuck in this issue since two weeks.

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

Hi,

I receive the following error when trying to read a .shp folder:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row

I read that it is related with magellan package trying to use schema "GenericMutableRow", which is no longer supported in recent versions of Spark. Is it the case? Any prospects of updating it?

Thank you

Add tests for Python APIs

This library should test its Python APIs. Note that Codecov has Python integration, so we can also measure coverage there.

Explaining how to use this library for standalone application

Hello,

I have been trying to use this library for standalone applications but I faced the problem of undefined sc context. When I run pyspark/scala with the package from command line everything is working, but for standalone applications the context is not imported automatically.

I have tried initialization with SQLContext, but then again it reports sc does not have parellelize method.

Can someone send me how to use this library?

Thanks!

python API

Hey,
after seeing the europe spark summit lecture on magellan, you said python API will be available.

is there still a plan to add it?
Thanks,

Support Spatial Joins

Currently we override TableScan and are not able to take advantage of Catalyst optimizer to use Spatial Joins.
Some work is needed to refactor the SpatialRelation to override CatalystScan and use custom strategies to induce spatial joins on predicates that act on geometries.

Python API Spark 1.5.2 Magellen 1.0.4 Snapshot on Databricks not working

Hi there! I compiled Magellen from this branch #46 (halfabrane:MAGELLAN-SPARK1.6 / 1.0.4 SNAPSHOT).

I am using Databricks and when I try to import modules form magellen. Scala version works fine.

ImportError: No module named magellan.types
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-b8ee42532bbb> in <module>()
----> 1 from magellan.types import Point, Polygon
      2 from pyspark.sql import Row, SQLContext
      3 import pyspark.sql.types
      4 import pyspark.sql.functions

ImportError: No module named magellan.types

I have uploaded the dependencies

Multi-polygon support

Hi Ram,

I know this has been raised for

Support for MultiPolygon

Currently Multi-Polygons are not supported in the GeoJSON reader.

Would it make sense to implement this by breaking down into multiple polygon records, or better to add a first-class MultiPolygon to the output schema?

Dependency (harsha2010:magellan:1.0.3-s_2.10) not found in any repo while building a jar through gradle

Hi there. I am trying to build a spark scala application using magellan. I am trying to build a jar but the dependency for this package, in my gradle build is not being found. Hence I can't compile and build the jar. I am using the following coordinates:

groupId: harsha2010
artifactId: magellan
version: 1.0.3-s_2.10

Snippet from build.gradle:
compile "harsha2010:magellan:1.0.3-s_2.10"

The repos that I have tried are standard gradle (mavenCetral) plus "http://maven.restlet.org".
Please advise. Thanks a lot.

spatial index?

Dear All @harsha2010

Does the system support spatial index i.e., R-tree, Quadtree for RDD?
Meanwhile, does the Magellan support spatial knn join efficiently?

I am working on one spatial data management system based on Spark name as LocationSpark(https://github.com/merlintang/SpatialSpark), which can provide spatial index and other spatial functionalists to speed up query in the RDD level. It can enhance two order speedup over the naive spark RDD i.e., spatial join and knn join. Because Magellan can work with SparkSQL, I am wondering is it possible to combine to LocationSpark to Magellan?

thanks and best

Mingjie

Reader for OSM data

Open Street Map provides extensive and detailed public domain mapping data. It would be useful to be able to work with data from this source in Magellan.

The primary format for exports of OSM data is an xml based format:
http://wiki.openstreetmap.org/wiki/OSM_XML

The planet.osm export in XML format is about 45G under bzip2 compression, making it hard to parse with a single file method, and hard to split, since XML tags cross tag boundaries in the canonical format of many of the tools, preventing split by lines.

There is also a binary format which is based on Protocol Buffers:
http://wiki.openstreetmap.org/wiki/PBF_Format

This weighs in around 29G. It may be possible to extend existing protocol buffer input formats to work with PBF format, and use the dataframes API to join the nodes, ways and relations into a more magellan friendly format.

Both formats work on a schema based on a set of nodes (points with meta data), ways (sequences of nodes) and relations (collections of the above with additional metadata, also used to produce things like boundary polygons).

Many people use selected extracts in OSM or PBF format, making this a useful reader to include in magellan.

Find nearest points within certain radius

Hello!

I am using pyspark.
I have a dataframe of points, similar to the one provided you in wiki, but much bigger:

points = sc.parallelize([
  (0, Point(-1.0, -1.0)),
  (1, Point(-1.0, 1.0)),
  (2, Point(1.0, -1.0))])\
.map(lambda x: PointRecord(*x))\
.toDF()

What code should I write, to found all points within given radius r around given point A with coordianates (x,y)?

Thank you in advance!

Query for proximity search in a radius to find n-nearest

Is it possible with the current API (I didn't find a way yet), given a Lat/Lon (WGS84) point "A" and a set of similar coordinates, find for a specific radius (e.g.: 1 Mile) the first 100 nearest points around "A"?

In case it helps, in Mongo would be similar to:
db.myData.find({ "location.coordinates" : { $near : { $geometry: { type: "Point", coordinates: [ 52.4702062,-0.2836129 ] }, $minDistance: 0, $maxDistance: 1600 } }, limit:100 })

Support common geospatial operators

Currently we support intersection and Shape assembly as the only supported geospatial operators.
Other common ones include:
Union, Distance, Intersection, Symmetric Difference, Convex Hull, Envelope, Buffer, Simplify, Polygon Assembly, Valid, Area, Length

JDBC connection to PostGIS throws unsupportedType

SparkSQL JDBC connection fails to retrieve a table with Geometry type. Please consider the following scenario :

I have a Database named "nycesri" in my postgres, with GIS Extension. It contains two tables, one with only primitive types named spatial_ref_sys, the other table named ny_counties_clip has a field of Geometry type.

I have launched spark_shell with PostgresJDBC.jar in the classPath with the following command :

SPARK_CLASSPATH=/home/mustafa/Desktop/postgresql-9.3-1104.jdbc41.jar ./spark-shell

Then, I have loaded the first table successfully into a SparkDF using the following query :

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:nycesri",
"dbtable" -> "spatial_ref_sys")).load()

However, when I have tried to load the other table, which contains the Geometry type using the following query :

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:nycesri",
"dbtable" -> "ny_counties_clip")).load()

I have received java.sql.SQLException: Unsupported type 1111

Upgrade to Spark 1.5

Spark 1.5 adds a null safe eval and codegen methods to Expressions, and also turns codegen on by default.
Currently these are breaking changes at runtime that don't allow Magellan compiled on Spark 1.4 to run on Spark Master. When Spark 1.5 comes out, we need to upgrade Magellan so that it can run on Spark 1.5.

Python API - Polygon creation and Within function

Hello,

I use Magellan and Spark 1.6.2

First, about Polygon creation, let's take the following example in Python (simplified from the Wiki):

square = [Point(1.0, 1.0), Point(1.0, -1.0), Point(-1.0, -1.0), Point(-1.0, 1.0), Point(1.0, 1.0)]
list_polygons = [Polygon([0], square)]
polygons = sc.parallelize(list_polygons).toDF()
polygons.show()

If I use an older release of Magellan (1.0.1-s_2.10), I get correct result:

+-----------+-------+--------------------+
|_shape_type|indices| points|
+-----------+-------+--------------------+
| 5| [0]|[[1,1.0,1.0], [1,...|
+-----------+-------+--------------------+

However, in the current release of Magellan (1.0.3-s_2.10) I get wrong null values:

+-----------+-------+--------------------+
|_shape_type|indices| points|
+-----------+-------+--------------------+
| 5| [0]|[null, null, null...|
+-----------+-------+--------------------+

The second question is how do you use the WITHIN operator in the Python API? Please add an example in the Wiki corresponding to the Python API.

Thanks in advance,
Camelia

Compile issue

Hi,

I'm running into a complie issue hooking in magellan as a maven dependency:

harsha2010 magellan 1.0.3-s_2.10

[ERROR] error: missing or invalid dependency detected while loading class file 'Shape.class'.
[INFO] Could not access term esri in package com,
[INFO] because it (or its dependencies) are missing. Check your build definition for
[INFO] missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
[INFO] A full rebuild may help if 'Shape.class' was compiled against an incompatible version of com.
[ERROR] error: missing or invalid dependency detected while loading class file 'Shape.class'.

I also tried hooking in the dependency on esri but it didn't solve the problem.

com.esri.geometry
esri-geometry-api
1.2.1

Any suggestions would be appreciated.

Thanks,
Trang

Project still maintained?

Hi all,

I was wondering if the project was still being maintained, since last commit was 2/3 months old and there are outstanding PRs.

Thank you

Update on Magellan 1.0.4?

Can you give an update on Magellan 1.0.4. I have this working on HDP 2.3.2 (Spark 1.4.1) but would very much to get it working on HDP 2.4 (Spark 1.6)?

Thanks, Eoin

Polygon within Polygon not Working as Expected

I am attempting to join a polygon of the states with a polygon of countries. The only result I am getting is that Hawaii is in "United States Minor Outlying Islands." When I tried joining points against each of these shapefiles, it appeared to work fine.

States Shapefile: http://www.arcgis.com/home/item.html?id=b07a9393ecbd430795a6f6218443dccc

Countries Shapefile: http://thematicmapping.org/downloads/TM_WORLD_BORDERS-0.3.zip

Code used (all HDFS paths contain appropriate contents of above files):

val countries = magellanContext.read.format("magellan").
load("/data/Countries").
select($"polygon".as("country_polygon"), $"metadata".as("country_metadata"))

val states = magellanContext.read.format("magellan").
load("/data/States").
select($"polygon".as("state_polygon"), $"metadata".as("state_metadata"))

countries.join(states, states("state_polygon") within countries("country_polygon"))

// also tried code below with same result

states.
join(countries).
where($"country_polygon" within $"state_polygon")

Thank you!
Steve Varner
@SteveVarner

ClassNotFoundException: magellan.PointUDT

With Spark 1.4.1 and magellan 1.0.3, when I do

prova = udf(lambda x,y: Point(x,y), Point())

I have

java.lang.ClassNotFoundException: magellan.PointUDT
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:225)
at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:166)
at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:988)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

Is this a bug? thanks

What's the plan of Magellan for Spark 2.0

Hi guys, I am using Spark 2.0 and I am dealing with geospatial datas. It seems to be that Magellan can't support that version of Spark and I can't find any other librairies. I was wondering what was the plan of Magellan for the future, are they going to release another version suitable with spark 2.0? When?
Thanks for any piece of informations

NAD83 UTM ZONE 18N Transformer

I am working with data from NYC, the shape file was projected using the followin system :

PROJCS["NAD83 / UTM zone 18N",
GEOGCS["NAD83",
DATUM["North_American_Datum_1983",
SPHEROID["GRS 1980",6378137,298.257222101,
AUTHORITY["EPSG","7019"]],
AUTHORITY["EPSG","6269"]],
PRIMEM["Greenwich",0,
AUTHORITY["EPSG","8901"]],
UNIT["degree",0.01745329251994328,
AUTHORITY["EPSG","9122"]],
AUTHORITY["EPSG","4269"]],
UNIT["metre",1,
AUTHORITY["EPSG","9001"]],
PROJECTION["Transverse_Mercator"],
PARAMETER["latitude_of_origin",0],
PARAMETER["central_meridian",-75],
PARAMETER["scale_factor",0.9996],
PARAMETER["false_easting",500000],
PARAMETER["false_northing",0],
AUTHORITY["EPSG","26918"],
AXIS["Easting",EAST],
AXIS["Northing",NORTH]]

I do not know how to transform GPS coordinates to this Datum,

Anyone can help me with this, or at least give me direction how to implement it ?

Upgrade Magellan to Spark 2.x

Within Function in megellan pyspark

HI All,

I try to use within function using pyspark sql. But its throwing error. My sql query is df=sqlContext.sql('select * from tablename where Point(32.000034,94.003453) within polygoncolumnname')

I got sample code for scala like below,

val joined = neighborhoods.
join(uberTransformed).
where($"nad83" within $"polygon").
select($"tripId", $"timestamp", explode($"metadata").as(Seq("k", "v"))).
withColumnRenamed("v", "neighborhood").
drop("k")

Please help me how to use it using pyspark

read geo spatital data stright from database

I know maggalen now supports reading straight from a file as

sqlCtx.read.format("maggalen").load("path")

but unfortunately my data is saved in the a mysql table.
is there an option to read straight from the a database?

GeoJSON MultiPolygon support

Hi,

It seems I am hitting a wall here with my GeoJSON consisting of MultiPolygons. If I look at GeoJSONRelation:93, it's clear this piece is not implemented yet.

Are there any plans to deliver this? Can you offer any hints on how to best implement it, assuming it's possible under the framework's architecture?

Cheers,
Lucas

Compile problem

I am running Magellan with Spark 2.0.0, but get and exception:
java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 17: No applicable constructor/method found for actual parameters "int, java.lang.Object"; , when trying this simple code:

from magellan.types import Point, Polygon
PointRecord = Row("id", "point")
points = sc.parallelize([(0, Point(-1.0, -1.0)),(1, Point(-1.0, 1.0)),(2, Point(1.0, -1.0))]).map(lambda x: PointRecord(*x)).toDF()
points.show()

Support common geospatial predicates

Currently we support intersects, within and contains (though contains is supported by a weird >? operator since contains is a keyword)
Other common predicates include:
Touches, Disjoint, Crosses, Overlaps, Equals, Covers

An example implementation of within is here:
https://github.com/harsha2010/magellan/blob/master/src/main/scala/org/apache/magellan/catalyst/predicates.scala

it delegates to the implementation of within on the Shape class:
https://github.com/harsha2010/magellan/blob/master/src/main/scala/org/apache/magellan/Shape.scala#L176

map matching support

This will give a boost to the project, there are a few scalable and open solutions to this problem.

In my short discussions @harsha2010, a HMM-based approach was mentioned.

@harsha2010: what is the status here?

Meanwhile I worked on a small HMM library, with the Viterbi algorithm implemented at: http://adrianulbona.github.io/hmm/

Non UTF-8 encoded in .DBF strings

Magellan seems to take the assumption that the string attributes in the .dbf file are always UTF-8 encoded.
I am processing some files where this is not true, strings are iso8559 encoded (french text with accent) and creating a dataframe from them yields to badly encoded accent see below :

"Grand domaine hydrog�ologique des formations sableuses du littoral en Artois Picardie et d�p�ts holoc�nes du Quaternaire en Loire-Atlantique et Vend�e"

I am using the current state of the master branch compiled with scala 2.10.5 with spark 1.6.2

Sample shapefiles to reproduce the issue
Code to reproduce the problem

I did not found any option controlling this encoding (maybe i missed something) when loading the files.

My current workaround is to correctly re-encode the strings prior loading with magellan but it's cumbersome. It would be practical to have an option to specify the encoding :
val df = sqlCtx.read .format("magellan") .option("dbf-string-encoding", "iso8559-1") .load(path)
and let magellan transcode to UTF-8

Thanks in advance for the feedback.

Cannot import magellan.Point in pyspark

I ran magellan with the pyspark shell as the wiki instructs:

bin/pyspark --packages harsha2010:magellan:1.0.1-s_2.10

then ran into

>>> from magellan.types import Point, Polygon
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/rsriharsha/projects/hortonworks/spatialsdk/python/magellan/types.py", line 19, in <module>
    BooleanType = bool
ImportError: No module named shapely.geometry

Looks like the python path is hardcoded somewhere?

add other functions into magellan

I can see that magellan has touch/within/intersect/intersection in spark sql.
Is it possible to add other opengis-functions into spark sql in simalar ways?

harsha2010 / magellan Goto Github PK

magellan's Introduction

Magellan: Geospatial Analytics Using Spark

Version Release Notes

Linking

Requirements

Capabilities

Reading Data

Scala API

Data Structures

Point

Polygon

Predicates

within

intersects

contains

Spatial indexes

Creating Indexes while loading data

Spatial Joins

Developer Channel

Magellan Blog

magellan's People

Contributors

Stargazers

Watchers

Forkers

magellan's Issues

Recommend Projects

Recommend Topics

Recommend Org