Code Monkey home page Code Monkey logo

filegdb's Introduction

Spark GDB

In the wake of the unpredictable future of User Defined Types (UDT), this is a hasty minimalist re-implementation of the spark-gdb project, in such that the content of a File GeoDatabase can be mapped to a read-only Spark DataFrame. It is minimalist as it only supports features with simple geometries (for now :-) with no M or Z.

In the previous implementation, a GeometryType was defined using the UDT framework. However in this implementation, points are stored in a field with two sub-fields x and y. Polylines and polygons are stored as a string in the Esri JSON format. It is not the most efficient format, but will make the interoperability with the ArcGIS API for Python a bit seamless. Polylines and Polygons shapes are stored as two sub fields, parts and coords. Parts is an array of integers, where the values are the number of points in the part. Coords is an array of doubles, where the values are a sequence of x,y pairs.

Notes:

  • This implementation does not support compressed file geo databases.
  • It is HIGHLY recommended to create a fully compacted feature class before using this implementation.
  • The best way to create a compacted feature class is to copy the edited feature class to a new feature class.
  • Date field is a timestamp with UTC timezone.

Changes

  • Sep 10, 2021, Version 0.41 is a breaking change in the FileGDB object.

Building the project using Maven:

mvn clean install

Usage

The best demonstration of the usage of this implementation is with PySpark DataFrames and in conjunction with the ArcGIS API for Python.

Create a Python 3 conda environment:

conda remove --yes --all --name py36
conda create --yes -n py36 -c conda-forge python=3.6 openjdk=8 findspark py4j
conda create --name arcgis python=3.6
conda activate arcgis
conda install -c esri arcgis
conda install matplotlib

Assuming that the environment variable SPARK_HOME points to the location of a Spark installation, start a Jupyter notebook that is backed by PySpark:

export PATH=${SPARK_HOME}/bin:${PATH}
export SPARK_LOCAL_IP=$(hostname)
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export GDB_MIN=2.11 # Spark 2.3
# export GDB_MIN=2.12 # Spark 2.4
export GDB_VER=0.18
pyspark\
 --master local[*]\
 --num-executors 1\
 --driver-memory 16G\
 --executor-memory 16G\
 --packages com.esri:webmercator_${GDB_MIN}:1.4,com.esri:filegdb_${GDB_MIN}:${GDB_VER}

Check out the Broadcast and Countries example notebooks.

Here is yet another example in Scala:

import com.esri.gdb._

val path = "World.gdb"
val name = "Countries"

val spark = SparkSession.builder().getOrCreate()
try
{
    spark
      .read
      .gdb(path, name)
      .createTempView(name)

    spark
      .sql(s"select CNTRY_NAME,SQKM from $name where SQKM < 10000.0 ORDER BY SQKM DESC LIMIT 10")
      .collect()
      .foreach(println)
}
finally
{
    spark.stop()
}

TODO

  • Write test cases. Come on Mansour, u know better !!
  • Save geometry as a struct(type,xmin,ymin,xmax,ymax,parts,coords)
  • Add option to skip reading the geometry.
  • Add option to return geometry envelope only.
  • Add option to return timestamp field as millis long.
  • Read geometry as WKB.
  • Add geometry extent as subfields to Shape.

Notes To Self

  • Install JDK-1.8
  • Set path to %JAVA_HOME%\bin,%JAVA_HOME%\jre\bin
  • keytool -import -alias cacerts -keystore cacerts -file C:\Windows\System32\documentdbemulatorcert.cer

References

filegdb's People

Contributors

dingsl-giser avatar mraad avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.