biggis-project / biggis-landuse Goto Github PK

View Code? Open in Web Editor NEW

11.0 4.0 3.0 558 KB

Land use update detection

Home Page: http://biggis-project.eu/biggis-docs/demos/landuse/

Scala 45.14% CSS 2.57% HTML 0.30% JavaScript 51.99%

classification geotrellis spark kafka stream-processing detection raster-tiles pixelization

biggis-landuse's Introduction

biggis-landuse

Land use update detection based on Geotrellis and Spark

Quick and dirty usage example

# first, we compile everything and produce a fat jar
# which contains all the dependences
mvn package

# now we can run the example app
java -cp target/biggis-landuse-0.0.8-SNAPSHOT.jar \
  biggis.landuse.spark.examples.GeotiffToPyramid \
  /path/to/raster.tif \
  new_layer_name \
  /path/to/catalog-dir

GettingStarted Example

Code for this example is located inside src/main/scala/biggis.landuse.spark.examples/GettingStarted.scala

# based on https://github.com/geotrellis/geotrellis-landsat-tutorial
# download examples from geotrellis-landsat-tutorial
# to data/geotrellis-landsat-tutorial
wget http://landsat-pds.s3.amazonaws.com/L8/107/035/LC81070352015218LGN00/LC81070352015218LGN00_B3.TIF
wget http://landsat-pds.s3.amazonaws.com/L8/107/035/LC81070352015218LGN00/LC81070352015218LGN00_B4.TIF
wget http://landsat-pds.s3.amazonaws.com/L8/107/035/LC81070352015218LGN00/LC81070352015218LGN00_B5.TIF
wget http://landsat-pds.s3.amazonaws.com/L8/107/035/LC81070352015218LGN00/LC81070352015218LGN00_BQA.TIF
wget http://landsat-pds.s3.amazonaws.com/L8/107/035/LC81070352015218LGN00/LC81070352015218LGN00_MTL.txt

Using an IDE

We strongly recommend using an IDE for Scala development, in particular IntelliJ IDEA which has a better support for Scala than Eclipse

For IDE Builds please select Maven Profile IDE before running to avoid using scope provided (necessary for cluster builds)

Since Geotrellis uses Apache Spark for processing, we need to set the spark.master property first.

For local debugging, the easiest option is to set the VM command line argument to -Dspark.master=local[*].
Other option for local debugging, that is closer to a cluster setup is to run geotrellis in a docker container as implemented in biggis-spark. In this case, use -Dspark.master=spark://localhost:7077
Third option is to use a real cluster which can run on the same docker-based infrastructure from biggis-spark

Geotrellis always work with a "catalog" which is basically a directory either in local filesystem or in HDFS. You might want to use target/geotrellis-catalog during development. This way, the catalog will be deleted when running mvn clean and won't be included into git repository.

biggis-landuse's People

Contributors

Stargazers

Watchers

Forkers

rehans516 wei-shujie yutingyao

biggis-landuse's Issues

ToDo: Get rid of hardcoded spark jars

In biggis.landuse.spark.examples.Utils.scala (oriiginally using Spark 1.6.2) it was necessary to Initialize SparkContext with the propper Jars, which was done hardcoded:

def initSparkClusterContext: SparkContext = {
//TODO: get rid of the hardcoded JAR
sparkConf.setJars(Seq("hdfs:///jobs/landuse-example/biggis-landuse-0.0.7-SNAPSHOT.jar"))

Spark 2.0 (we are using Spark 2.2) introduced SparkSession as a Container for SparkContext:

SparkSession.builder
.config("spark.jars", "hdfs:///jobs/landuse-example/biggis-landuse-0.0.7-SNAPSHOT.jar")
.getOrCreate()

To avoid hardcoded spark jars "spark.jars" has to be set externally using JSON in Spark Submit Master.

Otherwise the following error occurs:

Java.lang.CassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

Maven build failes due to reflection issue in Java 10

biggis-landuse is developed with scala for jdk 1.8, it is not compatible to java 10

if maven build fails with reflection issue it might be due to wrong java version. Please install Java 8:
sudo apt-get install openjdk-8-jdk
and set it to higher priority than java 10:
sudo update-alternatives --install /usr/bin/java java /usr/lib/jvm/java-8-oracle/jre/bin/java 1181
or change default:
sudo update-alternatives --config java
see: https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-on-ubuntu-18-04

TODO: Change tiling example from singleband to multiband

MultibandGeotiffTilingExample invalid hdfs path

MultibandGeotiffTilingExample uses a wrong hdfs path, when running in a docker container that was not started in cluster mode (started via docker-compose)

when importing

hdfs:///landuse-demo/landuse/dop ("hdfs://" + "/landuse-demo/landuse/dop")

then the path is falsely truncated to

hdfs:/landuse-demo/landuse/dop ("hdfs:" + "/landuse-demo/landuse/dop")

similar behaviour to already closed issue #15,
in which a serialization of the hadoop config was necessary.

implicit val conf: Configuration = sc.hadoopConfiguration
val serConf = new SerializableConfiguration(conf)

unfortunatelly MultibandGeotiffTilingExample uses SparkContext for hadoopMultibandGeoTiffRDD directly, not the (to be serialized) hadoopConfiguration

sc.hadoopMultibandGeoTiffRDD

ERROR org.apache.spark.executor.Executor ... scala.MatchError: Some() at geotrellis.raster.io.geotiff.reader.GeoTiffCSParser.getEllipsoidInfo

When I try:

java -cp target/biggis-landuse-0.0.1-SNAPSHOT.jar
biggis.landuse.spark.examples.GeotiffToPyramid
./data/DOP_RGBI_T2.tif
new_layer
./data/pyramid

I get the following error message:

15:58:50 ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 0.0 (TID 0)
scala.MatchError: Some() (of class scala.Some)
at geotrellis.raster.io.geotiff.reader.GeoTiffCSParser.getEllipsoidInfo(GeoTiffCSParser.scala:570)
at geotrellis.raster.io.geotiff.reader.GeoTiffCSParser.createGeoTiffCSParameters(GeoTiffCSParser.scala:172)
at geotrellis.raster.io.geotiff.reader.GeoTiffCSParser.geoTiffCSParameters$lzycompute(GeoTiffCSParser.scala:78)
at geotrellis.raster.io.geotiff.reader.GeoTiffCSParser.geoTiffCSParameters(GeoTiffCSParser.scala:78)
at geotrellis.raster.io.geotiff.reader.GeoTiffCSParser.model(GeoTiffCSParser.scala:80)
at geotrellis.raster.io.geotiff.tags.TiffTags.crs$lzycompute(TiffTags.scala:207)
at geotrellis.raster.io.geotiff.tags.TiffTags.crs(TiffTags.scala:205)
at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readGeoTiffInfo(GeoTiffReader.scala:315)
at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readSingleband(GeoTiffReader.scala:67)
at geotrellis.raster.io.geotiff.reader.GeoTiffReader$.readSingleband(GeoTiffReader.scala:61)
at geotrellis.raster.io.geotiff.SinglebandGeoTiff$.apply(SinglebandGeoTiff.scala:40)
at geotrellis.spark.io.hadoop.formats.GeotiffInputFormat.read(GeotiffInputFormat.scala:28)
at geotrellis.spark.io.hadoop.formats.BinaryFileInputFormat$$anonfun$createRecordReader$1.apply(BinaryFileInputFormat.scala:34)
at geotrellis.spark.io.hadoop.formats.BinaryFileInputFormat$$anonfun$createRecordReader$1.apply(BinaryFileInputFormat.scala:34)
at geotrellis.spark.io.hadoop.formats.BinaryFileRecordReader.initialize(BinaryFileInputFormat.scala:18)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Maven build fails recently with maven-surefire-plugin 2.x

Maven Build fails recently

tests failing using maven-surefire-plugin in recent builds (affects all branches, even old snapshots)
affects 2.18.1 (used in pom.xml), as well as all versions until 2.22.1
seems to be fixed in:
3.0.0-M1

https://mvnrepository.com/artifact/org.apache.maven.plugins/maven-surefire-plugin/3.0.0-M1

<!-- https://mvnrepository.com/artifact/org.apache.maven.plugins/maven-surefire-plugin -->
<dependency>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-surefire-plugin</artifactId>
    <version>3.0.0-M1</version>
</dependency>

No configuration setting found for key 'akka.version'

When I try:

java -cp target/biggis-landuse-0.0.1-SNAPSHOT.jar \
  biggis.landuse.spark.examples.GeotiffToPyramid \
  ./data/DOP_RGBI_T2.tif \
  new_layer \
  ./data/pyramid

I get the following error message:

09:25:35 INFO  org.apache.spark.util.Utils                                   - Successfully started service 'sparkDriver' on port 54655.
09:25:35 ERROR org.apache.spark.SparkContext                                 - Error initializing SparkContext.
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'
        at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
        at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
        at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:151)
        at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
        at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
        at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
        at akka.actor.ActorSystem$Settings.<init>(ActorSystem.scala:169)
        at akka.actor.ActorSystemImpl.<init>(ActorSystem.scala:505)
        at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
        at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
        at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
        at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
        at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
        at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2024)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
        at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2015)
        at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:266)
        at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
        at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:457)
        at biggis.landuse.spark.examples.GeotiffToPyramid$.apply(GeotiffToPyramid.scala:55)
        at biggis.landuse.spark.examples.GeotiffToPyramid$.main(GeotiffToPyramid.scala:41)
        at biggis.landuse.spark.examples.GeotiffToPyramid.main(GeotiffToPyramid.scala)

Raster tile to pixels and back

While writing a paper about BigGIS architecture, I realized that for the "pixelization" operation, we actually don't want the (ts, lat, lon) coordinates because they are more complicated to use when converting pixels back to a tile.
Instead, I would propose to use (sfc, offset) coordinates.

Each tile has a space-filling curve index sfc = HASH_SFC(gridx, gridy, ts) generated by geotrellis.
Each pixel within a tile has an offset = row * width + column

What do you think?

LayerToGeotiff/MultibandLayerToGeotiff invalid hdfs path

Invalid hdfs path issue in:

when exporting GeoTiff using hdfs path, e.g.

hdfs:///landuse-demo/landuse/out ("hdfs://" + "/landuse-demo/landuse/out")

then the path is falsely truncated to

hdfs:/landuse-demo/landuse/out ("hdfs:" + "/landuse-demo/landuse/out")

by GeoTiff.write(filename) or MultibandGeoFiff.write(filename) , causing
java.io.FileNotFoundException: hdfs:/landuse-demo/landuse/out/result.tif (No such file or directory))

Possible reason is that GeoTiff.write uses standard java.io to local filesystem.

A similar issue reading/writing JSON files in hdfs could be solved by using

// Hadoop Config is accessible from SparkContext
implicit val fs : FileSystem = FileSystem.get(sc.hadoopConfiguration);

and then replacing java.io.FileWriter by OutputStreamWriter with BufferedOutputStream and fs.create(Path(filename)):

val bw = new java.io.BufferedWriter(new java.io.FileWriter(new java.io.File(fileNameJson)))

changed to:

val bw = new java.io.BufferedWriter(new java.io.OutputStreamWriter(new java.io.BufferedOutputStream(fs.create(new Path(fileNameJson)))))

The problem is the java.io is hidden inside GeoTiff.write, which only accepts String as filename. There must be an alternative using org.apache.hadoop.fs in Geotrellis!

TODO: Transform tiles to a stream of samples

Wrong Offset Projection Gauss Krueger Zone 3 to WebMercator

Issue #2 was closed, but
when testing reprojecting Gauss-Krueger Zone 3 to WebMercator a wrong offset persists.

Geotrellis seems to use some wrong parameters in proj4J (Java https://trac.osgeo.org/proj4j/) library. This issue has been seen before with GDAL library, where proj.4 (C/C++ https://proj4.org/) is used. In GDAL this seems to have been fixed (probably csv table with parameters was updated).

TODO: Changing / Creating new zoom level (Upsampling / Downsampling)

We need to be able to change resolution / zoom levels if merging (mosaicing or layer stacking) layers with different zoom levels. Current approach selects highest zoom level common to all layers, but it fails if there is no common zoom level.
Also it would be nice to be able to set resolution to a specific value (alos between two zoom levels).

We will try to use ZoomResample for it:

https://github.com/locationtech/geotrellis/blob/master/spark/src/test/scala/geotrellis/spark/resample/ZoomResampleSpec.scala

https://github.com/locationtech/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/resample/ZoomResample.scala

An other approach is using Regrid:

https://github.com/locationtech/geotrellis/blob/master/spark/src/test/scala/geotrellis/spark/regrid/RegridSpec.scala

https://github.com/locationtech/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/regrid/Regrid.scala

TODO: Convert shapefile to raster

using spark
as a new layer (needs resolution parameter)
as a new band

Read GeoJSON in Cluster fails with invalid hdfs path

UtilsShape.readGeoJSONMultiPolygonLongAttribute
fails in Cluster due to truncated HDFS path in
val collection = GeoJson.fromFile[WithCrs[JsonFeatureCollection]](geojsonName)

see #15 LayerToGeotiff/MultibandLayerToGeotiff invalid hdfs path

Hadoop Layer Writer - Directory already exists

The following code was working fine with Geotrellis 0.10.3 and Spark 1.6.2:

// Create the writer that we will use to store the tiles in the local catalog.
val writer = HadoopLayerWriter(catalogPathHdfs, attributeStore)
val layerId = LayerId(layerName, zoom)
[..]
logger debug "Writing reprojected tiles using space filling curve"
writer.write(layerId, reprojected, ZCurveKeyIndexMethod)

Using Geotrellis 1.0.0 and Spark 2.1.0 now I get the failure:

10:29:29 INFO  org.apache.spark.storage.BlockManagerInfo                     - Removed broadcast_7_piece0 on 10.0.75.1:53190 in memory (size: 6.9 KB, free: 4.1 GB)
Exception in thread "main" geotrellis.spark.io.package$LayerWriteError: Failed to write Layer(name = "layer_label", zoom = 17)
                at geotrellis.spark.io.hadoop.HadoopLayerWriter._write(HadoopLayerWriter.scala:63)
                at geotrellis.spark.io.hadoop.HadoopLayerWriter._write(HadoopLayerWriter.scala:36)
                at geotrellis.spark.io.LayerWriter$class.write(LayerWriter.scala:59)
                at geotrellis.spark.io.hadoop.HadoopLayerWriter.write(HadoopLayerWriter.scala:36)
                at biggis.landuse.spark.examples.MultibandGeotiffTilingExample$.apply(MultibandGeotiffTilingExample.scala:76)
                at biggis.landuse.spark.examples.WorkflowExample$.apply(WorkflowExample.scala:52)
                at biggis.landuse.spark.examples.WorkflowExample$.main(WorkflowExample.scala:15)
                at biggis.landuse.spark.examples.WorkflowExample.main(WorkflowExample.scala)
Caused by: java.lang.Exception: Directory already exists: target/geotrellis-catalog/layer_label/17
                at geotrellis.spark.io.hadoop.HadoopRDDWriter$.write(HadoopRDDWriter.scala:85)
                at geotrellis.spark.io.hadoop.HadoopLayerWriter._write(HadoopLayerWriter.scala:65)
                ... 7 more

The write fails after creating the layer (zoom level 17 and partitions inside exist, also meta data for layer exists). I have cleaned the geotrellis-catalog before (to avoid data conflict between versions).