Perform Spatial Join on BigData using Hadoop MapReduce
In this example, we are joining a set of points from HDFS with a small set of polygons that is loaded from the distributed cache.
Create polygon sample data into local file system:
$ awk -f data.awk | awk -f polygons.awk > /tmp/polygons.txt
Create points sample data into HDFS:
$ awk -f data.awk | hadoop fs -put - points.txt
Build, package and run the tool:
$ mvn -PDensityTool clean package
$ hadoop jar target/SpatialJoin-1.0-SNAPSHOT-job.jar points.txt file:///tmp/polygons.txt output
Save the summary output into local file system:
$ hadoop fs -cat output/part-* > /tmp/relate.txt
Convert polygon set into a shapefile using ogr2ogr so it can be viewed in ArcMap and related with the above file for visualization:
$ awk -f featurecollection.awk /tmp/polygons.txt > /tmp/fc.json
$ ogr2ogr -f "ESRI Shapefile" /tmp/polygons /tmp/fc.json
In this example of a map side spatial join, I have a list of airport code pairs forming flight routes between two cities. In addition, I have a list of the airport code, city and location in lat/lon format. The task at hand it count the number of times an airport is overflown by the flight routes.
Place the data into HDFS:
$ hadoop fs -put data/airports.csv airports.csv
$ hadoop fs -put data/routes.csv routes.csv
Build, package and run the tool:
$ mvn -PJoinMapTool clean package
$ hadoop jar target/SpatialJoin-1.0-SNAPSHOT-job.jar routes.csv output
See the output:
$ hadoop fs -cat output/part-00000 | sort -n -r --key=2 | head -10
In this example of reduce side spatial join, we will join two polygon sets and return the intersection set.
Create the two polygon sets:
$ awk -f data.awk | hadoop fs -put - data1.txt
$ awk -f data.awk | hadoop fs -put - data2.txt
Build, package and run the tool - here I am specifying a cell size of 5 units:
$ mvn -PJoinRedTool clean package
$ hadoop jar target/SpatialJoin-1.0-SNAPSHOT-job.jar -D com.esri.size=5 data1.txt data2.txt output
To view the result (as long it is small :-), I am using the Esri-Leaflet project.
Create a folder under your web application and extract the data into that folder as GeoJSON formatted files.
$ hadoop fs -cat data1.txt | awk -f geojson.awk -v N=data1 > data1.js
$ hadoop fs -cat data2.txt | awk -f geojson.awk -v N=data2 > data2.js
$ hadoop fs -cat output/part-* | awk -f union.awk > union.js
Copy the file esri-leaflet.js and index.html in the webapp folder to that folder and launch your browser and view the result at http://localhost/myfolder/index.html
I highly recommend that you look at the test folder - There are two types of unit test cases in there. one that tests the mapper by itself, the reducer by itself and the combination of mapper and reducer. The other type is one that launches a mini cluster that launches a name node, a data node, a job tracker and a task tracker all in one VM. This enable the test case to write some well known data at startup time into HDFS and read the result of the MapReduce job back from HDFS to assert the values.