Code Monkey home page Code Monkey logo

geocoding_with_geonames's Introduction

Background

As a GIS and Data Associate working in the Brown library, I was asked to explore geocoding options for the Stolen Relations project (https://indigenousslavery.org/), which seeks to document histories of enslavement of indigenous peoples in the Americas. People associated with this project had scoured the archives and assembled a table with thousands of records of indigenous enslavement spanning several centuries. One of the columns in this table listed location information associated with each record, and they were looking to plot each record as a point on the map by identifying coordinates for each place name.

Rationale for not using a traditional geocoding service

Several aspects of the Stolen Relations dataset made it challenging to achieve good results using traditional geocoding services. Each location had an indeterminate number of components, all the components were in a single column and had not been split or labelled, and the level of spatial detail ranged from individual buildings to entire countries. Many of the place names themselves were colonial names that are no longer in use. And in some cases, what was listed in the location column was not a single place name with ordered components, but a collection of distinct place names that were all mentioned in a particular historical record.

Running the original dataset through the ArcGIS World Geocoding Service resulted in hundreds of incorrectly plotted points as well as missing points. Doing some pre-processing to identify and split country, state, or city names into their own columns and running that through ArcGIS's geocoding service reduced some (but not most) of the false matches, and still failed to plot points for hundreds of records.

Data and Methods

This script represents my first attempt at implementing a custom geocoding algorithm, and although it may have some utility for geocoding messy data beyond the Stolen Relations project, its features are tailored for the Stolen Relations data in certain ways. In particular, with the Stolen Relations data, I could more or less rely on each place name having its components separated with commas, even if I didn't know which components they were.

At the most basic level, the script uses a for loop to iterate through all the place names in the dataset. Within the for loop, there is a while loop that iterates through the components of each place name, starting with the last component. Inside this while loop, there is a large if-block. As the program enters different parts of the if block, local variables are updated that will send the program down different branches of the if block on each new iteration of the while loop, depending on the kinds of place name components that the script identifies within each place name.

For example, if the last component of a place name matches the name of a US state, the search for the next-to-last component of that place name will be done on the counties and cities within that state. If a match is found, the coordinates for that place name will be updated from the generic state coordinates to the more specific county or city coordinates. If no match is found, the place name will still have the state coordinates, and the next component of the place name will be considered. The full logic for how the script searches for place name components is visualized in a schematic diagram (stolen_relations_script_diagram.png). The script itself also has inline comments pointing out how this logic is operationalized.

Data for country, state, and county coordinates were downloaded from public domain sources, and are read into the script from local csv files as hashtables. For the coordinates of world cities, I downloaded a free dataset from https://simplemaps.com/data/world-cities, which is in turn based on data from sources like the US Census, the USGS, and NASA, and is free to use under a CCA 4.0 license.

To handle place names that don't show up in these tables of more "standard" place names, I decided to use the Search Webservice API offered by GeoNames (https://www.geonames.org/export/geonames-search.html), which supports open-ended and parameterized searches for place names and returns coordinates along with other information. GeoNames sets hourly, daily, and weekly limits on the number of requests that will be fulfilled for a given username, so to prevent redundant searches and avoid running into limits, the script stores information from previous searches in a hashtable and in a local csv file.

Every time the script needs to search GeoNames for a component of the place name, it will make an API request with specific parameters based on what's already known about that place name. For example, "Paris, Texas" would not be geocoded to Paris, France, because the previous component of the place name, Texas, would be passed into the GeoNames API request as a search parameter when looking for Paris, rather than doing an open-ended search for Paris.

The script prints out progress updates in the console every time it successfully processes 100 rows of the input table. At the end of each script run, it also prints out how many new GeoNames searches were performed during that run.

Geocoding performance and accuracy

Overall, the script took around 1 minute to process all 3000+ rows of the original data, performing around 300 GeoNames requests along the way. Because the script stores all the data it pulls from the requests to be reused later, subsequent script runs for minor tweaks took only a few seconds each, since they didn't involve any GeoNames requests, and since all the data are stored in hashtables for constant-time lookup rather than in lists or other structures that need to be searched.

The script failed to retrieve coordinates for only 10 of the original records in the Stolen Relations dataset, which I remedied without much pain by manually adding some information to those records based on the original source transcriptions (i.e., for records whose location was listed only as "Plymouth", I added ", Massachusetts" to ensure they weren't geocoded to Plymouth, England). Plotting the results on a map, I could not manually identify any points that had been plotted in the wrong country or the wrong US state. Most of the points were additionally accurate down to the subregion, city, town, or river.

geocoding_with_geonames's People

Contributors

mciethan avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.