Code Monkey home page Code Monkey logo

csv-reconcile-geo's Introduction

CSV Reconcile Geo distance scoring plugin

A scoring plugin for csv-reconcile using geodesic distance. See csv-reconcile for details.

Example

You can generate a csv file from a SPARQL query using wikibase-cli like so:

$ cat /tmp/nyc-lakes.rq
SELECT ?itemLabel ?item (SAMPLE(?coords1) AS ?coords) WHERE {
  ?item wdt:P31 wd:Q23397.
  ?item wdt:P17 wd:Q30.
  ?item wdt:P131* wd:Q60.
  ?item wdt:P625 ?coords1.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} GROUP BY ?item ?itemLabel
$ wd sparql /tmp/nyc-lakes.rq --format tsv
?itemLabel	?item	?coords
"Collect Pond"@en	<http://www.wikidata.org/entity/Q5146020>	"Point(-74.00167 40.71639)"^^<http://www.opengis.net/ont/geosparql#wktLiteral>
"Harlem Meer"@en	<http://www.wikidata.org/entity/Q3508849>	"Point(-73.9517 40.7966)"^^<http://www.opengis.net/ont/geosparql#wktLiteral>
"The Lake"@en	<http://www.wikidata.org/entity/Q104523187>	"Point(-73.9725 40.7765)"^^<http://www.opengis.net/ont/geosparql#wktLiteral>
"Williamsbridge Reservoir"@en	<http://www.wikidata.org
$

You could just as well use the Query Service and Download to tsv. Note: tsv is the default separator in csv-reconcile, but this can be changed using a config file.

Note that SAMPLE is used here to ensure we have only one coordinate for each item.

Then you can fire up a csv service with this file which does fuzzy search against this data based on the coordinates.

(venv) $ python -m pip install csv-reconcile
...
(venv) $ python -m pip install csv-reconcile-geo
(venv) $ wd sparql /tmp/nyc-lakes.rq --format tsv > /tmp/nyc-lakes.tsv
(venv) $
(venv) $   # I had to do a little clean up to remove blank line at the end
(venv) $
(venv) $ csv-reconcile /tmp/nyc-lakes.tsv '?item' '?coords' --scorer geo
 * Serving Flask app 'csv-reconcile' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

Now generating a project with the following SPARQL:

SELECT ?itemLabel ?item (SAMPLE(?coords1) AS ?coords) WHERE {
  VALUES ?item { wd:Q9188 }
  ?item wdt:P625 ?coords1.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} GROUP BY ?item ?itemLabel

I can use reconciliation on the coords column to find out which of the lakes in NYC is closest to the Empire State Building.

Note that it handles the extra data types in the coordinate column just fine, though you may want to clean that up. Similarly for the Qid to make it easier to reconcile by id against Wikidata in order to find more data.

Also, see configuration options below.

Reconciliation

This plugin is used to reconcile values representing points on the globe. It expects those values to be in well-known text format for a point. That is, like so: POINT( longitude latitude ).

The pre-processor automatically strips off literal datatypes when present as well as double quotes.

The CSV column to be reconciled needs to be in the same format. In addition, there must be at most one instance of any id column. For instance, if reconciling against coordinate location for a wikidata item, there must be at most one location per item.

Scoring

The scoring used is more or less arbitrary but has the following properties:

  • The highest score is 100 and occurs when the distance to the reconciliation candidate is zero
  • The lower the score the greater the distance to the reconciliation candidate
  • The score is scaled so that a distance of 10km yields a score of 50

Configuration

The plugin can be controlled via SCOREOPTIONS in the csv-reconcile --config file. SCOREOPTIONS is a Python dictionary and thus has the following form SCOREOPTIONS={ "key1":"value1,"key2":"value2"}.

  • SCALE set distance in kilometers at which a score of 50 occurs. ( Default 10km ) e.g. ~”SCALE”:2~
  • COORDRANGE If supplied do a precheck that both the latitude and the longitude of the compared values are within range. This is for performance to avoid the more expensive distance calculation for points farther apart. e.g. ~”COORDRANGE”:”1”~

Future enhancements

Some of the current implementation was driven by the current design of csv-reconcile. Both may be updated to accommodate the following:

  • Allow for separate latitude and longitude column in the CSV file
  • Add some scoring options such as the following:
    • Allow for overriding the scaling function
    • etc.

csv-reconcile-geo's People

Contributors

gitonthescene avatar

Watchers

 avatar  avatar

csv-reconcile-geo's Issues

Remove dependency on csv-reconcile

We'll start assuming that csv-reconcile is installed to be able to use this package as a plugin. This allows us to not have to re-publish this package whenever the base package is updated. It does mean that we'll have to test that changes to the plugin API still work with this plugin. Tests have been added to the main package to achieve this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.