Continuous integration?

CSV Reconcile

A reconciliation service for OpenRefine based on a CSV file similar to reconcile-csv. This one is written in Python and has some more configurability.

Quick start

Clone this repository

Run the service

$ python -m venv venv                                             # create virtualenv
$ venv/bin/pip install csv-reconcile                              # install package
$ source venv/bin/activate                                        # activate virtual environment
(venv) $ csv-reconcile init sample/reps.tsv item itemLabel        # initialize the service
(venv) $ csv-reconcile serve                                      # run the service
(venv) $ deactivate                                               # remove virtual environment

The service is run at http://127.0.0.1:5000/reconcile. You can point at a different host:port by adding SERVER_NAME to the sample.cfg. Since this is running from a virtualenv, you can simply delete the whole lot to clean up.

If you have a C compiler installed you may prefer to install the sdist dist/csv-reconcile-0.1.0.tar.gz which will build a Cython version of the computationally intensive fuzzy match routine for speed. With pip add the option --no-binary csv-reconcile.

Poetry

Prerequesites

You’ll need to have both poetry and poethepoet installed. For publishing to PyPI pandoc is required.

Running

This is packaged with poetry, so you can use those commands if you have it installed.

$ poe install
$ poetry run csv-reconcile init sample/reps.tsv item itemLabel
$ poetry run csv-reconcile serve

Building

Because this package uses a README.org file and pip requires a README.md, there are extra build steps beyond what poetry supplies. These are managed using poethepoet. Thus building is done as follows:

$ poe build

If you want to build a platform agnostic wheel, you’ll have to comment out the build = "build.py" line from pyproject.toml until poetry supports selecting build platform.

Description

This reconciliation service uses Dice coefficient scoring to reconcile values against a given column in a CSV file. The CSV file must contain a column containing distinct values to reconcile to. We’ll call this the id column. We’ll call the column being reconciled against the name column.

For performance reasons, the name column is preprocessed to normalized values which are stored in an sqlite database. This database must be initialized at least once by running the init sub-command. Once initialized this need not be run for subsequent runs.

Note that the service supplies all its data with a dummy type so there is no reason to reconcile against any particular type.

In addition to reconciling against the name column, the service also functions as a data extension service, which offers any of the other columns of the CSV file.

Note that Dice coefficient scoring is agnostic to word ordering.

Usage

Basic usage involves two steps:

initialization
running the service

Initialization primes the database with the data processed from the CSV file with the init subcommand. There are several options for running the service as described below.

Initialization

Basic usage of the init sub-command requires passing the name of the CSV file, the id column and the name column.

(venv) $ csv-reconcile --help
Usage: csv-reconcile [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  init
  run
  serve
(venv) $ csv-reconcile init --help
Usage: csv-reconcile init [OPTIONS] CSVFILE IDCOL NAMECOL

Options:
  --config TEXT  config file
  --scorer TEXT  scoring plugin to use
  --help         Show this message and exit.
(venv) $ poetry run csv-reconcile serve --help
Usage: csv-reconcile serve [OPTIONS]

Options:
  --help         Show this message and exit.
(venv) $

The --config option is used to point to a configuration file. The file is a Flask configuration and hence is Python code though most configuration is simply setting variables to constant values.

Running the service

The simplest way to run the service is to use Flask’s built-in web server with the serve subcommand which takes no arguments. However, as mentioned in the Flask documentation, this server is not suitable for production purposes.

For a more hardened service, you can use one of the other deployment options mentioned in that documentation. For example, gunicorn can be run as follows:

(venv) $ gunicorn -w 4 'csv_reconcile:create_app()'
1-11-16 17:40:20 +0900] [84625] [INFO] Starting gunicorn 20.1.0
1-11-16 17:40:20 +0900] [84625] [INFO] Listening at: http://127.0.0.1:8000 (84625)
1-11-16 17:40:20 +0900] [84625] [INFO] Using worker: sync
1-11-16 17:40:20 +0900] [84626] [INFO] Booting worker with pid: 84626
1-11-16 17:40:20 +0900] [84627] [INFO] Booting worker with pid: 84627
1-11-16 17:40:20 +0900] [84628] [INFO] Booting worker with pid: 84628
1-11-16 17:40:20 +0900] [84629] [INFO] Booting worker with pid: 84629
...

One thing to watch out for is that the default manifest points the extension service to port 5000, the default port for the Flask built-in web server. If you want to use the extension service when deploying to a different port, you’ll want to be sure to override that part of the manifest in your config file. You’ll need something like the following:

MANIFEST = {
    "extend": {
        "propose_properties": {
            "service_url": "http://localhost:8000",
            "service_path": "/properties"
        }
    }
}

Note also that the configuration is saved during the init step. If you change the config, you’ll need to re-run that step. You may also need to delete and re-add the service in OpenRefine.

Deprecated

The run subcommand mimics the old behavior which combined the initialization step with the running of the service. This may be removed in a future release.

Common configuration

SERVER_NAME - The host and port the service is bound to. e.g. SERVER_NAME=localhost:5555. ( Default localhost:5000 )
CSVKWARGS - Arguments to pass to csv.reader. e.g. CSVKWARGS={'delimiter': ',', 'quotechar': '"'} for comma delimited files using " as quote character.
CSVENCODING - Encoding of the CSV file. e.g. CSVENCODING="utf-8-sig" is the encoding used for data downloaded from GNIS.
SCOREOPTIONS - Options passed to scoring plugin during normalization. e.g. SCOREOPTIONS={'stopwords':['lake','reservoir']}
LIMIT - The maximum number of reonciliation candidates returned per entry. ( Default 10 ) e.g. LIMIT=10
THRESHOLD - The minimum score for returned reconciliation candidates. ( Default 30.0 ) e.g. THRESHOLD=80.5
DATABASE - The name of the generated sqlite database containing pre-processed values. (Default csvreconcile.db) e.g. DATABASE='lakes.db' You may want to change the name of the database if you regularly switch between databases being used.
MANIFEST - Overrides for the service manifest. e.g. MANIFEST={"name": "My service"} sets the name of the service to “My service”.

This last is most interesting. If your data is coming from Wikidata and your id column contains Q values, then a manifest like the following will allow your links to be clickable inside OpenRefine.

MANIFEST = {
  "identifierSpace": "http://www.wikidata.org/entity/",
  "schemaSpace": "http://www.wikidata.org/prop/direct/",
  "view": {"url":"https://www.wikidata.org/wiki/{{id}}"},
  "name": "My reconciliation service"
}

If your CSV is made up of data taken from another reconciliation service, you may similiarly copy parts of their manifest to make use of their features, such as the preview service. See the reconciliation spec for details.

Built-in preview service

There is a preview service built into the tool. (Thanks b2m!) You can turn it on by adding the following to your manifest:

"preview": {
   "url": "http://localhost:5000/preview/{{id}}",
   "width": 400,
   "height": 300
}

Note that if you reconcile against a service with a preview service enabled, a link to the service becomes part of the project. Thus if you bring the service down, your project will have hover over pop-ups to an unavailable service. One way around this is to copy the recon.match.id to a new column which can be re-reconciled to the column by id if you bring the service back up again whether or not you have preview service enabled. (Perhaps OpenRefine could be smarter about enabling this pop-ups only when the service is active.)

Scoring plugins

As mentioned above the default scoring method is to use Dice coefficient scoring, but this method can be overridden by implementing a cvs_reconcile.scorers plugin.

Implementing

A plugin module may override any of the methods in the csv_reconcile.scorers module by simply implementing a method of the same name with the decorator @cvs_reconcile.scorer.register.

See csv_reconcile_dice for how Dice coefficient scoring is implemented.

The basic hooks are as follows:

normalizedWord(word, **scoreOptions) preprocesses values to be reconciled to produce a tuple used in fuzzy match scoring. The value of SCOREOPTIONS in the configuration will be passed in to allow configuration of this preprocessing. This hook is required.
normalizedRow(word, row, **scoreOptions) preprocesses values to be reconciled against to produce a tuple used in fuzzy match scoring. Note that both the reconciled column and the entire row is available for calculating the normalized value and that the column reconciled against is required even when not used. The value of SCOREOPTIONS in the configuration will be passed in to allow configuration of this preprocessing. This defaults to calling normalizeWord(word,**scoreOptions).
getNormalizedFields() returns a tuple of names for the columns produced by normalizeWord(). The length of the return value from both functions must match. This defaults to calling normalizeWord(word,**scoreOptions). This hook is required.
processScoreOptions(options) is passed the value of SCOREOPTIONS to allow it to be adjusted prior to being used. This can be used for adding defaults and/or validating the configuration. This hook is optional
scoreMatch(left, right, **scoreOptions) gets passed two tuples as returned by normalizedWord(). The left value is the value being reconciled and the right value is the value being reconciled against. The value of SCOREOPTIONS in the configuration will be passed in to allow configuration of this preprocessing. Returning a score of None will not add tested value as a candidate. This hook is required.
valid(normalizedFields) is passed the normalized tuple prior to being scored to make sure it’s appropriate for the calculation. This hook is optional.
features(word, row, **scoreOptions) calculates features using the query string and the normalized row. By default calculating features is disabled. Implementions of this hook are automatically enabled. This hook is optional.

Installing

Hooks are automatically discovered as long as they provide a csv_reconcile.scorers setuptools entry point. Poetry supplies a plugins configuration which wraps the setuptools funtionality.

The default Dice coefficent scoring is supplied via the following snippet from pyproject.toml file.

[tool.poetry.plugins."csv_reconcile.scorers"]
"dice" = "csv_reconcile_dice"

Here dice becomes the name of the scoring option and csv_reconcile_dice is the package implementing the plugin.

Using

If there is only one scoring plugin available, that plugin is used. If there are more than one available, you will be prompted to pass the --scorer option to select among the scoring options.

Known plugins

See wiki for list of known plugins.

Testing

Though I long for the old days when a unit test was a unit test, these days things are a bit more complicated with various versions of Python and installation of plugins to manage. Now we have to wrestle with virtual environments. poetry handles the virtual environment for developing, but testing involves covering more options.

Tests layout

The tests directory structure is the following:

tests
    main
    plugins
        geo

Tests for the main package are found under main and don’t require installing any other packages whereas tests under plugins require the installation of the given plugin.

Running tests

Basic tests

These tests are written with pytest and can be running through poetry as follows:

$ poetry run pytest

To avoid the complications that come from installing plugins, there is a poe script for running only the tests under main which can be invoked as follows:

$ poe test

For steady state developing this is probably the command you’ll use most often.

Build matrices

The GitHub Actions for this project currently use a build matrix across a couple of architectures and several versions of Python, but a similar effect can be achieved using nox.

nox manages the creation of various virtual environments in what they call “sessions”, from which various commands can be run. This project’s noxfile.py manages the installation of the csv-reconcile-geo plugin for the plugin tests as well as running across several versions of Python. See the nox documentation for detail.

Some versions of this command you’re likely to run are as follows:

$ nox      # Run all the tests building virtual environemnts from scratch
$ nox -r   # Reuse previously built virtual environments for speed
$ nox -s test_geo  # Run only the tests for the csv-reconcile-geo plugin
$ nox -s test_main -p 3.8   # Run only the main tests with Python3.8

Eventually, the GitHub Actions may be changed to use setup-nox.

Future enhancements

It would be nice to add support for using properties as part of the scoring, so that more than one column of the csv could be taken into consideration.

ID	Search Column	Additional Data
1	John Doe	1970-01-01
2	John Doe	2020-01-01
...	...	...

gitonthescene / csv-reconcile Goto Github PK

csv-reconcile's Introduction